- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle
No need to backport anything since this class in not used yet
Closesscylladb/scylladb#24657
* github.com:scylladb/scylladb:
s3_test: Add s3_client test for non-retryable error handling
s3_test: Add trace logging for default_retry_strategy
s3_client: Fix edge case when the range is exhausted
s3_client: Fix indentation in try..catch block
s3_client: Stop retries in chunked download source
s3_client: Enhance test coverage for retry logic
s3_client: Add test for Content-Range fix
s3_client: Fix missing negation
s3_client: Refine logging
s3_client: Improve logging placement for current_range output
This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR.
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#23541
* github.com:scylladb/scylladb:
types/comparable_bytes: add testcase to verify compatibility with cassandra
types/comparable_bytes: support variable-length natively byte-ordered data types
types/comparable_bytes: support decimal cql3 types
types/comparable_bytes: introduce count_digits() method
types/comparable_bytes: support uuid and timeuuid cql3 types
types/comparable_bytes: support varint cql3 type
types/comparable_bytes: support skipping sign byte write in decode_signed_long_type
types/comparable_bytes: introduce encode/decode_varint_length
types/comparable_bytes: support float and double cql3 types
types/comparable_bytes: support date, time and timestamp cql3 types
types/comparable_bytes: support bigint cql3 type
types/comparable_bytes: support fixed length signed integers
types/comparable_bytes: support boolean cql3 type
types: introduce comparable_bytes class
bytes_ostream: overload write() to support writing from FragmentedView
docs: fix minor typo in docs/dev/cql3-type-mapping.md
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.
Fixesscylladb/scylladb#24018
Backport: all of the supported versions are affected, so we want to backport the changes there.
Closesscylladb/scylladb#24151
* github.com:scylladb/scylladb:
cql3/description: Serialize only rvalues of description
cql3: Represent create_statement using managed_string
cql3/statements/describe_statement.cc: Don't copy descriptions
cql3: Use managed_bytes instead of bytes in DESCRIBE
utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
The following cql3 data types - ascii, blob, duration, inet, and text -
are natively byte-ordered in their serialized forms. To encode them into
a byte-comparable format, zeros are escaped, and since these types have
variable lengths, the encoded form is terminated in an escaped state to
mark its end.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The decimal cql3 type is internally stored as a scale and an unscaled
integer. To convert them into a byte comparable format, they are first
normalized into a base-100 exponent and a mantissa that lies in [0.01, 1)
and then encoded into a byte sequence that preserves the numerical order.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implemented a method `count_digits()` to return the number of significant
digits in a given boost::multiprecision:cpp_int. This is required to
convert big_decimal to a byte comparable format.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The uuid type values are composed of two fixed-length unsigned integers:
an msb and an lsb. The msb contains a version digit, which must be
pulled first in a byte-comparable representation. For version 1 uuids,
in addition to extracting the version digit first, the msb must be
rearranged to make it byte comparable. The lsb is written as is.
For the timeuuid type, the msb is handled simliar to the version 1 uuid
values. The lsb however is treated differently - the sign bits of all
bytes are inverted to preserve the legacy comparison order, which
compared individual bytes as signed values.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Any varint value less than 7 bytes is encoded using the signed long
encoding format and remaining values are all encoded using the full form
encoding :
<signbyte><length as unsigned integer - 7><7 or more bytes>,
where <signbyte> is 00 for negative numbers and FF for positive ones,
and the length's bytes are inverted if the number is negative (so that
longer length sorts smaller).
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The length of a varint value is encoded separately as an unsigned
variable-length integer. For negative varint values, the encoded bytes
are flipped to ensure that longer lengths sort smaller. This patch
implements both encoding and decoding logic for varint lengths and will
be used by the subsequent patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The sign bit is flipped for positive values to ensure that they are
ordered after negative values. For negative values, all the bytes are
inverted, allowing larger negative values to be ordered before smaller
ones.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Both the date and time cql3 types are internally unsigned fixed length
integers. Their serialized form is already byte comparable, so the
encoder and decoder return the serialized bytes as it is.
The timestamp type is encoded using the fixed length signed integer
encoding.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The bigint type, internally implemented as a long data type, is encoded
using a variable-length encoding similar to UTF-8. This enables a
significant amount of space to be saved when smaller numbers are
frequently used, while still permitting large values to be efficiently
encoded.
The first bit of the encoding represents the inverted sign (i.e., 1 for
positive, 0 for negative), followed by length encoded as a sequence of
bits matching the inverted sign. This is then followed by a differing
bit (except for 9-byte encodings) and the bits of the number's two's
complement.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To encode fixed-length signed integers in a byte-comparable format, the
first bit of each value is inverted. This ensures that negative numbers
are ordered before positive ones during comparison. This patch adds
support for the data types : byte_type (tinyint), short_type (smallint),
and int32_type (int). Although long_type (bigint) is a fixed length
integer type, it has different byte comparable encoding and will be
handled separately in another patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This patch implements a new class, `comparable_bytes`, designed to
implement methods for converting data values to and from byte-comparable
formats. The class stores the comparable bytes as `managed_bytes` and
currently provides the structure for all required methods. The actual
logic for converting various data types will be implemented in subsequent
patches.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Overloaded write() method to support writing a FragmentedView into
bytes_ostream. Also added a testcase to verify the implementation.
The new helper will be used by the byte_comparable implementation
during the encode/decode process.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.
This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.
main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.
Fixes https://github.com/scylladb/scylladb/issues/17043
backport is not needed. this is preliminary work for support of MVs and CDC with tablets.
Closesscylladb/scylladb#22906
* github.com:scylladb/scylladb:
tablets: validate no clustering row mutations on co-located tables
raft_group0_client: extend validate_change to mixed_change type
docs: topology-over-raft: document co-located tables
tablet-mon.py: visual indication for co-located tablets
tablet-mon.py: handle co-located tablets
test/boost/view_schema_test.cc: fix race in wait_until_built
boost/tablets_test: test load balancing and resize of co-located tablets
test/tablets: test tablets colocation
tablets: co-locate view tablets with base when the partition keys match
test/pylib/tablets: common get_tablet_count api
test_mv_tablets: use get_tablet_replicas from common tablets api
test/pylib/tablets: fix test api to read tablet replicas from base table
tablets: allocator: create co-located tables in a single operation
alternator: prepare all new tables in a single announcement
migration_manager: add notification for creating multiple tables
tablets: read_tablet_transition_stage: read from base table
storage service: allow repair request only on base tables
tablets: keyspace_rf_change: apply on base table
storage service: generate tablet migration updates on base tables
tablets: replace all_tables method
tablets: split when all co-located tablets are ready
tablets: load balancer: sizing plan for table groups
tablets: load balancer: handle co-located tablets
tablets: allocate co-located tablets
tablets: handle migration of co-located tablets
storage service: add repair colocated tablets rpc
tablets: save and read tablet metadata of co-located tables
tablets: represent co-located tables in tablet metadata
tablets: add base_table column to system.tablets
docs: update system.tablets schema
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by
```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```
in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.
However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.
In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.
We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.
Fixesscylladb/scylladb#24018
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
Fixes#24574
* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.
Closesscylladb/scylladb#24633
* github.com:scylladb/scylladb:
encryption_at_rest_test: Add exception handler to ensure proxy stop
encryption: Ensure stopping timers in provider cache objects
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
Closesscylladb/scylladb#24497
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.
This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.
Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.
Fixes: #24581
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24640
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.
Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.
Refs #24248
* Enhancement. No backport required
Closesscylladb/scylladb#24554
* github.com:scylladb/scylladb:
feature_service: never disable UUID_SSTABLE_IDENTIFIERS
test: sstable_move_test: always use uuid sstable generation
test: sstable_directory_test: always use uuid sstable generation
sstables: sstable_generation_generator: set last_generation=0 by default
test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
test: lib: test_env: always use uuid sstable generation
test: sstable_test: always use uuid sstable generation
test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
test: sstable_compaction_test: always use uuid sstable generation
Some tests want to switch between sched groups. For that there's
cql-test-env facility to create and use them. However, there's a test
that uses replica::database as sched groups provider, which is not nice.
Fix it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24615
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.
The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.
Fixes: https://github.com/scylladb/scylladb/issues/24506
cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.
The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
255 bytes (common filesystem limit for a path component)
- 32 bytes (for the 32-character UUID string)
- 1 byte (for the '-' separator)
- 15 bytes (for the '_scylla_cdc_log' suffix)
- 15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).
This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
Fixes#4480
Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.
Closesscylladb/scylladb#24500
* github.com:scylladb/scylladb:
cql, schema: Extend name length limit from 48 to 192 bytes
replica: Remove unused keyspace::init_storage()
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
Fixesscylladb/scylladb#21413
Backport candidate because it fixes a crash that can happen in existing stable branches.
Closesscylladb/scylladb#21638
* github.com:scylladb/scylladb:
memtable: ensure _flushed_memory doesn't grow above total memory usage
replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).
Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.
But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.
But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.
To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.
This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.
This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.
Thus, cancel the timer, when calling the drain method to disable
the compaction manager.
Fixes https://github.com/scylladb/scylladb/issues/24504
All versions are affected. So it's a good candidate for a backport.
Closesscylladb/scylladb#24505
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.
The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
255 bytes (common filesystem limit for a path component)
- 32 bytes (for the 32-character UUID string)
- 1 byte (for the '-' separator)
- 15 bytes (for the '_scylla_cdc_log' suffix)
- 15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).
This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
The contract in mutation_reader.hh says:
```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
future<> fast_forward_to(const dht::partition_range& pr) {
```
`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.
Fix that.
Fixesscylladb/scylladb#24542Closesscylladb/scylladb#24543