Fix race conditions in atomic_vector by adding synchronization and snapshots

- Add write lock to add() method to prevent concurrent modifications - Remove insufficient locked() check in thread_for_each_nested() - Add vector snapshots in all iteration methods to prevent races - Update class documentation to reflect atomic operations Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>
Initial plan
2026-05-12 19:02:12 +00:00 · 2025-12-21 16:06:11 +00:00 · 2025-12-21 16:00:55 +00:00 · 2025-12-19 12:53:40 +01:00 · 2025-12-19 12:30:00 +02:00 · 2025-12-19 10:54:14 +02:00
18 changed files with 875 additions and 51 deletions
--- a/docs/cql/ddl.rst
+++ b/docs/cql/ddl.rst
@@ -1043,6 +1043,8 @@ The following modes are available:
   * - ``immediate``
     - Tombstone GC is immediately performed. There is no wait time or repair requirement. This mode is useful for a table that uses the TWCS compaction strategy with no user deletes. After data is expired after TTL, ScyllaDB can perform compaction to drop the expired data immediately.

+.. warning:: The ``repair`` mode is not supported for :term:`Colocated Tables <Colocated Table>` in this version.
+
 .. _cql-per-table-tablet-options:

 Per-table tablet options
--- a/docs/getting-started/index.rst
+++ b/docs/getting-started/index.rst
@@ -25,8 +25,7 @@ Getting Started
  :id: "getting-started"
  :class: my-panel

-  * `Install ScyllaDB (Binary Packages, Docker, or EC2) <https://www.scylladb.com/download/#core>`_ - Links to the ScyllaDB Download Center
-  
+  * :doc:`Install ScyllaDB </getting-started/install-scylla/index/>`
  * :doc:`Configure ScyllaDB </getting-started/system-configuration/>`
  * :doc:`Run ScyllaDB in a Shared Environment </getting-started/scylla-in-a-shared-environment>`
  * :doc:`Create a ScyllaDB Cluster - Single Data Center (DC) </operating-scylla/procedures/cluster-management/create-cluster/>`
--- a/docs/getting-started/installation-common/disable-housekeeping.rst
+++ b/docs/getting-started/installation-common/disable-housekeeping.rst
@@ -3,8 +3,7 @@
 ScyllaDB Housekeeping and how to disable it
 ============================================

-It is always recommended to run the latest version of ScyllaDB. 
-The latest stable release version is always available from the `Download Center <https://www.scylladb.com/download/>`_.
+It is always recommended to run the latest stable version of ScyllaDB. 

 When you install ScyllaDB, it installs by default two services: **scylla-housekeeping-restart** and **scylla-housekeeping-daily**. These services check for the latest ScyllaDB version and prompt the user if they are using a version that is older than what is publicly available.
 Information about your ScyllaDB deployment, including the ScyllaDB version currently used, as well as unique user and server identifiers, are collected by a centralized service.
--- a/docs/operating-scylla/nodetool-commands/cluster/repair.rst
+++ b/docs/operating-scylla/nodetool-commands/cluster/repair.rst
@@ -9,6 +9,8 @@ Running ``cluster repair`` on a **single node** synchronizes all data on all nod
 To synchronize all data in clusters that have both tablets-based and vnodes-based keyspaces, run :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair/>` on **all**
 of the nodes in the cluster, and :doc:`nodetool cluster repair </operating-scylla/nodetool-commands/cluster/repair/>` on  **any** of the nodes in the cluster.

+.. warning:: :term:`Colocated Tables <Colocated Table>` cannot be synchronized using cluster repair in this version.
+
 To check if a keyspace enables tablets, use:

 .. code-block:: cql
--- a/docs/reference/glossary.rst
+++ b/docs/reference/glossary.rst
@@ -202,3 +202,7 @@ Glossary
       The name comes from two basic operations, multiply (MU) and rotate (R), used in its inner loop.
       The MurmurHash3 version used in ScyllaDB originated from `Apache Cassandra <https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/digest/MurmurHash3.html>`_, and is **not** identical to the `official MurmurHash3 calculation <https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/MurmurHash.java#L31-L33>`_. More `here <https://github.com/russss/murmur3-cassandra>`_.

+    Colocated Table
+       An internal table of a special type in a :doc:`tablets </architecture/tablets>` enabled keyspace that is colocated with another base table, meaning it always has the same tablet replicas as the base table.
+       Current types of colocated tables include CDC log tables, local indexes, and materialized views that have the same partition key as their base table.
+
--- a/replica/database.cc
+++ b/replica/database.cc
@@ -2793,6 +2793,7 @@ future<> database::flush_all_tables() {
    });
    _all_tables_flushed_at = db_clock::now();
    co_await _commitlog->wait_for_pending_deletes();
+    dblog.info("Forcing new commitlog segment and flushing all tables complete");
 }

 future<db_clock::time_point> database::get_all_tables_flushed_at(sharded<database>& sharded_db) {
--- a/test/cluster/auth_cluster/test_raft_service_levels.py
+++ b/test/cluster/auth_cluster/test_raft_service_levels.py
@@ -604,18 +604,14 @@ async def test_driver_service_creation_failure(manager: ManagerClient) -> None:
        service_level_names = [sl.service_level for sl in service_levels]
        assert "driver" not in service_level_names

-def get_processed_tasks_for_group(metrics, group):
-    res = metrics.get("scylla_scheduler_tasks_processed", {'group': group})
-    if res is None:
-        return 0
-    return res
-
@pytest.mark.asyncio
 async def _verify_tasks_processed_metrics(manager, server, used_group, unused_group, func):
-    number_of_requests = 1000
+    number_of_requests = 3000

    def get_processed_tasks_for_group(metrics, group):
        res = metrics.get("scylla_scheduler_tasks_processed", {'group': group})
+        logger.info(f"group={group}, tasks_processed={res}")
+
        if res is None:
            return 0
        return res
@@ -627,8 +623,10 @@ async def _verify_tasks_processed_metrics(manager, server, used_group, unused_gr
    await asyncio.gather(*[asyncio.to_thread(func) for i in range(number_of_requests)])

    metrics = await manager.metrics.query(server.ip_addr)
-    assert get_processed_tasks_for_group(metrics, used_group) - initial_tasks_processed_by_used_group > number_of_requests
-    assert get_processed_tasks_for_group(metrics, unused_group) - initial_tasks_processed_by_unused_group < number_of_requests
+    tasks_processed_by_used_group = get_processed_tasks_for_group(metrics, used_group)
+    tasks_processed_by_unused_group = get_processed_tasks_for_group(metrics, unused_group)
+    assert tasks_processed_by_used_group - initial_tasks_processed_by_used_group > number_of_requests
+    assert tasks_processed_by_unused_group - initial_tasks_processed_by_unused_group < number_of_requests

@pytest.mark.asyncio
 async def test_driver_service_level_not_used_for_user_queries(manager: ManagerClient) -> None:
--- a/test/cluster/dtest/ccmlib/scylla_node.py
+++ b/test/cluster/dtest/ccmlib/scylla_node.py
@@ -52,6 +52,18 @@ KNOWN_LOG_LEVELS = {
    "OFF": "info",
 }

+# Captures the aggregate metric before the "[READ ..., WRITE ...]" block.
+STRESS_SUMMARY_PATTERN = re.compile(r'^\s*([\d\.\,]+\d?)\s*\[.*')
+
+# Extracts the READ metric number inside the "[READ ..., WRITE ...]" block.
+STRESS_READ_PATTERN = re.compile(r'.*READ:\s*([\d\.\,]+\d?)[^\d].*')
+
+# Extracts the WRITE metric number inside the "[READ ..., WRITE ...]" block.
+STRESS_WRITE_PATTERN = re.compile(r'.*WRITE:\s*([\d\.\,]+\d?)[^\d].*')
+
+# Splits a "key : value" line into key and value.
+STRESS_KEY_VALUE_PATTERN = re.compile(r'^\s*([^:]+)\s*:\s*(\S.*)\s*$')
+

 class NodeError(Exception):
    def __init__(self, msg: str, process: int | None = None):
@@ -528,6 +540,15 @@ class ScyllaNode:
        return self.cluster.manager.server_get_workdir(server_id=self.server_id)

    def stress(self, stress_options: list[str], **kwargs):
+        """
+        Run `cassandra-stress` against this node.
+        This method does not do any result parsing.
+
+        :param stress_options: List of options to pass to `cassandra-stress`.
+        :param kwargs: Additional arguments to pass to `subprocess.Popen()`.
+        :return: Named tuple with `stdout`, `stderr`, and `rc` (return code).
+        """
+
        cmd_args = ["cassandra-stress"] + stress_options

        if not any(opt in cmd_args for opt in ("-d", "-node", "-cloudconf")):
@@ -549,6 +570,73 @@ class ScyllaNode:
        except KeyboardInterrupt:
            pass

+
+    def _set_stress_val(self, key, val, res):
+        """
+        Normalize a stress result string and populate aggregate/read/write metrics.
+
+        Removes comma-thousands separators from numbers, converts to float,
+        stores the aggregate metric under `key`.
+        If the value contains a "[READ ..., WRITE ...]" block, also stores the
+        read and write metrics under `key:read` and `key:write`.
+
+        :param key: The metric name
+        :param val: The metric value string
+        :param res: The dictionary to populate
+        """
+
+        def parse_num(s):
+            return float(s.replace(',', ''))
+
+        if "[" in val:
+            p = STRESS_SUMMARY_PATTERN
+            m = p.match(val)
+            if m:
+                res[key] = parse_num(m.group(1))
+            p = STRESS_READ_PATTERN
+            m = p.match(val)
+            if m:
+                res[key + ":read"] = parse_num(m.group(1))
+            p = STRESS_WRITE_PATTERN
+            m = p.match(val)
+            if m:
+                res[key + ":write"] = parse_num(m.group(1))
+        else:
+            try:
+                res[key] = parse_num(val)
+            except ValueError:
+                res[key] = val
+        
+
+    def stress_object(self, stress_options=None, ignore_errors=None, **kwargs):
+        """
+        Run stress test and return results as a structured metrics dictionary.
+
+        Runs `stress()`, finds the `Results:` section in `stdout`, and then
+        processes each `key : value` line, putting it into a dictionary.
+
+        :param stress_options: List of stress options to pass to `stress()`.
+        :param ignore_errors: Deprecated (no effect).
+        :param kwargs: Additional arguments to pass to `stress()`.
+        :return: Dictionary of stress test results.
+        """
+        if ignore_errors:
+            self.warning("passing `ignore_errors` to stress_object() is deprecated")
+        ret = self.stress(stress_options, **kwargs)
+        p = STRESS_KEY_VALUE_PATTERN
+        res = {}
+        start = False
+        for line in (s.strip() for s in ret.stdout.splitlines()):
+            if start:
+                m = p.match(line)
+                if m:
+                    self._set_stress_val(m.group(1).strip().lower(), m.group(2).strip(), res)
+            else:
+                if line == 'Results:':
+                    start = True
+        return res
+
+
    def flush(self, ks: str | None = None, table: str | None = None, **kwargs) -> None:
        cmd = ["flush"]
        if ks:
--- a/test/cluster/dtest/schema_management_test.py
+++ b/test/cluster/dtest/schema_management_test.py
@@ -0,0 +1,690 @@
+#
+# Copyright (C) 2015-present The Apache Software Foundation
+# Copyright (C) 2025-present ScyllaDB
+#
+# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
+#
+
+import functools
+import logging
+import string
+import threading
+import time
+from concurrent import futures
+from typing import NamedTuple
+
+import pytest
+from cassandra import AlreadyExists, ConsistencyLevel, InvalidRequest
+from cassandra.concurrent import execute_concurrent_with_args
+from cassandra.query import SimpleStatement, dict_factory
+from concurrent.futures import ThreadPoolExecutor
+
+from dtest_class import Tester, create_cf, create_ks, read_barrier
+from tools.assertions import assert_all, assert_invalid
+from tools.cluster_topology import generate_cluster_topology
+from tools.data import create_c1c2_table, insert_c1c2, query_c1c2, rows_to_list
+
+logger = logging.getLogger(__name__)
+
+class TestSchemaManagement(Tester):
+    def prepare(self, racks_num: int, has_config: bool = True):
+        cluster = self.cluster
+        cluster_topology = generate_cluster_topology(rack_num=racks_num)
+
+        if has_config:
+            config = {
+                "ring_delay_ms": 5000,
+            }
+            cluster.set_configuration_options(values=config)
+
+        cluster.populate(cluster_topology)
+        cluster.start(wait_other_notice=True)
+
+        return cluster
+
+
+    def test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns(self):
+        cluster = self.prepare(racks_num=3)
+
+        [node1, node2, node3] = cluster.nodelist()
+
+        session = self.patient_cql_connection(node1)
+
+        logger.debug("Creating schema...")
+        create_ks(session, "ks", 3)
+        session.execute(
+            """
+            CREATE TABLE users (
+                id int,
+                firstname text,
+                lastname text,
+                PRIMARY KEY (id)
+             );
+         """
+        )
+
+        insert_statement = session.prepare("INSERT INTO users (id, firstname, lastname) VALUES (?, 'A', 'B')")
+        insert_statement.consistency_level = ConsistencyLevel.ALL
+        session.execute(insert_statement, [0])
+
+        logger.debug("Altering schema")
+        session.execute("ALTER TABLE users WITH comment = 'updated'")
+
+        logger.debug("Restarting node2")
+        node2.stop(gently=True)
+        node2.start(wait_for_binary_proto=True)
+
+        logger.debug("Restarting node3")
+        node3.stop(gently=True)
+        node3.start(wait_for_binary_proto=True, wait_other_notice=True)
+
+        n_partitions = 20
+        for i in range(n_partitions):
+            session.execute(insert_statement, [i])
+
+        rows = session.execute("SELECT * FROM users")
+        res = sorted(rows)
+        assert len(res) == n_partitions
+        for i in range(n_partitions):
+            expected = [i, "A", "B"]
+            assert list(res[i]) == expected, f"Expected {expected}, got {res[i]}"
+
+    def test_dropping_keyspace_with_many_columns(self):
+        """
+        Exploits https://github.com/scylladb/scylla/issues/1484
+        """
+        cluster = self.prepare(racks_num=1, has_config=False)
+
+        node1 = cluster.nodelist()[0]
+        session = self.patient_cql_connection(node1)
+
+        session.execute("CREATE KEYSPACE testxyz WITH replication = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1 }")
+        for i in range(8):
+            session.execute(f"CREATE TABLE testxyz.test_{i} (k int, c int, PRIMARY KEY (k),)")
+        session.execute("drop keyspace testxyz")
+
+        for node in cluster.nodelist():
+            s = self.patient_cql_connection(node)
+            s.execute("CREATE KEYSPACE testxyz WITH replication = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1 }")
+            s.execute("drop keyspace testxyz")
+
+    def test_multiple_create_table_in_parallel(self):
+        """
+        Run multiple create table statements via different nodes
+        1. Create a cluster of 3 nodes
+        2. Run create table with different table names in parallel - check all complete
+        3. Run create table with the same table name in parallel - check if they complete
+        """
+        logger.debug("1. Create a cluster of 3 nodes")
+        nodes_count = 3
+        cluster = self.prepare(racks_num=nodes_count)
+        sessions = [self.patient_exclusive_cql_connection(node) for node in cluster.nodelist()]
+        ks = "ks"
+        create_ks(sessions[0], ks, nodes_count)
+
+        def create_table(session, table_name):
+            create_statement = f"CREATE TABLE {ks}.{table_name} (p int PRIMARY KEY, c0 text, c1 text, c2 text, c3 text, c4 text, c5 text, c6 text, c7 text, c8 text, c9 text);"
+            logger.debug(f"create_statement {create_statement}")
+            session.execute(create_statement)
+
+        logger.debug("2. Run create table with different table names in parallel - check all complete")
+        step2_tables = [f"t{i}" for i in range(nodes_count)]
+        with ThreadPoolExecutor(max_workers=nodes_count) as executor:
+            list(executor.map(create_table, sessions, step2_tables))
+
+        for table in step2_tables:
+            sessions[0].execute(SimpleStatement(f"INSERT INTO {ks}.{table} (p) VALUES (1)", consistency_level=ConsistencyLevel.ALL))
+            rows = sessions[0].execute(SimpleStatement(f"SELECT * FROM {ks}.{table}", consistency_level=ConsistencyLevel.ALL))
+            assert len(rows_to_list(rows)) == 1, f"Expected 1 row but got rows:{rows} instead"
+
+        logger.debug("3. Run create table with the same table name in parallel - check if they complete")
+        step3_table = "test"
+        step3_tables = [step3_table for i in range(nodes_count)]
+        with ThreadPoolExecutor(max_workers=nodes_count) as executor:
+            res_futures = [executor.submit(create_table, *args) for args in zip(sessions, step3_tables)]
+            for res_future in res_futures:
+                try:
+                    res_future.result()
+                except AlreadyExists as e:
+                    logger.info(f"expected cassandra.AlreadyExists error {e}")
+
+        sessions[0].execute(SimpleStatement(f"INSERT INTO {ks}.{step3_table} (p) VALUES (1)", consistency_level=ConsistencyLevel.ALL))
+        sessions[0].execute(f"SELECT * FROM {ks}.{step3_table}")
+        rows = sessions[0].execute(SimpleStatement(f"SELECT * FROM {ks}.{step3_table}", consistency_level=ConsistencyLevel.ALL))
+        assert len(rows_to_list(rows)) == 1, f"Expected 1 row but got rows:{rows} instead"
+
+    @pytest.mark.parametrize("case", ("write", "read", "mixed"))
+    def test_alter_table_in_parallel_to_read_and_write(self, case):
+        """
+        Create a table and write into while altering the table
+        1. Create a cluster of 3 nodes and populate a table
+        2. Run write/read/read_and_write" statement in a loop
+        3. Alter table while inserts are running
+        """
+        logger.debug("1. Create a cluster of 3 nodes and populate a table")
+        cluster = self.prepare(racks_num=3)
+        col_number = 20
+
+        [node1, node2, node3] = cluster.nodelist()
+        session = self.patient_exclusive_cql_connection(node1)
+
+        def run_stress(stress_type, col=col_number - 2):
+            node2.stress_object([stress_type, "n=10000", "cl=QUORUM", "-schema", "replication(factor=3)", "-col", f"n=FIXED({col})", "-rate", "threads=1"])
+
+        logger.debug("Populate")
+        run_stress("write", col_number)
+
+        with ThreadPoolExecutor(max_workers=1) as executor:
+            logger.debug(f"2. Run {case} statement in a loop")
+            statement_future = executor.submit(functools.partial(run_stress, case))
+
+            logger.debug(f"let's {case} statement work some time")
+            time.sleep(2)
+
+            logger.debug("3. Alter table while inserts are running")
+            alter_statement = f'ALTER TABLE keyspace1.standard1 DROP ("C{col_number - 1}", "C{col_number - 2}")'
+            logger.debug(f"alter_statement {alter_statement}")
+            alter_result = session.execute(alter_statement)
+            logger.debug(alter_result.all())
+
+            logger.debug(f"wait till {case} statement finished")
+            statement_future.result()
+
+        rows = session.execute(SimpleStatement("SELECT * FROM keyspace1.standard1 LIMIT 1;", consistency_level=ConsistencyLevel.ALL))
+        assert len(rows_to_list(rows)[0]) == col_number - 1, f"Expected {col_number - 1} columns but got rows:{rows} instead"
+
+        logger.debug("read and check data")
+        run_stress("read")
+
+    @pytest.mark.skip("unimplemented")
+    def commitlog_replays_after_schema_change(self):
+        """
+        Commitlog can be replayed even though schema has been changed
+        1. Create a table and insert data
+        2. Alter table
+        3. Kill node
+        4. Boot node and verify that commitlog have been replayed and that all data is restored
+        """
+        raise NotImplementedError
+
+    @pytest.mark.parametrize("case", ("create_table", "alter_table", "drop_table"))
+    def test_update_schema_while_node_is_killed(self, case):
+        """
+        Check that a node that is killed durring a table creation/alter/drop is able to rejoin and to synch on schema
+        """
+
+        logger.debug("1. Create a cluster and insert data")
+        cluster = self.prepare(racks_num=3)
+
+        [node1, node2, node3] = cluster.nodelist()
+
+        session = self.patient_cql_connection(node1)
+
+        def create_table_case():
+            try:
+                logger.debug("Creating table")
+                create_c1c2_table(session)
+                logger.debug("Populating")
+                insert_c1c2(session, n=10)
+            except AlreadyExists:
+                # the CQL command can be called multiple time case of retries
+                pass
+
+        def alter_table_case():
+            try:
+                session.execute("ALTER TABLE ks.cf ADD (c3 text);", timeout=180)
+            except InvalidRequest as exc:
+                # the CQL command can be called multiple time case of retries
+                assert "Invalid column name c3" in str(exc)
+
+        def drop_table_case():
+            try:
+                session.execute("DROP TABLE cf;", timeout=180)
+            except InvalidRequest as exc:
+                # the CQL command can be called multiple time case of retries
+                assert "Cannot drop non existing table" in str(exc)
+
+        logger.debug("Creating keyspace")
+        create_ks(session, "ks", 3)
+        if case != "create_table":
+            create_table_case()
+
+        case_map = {
+            "create_table": create_table_case,
+            "alter_table": alter_table_case,
+            "drop_table": drop_table_case,
+        }
+        with ThreadPoolExecutor(max_workers=1) as executor:
+            logger.debug(f"2. kill node during {case}")
+            kill_node_future = executor.submit(node2.stop, gently=False, wait_other_notice=True)
+            case_map[case]()
+            kill_node_future.result()
+
+        logger.debug("3. Start the stopped node2")
+        node2.start(wait_for_binary_proto=True)
+
+        session = self.patient_exclusive_cql_connection(node2)
+        read_barrier(session)
+
+        def create_or_alter_table_expected_result(col_mun):
+            rows = session.execute(SimpleStatement("SELECT * FROM ks.cf LIMIT 1;", consistency_level=ConsistencyLevel.QUORUM))
+            assert len(rows_to_list(rows)[0]) == col_mun, f"Expected {col_mun} columns but got rows:{rows} instead"
+            for key in range(10):
+                query_c1c2(session=session, key=key, consistency=ConsistencyLevel.QUORUM)
+
+        expected_case_result_map = {
+            "create_table": functools.partial(create_or_alter_table_expected_result, 3),
+            "alter_table": functools.partial(create_or_alter_table_expected_result, 4),
+            "drop_table": functools.partial(assert_invalid, session, "SELECT * FROM test1"),
+        }
+        logger.debug("verify that commitlog has been replayed and that all data is restored")
+        expected_case_result_map[case]()
+
+    @pytest.mark.parametrize("is_gently_stop", [True, False])
+    def test_nodes_rejoining_a_cluster_synch_on_schema(self, is_gently_stop):
+        """
+        Nodes rejoining the cluster synch on schema changes
+        1. Create a cluster and insert data
+        2. Stop a node
+        3. Alter table
+        4. Insert additional data
+        5. Start the stopped node
+        6. Verify the stopped node synchs on the updated schema
+        """
+
+        logger.debug("1. Create a cluster and insert data")
+        cluster = self.prepare(racks_num=3)
+
+        [node1, node2, node3] = cluster.nodelist()
+
+        session = self.patient_cql_connection(node1)
+
+        logger.debug("Creating schema")
+        create_ks(session, "ks", 3)
+        create_c1c2_table(session)
+        create_cf(session, "cf", key_name="p", key_type="int", columns={"v": "text"})
+
+        logger.debug("Populating")
+        insert_c1c2(session, n=10, consistency=ConsistencyLevel.ALL)
+
+        logger.debug("2 Stop a node1")
+        node1.stop(gently=is_gently_stop, wait_other_notice=True)
+
+        logger.debug("3 Alter table")
+        session = self.patient_cql_connection(node2)
+        session.execute("ALTER TABLE ks.cf ADD (c3 text);", timeout=180)
+
+        logger.debug("4 Insert additional data")
+        session.execute(SimpleStatement("INSERT INTO ks.cf (key, c1, c2, c3) VALUES ('test', 'test', 'test', 'test')", consistency_level=ConsistencyLevel.QUORUM))
+
+        logger.debug("5. Start the stopped node1")
+        node1.start(wait_for_binary_proto=True)
+
+        logger.debug("6. Verify the stopped node synchs on the updated schema")
+        session = self.patient_exclusive_cql_connection(node1)
+        read_barrier(session)
+
+        rows = session.execute(SimpleStatement("SELECT * FROM ks.cf WHERE key='test'", consistency_level=ConsistencyLevel.ALL))
+        expected = [["test", "test", "test", "test"]]
+        assert rows_to_list(rows) == expected, f"Expected {expected} but got {rows} instead"
+        for key in range(10):
+            query_c1c2(session=session, key=key, consistency=ConsistencyLevel.ALL)
+
+    def test_reads_schema_recreated_while_node_down(self):
+        cluster = self.prepare(racks_num=3)
+
+        [node1, node2, node3] = cluster.nodelist()
+
+        session = self.patient_cql_connection(node1)
+
+        logger.debug("Creating schema")
+        create_ks(session, "ks", 3)
+        session.execute("CREATE TABLE cf (p int PRIMARY KEY, v text);")
+
+        logger.debug("Populating")
+        session.execute(SimpleStatement("INSERT INTO cf (p, v) VALUES (1, '1')", consistency_level=ConsistencyLevel.ALL))
+
+        logger.debug("Stopping node2")
+        node2.stop(gently=True)
+
+        logger.debug("Re-creating schema")
+        session.execute("DROP TABLE cf;")
+        session.execute("CREATE TABLE cf (p int PRIMARY KEY, v1 bigint, v2 text);")
+
+        logger.debug("Restarting node2")
+        node2.start(wait_for_binary_proto=True)
+        session2 = self.patient_cql_connection(node2)
+        read_barrier(session2)
+
+        rows = session.execute(SimpleStatement("SELECT * FROM cf", consistency_level=ConsistencyLevel.ALL))
+        assert rows_to_list(rows) == [], f"Expected an empty result set, got {rows}"
+
+    def test_writes_schema_recreated_while_node_down(self):
+        cluster = self.prepare(racks_num=3)
+
+        [node1, node2, node3] = cluster.nodelist()
+
+        session = self.patient_cql_connection(node1)
+
+        logger.debug("Creating schema")
+        create_ks(session, "ks", 3)
+        session.execute("CREATE TABLE cf (p int PRIMARY KEY, v text);")
+
+        logger.debug("Populating")
+        session.execute(SimpleStatement("INSERT INTO cf (p, v) VALUES (1, '1')", consistency_level=ConsistencyLevel.ALL))
+
+        logger.debug("Stopping node2")
+        node2.stop(gently=True, wait_other_notice=True)
+
+        logger.debug("Re-creating schema")
+        session.execute("DROP TABLE cf;")
+        session.execute("CREATE TABLE cf (p int PRIMARY KEY, v text);")
+
+        logger.debug("Restarting node2")
+        node2.start(wait_for_binary_proto=True)
+        session2 = self.patient_cql_connection(node2)
+        read_barrier(session2)
+
+        session.execute(SimpleStatement("INSERT INTO cf (p, v) VALUES (2, '2')", consistency_level=ConsistencyLevel.ALL))
+
+        rows = session.execute(SimpleStatement("SELECT * FROM cf", consistency_level=ConsistencyLevel.ALL))
+        expected = [[2, "2"]]
+        assert rows_to_list(rows) == expected, f"Expected {expected}, got {rows_to_list(rows)}"
+
+
+class TestLargePartitionAlterSchema(Tester):
+    # Issue scylladb/scylla: #5135:
+    #
+    # Issue: Cache reads may miss some writes if schema alter followed by a read happened concurrently with preempted
+    # partition entry update
+    # Affects only tables with multi-row partitions, which are the only ones that can experience the update of partition
+    # entry being preempted.
+    #
+    # The scenario in which the problem could have happened has to involve:
+    # - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
+    # - appending writes to the partition (not overwrites)
+    # - scans of the partition
+    # - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
+    #   column lands in the middle (in alphabetical order) of the old column set.
+    #
+    # Memtable flush has to happen after a schema alter concurrently with a read.
+    #
+    # The bug could result in cache corruption which manifests as some past writes being missing (not visible to reads).
+
+    PARTITIONS = 50
+    STRING_VALUE = string.ascii_lowercase
+
+    def prepare(self, cluster_topology: dict[str, dict[str, int]], rf: int):
+        if not self.cluster.nodelist():
+            self.cluster.populate(cluster_topology)
+            self.cluster.start(wait_other_notice=True)
+
+        node1 = self.cluster.nodelist()[0]
+        session = self.patient_cql_connection(node=node1)
+        self.create_schema(session=session, rf=rf)
+
+        return session
+
+    def create_schema(self, session, rf):
+        logger.debug("Creating schema")
+        create_ks(session=session, name="ks", rf=rf)
+
+        session.execute(
+            """
+            CREATE TABLE lp_table (
+                pk int,
+                ck1 int,
+                val1 text,
+                val2 text,
+                PRIMARY KEY (pk, ck1)
+            );
+        """
+        )
+
+    def populate(self, session, data, ck_start, ck_end=None, stop_populating: threading.Event = None):
+        ck = ck_start
+        def _populate_loop():
+            nonlocal ck
+            while True:
+                if stop_populating is not None and stop_populating.is_set():
+                    return
+                if ck_end is not None and ck >= ck_end:
+                    return
+                for pk in range(self.PARTITIONS):
+                    row = [pk, ck, self.STRING_VALUE, self.STRING_VALUE]
+                    data.append(row)
+                    yield tuple(row)
+                ck += 1
+
+        records_written = ck - ck_start
+
+        logger.debug(f"Start populate DB: {self.PARTITIONS} partitions with {ck_end - ck_start if ck_end else 'infinite'} records in each partition")
+
+        parameters = _populate_loop()
+
+        stmt = session.prepare("INSERT INTO lp_table (pk, ck1, val1, val2) VALUES (?, ?, ?, ?)")
+
+        execute_concurrent_with_args(session=session, statement=stmt, parameters=parameters, concurrency=100)
+        logger.debug(f"Finish populate DB: {self.PARTITIONS} partitions with {records_written} records in each partition")
+        return data
+
+    def read(self, session, ck_max, stop_reading: threading.Event = None):
+        def _read_loop():
+            while True:
+                for ck in range(ck_max):
+                    for pk in range(self.PARTITIONS):
+                        if stop_reading is not None and stop_reading.is_set():
+                            return
+                        session.execute(f"select * from lp_table where pk = {pk} and ck1 = {ck}")
+                if stop_reading is None:
+                    return
+
+        logger.debug(f"Start reading..")
+        _read_loop()
+        logger.debug(f"Finish reading..")
+
+    def add_column(self, session, column_name, column_type):
+        logger.debug(f"Add {column_name} column")
+        session.execute(f"ALTER TABLE lp_table ADD {column_name} {column_type}")
+
+    def drop_column(self, session, column_name):
+        logger.debug(f"Drop {column_name} column")
+        session.execute(f"ALTER TABLE lp_table DROP {column_name}")
+
+    def test_large_partition_with_add_column(self):
+        cluster_topology = generate_cluster_topology()
+        session = self.prepare(cluster_topology, rf=1)
+        data = self.populate(session=session, data=[], ck_start=0, ck_end=10)
+
+        threads = []
+        timeout = 300
+        ck_end = 5000
+        if self.cluster.scylla_mode == "debug":
+            timeout = 900
+            ck_end = 500
+        with ThreadPoolExecutor(max_workers=2) as executor:
+            stop_populating = threading.Event()
+            stop_reading = threading.Event()
+            # Insert new rows in background
+            threads.append(executor.submit(self.populate, session=session, data=data, ck_start=10, ck_end=None, stop_populating=stop_populating))
+            threads.append(executor.submit(self.read, session=session, ck_max=ck_end, stop_reading=stop_reading))
+            # Wait for running load
+            time.sleep(10)
+            self.add_column(session, "new_clmn", "int")
+
+            # Memtable flush has to happen after a schema alter concurrently with a read
+            logger.debug("Flush data")
+            self.cluster.nodelist()[0].flush()
+
+            # Stop populating and reading soon after flush
+            time.sleep(1)
+            logger.debug("Stop populating and reading")
+            stop_populating.set()
+            stop_reading.set()
+
+            for future in futures.as_completed(threads, timeout=timeout):
+                try:
+                    future.result()
+                except Exception as exc:  # noqa: BLE001
+                    pytest.fail(f"Generated an exception: {exc}")
+
+        # Add 'null' values for the new column `new_clmn` in the expected data
+        for i, _ in enumerate(data):
+            data[i].append(None)
+
+        assert_all(session, f"select pk, ck1, val1, val2, new_clmn from lp_table", data, ignore_order=True, print_result_on_failure=False)
+
+    def test_large_partition_with_drop_column(self):
+        cluster_topology = generate_cluster_topology()
+        session = self.prepare(cluster_topology, rf=1)
+        data = self.populate(session=session, data=[], ck_start=0, ck_end=10)
+
+        threads = []
+        timeout = 300
+        ck_end = 5000
+        if self.cluster.scylla_mode == "debug":
+            timeout = 900
+            ck_end = 500
+        with ThreadPoolExecutor(max_workers=2) as executor:
+            stop_populating = threading.Event()
+            stop_reading = threading.Event()
+            # Insert new rows in background
+            threads.append(executor.submit(self.populate, session=session, data=data, ck_start=10, ck_end=None, stop_populating=stop_populating))
+            threads.append(executor.submit(self.read, session=session, ck_max=ck_end, stop_reading=stop_reading))
+            # Wait for running load
+            time.sleep(10)
+            self.drop_column(session=session, column_name="val1")
+
+            # Memtable flush has to happen after a schema alter concurrently with a read
+            logger.debug("Flush data")
+            self.cluster.nodelist()[0].flush()
+
+            # Stop populating and reading soon after flush
+            time.sleep(1)
+            logger.debug("Stop populating and reading")
+            stop_populating.set()
+            stop_reading.set()
+
+            result = []
+            for future in futures.as_completed(threads, timeout=timeout):
+                try:
+                    result.append(future.result())
+                except Exception as exc:  # noqa: BLE001
+                    # "Unknown identifier val1" is expected error
+                    if not len(exc.args) or "Unknown identifier val1" not in exc.args[0]:
+                        pytest.fail(f"Generated an exception: {exc}")
+
+
+class HistoryVerifier:
+    def __init__(self, table_name="table1", keyspace_name="lwt_load_ks"):
+        """
+        Initialize parameters for further verification of schema history.
+        :param table_name: table thats we change it's schema and verify schema history accordingly.
+        """
+
+        self.table_name = table_name
+        self.keyspace_name = keyspace_name
+        self.versions = []
+        self.versions_dict = {}
+        self.query = ""
+
+    def verify(self, session, expected_current_diff, expected_prev_diff, query):
+        """
+        Verify current schema history entry by comparing to previous schema entry.
+        :param session: python cql session
+        :param expected_current_diff: difference of current schema from previous schema
+        :param expected_prev_diff: difference of previous schema from current schema
+        :param query: The query that created new schema
+        """
+
+        def get_table_id(session, keyspace_name, table_name):
+            assert keyspace_name, f"Input kesyspcase should have value, keyspace_name={keyspace_name}"
+            assert table_name, f"Input table_name should have value, table_name={table_name}"
+            query = "select keyspace_name,table_name,id from system_schema.tables"
+            query += f" WHERE keyspace_name='{keyspace_name}' AND table_name='{table_name}'"
+            current_rows = session.execute(query).current_rows
+            assert len(current_rows) == 1, f"Not found table description, ks={keyspace_name} table_name={table_name}"
+            res = current_rows[0]
+            return res["id"]
+
+        def read_schema_history_table(session, cf_id):
+            """
+            read system.scylla_table_schema_history and verify current version diff from previous vesion
+            :param session: python cql session
+            :param cf_id: uuid of the table we changed it's schema
+            """
+
+            query = f"select * from system.scylla_table_schema_history WHERE cf_id={cf_id}"
+            res = session.execute(query).current_rows
+            new_versions = list({
+                entry["schema_version"]
+                for entry in res
+                if str(entry["schema_version"]) not in self.versions
+            })
+            msg = f"Expect 1, got len(new_versions)={len(new_versions)}"
+            assert len(new_versions) == 1, msg
+            current_version = str(new_versions[0])
+            logger.debug(f"New schema_version {current_version} after executing '{self.query}'")
+            columns_list = (
+                {"column_name": entry["column_name"], "type": entry["type"]}
+                for entry in res
+                if entry["kind"] == "regular" and current_version == str(entry["schema_version"])
+            )
+            self.versions_dict[current_version] = {}
+            for item in columns_list:
+                self.versions_dict[current_version][item["column_name"]] = item["type"]
+
+            self.versions.append(current_version)
+            if len(self.versions) > 1:
+                current_id = self.versions[-1]
+                previous_id = self.versions[-2]
+                set_current = set(self.versions_dict[current_id].items())
+                set_previous = set(self.versions_dict[previous_id].items())
+                current_diff = set_current - set_previous
+                previous_diff = set_previous - set_current
+                msg1 = f"Expect diff(new schema,old schema) to be {expected_current_diff} got {current_diff}"
+                msg2 = f" query is '{self.query}' versions={current_id},{previous_id}"
+                if current_diff != expected_current_diff:
+                    logger.debug(msg1 + msg2)
+                assert current_diff == expected_current_diff, msg1 + msg2
+                msg1 = f"Expect diff(old schema,new schema) to be {expected_prev_diff} got {previous_diff}"
+                assert previous_diff == expected_prev_diff, msg1 + msg2
+
+        self.query = query
+        cf_id = get_table_id(session, keyspace_name=self.keyspace_name, table_name=self.table_name)
+        read_schema_history_table(session, cf_id)
+
+
+class DDL(NamedTuple):
+    ddl_command: str
+    expected_current_diff: set | None
+    expected_prev_diff: set | None
+
+
+class TestSchemaHistory(Tester):
+    def prepare(self):
+        cluster = self.cluster
+        # in case support tablets and rf-rack-valid-keyspaces
+        # create cluster with 3 racks with 1 node in each rack
+        cluster_topology = generate_cluster_topology(rack_num=3)
+        rf = 3
+        cluster.populate(cluster_topology).start(wait_other_notice=True)
+        self.session = self.patient_cql_connection(self.cluster.nodelist()[0], row_factory=dict_factory)
+        create_ks(self.session, "lwt_load_ks", rf)
+
+    def test_schema_history_alter_table(self):
+        """test schema history changes following alter table cql commands"""
+        self.prepare()
+        verifier = HistoryVerifier(table_name="table2")
+        queries_and_expected_diffs = [
+            DDL(ddl_command="CREATE TABLE IF NOT EXISTS lwt_load_ks.table2 (pk int PRIMARY KEY, v int, int_col int)", expected_current_diff=None, expected_prev_diff=None),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 ALTER v TYPE varint", expected_current_diff={("v", "varint")}, expected_prev_diff={("v", "int")}),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 ADD (v2 int, v3 int)", expected_current_diff={("v2", "int"), ("v3", "int")}, expected_prev_diff=set()),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 ALTER int_col TYPE varint", expected_current_diff={("int_col", "varint")}, expected_prev_diff={("int_col", "int")}),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 DROP int_col", expected_current_diff=set(), expected_prev_diff={("int_col", "varint")}),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 ADD int_col bigint", expected_current_diff={("int_col", "bigint")}, expected_prev_diff=set()),
+            DDL(ddl_command="ALTER TABLE lwt_load_ks.table2 DROP (int_col,v)", expected_current_diff=set(), expected_prev_diff={("int_col", "bigint"), ("v", "varint")}),
+        ]
+        for ddl in queries_and_expected_diffs:
+            self.session.execute(ddl.ddl_command)
+            verifier.verify(self.session, ddl.expected_current_diff, ddl.expected_prev_diff, query=ddl.ddl_command)
--- a/test/cluster/dtest/tools/assertions.py
+++ b/test/cluster/dtest/tools/assertions.py
@@ -218,6 +218,18 @@ def assert_row_count_in_select_less(
    assert count < max_rows_expected, f'Expected a row count < of {max_rows_expected} in query "{query}", but got {count}'


+def assert_length_equal(object_with_length, expected_length):
+    """
+    Assert an object has a specific length.
+    @param object_with_length The object whose length will be checked
+    @param expected_length The expected length of the object
+
+    Examples:
+    assert_length_equal(res, nb_counter)
+    """
+    assert len(object_with_length) == expected_length, f"Expected {object_with_length} to have length {expected_length}, but instead is of length {len(object_with_length)}"
+
+
 def assert_lists_equal_ignoring_order(list1, list2, sort_key=None):
    """
    asserts that the contents of the two provided lists are equal
--- a/test/cluster/dtest/tools/data.py
+++ b/test/cluster/dtest/tools/data.py
@@ -14,6 +14,7 @@ from cassandra.query import SimpleStatement
 from cassandra.concurrent import execute_concurrent_with_args

 from test.cluster.dtest.dtest_class import create_cf
+from test.cluster.dtest.tools import assertions


 logger = logging.getLogger(__name__)
@@ -51,6 +52,27 @@ def insert_c1c2(  # noqa: PLR0913
        execute_concurrent_with_args(session, statement, [[f"k{k}"] for k in keys], concurrency=concurrency)


+def query_c1c2(  # noqa: PLR0913
+    session,
+    key,
+    consistency=ConsistencyLevel.QUORUM,
+    tolerate_missing=False,
+    must_be_missing=False,
+    c1_value="value1",
+    c2_value="value2",
+    ks="ks",
+    cf="cf",
+):
+    query = SimpleStatement(f"SELECT c1, c2 FROM {ks}.{cf} WHERE key='k{key}'", consistency_level=consistency)
+    rows = list(session.execute(query))
+    if not tolerate_missing and not must_be_missing:
+        assertions.assert_length_equal(rows, 1)
+        res = rows[0]
+        assert len(res) == 2 and res[0] == c1_value and res[1] == c2_value, res
+    if must_be_missing:
+        assertions.assert_length_equal(rows, 0)
+
+
 def rows_to_list(rows):
    new_list = [list(row) for row in rows]
    return new_list
--- a/test/cluster/random_failures/test_random_failures.py
+++ b/test/cluster/random_failures/test_random_failures.py
@@ -181,11 +181,14 @@ async def test_random_failures(manager: ManagerClient,
            LOGGER.info("Found following message in the coordinator's log:\n\t%s", matches[-1][0])
            await manager.server_stop(server_id=s_info.server_id)

+    BANNED_NOTIFICATION = "received notification of being banned from the cluster from"
+    STARTUP_FAILED_PATTERN = f"init - Startup failed:|{BANNED_NOTIFICATION}"
+
    if s_info in await manager.running_servers():
        LOGGER.info("Wait until the new node initialization completes or fails.")
-        await server_log.wait_for("init - (Startup failed:|Scylla version .* initialization completed)", timeout=120)
+        await server_log.wait_for(f"init - (Startup failed:|Scylla version .* initialization completed)|{BANNED_NOTIFICATION}", timeout=120)

-        if await server_log.grep("init - Startup failed:"):
+        if await server_log.grep(STARTUP_FAILED_PATTERN):
            LOGGER.info("Check that the new node is dead.")
            expected_statuses = [psutil.STATUS_DEAD]
        else:
@@ -216,7 +219,7 @@ async def test_random_failures(manager: ManagerClient,
    else:
        if s_info in await manager.running_servers():
            LOGGER.info("The new node is dead.  Check if it failed to startup.")
-            assert await server_log.grep("init - Startup failed:")
+            assert await server_log.grep(STARTUP_FAILED_PATTERN)
            await manager.server_stop(server_id=s_info.server_id)  # remove the node from the list of running servers

        LOGGER.info("Try to remove the dead new node from the cluster.")
--- a/test/cluster/suite.yaml
+++ b/test/cluster/suite.yaml
@@ -26,6 +26,7 @@ skip_in_release:
  - test_raft_cluster_features
  - test_cluster_features
  - dtest/limits_test
+  - dtest/schema_management_test
 skip_in_debug:
  - test_shutdown_hang
  - test_replace
--- a/test/cluster/test_cluster_features.py
+++ b/test/cluster/test_cluster_features.py
@@ -146,13 +146,13 @@ async def test_joining_old_node_fails(manager: ManagerClient) -> None:

    # Try to add a node that doesn't support the feature - should fail
    new_server_info = await manager.server_add(start=False, property_file=servers[0].property_file())
-    await manager.server_start(new_server_info.server_id, expected_error="Feature check failed")
+    await manager.server_start(new_server_info.server_id, expected_error="Feature check failed|received notification of being banned from the cluster from")

    # Try to replace with a node that doesn't support the feature - should fail
    await manager.server_stop_gracefully(servers[0].server_id)
    replace_cfg = ReplaceConfig(replaced_id=servers[0].server_id, reuse_ip_addr=False, use_host_id=False)
    new_server_info = await manager.server_add(start=False, replace_cfg=replace_cfg, property_file=servers[0].property_file())
-    await manager.server_start(new_server_info.server_id, expected_error="Feature check failed")
+    await manager.server_start(new_server_info.server_id, expected_error="Feature check failed|received notification of being banned from the cluster from")


@pytest.mark.asyncio
--- a/test/cluster/test_major_compaction.py
+++ b/test/cluster/test_major_compaction.py
@@ -131,7 +131,7 @@ async def test_major_compaction_flush_all_tables(manager: ManagerClient, compact
            await manager.api.keyspace_compaction(server.ip_addr, ks, cf)

            flush_log = await log.grep("Forcing new commitlog segment and flushing all tables", from_mark=mark)
-            assert len(flush_log) == (1 if expect_all_table_flush else 0)
+            assert len(flush_log) == (2 if expect_all_table_flush else 0)

        # all tables should be flushed the first time unless compaction_flush_all_tables_before_major_seconds == 0
        await check_all_table_flush_in_major_compaction(compaction_flush_all_tables_before_major_seconds != 0)
--- a/test/raft/failure_detector_test.cc
+++ b/test/raft/failure_detector_test.cc
@@ -40,16 +40,8 @@ struct test_pinger: public direct_failure_detector::pinger {
                co_return;
            }

-            promise<> p;
-            auto f = p.get_future();
-            auto sub = as.subscribe([&, p = std::move(p)] () mutable noexcept {
-                p.set_value();
-            });
-            if (!sub) {
-                throw abort_requested_exception{};
-            }
-            co_await std::move(f);
-            throw abort_requested_exception{};
+            // Simulate a blocking ping that only returns when aborted.
+            co_await sleep_abortable(std::chrono::hours(1), as);
        }, as);
        co_return ret;
    }
--- a/test/storage/test_out_of_space_prevention.py
+++ b/test/storage/test_out_of_space_prevention.py
@@ -64,7 +64,7 @@ async def test_user_writes_rejection(manager: ManagerClient, volumes_factory: Ca
            for server in servers:
                await manager.api.disable_autocompaction(server.ip_addr, ks)

-            async with new_test_table(manager, ks, "pk int PRIMARY KEY, t text") as cf:
+            async with new_test_table(manager, ks, "pk int PRIMARY KEY, t text", " WITH speculative_retry = 'NONE'") as cf:

                logger.info("Create a big file on the target node to reach critical disk utilization level")
                disk_info = psutil.disk_usage(workdir)
--- a/utils/atomic_vector.hh
+++ b/utils/atomic_vector.hh
@@ -15,8 +15,8 @@

 #include <vector>

-// This class supports atomic removes (by using a lock and returning a
-// future) and non atomic insert and iteration (by using indexes).
+// This class supports atomic inserts, removes, and iteration.
+// All operations are synchronized using a read-write lock.
 template <typename T>
 class atomic_vector {
    std::vector<T> _vec;
@@ -24,6 +24,10 @@ class atomic_vector {

 public:
    void add(const T& value) {
+        auto lock = _vec_lock.for_write().lock().get();
+        auto unlock = seastar::defer([this] {
+            _vec_lock.for_write().unlock();
+        });
        _vec.push_back(value);
    }
    seastar::future<> remove(const T& value) {
@@ -44,11 +48,14 @@ public:
        auto unlock = seastar::defer([this] {
            _vec_lock.for_read().unlock();
        });
-        // We grab a lock in remove(), but not in add(), so we
-        // iterate using indexes to guard against the vector being
-        // reallocated.
-        for (size_t i = 0, n = _vec.size(); i < n; ++i) {
-            func(_vec[i]);
+        // Take a snapshot of the current contents while holding the read lock,
+        // so that concurrent add() calls and possible reallocations won't
+        // affect our iteration.
+        auto snapshot = _vec;
+        // We grab locks in both add() and remove(), so we iterate using
+        // indexes on the snapshot to avoid concurrent modifications.
+        for (size_t i = 0, n = snapshot.size(); i < n; ++i) {
+            func(snapshot[i]);
        }
    }

@@ -59,16 +66,17 @@ public:
    void thread_for_each_nested(seastar::noncopyable_function<void(T)> func) const {
        // When called in the context of thread_for_each, the read lock is
        // already held, so we don't need to acquire it again. Acquiring it
-        // again could lead to a deadlock.
-        if (!_vec_lock.locked()) {
-            utils::on_internal_error("thread_for_each_nested called without holding the vector lock");
-        }
+        // again could lead to a deadlock. This function must only be called
+        // while holding the read lock on _vec_lock.

-        // We grab a lock in remove(), but not in add(), so we
-        // iterate using indexes to guard against the vector being
-        // reallocated.
-        for (size_t i = 0, n = _vec.size(); i < n; ++i) {
-            func(_vec[i]);
+        // Take a snapshot of the current contents while the read lock is held,
+        // so that concurrent add() calls and possible reallocations won't
+        // affect our iteration.
+        auto snapshot = _vec;
+        // We grab locks in both add() and remove(), so we iterate using
+        // indexes on the snapshot to avoid concurrent modifications.
+        for (size_t i = 0, n = snapshot.size(); i < n; ++i) {
+            func(snapshot[i]);
        }
    }

@@ -79,11 +87,14 @@ public:
    // preemption.
    seastar::future<> for_each(seastar::noncopyable_function<seastar::future<>(T)> func) const {
        auto holder = co_await _vec_lock.hold_read_lock();
-        // We grab a lock in remove(), but not in add(), so we
-        // iterate using indexes to guard against the vector being
-        // reallocated.
-        for (size_t i = 0, n = _vec.size(); i < n; ++i) {
-            co_await func(_vec[i]);
+        // Take a snapshot of the current contents while holding the read lock,
+        // so that concurrent add() calls and possible reallocations won't
+        // affect our iteration.
+        auto snapshot = _vec;
+        // We grab locks in both add() and remove(), so we iterate using
+        // indexes on the snapshot to avoid concurrent modifications.
+        for (size_t i = 0, n = snapshot.size(); i < n; ++i) {
+            co_await func(snapshot[i]);
        }
    }
 };
Author	SHA1	Message	Date
copilot-swe-agent[bot]	1303567fa4	Fix race conditions in atomic_vector by adding synchronization and snapshots - Add write lock to add() method to prevent concurrent modifications - Remove insufficient locked() check in thread_for_each_nested() - Add vector snapshots in all iteration methods to prevent races - Update class documentation to reflect atomic operations Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>	2025-12-21 16:06:11 +00:00
copilot-swe-agent[bot]	e7d10b5be0	Initial plan	2025-12-21 16:00:55 +00:00
Anna Stuchlik	f65db4e8eb	doc: remove the links to the Download Center This commit removes the remaining links to the Download Center on the website. We no longer use it for installation, and we don't want users to infer that something like that still exists. Fixes https://github.com/scylladb/scylladb/issues/27753 Closes scylladb/scylladb#27756	2025-12-19 12:53:40 +01:00
Botond Dénes	df2ac0f257	Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic This PR migrates schema management tests from dtest to this repository. One reason is that there is an ongoing effort to migrate tests from dtest to here. Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following: - `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected - drop column works as intended - interrupts `populate` and proceeds to flush - flush probably works as intended - logs are consistent with what we expect and what I got in local test runs - `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests. Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the probably part of the flush analysis step described above. If the issue reoccurs, we will have more information. The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated. Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved. Small performance improvements: - Regex compilation extracted from the stress function to the module level, to avoid recompilation. - Do not materialize list in `stress_object` for loop. Use a generator expression. The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. scylla-dtest PR that removes migrated tests: [schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427) Fixes #26932 This is a migration of existing tests to this repository. No need for backport. Closes scylladb/scylladb#27106 * github.com:scylladb/scylladb: test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests test: dtest: schema_management_test.py: fix large partition add column test test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` test: dtest: schema_management_test.py: test enhancements test: dtest: schema_management_test.py: make the tests work test: dtest: migrate setup and tools from dtest test: dtest: copy unmodified schema_management_test.py replica: database: flush_all_tables log on completion	2025-12-19 12:30:00 +02:00
Botond Dénes	093e97a539	Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski `_verify_tasks_processed_metrics()` is used to check that the correct service level is used to process requests. It takes two service levels as arguments and executes numerous requests. After that, the number of tasks processed by one of the service levels is expected to rise by at least the number of executed requests. In contrast, the second service level is expected to process fewer tasks than the number of requests. Unfortunately, background noise may cause some tasks to be executed on the service level that is not supposed to process requests. This patch increases the number of executed requests to eliminate the chance of noise causing test failures. Additionally, this commit extends logging to make future investigation easier. Fixes: https://github.com/scylladb/scylladb/issues/27715 No backport, fix for test on master. Closes scylladb/scylladb#27735 * github.com:scylladb/scylladb: test: remove unused `get_processed_tasks_for_group` test: increase num of requests in driver_service_level tests	2025-12-19 10:54:14 +02:00
Emil Maskovsky	fa6e5d0754	test/random_failures: fix handling of banned notification After `39cec4a` node join may fail with either "init - Startup failed" notification or occasionally because it was banned, depending on timing. The change updates the test to handle both cases. Fixes: scylladb/scylladb#27697 No backport: This failure is only present in master. Closes scylladb/scylladb#27768	2025-12-19 09:55:31 +02:00
Emil Maskovsky	08518b2c12	test/raft: fix `test_joining_old_node_fails` flakiness When a node without the required feature attempts to join a Raft-based cluster with the feature enabled, there is a race between the join rejection response ("Feature check failed") and the ban notification ("received notification of being banned"). Depending on timing, either message may appear in the joining node's log. This starts to happen after `39cec4a` (which introduced informing the nodes about being banned). Updated the test to accept both error messages as valid, making the test robust against this race condition, which is more likely in debug mode or under slow execution. Fixes: scylladb/scylladb#27603 No backport: This failure is only present in master. Closes scylladb/scylladb#27760	2025-12-19 09:44:09 +02:00
Emil Maskovsky	2a75b1374e	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737	2025-12-19 09:42:19 +02:00
Łukasz Paszkowski	2cb9bb8f3a	test_user_writes_rejection: Disable speculative retries This test starts a 3-node cluster and creates a large blob file so that one node reaches critical disk utilization, triggering write rejections on that node. The test then writes data with CL=QUORUM and validates that the data: - did not reach the critically utilized node - did reach the remaining two nodes By default, tables use speculative retries to determine when coordinators may query additional replicas. Since the validation uses CL=ONE, it is possible that an additional request is sent to satisfy the consistency level. As a result: - the first check may fail if the additional request is sent to a node that already contains data, making it appear as if data reached the critically utilized node - the second check may fail if the additional request is sent to the critically utilized node, making it appear as if data did not reach the healthy node The patch fixes the flakiness by disabling the speculative retries. Fixes https://github.com/scylladb/scylladb/issues/27212 Closes scylladb/scylladb#27488	2025-12-19 09:39:09 +02:00
Dario Mirovic	f1d63d014c	test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. The patch also introduces a few small improvements: - `cs_run` renamed to `run_stress` for clarity - Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use - `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly - Added explanation comment on why we do `data[i].append(None)` - Replaced `alter_table` inner function with its body, for simplicity - Removed unnecessary `ck_rows` variable in `populate` - Removed unnecessary `isinstance(self.cluster. ScyllaCluster)` - Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed - Replaced functional programming style expressions for `new_versions` and `columns_list` with comprehension/generator statement python style code, improving readability Refs #26932 fix	2025-12-18 17:07:27 +01:00
Michael Litvak	33f7bc28da	docs: document restrictions of colocated tables Currently some things are not supported for colocated tables: it's not possible to repair a colocated table, and due to this it's also not possible to use the tombstone_gc=repair mode on a colocated table. Extend the documentation to explain what colocated tables are and document these restrictions. Fixes scylladb/scylladb#27261 Closes scylladb/scylladb#27516	2025-12-18 15:38:29 +01:00
Dario Mirovic	f831ca5ab5	test: dtest: schema_management_test.py: fix large partition add column test `large_partition_with_add_column_test` and `large_partition_with_drop_column_test` were added on August 17th, 2020 in scylladb/scylla-dtest#1569. Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051. Since then this test has not been running. This patch fixes it - the test is updated and renamed and the testing environment now properly picks it up. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	1fe0509a9b	test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` Extract repeated cluster initialization code in `TestSchemaManagement` into a separate `prepare` method. It holds all the common code for cluster preparation, with just the necessary parameters. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	e7d76fd8f3	test: dtest: schema_management_test.py: test enhancements Extract regex compilation from the stress functions to the module level, to avoid unnecessary regex compilation repetition. Add descriptions to the stress functions. Do not materialize list in `stress_object` for loop. Use a generator expression. Make `_set_stress_val` an object method. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	700853740d	test: dtest: schema_management_test.py: make the tests work Remove unused function markers. Add wait_other_notice=True to cluster start method in TestSchemaHistory.prepare function to make the test stable. Enable the test in suite.yaml for dev and debug modes. Fixes #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	3c5dd5e5ae	test: dtest: migrate setup and tools from dtest Migrate several functionalities from dtest. These will be used by the schema_management_test.py tests when they are enabled. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	5971b2ad97	test: dtest: copy unmodified schema_management_test.py Copy schema_management_test.py from scylla-dtest to test/cluster/dtest/schema_management_test.py. Add license header. Disable it for debug, dev, and release mode. Refs #26932	2025-12-18 12:54:42 +01:00
Dario Mirovic	f89315d02f	replica: database: flush_all_tables log on completion In database::flush_all_tables add log on completion. This slightly improves the readability of logs when debugging an issue. Refs #26932	2025-12-18 12:54:42 +01:00
Andrzej Jackowski	6ad10b141a	test: remove unused `get_processed_tasks_for_group` The function `get_processed_tasks_for_group` was defined twice in `test_raft_service_levels.py`. This change removes the unused definition to avoid confusion and clean up the code.	2025-12-17 20:45:53 +01:00
Andrzej Jackowski	8cf8e6c87d	test: increase num of requests in driver_service_level tests `_verify_tasks_processed_metrics()` is used to check that the correct service level is used to process requests. It takes two service levels as arguments and executes numerous requests. After that, the number of tasks processed by one of the service levels is expected to rise by at least the number of executed requests. In contrast, the second service level is expected to process fewer tasks than the number of requests. Unfortunately, background noise may cause some tasks to be executed on the service level that is not supposed to process requests. This patch increases the number of executed requests to eliminate the chance of noise causing test failures. Additionally, this commit extends logging to make future investigation easier. Fixes: scylladb/scylladb#27715	2025-12-17 20:45:48 +01:00