scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 22:43:15 +00:00

Author	SHA1	Message	Date
Botond Dénes	53a6ec05ed	Merge 'replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Fixes #18821. ``` WRITE ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets --write - BEFORE 65559.52 tps ( 59.6 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52841 insns/op, 30946 cycles/op, 0 errors) 67408.05 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53018 insns/op, 30874 cycles/op, 0 errors) 67714.72 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53026 insns/op, 30881 cycles/op, 0 errors) 67825.57 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53015 insns/op, 30821 cycles/op, 0 errors) 67810.74 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53009 insns/op, 30828 cycles/op, 0 errors) throughput: mean=67263.72 standard-deviation=967.40 median=67714.72 median-absolute-deviation=547.02 maximum=67825.57 minimum=65559.52 instructions_per_op: mean=52981.61 standard-deviation=79.09 median=53014.96 median-absolute-deviation=36.54 maximum=53025.79 minimum=52840.56 cpu_cycles_per_op: mean=30869.90 standard-deviation=50.23 median=30874.06 median-absolute-deviation=42.11 maximum=30945.94 minimum=30820.89 - AFTER 65448.76 tps ( 59.5 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52788 insns/op, 31013 cycles/op, 0 errors) 67290.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30950 cycles/op, 0 errors) 67646.81 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30909 cycles/op, 0 errors) 67565.90 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53058 insns/op, 30951 cycles/op, 0 errors) 67537.32 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 52983 insns/op, 30963 cycles/op, 0 errors) throughput: mean=67097.93 standard-deviation=931.44 median=67537.32 median-absolute-deviation=467.97 maximum=67646.81 minimum=65448.76 instructions_per_op: mean=52975.85 standard-deviation=108.07 median=53024.55 median-absolute-deviation=49.45 maximum=53057.99 minimum=52788.49 cpu_cycles_per_op: mean=30957.17 standard-deviation=37.43 median=30951.31 median-absolute-deviation=7.51 maximum=31013.01 minimum=30908.62 READ ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets - BEFORE 79423.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41840 insns/op, 26820 cycles/op, 0 errors) 81076.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41837 insns/op, 26583 cycles/op, 0 errors) 80927.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41829 insns/op, 26629 cycles/op, 0 errors) 80539.44 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41841 insns/op, 26735 cycles/op, 0 errors) 80793.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41864 insns/op, 26662 cycles/op, 0 errors) throughput: mean=80551.99 standard-deviation=661.12 median=80793.10 median-absolute-deviation=375.37 maximum=81076.70 minimum=79423.36 instructions_per_op: mean=41842.20 standard-deviation=13.26 median=41840.14 median-absolute-deviation=5.68 maximum=41864.50 minimum=41829.29 cpu_cycles_per_op: mean=26685.88 standard-deviation=93.31 median=26662.18 median-absolute-deviation=56.47 maximum=26820.08 minimum=26582.68 - AFTER 79464.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41799 insns/op, 26761 cycles/op, 0 errors) 80954.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41803 insns/op, 26605 cycles/op, 0 errors) 81160.90 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41811 insns/op, 26555 cycles/op, 0 errors) 81263.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41814 insns/op, 26527 cycles/op, 0 errors) 81162.97 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41806 insns/op, 26549 cycles/op, 0 errors) throughput: mean=80801.25 standard-deviation=755.54 median=81160.90 median-absolute-deviation=361.72 maximum=81263.10 minimum=79464.70 instructions_per_op: mean=41806.47 standard-deviation=5.85 median=41806.05 median-absolute-deviation=4.05 maximum=41813.86 minimum=41799.36 cpu_cycles_per_op: mean=26599.22 standard-deviation=94.84 median=26554.54 median-absolute-deviation=50.51 maximum=26761.06 minimum=26527.05 ``` Closes scylladb/scylladb#19469 * github.com:scylladb/scylladb: replica: remove rwlock for protecting iteration over storage group map replica: get rid of fragile compaction group intrusive list	2024-07-12 15:45:36 +03:00
Emil Maskovsky	b9abad0515	test: raft: fix the topology failure recovery test flakiness Setting the error condition for all nodes in the cluster to avoid having to check which one is the coordinator. This should make the test more stable and avoid the flakiness observed when the coordinator node is the one that got the error condition injected. Randomizing the retrieved running servers to reproduce the issue more frequently and to avoid making any assumptions about the order of the servers. Note that only the "raft_topology_barrier_fail" needs to run on a non-coordinator node, the other error "stream_ranges_fail" can be injected on any node (including the coordinator). Fixes: scylladb/scylladb#18614 Closes scylladb/scylladb#19663	2024-07-11 16:23:26 +02:00
Piotr Dulikowski	188b4ac0fc	Merge 'service_level_controller: update configuration on raft change' from Michał Jadwiszczak This patch is a follow-up to scylladb/scylladb#16585. Once we have service levels on raft, we can get rid of update loop, which updates the configuration in a configured interval (default is 10s). Instead, this PR introduces methods to `group0_state_machine` which look through table ids in mutations in `write_mutation` and update submodules based on that ids. Fixes: scylladb/scylladb#18060 Closes scylladb/scylladb#18758 * github.com:scylladb/scylladb: test: remove `sleep()`s which were required to reload service levels configuration test/cql_test_env: remove unit test service levels data accessors service/storage_service: reload SL cache on topology_state_load() service/qos/service_level_controller: move semaphore breaking to stop service/qos/service_level_controller: maybe start and stop legacy update loop service/qos/service_level_controller: make update loop legacy raft/group0_state_machine: update submodules based on table_id service/storage_service: add a proxy method to reload sl cache	2024-07-11 16:18:48 +02:00
Michał Chojnowski	1a8ee69a43	sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the compressor objects based on its own schema, but using them based based on the sstable's schema the sstable's schema. This patch forces the writer to use the sstable's schema for both.	2024-07-11 12:53:54 +02:00
Michał Chojnowski	d10b38ba5b	sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the filter based on its own schema, while the layer outside the writer was interpreting it as if it was created with the sstable's schema. This patch forces the writer to pick the filter's parameters based on the sstable's schema instead.	2024-07-11 12:53:54 +02:00
Piotr Dulikowski	19c5e1807c	Merge 'schema: fix describe of indexes on collections' from Michał Jadwiszczak If the index was created on collection (both frozen or not), its description wasn't a correct create statement. This patch fixes the bug and includes functions like `full()`, `keys()`, `values()`, ... used to create index on collections. Fixes scylladb/scylladb#19278 Closes scylladb/scylladb#19381 * github.com:scylladb/scylladb: cql-pytest/test_describe: add a test for describe indexes schema/schema: fix column names in index description	2024-07-11 09:11:01 +02:00
Michał Chojnowski	fdd8b03d4b	scylla-gdb.py: add $coro_frame() Adds a convenience function for inspecting the coroutine frame of a given seastar task. Short example of extracting a coroutine argument: ``` (gdb) p $coro_frame(seastar::local_engine->_current_task) $1 = { __resume_fn = 0x2485f80 <sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&)>, ... PointerType_7 = 0x601008e67880, ... __coro_index = 0 '\000' ... (gdb) p $downcast_vptr($->PointerType_7) $2 = (schema ) 0x601008e67880 ``` Closes scylladb/scylladb#19479	2024-07-10 21:46:27 +03:00
Avi Kivity	45e27c0da2	config, enum_option: allow round-trip string conversion The default configuration for replication_strategy_warn_list is ["SimpleStrategy"], but one cannot set this via CQL: cqlsh> select * from system.config where name = 'replication_strategy_warn_list'; name \| source \| type \| value --------------------------------+---------+---------------------------+-------------------- replication_strategy_warn_list \| default \| replication strategy list \| ["SimpleStrategy"] (1 rows) cqlsh> update system.config set value = '[NetworkTopologyStrategy]' where name = 'replication_strategy_warn_list'; cqlsh> select * from system.config where name = 'replication_strategy_warn_list'; name \| source \| type \| value --------------------------------+--------+---------------------------+----------------------------- replication_strategy_warn_list \| cql \| replication strategy list \| ["NetworkTopologyStrategy"] (1 rows) cqlsh> update system.config set value = '["NetworkTopologyStrategy"]' where name = 'replication_strategy_warn_list'; WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for system.config - received 0 responses and 1 failures from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1} Fix by allowing quotes in enum_set parsing. Bug present since `8c464b2ddb` ("guardrails: restrict replication strategy (RS)", 6.0). Fixes #19604. Closes scylladb/scylladb#19605	2024-07-10 20:39:01 +03:00
Michał Jadwiszczak	375499b727	test: remove `sleep()`s which were required to reload service levels configuration Previously, some service levels tests requires to sleep in order to ensure in-memory configuration of service levels was updated. Now, when we are updating the configuration as the raft log is applied, doing read barrier (for instance to execute `DROP TABLE IF EXISTS non_existing_table`) is enough and the sleeps are not needed.	2024-07-10 10:42:21 +02:00
Michał Jadwiszczak	23bebb8037	test/cql_test_env: remove unit test service levels data accessors Unit test data accessors were created to avoid starting update loop in unit test and to update controller's configuration directly. With raft data accessor and configuration updates on applying raft log, we can get rid of unit test data accessors and use the raft one. This also make unit test env a bit like real Scylla environment.	2024-07-10 10:42:21 +02:00
Nadav Har'El	c6cffe36dd	Merge 'cql: forbid having counter columns in tablets tables' from Piotr Smaron Counter updates break under tablet migration (#18180), and for this reason counters need to be disabled until the problem is fixed. It's enough to forbid creating a table with counters, as altering a table without counters already cannot result in the table having counters: 1) Adding a counter column to a table without counters: ``` cqlsh> ALTER TABLE temp.cf ADD (col_name counter); ConfigurationException: Cannot add a counter column (col_name) in a non counter column family ``` 2) Altering a column to be of the counter type: ``` cqlsh> ALTER TABLE temp.cf ALTER col_name TYPE counter; ConfigurationException: Cannot change col_name from type int to type counter: types are incompatible. ``` Fixes: #19449 Fixes: https://github.com/scylladb/scylladb/issues/18876 Need to backport to 6.0, as this is broken there. Closes scylladb/scylladb#19518 * github.com:scylladb/scylladb: doc: add notes to feature pages which don't support tablets cql: adjust warning about tablets cql: forbid having counter columns in tablets tables	2024-07-10 10:18:30 +03:00
Michał Jadwiszczak	b65a4c66f0	cql-pytest/test_describe: add a test for describe indexes	2024-07-10 07:14:46 +02:00
Raphael S. Carvalho	c539b7c861	replica: remove rwlock for protecting iteration over storage group map rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Check documentation added to compaction_group.hh to understand how concurrent iterations and updates to the map work without the rwlock. Yielding variants that iterate over groups are no longer returning group id since id stability can no longer be guaranteed without serializing split finalization and iteration. Fixes #18821. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:59:24 -03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Avi Kivity	f31d5e3204	Merge 'repair/streaming: enable toggling tombstone gc with a config item' from Botond Dénes We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing. This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows. This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC. Fixes: https://github.com/scylladb/scylladb/issues/19015 Not a regression, no backport needed Closes scylladb/scylladb#19016 * github.com:scylladb/scylladb: test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming() db/config: introduce enable_tombstone_gc_for_streaming_and_repair	2024-07-09 19:04:11 +03:00
Piotr Smaron	c70f321c6f	cql: forbid having counter columns in tablets tables Counter updates break under tablet migration (#18180), and for this reason they need to be disabled until the problem is fixed. It's enough to forbid creating a table with counters, as altering a table without counters already cannot result in the table having counters: 1) Adding a counter column to a table without counters: ``` cqlsh> ALTER TABLE temp.cf ADD (col_name counter); ConfigurationException: Cannot add a counter column (col_name) in a non counter column family ``` 2) Altering a column to be of the counter type: ``` cqlsh> ALTER TABLE temp.cf ALTER col_name TYPE counter; ConfigurationException: Cannot change col_name from type int to type counter: types are incompatible. ``` Fixes: #19449	2024-07-09 18:01:31 +02:00
Patryk Wrobel	a89e3d10af	code-cleanup: add missing header guards The following command had been executed to get the list of headers that did not contain '#pragma once': 'grep -rnw . -e "#pragma once" --include *.hh -L' This change adds missing include guard to headers that did not contain any guard. Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#19626	2024-07-09 18:31:35 +03:00
Calle Wilund	8295980d14	commitlog: Make max data lifetime runtime-configurable	2024-07-09 12:30:49 +00:00
Calle Wilund	55d6afda6e	commitlog: Add optional max lifetime parameter to cl instance If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.	2024-07-09 12:30:48 +00:00
Tomasz Grabiec	252110bc54	Merge 'mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion' from Michał Chojnowski apply_monotonically() is run with reclaim disabled. So with some bad luck, sentinel insertion might fail with bad_alloc even on a perfectly healthy node. We can't deal with the failure of sentinel insertion, so this will result in a crash. This patch prevents the spurious OOM by reserving some memory (1 LSA segment) and only making it available right before the critical allocations. Fixes https://github.com/scylladb/scylladb/issues/19552 Closes scylladb/scylladb#19617 * github.com:scylladb/scylladb: mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion logalloc: add hold_reserve logalloc: generalize refill_emergency_reserve()	2024-07-09 13:09:01 +02:00
Michael Litvak	ed33e59714	storage_proxy: remove response handler if no targets When writing a mutation, it might happen that there are no live targets to send the mutation to, yet the request can be satisfied. For example, when writing with CL=ANY to a dead node, the request is completed by storing a local hint. Currently, in that case, a write response handler is created for the request and it remains active until it timeouts because it is not removed anywhere, even though the write is completed successfuly after storing the hint. The response handler should be removed usually when receiving responses from all targets, but in this case there are no targets to trigger the removal. In this commit we check if we don't have live targets to send the mutation to. If so, we remove the response handler immediately. Fixes scylladb/scylladb#19529 Closes scylladb/scylladb#19586	2024-07-09 12:11:05 +03:00
Kamil Braun	98c18d8904	Merge 'Add API for read barrier' from Emil Maskovsky Introduce REST API for triggering a read barrier. This is to make sure the database schema is up to date on the node where the read barrier is triggered. One of the use cases is the database backup via the Scylla Manager, which requires that the schema backed up is matching the data or newer (data can be migrated, but an older schema would cause issues). Fixes scylladb/scylladb#19213 Closes scylladb/scylladb#19597 * github.com:scylladb/scylladb: raft: add the read barrier REST API raft: use `raft_timeout` in trigger_snapshot raft: use bad_param_exception for consistency test: raft: verify schema updated after read barrier	2024-07-09 10:58:21 +02:00
Kefu Chai	6af989782c	test: sstable_directory_test: use THREADSAFE_BOOST_REQUIRE_EQUAL when appropriate for better debugging experience. before this change, we have ``` fatal error: in "sstable_directory_test_generation_sanity": critical check sst->generation() == sst1->generation() has failed ``` after this change, we have ``` fatal error: in "sstable_directory_test_generation_sanity": critical check sst->generation() == sst1->generation() has failed [3ghm_0ntw_29vj625yegw7jodysc != 3ghm_0ntw_29vj625yegw7jodysd] ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19639	2024-07-09 10:54:23 +03:00
Kefu Chai	30e82a81e8	test: do not define boost_test_print_type() for types with operator<< before this change, we provide `boost_test_print_type()` for all types which can be formatted using {fmt}. these types includes those who fulfill the concept of range, and their element can be formatted using {fmt}. if the compilation unit happens to include `fmt/ranges.h`. the ranges are formatted with `boost_test_print_type()` as well. this is what we expect. in other words, we use {fmt} to format types which do not natively support {fmt}, but they fulfill the range concept. but `boost::unit_test::basic_cstring` is one of them - it can be formatted using operator<<, but it does not provide fmt::format specialization - it fulfills the concept of range - and its element type is `char const`, which can be formatted using {fmt} that's why it's formatted like: ``` test/boost/sstable_directory_test.cc(317): fatal error: in "sstable_directory_test_generation_sanity": critical check ['s', 's', 't', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')', ' ', '=', '=', ' ', 's', 's', 't', '1', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')'] has failed` ``` where the string is formatted as a sequence-alike container. this is far from readable. so, in this change, we do not define `boost_test_print_type()` for the types which natively support `operator<<` anymore. so they can be printed with `operator<<` when boost::test prints them. Fixes scylladb/scylladb#19637 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19638	2024-07-09 10:34:37 +03:00
Botond Dénes	9544c364be	scylla-gdb.py: introduce scylla large-objects The equivalent of small-objects, but for large objects (spans). Allows listing object of a large-class, and therefore investigating a run-away class, by attempting to identify the owners of the objects in it. Written to investigate #16493 Closes scylladb/scylladb#16711	2024-07-09 10:21:09 +03:00
Emil Maskovsky	a9e985fcc9	raft: add the read barrier REST API This will allow to trigger the read barrier directly via the API, instead of doing work-arounds (like dropping a non-existent table). The intended use-case is in the Scylla Manager, to make sure that the database schema is up to date after the data has been backed up and before attempting to backup the database schema. The database schema in particular is being backed up just on a single node, which might not yet have the schema at least as new as the data (data can be migrated to a newer schema, but not a vice-versa). The read barrier issued on the node should ensure that the node should have the schema at least as new as the data or newer. Closes #19213	2024-07-08 18:16:27 +02:00
Michał Chojnowski	7b3f55a65f	logalloc: add hold_reserve mutation_partition_v2::apply_monotonically() needs to perform some allocations in a destructor, to ensure that the invariants of the data structure are restored before returning. But it is usually called with reclaiming disabled, so the allocations might fail even in a perfectly healthy node with plenty of reclaimable memory. This patch adds a mechanism which allows to reserve some LSA memory (by asking the allocator to keep it unused) and make it available for allocation right when we need to guarantee allocation success.	2024-07-08 16:08:27 +02:00
Emil Maskovsky	80986c17c3	test: raft: verify schema updated after read barrier Regression test for #19213.	2024-07-08 10:50:32 +02:00
Piotr Dulikowski	3c535641fd	Merge 'service/storage_proxy: Add metrics keeping track of incoming hints' from Dawid Mędrek Although Scylla already exposes metrics keeping track of various information related to hinted handoff, all of them correspond to either storing or sending hints. However, when debugging, it's also crucial to be aware of how many hints are coming to a given node and what their size is. Unfortunately, the existing metrics are not enough to obtain that information. This PR introduces the following new metrics: * `sent_bytes_total` – the total size of the hints that have been sent from a given shard, * `received_hints_total` – the total number of hints that a given shard has received, * `received_hints_bytes_total` – the total size of the hints a given shard has received. It also renames `hints_manager_sent` to `hints_manager_sent_total` to avoid conflicts of prefixes between that metric and `sent_bytes_total` in tests. Fixes scylladb/scylladb#10987 Closes scylladb/scylladb#18976 * github.com:scylladb/scylladb: db/hints: Add a metric for the size of sent hints service/storage_proxy: Add metrics for received hints	2024-07-08 10:29:53 +02:00
Botond Dénes	56c194e52c	Merge 'compaction: not include unused headers' from Kefu Chai these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#19581 * github.com:scylladb/scylladb: .github: add compaction to iwyu's CLEANER_DIR compaction: not include unused headers	2024-07-08 10:03:51 +03:00
Michael Litvak	407274e828	view: drain view builder before database The view builder is doing write operations to the database. In order for the view builder to shutdown gracefully without errors, we need to ensure the database can handle writes while it is drained. The commit changes the drain order, so that view builder is drained before the database shuts down. Fixes scylladb/scylladb#18929 Closes scylladb/scylladb#19609	2024-07-05 22:17:40 +03:00
Avi Kivity	0626e0487d	Merge 'Add copy on write to functions schema code' from Marcin Maliszkiewicz This is the first patch from series which would allow us to unify raft command code. Property we want to achieve is that all modifications performed by a single raft command can be made visible atomically. This helps to exclude accidental dependencies across subsystem updates and make easier to reason about state. Here we alter functions schema code so that changes are first applied to a copy of declared functions and then made visible atomically. Later work will apply similar strategy to the whole schema. Relates scylladb/scylladb#19153 Closes scylladb/scylladb#19598 * github.com:scylladb/scylladb: cql3: functions: make modification functions accessible only via batch class db: replica: batch functions schema modifications cql3: functions: introduce class for batching functions modifications cql3: functions: make functions class non-static cql3: functions: remove reduntant class access specifiers cql3: functions: remove unused java snippet	2024-07-04 17:40:23 +03:00
Nadav Har'El	96dff367f8	Merge 'storage_proxy: update view update backlog on correct shard when writing' from Wojciech Mitros This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets updated - we do not assure that the view update backlog returned in a response is necessarily the backlog that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because this series does not include this change, it's considerably less complex and it doesn't modify the common write patch, so no particular performance considerations were needed in that context. The issue being fixed is still the same, the full description can be seen below. When a replica applies a write on a table which has a materialized view it generates view updates. These updates take memory which is tracked by `database::_view_update_concurrency_sem`, separate on each shard. The fraction of units taken from the semaphore to the semaphore limit is the shard's view update backlog. Based on these backlogs, we want to estimate how busy a node is with its view updates work. We do that by taking the max backlog across all shards. To avoid excessive cross-shard operations, the node's (max) backlog isn't calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it): ``` update_backlog node_update_backlog::fetch() { auto now = clock::now(); if (now >= _last_update.load(std::memory_order_relaxed) + _interval) { _last_update.store(now, std::memory_order_relaxed); auto new_max = boost::accumulate( _backlogs, update_backlog::no_backlog(), [] (const update_backlog& lhs, const per_shard_backlog& rhs) { return std::max(lhs, rhs.load()); }); _max.store(new_max, std::memory_order_relaxed); return new_max; } return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed)); } ``` For the same reason, even when we do calculate the new node's backlog, we don't read from the `_view_update_concurrency_sem`. Instead, for each shard we also store a update_backlog atomic which we use for calculation: ``` struct per_shard_backlog { // Multiply by 2 to defeat the prefetcher alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog(); need_publishing need_publishing = need_publishing::no; update_backlog load() const { return backlog.load(std::memory_order_relaxed); } }; std::vector<per_shard_backlog> _backlogs; ``` Due to this distinction, the update_backlog atomic need to be updated separately, when the `_view_update_concurrency_sem` changes. This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`) and then calls node`_update_backlog::add` where the read backlog is stored in the atomic: ``` void storage_proxy::update_view_update_backlog() { _max_view_update_backlog.add(get_db().local().get_view_update_backlog()); } void node_update_backlog::add(update_backlog backlog) { _backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed); _backlogs[this_shard_id()].need_publishing = need_publishing::yes; } ``` For this implementation of calculating the node's view update backlog to work, we need the atomics to be updated correctly when the semaphores of corresponding shards change. The main event where the view update backlog changes is an incoming write request. That's why when handling the request and preparing a response we update the backlog calling `storage_proxy::get_view_update_backlog` (also because we want to read the backlog and send it in the response): backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`) ``` auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable { return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state()) .then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] { // make mutation alive until it is processed locally, otherwise it // may disappear if write timeouts before this future is ready got_response(response_id, my_address, get_view_update_backlog()); }); }; backlog update after remote view updates (storage_proxy::remote::handle_write) auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr, shard, response_id, p->get_view_update_backlog())); ``` Now assume that on a certain node we have a write request received on shard A, which updates a row on shard B (A!=B). As a result, shard B will generate view updates and consume units from its `_view_update_concurrency_sem`, but will not update its atomic in `_backlogs` yet. Because both shards in the example are on the same node, shard A will perform a local write calling `lmutate` shown above. In the `lmutate` call, the `apply_locally` will initiate the actual write on shard B and the `storage_proxy::update_view_update_backlog` will be called back on shard A. In no place will the backlog atomic on shard B get updated even though it increased in size due to the view updates generated there. Currently, what we calculate there doesn't really matter - it's only used for the MV flow control delays, so currently, in this scenario, we may only overload a replica causing failed replica writes which will be later retried as hints. However, when we add MV admission control, the calculated backlog will be the difference between an accepted and a rejected request. Fixes: https://github.com/scylladb/scylladb/issues/18542 Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none Closes scylladb/scylladb#19341 * github.com:scylladb/scylladb: test: add test for view backlog not being updated on correct shard test: move auxiliary methods for waiting until a view is built to util mv: update view update backlog when it increases on correct shard	2024-07-04 11:40:09 +03:00
Marcin Maliszkiewicz	16b770ff1a	cql3: functions: make functions class non-static This is done to ease code reuse in the following commit. It'd also help should we ever want properly mount functions class to schema object instead of static storage.	2024-07-04 10:24:57 +02:00
Kefu Chai	35e7a0b36f	test/cql-pytest: use offset-aware API to avoid deprecate warning to avoid warning like ``` DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). ``` and to be future-proof, let's use the offset-aware timestamp. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19536	2024-07-04 10:48:00 +03:00
Botond Dénes	e3e5f8209d	Merge 'alternator: fix "/localnodes" to use broadcast_rpc_address' from Nadav Har'El This short series fixes Alternator's "/localnodes" request to allow a node's external IP address - configured with `broadcast_rpc_address` - to be listed instead of its usual, internal, IP address. The first patch fixes a bug in gossiper::get_rpc_address(), which the second patch needs to implement the feature. The second patch also contains regression tests. Fixes #18711. Closes scylladb/scylladb#18828 * github.com:scylladb/scylladb: alternator: fix "/localnodes" to use broadcast_rpc_address gossiper: fix get_rpc_address() for this node	2024-07-04 10:37:28 +03:00
Wojciech Mitros	1fdc65279d	test: add test for view backlog not being updated on correct shard This patch adds a test for reproducing issue https://github.com/scylladb/scylladb/issues/18542 The test performs writes on a table with a materialized view and checks that the view backlog increases. To get the current view update backlog, a new metric "view_update_backlog" is added to the `storage_proxy` metrics. The metric differs from the metric from `database` metric with the same name by taking the backlog from the max_view_update_backlog which keeps view update backlogs from all shards which may be a bit outdated, instead of taking the backlog by checking the view_update_semaphore which the backlog is based on directly.	2024-07-03 23:18:52 +02:00
Wojciech Mitros	c4f5659c11	test: move auxiliary methods for waiting until a view is built to util In many materialized view tests we need to wait until a view is built before actually working on it, future tests will also need it. In existing tests we use the same, duplicated method for achieving that. In this patch the method is deduplicated and moved to pylib/util.py and existing tests are modified to use it instead.	2024-07-03 23:18:52 +02:00
Avi Kivity	3fc4e23a36	forward_service: rename to mapreduce_service forward_service is nondescriptive and misnamed, as it does more than forward requests. It's a classic map/reduce algorithm (and in fact one of its parameters is "reducer"), so name it accordingly. The name "forward" leaked into the wire protocol for the messaging service RPC isolation cookie, so it's kept there. It's also maintained in the name of the logger (for "nodetool setlogginglevel") for compatibility with tests. Closes scylladb/scylladb#19444	2024-07-03 19:29:47 +03:00
Michael Litvak	08b29460fc	mv: skip building view updates on a pending replica Currently, a pending replica that applies a write on a table that has materialized views, will build all the view updates as a normal replica, only to realize at a late point, in db::view::get_view_natural_endpoint(), that it doesn't have a paired view replica to send the updates to. It will then either drop the view updates, or send them to a pending view replica, if such exists. This work is unnecessary since it may be dropped, and even if there is a pending view replica to send the updates to, the updates that are built by the pending replica may be wrong since it may have incomplete information. This commit fixes the inefficiency by skipping the view update building step when applying an update on a pending replica. The metric total_view_updates_on_wrong_node is added to count the cases that a view update is determined to be unnecessary. The test reproduces the scenario of writing to a table and applying the update on a pending replica, and verifies that the pending replica doesn't try to build view updates. Fixes scylladb/scylladb#19152 Closes scylladb/scylladb#19488	2024-07-02 13:10:18 +02:00
Nadav Har'El	d61513c41c	Merge 'reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache. This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly. This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread. Fixes: https://github.com/scylladb/scylladb/issues/19017 This fixes a regression introduced in 5.0, so we have to backport to all currently supported releases Closes scylladb/scylladb#19018 * github.com:scylladb/scylladb: test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc Please enter the commit message for your changes. Lines starting test/boost/reader_concurrency_semaphore_test: hoist require_can_admit reader_concurrency_semaphore: wire in the configurable cpu concurrency reader_concurrency_semaphore: add cpu_concurrency constructor parameter db/config: introduce reader_concurrency_semahore_cpu_concurrency	2024-07-02 13:39:00 +03:00
Kefu Chai	e87b64b7bb	compaction: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-07-02 14:06:42 +08:00
Nadav Har'El	44e036c53c	alternator: fix "/localnodes" to use broadcast_rpc_address Alternator's non-standard "/localnodes" HTTP request returns a list of live nodes on this DC, to consider for load balancing. The returned node addresses should be external IP addresses usable by the clients. Scylla has a configuration parameter - broadcast_rpc_address - which defines for a node an external IP address. If such a configuration exists, we need to use those external IP addresses, not the internal ones. Finding these broadcast_rpc_address of all nodes is easy, because the gossiper already gossips them. This patch also tests the new feature: 1. The existing single-node test is extended to verify that without broadcast_rpc_address we get the usual IP address. 2. A new two-node test is added to check that when broadcast_rpc_address is configured, we get that address and not the usual internal IP addresses. Fixes #18711. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-06-30 18:38:15 +03:00
Gleb Natapov	3f136cf2eb	test: add test that checks that local address cannot expire between join request placemen and its processing	2024-06-30 15:52:23 +03:00
Pavel Emelyanov	d034cde01f	Merge 'build: update C++ standard to C++23' from Avi Kivity Switch the C++ standard from C++20 to C++23. This is straightforward, but there are a few fallouts (mostly due to std::unique_ptr that became constexpr) that need to be fixed first. Internal enhancement - no backport required Closes scylladb/scylladb#19528 * github.com:scylladb/scylladb: build: switch to C++23 config: avoid binding an lvalue reference to an rvalue reference readers: define query::partition_slice before using it in default argument test: define table_for_tests earlier compaction: define compaction_group::table_state earlier compaction: compaction_group: define destructor out-of-line compaction_manager: define compaction_manager::strategy_control earlier	2024-06-28 18:02:33 +03:00
Piotr Dulikowski	f00c4eaf72	Merge '[test.py] add --extra-scylla-cmdline-options argument for test.py' from Artsiom Mishuta this PR has 2 commits - [test: pass Scylla extra CMD args from test.py args](`6b367a04b5`) - [test: adjust scylla_cluster.merge_cmdline_options behavior](`c60b36090a`) the main goal is to solve [test.py: provide an easy-to-remember, univeral way to run scylla with trace level logging](https://github.com/scylladb/scylladb/issues/14960) issue but also can be used to easily apply additional arguments for all UnitTests and PythonTests on the fly from the test.py CMD Closes scylladb/scylladb#19509 * github.com:scylladb/scylladb: test: adjust scylla_cluster.merge_cmdline_options behavior test: pass scylla extra CMD args from test.py args	2024-06-28 11:11:29 +02:00
Piotr Smaron	88eda47f13	cql: forbid switching from tablets to vnodes in ALTER KS This check is already in place, but isn't fully working, i.e. switching from a vnode KS to a tablets KS is not allowed, but this check doesn't work in the other direction. To fix the latter, `ks_prop_defs::get_initial_tablets()` has been changed to handle 3 states: (1) init_tablets is set, (2) it was skipped, (3) tablets are disabled. These couldn't fit into std::optional, so a new local struct to hold these states has been introduced. Callers of this function have been adjusted to set init_tablets to an appropriate value according to the circumstances, i.e. if tablets are globally enabled, but have been skipped in the CQL, init_tablets is automatically set to 0, but if someone executes ALTER KS and doesn't provide tablets options, they're inherited from the old KS. I tried various approaches and this one resulted in the least lines of code changed. I also provided testcases to explain how the code behaves. Fixes: #18795 Closes scylladb/scylladb#19368	2024-06-28 11:41:41 +03:00
Piotr Dulikowski	f9abe52d3b	Merge 'test: auth: add random tag to resources in test_auth_v2_migration' from Marcin Maliszkiewicz Those tests are sometimes failing on CI and we have two hypothesis: 1. Something wrong with consistency of statements 2. Interruption from another test run (e.g. same queries performed concurrently or data remained after previous run) To exclude or confirm 2. we add random marker to avoid potential collision, in such case it will be clearly visible that wrong data comes from a different run. Related scylladb/scylladb#18931 Related scylladb/scylladb#18319 backport: no, just a test fix Closes scylladb/scylladb#19484 * github.com:scylladb/scylladb: test: auth: add random tag to resources in test_auth_v2_migration test: extend unique_name with random sufix	2024-06-27 17:35:14 +02:00
Avi Kivity	e5807555bd	test: define table_for_tests earlier C++23 made std::unique_ptr constexpr. A side effect of this (presumably) is that the compiler compiles it more eagerly, requiring the full definition of the class in std::make_unique, while it previously was content with finding the definition later. One victim of this change is table_for_tests; define it earlier to build with C++23.	2024-06-27 17:54:12 +03:00
Botond Dénes	b4f3809ad2	test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc Please enter the commit message for your changes. Lines starting	2024-06-27 09:57:11 -04:00

... 93 94 95 96 97 ...

11801 Commits