scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 21:55:50 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Pavel Emelyanov	86b3e9b50b	code: Move checked-file-impl.hh to util/ fixes: #22100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23123	2025-03-06 10:22:05 +02:00
Pavel Emelyanov	e7d1ea3ab6	commitlog: Use shorter input stream creation overload There's one that doesn't need the offset argument when it's 0 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23140	2025-03-06 08:06:42 +01:00
Benny Halevy	7a624e3df8	system_keyspace: call shutdown from stop and use that to replace the explicit shutdown when stopped in cql_test_env. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	102aec64d5	system_keyspace: shutdown: allow calling more than once Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:22 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Botond Dénes	71d8b7aa9f	querier: demote tombstone warning for range-scans to debug level Range scans are expected to go though lots of tombstones, no need to spam the logs about this. The tombstone warning log is demoted to debug level, if somebody wants to see it they can bump the logger to debug level. Fixes: https://github.com/scylladb/scylladb/issues/23093 Closes scylladb/scylladb#23094	2025-03-04 10:38:06 +03:00
Botond Dénes	6f7a069bce	Merge 'Label basic metrics' from Amnon Heiman This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar) The idea in this series is to create an equivalent of levels with a label. First, label a subset of the metrics used by the dashboards. Second, the per-table metrics that are now off by default will be marked with a different label. The following specific optional features: CDC, CAS, and Alternator have a dedicated label now. This will allow users to disable all metrics of features that are not in use. All the rest of the metrics are left unlabeled. Without any changes, users would get the same metrics they are getting today. But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies). The labels are not reported based on the seastar feature of hiding labels that start with an underscore. Closes scylladb/scylladb#12246 * github.com:scylladb/scylladb: db/view/view.cc: label metrics with basic_level transport/server.cc: label metrics with basic_level service/storage_proxy.cc: label metrics with basic_level and cas main.cc: label metrics with basic_level streaming/stream_manager.cc: label metrics with basic_level repair/repair.cc: label metrics with basic_level service/storage_service.cc: label metrics with basic_level gms/gossiper.cc: label metrics with basic_level replica/database.cc: label metrics with basic_level cdc/log.cc: label metrics with basic_level and cdc alternator: label metrics with basic_level and alternator row_cache.cc: label metrics with basic_level query_processor.cc: label metrics with basic_level sstables.cc: label metrics with basic_level utils/logalloc.cc label metrics with basic_level commitlog.cc: label metrics with basic_level compaction_manager.cc: label metrics with basic_level Adding the __level and features labels	2025-03-04 09:32:11 +02:00
Calle Wilund	2f10205714	config: Enable optional TLS1.3 session ticket usage in cert setup Refs #22916 Adds an "enable_session_tickets" option to TLS setup for our server endpoints (not documented for internode RPC, as we don't handle it on the client side there), allowing enabling of TLS3 client session ticket, i.e. quicker reconnect. Session tickets are valid within a time frame or until a node restarts, whichever comes first. v2: Use "TLS1.3" in help message Closes scylladb/scylladb#22928	2025-03-04 09:30:53 +02:00
Amnon Heiman	19a414598b	db/view/view.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_view_builder_builds_in_progress Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	f40dc4e5c4	row_cache.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_cache_bytes_total scylla_cache_bytes_used scylla_cache_partition_evictions scylla_cache_partition_hits scylla_cache_partition_insertions scylla_cache_partition_merges scylla_cache_partition_misses scylla_cache_partition_removals scylla_cache_range_tombstone_reads scylla_cache_reads scylla_cache_reads_with_misses scylla_cache_row_evictions scylla_cache_row_hits scylla_cache_row_insertions scylla_cache_row_misses scylla_cache_row_removals scylla_cache_rows scylla_cache_rows_merged_from_memtable scylla_cache_row_tombstone_reads Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Amnon Heiman	6826b98c88	commitlog.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_commitlog_segments scylla_commitlog_allocating_segments scylla_commitlog_unused_segments scylla_commitlog_alloc scylla_commitlog_flush scylla_commitlog_bytes_written scylla_commitlog_pending_allocations scylla_commitlog_requests_blocked_memory scylla_commitlog_flush_limit_exceeded scylla_commitlog_disk_total_bytes scylla_commitlog_disk_active_bytes scylla_commitlog_disk_slack_end_bytes	2025-03-03 16:58:38 +02:00
Kefu Chai	6e4cb20a69	tree: implement boost::accumulate with std::ranges library Replace boost::accumulate() calls with std::ranges facilities. This change reduces external dependencies and modernizes the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23062	2025-02-26 23:22:02 +02:00
Botond Dénes	5d63ef4d15	Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them. Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points. Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line. Closes scylladb/scylladb#22327 * github.com:scylladb/scylladb: tools: Add standard extensions and propagate to schema load cql_test_env: Use add all extensions instead of inidividually main: Move extensions adding to function tomstone_gc: Make validate work for tools	2025-02-26 13:52:47 +02:00
Andrzej Jackowski	b4f0a5149a	db: cql3: add comments regarding unsafe interval<clustering_key_prefix> class clustering_range is a range of Clustering Key Prefixes implemented as interval<clustering_key_prefix>. However, due to the nature of Clustering Key Prefix, the ordering of clustering_range is complex and does not satisfy the invariant of interval<>. To be more specific, as a comment in interval<> implementation states: “The end bound can never be smaller than the start bound”. As a range of CKP violates the invariant, some algorithms, like intersection(), can return incorrect results. For more details refer to scylladb#8157, scylladb#21604, scylladb#22817. This commit: - Add a WARNING comment to discourage usage of clustering_range - Add WARNING comments to potentially incorrect uses of interval<clustering_key_prefix> non-trivial methods - Add a FIXME comment to incorrect use of interval<clustering_key_prefix_view>::deoverlap and WARNING comments to related interval<clustering_key_prefix_view> misuse. Closes scylladb/scylladb#22913	2025-02-26 12:01:28 +01:00
Kefu Chai	9fdbe0e74b	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22997	2025-02-25 10:32:32 +03:00
Kefu Chai	42335baec5	backup_task: Use INFO level for upload abort during shutdown When a backup upload is aborted due to instance shutdown, change the log level from ERROR to INFO since this is expected behavior. Previously, `abort_requested_exception` during upload would trigger an ERROR log, causing test failures since error logs indicate unexpected issues. This change: - Catches `abort_requested_exception` specifically during file uploads - Logs these shutdown-triggered aborts at INFO level instead of ERROR - Aligns with how `abort_requested_exception` is handled elsewhere in the service This prevents false test failures while still informing administrators about aborted uploads during shutdown. Fixes scylladb/scylladb#22391 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22995	2025-02-25 10:32:10 +03:00
Avi Kivity	d99df7af6c	Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec This series achieves two things: 1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes https://github.com/scylladb/scylladb/issues/21967 2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count The per-shard goal is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. The scaling is applied after computing desired tablet count due to all other factors: per-table tablet count hints, defaults, average tablet size. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. When creating a new table, its tablet count is determined by tablet scheduler using the scheduler logic, as if the table was already created. So any scaling due to per-shard tablet count goal is reflected immediately when creating a table. It may however still take some time for the system to shrink existing tables. We don't reject requests to create new tables. Fixes #21458 Closes scylladb/scylladb#22522 * github.com:scylladb/scylladb: config, tablets: Allow tablets_initial_scale_factor to be a fraction test: tablets_test: Test scaling when creating lots of tables test: tablets_test: Test tablet count changes on per-table option and config changes test: tablets_test: Add support for auto-split mode test: cql_test_env: Expose db config config: Make tablets_initial_scale_factor live-updateable tablets: load_balancer: Pick initial_scale_factor from config tablets, load_balancer: Fix and improve logging of resize decisions tablets, load_balancer: Log reason for target tablet count tablets: load_balancer: Move hints processing to tablet scheduler tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table tablets: load_balancer: Determine desired count from size separately from count from options tablets: load_balancer: Determine resize decision from target tablet count tablets: load_balancer: Allow splits even if table stats not available tablets: load_balancer: Extract make_sizing_plan() tablets: Add formatter for resize_decision::way_type tablets: load_balancer: Simplify resize_urgency_cmp() tablets: load_balancer: Keep config items as instance members locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard tablets: Set default initial tablet count scale to 10 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() tablets: load_balancer: Extract get_schema_and_rs() tablets: load_balancer: Drop test_mode	2025-02-24 17:59:26 +02:00
Patryk Jędrzejczak	78c227c521	Merge 'raft topology: Add support for raft topology init to happen before group0 initialization' from Abhinav Kumar Jha In the current scenario, the problem discovered is that there is a time gap between group0 creation and raft_initialize_discovery_leader call. Because of that, the group0 snapshot/apply entry enters wrong values from the disk(null) and updates the in-memory variables to wrong values. During the above time gap, the in-memory variables have wrong values and perform absurd actions. This PR removes the variable `_manage_topology_change_kind_from_group0` which was used earlier as a work around for correctly handling `topology_change_kind` variable, it was brittle and had some bugs (causing issues like scylladb/scylladb#21114). The reason for this bug that _manage_topology_change_kind used to block reading from disk and was enabled after group0 initialization and starting raft server for the restart case. Similarly, it was hard to manage `topology_change_kind` using `_manage_topology_change_kind_from_group0` correctly in bug free manner. Post `_manage_topology_change_kind_from_group0` removal, careful management of `topology_change_kind` variable was needed for maintaining correct `topology_change_kind` in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via `raft_initialize_discovery_leader` function). Now because `raft_initialize_discovery_leader` happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function `initialize_done_topology_upgrade_state` which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving `raft_initialize_discovery_leader` logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#21114 Closes scylladb/scylladb#22484 * https://github.com/scylladb/scylladb: storage_service: Remove the variable _manage_topology_change_kind_from_group0 storage_service: fix indentation after the previous commit raft topology: Add support for raft topology system tables initialization to happen before group0 initialization service/raft: Refactor mutation writing helper functions.	2025-02-20 14:42:39 +01:00
Tomasz Grabiec	1a7023c85a	config, tablets: Allow tablets_initial_scale_factor to be a fraction We may want fewer than 1 tablets per shard in large clusters. The per-table option is a fraction, so for consistency, this should be too.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	3d01ce3707	config: Make tablets_initial_scale_factor live-updateable	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	f043c83ba5	tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard Currently the scale is applied post rounding up of tablet count so that tablet count per shard is at least 1. In order to be able to use the scale to increase tablet count per shard, we need to apply it prior to division by RF, otherwise we will overshoot per-shard tablet replica count. Example: 4 nodes, -c1, rf=3, initial_tablets_scale=10 Before: initial_tablet_count=20, tablet-per-shard=15 After: initial_tablet_count=14, tablets-per-shard=10.5	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	2463e524ed	tablets: Set default initial tablet count scale to 10 This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes #21967 In some tests, we explicity set the initial scale to 1 as some of the existing tests assume 1 compaction group per shard. test.py uses a lower default. Having many tablets per shard slows down certain topology operations like decommission/replace/removenode, where the running time is proportional to tablet count, not data size, because constant cost (latency) of migration dominates. This latency is due to group0 operations and barriers. This is especially pronounced in debug mode. Scheduler allows at most 2 migrations per shard, so this latency becomes a determining factor for decommission speed. To avoid this problem in tests, we use lower default for tablet count per shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we could compensate by allowing more concurrency when migrating small tablets, but there's no infrastructure for that yet. I observed that with 10 tablets per shard, debug-mode topology_custom.mv/test_mv_topology_change starts to time-out during removenode (30 s).	2025-02-19 14:38:50 +01:00
Kefu Chai	aa8c27b872	db: prevent accidental copies of result_set_row by making it move-only result_set_row is a heavyweight object containing multiple cell types: regular columns, partition keys, and static values. To prevent expensive accidental copies, delete the copy constructor and replace it with: 1. A move constructor for efficient vector reallocation 2. An explicit copy() method when copies are actually needed This change reduces overhead in some non-hot paths by eliminating implicit deep copies. Please note, previously, in `create_view_from_mutation()`, we kept a copy of `result_set_row`, and then reused `table_rs` for holding the mutation for `scylla_tables`. Because we don't copy the `result_set_row` in this change, in order to avoid invalidating the `row` after reusing `table_rs` in the outer scope, we define a new `table_rs` shadowing the one in the out scope. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22741	2025-02-17 09:48:08 +02:00
Kefu Chai	7ff0d7ba98	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22857	2025-02-15 20:32:22 +02:00
Abhinav Jha	e491950c47	raft topology: Add support for raft topology system tables initialization to happen before group0 initialization In the current scenario, topology_change_kind variable, was been handled using _manage_topology_change_kind_from_group0 variable. This method was brittle and had some bugs(e.g. for restart case, it led to a time gap between group0 server start and topology_change_kind being managed via group0) Post _manage_topology_change_kind_from_group0 removal, careful management of topology_change_kind variable was needed for maintaining correct topology_change_kind in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function initialize_done_topology_upgrade_state which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving raft_initialize_discovery_leader logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. Fixes: scylladb/scylladb#21114	2025-02-14 16:56:17 +05:30
Botond Dénes	f808f84a45	db/config: improve description of repair_multishard_reader_enable_read_ahead The current description has a typo and in general not informative enough on when this option should be used. Closes scylladb/scylladb#21758	2025-02-11 22:16:09 +02:00
Pavel Emelyanov	529ff3efa5	Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is not the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567. We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471. This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. It should be backported to 2025.1, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1. Closes scylladb/scylladb#21989 * github.com:scylladb/scylladb: alternator: fix view build on oversized GSI key attribute mv: clean up do_delete_old_entry test/alternator: unflake test for IndexStatus test/alternator: work around unrelated bug causing test flakiness docs/alternator: adding a GSI is no longer an unimplemented feature test/alternator: remove xfail from all tests for issue 11567 alternator: overhaul implementation of GSIs and support UpdateTable mv: support regular_column_transformation key columns in view alternator: add new materialized-view computed column for item in map build: in cmake build, schema needs alternator build: build tests with Alternator alternator: add function serialized_value_if_type() mv: introduce regular_column_transformation, a new type of computed column alternator: add IndexStatus/Backfilling in DescribeTable alternator: add "LimitExceededException" error type docs/alternator: document two more unimplemented Alternator features	2025-02-11 10:02:01 +03:00
Avi Kivity	c212f5a296	db/config: forward-declare boost options_description_easy_init Reduces large dependency pull from boost. Closes scylladb/scylladb#22748	2025-02-10 15:08:11 +02:00
Kefu Chai	0185aa458b	build: cmake: remove trailing comma in db/CMakeLists.txt source list In `c5668d99`, a new source file row_cache.cc was added to the `db` target, but with an extraneous trailing comma. In CMake's target_sources(), source files should be space-separated - any comma is interpreted as part of the filename, causing build failures like: ``` CMake Error at db/CMakeLists.txt:2 (target_sources): Cannot find source file: row_cache.cc, ``` Fix the issue by removing the trailing comma. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22754	2025-02-09 17:28:47 +02:00
Avi Kivity	9712390336	Merge 'Add per-table tablet options in schema' from Benny Halevy This series extends the table schema with per-table tablet options. The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions, when the table size changes. * New feature, no backport required Closes scylladb/scylladb#22090 * github.com:scylladb/scylladb: tablets: resize_decision: get rid of initial_decision tablet_allocator: consider tablet options for resize decision tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member network_topology_strategy: allocate_tablets_for_new_table: consider tablet options network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0 cql3/create_keyspace_statement: add deprecation warning for initial tablets test: cqlpy: test_tablets: add tests for per-table tablet options schema: add per-table tablet options feature_service: add TABLET_OPTIONS cluster schema feature	2025-02-08 20:32:19 +02:00
Kefu Chai	a6f703414a	db: switch from boost::adaptors::indirected to std::views replace boost::adaptors::indirected using std::views::transform for less header dependency. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22731	2025-02-08 17:36:46 +02:00
Avi Kivity	861fb58e14	Merge 'vector: add support for vector type' from Dawid Pawlik This pull request is an implementation of vector data type similar to one used by Apache Cassandra. The patch contains: - implementation of vector_type_impl class - necessary functionalities similar to other data types - support for serialization and deserialization of vectors - support for Lua and JSON format - valid CQL syntax for `vector<>` type - `type_parser` support for vectors - expression adjustments such as: - add `collection_constructor::style_type::vector` - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector` - vector type encoding (for drivers) - unit tests - cassandra compatibility tests - necessary documentation Co-authored-by: @janpiotrlakomy Fixes https://github.com/scylladb/scylladb/issues/19455 Closes scylladb/scylladb#22488 * github.com:scylladb/scylladb: docs: add vector type documentation cassandra_tests: translate tests covering the vector type type_codec: add vector type encoding boost/expr_test: add vector expression tests expression: adjust collection constructor list style expression: add vector style type test/boost: add vector type cql_env boost tests test/boost: add vector type_parser tests type_parser: support vector type cql3: add vector type syntax types: implement vector_type_impl	2025-02-06 20:36:50 +02:00
Kefu Chai	5c7ad745fd	db: do not include unused headers these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, took this opportunity to remove an unused namespace alias. and add an include which is used actually. please note, `std::ranges::pop_heap()` and friends are actually provided by `<algorithm>` not `<ranges>`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22716	2025-02-06 13:38:19 +02:00
Nadav Har'El	cae8a7222e	alternator: fix view build on oversized GSI key attribute Before this patch, the regular_column_transformation constructor, which we used in Alternator GSIs to generates a view key from a regular-column cell, accepted a cell of any size. As a reviewer (Avi) noticed, very long cells are possible, well beyond what Scylla allows for keys (64KB), and because regular_column_transformation stores such values in a contiguous "bytes" object it can cause stalls. But allowing oversized attributes creates an even more accute problem: While view building (backfilling in DynamoDB jargon), if we encounter an oversized (>64KB) key, the view building step will fail and the entire view building will hang forever. This patch fixes both problems by adding to regular_column_transformation's constructor the check that if the cell is 64KB or larger, an empty value is returned for the key. This causes the backfilling to silently skip this item, which is what we expect to happen (backfilling cannot do anything to fix or reject the pre-existing items in the best table). A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is introduced to reproduce this problem and its fix. The test adds a 65KB attribute to a base table, and then adds GSIs to this table with this attribute as its partition key or its sort key. Before this patch, the backfilling process for the new GSIs hangs, and never completes. After this patch, the backfilling completes and as expected contains other base-table items but not the item with the oversized attribute. The new test also passes on DynamoDB. However, while implementing this fix I realized that issue #10347 also exists for GSIs. Issue #10347 is about the fact that DynamoDB limits partition key and sort key attributes to 2048 and 1024 bytes, respectively. In the fix described above we only handled the accute case of lengths above 64 KB, but we should actually skip items whose GSI keys are over 2048 or 1024 bytes - not 64KB. This extra checking is not handled in this patch, and is part of a wider existing issue: Refs #10347 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:50 +01:00
Nadav Har'El	7a0027bacc	mv: clean up do_delete_old_entry The function do_delete_old_entry() had an if() which was supposedly for the case of collection column indexing, and which our previous patch that improved this function to support caller-specified deletion_ts left behind. As a reviewer noticed, the new tombstone-setting code was in an "else" of that existing if(), and it wasn't clear what happens if we get to that else in the collection column indexing. So I reviewed the code and added breakpoints and realized that in fact, do_delete_old_entry() is never called for the collection-indexing case, which has its own update_entry_for_computed_column() which view_updates::generate_update() calls instead of the do_delete_old_entry() function and its friends. So it appears that do_delete_old_entry() doesn't need that special case at all, which simplifies it. We should eventually simplify this code further. In particular, the function generate_update() already knows the key of the rows it adds or deletes so do_delete_old_entry() and its friends don't need to call get_view_rows() to get it again. But these simplifications and other will need to come in a later patch series, this one is already long enough :-) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	bc7b5926d2	mv: support regular_column_transformation key columns in view In an earlier patch, we introduced regular_column_transformation, a new type of computed column that does a computation on a cell in regular column in the base and returns a potentially transformed cell (value or deletion, timestamp and ttl). In this patch, we wire the materialized view code to support this new kind of computed column that is usable as a materialized-view key column. This new type of computed column is not yet used in this patch - this will come in the next patch, where we will use it for Alternator GSIs. Before this patch, the logic of deciding when the view update needs to create a new row or delete a new one, and which timestamp and ttl to give to the new row, could depend on one (or two - in Alternator) cells read from base-table regular columns. In this patch, this logic is rewritten - the notion of "base table regular columns" is generalized to the notion of "updatable view key columns" - these are view key columns that an update may change - because they really are base regular columns, or a computed function of one (regular_column_transformation). In some sense, the new code is easier to understand - there is no longer a separate "compute_row_marker()" function, rather the top-level generate_update() is now in charge of finding the "updatable view key columns" and calculate the row marker (timestamp and ttl) as part of deciding what needs to be done. But unfortunately the code still has separate code paths for "collection secondary indexing", and also for old-style column_computation (basically, only token_column_computation). Perhaps in the future this can be further simplified. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	c8ea9f8470	mv: introduce regular_column_transformation, a new type of computed column In the patches that follow, we want Alternator to be able to use as a key for a materialized view (GSI) not a real column from the schema, but rather an attribute value deserialized from a member of the ":attrs" map. For this, we need the ability for materialized view to define a key column which is computed as function of a real column (":attrs"). We already have an MV feature which we called "computed column" (column_computation), but it is wholy inadequate for this job: column_computation can only take a partition key, and produce a value - while we need it to take a regular column (one member of ":attrs"), not just the partition key, and return a cell - value or deletion, timestamp and TTL. So in this patch we introduce a new type of computed column, which we called "regular_column_transformation" since it intends to perform some sort of transformation on a single column (or more accurately, a single atomic cell). The limitation that this function transforms a single column only is important - if we had a function of multiple columns, we wouldn't know which timestamp or ttl it should use for the result if the two columns had different timestamps or TTLs. The new class isn't wired to anything yet: The MV code cannot handle it yet, and the Alternator code will not use it yet. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Benny Halevy	c5668d99c9	schema: add per-table tablet options Unlike with vnodes, each tablet is served only by a single shard, and it is associated with a memtable that, when flushed, it creates sstables which token-range is confined to the tablet owning them. On one hand, this allows for far better agility and elasticity since migration of tablets between nodes or shards does not require rewriting most if not all of the sstables, as required with vnodes (at the cleanup phase). Having too few tablets might limit performance due not being served by all shards or by imbalance between shards caused by quantization. The number of tabelts per table has to be a power of 2 with the current design, and when divided by the number of shards, some shards will serve N tablets, while others may serve N+1, and when N is small N+1/N may be significantly larger than 1. For example, with N=1, some shards will serve 2 tablet replicas and some will serve only 1, causing an imbalance of 100%. Now, simply allocating a lot more tablets for each table may theoretically address this problem, but practically: a. Each tablet has memory overhead and having too many tablets in the system with many tables and many tablets for each of them may overwhelm the system's and cause out-of-memory errors. b. Too-small tablets cause a proliferation of small sstables that are less efficient to acces, have higher metadata overhead (due to per-sstable overhead), and might exhaust the system's open file-descriptors limitations. The options introduced in this change can help the user tune the system in two ways: 1. Sizing the table to prevent unnecessary tablet splits and migrations. This can be done when the table is created, or later on, using ALTER TABLE. 2. Controlling min_per_shard_tablet_count to improve tablet balancing, for hot tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	ad8b0649ff	feature_service: add TABLET_OPTIONS cluster schema feature To be used for enabling per-table tablet options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Botond Dénes	3d12451d1f	db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2 This config item controls how many CPU-bound reads are allowed to run in parallel. The effective concurrency of a single CPU core is 1, so allowing more than one CPU-bound reads to run concurrently will just result in time-sharing and both reads having higher latency. However, restricting concurrency to 1 means that a CPU bound read that takes a lot of time to complete can block other quick reads while it is running. Increase this default setting to 2 as a compromise between not over-using time-sharing, while not allowing such slow reads to block the queue behind them. Fixes: #22450 Closes scylladb/scylladb#22679	2025-02-05 21:52:20 +02:00
Ran Regev	edd56a2c1c	moved cache files to db As requested in #22097, moved the files and fixed other includes and build system. Fixes: #22097 Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#22495	2025-02-04 12:21:31 +03:00
Pavel Emelyanov	e47c7d5255	Merge 'config: Improve internode_compression option validation and documentation' from Kefu Chai This PR enhances the internode_compression configuration option in two ways: 1. Add validation for option values Previously, we silently defaulted to 'none' when given invalid values. Now we explicitly validate against the three supported values (all, dc, none) and reject invalid inputs. This provides better error messages when users misconfigure the option. 2. Fix documentation rendering The help text for this option previously used C++ escape sequences which rendered incorrectly in Sphinx-generated HTML. We now use bullet points with '' prefix to list the available values, matching our documentation style for other config options. This ensures consistent rendering in both CLI and HTML outputs. Note: The current documentation format puts type/default/liveness information in the same bullet list as option values. This affects other config options as well and will need to be addressed in a separate change. --- this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport. Closes scylladb/scylladb#22548 github.com:scylladb/scylladb: config: validate internode_compression option values config: start available options with '*'	2025-02-04 10:17:23 +03:00

1 2 3 4 5 ...

4189 Commits