scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Nadav Har'El	3e16b80014	Merge 'Reject create table with compact storage' from Benny Halevy As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage`, set to `false` by default, that require users to opt-in in order to create new tables WITH COMPACT STORAGE. Refs scylladb/scylladb#12263, scylladb/scylladb#16375 * Since this guardrail is an enhancement, no backport is needed Closes scylladb/scylladb#16403 * github.com:scylladb/scylladb: docs: ddl: document the deprecation of compact tables test: enable_create_table_with_compact_storage for tests that need it config: add enable_create_table_with_compact_storage	2025-01-20 22:02:02 +02:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00
Benny Halevy	f3ab00e61c	test: enable_create_table_with_compact_storage for tests that need it Now enable_create_table_with_compact_storage can be set to `false` by default in db/config. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-20 08:14:37 +02:00
Benny Halevy	0110eb0506	config: add enable_create_table_with_compact_storage As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage` that require users to opt-in in order to create new tables WITH COMPACT STORAGE. The option is currently set to `true` by default in db/config to reduce the churn to tests and to `false` in scylla.yaml, for new clusters. TODO: once regressions tests that use compact storage are converted to enable the option, change the default in db/config to false. A unit test was added to test/cql-pytest that checks that the respective cql query fails as expected with the default option or when it is explicitly set to `false`, and that the query succeeds when the option is set to `true`. Note that `check_restricted_table_properties` already returns an optional warning, but it is only logged but not returned in the `prepared_statement`. Fixing that is out of the scope of this patch. See https://github.com/scylladb/scylladb/issues/20945 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-20 08:03:25 +02:00
Piotr Dulikowski	6aa962f5f4	Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski Introduces a comprehensive audit system to track database operations for security and compliance purposes. This change includes: Core Components: - New audit subsystem for logging database operations - Service level integration for proper resource management - CQL statement tracking with operation categories - Login process integration for tenant management Key Features: - Configurable audit logging (syslog/table) - Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN) - Selective auditing by keyspace/table - Password sanitization in audit logs - Service level shares support (1-1000) for workload prioritization - Proper lifecycle management and cleanup I ran the dtests for audit (manually enabled) and they pass. The in-repo tests pass. Notably, there should be no non-whitespace changes between this and scylla-enterprise Fixes scylladb/scylla-enterprise#4999 Closes scylladb/scylladb#22147 * github.com:scylladb/scylladb: audit: Add shares support to service level management audit: Add service level support to CQL login process audit: Add support to CQL statements audit: Integrate audit subsystem into Scylla main process audit: Add documentation for the audit subsystem audit: Add the audit subsystem	2025-01-17 13:14:55 +01:00
Avi Kivity	d6f7f873d0	utils: config_file: don't use extern fully specialized variable templates Declaring-but-not-defining a fully specialized template is a great way to cut dependencies between users and providers, but unfortunately not supported for variable templates. Clang 18 does support it, but apparently it is a misinterpretation of the standard, and was removed in clang 19. We started using this non-feature in `7ed89266b3`. The fix is to use function templates. This is more verbose as each specialization needs to define a static variable to return, but is fully supported. Closes scylladb/scylladb#22299	2025-01-17 11:06:50 +03:00
Paweł Zakrzewski	384641194a	audit: Add the audit subsystem This change introduces a new audit subsystem that allows tracking and logging of database operations for security and compliance purposes. Key features include: - Configurable audit logging to either syslog or a dedicated system table (audit.audit_log) - Selective auditing based on: - Operation categories (QUERY, DML, DDL, DCL, AUTH, ADMIN) - Specific keyspaces - Specific tables - New configuration options: - audit: Controls audit destination (none/syslog/table) - audit_categories: Comma-separated list of operation categories to audit - audit_tables: Specific tables to audit - audit_keyspaces: Specific keyspaces to audit - audit_unix_socket_path: Path for syslog socket - audit_syslog_write_buffer_size: Buffer size for syslog writes The audit logs capture details including: - Operation timestamp - Node and client IP addresses - Operation category and query - Username - Success/failure status - Affected keyspace and table names	2025-01-15 11:10:35 +01:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Avi Kivity	814942505f	Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631 EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data. Introduces OpenSSL based file level encrypted storage, managed via a set of providers ranging from local files to cloud KMS providers. For a more comprehensive explanation, see the included docs (or if possible, original source tree). Manual bulk merge of EAR feature from enterprise repo to main scylla repo. Breaks some features apart, but main EAR is still a humongous commit, because to separate this I would have to mess with code incrementally, adding time and risk. This PR includes the local file gen tool, tests and also p11 validation. Note: CI will not execute the full tests unless master CI is set to provide the same environment as the enterprise one. Not sure about the status of this ATM. Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality will be enabled, otherwise not. Closes scylladb/scylladb#22233 * github.com:scylladb/scylladb: docs: Add EAR docs main/build: Add p11-kit and initialize tools: Add local-file-key-generator tool tests: Add EAR tests tmpdir: shorten test tempdir path EAR: port the ear feature from enterprise cql_test_env: Add optional query timeout schema/migration_manager: Add schema validate sstables: add get_shared_components accessor config/config_file: Add exports and definitions of config_type_for<>	2025-01-12 16:10:46 +02:00
Benny Halevy	8d2ff8a915	utils: add disk_space_monitor Instantiated only on shard 0. Currently, only subscribe from unit test Manual unit test using loop mount was added. Note that the test requires sudo access and root access to /dev/loop, so it cannot run in rootless podman instance, and it'd fail with Permission denied. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21523	2025-01-12 14:51:15 +02:00
Piotr Smaron	288f9b2b15	Introduce LDAP role manager & saslauthd authenticator This PR extends authentication with 2 mechanisms: - a new role_manager subclass, which allows managing users via LDAP server, - a new authenticator, which delegates plaintext authentication to a running saslauthd daemon. The features have been ported from the enterprise repository with their test.py tests and the documentation as part of changing license to source available. Fixes: scylladb/scylla-enterprise#5000 Fixes: scylladb/scylla-enterprise#5001 Closes scylladb/scylladb#22030	2025-01-12 14:50:29 +02:00
Calle Wilund	7ed89266b3	config/config_file: Add exports and definitions of config_type_for<> Required for implementors. Other than config.cc.	2025-01-08 12:50:03 +00:00
Michał Chojnowski	9f639b176f	db/config: increase the default value of internode_compression_zstd_min_message_size from 0 to 1024 Usually, the smaller the messsage, the higher the CPU cost per each network byte saved by compression, so it often makes sense to reserve heavier compression for bigger messages (where it can make the biggest impact for a given CPU budget) and use ligher compression for smaller messages. There is a knob -- internode_compression_zstd_min_message_size -- which excludes RPC messages below certain size from being compressed with zstd. We arbitrarily set its default to 0 bytes before. Now we want to arbitrarily set it to 1024 bytes. This is based purely on intuition and isn't backed by any solid data. Fixes scylladb/scylla-enterprise#4731 Closes scylladb/scylla-enterprise#4990 Closes scylladb/scylladb#22204	2025-01-07 18:14:01 +02:00
Wojciech Mitros	d04f376227	mv: add an experimental feature for creating views using tablets We still have a number of issues to be solved for views with tablets. Until they are fixed, we should prevent users from creating them, and use the vnode-based views instead. This patch prepares the feature for enabling views with tablets. The feature is disabled by default, but currently it has no effect. After all tests are adjusted to use the feature, we should depend on the feature for deciding whether we can create materialized views in tablet-enabled keyspaces. The unit tests are adjusted to enable this feature explicitly, and it's also added to the scylla sstable tool config - this tool treats all tables as if they were tablet-based (surprisingly, with SimpleStrategy), so for it to work on views, the new feature must be enabled. Refs scylladb/scylladb#21832 Closes scylladb/scylladb#21833	2025-01-07 15:52:36 +01:00
Asias He	d719f423e5	config: Enable enable_small_table_optimization_for_rbno by default Since the problematic dtests are with the enable_small_table_optimization_for_rbno turn off now, we can enable the flag by default. https://github.com/scylladb/scylla-dtest/pull/5383 Refs: #19131 Closes scylladb/scylladb#21861	2025-01-07 16:20:36 +02:00
Michael Litvak	0617564123	db/commitlog: make the commit log hard limit mandatory mark the config parameter --commitlog-use-hard-size-limit as deprecated so the default 'true' is always used, making the hard limit mandatory. Fixes scylladb/scylladb#16471 Closes scylladb/scylladb#21804	2025-01-07 15:03:56 +02:00
Michał Chojnowski	fdb2d2209c	messaging_service: use advanced_rpc_compression::tracker for compression This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`, `dict_sampler` and `dictionary_service` in `main()`, and wires them to each other and to `messaging_service`. `messaging_service` compresses its network traffic with compressors managed by the `advanced_rpc_compression::tracker`. All this traffic is passed as a single merged "stream" through `dict_sampler`. `dictionary_service` has access to `dict_sampler`. On chosen nodes (by default: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the `alien_worker` thread) and publishes the new dictionary to `system.dicts` via Raft. This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes, which updates the dictionary used by the compressors it manages.	2024-12-27 10:17:58 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Botond Dénes	34a8b492be	Merge 'materialized view: make flow-control maximum delay configurable' from Piotr Dulikowski This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test. Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays. This hard-coded one maximum second delay was considered huge - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client. So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before. Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it. The new parameter's help string mentions both these use cases of the parameter. Fixes #18187 This is new functionality, no need to backport to any open source release. Closes scylladb/scylladb#21647 * github.com:scylladb/scylladb: materialized views: test for the MV delay configuration parameter service: add injection for skipping view update backlog materialized view: make flow-control maximum delay configurable	2024-12-16 14:20:33 +02:00
Nadav Har'El	49f11f655c	materialized view: make flow-control maximum delay configurable Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays. This hard-coded one maximum second delay was considered huge - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client. So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before. Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it. The new parameter's help string mentions both these use cases of the parameter. Fixes #18187 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-12-05 09:51:56 +01:00
Botond Dénes	ccb433d767	Merge 'tasks: add api_task_ttl for tasks started with API' from Aleksandra Martyniuk When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long. Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed. Fixes: #21499. Refs: #21425. No backport needed - previous versions may use task_ttl Closes scylladb/scylladb#21505 * github.com:scylladb/scylladb: test: add test to check user_task_ttl tasks: api: move make_task method docs: nodetool: update backup and restore commands docs docs: update task manager docs nodetool: add nodetool tasks user-ttl command node_ops: use user task ttl for node ops virtual task tasks: use user_task_ttl for tasks started by user api: task_manager: add /task_manager/user_ttl to get and set user task ttl tasks: add task_manager::task::is_user_task method tasks: keep updateable_value of task_ttl in task manager db: config: add user_task_ttl_seconds named value	2024-11-27 09:57:57 +02:00
Aleksandra Martyniuk	1bf073704c	db: config: add user_task_ttl_seconds named value Add user_task_ttl_seconds config option and keep the value in task manager. In the following patches tasks started by user will be kept in task manager for user_task_ttl_seconds after they are finished.	2024-11-25 14:16:06 +01:00
Asias He	9ace191616	repair: Enable small table optimization for RBNO bootstrap and decommission The non local strategy system keyspaces usually contain very litte data. All the tables within them have to be repaired for all the token ranges, which could be large in clusters with a large number of nodes. In multiple DC setup, the repair in RBNO is dominated by the network latency. As a result, it takes a long time to repair those tables even if they are almost empty. To speed up the RBNO bootstrap, especially for starting empty clusters, this patch enables small table optimization for RBNO for system tables. We could enable it for small user tables as a follow up. Tests: 1) A 5ms latency is added to simulate cross dc network delay, 256 tokens per node, 10 nodes: - Before topology_custom dev topology_custom.test_boot_time.1 1287.06s - After topology_custom dev topology_custom.test_boot_time.1 12.48s The test shows 100X boot time improvement 2) A SCT test to bootstrap 3 DCs, 3 nodes in each DC. - Before Time to bootstrap = 1h23m - After Time to bootstrap = 13m The test shows 6X bootstrap time improvement Fixes #19131	2024-11-25 13:46:17 +08:00
Kefu Chai	00810e6a01	treewide: include seastar/core/format.hh instead of seastar/core/print.hh The later includes the former and in addition to `seastar::format()`, `print.hh` also provides helpers like `seastar::fprint()` and `seastar::print()`, which are deprecated and not used by scylladb. Previously, we include `seastar/core/print.hh` for using `seastar::format()`. and in seastar 5b04939e, we extracted `seastar::format()` into `seastar/core/format.hh`. this allows us to include a much smaller header. In this change, we just include `seastar/core/format.hh` in place of `seastar/core/print.hh`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21574	2024-11-14 17:45:07 +02:00
Avi Kivity	b58dbe57aa	Merge 'repair: introduce and use buffer size hint for mixed-shard multishard reader' from Botond Dénes Add a buffer hint to the multishard reader. This is an internal hint, used by the multishard reader to provide a hint to the shard reader, on how much data exactly is needed by the multishard reader from the respective shard. This hint allows eliminating extraneous cross-shard round-trips and possible shard reader evict-recreate cycles. Building on this, repair sets its own row buffer size as the max buffer size on the multishard reader, ensuring that the row buffer is filled with the minimum amount of cross-shard round trips and minimal reader recreation. To further eliminate unnecessary evictions, this PR also disables the multishard reader's read-ahead which is a mechanism that was designed to reduce latency for user-reads but it can be too aggressive for repair, causing unnecessary extra congestion on the already struggling streaming semaphores. Refs: https://github.com/scylladb/scylladb/issues/18269 Fixes: https://github.com/scylladb/scylladb/issues/21113 The performance impact was measured with an SCT test, which creates a cluster of 3 nodes with 16 shards, then adds a 4th one with 12 shards. Currently, it is the bootstrap time which is the worse in the case of mixed shard clusters, see below for the improvement measured during bootstrap: \| \| master \| buffer-hint \| metric \| \| ------------ \| ------------- \| ------------- \| --------------------------------------------------- \| \| evictions \| 0.9M \| 93.0K \| scylla_database_paused_reads_permit_based_evictions \| \| read (bytes) \| 9.0T \| 3.9T \| scylla_reactor_aio_bytes_read \| \| read (ops) \| 88.0M \| 33.5M \| scylla_reactor_aio_reads \| \| time \| 56min \| 20min \| N/A \| This is a performance improvement, no backport required. Closes scylladb/scylladb#20815 * github.com:scylladb/scylladb: test/boost/mutation_reader_test: add test for multishard reader buffer hint repair/row_level: disable read-ahead db/config: introduce repair_multishard_reader_enable_read_ahead readers/multishard: implement the read_ahead flag replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter readers/multishard: add read_ahead parameter repair/row_level: set max buffer size on multishard reader replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter db/config: introduce enable_repair_multishard_reader_buffer_hint readers/multishard: multishard_reader: pass hint to shard_reader readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer() readers/multishard: shard_reader: extract buffer-fill into its own method	2024-11-10 12:55:19 +02:00
Benny Halevy	974b0f2080	feature_service: prevent enabling both tablets and gossip topology changes Tablets require raft consistent topology changes. Therefore, document that they are incompatible in the config help and prevent their usage in `feature_config_from_db_config` Fixes scylladb/scylladb#21075 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-11-07 13:56:59 +02:00
Botond Dénes	a248520201	db/config: introduce repair_multishard_reader_enable_read_ahead Not used yet.	2024-11-07 02:47:54 -05:00
Botond Dénes	3c25e6fcb4	db/config: introduce enable_repair_multishard_reader_buffer_hint Allows enabling/disabling the multishard reader buffer hint optimization. Not wired yet.	2024-11-06 08:51:00 -05:00
Asias He	b3b3e880d3	repair: Reduce hints and batchlog flush The hints and batchlog flush requests are issued to all nodes for each repair request when tombstone_gc repair mode is used. The amount of such flush requests is high when all nodes in the cluster run repair. It is observed it takes a long time, up to 15s, for a repair request to finish such a flush request. To reduce overhead of the flush, each node caches the flush and only executes the real flush when the cahce time has passed. It is safe to do so because the real flush_time is returned. Repair uses the smallest flush_time returned from peers as the repair time. The nice thing about the cache on the receiver side is that all senders can hit the cache. It is better than cache on the sender side. A slightly smaller flush_time compared to the real flush time will be used with the benefits of significantly dropped hints and batchlog flush. The trade-off looks reasonable. Tests: 2 nodes, with 1s batchlog delay: Before: Repair nr_repairs=20 cache_time_in_ms=0 total_repair_duration=40.04245328903198 After: Repair nr_repairs=20 cache_time_in_ms=5000 total_repair_duration=1.252073049545288 Fixes #20259	2024-10-30 11:07:57 +08:00
Botond Dénes	1635525526	db/config: introduce batchlog_replay_cleanup_after_replays Not used yet.	2024-10-30 11:07:57 +08:00
Avi Kivity	73b1f66b70	Revert "Merge 'Allow explicitly enabling or disabling tablets when creating a new keyspace' from Benny Halevy" This reverts commit `c286434e4c`, reversing changes made to `6712fcc316`. The commit causes memtable_test to be very flaky in debug mode. Specifically, subtests test_exceptions_in_flush_on_sstable_open and test_exceptions_in_flush_on_sstable_write).	2024-10-30 00:55:29 +02:00
Benny Halevy	bc62407421	feature_service: prevent enabling both tablets and gossip topology changes Tablets require raft consistent topology changes. Therefore, document that they are incompatible in the config help and prevent their usage in `feature_config_from_db_config` Fixes scylladb/scylladb#21075 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-10-24 10:18:42 +03:00
Nadav Har'El	5fd3177057	Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload. The issue of view updates causing increased latencies most often occurs in the following scenario: * we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows * we have any read workload * one replica is slower or is handling more writes due to an imbalance of data distribution * we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it. * each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued * the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario: * the tables were prepared satisfying the conditions above * we use a medium write workload and a very low read workload * the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update * to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond. In the test above: * without the fix, the latency of reads increases over 50s * with the fix, the latency of reads stays below 20ms Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805 The patch is not that small and it isn't fixing a regression, so no backports Closes scylladb/scylladb#20887 * github.com:scylladb/scylladb: test: add test for high view update concurrency causing bad_allocs test: add test for high view update concurrency degrading read latency mv: add a dedicated read concurrency semaphore for view update read before writes	2024-10-22 22:17:23 +03:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Wojciech Mitros	242079d70b	mv: add a dedicated read concurrency semaphore for view update read before writes When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805	2024-10-21 11:02:06 +02:00
Emil Maskovsky	a03e98d6e8	raft: fast tombstone GC for group0-managed tables Set the tombstone GC time for group0-managed tables to the minimal state id of the group0 nodes. The check is being done based on a timer, iterating through each node (according to the group0 topology configuration) and taking the minimum across all nodes. This miminum timestamp is then be used to set the tombstone GC time for the tombstone GC of all the group0-managed tables. Fixes: scylladb/scylla#15607	2024-10-08 21:07:30 +02:00
Sergey Zolotukhin	3b9033423d	utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists - utils::split_comma_separated_list now accepts a reference to sstring instead of a copy to avoid extra memory allocations. Additionally, the results of trimming are moved to the resulting vector instead of being copied. - service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps, parse_node_list and api/storage_service::set_storage_service were changed to use std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as std::vector is a more cache-friendly structure, resulting in better performance.	2024-10-02 11:56:59 +02:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Calle Wilund	238a0236e5	features/config: Add feature for fragmented commitlog entries Hides the functionality behind a cluster feature, i.e. postspones using it until an upgrade is complete etc. This to allow rolling back even with dirty nodes, at least until a cluster is commited. Feature can also be disabled by scylla option, just in case. This will lock it out of whole cluster, but this is probably good, because depending on off or on, certain schema/raft ops might fail or succeed (due to large mutations), and this should probably be equivalent across nodes.	2024-09-03 16:38:28 +00:00
Botond Dénes	52bed81a1e	Merge 'cql3: add option to not unify bind variables with the same name' from Avi Kivity Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping. In `19a6e69001` we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to. However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection. Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables. A unit test is added. Fixes #15559 This may be useful to users transitioning from Cassandra, so merits a backport. Closes scylladb/scylladb#19493 * github.com:scylladb/scylladb: cql3: add option to not unify bind variables with the same name cql3: introduce dialect infrastructure cql3: prepared_statement_cache: drop cache key default constructor	2024-09-02 08:34:24 +03:00
Avi Kivity	ea8441dfa3	cql3: add option to not unify bind variables with the same name Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping. In `19a6e69001` we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to. However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection. Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables. A unit test is added. Fixes #15559	2024-09-01 17:27:48 +03:00
Patryk Jędrzejczak	22d907e721	treewide: introduce support for zero-token nodes in Raft topology We revive the `join_ring` option. We support it only in the Raft-based topology, as we plan to remove the gossip-based topology when we fix the last blocker - the implementation of the manual recovery tool. In the Raft-based topology, a node can be assigned tokens only once when it joins the cluster. Hence, we disallow joining the ring later, which is possible in Cassandra. The main idea behind the solution is simple. We make the unsupported special case of zero tokens a supported normal case. Nodes with zero tokens assigned are called "zero-token nodes" from now on. From the topology point of view, zero-token nodes are the same as token-owning nodes. They can be in the same states, etc. From the data point of view, they are different. They are not members of the token ring, so they are not present in `token_metadata::_normal_token_owners`. Hence, they are ignored in all non-local replication strategies. The tablet load balancer also ignores them. Topology operations involving zero-token nodes are simplified: - `add` and `replace` finish in the `join_group0` state, so creating a new CDC generation and streaming are skipped, - `removenode` and `decommission` skip streaming, - `rebuild` does not even contact the topology coordinator as there is nothing to rebuild, Also, if the topology operation involves a token-owning node, zero-token nodes are ignored in streaming. Zero-token nodes can be used as coordinator-only nodes, just like in Cassandra. They can handle requests just like token-owning nodes. The main motivation behind zero-token nodes is that they can prevent the Raft majority loss efficiently. Zero-token nodes are group 0 voters, but they can run on much weaker and cheaper machines because they do not replicate data and handle client requests by default (drivers ignore them). For example, if there are two DCs, one with 4 nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes, every DC will contain less than half of the nodes, so we won't lose the majority when any DC dies. Another way of preventing the Raft majority loss is changing the voter set, which is tracked by scylladb/scylladb#18793. That approach can be used together with zero-token nodes. In the example above, if we choose equal numbers of voters in both DCs, then a DC with one zero-token node will be sufficient. However, in the typical setup of 2 DCs with the same number of nodes it is enough to add a DC with only one zero-token node without changing the voter set. Zero-token nodes could also be used as load balancers in the Alternator.	2024-08-29 10:37:07 +02:00
Łukasz Paszkowski	b270097f1f	config: drop reversed_reads_auto_bypass_cache Reverse reads have already been with us for a while, thus this back door option to bypass in-memory data cache for reversed queries can be retired.	2024-08-13 10:02:42 +02:00
Łukasz Paszkowski	80df313f49	config: drop enable_optimized_reversed_reads Reverse reads have already been with us for a while, thus this back door option to read entire paritions forward and reversing them after can be retired.	2024-08-13 10:02:42 +02:00
Nadav Har'El	9eb47b3ef0	Merge 'config: round-trip boolean configuration variables' from Avi Kivity When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in both directions. Not a regression, so a backport isn't strictly necessary. Closes scylladb/scylladb#19792 * github.com:scylladb/scylladb: config: specialize from-string conversion for bool config: wrap boost::lexical_cast<> when converting from strings	2024-07-22 17:53:02 +03:00
Botond Dénes	d3135db457	Merge 'commitlog: Add optional max lifetime parameter to cl instance' from Calle Wilund If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log. Related to #15820 The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc. Closes scylladb/scylladb#15971 * github.com:scylladb/scylladb: commitlog: Make max data lifetime runtime-configurable db::config: Expose commitlog_max_data_lifetime_in_s parameter commitlog: Add optional max lifetime parameter to cl instance	2024-07-22 17:21:33 +03:00
Avi Kivity	0780228aa2	config: specialize from-string conversion for bool The yaml/json representation for bool is true/false, but boost::lexical_cast is 1/0. Specialize bool conversion to accept true/false (for yaml/json compatibilty) and 1/0 (for backward compatibility). This provides round-trip conversion for bool configs in system.config.	2024-07-18 18:38:22 +03:00
Avi Kivity	33eaa61cdd	config: wrap boost::lexical_cast<> when converting from strings Configuration uses boost::lexical_cast to convert strings to native values (e.g. bools/ints). However, boost::lexical_cast doesn't recognize true/false for bool. Since we can't change boost::lexical_cast, replace it with a wrapper that forwards directly to boost::lexical_cast. In the next step, we'll specialize it for bool.	2024-07-18 18:38:19 +03:00

1 2 3 4 5 ...

379 Commits