scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 01:50:35 +00:00

Author	SHA1	Message	Date
Kefu Chai	481397317d	sstables, test: migrate from boost::copy() to std::ranges::copy() Replace boost::copy() with the standard library's std::ranges::copy() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22789	2025-02-11 14:55:25 +03:00
Botond Dénes	51a273401c	Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec This PR converts boost load balancer tests in preparation for load balancer changes which add per-table tablet hints. After those changes, load balancer consults with the replication strategy in the database, so we need to create proper schema in the database. To do that, we need proper topology for replication strategies which use RF > 1, otherwise keyspace creation will fail. Topology is created in tests via group0 commands, which is abstracted by the new `topology_builder` class. Tests cannot modify token_metadata only in memory now as it needs to be consistent with the schema and on-disk metadata. That's why modifications to tablet metadata are now made under group0 guard and save back metadata to disk. Closes scylladb/scylladb#22648 * github.com:scylladb/scylladb: test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario tests: tablets: Set initial tablets to 1 to exit growing mode test: tablets_test: Create proper schema in load balancer tests test: lib: Introduce topology_builder test: cql_test_env: Expose topology_state_machine topology_state_machine: Introduce lock transition	2025-02-10 16:08:41 +02:00
Nadav Har'El	a492e239e3	Merge 'test.py: Add the possibility to run boost and unit tests with pytest ' from Andrei Chekun Add the possibility to run boost and unit tests with pytest test.py should follow the next paradigm - the ability to run all test cases sequentially by ONE pytest command. With this paradigm, to have the better performance, we can split this 1 command into 2,3,4,5,100,200... whatever we want It's a new functionality that does not touch test.py way of executing the boost and unit tests. It supports the main features of test.py way of execution: automatic discovery of modes, repeats. There is an additional requirement to execute tests in parallel: pytest-xdist. To install it, execute `pip install pytest-xdist` To run test with pytest execute `pytest test/boost`. To execute only one file, provide the path filename `pytest test/boost/aggregate_fcts_test.cc` since it's a normal path, autocompletion will work on the terminal. To provide a specific mode, use the next parameter `--mode dev`, if parameter will not be provided pytest will try to use `ninja mode_list` to find out the compiled modes. Parallel execution controlled by pyest-xdist and the parameter `-n 12`. The useful command to discover the tests in the file or directory is `pytest --collect-only -q --mode dev test/boost/aggregate_fcts_test.cc`. That will return all test functions in the file. To execute only one function from the test, you can invoke the output from the previous command, but suffix for mode should be skipped, for example output will be `test/boost/aggregate_fcts_test.cc::test_aggregate_avg.dev`, so to execute this specific test function, please use the next command `pytest --mode dev test/boost/aggregate_fcts_test.cc::test_aggregate_avg` There is a parameter `--repeat` that used to repeat the test case several times in the same way as test.py did. It's not possible to run both boost and unit tests directories with one command, so we need to provide explicitly which directory should be executed. Like this `pytest --mode dev test/unit` or `pytest --mode dev test/boost` Fixes: https://github.com/scylladb/qa-tasks/issues/1775 Closes scylladb/scylladb#21108 * github.com:scylladb/scylladb: test.py: Add possibility to run ldap tests from pytest test.py: Add the possibility to run unit tests from pytest test.py: Add the possibility to run boost test from pytest test.py: Add discovery for C++ tests for pytest test.py: Modify s3 server mock test.py: Add method to get environment variables from MinIO wrapper test.py: Move get configured modes to common lib	2025-02-09 11:56:24 +01:00
Avi Kivity	9712390336	Merge 'Add per-table tablet options in schema' from Benny Halevy This series extends the table schema with per-table tablet options. The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions, when the table size changes. * New feature, no backport required Closes scylladb/scylladb#22090 * github.com:scylladb/scylladb: tablets: resize_decision: get rid of initial_decision tablet_allocator: consider tablet options for resize decision tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member network_topology_strategy: allocate_tablets_for_new_table: consider tablet options network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0 cql3/create_keyspace_statement: add deprecation warning for initial tablets test: cqlpy: test_tablets: add tests for per-table tablet options schema: add per-table tablet options feature_service: add TABLET_OPTIONS cluster schema feature	2025-02-08 20:32:19 +02:00
Avi Kivity	9db9b0963f	Merge ' reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Botond Dénes `set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical). Disable the timeout before setting the TTL to prevent premature eviction. Fixes: https://github.com/scylladb/scylladb/issues/22629 Backport required to all active releases, they are all affected. Closes scylladb/scylladb#22701 * github.com:scylladb/scylladb: reader_concurrency_semaphore: set_notify_handler(): disable timeout reader_permit: mark check_abort() as const	2025-02-08 20:05:03 +02:00
Andrei Chekun	8ef840a1c5	test.py: Add the possibility to run boost test from pytest Add the possibility to run boost test from pytest. Boost facade based on code from https://github.com/pytest-dev/pytest-cpp, but enhanced and rewritten to suite better.	2025-02-07 21:40:25 +01:00
Tomasz Grabiec	1854ea2165	test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario This scenario is invoked in a loop in the test_load_balancing_merge_colocation_with_random_load test case, which will cause accumulation of tablet maps making each reload slower in subsequent iterations. It wasn't a problem before because we overwritten tablet_metadata in each iteration to contain only tablets for the current table, but now we need to keep it consistent with the schema and don't do that.	2025-02-07 17:13:52 +01:00
Tomasz Grabiec	58460a8863	tests: tablets: Set initial tablets to 1 to exit growing mode After tablet hints, there is no notion of leaving growing mode and tablet count is sustained continuously by initial tablet option, so we need to lower it for merge to happen.	2025-02-07 17:13:52 +01:00
Tomasz Grabiec	ca6159fbe2	test: tablets_test: Create proper schema in load balancer tests This is in preparation for load balancer changes needed to respect per-table tablet hints and respecting per-shard tablet count goal. After those changes, load balancer consults with the replication strategy in the database, so we need to create proper schema in the database. To do that, we need proper topology for replication strategies which use RF > 1, otherwise keyspace creation will fail.	2025-02-07 17:13:52 +01:00
Botond Dénes	9174f27cc8	reader_concurrency_semaphore: set_notify_handler(): disable timeout set_notify_handler() is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical). Disable the timeout before setting the TTL to prevent premature eviction. Fixes: #scylladb/scylladb#22629	2025-02-07 02:31:01 -05:00
Avi Kivity	861fb58e14	Merge 'vector: add support for vector type' from Dawid Pawlik This pull request is an implementation of vector data type similar to one used by Apache Cassandra. The patch contains: - implementation of vector_type_impl class - necessary functionalities similar to other data types - support for serialization and deserialization of vectors - support for Lua and JSON format - valid CQL syntax for `vector<>` type - `type_parser` support for vectors - expression adjustments such as: - add `collection_constructor::style_type::vector` - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector` - vector type encoding (for drivers) - unit tests - cassandra compatibility tests - necessary documentation Co-authored-by: @janpiotrlakomy Fixes https://github.com/scylladb/scylladb/issues/19455 Closes scylladb/scylladb#22488 * github.com:scylladb/scylladb: docs: add vector type documentation cassandra_tests: translate tests covering the vector type type_codec: add vector type encoding boost/expr_test: add vector expression tests expression: adjust collection constructor list style expression: add vector style type test/boost: add vector type cql_env boost tests test/boost: add vector type_parser tests type_parser: support vector type cql3: add vector type syntax types: implement vector_type_impl	2025-02-06 20:36:50 +02:00
Benny Halevy	20c6ca2813	tablet_allocator: consider tablet options for resize decision Do not merge tablets if that would drop the tablet_count below the minimum provided by hints. Split tablets if the current tablet_count is less than the minimum tablet count calculated using the table's tablet options. TODO: override min_tablet_count if the tablet count per shard is greater than the maximum allowed. In this case the tables tablet counts should be scaled down proportionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:35 +02:00
Pavel Emelyanov	951625ca13	Merge 's3 client: add aws credentials providers' from Ernest Zaslavsky This update introduces four types of credential providers: 1. Environment variables 2. Configuration file 3. AWS STS 4. EC2 Metadata service The first two providers should only be used for testing and local runs. They must NEVER be used in production. The last two providers are intended for use on real EC2 instances: - AWS STS: Preferred method for obtaining temporary credentials using IAM roles. - EC2 Metadata Service: Should be used as a last resort. Additionally, a simple credentials provider chain is created. It queries each provider sequentially until valid credentials are obtained. If all providers fail, it returns an empty result. fixes: #21828 Closes scylladb/scylladb#21830 * github.com:scylladb/scylladb: docs: update the `object_storage.md` and `admin.rst` aws creds: add STS and Instance Metadata service credentials providers aws creds: add env. and file credentials providers s3 creds: move credentials out of endpoint config	2025-02-06 11:12:37 +03:00
Benny Halevy	32c2f7579f	network_topology_strategy: allocate_tablets_for_new_table: consider tablet options Use the keyspace initial_tablets for min_tablet_count, if the latter isn't set, then take the maximum of the option-based tablet counts: - min_tablet_count - and expected_data_size_in_gb / target_tablet_size - min_per_shard_tablet_count (via calculate_initial_tablets_from_topology) If none of the hints produce a positive tablet_count, fall back to calculate_initial_tablets_from_topology * initial_scale. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Benny Halevy	c5668d99c9	schema: add per-table tablet options Unlike with vnodes, each tablet is served only by a single shard, and it is associated with a memtable that, when flushed, it creates sstables which token-range is confined to the tablet owning them. On one hand, this allows for far better agility and elasticity since migration of tablets between nodes or shards does not require rewriting most if not all of the sstables, as required with vnodes (at the cleanup phase). Having too few tablets might limit performance due not being served by all shards or by imbalance between shards caused by quantization. The number of tabelts per table has to be a power of 2 with the current design, and when divided by the number of shards, some shards will serve N tablets, while others may serve N+1, and when N is small N+1/N may be significantly larger than 1. For example, with N=1, some shards will serve 2 tablet replicas and some will serve only 1, causing an imbalance of 100%. Now, simply allocating a lot more tablets for each table may theoretically address this problem, but practically: a. Each tablet has memory overhead and having too many tablets in the system with many tables and many tablets for each of them may overwhelm the system's and cause out-of-memory errors. b. Too-small tablets cause a proliferation of small sstables that are less efficient to acces, have higher metadata overhead (due to per-sstable overhead), and might exhaust the system's open file-descriptors limitations. The options introduced in this change can help the user tune the system in two ways: 1. Sizing the table to prevent unnecessary tablet splits and migrations. This can be done when the table is created, or later on, using ALTER TABLE. 2. Controlling min_per_shard_tablet_count to improve tablet balancing, for hot tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Tomasz Grabiec	3bb19e9ac9	locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables For example, nodes which are being decommissioned should not be consider as available capacity for new tables. We don't allocate tablets on such nodes. Would result in higher per-shard load then planned. Closes scylladb/scylladb#22657	2025-02-05 23:59:41 +02:00
Kefu Chai	9a20fb43ab	tree: replace boost::min_element() with std::ranges::min_element() in order to reduce the external header dependency, let's switch to the standardlized std::ranges::min_element(). Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22572	2025-02-05 21:54:01 +02:00
Tomasz Grabiec	e22e3b21b1	locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes In that case, new_racks will be used, but when we discover no candidates, we try to pop from existing_racks. Fixes #22625 Closes scylladb/scylladb#22652	2025-02-05 20:13:05 +02:00
Raphael S. Carvalho	ce65164315	test: Use linux-aio backend again on seastar-based tests Since mid December, tests started failing with ENOMEM while submitting I/O requests. Logs of failed tests show IO uring was used as backend, but we never deliberately switched to IO uring. Investigation pointed to it happening accidentaly in commit `1bac6b75dc`, which turned on IO uring for allowing native tool in production, and picked linux-aio backend explicitly when initializing Scylla. But it missed that seastar-based tests would pick the default backend, which is io_uring once enabled. There's a reason we never made io_uring the default, which is that it's not stable enough, and turns out we made the right choice back then and it apparently continue to be unstable causing flakiness in the tests. Let's undo that accidental change in tests by explicitly picking the linux-aio backend for seastar-based tests. This should hopefully bring back stability. Refs #21968. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22695	2025-02-05 15:19:24 +02:00
Ernest Zaslavsky	dee4fc7150	aws creds: add STS and Instance Metadata service credentials providers This commit introduces two new credentials providers: STS and Instance Metadata Service. The S3 client's provider chain has been updated to incorporate these new providers. Additionally, unit tests have been added to ensure coverage of the new functionality.	2025-02-05 14:57:19 +02:00
Ernest Zaslavsky	d534051bea	aws creds: add env. and file credentials providers This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.	2025-02-05 14:57:19 +02:00
Botond Dénes	f2d5819645	reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload with_permit() creates a permit, with a self-reference, to avoid attaching a continuation to the permit's run function. This self-reference is used to keep the permit alive, until the execution loop processes it. This self reference has to be carefully cleared on error-paths, otherwise the permit will become a zombie, effectively leaking memory. Instead of trying to handle all loose ends, get rid of this self-reference altogether: ask caller to provide a place to save the permit, where it will survive until the end of the call. This makes the call-site a little bit less nice, but it gets rid of a whole class of possible bugs. Fixes: #22588 Closes scylladb/scylladb#22624	2025-02-04 21:27:16 +02:00
Ernest Zaslavsky	c911fc4f34	s3 creds: move credentials out of endpoint config This commit refactors the way AWS credentials are managed in Scylla. Previously, credentials were included in the endpoint configuration. However, since credentials and endpoint configurations serve different purposes and may have different lifetimes, it’s more logical to manage them separately. Moving forward, credentials will be completely removed from the endpoint_config to ensure clear separation of concerns.	2025-02-04 16:45:23 +02:00
Ran Regev	edd56a2c1c	moved cache files to db As requested in #22097, moved the files and fixed other includes and build system. Fixes: #22097 Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#22495	2025-02-04 12:21:31 +03:00
Botond Dénes	af46894bb7	Merge 'Rack aware view pairing' from Benny Halevy Enabled with the tablets_rack_aware_view_pairing cluster feature rack-aware pairing pairs base to view replicas that are in the same dc and rack, using their ordinality in the replica map We distinguish between 2 cases: - Simple rack-aware pairing: when the replication factor in the dc is a multiple of the number of racks and the minimum number of nodes per rack in the dc is greater than or equal to rf / nr_racks. In this case (that includes the single rack case), all racks would have the same number of replicas, so we first filter all replicas by dc and rack, retaining their ordinality in the process, and finally, we pair between the base replicas and view replicas, that are in the same rack, using their original order in the tablet-map replica set. For example, nr_racks=2, rf=4: base_replicas = { N00, N01, N10, N11 } view_replicas = { N11, N12, N01, N02 } pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 } Note that we don't optimize for self-pairing if it breaks pairing ordinality. - Complex rack-aware pairing: when the replication factor is not a multiple of nr_racks. In this case, we attempt best-match pairing in all racks, using the minimum number of base or view replicas in each rack (given their global ordinality), while pairing all the other replicas, across racks, sorted by their ordinality. For example, nr_racks=4, rf=3: base_replicas = { N00, N10, N20 } view_replicas = { N11, N21, N31 } pairing would be: { N00, N31 }\, { N10, N11 }, { N20, N21 } \ cross-rack pair If we'd simply stable-sort both base and view replicas by rack, we might end up with much worse pairing across racks: { N00, N11 }\, { N10, N21 }\, { N20, N31 }\* \* cross-rack pair Fixes scylladb/scylladb#17147 * This is an improvement so no backport is required Closes scylladb/scylladb#21453 * github.com:scylladb/scylladb: network_topology_strategy_test: add tablets rack_aware_view_pairing tests view: get_view_natural_endpoint: implement rack-aware pairing for tablets view: get_view_natural_endpoint: handle case when there are too few view replicas view: get_view_natural_endpoint: track replica locator::nodes locator: topology: consult local_dc_rack if node not found by host_id locator: node: add dc and rack getters feature_service: add tablet_rack_aware_view_pairing feature view: get_view_natural_endpoint: refactor predicate function view: get_view_natural_endpoint: clarify documentation view: mutate_MV: optimize remote_endpoints filtering check view: mutate_MV: lookup base and view erms synchronously view: mutate_MV: calculate keyspace-dependent flags once	2025-01-30 11:32:19 +02:00
Botond Dénes	5dd6fcfe6f	Merge 'encrypted_file_impl: Check for reads on or past actual file length in transform' from Calle Wilund Fixes #22236 If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+<1-15>) (if crossing block boundary) and try to decrypt this buffer (last one). Simplest example: Actual data size: 4095 Physical file size: 4095 + key block size (typically 16) Read from 4096: -> 15 bytes (padding) -> transform return `_file_size` - `read offset` -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero). Check on last block in `transform` would wrap around size due to us being >= file size (l). Just do an early bounds check and return zero if we're past the actual data limit. Closes scylladb/scylladb#22395 * github.com:scylladb/scylladb: encrypted_file_test: Test reads beyond decrypted file length encrypted_file_impl: Check for reads on or past actual file length in transform	2025-01-29 20:09:32 +02:00
Dawid Pawlik	489ab1345e	boost/expr_test: add vector expression tests Add and adjust tests using vector and list_or_vector style types. Implemented utilities used in expr_test similar to those added in `8f6309bd66`.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	ed49093a01	expression: adjust collection constructor list style Like mentioned in the previous commit, this changes introduce usage of vector style type and adjusts the functions using list style type to distinguish vectors from lists. Rename collection constructor style list to list_or_vector.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	7554e55c2c	test/boost: add vector type cql_env boost tests These tests check serialization and deserialization (including JSON), basic inserts and selects, aggregate functions, element validation, vector usage in user defined types and functions. test_vector_between_user_types is a translated Apache Cassandra test to check if it is handled properly internally.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	c3a1760a44	test/boost: add vector type_parser tests Contains two type_parser tests: one for a valid vector and another for invalid vector.	2025-01-28 21:14:49 +01:00
Nikos Dragazis	2fb95e4e2f	encrypted_file_test: Test reads beyond decrypted file length Add a test to reproduce a bug in the read DMA API of `encrypted_file_impl` (the file implementation for Encryption-at-Rest). The test creates an encrypted file that contains padding, and then attempts to read from an offset within the padding area. Although this offset is invalid on the decrypted file, the `encrypted_file_impl` makes no checks and proceeds with the decryption of padding data, which eventually leads to bogus results. Refs #22236. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `8f936b2cbc`)	2025-01-27 13:19:37 +00:00
Calle Wilund	e96cc52668	encrypted_file_impl: Check for reads on or past actual file length in transform Fixes #22236 If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+1-15) (block boundary) and try to decrypt this buffer (last one). Check on last block in `transform` would wrap around size due to us being >= file size (l). Simplest example: Actual data size: 4095 Physical file size: 4095 + key block size (typically 16) Read from 4096: -> 15 bytes (padding) -> transform return _file_size - read offset -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero). Just do an early bounds check and return zero if we're past the actual data limit. v2: * Moved check to a min expression instead * Added lengthy comment * Added unit test v3: * Fixed read_dma_bulk handling of short, unaligned read * Added test for unaligned read v4: * Added another unaligned test case	2025-01-27 13:19:37 +00:00
Botond Dénes	9fc14f203b	Merge 'Simplify loading_cache_test and use manual_clock' from Benny Halevy This series exposes a Clock template parameter for loading_cache so that the test could use the manual_clock rather than the lowres_clock, since relying on the latter is flaky. In addition, the test load function is simplified to sleep some small random time and co_return the expected string, rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test. Fixes #20322 * The test was flaky forever, so backport is required for all live versions. Closes scylladb/scylladb#22064 * github.com:scylladb/scylladb: tests: loading_cache_test: use manual_clock utils: loading_cache: make clock_type a template parameter test: loading_cache_test: use function-scope loader test: loading_cache_test: simlute loader using sleep test: lib: eventually: add sleep function param test: lib: eventually: make *EVENTUALLY_EQUAL inline functions	2025-01-27 13:13:41 +01:00
Avi Kivity	a23a3110b5	utils: config_file: forward_declare boost::program_options classes Avoid pulling in boost dependencies when all we need is the class name. Closes scylladb/scylladb#22453	2025-01-27 10:45:43 +03:00
Kefu Chai	d1c222d9bd	config: specialize config_from_string() for sstring Specialize config_from_string() for sstring to resolve lexical_cast stream state parsing limitation. This enables correct handling of empty string configurations, such as setting an empty value in CQL: ```cql UPDATE system.config SET value='' WHERE name='allowed_repair_based_node_ops'; ``` Previous implementation using boost::lexical_cast would fail due to EOF stream state, incorrectly rejecting valid empty string conversions. Fixes scylladb/scylladb#22491 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22492	2025-01-26 15:53:12 +02:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Botond Dénes	e60e575cb0	test/boost/bptree_validation.hh: add missing include <fmt/format.h>	2025-01-23 06:05:57 -05:00
Benny Halevy	32b7cab917	tests: loading_cache_test: use manual_clock Relying on a real-time clock like lowres_clock can be flaky (in particular in debug mode). Use manual_clock instead to harden the test against timing issues. Fixes #20322 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:08 +02:00
Benny Halevy	b258f8cc69	test: loading_cache_test: use function-scope loader Rather than a global function, accessing a thread-local `load_count`. The thread-local load_count cannot be used when multiple test cases run in parallel. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:07 +02:00
Benny Halevy	d68829243f	test: loading_cache_test: simlute loader using sleep This test isn't about reading values from file, but rather it's about the loading_cache. Reading from the file can sometimes take longer than the expected refresh times, causing flakiness (see #20322). Rather than reading a string from a real file, just sleep a random, short time, and co_return the string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:07 +02:00
Benny Halevy	b509644972	test: lib: eventually: make *EVENTUALLY_EQUAL inline functions rather then macros. This is a first cleanup step before adding a sleep function parameter to support also manual_clock. Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL, respectively, to make an error more visible in the test log since those entry points print the offending values when not equal. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 12:47:33 +02:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Benny Halevy	dd21d591f6	network_topology_strategy_test: add tablets rack_aware_view_pairing tests Test the simple case of base/view pairing with replication_factor that is a multiple of the number of racks. As well as the complex case when simple_tablets_rack_aware_view_pairing is not possible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Nadav Har'El	a8805c4fc1	Merge 'cql3, test, utils: switch from boost::adaptors::uniqued to utils::views:unique ' from Kefu Chai In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22393 * github.com:scylladb/scylladb: cql3, test: switch from boost::adaptors::uniqued to utils::views:unique utils: implement drop-in replacement for replacing boost::adaptors::uniqued	2025-01-21 19:06:21 +02:00
Kefu Chai	ccb7b4e606	cql3, test: switch from boost::adaptors::uniqued to utils::views:unique In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-21 16:24:45 +08:00
Kefu Chai	d5d251da9a	utils: implement drop-in replacement for replacing boost::adaptors::uniqued Add a custom implementation of boost::adaptors::uniqued that is compatible with C++20 ranges library. This bridges the gap between Boost.Range and the C++ standard library ranges until std::views::unique becomes available in C++26. Currently, the unique view is included in [P2214](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2760r0.html) "A Plan for C++ Ranges Evolution", which targets C++26. The implementation provides: - A lazy view adaptor that presents unique consecutive elements - No modification of source range - Compatibility with C++20 range views and concepts - Lighter header dependencies compared to Boost This resolves compilation errors when piping C++20 range views to boost::adaptors::uniqued, which fails due to concept requirements mismatch. For example: ```c++ auto range = std::views::take(n) \| boost::adaptors::uniqued; // fails ``` This change also offers us a lightweight solution in terms of smaller header dependency. While std::ranges::unique exists in C++23, it's an eager algorithm that modifies the source range in-place, unlike boost::adaptors::uniqued which is a lazy view. The proposed std::views::unique (P2214) targeting C++26 would provide this functionality, but is not yet available. This implementation serves as an interim solution for filtering consecutive duplicate elements using range views until std::views::unique is standardized. For more details on the differences between `std::ranges::unique` and `boost::adaptors::uniqued`: - boost::adaptors::uniqued is a view adaptor that creates a lazy view over the original range. It: * Doesn't modify the source range * Returns a view that presents unique consecutive elements * Is non-destructive and lazy-evaluated * Can be composed with other views - std::ranges::unique is an algorithm that: * Modifies the source range in-place * Removes consecutive duplicates by shifting elements * Returns an iterator to the new logical end * Cannot be used as a view or composed with other range adaptors Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-21 16:24:45 +08:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Nadav Har'El	3e16b80014	Merge 'Reject create table with compact storage' from Benny Halevy As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage`, set to `false` by default, that require users to opt-in in order to create new tables WITH COMPACT STORAGE. Refs scylladb/scylladb#12263, scylladb/scylladb#16375 * Since this guardrail is an enhancement, no backport is needed Closes scylladb/scylladb#16403 * github.com:scylladb/scylladb: docs: ddl: document the deprecation of compact tables test: enable_create_table_with_compact_storage for tests that need it config: add enable_create_table_with_compact_storage	2025-01-20 22:02:02 +02:00
Tomasz Grabiec	c7f78edc78	Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507 New feature. No backport is needed. Closes scylladb/scylladb#21896 * github.com:scylladb/scylladb: repair: Stop using rpc to update repair time for repairs scheduled by scheduler repair: Wire repair_time in system.tablets for tombstone gc test: Disable flush_cache_time for two tablet repair tests test: Introduce guarantee_repair_time_next_second helper repair: Return repair time for repair_service::repair_tablet service: Add tablet_operation.hh	2025-01-20 18:08:49 +01:00
Benny Halevy	88ae067ddb	everywhere: add skeletal support for the in_memory_tables feature Forward-ported from scylla-enterprise. Note that the feature has been deprecated and the implementation is provided only for backward compatibility with pre-existing features and schema. Tested manually after adding the following to feature_service: ``` gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv }; ``` Launched a single-node cluster running 2023.1.10 ``` cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'}; ``` log: ``` Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ... ... INFO 2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 ``` Upgraded to this branch and started scylla. Verified that ks.test was successfuly loaded: log: ``` INFO 2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ... ... WARN 2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy. ... INFO 2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8 ``` Then, tested: ``` cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'InMemoryCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE'; cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'}; cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE' AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'}; ``` log: ``` INFO 2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22068	2025-01-20 16:55:17 +02:00

1 2 3 4 5 ...

3725 Commits