scylladb

Author	SHA1	Message	Date
Calle Wilund	bb5dc0771c	commitlog: Fix buffer_list_bytes not updated correctly Fixes #20862 With the change in `60af2f3cb2` the bookkeep for buffer memory was changed subtly, the problem here that we would shrink buffer size before we after flush use said buffer's size to decrement the buffer_list_bytes value, previously inc:ed by the full, allocated size. I.e. we would slowly grow this value instead of adjusting properly to actual used bytes. Test included. (cherry picked from commit `ee5e71172f`) Closes scylladb/scylladb#20902	2024-10-01 17:41:02 +03:00
Calle Wilund	ad595e4d6a	commitlog_test: Add more oversized cases Also adds some randomization to the tests.	2024-09-03 16:38:28 +00:00
Calle Wilund	05bf2ae5d7	commitlog: Handle oversized entries Refs #18161 Yet another approach to dealing with large commitlog submissions. We handle oversize single mutation by adding yet another entry type: fragmented. In this case we only add a fragment (aha) of the data that needs storing into each entry, along with metadata to correlate and reconstruct the full entry on replay. Because these fragmented entries are spread over N segments, we also need to add references from the first segment in a chain to the subsequent ones. These are released once we clear the relevant cf_id count in the base. * This approach has the downside that due to how serialization etc works w.r.t. mutations, we need to create an intermediate buffer to hold the full serialized target entry. This is then incrementally written into entries of < max_mutation_size, successively requesting more segments. On replay, when encountering a fragment chain, the fragment is added to a "state", i.e. a mapping of currently processing frag chains. Once we've found all fragments and concatenated the buffers into a single fragmented one, we can issue a replay callback as usual. Note that a replay caller will need to create and provide such a state object. Old signature replay function remains for tests and such. This approach bumps the file format (docs to come). To ensure "atomicity" we both force syncronization, and should the whole op fail, we restore segment state (rewinding), thus discarding data all we wrote. v2: * Improve some bookeep, ensure we keep track of segments and flush properly, to get counter correct	2024-09-03 16:38:27 +00:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Botond Dénes	d3135db457	Merge 'commitlog: Add optional max lifetime parameter to cl instance' from Calle Wilund If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log. Related to #15820 The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc. Closes scylladb/scylladb#15971 * github.com:scylladb/scylladb: commitlog: Make max data lifetime runtime-configurable db::config: Expose commitlog_max_data_lifetime_in_s parameter commitlog: Add optional max lifetime parameter to cl instance	2024-07-22 17:21:33 +03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Calle Wilund	8295980d14	commitlog: Make max data lifetime runtime-configurable	2024-07-09 12:30:49 +00:00
Calle Wilund	55d6afda6e	commitlog: Add optional max lifetime parameter to cl instance If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.	2024-07-09 12:30:48 +00:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Kefu Chai	222dbf2ce4	test/boost: include test/lib/test_utils.hh this change was created in the same spirit of 505900f18f. because we are deprecating the operator<< for vector and unorderd_map in Seastar, some tests do not compile anymore if we disable these operators. so to be prepared for the change disabling them, let's include test/lib/test_utils.hh for accessing the printer dedicated for Boost.test. and also '#include <fmt/ranges.h>' when necessary, because, in order to format the ranges using {fmt}, we need to use fmt/ranges.h. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-05-26 12:32:43 +08:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Kefu Chai	40ce52c3cc	test: use generic boost_test_print_type() in this change, we trade the `boost_test_print_type()` overloads for the generic template of `boost_test_print_type()`, except for those in the very small tests, which presumably want to keep themselves relative self-contained. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18727	2024-05-20 12:56:20 +03:00
Tomasz Grabiec	32a191384a	test: Avoid using deprecated sharded API There is not tablet migration in unit tests, so shard_of() can be safely replaced with shard_for_reads(). Even if it's used for writes.	2024-05-16 00:28:47 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Kefu Chai	8f38bd5376	commitlog: add formatter for db::replay_position before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `db::replay_position`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17014	2024-01-29 09:59:30 +02:00
Calle Wilund	af0772d605	commitlog: Add wait_for_pending_deletes Refs #16757 Allows waiting for all previous and pending segment deletes to finish. Useful if a caller of `discard_completed_segments` (i.e. a memtable flush target) not only wants to ensure segments are clean and released, but thoroughly deleted/recycled, and hence no treat to resurrecting data on crash+restart. Test included. Closes scylladb/scylladb#16801	2024-01-17 09:30:55 +02:00
Lakshmi Narayanan Sreethar	76f0d5e35b	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658	2024-01-11 08:37:56 +02:00
Calle Wilund	b34366957e	commitlog_test::test_commitlog_reader: handle segment_truncation Fixes #16312 This test replays a segment before it might be closed or even fully flushed, thus it can (with the new semantics) generate a segment_truncation exception if hitting eof earlier than expected. (Note: test does not use pre-allocated segments).	2023-12-11 11:53:12 +00:00
Calle Wilund	d85c0ea26f	commitlog_test: coroutinize test_commitlog_reader To make it easier to read and modify.	2023-12-11 11:47:48 +00:00
Calle Wilund	dba39b47bd	commitlog: Fix allocation size check to take sector overhead into account. Fixes #16301 The calculation on whether data may be added is based on position vs. size of incoming data. However, it did not take sector overhead into account, which lead us to writing past allowed segment end, which in turn also leads to metrics overflows.	2023-12-07 07:36:27 +00:00
Calle Wilund	0d35c96ef4	commitlog: Fix commitlog_segment::buffer_position() calculation and replay counterpart Fixes #16298 The adjusted buffer position calculation in buffer_position(), introduced in #15494 was in fact broken. It calculated (like previously) a "position" based on diff between underlying buffer size and ostream size() (i.e. avail), then adjusted this according to sector overhead rules. However, the underlying buffer size is in unadjusted terms, and the ostream is adjusted. The two cannot be compared as such, which means the "positions" we get here are borked. Luckily for us (sarcasm), the position calculation in replayer made a similar error, in that it adjusts up current position by one sector overhead to much, leading to us more or less getting the same, erroneous results in both ends. However, when/iff one needs to adjust the segment file format further, one might very quickly realize that this does not work well if, say, one needs to be able to safely read some extra bytes before first chunk in a segment. Conversely, trying to adjust this also exposes a latent potential error in the skip mechanism, manifesting here. Issue fixed by keeping track of the initial ostream capacity for segment buffer, and use this for position calculation, and in the case of replayer, move file pos adjustment from read_data() to subroutine (shared with skipping), that better takes data stream position vs. file position adjustment. In implementaion terms, we first inc the "data stream" pos (i.e. pos in data without overhead), then adjust for overhead. Also fix replayer::skip, so that we handle the buffer/pos relation correctly now. Added test for intial entry position, as well as data replay consistency for single entry_writer paths.	2023-12-07 07:36:27 +00:00
Calle Wilund	e94070db64	commitlog_test: Add test for commit log replay skip past EOF Refs #15269 Unit test to check that trying to skip past EOF in a borked segment will not crash the process. file_data_input_impl asserts iff caller tries this.	2023-12-04 20:50:42 +02:00
Avi Kivity	60af2f3cb2	Merge 'New commitlog file format using tagged pages' from Calle Wilund Prototype implementation of format suggested/requested by @avikivity: Divides segments into disk-write-alignment sized pages, each tagged with segment ID + CRC of data content. When read, we both verify sector integrity (CRC) to detect corruption, as well as matching ID read with expected one. If the latter mismatches we have a prematurely terminated segment (read truncation), which, depending on whether the CL is written in batch or periodic mode, as well as explicit sync, can mean data loss. Note: all-zero pages are treated as kosher, both to align with newly allocated segments, as well as fully terminated (zero-page) ones. Note: This is a preview/RFC - the rest of the file format is not modified. At least parts of entry CRC could probably be removed, but I have not done so yet (needs some thinking). Note: Some slight abstraction breaks in impl. and probably less than maximal efficiency. v2: * Removed entry CRC:s in file format. * Added docs on format v3 * Added one more test for recycling-truncation v3: * Fixed typos in size calc and docs * Changed sect metadata order * Explicit iter type Closes scylladb/scylladb#15494 * github.com:scylladb/scylladb: commitlog_test: Add test for replaying large-ish mutation commitlog_test: Add additional test for segmnent truncation docs: Add docs on commitlog format 3 commitlog: Remove entry CRC from file format commitlog: Implement new format using CRC:ed sectors commitlog: Add iterator adaptor for doing buffer splitting into sub-page ranges fragmented_temporary_buffer: Add const iterator access to underlying buffers commitlog_replayer: differentiate between truncated file and corrupt entries	2023-12-04 13:31:13 +01:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Calle Wilund	3b70fde3cd	commitlog: Make named_files in delete_segments have updated size Fixes #16207 commitlog::delete_segments deletes (or recycles) segments replayed. The actual file size here is added to footprint so actual delete then can determine iff things should be recycled or removed. However, we build a pending delete list of named_files, and the files we added did not have size set. Bad. Actual deletion then treated files as zero-byte sized, i.e. footprint calculations borked. Simple fix is just filling in the size of the objects when addind. Added unit test for the problem. Closes scylladb/scylladb#16210	2023-11-29 09:58:47 +02:00
Calle Wilund	33fba28265	commitlog_test: Add test for replaying large-ish mutation (i.e. cross several normal-sized buffers).	2023-11-21 08:50:57 +00:00
Calle Wilund	0d41769daa	commitlog_test: Add additional test for segmnent truncation Emulate replay of a non-sealed segment, verifying we don't get data beyond termination point, as well as the correct exception.	2023-11-21 08:50:57 +00:00
Calle Wilund	e29bf6f9e8	commitlog: Implement new format using CRC:ed sectors Breaks the file into individually tagged + crc:ed pages. Each page (sized as disk write alignment) gets a trailing 12-byte metadata, including CRC of the first page-12 bytes, and the ID of the segment being written. When reading, each page read is CRC:ed and checked to be part of the expected segment by comparing ID:s. If crc is broken, we have broken data. If crc is ok, but ID does not match, we have a prematurely terminated segment (truncated), which, depending on whether we use batch mode or not, implied data loss.	2023-11-21 08:50:54 +00:00
Calle Wilund	862f4f2ed3	commitlog_replayer: differentiate between truncated file and corrupt entries Refs #11845 When replaying, differentiate between the two cases for failure we have: - A broken actual entry - i.e. entry header/data does not hold up to crc scrutiny - Truncated file - i.e. a chunk header is broken or unreadable. This can be due to either "corruption" (i.e. borked write, post-corruption, hw whatever), or simply an unterminated segment. The difference is that the former is recoverable, the latter is not. We now signal and report the two separately. The end result for a user is not much different, in either case they imply data loss and the need for repair. But there is some value in differentiating which of the two we encountered. Modifies and adds test cases.	2023-11-21 08:42:33 +00:00
Wojciech Mitros	f08e7aad61	test: account for multiple flushes of commitlog segments Currently, when we calculate the number of deactivated segments in test_commitlog_delete_when_over_disk_limit, we only count the segments that were active during the first flush. However, during the test, there may have been more than one flush, and a segment could have been created between them. This segment would sometimes get deactivated and even destroyed, and as a result, the count of destroyed segments would appear larger than the count of deactivated ones. This patch fixes this behavior by accounting for all segments that were active during any flush instead of just segments active during the first flush. Fixes #10527 Closes scylladb/scylladb#14610	2023-10-29 18:30:32 +02:00
Tomasz Grabiec	e48ec6fed3	db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Botond Dénes	156e5d346d	reader_permit: keep trace_state pointer on permit And propagate it down to where it is created. This will be used to add trace points for semaphore related events, but this will come in the next patches.	2023-03-22 04:58:01 -04:00
Pavel Emelyanov	47b61389b5	commitlog: Create commitlog_replayer with system keyspace The replayer code needs system keyspace to fetch truncation records from, thus it needs this explicit dependency. By the time it runs system keyspace is fully initialized already Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-03-06 13:28:36 +03:00
Kefu Chai	3ae11de204	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-28 21:56:53 +08:00
Calle Wilund	64102780fe	commitlog: Use static (reused) regex for (left over) descriptor parse Refs #11710 Allows reusing regex for segment matching (for opening left-over segments after crash). Should remove any stalls caused by commitlog replay preparation. v2: Add unit test for descriptor parsing Closes #12112	2023-02-21 18:34:04 +02:00
Tomasz Grabiec	ccc8e47db1	Merge 'test/lib: introduce key_utils.hh' from Botond Dénes We currently have two method families to generate partition keys: * make_keys() in test/lib/simple_schema.hh * token_generation_for_shard() in test/lib/sstable_utils.hh Both work only for schemas with a single partition key column of `text` type and both generate keys of fixed size. This is very restrictive and simplistic. Tests, which wanted anything more complicated than that had to rely on open-coded key generation. Also, many tests started to rely on the simplistic nature of these keys, in particular two tests started failing because the new key generation method generated keys of varying size: * sstable_compaction_test.sstable_run_based_compaction_test * sstable_mutation_test.test_key_count_estimation These two tests seems to depend on generated keys all being of the same size. This makes some sense in the case of the key count estimation test, but makes no sense at all to me in the case of the sstable run test. Closes #12657 * github.com:scylladb/scylladb: test/lib/sstable_utils: remove now unused token_generation_for_shard() and friends test/lib/simple_schema: remove now unused make_keys() and friends test: migrate to tests::generate_partition_key[s]() test/lib/test_services: add table_for_tests::make_default_schema() test/lib: add key_utils.hh test/lib/random_schema.hh: value_generator: add min_size_in_bytes	2023-02-06 18:11:32 +01:00
Raphael S. Carvalho	3c5afb2d5c	test: Enable Scylla test command line options for boost tests We have enabled the command line options without changing a single line of code, we only had to replace old include with scylla_test_case.hh. Next step is to add x-log-compaction-groups options, which will determine the number of compaction groups to be used by all instantiations of replica::table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-02-01 20:14:51 -03:00
Botond Dénes	4ad3ba52b0	test: migrate to tests::generate_partition_key[s]() Use the newly introduced key generation facilities, instead of the the old inflexible alternatives and hand-rolled code. Most of the migrations are mechanic, but there are two tests that were tricky to migrate: * sstable_compaction_test.sstable_run_based_compaction_test * sstable_mutation_test.test_key_count_estimation These two tests seems to depend on generated keys all being of the same size. This makes some sense in the case of the key count estimation test, but makes no sense at all to me in the case of the sstable run test.	2023-01-30 05:03:42 -05:00
Raphael S. Carvalho	ef8f542d75	replica: Adapt table::active_memtable() to compaction groups active_memtable() was fine to a single group, but with multiple groups, there will be one active memtable per group. Let's change the interface to reflect that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:14 -03:00
Benny Halevy	257d74bb34	schema, everywhere: define and use table_id as a strong type Define table_id as a distinct utils::tagged_uuid modeled after raft tagged_id, so it can be differentiated from other uuid-class types, in particular from table_schema_version. Fixes #11207 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:41 +03:00
Calle Wilund	aab7794c31	commitlog_test: Change timeout handling to do abort() Refs #10805 To help debug spurious failures, ensure to do abort() for debugger/core ease. Closes #10843.	2022-06-29 13:26:51 +03:00
Calle Wilund	688fd31e64	commitlog: Add counters for actual pending allocations + segment wait Fixes #9367 The CL counters pending_allocations and requests_blocked_memory are exposed in graphana (etc) and often referred to as metrics on whether we are blocking on commit log. But they don't really show this, as they only measure whether or not we are blocked on the memory bandwidth semaphore that provides rate back pressure (fixed num bytes/s - sortof). However, actual tasks in allocation or segment wait is not exposed, so if we are blocked on disk IO or waiting for segments to become available, we have no visible metrics. While the "old" counters certainly are valid, I have yet to ever see them be non-zero in modern life. Closes #9368	2022-06-28 08:36:27 +03:00
Benny Halevy	e88871f4ec	replica: database: move shard_of implementation to mutation layer We don't need the database to determine the shard of the mutation, only its schema. So move the implementation to the respecive definitions of mutation and frozen_mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10430	2022-04-27 14:40:24 +03:00
Calle Wilund	d478896d46	commitlog: kill non-recycled segment management It has been default for a while now. Makes no sense to not do it. Even hints can use it (even if it makes no difference there)	2022-04-11 16:34:00 +00:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	e2c27ee743	Merge 'commitlog: recalculate disk footprint on delete_segment exceptions' from Calle Wilund If we get errors/exceptions in delete_segments we can (and probably will) loose track of disk footprint counters. This can in turn, if using hard limits, cause us to block indefinitely on segment allocation since we might think we have larger footprint than we actually do. Of course, if we actually fail deleting a segment, it is 100% true that we still technically hold this disk footprint (now unreachable), but for cases where for example outside forces (or wacky tests) delete a file behind our backs, this might not be true. One could also argue that our footprint is the segments and file names we keep track of, and the rest is exterior sludge. In any case, if we have any exceptions in delete_segments, we should recalculate disk footprint based on current state, and restart all new_segment paths etc. Fixes #9348 (Note: this is based on previous PR #9344 - so shows these commits as well. Actual changes are only the latter two). Closes #9349 * github.com:scylladb/scylla: commitlog: Recalculate footprint on delete_segment exceptions commitlog_test: Add test for exception in alloc w. deleted underlying file commitlog: Ensure failed-to-create-segment is re-deleted commitlog::allocate_segment_ex: Don't re-throw out of function	2021-11-16 17:44:56 +02:00
Avi Kivity	9166d1ab1d	test: commitlog_test: don't use deprecated seastar::unaligned_cast unaligned_cast is deprecated, and gcc complains that it violates strict aliasing rules. Switch to std::copy_n() instead.	2021-10-10 18:17:53 +03:00
Calle Wilund	dcc73c5d4e	commitlog: Recalculate footprint on delete_segment exceptions Fixes #9348 If we get exceptions in delete_segments, we can, and probably will, loose track of footprint counters. We need to recompute the used disk footprint, otherwise we will flush too often, and even block indefinately on new_seg iff using hard limits.	2021-09-15 11:53:03 +00:00
Calle Wilund	8a326638af	commitlog_test: Add test for exception in alloc w. deleted underlying file Tests that we can handle exception-in-alloc cleanup if the file actually does not exist. This however uncovers another weakness (addressed in next patch) - that we can loose track of disk footprint here, and w. hard limits end up waiting for disk space that never comes. Thus test does not use hard limit.	2021-09-15 11:51:05 +00:00

1 2

80 Commits