scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 20:05:10 +00:00

Author	SHA1	Message	Date
Avi Kivity	aee94b7176	Merge "Convert remaining mutation sources to v2" from Botond " After the recent conversion of the row-cache, two v1 mutation sources remained: the memtable and the kl sstable reader. This series converts both to a native v2 implementation. The conversion is shallow: both continue to read and process the underlying (v1) data in v1, the fragments are converted to v2 right before being pushed to the reader's buffer. This conversion is simple, surgical and low-risk. It is also better than the upgrade_to_v2() used previously. Following this, the remaining v1 reader implementations are removed, with the exception of the downgrade_to_v1(), which is the only one left at this point. Removing this requires converting all mutation sinks to accept a v2 stream. upgrade_to_v2() is now not used in any production code. It is still needed to properly test downgrade_to_v1() (which is till used), so we can't remove it yet. Instead it hidden as a private method of mutation_source. This still allows for the above mentioned testing to continue, while preventing anyone from being tempted to introduce new usage. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191 " * 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla: readers: make upgrade_to_v2() private test/lib/mutation_source_test: remove upgrade_to_v2 tests readers: remove v1 forwardable reader readers: remove v1 empty_reader readers: remove v1 delegating_reader sstables/kl: make reader impl v2 native sstables/kl: return v2 reader from factory methods sstables: move mp_row_consumer_reader_k_l to kl/reader.cc partition_snapshot_reader: convert implementation to native v2 mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()	2022-04-28 20:31:23 +03:00
Botond Dénes	a22b02c801	sstables/kl: return v2 reader from factory methods This just moves the upgrade_to_v2() calls to the other side of said factory methods, preparing the ground for converting the kl reader impl to a native v2 one.	2022-04-28 14:12:24 +03:00
Raphael S. Carvalho	791403e4bb	sstables: Fix deletion of partial SSTables If SSTable write fails, it will leave a partial sst which contains a temporary TOC in addition to other components partially written. temporary TOC content is written upfront, to allow us from deleting all partial components using the former content if write fails. After commit `e5fc4b6`, partial sst cannot be deleted because deletion procedure is incorrectly assuming all SSTs being deleted unconditionally have TOC, but partial SSTs only have TMP TOC instead. That happens because parent_path() requires all path components to exist due to its usage of fs::path::canonical. The consequence of this is that space of partial files cannot be reclaimed, making it worse for Scylla to recover from ENOSPC, which could happen by selecting a set of files for compaction with higher chance of suceeeding given the free space. This is fixed by only calling parent_path() on TMP TOC, which is guaranteed to exist prior to calling fsync_directory(). Fixes #10410. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Raphael S. Carvalho	0be44b1035	sstables: Fix fsync_directory() fsync_directory() is broken because it's unconditionally performing fsync on parent directory, not on the directory that it was called with. To fix, let's remove wrong parent_path() usage. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Raphael S. Carvalho	ca8f5dcdb7	sstables: Rename dirname() to a more descriptive name dirname() is confusing because if it's called on a directory, parent path is retrieved. By renaming it to parent_path(), it's clearer what the function will do exactly. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-04-26 11:00:27 -03:00
Piotr Sarna	bce2933d99	sstables: : remove unnecessary throws Throws are translated to passing the exceptions directly.	2022-04-12 13:09:54 +02:00
Avi Kivity	af07519928	Merge "Remove reader from mutations v1" from Botond " First migrate all users to the v2 variant, all of which are tests. However, to be able to properly migrate all tests off it, a v2 variant of the restricted reader is also needed. All restricted reader users are then migrated to the freshly introduced v2 variant and the v1 variant is removed. Users include: * replica::table::make_reader_v2() * streaming_virtual_table::as_mutation_source() * sstables::make_reader() * tests This allows us to get rid of a bunch of conversions on the query path, which was mostly v2 already. With a few tests we did kick the can down the road by wrapping the v2 reader in `downgrade_to_v1()`, but this series is long enough already. Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug) " * 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla: readers: remove now unused v1 reader from mutations test: move away from v1 reader from mutations test/boost/mutation_reader_test: use fragment_scatterer test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh test/boost: mutation_fragment_test: refactor fragment_scatterer readers: remove now unused v1 reversing reader test/boost/flat_mutation_reader_test: convert to v2 frozen_mutation: fragment_and_freeze(): convert to v2 frozen_mutation: coroutinize fragment_and_freeze() readers: migrate away from v1 reversing reader db/virtual_table: use v2 variant of reversing and forwardable readers replica/table: use v2 variant of reversing reader sstables/sstable: remove unused make_crawling_reader_v1() sstables/sstable: remove make_reader_v1() readers: add v2 variant of reversing reader readers/reversing: remove FIXME readers: reader from mutations: use mutation's own schema when slicing	2022-03-31 13:29:11 +03:00
Botond Dénes	3b67c25e49	sstables/sstable: remove unused make_crawling_reader_v1()	2022-03-31 09:57:48 +03:00
Botond Dénes	219cb881a4	sstables/sstable: remove make_reader_v1() No external users, only used internally, by make_reader(), who delegates cases currently unsupported by v2 to it. The code needed from make_reader_v1() is inlined into make_reader() and the former is removed.	2022-03-31 09:57:48 +03:00
Botond Dénes	b029bd3db7	tree: remove mutation_reader.hh include In most files it was unused. We should move these to the patch which moved out the last interesting reader from mutation_reader.hh (and added the corresponding new header include) but its probably not worth the effort. Some other files still relied on mutation_reader.hh to provide reader concurrency semaphore and some other misc reader related definitions.	2022-03-30 15:42:51 +03:00
Avi Kivity	585c0841c3	Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page. For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing. This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used. As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`. Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of the `_entries` `chunked_managed_vector`. A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward: Before: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.899447 4 1000000 1111794 12284 1113248 1096537 975.5 972 124356 1 0 0 0 0 0 0 0 12032202 29103 8967 100.0% -> 1 1 1.805811 4 500000 276884 907 278214 275977 3655.8 3654 135084 2688 0 3161 4548 5935 0 0 0 7225100 140466 27010 75.6% -> 1 8 0.927339 4 111112 119818 357 120465 119461 3654.0 3654 135084 2685 0 2133 4548 6963 0 0 0 1749663 107922 57502 50.2% -> 1 16 0.790630 4 58824 74401 782 74617 73497 3654.0 3654 135084 2695 0 1975 4548 7121 0 0 0 1019189 109349 90832 42.7% -> 1 32 0.717235 4 30304 42251 243 42266 41975 3654.0 3654 135084 2689 0 1871 4548 7225 0 0 0 619876 109199 156751 37.3% -> 1 64 0.681624 4 15385 22571 244 22815 22286 3654.0 3654 135084 2685 0 1870 4548 7226 0 0 0 407671 105798 285688 34.0% -> 1 256 0.630439 4 3892 6173 24 6214 6150 3549.0 3549 135116 2581 0 1313 3927 6505 0 0 0 232541 100803 1022454 29.1% -> 1 1024 0.313303 4 976 3115 219 3126 2766 1956.0 1956 130608 986 0 0 987 1962 0 0 0 81165 41385 1724979 29.1% -> 1 4096 0.083688 4 245 2928 85 3012 2134 738.8 737 17212 492 244 0 247 491 0 0 0 30500 19406 1999263 24.6% -> 64 1 1.509011 4 984616 652491 2746 660930 649745 3673.5 3654 135084 2687 0 4507 4548 4589 0 0 0 11075882 117074 13157 68.9% -> 64 8 1.424147 4 888896 624160 4446 625675 617713 3654.0 3654 135084 2691 0 4248 4548 4848 0 0 0 10019098 117383 13700 66.5% -> 64 16 1.343276 4 800000 595559 5834 605880 589725 3654.0 3654 135084 2698 0 3989 4548 5107 0 0 0 9043830 124022 14206 64.9% -> 64 32 1.249721 4 666688 533469 5056 536638 526212 3654.0 3654 135084 2688 0 3616 4548 5480 0 0 0 7570848 123043 15377 60.9% -> 64 64 1.154549 4 500032 433097 10215 443312 415001 3654.0 3654 135084 2703 0 3161 4548 5935 0 0 0 5718758 110657 17787 53.2% -> 64 256 1.005309 4 200000 198944 1179 199338 196989 3935.0 3935 137216 2966 0 690 4048 5592 0 0 0 2398359 110510 27855 51.3% -> 64 1024 0.441913 4 58880 133239 8094 135471 120467 2161.0 2161 131820 1190 0 0 1192 1848 0 0 0 725092 45449 33740 59.7% -> 64 4096 0.124826 4 15424 123564 5958 126814 95101 795.5 794 17400 553 240 0 312 482 0 0 0 199943 20869 46621 41.9% ``` After: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.917468 4 1000000 1089956 1422 1091378 1073112 975.5 972 124356 1 0 0 0 0 0 0 0 12032761 29721 8972 100.0% -> 1 1 1.311446 4 500000 381259 3212 384470 377238 1087.0 1083 138420 2 0 4445 4548 4651 0 0 0 7096216 55681 20869 100.0% -> 1 8 0.467975 4 111112 237432 1446 239372 235985 1121.2 1119 143124 9 0 4344 4548 4752 0 0 0 1619944 23502 28844 98.7% -> 1 16 0.337085 4 58824 174508 3410 178451 171099 1117.5 1120 143276 11 0 4319 4548 4777 0 0 0 883692 19152 37460 96.8% -> 1 32 0.262798 4 30304 115313 1222 116535 112400 1070.2 1066 135620 166 26 4354 4548 4742 0 0 0 483185 18856 54275 94.9% -> 1 64 0.283954 4 15385 54181 531 56177 53650 2022.5 2040 137036 319 19 4351 4548 4745 0 0 0 292766 32998 102276 84.9% -> 1 256 0.207020 4 3892 18800 575 19105 17520 1315.5 1334 136072 418 24 3703 3927 4115 0 0 0 118400 27427 292146 82.1% -> 1 1024 0.164396 4 976 5937 57 5993 5842 1208.2 1195 135384 568 14 932 987 1030 0 0 0 62999 27554 503559 70.0% -> 1 4096 0.085079 4 245 2880 108 2987 2714 635.8 634 26468 248 246 233 247 258 0 0 0 31264 12872 1546404 37.4% -> 64 1 1.073331 4 984616 917346 7614 923983 909314 1812.2 1824 136792 11 20 4544 4548 4552 0 0 0 10971661 54538 9919 99.6% -> 64 8 1.024389 4 888896 867733 6327 870429 845215 3027.2 3072 138212 31 0 4523 4548 4573 0 0 0 9933078 68059 10050 99.5% -> 64 16 0.978754 4 800000 817366 7802 827665 809564 3012.2 3008 139884 39 0 4486 4548 4610 0 0 0 8947041 64050 10302 98.1% -> 64 32 0.837266 4 666688 796267 10312 806579 785370 2275.8 2266 139672 29 0 4465 4548 4631 0 0 0 7458644 50754 10564 97.8% -> 64 64 0.645627 4 500032 774490 4713 779203 768432 1136.8 1137 145428 8 0 4438 4548 4658 0 0 0 5593168 29982 10938 98.4% -> 64 256 0.386192 4 200000 517877 22509 544067 495368 1134.8 1136 145300 109 0 2135 4048 4147 0 0 0 2270291 22840 13682 94.5% -> 64 1024 0.238617 4 58880 246755 55856 305110 190899 1176.0 1118 135324 451 13 625 1192 1223 0 0 0 701262 24418 17323 71.1% -> 64 4096 0.133340 4 15424 115674 14837 117978 99072 974.0 961 27132 366 347 99 312 383 0 0 0 209595 20657 43096 50.4% ``` For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition). For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads. Relevant traces: Before: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:22:25.847362 \| 127.0.0.1 \| 148 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:22:25.900996 \| 127.0.0.1 \| 53782 \| 127.0.0.1 ``` After: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:19:37.380033 \| 127.0.0.1 \| 149 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:19:37.433662 \| 127.0.0.1 \| 53777 \| 127.0.0.1 ``` Tests: unit(dev) Closes #9063 * github.com:scylladb/scylla: sstables: index_reader: optimize single partition reads sstables: use read-aheads in the index reader sstables: index_reader: remove unused members from index reader context	2022-03-21 13:47:28 +02:00
Avi Kivity	975b0c0b03	Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond " This patchset adds two new operations to scylla-sstable: * validate-checksums - helps identifying whether an sstable is intact or not, but checking the digest and the per-chunk checksums against the data on disk. * decompress - helps when one wants to manually examine the content of a compressed sstable. Refs: #497 Tests: unit(dev) " * 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla: tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/ tools/scylla-sstable: add decompress operation tools/scylla-sstables: add validate-checksums operation sstables/sstable: add validate_checksums() sstables/sstable: add raw_stream option to data_stream() sstables/sstable: make data_stream() and data_read() public utils/exceptions: add maybe_rethrow_exception()	2022-03-16 18:56:48 +02:00
Botond Dénes	ddf9dee9d8	sstables/sstable: add validate_checksums() Sstables have two kind of checksums: per-chunk checksums and full-checksum (digest) calculated over the entire content of Data.db. The full-checksum (digest) is stored in Digest.crc (component_type::Digest). When compression is used, the per-chunk checksum is stored directly inside Data.db, after each compressed chunk. These are validated on read, when decompressing the respective chunks. When no compression is used, the per-chunk checksum is stored separately in CRC.db (component_type::CRC). Chunk size is defined and stored in said component as well. In both compressed and uncompressed sstables, checksums are calculated on the data that is actually written to disk, so in case of compressed data, on the compressed data. This method validates both the full checksum and the per-chunk checksum for the entire Data.db.	2022-03-15 14:52:15 +02:00
Botond Dénes	bf335c9e7a	sstables/sstable: add raw_stream option to data_stream() Optionally provide access to the underlying data as-is, without decompression.	2022-03-15 14:47:27 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Botond Dénes	2e0610e459	sstables/sstable: remove now unused v1 write_components() variant Supplanted by the v2 variant.	2022-03-10 09:16:33 +02:00
Botond Dénes	fed5b73147	sstables/sstable: expose v2 variant of write_components() In parallel to the existing v1 one. In the next patches we start migrating users to the v2 variant incrementally and finally remove the v1 variant.	2022-03-10 07:03:49 +02:00
Botond Dénes	105bf8888a	sstables: convert mx writer to v2 The sstables::sstable class has two methods for writing sstables: 1) sstable_writer get_writer(...); 2) future<> write_components(flat_mutation_reader, ...); (1) directly exposes the writer type, so we have to update all users of it (there is not that many) in this same patch. We defer updating users of (2) to a follow-up commits.	2022-03-10 07:03:49 +02:00
Nadav Har'El	ef43531fb6	materialized views: allow empty strings in views and indexes Although Cassandra generally does not allow empty strings as partition keys (note they are allowed as clustering keys!), it does allow empty strings in regular columns to be indexed by a secondary index, or to become an empty partition-key column in a materialized view. As noted in issues #9375 and #9364 and verified in a few xfailing cql-pytest tests, Scylla didn't allow these cases - and this patch fixes that. The patch mostly removes unnecessary code: In one place, code prevented an sstable with an empty partition key from being written. Another piece of removed code was a function is_partition_key_empty() which the materialized-view code used to check whether the view's row will end up with an empty partition key, which was supposedly forbidden. But in fact, should have been allowed like they are allowed in Cassandra and required for the secondary-index implementation, and the entire function wasn't necessary. Note that the removed function is_partition_key_empty() was NOT required for the "IS NOT NULL" feature of materialized views - this continues to work as expected after this patch, and we add another test to confirm it. Being null and being an empty string are two different things. This patch also removes a part of a unit test which enshrined the wrong behavior. After this patch we are left with one interesting difference from Cassandra: Though Cassandra allows a user to create a view row with an empty-string partition key, and this row is fully visible in when scanning the view, this row can not be queried individually because "WHERE v=''" is forbidden when v is the partition key (of the view). Scylla does not reproduce this anomaly - and such point query does work in Scylla after this patch. We add a new test to check this case, and mark it "cassandra_bug", i.e., it's a Cassandra behavior which we consider wrong and don't want to emulate. This patch relies on #9352 and #10178 having been fixed in previous patches, otherwise the WHERE v='' does not work when reading from sstables. We add to the already existing tests we had for empty materialized-views keys a lookup with WHERE v='' which failed before fixing those two issues. Fixes #9364 Fixes #9375 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-03-08 15:34:26 +02:00
Benny Halevy	eff5076dd5	sstables: close_files: auto-remove temporary sstable directory If the sstable is marked for deletion, e.g. when writing the sstable fails for any reason before it's sealed, make sure to remove the sstable's temporary directory, if present, besides the sstables files. This condition is benign as these empty temp dirs are removed when scylla starts up, but the do accumulate and we better remove them too. Fixes #9522 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>	2022-03-03 16:13:03 +02:00
Michael Livshin	0caa21079d	sstables: refrain from throwing on host id mismatch This makes host id mismatch cause a warning and stop being fatal, to un-break node replacement dtests. Should be revisited if/when the underlying problem (double setting of local host id on a replacing node) is fixed. Refs #10148 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>	2022-03-03 15:53:19 +02:00
Michael Livshin	a389cc520b	system_keyspace, sstable: log local host id in key places Specifically: when it is generated, when it is loaded from `system.local`, and when there is a mismatch during sstable validation; in the latter case log the in-sstable host id also. Refs #10148 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220301123925.257766-1-michael.livshin@scylladb.com>	2022-03-02 09:49:37 +02:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00
Wojciech Mitros	c81992c665	sstables: use read-aheads in the index reader Currently, when advancing one of index_reader's bounds, we're creating a new index_consume_entry_context with a new underlying file input_stream for each new page. For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing. This patch adds a index_consume_entry_context to each of index_reader's bounds, so that for each new page, the same file input_stream is used. As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the input_stream's read aheads, decreasing the number of blocking reads and increasing the throughput of the index_reader. Fixes #2388 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-02-22 01:51:33 +01:00
Michael Livshin	79bf79ebd3	sstables: validate originating host id Add an additional sstable validation step to check that originating host id matches the local host id. This is only done for ME-and-up sstables, which do not come from upload/, and when the local host id is known. When local host id is unknown, check that the sstable belongs to a system keyspace, i.e. whether it is plausible that Scylla is still booting up and hasn't loaded/generated the local host id yet. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	3511d7cd21	sstable: add is_uploaded() predicate Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	c96708d262	add support for the ME sstable format The ME format has been introduced in Cassandra 3.11.11: `11952fae77/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java (L123)` `d84c6e9810` It adds originating host id to sstable metadata in support of fixing loss of commit log data when moving sstables between nodes: https://issues.apache.org/jira/browse/CASSANDRA-16619 In Scylla: * The supported way to ingest sstables is via upload/, where stored commit log replay position should be disregarded (but see https://github.com/scylladb/scylla/issues/10080). * A later commit in this series implements originating host id validation for native ME sstables. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	26bae0cd39	sstables: add ability to write and parse optionals (that is, instances of `std::optional`). The ME sstable format includes optional originating host id in stats metadata. We know how to write and parse uuids, but not how to write and parse optionals. The format is (used by C* in this case, and also happens to be consistent with how booleans are serialized): first a boolean indicating whether the contents are present (0 or 1, as a byte), then the contents (if any). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:23 +02:00
Michael Livshin	c00d272b16	globalize sstables::write(..., utils::UUID) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:23 +02:00
Benny Halevy	67580c0855	sstables: get rid of remove_sstable_with_temp_toc It is unused since `e40aa042a7` (version 4.2) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214140029.1513522-2-bhalevy@scylladb.com>	2022-02-14 18:57:40 +02:00
Benny Halevy	e5fc4b6f5d	sstables: coroutinize remove_by_toc_name Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214140029.1513522-1-bhalevy@scylladb.com>	2022-02-14 18:57:39 +02:00
Benny Halevy	8f417b8021	sstable: coroutinize seal_sstable Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214105214.1337361-1-bhalevy@scylladb.com>	2022-02-14 17:49:52 +02:00
Benny Halevy	c75e63e480	sstable: coroutinize move_to_new_dir Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220214154403.1590022-1-bhalevy@scylladb.com>	2022-02-14 17:47:09 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	7260d8abed	Merge "index_reader: improve verify_end_state()" from Botond " Said method should take care of checking that parsing stopped in a valid state. This patch-set expands the existing but very lacking implementation by improving the existing error message and adding an additional check for prematurely exiting the parser in the middle of parsing an index entry, something we've seen recently in #9446. To help in debugging such issues, some additional information is added to the trace messages. The series also fixes a bug in the error handling code of the partition index cache. Refs: #9446 Tests: unit(dev) " * 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla: sstables/index_reader: process_state(): add additional information to trace logging sstables/index_reader: verify_end_state(): add check for premature EOS sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception sstables/index_reader: add const sstable& to index_consume_entry_context sstables/index_reader: remove unused members from index_consume_entry_context	2022-01-18 12:13:08 +02:00
Benny Halevy	2ae69447b5	sstables: update_info_for_opened_data: accumulate allocated_size into bytes_on_disk bytes_on_disk is intended to reflect the bytes allocated for the sstable files on disk. Accumulating the files logical size, as done today, causes a discrepancy between information retrieved over the storage_service/sstables_info api, like nodetool status or nodetool cfstats and command line tools like df -H /var/lib/scylla. Fixes #9941 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220118070208.3963076-1-bhalevy@scylladb.com>	2022-01-18 11:33:36 +02:00
Botond Dénes	7508b4fd22	sstables/index_reader: add const sstable& to index_consume_entry_context To be used by the next patches to throw malformed sstable exception.	2022-01-18 10:38:11 +02:00
Botond Dénes	9f3e5ae801	sstables/index_reader: remove unused members from index_consume_entry_context The unused members are: _s and _file_name.	2022-01-18 10:38:11 +02:00
Nadav Har'El	3cc058d193	sstables: add missing include of seastar/core/metrics.hh sstables/sstables.cc uses seastar::metrics but was missing an include of <seastar/core/metrics.hh>. It probably received this include through some other random included Seastar header (e.g., smp.hh). Now that we're reducing the unnecessary inclusions in Seastar (an ongoing effort of Seastar patches), it is no longer included implicitly, and we need to include it explicitly in sstables.cc. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220109162823.511781-1-nyh@scylladb.com>	2022-01-09 18:30:50 +02:00
Raphael S. Carvalho	9a1fdb0635	sstables: stop including unused expensive headers database.hh is expensive to include, and turns out it's no longer needed. also stop including other unused ones. build time of sstables.o reduces by ~3% (cleared all caches and set cpu frequency to a fixed value before building sstables.o from scratch) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220104175908.98833-1-raphaelsc@scylladb.com>	2022-01-04 20:14:01 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Benny Halevy	f7b8b809d0	sstables: parse chunked_vector<std::integral Members>: maximize chunk size Currently this parse function reads only 100KB worth of members in eac hiteration. Since the default max_chunk_capacity is 128KB, 100KB underutilize the chunk capacity, and it could be safely increased to the max to reduce the number of allocations and corresponding calls to read_exactly for large arrays. Expose utils::chunked_vector::max_chunk_capacity so that the caler wouldn't have to guess this number and use it in parse(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>	2021-12-22 15:47:37 +02:00
Benny Halevy	d95f6602a7	sstables: coroutinize parse functions Simplify the implementation using coroutines. This also has the potential to coalesce multiple allocations into one. test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211222103126.1819289-1-bhalevy@scylladb.com>	2021-12-22 15:47:37 +02:00
Benny Halevy	bbe275f37d	compaction: scrub_sstables_validate_mode: quarantine invalid sstables When invalid sstables are detected, move them to the quarantine subdirectory so they won't be selected for regular compaction. Refs #7658 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:14:16 +02:00
Benny Halevy	13e7b00f2e	sstables: add is_quarantined Quarantined sstables will reside in a "quarantine" subdirectory and are also not eligible for regular compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	bdc53880d4	sstables: define symbolic names for table subdirectories Define the "staging", "upload", and "snapshots" subdirectory names as named const expressions in the sstables namespace rather than relying on their string representation, that could lead to typo mistakes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Kamil Braun	8722e0d23c	sstables: mx: enable position fast-forwarding in reverse mode Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details.	2021-11-29 11:10:49 +01:00
Raphael S. Carvalho	4271c4edcd	sstables: Fix metric currently_open_for_writing metric currently_open_for_writing, used to inform # of sstables opened for writing, holds the same value as total_open_for_writing. that means we aren't actually decreasing the counter, so it is bogus. Moved to sstable_writer, because sstable is used by writer to open files, which are then extracted from sstable object, and later the same object is reused for read-only mode. Fixes #9455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>	2021-10-18 18:29:33 +03:00
Tomasz Grabiec	cc56a971e8	database, treewide: Introduce partition_slice::is_reversed() Cleanup, reduces noise. Message-Id: <20211014093001.81479-1-tgrabiec@scylladb.com>	2021-10-14 12:39:16 +03:00
Botond Dénes	1b7b3a81e6	sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf Not necessitating these to be extracted from the sstable dir path. This practically allows for la/mx sstables at non-standard paths to be opened. This will be used by the `scylla-sstable` tool which wants to be flexible about where the sstables it opens are located.	2021-10-12 11:43:23 +03:00

1 2 3 4 5 ...

991 Commits