Commit Graph

2762 Commits

Author SHA1 Message Date
Botond Dénes
e8f3d7dd13 sstables/index_reader: short-circuit fast-forward-to when at EOF
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
2022-05-05 14:42:37 +03:00
Avi Kivity
aee94b7176 Merge "Convert remaining mutation sources to v2" from Botond
"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.

Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.

upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"

* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
  readers: make upgrade_to_v2() private
  test/lib/mutation_source_test: remove upgrade_to_v2 tests
  readers: remove v1 forwardable reader
  readers: remove v1 empty_reader
  readers: remove v1 delegating_reader
  sstables/kl: make reader impl v2 native
  sstables/kl: return v2 reader from factory methods
  sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
  partition_snapshot_reader: convert implementation to native v2
  mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
2022-04-28 20:31:23 +03:00
Botond Dénes
70d019116f sstables/kl: make reader impl v2 native
The conversion is shallow: the meat of the logic remains v1, fragments
are converted to v2 right before being pushed into the buffer. This
approach is simple, surgical and is still better then a full
upgrade_to_v2().
2022-04-28 14:12:24 +03:00
Botond Dénes
a22b02c801 sstables/kl: return v2 reader from factory methods
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
2022-04-28 14:12:24 +03:00
Botond Dénes
4b222e7f37 sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
Its only user is in said file, so that is a better place for it.
2022-04-28 14:12:24 +03:00
Raphael S. Carvalho
791403e4bb sstables: Fix deletion of partial SSTables
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.

After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.

The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.

This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().

Fixes #10410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Raphael S. Carvalho
0be44b1035 sstables: Fix fsync_directory()
fsync_directory() is broken because it's unconditionally performing
fsync on parent directory, not on the directory that it was called
with. To fix, let's remove wrong parent_path() usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Raphael S. Carvalho
ca8f5dcdb7 sstables: Rename dirname() to a more descriptive name
dirname() is confusing because if it's called on a directory, parent
path is retrieved. By renaming it to parent_path(), it's clearer
what the function will do exactly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Piotr Sarna
bce2933d99 sstables: : remove unnecessary throws
Throws are translated to passing the exceptions directly.
2022-04-12 13:09:54 +02:00
Avi Kivity
af07519928 Merge "Remove reader from mutations v1" from Botond
"
First migrate all users to the v2 variant, all of which are tests.
However, to be able to properly migrate all tests off it, a v2 variant
of the restricted reader is also needed. All restricted reader users are
then migrated to the freshly introduced v2 variant and the v1 variant is
removed.
Users include:
* replica::table::make_reader_v2()
* streaming_virtual_table::as_mutation_source()
* sstables::make_reader()
* tests

This allows us to get rid of a bunch of conversions on the query path,
which was mostly v2 already.

With a few tests we did kick the can down the road by wrapping the v2
reader in `downgrade_to_v1()`, but this series is long enough already.

Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug)
"

* 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla:
  readers: remove now unused v1 reader from mutations
  test: move away from v1 reader from mutations
  test/boost/mutation_reader_test: use fragment_scatterer
  test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
  test/boost: mutation_fragment_test: refactor fragment_scatterer
  readers: remove now unused v1 reversing reader
  test/boost/flat_mutation_reader_test: convert to v2
  frozen_mutation: fragment_and_freeze(): convert to v2
  frozen_mutation: coroutinize fragment_and_freeze()
  readers: migrate away from v1 reversing reader
  db/virtual_table: use v2 variant of reversing and forwardable readers
  replica/table: use v2 variant of reversing reader
  sstables/sstable: remove unused make_crawling_reader_v1()
  sstables/sstable: remove make_reader_v1()
  readers: add v2 variant of reversing reader
  readers/reversing: remove FIXME
  readers: reader from mutations: use mutation's own schema when slicing
2022-03-31 13:29:11 +03:00
Botond Dénes
3b67c25e49 sstables/sstable: remove unused make_crawling_reader_v1() 2022-03-31 09:57:48 +03:00
Botond Dénes
219cb881a4 sstables/sstable: remove make_reader_v1()
No external users, only used internally, by make_reader(), who delegates
cases currently unsupported by v2 to it. The code needed from
make_reader_v1() is inlined into make_reader() and the former is
removed.
2022-03-31 09:57:48 +03:00
Botond Dénes
b029bd3db7 tree: remove mutation_reader.hh include
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
2022-03-30 15:42:51 +03:00
Botond Dénes
f8015d9c26 readers: move combined reader into readers/
Since the combined reader family weighs more than 1K SLOC, it gets its
own .cc file.
2022-03-30 15:42:51 +03:00
Botond Dénes
a69d98c3d0 Merge "Improve efficiency of cleanup compaction by making it bucket aware" from Raphael S. Carvalho
"
Cleanup compaction works by rewriting all sstables that need clean up, one at
a time.
This approach can cause bad write amplification because the output data is
being made incrementally available for regular compaction.
Cleanup is a long operation on large data sets, and while it's happening,
new data can be written to buckets, triggering regular compaction.
Cleanup fighting for resources with regular compaction is a known problem.
With cleanup adding one file at a time to buckets, regular may require multiple
rounds to compact the data in a given bucket B, producing bad writeamp.

To fix this problem, cleanup will be made bucket aware. As each compaction
strategy has its own definition of bucket, strategies will implement their
own method to retrieve cleanup jobs. The method will be implemented such that
all files in a bucket B will be cleaned up together, and on completion,
they'll be made available for regular at once.

For STCS / ICS, a bucket is a size tier.
For TWCS, a bucket is a window.
For LCS, a bucket is a level.

In this way, writeamp problem is fixed as regular won't have to perform
multiple rounds to compact the data in a given bucket. Additionally, cleanup
will now be able to deduplicate data and will become way more efficient at
garbage collecting expired data.

The space requirement shouldn't be an issue, as compacting an entire bucket
happens during regular compaction anyway.
With leveled strategy, compacting an entire level is also not a problem because
files in a level L don't overlap and therefore incremental compaction is
employed to limit the space requirement.

By the time being, only STCS cleanup was made bucket aware. The others will be
using a default method, where one file is cleaned up at a time. Making cleanup
of other strategies bucket aware is relatively easy now and will be done soon.

Refs #10097.
"

* 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla:
  test: sstable_compaction_test: Add test for strategy cleanup method
  compaction: STCS: Implement cleanup strategy
  compaction_manager: Wire cleanup task into the strategy cleanup method
  compaction_strategy: Allow strategies to define their own cleanup strategy
  compaction: Introduce compaction_descriptor::sstables_size
  compaction: Move decision of garbage collection from strategy to task type
2022-03-25 16:30:28 +02:00
Mikołaj Sielużycki
6f1b6da68a compile: Fix headers so that *-headers targets compile cleanly.
Closes #10273
2022-03-25 16:19:26 +02:00
Raphael S. Carvalho
c25d8f6770 compaction: Move decision of garbage collection from strategy to task type
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.

That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.

The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.

As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.

can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().

The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-21 12:14:04 -03:00
Avi Kivity
585c0841c3 Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros
Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page.

For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing.

This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used.
As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`.

Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of  the `_entries` `chunked_managed_vector`.

A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward:

Before:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.899447            4   1000000    1111794      12284    1113248    1096537      975.5    972     124356       1       0        0        0        0        0        0        0  12032202   29103    8967 100.0%
-> 1       1         1.805811            4    500000     276884        907     278214     275977     3655.8   3654     135084    2688       0     3161     4548     5935        0        0        0   7225100  140466   27010  75.6%
-> 1       8         0.927339            4    111112     119818        357     120465     119461     3654.0   3654     135084    2685       0     2133     4548     6963        0        0        0   1749663  107922   57502  50.2%
-> 1       16        0.790630            4     58824      74401        782      74617      73497     3654.0   3654     135084    2695       0     1975     4548     7121        0        0        0   1019189  109349   90832  42.7%
-> 1       32        0.717235            4     30304      42251        243      42266      41975     3654.0   3654     135084    2689       0     1871     4548     7225        0        0        0    619876  109199  156751  37.3%
-> 1       64        0.681624            4     15385      22571        244      22815      22286     3654.0   3654     135084    2685       0     1870     4548     7226        0        0        0    407671  105798  285688  34.0%
-> 1       256       0.630439            4      3892       6173         24       6214       6150     3549.0   3549     135116    2581       0     1313     3927     6505        0        0        0    232541  100803 1022454  29.1%
-> 1       1024      0.313303            4       976       3115        219       3126       2766     1956.0   1956     130608     986       0        0      987     1962        0        0        0     81165   41385 1724979  29.1%
-> 1       4096      0.083688            4       245       2928         85       3012       2134      738.8    737      17212     492     244        0      247      491        0        0        0     30500   19406 1999263  24.6%
-> 64      1         1.509011            4    984616     652491       2746     660930     649745     3673.5   3654     135084    2687       0     4507     4548     4589        0        0        0  11075882  117074   13157  68.9%
-> 64      8         1.424147            4    888896     624160       4446     625675     617713     3654.0   3654     135084    2691       0     4248     4548     4848        0        0        0  10019098  117383   13700  66.5%
-> 64      16        1.343276            4    800000     595559       5834     605880     589725     3654.0   3654     135084    2698       0     3989     4548     5107        0        0        0   9043830  124022   14206  64.9%
-> 64      32        1.249721            4    666688     533469       5056     536638     526212     3654.0   3654     135084    2688       0     3616     4548     5480        0        0        0   7570848  123043   15377  60.9%
-> 64      64        1.154549            4    500032     433097      10215     443312     415001     3654.0   3654     135084    2703       0     3161     4548     5935        0        0        0   5718758  110657   17787  53.2%
-> 64      256       1.005309            4    200000     198944       1179     199338     196989     3935.0   3935     137216    2966       0      690     4048     5592        0        0        0   2398359  110510   27855  51.3%
-> 64      1024      0.441913            4     58880     133239       8094     135471     120467     2161.0   2161     131820    1190       0        0     1192     1848        0        0        0    725092   45449   33740  59.7%
-> 64      4096      0.124826            4     15424     123564       5958     126814      95101      795.5    794      17400     553     240        0      312      482        0        0        0    199943   20869   46621  41.9%
```
After:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.917468            4   1000000    1089956       1422    1091378    1073112      975.5    972     124356       1       0        0        0        0        0        0        0  12032761   29721    8972 100.0%
-> 1       1         1.311446            4    500000     381259       3212     384470     377238     1087.0   1083     138420       2       0     4445     4548     4651        0        0        0   7096216   55681   20869 100.0%
-> 1       8         0.467975            4    111112     237432       1446     239372     235985     1121.2   1119     143124       9       0     4344     4548     4752        0        0        0   1619944   23502   28844  98.7%
-> 1       16        0.337085            4     58824     174508       3410     178451     171099     1117.5   1120     143276      11       0     4319     4548     4777        0        0        0    883692   19152   37460  96.8%
-> 1       32        0.262798            4     30304     115313       1222     116535     112400     1070.2   1066     135620     166      26     4354     4548     4742        0        0        0    483185   18856   54275  94.9%
-> 1       64        0.283954            4     15385      54181        531      56177      53650     2022.5   2040     137036     319      19     4351     4548     4745        0        0        0    292766   32998  102276  84.9%
-> 1       256       0.207020            4      3892      18800        575      19105      17520     1315.5   1334     136072     418      24     3703     3927     4115        0        0        0    118400   27427  292146  82.1%
-> 1       1024      0.164396            4       976       5937         57       5993       5842     1208.2   1195     135384     568      14      932      987     1030        0        0        0     62999   27554  503559  70.0%
-> 1       4096      0.085079            4       245       2880        108       2987       2714      635.8    634      26468     248     246      233      247      258        0        0        0     31264   12872 1546404  37.4%
-> 64      1         1.073331            4    984616     917346       7614     923983     909314     1812.2   1824     136792      11      20     4544     4548     4552        0        0        0  10971661   54538    9919  99.6%
-> 64      8         1.024389            4    888896     867733       6327     870429     845215     3027.2   3072     138212      31       0     4523     4548     4573        0        0        0   9933078   68059   10050  99.5%
-> 64      16        0.978754            4    800000     817366       7802     827665     809564     3012.2   3008     139884      39       0     4486     4548     4610        0        0        0   8947041   64050   10302  98.1%
-> 64      32        0.837266            4    666688     796267      10312     806579     785370     2275.8   2266     139672      29       0     4465     4548     4631        0        0        0   7458644   50754   10564  97.8%
-> 64      64        0.645627            4    500032     774490       4713     779203     768432     1136.8   1137     145428       8       0     4438     4548     4658        0        0        0   5593168   29982   10938  98.4%
-> 64      256       0.386192            4    200000     517877      22509     544067     495368     1134.8   1136     145300     109       0     2135     4048     4147        0        0        0   2270291   22840   13682  94.5%
-> 64      1024      0.238617            4     58880     246755      55856     305110     190899     1176.0   1118     135324     451      13      625     1192     1223        0        0        0    701262   24418   17323  71.1%
-> 64      4096      0.133340            4     15424     115674      14837     117978      99072      974.0    961      27132     366     347       99      312      383        0        0        0    209595   20657   43096  50.4%
```
For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition).
For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads.
Relevant traces:
Before:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:22:25.847362 | 127.0.0.1 |            148 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:22:25.900996 | 127.0.0.1 |          53782 | 127.0.0.1
```
After:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:19:37.380033 | 127.0.0.1 |            149 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:19:37.433662 | 127.0.0.1 |          53777 | 127.0.0.1
```
Tests: unit(dev)

Closes #9063

* github.com:scylladb/scylla:
  sstables: index_reader: optimize single partition reads
  sstables: use read-aheads in the index reader
  sstables: index_reader: remove unused members from index reader context
2022-03-21 13:47:28 +02:00
Avi Kivity
975b0c0b03 Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
  not, but checking the digest and the per-chunk checksums against the
  data on disk.
* decompress - helps when one wants to manually examine the content of a
  compressed sstable.

Refs: #497

Tests: unit(dev)
"

* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
  tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
  tools/scylla-sstable: add decompress operation
  tools/scylla-sstables: add validate-checksums operation
  sstables/sstable: add validate_checksums()
  sstables/sstable: add raw_stream option to data_stream()
  sstables/sstable: make data_stream() and data_read() public
  utils/exceptions: add maybe_rethrow_exception()
2022-03-16 18:56:48 +02:00
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Botond Dénes
ddf9dee9d8 sstables/sstable: add validate_checksums()
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.

The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).

When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.

In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.

This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
2022-03-15 14:52:15 +02:00
Botond Dénes
bf335c9e7a sstables/sstable: add raw_stream option to data_stream()
Optionally provide access to the underlying data as-is, without
decompression.
2022-03-15 14:47:27 +02:00
Botond Dénes
9bc80b42cd sstables/sstable: make data_stream() and data_read() public 2022-03-15 14:42:45 +02:00
Benny Halevy
90edddd7e3 everywhere: use make_flat_mutation_reader_from_mutations_v2
Rather than upgrade_to_v2(make_flat_mutation_reader_from_mutations)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220315083425.2786228-1-bhalevy@scylladb.com>
2022-03-15 11:41:10 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Botond Dénes
7e0b51ff23 Merge 'Overhaul compaction_manager::task' from Benny Halevy
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.

While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.

Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.

Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py

Fixes #9974

Closes #10122

* github.com:scylladb/scylla:
  compaction_manager: use coroutine::switch_to
  compaction_manager::task: drop _compaction_running
  compaction_manager: move per-type logic to derived task
  compaction_manager: task: add state enum
  compaction_manager: task: add maybe_retry
  compaction_manager: reevaluate_postponed_compactions: mark as noexcept
  compaction_manager: define derived task types
  compaction_manager: register_metrics: expose postponed_compactions
  compaction_manager: register_metrics: expose failed_compactions
  compaction_manager: register_metrics: expose _stats.completed_tasks
  compaction: add documentation for compaction_type to string conversions
  compaction: expose to_string(compaction_type)
  compaction_manager: task: standardize task description in log messages
  compaction_manager: refactor can_proceed
  compaction_manager: pass compaction_manager& to task ctor
  compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
  compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
  compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
2022-03-10 13:33:56 +02:00
Benny Halevy
72162ed653 compaction_manager: define derived task types
Turn task into a class, defining a clear hierarchy
of private, protected, and public methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 11:35:35 +02:00
Botond Dénes
2e0610e459 sstables/sstable: remove now unused v1 write_components() variant
Supplanted by the v2 variant.
2022-03-10 09:16:33 +02:00
Botond Dénes
fed5b73147 sstables/sstable: expose v2 variant of write_components()
In parallel to the existing v1 one. In the next patches we start
migrating users to the v2 variant incrementally and finally remove the
v1 variant.
2022-03-10 07:03:49 +02:00
Botond Dénes
105bf8888a sstables: convert mx writer to v2
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);

(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
2022-03-10 07:03:49 +02:00
Botond Dénes
11adb404c6 sstables/metadata_collector: use position_in_partition for min/max keys
Instead of naked clustering keys. Working with the latter is dangerous
because it cannot accurately represent the entire clustering domain: it
cannot represent positions between (before/after) keys. For this reason
the metadata collector had a separate update_min_max_components()
overload for range tombstones because the positions of these cannot be
represented by clustering keys alone.
Moving to position_in_partition solves this problem and it is now enough
to have a single overload with position_in_partition_view. This is also
more future proof as it will work with range tombstone changes without
any additional changes.
2022-03-10 07:03:49 +02:00
Nadav Har'El
ef43531fb6 materialized views: allow empty strings in views and indexes
Although Cassandra generally does not allow empty strings as partition
keys (note they are allowed as clustering keys!), it *does* allow empty
strings in regular columns to be indexed by a secondary index, or to
become an empty partition-key column in a materialized view. As noted in
issues #9375 and #9364 and verified in a few xfailing cql-pytest tests,
Scylla didn't allow these cases - and this patch fixes that.

The patch mostly *removes* unnecessary code: In one place, code
prevented an sstable with an empty partition key from being written.
Another piece of removed code was a function is_partition_key_empty()
which the materialized-view code used to check whether the view's
row will end up with an empty partition key, which was supposedly
forbidden. But in fact, should have been allowed like they are allowed
in Cassandra and required for the secondary-index implementation, and
the entire function wasn't necessary.

Note that the removed function is_partition_key_empty() was *NOT* required
for the "IS NOT NULL" feature of materialized views - this continues to
work as expected after this patch, and we add another test to confirm it.
Being null and being an empty string are two different things.

This patch also removes a part of a unit test which enshrined the
wrong behavior.

After this patch we are left with one interesting difference from
Cassandra: Though Cassandra allows a user to create a view row with an
empty-string partition key, and this row is fully visible in when
scanning the view, this row can *not* be queried individually because
"WHERE v=''" is forbidden when v is the partition key (of the view).
Scylla does not reproduce this anomaly - and such point query does work
in Scylla after this patch. We add a new test to check this case, and mark
it "cassandra_bug", i.e., it's a Cassandra behavior which we consider
wrong and don't want to emulate.

This patch relies on #9352 and #10178 having been fixed in previous patches,
otherwise the WHERE v='' does not work when reading from sstables.
We add to the already existing tests we had for empty materialized-views
keys a lookup with WHERE v='' which failed before fixing those two issues.

Fixes #9364
Fixes #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 15:34:26 +02:00
Benny Halevy
eff5076dd5 sstables: close_files: auto-remove temporary sstable directory
If the sstable is marked for deletion, e.g. when
writing the sstable fails for any reason before it's
sealed, make sure to remove the sstable's temporary
directory, if present, besides the sstables files.

This condition is benign as these empty temp dirs
are removed when scylla starts up, but the do accumulate
and we better remove them too.

Fixes #9522

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>
2022-03-03 16:13:03 +02:00
Michael Livshin
0caa21079d sstables: refrain from throwing on host id mismatch
This makes host id mismatch cause a warning and stop being fatal,
to un-break node replacement dtests.

Should be revisited if/when the underlying problem (double setting of
local host id on a replacing node) is fixed.

Refs #10148

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>
2022-03-03 15:53:19 +02:00
Michael Livshin
a389cc520b system_keyspace, sstable: log local host id in key places
Specifically: when it is generated, when it is loaded from
`system.local`, and when there is a mismatch during sstable
validation; in the latter case log the in-sstable host id also.

Refs #10148

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220301123925.257766-1-michael.livshin@scylladb.com>
2022-03-02 09:49:37 +02:00
Nadav Har'El
7cf2e5ee5c Merge 'directory_lister: drop abort method and simplify close semantics' from Benny Halevy
This series contains:
- lister: move to utils
  - tidy up the clutter in the root dir
Based on Avi's feedback to `[PATCH 1/1] utils: directory_lister: close: always abort queue` that was sent to the mailing list:
  - directory_lister: drop abort method
  - lister: do not require get after close to fail
- test: lister_test: test_directory_lister_close simplify indentation
  - cosmetic cleanup

Closes #10142

* github.com:scylladb/scylla:
  test: lister_test: test_directory_lister_close simplify indentation
  lister: do not require get after close to fail
  directory_lister: drop abort method
  lister: move to utils
2022-03-01 16:23:47 +02:00
Benny Halevy
ebbbf1e687 lister: move to utils
There's nothing specific to scylla in the lister
classes, they could (and maybe should) be part of
the seastar library.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 12:36:03 +02:00
Pavel Emelyanov
3f884fbdd7 sstables: Remove excessive type-match assertions
The primitive_consumer method templates overcomplicate the
declaration of the fact that one of the method arguments is
the sub-type of a template argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:49:20 +03:00
Pavel Emelyanov
b8401f2ddd code: Convert is_integral assertions to concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:44:29 +03:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00
Wojciech Mitros
7f590a3686 sstables: index_reader: optimize single partition reads
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 02:16:52 +01:00
Wojciech Mitros
c81992c665 sstables: use read-aheads in the index reader
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.

For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.

This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.

Fixes #2388

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 01:51:33 +01:00
Pavel Emelyanov
49c5d5b7e8 Merge 'lister: add directory_lister' from Benny Halevy
directory_lister provides a simpler interface compared to lister.

After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.

This is especially suitable for coroutines
as demonstrated in the unit tests that were added.

For example:
```c++
        auto dl = directory_lister(path);
        while (auto de = co_await dl.get()) {
            co_await process(*de);
        }
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #9835

* github.com:scylladb/scylla:
  sstable_directory: process_sstable_dir: use directory_lister
  sstable_directory: process_sstable_dir: fixup indentation
  sstable_directory: coroutinize process_sstable_dir
  lister: add directory_lister
2022-02-21 12:24:28 +03:00
Wojciech Mitros
0a1500acd2 sstables: index_reader: remove unused members from index reader context
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
2022-02-20 16:24:27 +01:00
Michael Livshin
79bf79ebd3 sstables: validate originating host id
Add an additional sstable validation step to check that originating
host id matches the local host id.

This is only done for ME-and-up sstables, which do not come from
upload/, and when the local host id is known.

When local host id is unknown, check that the sstable belongs to a
system keyspace, i.e. whether it is plausible that Scylla is still
booting up and hasn't loaded/generated the local host id yet.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3511d7cd21 sstable: add is_uploaded() predicate
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
dd4e330cc5 sstables: store originating host id in stats metadata
With this change, ME sstables start carrying their originating host
id, which makes ME format feature-complete so it can be made default.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3fef604075 sstables_manager: add get_local_host_id() method and support
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.

Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0895188851 sstables_manager: formalize inheritability
The class is already inherited from in tests (along with overriding a
non-virtual method), so this seems to be called for.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
c96708d262 add support for the ME sstable format
The ME format has been introduced in Cassandra 3.11.11:

11952fae77/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java (L123)
d84c6e9810

It adds originating host id to sstable metadata in support of fixing
loss of commit log data when moving sstables between nodes:

https://issues.apache.org/jira/browse/CASSANDRA-16619

In Scylla:

* The supported way to ingest sstables is via upload/, where stored
  commit log replay position should be disregarded (but see
  https://github.com/scylladb/scylla/issues/10080).

* A later commit in this series implements originating host id
  validation for native ME sstables.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00