Commit Graph

2475 Commits

Author SHA1 Message Date
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Avi Kivity
0acf5bfca6 build: enable -Wreturn-std-move
Clang warns when "return std::move(x)" is needed to elide a copy,
but the call to std::move() is missing. We disabled the warning during
the migration to clang. This patch re-enables the warning and fixes
the places it points out, usually by adding std::move() and in one
place by converting the returned variable from a reference to a local,
so normal copy elision can take place.

Closes #8739
2021-05-27 21:16:26 +03:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00
Benny Halevy
56d3cb514a sstables: parse statistics: improve error handling
Properly return malformed_sstable_exception if the
statistics file fails to parse.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210524113808.973951-1-bhalevy@scylladb.com>
2021-05-24 15:12:48 +03:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Avi Kivity
047b3f85d3 sstables: mx reader: drop unused _column_value_length field 2021-05-21 21:02:55 +03:00
Avi Kivity
32d9ba2fbb sstables: index_consumer: drop unused max_quantity field 2021-05-21 21:02:16 +03:00
Avi Kivity
cb587aaa5c compaction: resharding_compaction: drop unused _shard field 2021-05-21 21:01:54 +03:00
Avi Kivity
f62469b7c5 compaction: compaction_read_monitor: drop unused _compaction_manager field
A constructor that now takes on argument is made explicit.
2021-05-21 21:00:47 +03:00
Pavel Solodovnikov
d7a77a993f sstables/writer: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:03:24 +03:00
Pavel Solodovnikov
c3a7b55507 treewide: remove extraneous database.hh includes from headers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:59:14 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Avi Kivity
6db826475d Merge "Introduce segregate scrub mode" from Botond
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.

This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.

The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
  that is suspected to be corrupt (but even these a certain degree) have
  to consider the possibility that they are rehabilitating corruptions,
  leaving them in the system without a warning, in the sense that the
  user won't see any more problems due to low-level corruptions and
  hence might think everything is alright, while data is still corrupt
  from the business logic point of view. It is very hard to draw a line
  between what should and shouldn't scrub do, yet there is a demand from
  users for scrub that can restore data without loosing any of it. Note
  that anybody executing such a scrub is already in a bad shape, even if
  they can read their data (they often can't) it is already corrupt,
  scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
  enum, which now selects the scrub mode. This means that
  `skip_corrupted` cannot be combined with segregate to throw out what
  the former can't fix. This was chosen for simplicity, a bunch of
  flags, all interacting with each other is very hard to see through in
  my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
  fragment-level disorder. Maybe it should only do it on the partition
  level, or maybe this should be made configurable, allowing the user to
  select what to happen with those data that cannot be fixed.

Tests: unit(dev), unit(sstable_datafile_test:debug)
"

* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
  test: boost/sstable_datafile_test: add tests for segregate mode scrub
  api: storage_service/keyspace_scrub: expose new segregate mode
  sstables: compaction/scrub: add segregate mode
  mutation_fragment_stream_validator: add reset methods
  mutation_writer: add segregate_by_partition
  api: /storage_service/keyspace_scrub: add scrub mode param
  sstables: compaction/scrub: replace skip_corrupted with mode enum
  sstables: compaction/scrub: prevent infinite loop when last partition end is missing
  tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
2021-05-18 13:43:01 +03:00
Raphael S. Carvalho
10ae77966c compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
2021-05-17 13:57:05 +02:00
Michael Livshin
357ab759ee statistics: add global bloom filter memory gauge
Refs #251.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
5abeadde4d statistics: add some sstable management metrics
Add the following metrics, as part of #251:

- open for writing (a.k.a. "created", unless I'm missing something?)

- open for reading

- deleted

- currently open for reading/writing (gauges)

Refs #251.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
9a2b54fcf6 sstables: make the _open field more useful
The field is hitherto only used in scylla-gdb.py.  Let it store the
open mode (if any).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Michael Livshin
1f83251b2b sstables: stats: noexcept all accessors
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-12 03:48:07 +03:00
Piotr Sarna
00e59a9823 sstables: disambiguate boost::find
There are multiple functions named `find` in boost,
so to avoid future clashes, this one is explicitly marked
as belonging to boost::range.
2021-05-10 11:48:14 +02:00
Raphael S. Carvalho
8480839932 LCS/reshape: Don't reshape single sstable in level 0 with strict mode
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.

This is fixed by respecting the offstrategy threshold. Unit test is
added.

Fixes #8573.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
2021-05-09 11:09:54 +03:00
Lauro Ramos Venancio
15f72f7c9e TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590
2021-05-05 17:31:05 +03:00
Botond Dénes
674a77ead0 sstables: compaction/scrub: add segregate mode
In segregate mode scrub will segregate the content of of input sstables
into potentially multiple output sstables such that they respect
partition level and fragment level monotonicity requirements. This can
be used to fix data where partitions or even fragments are out-of-order
or duplicated. In this case no data is lost and after the scrub each
sstables contains valid data.
Out-of-order partitions are fixed by simply being written into a
separate output, compared to the last one compaction was writing into.
Out-of-order fragments are fixed by injecting a
partition-end/partition-start pair right before them, effectively
moving them into a separate (duplicate) partition which is then treated
in the above mentioned way.
This mode can fix corruptions where partitions are out-of-order or
duplicated.
This mode cannot fix corruptions where partitions were merged, although
data will be made valid from the database level, it won't be on the
business-logic level.
2021-05-05 14:33:49 +03:00
Benny Halevy
ead96e21c3 compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:37 +03:00
Benny Halevy
c1681cb9ea compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
SSTables are added in increasing size order so the bucket's
average might drift upwards.
Don't let it drift too high, to a point where the smallest
SSTable might fall out of range.

For example, here's a simulation run of the algorithm for these sstable sizes:
    [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774]

the simulated compaction strategy options are:
min_sstable_size = 4
bucket_low = 0.66667
bucket_high = 1.5

For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high)

UNCHANGED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 276.4)  414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

IMPROVED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 247.0)  370.5 ( 555.8): [252, 363, 379, 394, 407, 428]
    ( 320.5)  480.8 ( 721.1): [463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

Fixes #8584

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:28 +03:00
Benny Halevy
d3aa5265ab compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
    auto total_size = bucket.size() * old_average_size;
    auto new_average_size = (total_size + size) / (bucket.size() + 1);
```

We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.

Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.

Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:25 +03:00
Benny Halevy
44b094f9a5 compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:20 +03:00
Benny Halevy
336a4dc0fd compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.

Instead, just consider the most recently inserted bucket.

Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.

Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:05 +03:00
Botond Dénes
03728f5c26 sstables: compaction/scrub: replace skip_corrupted with mode enum
We want to add more modes than the current two, so replace the current
boolean mode selector with an enum which allows for easy extensions.
2021-05-05 12:03:42 +03:00
Botond Dénes
ba75115e20 sstables: compaction/scrub: prevent infinite loop when last partition end is missing
Scrub compaction will add the missing last partition-end in a stream
when allowed to modify the stream. This however can cause an infinite
loop:
1) user calls fill_buffer()
2) process fragments until underlying is at EOS
3) add missing partition end
4) set EOS
5) user sees that last buffer wasn't empty
6) calls fill_buffer() again
7) goto (3)

To prevent this cycle, break out of `fill_buffer()` early when both the
scrub reader and the underlying is at EOS.
2021-05-05 12:03:42 +03:00
Pavel Emelyanov
13b07a3c58 sstables: Make checksum sink report buffer size from lower sink
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:30 +03:00
Pavel Emelyanov
01b979beca sstables: Report buffer size from compressed file sink
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:27 +03:00
Botond Dénes
9fc3cba055 sstables: improve error message for invalid sstable paths
The error message currently complains about "invalid version" and later
says the reason is that the path is not recognized. This is confusing so
change the error message to start with "invalid path" instead. It is the
path that is invalid not the version after all.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092749.52659-1-bdenes@scylladb.com>
2021-04-29 12:50:48 +03:00
Asias He
60ba8eb9b8 sstables: Add debug info when create_sharding_metadata generates zero ranges
The range passed to create_sharding_metadata is supposed to be owned or
at least partially owned by the shard. Log keys, range and split
ranges for debugging if the range does not belong to the shard.

This is helpful for debugging "Failed to generate sharding
metadata for foo.db" issues reported.

Refs #7056

Closes #8557
2021-04-28 11:22:06 +03:00
Benny Halevy
3e7075a739 compaction: setup: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
90a7a8ff0e compaction: close reader when done consuming
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7d42a71310 mutation_reader: position_reader_queue: add close method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
3c05529329 sstables: scrub_compaction: reader: close underlying reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
75eed563bc sstables: write_components: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
8c585ccb5c sstables: sstable_mutation_reader: implement close
Close both the _index_reader and _context, if they are engaged.
Warn and ignore any erros from close as it may be called either
from the destructor or from f_m_r close.

Call close() for closing in the background if needed when destroyed
and warn about.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Benny Halevy
6a82e9f4be sstables: index_reader: mark close noexcept
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.

close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.

Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:16:10 +03:00
Avi Kivity
350f79c8ce Merge 'sstables: remove large allocations when parsing cells' from Wojciech Mitros
sstable cells are parsed into temporary_buffers, which causes large contiguous allocations for some cells.
This is fixed by storing fragments of the cell value in a fragmented_temporary_buffer instead.
To achieve this, this patch also adds new methods to the fragmented_temporary_buffer(size(), ostream& operator<<()) and adds methods to the underlying parser(primitive_consumer) for parsing byte strings into fragmented buffers.

Fixes #7457
Fixes #6376

Closes #8182

* github.com:scylladb/scylla:
  primitive_consumer: keep fragments of parsed buffer in a small_vector
  sstables: add parsing of cell values into fragmented buffers
  sstables: add non-contiguous parsing of byte strings to the primitive_consumer
  utils: add ostream operator<<() for fragmented_temporary_buffer::view
  compound_type: extend serialize_value for all FragmentedView types
2021-04-22 15:38:10 +02:00
Avi Kivity
a063173ace Merge "Fix unbounded memory usage and high write amplification in TWCS reshape" from Raphael
"
Memory usage is considerably reduced by making reshape switch to partitioned set,
given that input sstables are disjoint. This will benefit reshape for all
strategies, not only TWCS.

Write amplification is reduced a lot by compacting all input sstables at once,
which is possible given that unbounded memory usage is fixed too.

With both these issues fixed, TWCS reshape will be much more efficient.

tests: mode(dev).
"

* 'twcs_reshape_fixes' of github.com:raphaelsc/scylla:
  tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently
  TWCS: Reshape all sstables in a time window at once if they're disjoint
  sstables: Extract code to count amount of overlapping into a function
  LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
  compaction: Make reshape compaction always use partitioned_sstable_set
  compaction: Allow a compaction type to override the sstable_set for input sstables
2021-04-22 11:24:49 +03:00
Raphael S. Carvalho
d5fc2f3839 TWCS: Reshape all sstables in a time window at once if they're disjoint
With repair-based operations, each window will have 256 disjoint
sstables due to data segregation which produces N sstables for each
vnode range, where N = # of existing windows. So each window ends up
with one sstable per vnode range = 256.
Given that reshape now unconditionally uses partitioned set's incremental
selector, all the 256 sstables can be compacted at once as compaction
essentially becomes a copy operation, where only one sstable will be
opened at a time, making its memory usage very efficient.
By compacting all sstables at once, write amplification is a lot
reduced because each byte is now only rewritten once.
Previously, with the initial set of 256 sstables, write amp could be
up to 8, which makes reshape for TWCS very slow.

Refs #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:16 -03:00
Raphael S. Carvalho
0f7774a6f8 sstables: Extract code to count amount of overlapping into a function
This function will be reused by TWCS reshape when checking if all
sstables in a window are disjoint and can be all compacted together.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:16 -03:00
Raphael S. Carvalho
39ecddbd34 LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.

Fixes #8531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-21 11:03:07 -03:00
Piotr Sarna
2ad09d0bf8 Merge 'treewide: remove inclusions of storage_proxy.hh from headers' from Avi Kivity
Reduce rebuilds and build time by removing unnecessary includes. Along the way,
improve header sanity.

Ref #1.

Test: dev-headers, unit(dev).

Closes #8524

* github.com:scylladb/scylla:
  treewide: remove inclusions of storage_proxy.hh from headers
  storage_proxy: unnest coordinator_query_result
  treewide: make headers self-sufficient
  utils: intrusive_btree: add missing #pragma once
2021-04-21 08:22:52 +02:00
Benny Halevy
7130e2e7ff sstables: harden unlink
Make sure that sstable::unlink will never fail.

It will terminate in the unlikely case toc_filename
throws (e,g, on bad_alloc), otherwise it ignores any other error
and juts warns about it.

Make unlink a coroutine to simplify the implementation
without introducing additional allocations.

Note that remove_by_toc_name and maybe_delete_large_data_entries
are executed asynchronously and concurrently.
Waiting for them to finish is serialized by co_await,
making sure that both are being waited on so not to leave
abandoned futures behind.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>
2021-04-21 08:22:52 +02:00
Raphael S. Carvalho
678e4c0bb9 compaction: Make reshape compaction always use partitioned_sstable_set
Reshape compaction potentially works with disjoint sstables, so it will
benefit a lot from using partitioned_sstable_set, which is able to
incrementally open the disjoint sstables. Without it, all sstables are
opened at once, which means unbounded memory usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-20 15:39:51 -03:00
Avi Kivity
14a4173f50 treewide: make headers self-sufficient
In preparation for some large header changes, fix up any headers
that aren't self-sufficient by adding needed includes or forward
declarations.
2021-04-20 21:23:00 +03:00
Raphael S. Carvalho
ad9bc808b9 compaction: Allow a compaction type to override the sstable_set for input sstables
By default, compaction will pick a implementation of sstable_set as
defined by the underlying compaction strategy.
However, reshape compaction potentially works with disjoint sstables
and will benefit a lot from always using partitioned set.
For example, when reshaping a TWCS table, it's better to use the
partitioned set rather than the time window set, as the former will
be much more memory efficient by incrementally selecting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-04-20 12:03:44 -03:00