Commit Graph

32267 Commits

Author SHA1 Message Date
Botond Dénes
079e425ef1 test/cql-pytest: add regression test for "IDL frame truncated" error 2022-07-28 09:02:28 +03:00
Botond Dénes
b23ce76b27 query: query_result_builder: add check for missing partition-end
If the reader feeding the result builder is missing a partition-end
between two partition, or at end-of-stream, the result builder will
write a corrupt partition-entry into the result, ending up in an
"IDL Frame truncated" error.
It is trivial to add a check for this and this will result in a much
more clear error message, then the mysterious frame truncated error
mentioned above.
2022-07-28 09:02:28 +03:00
Botond Dénes
f119554106 mutation_compactor: detach_state(): make it no-op if partition was exhausted
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.

This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
2022-07-28 09:02:26 +03:00
Botond Dénes
afa694a20c querier: use full_position in shard_mutation_querier
Instead of a separate partition key and position-in-partition.
This continues the recently started effort to standardize storing of
full positions on `full_position`.

This patch is also a hidden preparation for read_context::save_readers()
multishard_mutation_query.cc) no longer being able to get partition key
from compaction state in the future.
2022-07-28 08:19:23 +03:00
Botond Dénes
c54d19427d mutation_compactor: don't ignore consumer's stop request on range tombstone
Broken since the v2 output support was introduced (ad435dc).
No known adverse affects, besides mutation reads stopping a little later
than desired (on the next non-range-tombstone-change fragment) and hence
consuming more memory than the limit set for them.

Fixes: #11138

Closes #11139
2022-07-27 22:24:29 +03:00
Avi Kivity
2c0932cc41 Merge 'Reduce the amount of per-table metrics' from Amnon Heiman
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.

The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).

Closes #11058

* github.com:scylladb/scylla:
  Add summary_test
  database: Reduce the number of per-table metrics
  replica/table.cc: Do not register per-table metrics for system
  histogram_metrics_helper.hh: Add to_metrics_summary function
  Unified histogram, estimated_histogram, rates, and summaries
  Split the timed_rate_moving_average into data and timer
  utils/histogram.hh: should_sample should use a bitmask
  estimated_histogram: add missing getter method
2022-07-27 22:01:08 +03:00
Avi Kivity
2d4caa0134 Update tools/java submodule
* tools/java d0143b447c...1e7b872a61 (2):
  > scylla-tools-java: Update "six" library used by cqlsh/python driver
  > Add Scylla-specific table options to Option enum
Fixes scylladb/scylla#10856.
2022-07-27 21:41:18 +03:00
Avi Kivity
4438865a26 Merge 'memtable flush error handling' from Benny Halevy
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.

The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.

Fixes #10498

Closes #10691

* github.com:scylladb/scylla:
  test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
  test: memtable_test: failed_flush_prevents_writes: extend error injection
  table: seal_active_memtable: abort if retried for too long
  table: seal_active_memtable: abort on unexpected error
  table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
  dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
  dirty_memory_manager: flush_permit: add has_sstable_write_permit
  dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
  dirty_memory_manager: flush_permit: make _sstable_write_permit optional
  table: reindent seal_active_memtable
  table: coroutinize seal_active_memtable
  memtable_list: mark functions noexcept
  commitlog: make discard_completed_segments and friends noexcept
  dirty_memory_manager: flush_when_needed: target error handling at flush_one
  database: delete unused seal_delayed_fn_type
  dirty_memory_manager: mark functions noexcept
  memtable: mark functions noexcept
  memtable: memtable_encoding_stats_collector: mark functions noexcept
  encoding_state: mark functions noexcept
  logalloc: mark free functions noexcept
  logalloc: allocating_section: mark functions noexcept
  logalloc: allocating_section: guard: mark constructor noexcept
  logalloc: reclaim_lock: mark functions noexcept
  logalloc: tracker_reclaimer_lock: mark constructor noexcept
  logalloc: mark shard_tracker noexcept
  logalloc: region: mark functions const/noexcept
  logalloc: basic_region_impl: mark functions noexcept
  logalloc: region_impl: mark functions noexcept
  utils: log_heap: mark functions noexcept
  logalloc: region_impl: object_descriptor: mark functions noexcept
  logalloc: region_group: mark functions noexcept
  logalloc: tracker: mark functions const/noexcept
  logalloc: tracker::impl: make region_occupancy and friends const
  logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
  logalloc: tracker::impl: mark functions noexcept
  logalloc: segment: mark functions const / noexcept
  logalloc: segment_pool: add const variant of descriptor method
  logalloc: segment_pool: move descriptor method to class definition
  logalloc: segment_pool: mark functions const/noexcept
  logalloc: segment_pool: delete unused free_or_restore_to_reserve method
  utils: dynamic_bitset: mark functions noexcept
  utils: dynamic_bitset: delete unused members
  logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
  logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
  logalloc: segment_store: make can_allocate_more_segments const
  logalloc: segment_store: mark functions noexcept
  logalloc: segment_descriptor: mark functions noexcept
  logalloc: occupancy_stats: mark functions noexcept
  min_max_tracker: mark functions noexcept
  gc_clock, db_clock: mark functions noexcept
  dirty_memory_manager: region_group: mark functions noexcept
  dirty_memory_manager: region_group: make simple constructor noexcept
  dirty_memory_manager: region_group_reclaimer mark functions noexcept
  logalloc: lsa_buffer: mark functions noexcept
2022-07-27 19:08:59 +03:00
Amnon Heiman
3658aa9ec2 Add summary_test
This patch adds unit tests for the summary implementation.
2022-07-27 16:58:52 +03:00
Amnon Heiman
99a060126d database: Reduce the number of per-table metrics
This patch reduces the number of metrics that is reported per table, when
the per-table flag is on.

When possible, it moves from time_estimated_histogram and
timed_rate_moving_average_and_histogram to use the unified timer.

Instead of a histogram per shard, it will now report a summary per shard
and a histogram per node.

Counters, histograms, and summaries will not be reported if they were
never used.

The API was updated accordingly so it would not break.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c31a58f2e9 replica/table.cc: Do not register per-table metrics for system
There is a set of per-table metrics that should only be registered for
user tables.
As time passes there are more keyspaces that are not for the user
keyspace and there is now a function that covers all those cases.

This patch replaces the implementation to use is_internal_keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
9a3e70adfb histogram_metrics_helper.hh: Add to_metrics_summary function
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c220e3a00f Unified histogram, estimated_histogram, rates, and summaries
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.

The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.

We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.

This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.

summary_calculator uses two estimated_histogram to calculate a summary.

timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.

Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram.  By keeping the API it
would make the transition easy.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:25 +03:00
Avi Kivity
a03a33dcaf Merge 'forward_service: reduce allocations in forward_service' from Piotr Sarna
This series refactors the code to get rid of unnecessary
allocations by extracing a helper requires_thread() function,
as well as by removing std::optional usage in forward_result,
now that it's possible to merge empty results with each other,
both ways (#11064).

Closes #11120

* github.com:scylladb/scylla:
  forward_service: remove redundant optional from forward_service
  forward_service: open-code running a Sestar thread
  forward_service: add requires_thread helper
2022-07-27 16:29:00 +03:00
Avi Kivity
71bec22117 Merge 'Speed up bootstrap with large number of tokens in the cluster 10X' from Asias He
=== Setup ===
1) start node1 with
```
scylla --num-tokens 20000 --smp 1
```
The large number of tokens per node is used to simulate large number of nodes in the cluster (large total number of tokens for the cluster).

2) start node2 with
```
scylla --num-tokens 20000 --smp 1
```
3) Measure the time to finish bootstrap

=== Result ===

1) With speed up patch:
```
node1 (16s)
INFO  2022-06-21 14:30:00,038 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO  2022-06-21 14:30:16,019 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
node2 (bootstrap node,174s)
INFO  2022-06-21 14:30:40,954 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO  2022-06-21 14:33:34,899 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
```
2) Without speed up patch:
```
node1 (171s)
INFO  2022-06-21 14:38:49,065 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO  2022-06-21 14:41:40,601 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
node2 (bootstrap node, 1181s)
INFO  2022-06-21 14:41:46,997 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO  2022-06-21 15:01:27,507 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
```

The improvements for bootstrap time:

node1: 171s  / 16s = 10.68X
node2: 1181s / 174s = 6.78X

Refs #10337
Refs #10817
Refs #10836
Refs #10837

Closes #10850

* github.com:scylladb/scylla:
  locator: Speed up abstract_replication_strategy::get_address_ranges
  locator: Speed up simple_strategy::calculate_natural_endpoint
  token_metadata: Speed up count_normal_token_owners
2022-07-27 16:04:09 +03:00
Raphael S. Carvalho
0796b8c97a sstables: Enforce disjoint invariant in sstable_run
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.

Closes #11116
2022-07-27 14:48:28 +03:00
Benny Halevy
bb9eddc67f test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
Now that memtable flush error handling was moved entirely
to table::seal_active_memtable, we don't need to notify_soft_pressure
to keep retry going.  The inifinite retry loop should
eventually either succeed or die (by isolating the node or aborting)
on its own.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
b5abbb971f test: memtable_test: failed_flush_prevents_writes: extend error injection
Inject errors into all seal_active_memtable distinct error
handling sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
a5911619c0 table: seal_active_memtable: abort if retried for too long
If we haven't been able to flush the memtable
in ~30 minutes (based on the number of retries)
just abort assuming that the OOM
condition is permanent rather than transient.

Refs #4344

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
bc18f750c6 table: seal_active_memtable: abort on unexpected error
Currently when we can't write the flushed sstable
due to corruption in the memtable we get into
an infinite retry loop (see #10498).

Until we can go into maintenance mode, the next best thing
would be to abort, though there is still a risk that
commitlog replay will reproduce the corruption in the
memtable and we's end up with an infinite crash loop.
(hence #10498 is not Fixed with this patch)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:57 +03:00
Benny Halevy
f0a597a252 table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.

In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.

Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.

Refs #10498

(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:04:55 +03:00
Benny Halevy
d55a2ac762 dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.

Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.

Refs #4174
Refs #10498

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
93f835a2dd dirty_memory_manager: flush_permit: add has_sstable_write_permit
after release_sstable_write_permit is called,
_sstable_write_permit will have no value.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
b3dcc77c66 dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
Neither exchanging the std:;optional nor moving
the sstable_write_permit throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
53355eb95d dirty_memory_manager: flush_permit: make _sstable_write_permit optional
So we can safely test whether it was released or not
by release_sstable_write_permit in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
67479e4243 table: reindent seal_active_memtable 2022-07-27 13:43:17 +03:00
Benny Halevy
00941452d5 table: coroutinize seal_active_memtable
As a first step to making it robust using
state machine driven retries.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d3acd80cf5 memtable_list: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
5991482049 commitlog: make discard_completed_segments and friends noexcept
To simplify table::seal_active_memtable error handling
and retry logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
863e9d9e6a dirty_memory_manager: flush_when_needed: target error handling at flush_one
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().

The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
73e50bc97d database: delete unused seal_delayed_fn_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
73e5cd0448 dirty_memory_manager: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
fcb3347c7a memtable: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
2d1ba0d7d8 memtable: memtable_encoding_stats_collector: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
ad85e720f9 encoding_state: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
6e961ead3b logalloc: mark free functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
705b42efe2 logalloc: allocating_section: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
f9db708376 logalloc: allocating_section: guard: mark constructor noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
5416808367 logalloc: reclaim_lock: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
95b0e41abb logalloc: tracker_reclaimer_lock: mark constructor noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
ed9e036509 logalloc: mark shard_tracker noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d6e6ffc741 logalloc: region: mark functions const/noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
2beee4a6cd logalloc: basic_region_impl: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
3ba85c3bbd logalloc: region_impl: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d838456be2 utils: log_heap: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
3f96818c03 logalloc: region_impl: object_descriptor: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
0866548b27 logalloc: region_group: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
fe50c76dbc logalloc: tracker: mark functions const/noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:40:50 +03:00
Benny Halevy
71c21a83ad logalloc: tracker::impl: make region_occupancy and friends const
No that they don't modify the tracker impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:40:18 +03:00
Benny Halevy
1c0c01cc24 logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
It was added in d20fae96a2
as a precaution not to invalidate iterators while
traversing _regions.  However it is not requried as no allocation
is done on this synchronous path - therefore there is no
point in preventing reclaim.

This will allow making the respective functions const
as they merely return stats and do not modify the tracker impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:39:18 +03:00