Commit Graph

4972 Commits

Author SHA1 Message Date
Piotr Dulikowski
0fd36e2579 api: allow changing hinted handoff configuration
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.

To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
220a2ca800 hints_manager: implement change_host_filter
Implements a function which is responsible for changing hints manager
configuration while it is running.

It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.

The function is blocking and waits until all relevant ep managers are
started or stopped.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
1302f1b5bf storage_proxy: always create hints manager
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
cefe5214ff config: plug in hints::host_filter object into configuration
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.

Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.

Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
2020-11-17 10:24:42 +01:00
Piotr Dulikowski
5c3c7c946b db/hints: introduce host_filter
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.

This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
2020-11-17 10:15:47 +01:00
Piotr Dulikowski
a4f03d72b3 hints/resource_manager: allow registering managers after start
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.

This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
2020-11-17 10:15:47 +01:00
Piotr Dulikowski
40710677d0 hints: introduce db::hints::directory_initializer
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
2020-11-17 10:15:47 +01:00
Kamil Braun
d74f303406 cdc: ensure that CDC generation write is flushed to commitlog before ack
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.

The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).

This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.

Fixes https://github.com/scylladb/scylla/issues/7610.

Closes #7619
2020-11-17 00:01:13 +02:00
Piotr Jastrzebski
d2897d8f8b alternator: guard streams with an experimental flag
Add new alternator-streams experimental flag for
alternator streams control.

CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:36:16 +01:00
Piotr Jastrzebski
e9072542c1 Mark CDC as GA
Enable CDC by default.
Rename CDC experimental feature to UNUSED_CDC to keep accepting cdc
flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:36:13 +01:00
Piotr Sarna
d43ac783c6 db,view: degrade helper message from error to warn
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.

Closes #7597
2020-11-12 12:28:26 +02:00
Benny Halevy
3fab0f8694 storage_proxy: convert to shared_token_metadata
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.

expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6d06853e6c abstract_replication_strategy: convert to shared_token_metadata
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
8bcdf39a18 hints/manager: scan_for_hints_dirs: fix use-after-move
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.

The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.

This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.

Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
2020-11-09 16:32:54 +01:00
Piotr Wojtczak
72c7f25a29 db: add TransitionalAuthorizer and TransitionalAuthenticator...
... to config descriptions

We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.

Closes #7565
2020-11-09 10:51:54 +01:00
Avi Kivity
6b4a7fa515 Revert "Revert "config: Do not enable repair based node operations by default""
This reverts commit 71d0d58f8c. Repair based
node operations are still not ready and will be re-enabled after more
testing and fixes.
2020-11-08 14:09:50 +02:00
Tomasz Grabiec
6d0d55aa72 Merge "Unglobal query processor instance" from Pavel Emelyanov
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.

The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.

* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
  code: RIP global query processor instance
  cql test env: Keep query processor reference on board
  system distributed keyspace: Start sharded service erarlier
  schema_tables: Use qctx to make internal requests
  transport: Keep sharded query processor reference on controller
  thrift: Keep sharded query processor reference on controller
  alternator: Use local query processor reference to get keys
  alternator: Keep local query processor reference in server
2020-11-06 14:24:41 +01:00
Piotr Sarna
b61d4bc8d0 db: degrade view building progress loading error to warning
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.

Fixes #7527

Closes #7558
2020-11-06 10:19:11 +02:00
Nadav Har'El
7ff72b0ba5 Merge 'secondary_index: fix returned rows token ordering' from Piotr Grabowski
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.

Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.

Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.

Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.

Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.

Fix tests according to new token ordering and add one new test to validate this aspect explicitly.

Fixes #7443

Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.

Closes #7534

* github.com:scylladb/scylla:
  tests: add token ordering test of indexed selects
  tests: fix tests according to new token ordering
  secondary_index: use new token_column_computation
  feature: add correct_idx_token_in_secondary_index
  column_computation: add token_column_computation
  token_column_computation: rename as legacy
2020-11-05 18:44:49 +01:00
Piotr Grabowski
b1350af951 token_column_computation: rename as legacy
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:

https://github.com/scylladb/scylla/issues/7443
2020-11-04 12:00:18 +01:00
Piotr Sarna
b66c285f94 schema_tables: fix fixing old secondary index schemas
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.

Fixes #7515

Closes #7516
2020-11-02 12:30:20 +02:00
Pavel Emelyanov
021b905773 schema_tables: Use qctx to make internal requests
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.

To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).

The qctx itself is a global pointer, which waits for its fix too, of course.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 18:50:01 +03:00
Piotr Sarna
35887bf88b view: add printing missing base column on errors
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
2020-10-31 12:22:07 +01:00
Piotr Sarna
ef3470fa34 view: simplify creating base-dependent info for reads only
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
2020-10-31 12:22:07 +01:00
Piotr Sarna
71b28d69b3 view: fix typo: s/dependant/dependent 2020-10-31 12:22:07 +01:00
Piotr Sarna
669e2ada92 view: add error logs if a view is out of sync with its base
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
2020-10-31 12:22:07 +01:00
Tomasz Grabiec
158ae99c89 Merge 'view info: preserve integrity by allowing base info for reads only and by initializing base info' from Eliran Sinvani
This PR purpose is to handle schema integrity issues that can arise in races involving materialized views.
The possibility of such integrity issues was found in #7420 , where a view schema was used for reading without
it's _base_info member initialized resulting in a segfault.
We handle this doing 3 things:
1. First guard against using an uninitialized base info - this will be considered as an internal error as it will indicate that
there is a path in our code that creates a view schema to be used for reads or writes but is not initializing the base info.
2. We allow the base info to be initialized also from partially matching base (most likely a newer one that this used to create the view).
3. We fix the suspected path that create such a view schema to initialize it. (in migration manager)

It is worth mentioning that this PR is a workaround to a probable design flaw in our materialized views which requires the base
table's information to be retrieved in the first place instead of just being self contained.

Refs #7420

Closes #7469

* github.com:scylladb/scylla:
  materialized views: add a base table reference if missing
  view info: support partial match between base and view for only reading from view.
  view info: guard against null dereference of the base info
2020-10-21 16:21:00 +02:00
Eliran Sinvani
70e04c1123 view info: support partial match between base and view for
only reading from view.

The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
2020-10-21 15:20:43 +03:00
Eliran Sinvani
372051c97d view info: guard against null dereference of the base info
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
2020-10-21 12:12:51 +03:00
Avi Kivity
8e386a5f48 schema_tables: adjust altered_schema construction for clang
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
altered_schema.

As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
2020-10-19 14:57:21 +03:00
Avi Kivity
cb9a9584ac db: hints/manager: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:09 +03:00
Avi Kivity
1986a74cc4 db: commitlog_replayer: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:01 +03:00
Tomasz Grabiec
f893516e55 Merge "lwt: store column_mapping's for each table schema version upon a DDL change" from Pavel Solodovnikov
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

* git@github.com:ManManson/scylla.git feature/table_schema_history_v7:
  lwt: add column_mapping history persistence tests
  schema: add equality operator for `column_mapping` class
  lwt: store column_mapping's for each table schema version upon a DDL change
  schema_tables: extract `fill_column_info` helper
  frozen_mutation: introduce `unfreeze_upgrading` method
2020-10-15 20:48:29 +02:00
Pavel Solodovnikov
055fd3d8ad lwt: store column_mapping's for each table schema version upon a DDL change
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

In case we don't find a column_mapping we just return an error
from the learn stage.

Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:24:30 +03:00
Nadav Har'El
509a41db04 alternator: change name of Alternator's SSL options
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.

Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".

This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).

For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.

This patch also:
1. fixes the test run code (which has an "--https" option to test over
   https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
   previously the information on how to control the location of the
   certificate was missing from these documents.

Fixes #7204.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
2020-10-14 18:13:57 +03:00
Avi Kivity
86bbf1763d Merge "reader concurrency semaphore: dump permit diagnostics on timeout or queue overflow" from Botond
"
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.

This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.

Example:

DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory  count   name
3499M   27      ks.test:data-query

3499M   27      total

Permits with state waiting, sorted by count
count   memory  name
1       0B      ks.test:drain
7650    0B      ks.test:data-query

7651    0B      total

Permits with state registered, sorted by count
count   memory  name

0       0B      total

Total: permits: 7678, memory: 3499M

This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory

This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.

Tests: unit(dev, debug)
"

* 'dump-diagnostics-on-semaphore-timeout/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow
  utils: add to_hr_size()
  reader_concurrency_semaphore: link permits into an intrusive list
  reader_concurrency_semaphore: move expiry_handler::operator()() out-of-line
  reader_concurrency_semaphore: move constructors out-of-line
  reader_concurrency_semaphore: add state to permits
  reader_concurrency_semaphore: name permits
  querier_cache_test: test_immediate_evict_on_insert: use two permits
  multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
  multishard_combining_reader: add permit parameter
  multishard_combining_reader: shard_reader: use multishard reader's permit
2020-10-13 12:44:23 +03:00
Botond Dénes
ff623e70b3 reader_concurrency_semaphore: name permits
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.

As not all read can be associated with one schema, the schema is allowed
to be null.

The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
2020-10-13 12:32:13 +03:00
Piotr Dulikowski
77a0f1a153 hints: don't read hint files when it's not allowed to send
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:

- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
  target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
  this operation after sleeping 1 second.

This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.

This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.

First observed in #6964

Tests:
- unit(dev)
- hinted handoff dtests

Closes #7407
2020-10-12 19:09:57 +03:00
Botond Dénes
dd372c8457 flat_mutation_reader: de-virtualize buffer_size()
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
2020-10-06 08:22:56 +03:00
Avi Kivity
fd1dd0eac7 Merge "Track the memory consumption of reader buffers" from Botond
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.

This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.

This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.

Refs: #4176

Tests: unit(dev, debug, release)
"

* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
  test/manual/sstable_scan_footprint_test: run test body in statement sched group
  test/manual/sstable_scan_footprint_test: move test main code into separate function
  test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
  test/manual/sstable_scan_footprint_test: make clustering row size configurable
  test/manual/sstable_scan_footprint_test: document sstable related command line arguments
  mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
  test: simple_schema: add make_static_row()
  reader_permit: reader_resources: add operator==
  mutation_fragment: memory_usage(): remove unused schema parameter
  mutation_fragment: track memory usage through the reader_permit
  reader_permit: resource_units: add permit() and resources() accessors
  mutation_fragment: add schema and permit
  partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
  mutation_fragment: remove as_mutable_end_of_partition()
  mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
  mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
  mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
  mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
  flat_mutation_reader: make _buffer a tracked buffer
  mutation_reader: extract the two fill_buffer_result into a single one
  ...
2020-09-29 16:08:16 +03:00
Eliran Sinvani
925cdc9ae1 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296
2020-09-29 13:25:41 +03:00
Piotr Sarna
4b856cf62d transport: make max_concurrent_requests_per_shard reloadable
This configuration entry is expected to be used as a quick fix
for an overloaded node, so it should be possible to reload this value
without having to restart the server.
2020-09-29 10:11:36 +02:00
Piotr Sarna
b4db6d2598 transport,config: add a param for max request concurrency
The newly introduced parameter - max_concurrent_requests_per_shard
- can be used to limit the number of in-flight requests a single
coordinator shard can handle. Each surplus request will be
immediately refused by returning OverloadedException error to the client.
The default value for this parameter is large enough to never
actually shed any requests.
Currently, the limit is only applied to CQL requests - other frontends
like alternator and redis are not throttled yet.
2020-09-29 09:59:30 +02:00
Botond Dénes
6ca0464af5 mutation_fragment: add schema and permit
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.

In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
2020-09-28 11:27:23 +03:00
Botond Dénes
4f5ccf82cb mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
3fab83b3a1 flat_mutation_reader: impl: add reader_permit parameter
Not used yet, this patch does all the churn of propagating a permit
to each impl.

In the next patch we will use it to track to track the memory
consumption of `_buffer`.
2020-09-28 10:53:48 +03:00
Piotr Dulikowski
39771967bb hinted handoff: fix race - decomission vs. endpoint mgr init
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.

The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.

In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.

End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.

To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.

Fixes #7257

Closes #7278
2020-09-24 14:51:24 +03:00
Avi Kivity
844b675520 view: view_update_generator: drop references to sstables when stopping
sstable_manager will soon wait for all sstables under its
control to be deleted (if so marked), but that can't happen
if someone is holding on to references to those sstables.

To allow sstables_manager::stop() to work, drop remaining
queued work when terminating.
2020-09-23 20:55:02 +03:00
Avi Kivity
a0ffcabd66 view: use nonwrapping_interval instead of nonwrapping_range to avoid clang deduction failure
We use class template argument deduction (CTAD) in a few places, but it appears
not to work for alias templates in clang. While it looks like a clang bug, using
the class name is an improvement, so let's do that.
2020-09-21 16:32:53 +03:00
Pavel Solodovnikov
6e10f2b530 schema_registry: make grace period configurable
Introduce new database config option `schema_registry_grace_period`
describing the amount of time in seconds after which unused schema
versions will be cleaned up from the schema registry cache.

Default value is 1 second, the same value as was hardcoded before.

Tests: unit(debug)
Refs: #7225

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200915131957.446455-1-pa.solodovnikov@scylladb.com>
2020-09-15 17:53:27 +02:00