Commit Graph

3193 Commits

Author SHA1 Message Date
Botond Dénes
3a51053e66 Merge 'De-static system_keyspace::*_group0_* methods' from Pavel Emelyanov
These are users of global `qctx` variable or call `(get|set)_scylla_local_param(_as)?` which, in turn, also reference the `qctx`. Unfortunately, the latter(s) are still in use by other code and cannot be marked non-static in this PR

Closes #14869

* github.com:scylladb/scylladb:
  system_keyspace: De-static set_raft_group0_id()
  system_keyspace: De-static get_raft_group0_id()
  system_keyspace: De-static get_last_group0_state_id()
  system_keyspace: De-static group0_history_contains()
  raft: Add system_keyspace argument to raft_group0::join_group0()
2023-07-28 14:53:22 +03:00
Pavel Emelyanov
d311784721 system_keyspace: De-static set_raft_group0_id()
The caller is group0 code with sys_ks local variable

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:13:59 +03:00
Pavel Emelyanov
7837bc7d5a system_keyspace: De-static get_raft_group0_id()
The callers are in group0 code that have sys_ks local variable/argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:13:11 +03:00
Pavel Emelyanov
26dd7985a8 system_keyspace: De-static get_last_group0_state_id()
The caller is raft_group0_client with sys.ks. dependency reference and
group0_state_machine with raft_group0_client exporing its sys.ks.

This makes it possible to instantly drop one more qctx reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:12:04 +03:00
Pavel Emelyanov
3de0efd32c system_keyspace: De-static group0_history_contains()
The caller is raft_group0_client with sys.ks. dependency reference.
This allows to drop one qctx reference right at once

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 13:11:08 +03:00
Avi Kivity
cf81eef370 Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.

Manually tested using ccm on cluster upgrade scenarios and node restarts.

Closes #14441

* github.com:scylladb/scylladb:
  test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
  schema_mutations, migration_manager: Ignore empty partitions in per-table digest
  migration_manager, schema_tables: Implement migration_manager::reload_schema()
  schema_tables: Avoid crashing when table selector has only one kind of tables
2023-07-28 00:01:33 +03:00
Pavel Emelyanov
e9218e6873 system_keyspace: Don't update schema version in .setup()
The db.get_version() called that early returns value that database got
construction-time, i.e. -- empty_version thing. It makes little sense
committing it into the system k.s. all the more so the "real" version is
calculated and updated few steps after .setup().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14833
2023-07-27 09:38:57 +03:00
Pavel Emelyanov
c017117340 system_keyspace: Remove qctx usage from load_topology_state()
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14824
2023-07-27 08:56:40 +03:00
Avi Kivity
615544a09a Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov
This is to make m.s. initialization more solid and simplify sys.ks.::setup()

Closes #14832

* github.com:scylladb/scylladb:
  system_keyspace: Remove unused snitch arg from setup()
  messaging_service: Setup preferred IPs from config
2023-07-26 22:12:28 +03:00
Nadav Har'El
056d04954c Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.

Fixes: #14819

Closes #14821

* github.com:scylladb/scylladb:
  test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
  db/view/view_updating_consumer: account for the size of mutations
  mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
  mutation/mutation: add memory_usage()
2023-07-26 20:04:28 +03:00
Pavel Emelyanov
6b82071064 system_keyspace: Remove unused snitch arg from setup()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-26 16:05:26 +03:00
Pavel Emelyanov
0fba57a3e8 messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-26 16:03:23 +03:00
Botond Dénes
d66b07823b db/view/view_updating_consumer: account for the size of mutations
All partitions will have a corresponding mutation object in the buffer.
These objects have non-negligible sizes, yet the consumer did not bump
the _buffer_size when a new partition was consumer. This resulted in
empty partitions not moving the _buffer_size at all, and thus they could
accumulate without bounds in the buffer, never triggering a flush just
by themselves. We have recently seen this causing OOM.
This patch fixes that by bumping the _buffer_size with the size of the
freshly created mutation object.
2023-07-26 03:07:25 -04:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Kamil Braun
e6099c4685 Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak
Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable
2023-07-24 10:23:34 +02:00
Pavel Emelyanov
db1c6e2255 system_keyspace: Make save_truncation_record() non-static
... and stop using qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:12:50 +03:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Pavel Emelyanov
1ef34a5ada db: Add sharded<system_keyspace>& to legacy_schema_migrator
One of the class' methods calls db::drop_table_on_all_shards() that will
need sys.ks. in the next patch.

The reference in question is provided from the only caller -- main.cc

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 12:38:46 +03:00
Botond Dénes
53da97416a Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.

Closes #14758

* github.com:scylladb/scylladb:
  db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
  db/system_keyspace: Make paxos methods non-static
  service/paxos: Add db::system_keyspace& argument to some methods
  test: Optionally initialize proxy remote for cql_test_env
  proxy/remote: Keep sharded<db::system_keyspace>& dependency
2023-07-20 16:53:25 +03:00
Pavel Emelyanov
8a87c87824 db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
This template call is only used by system keyspace paxos methods. All
those methods are no longer static and can use system_keyspace::_qp
reference to real query processor instead of global qctx. The
execute_cql_with_timeout() wrapper is moved to system_keyspace to make
it work

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Pavel Emelyanov
b9ef16c06f db/system_keyspace: Make paxos methods non-static
The service::paxos_state methods that call those already have system
keyspace reference at hand and can call method on an object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-19 19:32:10 +03:00
Patryk Jędrzejczak
b3be9617dc config: set schema_commitlog_segment_size_in_mb to 128
We increase the default schema commitlog segment size so that the
large mutations do not fail. We have agreed that 128 MB is sufficient.
2023-07-19 14:16:49 +02:00
Patryk Jędrzejczak
5b167a4ad7 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
2023-07-19 14:16:41 +02:00
Kefu Chai
8f390997cb db: do not use std::cmp_not_equal() when appropriate
this change is a follow-up of 3129ae3c8c.
since in both cases in this change, the `num_ranges` should always
be greater than zero, there is no need to use `int` for its type,
and "num_ranges" returned by the CQL query should always be greater
or equal to zero, so there is no need to check if it is positive.

in this change, we

* change the type of `num_ranges` to `size_t`
* change std::cmp_not_equal() to !=

to avoid using the verbose `std::cmp_not_equal()` helper, for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14754
2023-07-19 13:25:21 +03:00
Asias He
c29e7e4644 Revert "Revert "view_update_generator: Increase the registration_queue_size""
This reverts commit 4cee8206f8.

The test is fixed.

Closes #14750
2023-07-19 11:46:28 +03:00
Kamil Braun
eb6202ef9c Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak
Fixes #9405

`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.

Closes #14534

* github.com:scylladb/scylladb:
  db: hints: add checksum to sync point encoding
  db: hints: add the version_size constant
2023-07-18 13:05:10 +02:00
Botond Dénes
21ff6efd74 test/boost/view_build_test: improve test_view_update_generator_register_semaphore_unit_leak
By making it independent of the number of units the view update
generator's registration semaphore is created with. We want to increase
this number significantly and that would destabilize this test
significantly. To prevent this, detach the test from the number of units
completely, while stil preserving the original intent behind it, as best
as it could be determined.

Closes #14727
2023-07-18 09:18:28 +03:00
Kefu Chai
fa3129fa29 treewide: use unsigned variable to compare with unsigned
some times we initialize a loop variable like

auto i = 0;

or

int i = 0;

but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Kefu Chai
3129ae3c8c treewide: compare signed and unsigned using std::cmp_*()
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:

```
int main(int argc, char **argv) {
    int x = -1;
    unsigned y = 2;
    fmt::print("{}\n", x < y);
    return 0;
}
```

this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Patryk Jędrzejczak
02618831ef db: hints: add checksum to sync point encoding
sync point API provided with incorrect sync point id might allocate
crazy amount of memory and fail with std::bad_alloc.

To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering
it with a checksum calculated in db::hints::decode before decoding.
2023-07-17 16:05:07 +02:00
Patryk Jędrzejczak
0a424e1760 db: hints: add the version_size constant
The next commit changes the format of encoding sync points to V2. The
new format appends the checksum to the encoded sync points and its
implementation uses the checksum_size constant - the number of bytes
required to store the checksum. To increase consistency and readability,
we can additionally add and use the version_size constant.

Definitions of sync_point::decode and sync_point::encode are slightly
changed so that they don't depend on the version_size value and make
implementation of the V2 format easier.
2023-07-17 16:02:18 +02:00
Kefu Chai
3ed982df87 query_context: do not include unused header
in this header, none of the exceptions defined by
`exceptions/exceptions.hh` is used. so let's drop the `#include`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14718
2023-07-17 12:00:49 +03:00
Raphael S. Carvalho
d6029a195e Remove DateTieredCompactionStrategy
This is the last step of deprecation dance of DTCS.

In Scylla 5.1, users were warned that DTCS was deprecated.

In 5.2, altering or creation of tables with DTCS was forbidden.

5.3 branch was already created, so this is targetting 5.4.

Users that refused to move away from DTCS will have Scylla
falling back to the default strategy, either STCS or ICS.

See:
WARN  2023-07-14 09:49:11,857 [shard 0] schema_tables - Falling back to size-tiered compaction strategy after the problem: Unable to find compaction strategy class 'DateTieredCompactionStrategy

Then user can later switch to a supported strategy with
alter table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14559
2023-07-14 16:20:48 +03:00
Asias He
dad5caf141 streaming: Add stream_plan_ranges_percentage
This option allows user to change the number of ranges to stream in
batch per stream plan.

Currently, each stream plan streams 10% of the total ranges.

With more ranges per stream plan, it reduces the waiting time between
two stream plans. For example,

stream_plan1: shard0 (t0), shard1 (t1)
stream_plan2: shard0 (t2), shard1 (t3)

We start stream_plan2 after all shards finish streaming in stream_plan1.
If shard0 and shard1 in stream_plan1 finishes at different time. One of
the shards will be idle.

If we stream more ranges in a single stream plan, the waiting time will
be reduced.

Previously, we retry the stream plan if one of the stream plans is
failed. That's one of the reasons we want more stream plans. With RBNO
and 1f8b529e08 (range_streamer: Disable restream logic), the
restream factor is not important anymore.

Also, more ranges in a single stream plan will create bigger but fewer
sstables on the receiver side.

The default value is the same as before: 10% percentage of total ranges.

Fixes #14191

Closes #14402
2023-07-14 09:03:01 +03:00
Botond Dénes
4cee8206f8 Revert "view_update_generator: Increase the registration_queue_size"
This reverts commit d3034e0fab.

The test modified by this commit
(view_build_test.test_view_update_generator_register_semaphore_unit_leak)
often fails, breaking build jobs.
2023-07-13 16:48:50 +03:00
Asias He
d3034e0fab view_update_generator: Increase the registration_queue_size
When repair writes a sstable to disk, we check if the sstable needs view
update processing. If yes, the sstable will be placed into the staging
dir for processing, with the _registration_sem semaphore to prevent too
many pending unprocessed sstables.

We have seen multiple cases in the field where view update processing is
inefficient and way too slow which blocks the base table repair to
finish on time.

This patch increases the registration_queue_size to a bigger number to
mitigate the problem that slow view update processing blocks repair.

It is better to have a consistent base table + inconsistent view table
than inconsistent base table + inconsistent view table.

Currently, sstables in staging dir are not compacted. So we could not
increase the _registration_sem with too big number to avoid accumulate
too many sstables.

The view_build_test.cc is updated to make the test pass.

Closes #14241
2023-07-12 15:51:35 +03:00
Botond Dénes
296837120d db: move virtual tables into virtual_tables.cc
The definitions of virtual tables make up approximately a quarter of the
huge system_keyspace.cc file (almost 4K lines), pulling in a lot of
headers only used by them.
Move them to a separate source file to make system_keyspace.cc easier
for humans and compilers to digest.
This patch also moves the `register_virtual_tables()`,
`install_virtual_readers()` as well as the `virtual_tables` global.

Closes #14308
2023-07-12 15:26:54 +03:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Gleb Natapov
4f23eec44f Rename experimental raft feature to consistent-topology-changes
Make the name more descriptive

Fixes #14145

Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>
2023-07-07 11:08:10 +02:00
Nadav Har'El
d6aba8232b alternator: configurable override for DescribeEndpoints
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.

This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:

1. Empty string (the default) leaves the current behavior -
   DescribeEndpoints echos the request's "Host" header.

2. The string "disabled" disables the DescribeEndpoints (it will return
   an UnknownOperationException). This is how DynamoDB Local behaves,
   and the AWS C++ SDK and the Dynamodb Shell work well in this mode.

3. Any other string is a fixed string to be returned by DescribeEndpoints.
   It can be useful in setups that should return a known address.

Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.

Fixes #14410.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14432
2023-07-07 11:08:10 +02:00
Tomasz Grabiec
c25201c1a3 Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Fixes https://github.com/scylladb/scylladb/issues/14503

Closes #14502

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-05 21:21:43 +02:00
Michał Chojnowski
ac29b6f198 view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-05 17:33:47 +02:00
Michał Chojnowski
5ad0846bff view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-04 20:33:21 +02:00
Tomasz Grabiec
f2ed9fcd7e schema_mutations, migration_manager: Ignore empty partitions in per-table digest
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.
2023-07-03 23:06:55 +02:00
Tomasz Grabiec
0c86abab4d migration_manager, schema_tables: Implement migration_manager::reload_schema()
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
2023-07-03 20:32:59 +02:00
Tomasz Grabiec
9bfe9f0b2f schema_tables: Avoid crashing when table selector has only one kind of tables
Currently not reachable, because selectors are always constructed with
both kinds initailized. Will be triggered by the next patch.
2023-07-03 20:32:59 +02:00
Pavel Emelyanov
0d4c981423 database: Remove unused proxy arg from update_keyspace_on_all_shards()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-03 14:19:54 +03:00
Tomasz Grabiec
a9282103ba Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy
This series aims at hardening schema merges and preventing inconsistencies across shards by
updating the database shards before calling the notification callback.

As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux.

In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state .

Other changes optimize this path by:
- updating shard 0 first, to seed the effective_replication_map.
- executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change.

Fixes #13137

Closes #14158

* github.com:scylladb/scylladb:
  migration_manager: propagate listener notification exceptions
  storage_service: keyspace_changed: execute only on shard 0
  database: modify_keyspace_on_all_shards: execute func first on shard 0
  database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
  database: add modify_keyspace_on_all_shards
  schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
  database: create_keyspace_on_all_shards
  database: update_keyspace_on_all_shards
  database: drop_keyspace_on_all_shards
2023-06-29 12:17:53 +02:00
Avi Kivity
f86dd857ca Merge 'Certificate based authorization' from Calle Wilund
Fixes #10099

Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex.

Example:

scylla.yaml:

```
    authenticator: com.scylladb.auth.CertificateAuthenticator
    auth_superuser_name: <name>
    auth_certificate_role_query: CN=([^,\s]+)

    client_encryption_options:
      enabled: True
      certificate: <server cert>
      keyfile: <server key>
      truststore: <shared trust>
      require_client_auth: True
```
In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for  qlsh set "usercert" and "userkey" to these certificate files.

No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it.

Otherwise, connection becomes the role described.

To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup.

Closes #12214

* github.com:scylladb/scylladb:
  docs: Add documentation of certificate auth + auth_superuser_name
  auth: Add TLS certificate authenticator
  transport: Try to do early, transport based auth if possible
  auth: Allow for early (certificate/transport) authentication
  auth: Allow specifying initial superuser name + passwd (salted) in config
  roles-metadata: Coroutinuze some helpers
2023-06-27 12:52:14 +03:00
Botond Dénes
f5e3b8df6d Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
```
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
`INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`

to
`INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`

Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.

Closes #14364

* github.com:scylladb/scylladb:
  table: Optimize creation of reader excluding staging for view building
  view_update_generator: Dump throughput and duration for view update from staging
  utils: Extract pretty printers into a header
2023-06-27 07:25:30 +03:00