Commit Graph

2111 Commits

Author SHA1 Message Date
Tomasz Grabiec
4b51e0bf30 row_cache: Move cache_tracker to a separate header
It will be needed by the sstable layer to get the to the LRU and the
LSA region. Split to avoid inclusion of whole row_cache.hh
2021-07-02 10:25:58 +02:00
Calle Wilund
a40b6a2f54 commitlog: Use disk file alignment info (with lower value if possible)
Previously, the disk block alignment of segments was hardcoded (due to
really old code). Now we use the value as declared in the actual file
opened. If we are using a previously written file (i.e. o_dsync), we
can even use the sometimes smaller "read" alignment.

Also allow config to completely override this with a disk alignment
config option (not exposed to global config yet, but can be).

v2:
* Use overwrite alignment if doing only overwrite
* Ensure to adjust actual alignment if/when doing file wrapping

v3:
* Kill alignment config param. Useless and unsafe.

Closes #8935
2021-06-29 16:00:49 +03:00
Piotr Jastrzebski
1bdcef6890 features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled
These features have been around for over 2 years and every reasonable
deployment should have them enabled.

The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Nadav Har'El
4d7f55a29f cql: add configurable restriction of DateTieredCompactionStrategy
DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time
(users should use TimeWindowCompactionStrategy, TWCS, instead). This
patch adds a new configuration option - restrict_dtcs - which can be used
to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE
statements. This is part of a "safe mode" effort to allow an installation
to restrict operations which are un-recommended or dangerous.

The new restrict_dtcs option has three values: "true", "false", and "warn":

For the time being, "false" is still the default, and means DTCS is not
restricted  and can still be used freely. We can easily change this
default in a followup patch.

Setting a value of "true" means that DTCS *is* restricted -
trying to create a a table or alter a table with it will fail with an error.

Setting a value of "warn" will allow the create or alter operation, but
will warn the user - both with a warning message which will immediately
appear in cqlsh (for example), and with a log message.

Fixes #8914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210624122411.435361-1-nyh@scylladb.com>
2021-06-24 20:59:27 +03:00
Kamil Braun
a3f3563828 storage_service: check for existing normal token owners before bootstrapping
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.

However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).

Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.

Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).

Ref #8889.

Closes #8896
2021-06-24 13:19:08 +03:00
Michael Livshin
9b9efb2b42 disable caching of the system.large_* tables
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (https://github.com/scylladb/scylla/issues/7750), and those
range tombstones can be evicted only together with their partition
(https://github.com/scylladb/scylla/issues/3288).

Making the system.large_* tables uncached should work around the
problem until #3288 is fixed.

Fixes #8874
Refs #7750
Refs #3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
2021-06-24 12:26:45 +03:00
Avi Kivity
14252c8b71 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-06-24 12:03:26 +03:00
Piotr Dulikowski
de1679b1b9 hints: make hints concurrency configurable and reduce the default
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.

The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)

Refs: #8624

Closes #8646
2021-06-22 15:58:56 +02:00
Calle Wilund
d6113912cd commitlog: Add waitable future for background sync/flush
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.

This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
2021-06-21 06:01:19 +00:00
Pavel Emelyanov
96131349e8 schema_tables: Remove unused sharded<proxy> argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Avi Kivity
b099e7c254 Merge "Untie hints managers and storage service" from Pavel E
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.

This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.

tests: unit(dev),
       dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"

* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
  hints: Drop storage service from managers
  hints: Do not subscribe managers on lifecycle events directly
2021-06-17 17:12:31 +03:00
Pavel Emelyanov
92a4278cd1 hints: Drop storage service from managers
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:09:36 +03:00
Pavel Emelyanov
acdc568ecf hints: Do not subscribe managers on lifecycle events directly
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:06:26 +03:00
Nadav Har'El
45c2442f49 Merge 'Avoid large allocs in mv update code' from Piotr Sarna
This series addresses #8852 by:
 * migrating to chunked_vector in view update generation code to avoid large allocations
 * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent

Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.

Tests: unit(release)
Refs  #8852

Closes #8856

* github.com:scylladb/scylla:
  db,view: limit the number of simultaneous view update futures
  db,view: use chunked_vector for view updates
2021-06-17 14:01:38 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
98cdeaf0f2 schema_tables: make the_merge_lock thread_local
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.

To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.

Closes #8858
2021-06-17 13:41:11 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00
Piotr Sarna
1fb831c8c1 db,view: limit the number of simultaneous view update futures
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
2021-06-17 10:20:52 +02:00
Piotr Sarna
a7f7716ecf db,view: use chunked_vector for view updates
The number of view updates can grow large, especially in corner
cases like removing large base partitions. Chunked vector
prevents large allocations.
2021-06-17 10:15:17 +02:00
Tomasz Grabiec
6bdf8c4c46 Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja
Miscellaneous preparatory patches for group 0 discovery.

* scylla-dev/raft-group-0-part-2-v4:
  raft: (service) servers map is gid -> server, not sid -> server
  system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
  raft: (server) wait for configuration transition to complete
  raft: (server) implement raft::server::get_configuration()
  raft: (service) don't throw from schema state machine
  raft: (service) permit some scylla.raft cells to be empty
  raft: (service) properly handle failure to add a server
  raft: implement is_transient_error()
2021-06-17 00:15:40 +02:00
Calle Wilund
14559b5a86 commitlog: abort queues on shutdown
In case we only have a single segment active when shutting down,
the replenisher can be blocked even though we manually flush-deleted.

Add a signal type and abort queues using this to wake up waiter and
force them to check shutdown status.
2021-06-16 15:35:56 +00:00
Calle Wilund
227b573cdf commitlog: break out "abort" calls into member functions 2021-06-16 15:35:56 +00:00
Calle Wilund
5cd9691f00 commitlog: Do explicit discard+delete in shutdown
When we are shutting down, before trying to close the gate,
we should issue a discard to ensure waking up the replenish task
2021-06-16 15:35:56 +00:00
Calle Wilund
03b8baaa8d commitlog: Recycle or not should not depend on shutdown state
If we are using recycling, we should always use recycle in
delete_segments, otherwise we can cause deadlock with replenish
task, since it will be waiting for segment, then shutdown is set,
and we are called, and can't fulfil the alloc -> deadlock
2021-06-16 15:35:56 +00:00
Calle Wilund
5ebf5835b0 commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
If a segments, when finishing a flush call, is deletable, we should issue
a manual call to discard function (which moves deleteable segments off
segment list) asap, since we otherwise are dependent on more calls
from flush handlers (memtable flush). And since we could have blocked
segment allocation, this can cause deadlocks, at least in tests.
2021-06-16 15:35:56 +00:00
Calle Wilund
cbddcf46aa commitlog: Flush all segments if we only have one.
Handle test cases with borked config so we don't deadlock
in cases where we only have one segment in a commitlog
2021-06-16 15:35:56 +00:00
Calle Wilund
a0f559a44c commitlog: Always force flush if segment allocation is waiting
Refs #8270

If segement allocation is blocked, we should bypass all thresholds
and issue a flush of as much as possible.
2021-06-16 15:35:56 +00:00
Calle Wilund
bcf4d07f0b commitlog: Include segment wasted (slack) size in footprint check
Refs #8270

Since segment allocation looks at actual disk footprint, not active,
the threshold check in timer task should include slack space so we
don't mistake sparse usage for space left.
2021-06-16 15:35:56 +00:00
Calle Wilund
1187f5c181 commitlog: Adjust (lower) usage threshold
Refs #8270

Try to ensure we issue a flush as soon as we are allocating in the
last allowable segment, instead of "half through". This will make
flushing a little more eager, but should reduce latencies created
by waiting for segment delete/recycle on heavy usage.
2021-06-16 15:35:56 +00:00
Konstantin Osipov
9c93d77e74 system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
Fix a bug in definitions of system.raft, system.raft_snapshots,
group_id is TIMEUUID, not long.
2021-06-16 16:52:43 +03:00
Piotr Sarna
f832a30388 db,view,table: futurize calculating affected ranges
In order to avoid stalls on large inputs, calculating
affected ranges is now able to yield.
2021-06-16 09:51:31 +02:00
Piotr Sarna
3592d9b36e db,view: use chunked vector for view affected ranges
There were large allocation reportsa from vectors used for
calculating affected ranges. In order to reduce the pressure
on the allocator, chunked vector is used for storing intermediate
results.
2021-06-15 10:30:27 +02:00
Piotr Sarna
8a049c9116 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-14 09:36:10 +02:00
Piotr Sarna
7cdbb7951a db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-14 09:34:40 +02:00
Piotr Sarna
88d4a66e90 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-14 09:30:38 +02:00
Nadav Har'El
8a4ac6914a config: add configuration option restrict_replication_simplestrategy
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).

After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:16 +03:00
Nadav Har'El
a3d6f502ad config: add "tri_mode_restriction" type of configurable value
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".

We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:44:20 +03:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Nadav Har'El
48ff641f67 Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.

This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.

In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.

Fixes #8577

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8578

* github.com:scylladb/scylla:
  commitlog: segment_manager::shutdown: abort on errors
  commitlog: allocate_segment_ex: make_checked_file
2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Benny Halevy
9cf858b5fc snapshot: support skip_flush option
skip_flush is disabled by default.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 17:20:21 +03:00
Piotr Sarna
389a0a52c9 treewide: revamp workload type for service levels
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
 - instead of using 'unspecified' for declaring that there's no workload
   type for a particular service level, NULL is used for that purpose;
   NULL is the standard way of representing lack of data
 - introducing a delete marker, which accompanies NULL and makes it
   possible to distinguish between wanting to forcibly reset a workload
   type to unspecified and not wanting to change the previous value
 - updating the tests accordingly

These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.

Tests: unit(release)

Closes #8763
2021-05-31 18:18:33 +03:00
Piotr Jastrzebski
76d7c761d1 schema: Stop using deprecated constructor
This is another boring patch.

One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.

This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.

Fixes #8507

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
2021-05-30 11:58:27 +03:00
Pavel Emelyanov
1ce0682821 view: Get database from stprage_proxy
The db::view code already uses proxy rather actively, so instead of
depending on the storage service to be at hands it's better to make
db::view require the proxy. For now -- via global instance.

There's one dependency on storage service left after this patch --
to get the tokens. This piece is to be fixed later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-28 18:09:32 +03:00
Avi Kivity
0acf5bfca6 build: enable -Wreturn-std-move
Clang warns when "return std::move(x)" is needed to elide a copy,
but the call to std::move() is missing. We disabled the warning during
the migration to clang. This patch re-enables the warning and fixes
the places it points out, usually by adding std::move() and in one
place by converting the returned variable from a reference to a local,
so normal copy elision can take place.

Closes #8739
2021-05-27 21:16:26 +03:00
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00