Commit Graph

26348 Commits

Author SHA1 Message Date
Tomasz Grabiec
abe3d7d7d3 Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.

This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.

Test results (perf_simple_query --smp 1 --task-quota-ms 10):

```
before: median 184810.98 tps ( 91.1 allocs/op,  20.1 tasks/op,   54564 insns/op)
after:  median 192125.99 tps ( 87.1 allocs/op,  20.1 tasks/op,   53673 insns/op)
```

4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).

The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.

I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.

Closes #8592

* github.com:scylladb/scylla:
  storage_proxy, treewide: use utils::small_vector inet_address_vector:s
  storage_proxy, treewide: introduce names for vectors of inet_address
  utils: small_vector: add print operator for std::ostream
  hints: messages.hh: add missing #include
2021-05-06 18:00:54 +02:00
Tomasz Grabiec
6aec8cc447 Merge "raft: fixes and improvements for snapshot transfer" from Gleb
* scylla-dev/raft-snapshot-fixes-v4:
  raft: document that add entry my throw commit_status_unknown
  raft: test: add test of a leadership change during ongoing snapshot transfer
  raft: test: retry submitting an entry if it was dropped
  raft: test: wait for the log to be fully replicated on new leader only
  raft: drop waiters with outdated terms
  raft: make snapshot transfer abortable
  raft: accept snapshots transfer from multiple nodes simultaneously
  raft: do not send probes while transferring snapshot
  raft: handle messages sending errors
  raft: test: return error from rpc module if nodes are disconnected
  raft: fix a typo in a variable name
2021-05-06 17:44:22 +02:00
Avi Kivity
d6d6758857 Merge 'Switch to use NODE_OPS_CMD for decommission and bootstrap operation' from Asias He
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.

In this patch set, we continue the work to switch decommission and bootstrap
operation to use NODE_OPS_CMD.

Fixes #8472
Fixes #8471

Closes #8481

* github.com:scylladb/scylla:
  repair: Switch to use NODE_OPS_CMD for bootstrap operation
  repair: Switch to use NODE_OPS_CMD for decommission operation
2021-05-06 17:28:19 +03:00
Avi Kivity
f2132150c4 Merge "Extract reader concurrency semaphore tests into separate file" from Botond
"
The current home of these tests -- mutation_reader_test -- is already
one of the larges test files we have. To reduce the size of the former
and to make finding these tests easier they are moved to a separate
file.
"

* 'reader-concurrency-semaphore-test/v2' of https://github.com/denesb/scylla:
  test: move reader_concurrency_semaphore related tests into separate file
  test: mutation_reader_test: convert restricted reader tests to semaphore tests
2021-05-06 17:13:45 +03:00
Gleb Natapov
aa7ea333da raft: document that add entry my throw commit_status_unknown 2021-05-06 11:59:36 +03:00
Gleb Natapov
3a1bff26dd raft: test: add test of a leadership change during ongoing snapshot transfer 2021-05-06 11:34:31 +03:00
Gleb Natapov
612e0f08c4 raft: test: retry submitting an entry if it was dropped 2021-05-06 11:34:31 +03:00
Gleb Natapov
0b2c9c549a raft: test: wait for the log to be fully replicated on new leader only
When forcing new leader it should be enough to wait for log to be fully
replicated to that particular leader.
2021-05-06 11:34:31 +03:00
Gleb Natapov
d2f58d8656 raft: drop waiters with outdated terms
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.

Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.

The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
2021-05-06 11:34:31 +03:00
Gleb Natapov
6abe2772dc raft: make snapshot transfer abortable
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
2021-05-06 11:34:31 +03:00
Gleb Natapov
50d545a138 raft: accept snapshots transfer from multiple nodes simultaneously
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
2021-05-06 11:34:31 +03:00
Gleb Natapov
073a9be4c7 raft: do not send probes while transferring snapshot
If a follower is in snapshot transfer mode there is no need to send
probe append messages to it.
2021-05-06 11:34:31 +03:00
Gleb Natapov
08077a21b7 raft: handle messages sending errors
Fail to send a message should not abort raft server.
2021-05-06 11:34:31 +03:00
Gleb Natapov
d0ebd79deb raft: test: return error from rpc module if nodes are disconnected
Returning an error when nodes are disconnected closer resembles what
will happen in real networking.
2021-05-06 11:34:31 +03:00
Gleb Natapov
c4d87d7a23 raft: fix a typo in a variable name 2021-05-06 11:33:47 +03:00
Botond Dénes
c872a963b6 test: move reader_concurrency_semaphore related tests into separate file
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
2021-05-06 08:59:47 +03:00
Botond Dénes
5f217b6dee test: mutation_reader_test: convert restricted reader tests to semaphore tests
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
2021-05-06 08:57:12 +03:00
Avi Kivity
e9802348b5 storage_proxy, treewide: use utils::small_vector inet_address_vector:s
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.

In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
2021-05-05 18:36:54 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Gleb Natapov
745f63991f raft: test: fix c&p error in a test
Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>
2021-05-05 17:18:49 +02:00
Avi Kivity
ddb1f0e6ca Merge "Choose the user max-result-size for service levels" from Botond
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.

This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.

Fixes: #8591

Tests: unit(dev)
"

* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
  service/storage_proxy: get_max_result_size() defer to db for unlimited queries
  database: add get_unlimited_query_max_result_size()
  query_class_config: add operator== for max_result_size
  database: get_reader_concurrency_semaphore(): extract query classification logic
2021-05-05 18:11:10 +03:00
Lauro Ramos Venancio
15f72f7c9e TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590
2021-05-05 17:31:05 +03:00
Avi Kivity
1ed3f54f4a Merge "size_tiered_compaction_strategy: get_buckets improvements" from Benny
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:

1. Consider only current bucket for each sstable.
   No need to scan all buckets using a map
   since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
   Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
   As we insert increasingly larger sstables into a bucket,
   it's average size drifts up and eventually this may break
   the bucket invariant that all sstables in the bucket should
   be within the (bucket_low, bucket_high) range relative
   to the bucket average.

Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
    compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy

Fixes #8584
"

* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
  compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
  compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
  compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
  compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
  compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
2021-05-05 16:25:12 +03:00
Avi Kivity
6977064693 dist: scylla_raid_setup: reduce xfs block size to 1k
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.

That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.

Tested on AWS i3.large.

Fixes #8156.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b

Closes #8585
2021-05-05 16:07:50 +03:00
Nadav Har'El
64a4e5e059 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
2021-05-05 15:24:25 +03:00
Nadav Har'El
5fbd78ed96 CONTRIBUTING.md: add the requirement for self-contained headers
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
2021-05-05 15:10:46 +03:00
Benny Halevy
ead96e21c3 compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:37 +03:00
Benny Halevy
c1681cb9ea compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
SSTables are added in increasing size order so the bucket's
average might drift upwards.
Don't let it drift too high, to a point where the smallest
SSTable might fall out of range.

For example, here's a simulation run of the algorithm for these sstable sizes:
    [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774]

the simulated compaction strategy options are:
min_sstable_size = 4
bucket_low = 0.66667
bucket_high = 1.5

For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high)

UNCHANGED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 276.4)  414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

IMPROVED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 247.0)  370.5 ( 555.8): [252, 363, 379, 394, 407, 428]
    ( 320.5)  480.8 ( 721.1): [463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

Fixes #8584

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:28 +03:00
Benny Halevy
d3aa5265ab compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
    auto total_size = bucket.size() * old_average_size;
    auto new_average_size = (total_size + size) / (bucket.size() + 1);
```

We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.

Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.

Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:25 +03:00
Benny Halevy
44b094f9a5 compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:20 +03:00
Benny Halevy
336a4dc0fd compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.

Instead, just consider the most recently inserted bucket.

Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.

Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:05 +03:00
Botond Dénes
9d5e958331 service/storage_proxy: get_max_result_size() defer to db for unlimited queries
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
2021-05-05 13:30:50 +03:00
Botond Dénes
992819b188 database: add get_unlimited_query_max_result_size()
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
2021-05-05 13:30:42 +03:00
Nadav Har'El
58e275e362 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
2021-05-05 13:23:00 +03:00
Avi Kivity
83a826a4de Merge 'Azure Ls v2 local disk setup' from Lubos Kosco
fixes #8325

The iotune tests happened on Centos 8.2 both with stock and elrepo kernel, using Scylla 4.3 rc3

results are in https://docs.google.com/spreadsheets/d/1_uYq8UxY47XF5jreetrpleykLPqNGjfPXIirvTPh6rk/edit#gid=1101336711

Closes #7807

* github.com:scylladb/scylla:
  scylla_io_setup: add disk properties for L Azure instances
  scylla_util.py: add new class for Azure cloud support
2021-05-05 12:39:00 +03:00
Avi Kivity
3114f09d76 utils: small_vector: add print operator for std::ostream
In order to replace std::vector with utils::small_vector, it needs to
support this feature too.
2021-05-05 12:10:59 +03:00
Avi Kivity
84ea06f15b hints: messages.hh: add missing #include
Make the header self-contained.
2021-05-05 12:10:17 +03:00
Botond Dénes
e84c31fab8 query_class_config: add operator== for max_result_size 2021-05-05 11:20:22 +03:00
Botond Dénes
9313acb304 database: get_reader_concurrency_semaphore(): extract query classification logic
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
2021-05-05 10:41:04 +03:00
Tomasz Grabiec
121eb32679 Merge 'test: perf: report instructions retired per operations' from Avi Kivity
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).

This allows incremental changes to the code base to be compared with
more confidence.

Current results are around 55k instructions per read, and 52k for writes.

Closes #8563

* github.com:scylladb/scylla:
  test: perf: tidy up executor_stats snapshot computation
  test: perf: report instructions retired per operations
  test: perf: add RAII wrapper around Linux perf_event_open()
  test: perf: make executor_stats_snapshot() a member function of executor
2021-05-05 00:54:08 +02:00
Tomasz Grabiec
b8665c459d Merge "raft: replication test updates" from Alejo
Cleanups, fixes, and configuration change support for replication tests.

* alejo/raft-tests-replication-01-fixes-v13:
  raft: replication test: remove obsolete helper
  raft: replication test: add_entry with retries
  raft: replication test: support config change
  raft: replication test: add dummy command support
  raft: replication test: test both with and without prevote
  raft: replication test: make initial leader just default
  raft: replication test: create command helper
  raft: replication test: free elections as helper
  raft: replication test: fix election connectivity
  raft: replication test: fix custom election
  raft: replication test: add helpers for threshold and election
  raft: replication test: connectivity improvement
  raft: replication test: helper for server_address
  raft: replication test: use wait_log()
  raft: replication test: cycle leader more
  raft: replication test: fix a test description
  raft: replication test: remove multiple state machines
  raft: replication test: remove checksum
  raft: replication test: remove unused class param
2021-05-04 18:52:47 +02:00
Alejo Sanchez
27ad2a0f28 raft: replication test: remove obsolete helper
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:01:07 -04:00
Alejo Sanchez
0a54fd848b raft: replication test: add_entry with retries
The current leader might have stepped down. Try again and learn if
there's a new leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:00:46 -04:00
Nadav Har'El
df65d09e08 Merge ' cdc: log: fill cdc$deleted_ columns in pre-images ' from Piotr Grabowski
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:

```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```

For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):

```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.

Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).

A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:

If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```

A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.

No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.

Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:

```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```

```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```

generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.

Closes #8568

* github.com:scylladb/scylla:
  cdc: log: assert post_image is always in full mode
  cdc: tests: check cdc$deleted_ columns in images
  cdc: log: fill cdc$deleted_ columns in pre-images
2021-05-04 14:45:27 +03:00
Lubos Kosco
c26bcf29f9 scylla_io_setup: add disk properties for L Azure instances 2021-05-04 13:13:05 +02:00
Lubos Kosco
f627fcbb0c scylla_util.py: add new class for Azure cloud support 2021-05-04 13:12:42 +02:00
Piotr Grabowski
cd6154e8bf cdc: log: assert post_image is always in full mode
Add an assertion that checks that post_image can never be in non-full
mode.
2021-05-04 12:33:15 +02:00
Piotr Grabowski
778fbb144f cdc: tests: check cdc$deleted_ columns in images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.

This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
2021-05-04 12:33:15 +02:00
Calle Wilund
7e345e37e8 cql/cdc_batch_delete_postimage_test - rename test files + fix result
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).

Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.

Renamed and re-resulted.

Closes #8581
2021-05-04 12:47:33 +03:00
Avi Kivity
ef2313325b Merge "Teach sstables streams new streams API" from Pavel E
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.

The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).

The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.

tests: unit(dev)
"

* 'br-streams-napi' of https://github.com/xemul/scylla:
  sstables: Make checksum sink report buffer size from lower sink
  sstables: Report buffer size from compressed file sink
2021-05-04 12:22:38 +03:00