storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.
This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.
Test results (perf_simple_query --smp 1 --task-quota-ms 10):
```
before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op)
after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op)
```
4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).
The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.
I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.
Closes#8592
* github.com:scylladb/scylla:
storage_proxy, treewide: use utils::small_vector inet_address_vector:s
storage_proxy, treewide: introduce names for vectors of inet_address
utils: small_vector: add print operator for std::ostream
hints: messages.hh: add missing #include
* scylla-dev/raft-snapshot-fixes-v4:
raft: document that add entry my throw commit_status_unknown
raft: test: add test of a leadership change during ongoing snapshot transfer
raft: test: retry submitting an entry if it was dropped
raft: test: wait for the log to be fully replicated on new leader only
raft: drop waiters with outdated terms
raft: make snapshot transfer abortable
raft: accept snapshots transfer from multiple nodes simultaneously
raft: do not send probes while transferring snapshot
raft: handle messages sending errors
raft: test: return error from rpc module if nodes are disconnected
raft: fix a typo in a variable name
In commit 323f72e48a (repair: Switch to
use NODE_OPS_CMD for replace operation), we switched replace operation
to use the new NODE_OPS_CMD infrastructure.
In this patch set, we continue the work to switch decommission and bootstrap
operation to use NODE_OPS_CMD.
Fixes#8472Fixes#8471Closes#8481
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for bootstrap operation
repair: Switch to use NODE_OPS_CMD for decommission operation
"
The current home of these tests -- mutation_reader_test -- is already
one of the larges test files we have. To reduce the size of the former
and to make finding these tests easier they are moved to a separate
file.
"
* 'reader-concurrency-semaphore-test/v2' of https://github.com/denesb/scylla:
test: move reader_concurrency_semaphore related tests into separate file
test: mutation_reader_test: convert restricted reader tests to semaphore tests
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.
Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.
The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.
In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.
This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.
Fixes: #8591
Tests: unit(dev)
"
* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
service/storage_proxy: get_max_result_size() defer to db for unlimited queries
database: add get_unlimited_query_max_result_size()
query_class_config: add operator== for max_result_size
database: get_reader_concurrency_semaphore(): extract query classification logic
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:
1. Consider only current bucket for each sstable.
No need to scan all buckets using a map
since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
As we insert increasingly larger sstables into a bucket,
it's average size drifts up and eventually this may break
the bucket invariant that all sstables in the bucket should
be within the (bucket_low, bucket_high) range relative
to the bucket average.
Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy
Fixes#8584
"
* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.
That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.
Tested on AWS i3.large.
Fixes#8156.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251bCloses#8585
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
auto total_size = bucket.size() * old_average_size;
auto new_average_size = (total_size + size) / (bucket.size() + 1);
```
We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.
Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.
Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.
Instead, just consider the most recently inserted bucket.
Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.
Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).
This allows incremental changes to the code base to be compared with
more confidence.
Current results are around 55k instructions per read, and 52k for writes.
Closes#8563
* github.com:scylladb/scylla:
test: perf: tidy up executor_stats snapshot computation
test: perf: report instructions retired per operations
test: perf: add RAII wrapper around Linux perf_event_open()
test: perf: make executor_stats_snapshot() a member function of executor
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:
```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```
For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.
Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:
If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```
A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.
Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:
```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```
```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.
Closes#8568
* github.com:scylladb/scylla:
cdc: log: assert post_image is always in full mode
cdc: tests: check cdc$deleted_ columns in images
cdc: log: fill cdc$deleted_ columns in pre-images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.
This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).
Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.
Renamed and re-resulted.
Closes#8581
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.
The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).
The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.
tests: unit(dev)
"
* 'br-streams-napi' of https://github.com/xemul/scylla:
sstables: Make checksum sink report buffer size from lower sink
sstables: Report buffer size from compressed file sink