"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.
This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.
Fixes: #8591
Tests: unit(dev)
"
* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
service/storage_proxy: get_max_result_size() defer to db for unlimited queries
database: add get_unlimited_query_max_result_size()
query_class_config: add operator== for max_result_size
database: get_reader_concurrency_semaphore(): extract query classification logic
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:
1. Consider only current bucket for each sstable.
No need to scan all buckets using a map
since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
As we insert increasingly larger sstables into a bucket,
it's average size drifts up and eventually this may break
the bucket invariant that all sstables in the bucket should
be within the (bucket_low, bucket_high) range relative
to the bucket average.
Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy
Fixes#8584
"
* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.
That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.
Tested on AWS i3.large.
Fixes#8156.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251bCloses#8585
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
auto total_size = bucket.size() * old_average_size;
auto new_average_size = (total_size + size) / (bucket.size() + 1);
```
We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.
Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.
Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.
Instead, just consider the most recently inserted bucket.
Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.
Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.
In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.
After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.
Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.
Refs #1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).
This allows incremental changes to the code base to be compared with
more confidence.
Current results are around 55k instructions per read, and 52k for writes.
Closes#8563
* github.com:scylladb/scylla:
test: perf: tidy up executor_stats snapshot computation
test: perf: report instructions retired per operations
test: perf: add RAII wrapper around Linux perf_event_open()
test: perf: make executor_stats_snapshot() a member function of executor
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:
```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```
For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.
Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).
A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:
If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```
A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.
No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.
Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:
```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```
```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```
Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.
Closes#8568
* github.com:scylladb/scylla:
cdc: log: assert post_image is always in full mode
cdc: tests: check cdc$deleted_ columns in images
cdc: log: fill cdc$deleted_ columns in pre-images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.
This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).
Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.
Renamed and re-resulted.
Closes#8581
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.
The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).
The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.
tests: unit(dev)
"
* 'br-streams-napi' of https://github.com/xemul/scylla:
sstables: Make checksum sink report buffer size from lower sink
sstables: Report buffer size from compressed file sink
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.
When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.
This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.
This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.
Fixes#8102
Tests:
- unit(dev)
- some manual tests:
- shutting down repair coordinator during hints replay,
- shutting down node participating in repair during hints replay,
Closes#8452
* github.com:scylladb/scylla:
repair: introduce abort_source for repair abort
repair: introduce abort_source for shutdown
storage_proxy: add abort_source to wait_for_hints_to_be_replayed
storage_proxy: stop waiting for hints replay when node goes down
hints: dismiss segment waiters when hint queue can't send
repair: plug in waiting for hints to be sent before repair
repair: add get_hosts_participating_in_repair
storage_proxy: coordinate waiting for hints to be sent
config: add wait_for_hint_replay_before_repair option
storage_proxy: implement verbs for hint sync points
messaging_service: add verbs for hint sync points
storage_proxy: add functions for syncing with hints queue
db/hints: make it possible to wait until current hints are sent
db/hints: add a metric for counting processed files
db/hints: allow to forcefully update segment list on flush
Add support for configuration change on leader.
Keep track of servers in config in test.
Add a dummy entry to confirm configuration changed. If the add fails,
because the old leader was not in the new config and stepped down, the
config is considered changed, too.
Add a test with some configuration changes.
Add a test cycling every scenario for 1 of 4 nodes removed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use a special value as dummy entry to be ignored when seen in state
machine input.
Ignore dummy entries for count.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before this change the default was prevote enabled.
With this change each test is run with and without prevote.
This duplicates the number of test cases.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The test suite requires an initial leader and at the moment it's always
just 0. Make it default and simplify code.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a leader was already disconnected the election of a new leader could
re-connect. Save original connectivity and restore it when done electing
new leader.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use the new specific connectivity to manage old leader disconnection
more specifically.
This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader. For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.
So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).
Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.
The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower
The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add 2 helper functions for making nodes reach timeout threshold and to
elect a specific node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Replace simple full disconnect of a node with specific from -> to
disconnection tracking.
This will help electing new leaders.
Say there are {A,B,C} with A leader and we want to elect B.
Before this patch, we would disconnect A, run an election with just
{B,C}, and then re-connect A.
If we have {A,B} and want to elect B, this won't work as B needs 2/2+1
votes and A is disconnected. Even if we made A stepped down. This patch
corrects this shortcoming. (@gleb-cloudius)
With this patch, we can specify other followers (not the previous or
next leader) to not see the old leader, but the new and old leaders see
each other just fine. In the example {A,B,C} above we can cut A<->B
specifcally.
Also, this is closer to etcd testing and should help porting cases.
NOTE: in the current test implementation failure_detector reports
node.is_alive(other_node) if there is a connection both ways.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Checksum was removed so undo support for multiple versions added in:
test: add support for different state machines
43dc5e7dc2
NOTE: as there is a test with custom total_values, expected value cannot
be static const anymore. (line 630)
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>