If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).
Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:
* For NULL, we consider m[NULL] to result in NULL - instead of an error.
This behavior is more consistent with other expressions that contain
null - for example NULL[2] and NULL<2 both result in NULL as well.
Moreover, if in the future we allow more complex expressions, such
as m[a] (where a is a column), we can find the subscript to be null
for some rows and non-null for other rows - and throwing an "invalid
query" in the middle of the filtering doesn't make sense.
* For UNSET_VALUE, we do consider this an error like Cassandra, and use
the same error message as Cassandra. However, the current implementation
checks for this error only when the expression is evaluated - not
before. It means that if the scan is empty before the filtering, the
error will not be reported and we'll silently return an empty result
set. We currently consider this ok, but we can also change this in the
future by binding the expression only once (today we do it on every
evaluation) and validating it once after this binding.
Fixes#10361Fixes#10399
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit fbb2a41246)
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.
The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).
The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.
Fixes#10417
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 808a93d29b)
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.
In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.
The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).
The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:
test/cql-pytest/run --count 100 --runxfail
test_filtering.py::test_filtering_null_map_with_subscript
Refs #10361
Refs #10399
Refs #10401
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 189b8845fe)
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.
Fixes#10351
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10425
(cherry picked from commit 055141fc2e)
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.
Fixes#10787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd68b04fbf)
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.
Fix by triggering soft pressure on retries.
Fixes#10801
Refs #10793
(cherry picked from commit 0e78ad50ea)
Closes#10802
(cherry picked from commit 3bec1cc19f)
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.
_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.
The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.
For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:
[a, c] @ t0
[a, b) @ t1, [b, d] @ t2
c > b
The proper sequence for such versions is (assuming d > c):
[a, b) @ t1,
[b, d] @ t2
Due to the bug, the reader will produce:
[a, b) @ t1,
[b, c] @ t0
The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).
The problem goes away once MVCC versions merge.
Fixes#10913Fixes#10830Closes#10914
(cherry picked from commit a6aef60b93)
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.
fixes: #10494
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc87d0)
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.
This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.
Fixes: #10168
Tests: unit(dev)
* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
test/boost/mutation_test: add test for mutation_fragment_stream_validator
mutation_fragment_stream_validator: validate range tombstone changes
(cherry picked from commit edd0481b38)
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.
`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.
With that, error are handed in table::snapshot and propagated
back to the callers.
Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.
Test: unit(dev), rest_api(debug)
* github.com:scylladb/scylla:
table: snapshot: handle errors
table: snapshot: get rid of skip_flush param
database: truncate: skip flush when taking snapshot
test: rest_api: storage_service: verify_snapshot_details: add truncate
database: snapshot_on_all: flush before snapshot if needed
table: make snapshot method private
database: add snapshot_on_all
snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
api: storage_service: increase visibility of snapshot ops in the log
api: storage_service: coroutinize take_snapshot and del_snapshot
api: storage_service: take_snapshot: improve api help messages
test: rest_api: storage_service: add test_storage_service_snapshot
database: add flush_on_all variants
test: rest_api: add test_storage_service_flush
(cherry picked from commit 2c39c4c284)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10975
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).
Fixes#10851Closes#10855
(cherry picked from commit bc3a635c42)
To avoid failing to run scripts in non-root user, we need to set
permission explicitly on executables.
Fixes#10752Closes#10840
(cherry picked from commit 13caac7ae6)
The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.
Fixes#10004Closes#10124
(cherry picked from commit ec59f7a079)
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.
Fixes#10183Closes#10181
* github.com:scylladb/scylla:
test: add a case for LIKE operator on a descending order column
types: fix is_string for reversed types
(cherry picked from commit 733672fc54)
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.
However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.
Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.
The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.
Fixes#10151
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.
Fixes#10669Closes#10714
(cherry picked from commit ad2344a864)
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.
Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax
However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.
So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).
Fixes#10332
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
(cherry picked from commit 9c1ebdceea)
* seastar 8b2c13b3...dbf79189 (1):
> Merge 'Calculate max IO lengths as lengths'
io_queue: Type alias for internal::io_direction_and_length
io_queue, fair_group: Throw instead of assert
io_queue: Keep max lengths on board
io_queue: Toss request_fq_ticket()
io_queue: Introduce make_ticket() helper
io_queue: Remove max_ticket_size
io_queue: Make make_ticket() non-brancy
io_queue: Add devid to group creation log
tests: cstress(release)
fixes: #10704
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.
In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:
scheduling_group
messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
if (must_compress) {
opts.compressor_factory = &compressor_factory;
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
- opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ // We send cookies only for non-default statement tenant clients.
+ if (idx > 3) {
+ opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+ }
This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.
Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.
Fixes#9505.
Closes#10673
(cherry picked from commit c83393e819)
This patch set adds two commits to allow trigger off strategy early for node operations.
*) repair: Repair table by table internally
This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.
Before:
```
for range in ranges
for table in tables
repair(range, table)
```
After:
```
for table in tables
for range in ranges
repair(range, table)
```
The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.
This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.
Refs: #10462
*) repair: Trigger off strategy compaction after all ranges of a table is repaired
When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.
To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.
Refs: #10462Closes#10551
* github.com:scylladb/scylla:
repair: Trigger off strategy compaction after all ranges of a table is repaired
repair: Repair table by table internally
(cherry picked from commit e65b3ed50a)
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!
This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.
The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.
After this patch, the following test now works on Scylla without
experimental flags turned on:
test/alternator/run test_ttl.py::test_describe_ttl_without_ttl
Refs #10660
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8ecf1e306f)
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.
Fixes#10523Closes#10524
(cherry picked from commit a9dfe5a8f4)
Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.
To avoid that, we need to extract CPU mask from the output.
Fixes#10082Closes#10107
(cherry picked from commit 59adf05951)
First patch removes incorrect usage of rwlock which should be restricted to minor and major compaction tasks.
Second patch revives a semaphore, which was lost in 6737c88045, as we want major to not wait on off-strategy completion before deciding whether or not it should proceed with execution. It wouldn't proceed with execution if user asked major to stop while waiting for a chance to run.
For master, we're going to rely on abortable variant of get_units() to allow major to be quickly aborted.
Fixes#10485.
Closes#10582
* github.com:scylladb/scylla:
compaction_manager: Revive custom job semaphore
compaction_manager: Remove rwlock usage in run_custom_job()
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.
The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.
The assert manifested like this:
scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.
Fixes#10617Closes#10653
(cherry picked from commit f87274f66a)
In commit 6737c88045, we started using a single semaphore for
maintenance operations, which is a good change.
However, after introduction of off-strategy, major cannot proceed
until off-strategy is done reshaping all its input files.
If user requests major to abort, the command will only return
once off-strategy is done, and that can take lots of time.
In master, we'll allow pending major to be quickly aborted, but
that's not possible here as abortable variant of get_units()
is not available yet.
Here, we'll allow major to proceed in parallel to off-strategy,
so major can decide whether or not it should run in parallel.
Fixes#10485.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The rwlock usage was introduced in 2017 commit 10eaa2339e.
Resharding was online back then and we want to serialize it with
major.
Rwlock usage should be restricted to major and minor, as clearly
stated in the documentation, but we're still using it in
run_custom_job().
It gains us nothing, it only prevents off-strategy and other
custom jobs from running concurrently to major.
Let's kill this as we want to allow off-strategy to not prevent
a major from happening in parallel, as the former works only
on the maintenance sstable set and won't interfere with
the latter.
Refs #10485.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.
Fixes#10423
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup
Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().
_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.
Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.
To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.
compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.
Fixes#10378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10508
(cherry picked from commit 8e99d3912e)
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#10559Reopens#9313.
(cherry picked from commit 8e8dc2c930)
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.
Fixes#10487Closes#10503
(cherry picked from commit 603dd72f9e)
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
(cherry picked from commit 01f41630a5)
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.
Fixes#10129
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
purpose
Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.
(cherry picked from commit bf50dbd35b)
Ref #10129
An OOM failure while peeking into fragment, to determine if reader will
produce any fragment, causes Scylla to abort as flat_mutation_reader
expects reader to be closed before destroyed. Let's close it if
peek() fails, to handle the scenario more gracefully.
Fixes#10027.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>
(cherry picked from commit 755cec1199)
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.
Should probably be backported to all CDC enabled versions.
Fixes#10473Closes#10474
(cherry picked from commit 78350a7e1b)
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
(cherry picked from commit 5169ce40ef)
There are two issues with current implementation of remove/remove_if:
1) If it happens concurrently with get_ptr(), the latter may still
populate the cache using value obtained from before remove() was
called. remove() is used to invalidate caches, e.g. the prepared
statements cache, and the expected semantic is that values
calculated from before remove() should not be present in the cache
after invalidation.
2) As long as there is any active pointer to the cached value
(obtained by get_ptr()), the old value from before remove() will be
still accessible and returned by get_ptr(). This can make remove()
have no effect indefinitely if there is persistent use of the cache.
One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:
User Defined Type value contained too many fields (expected 5, got 6)
The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.
The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.
Fixes#10117
Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)