these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
There's a bunch of them used by mainly view_builder and also by the API
and storage_service. All use global qctx to make its job, now when the
callers have main-local sharded<system_keysace> references they can be
made non-static.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view builder updates system.scylla_views_builds_in_progress and
.built_views tables and thus needs the system keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since we're potentially searching the row_lock in parallel to acquiring the read_lock on the partition, we're racing with row_locker::unlock that may erase the _row_locks entry for the same clustering key, since there is no lock to protect it up until the partition lock has been acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock _after_ the partition lock has been acquired to make sure we're synchronously starting the read/write lock function on it, without yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance that wasn't needed before if the lock for it was already found, but the view building is not on the hot path so we can tolerate that.
This is required on top of 5007ded2c1 as seen in https://github.com/scylladb/scylladb/issues/12632 which is closely related to #12168 but demonstrates a different race causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12639
* github.com:scylladb/scylladb:
view: row_lock: lock_ck: try_emplace row_lock entry
view: row_lock: lock_ck: find or construct row_lock under partition lock
We currently don't clean up the system_distributed.view_build_status
table after removed nodes. This can cause false-positive check for
whether view update generation is needed for streaming.
The proper fix is to clean up this table, but that will be more
involved, it even when done, it might not be immediate. So until then
and to be on the safe side, filter out entries belonging to unknown
hosts from said table.
Fixes: #11905
Refs: #11836Closes#11860
Use same method as the two-level lock at the
partition level. try_emplace will either use
an existing entry, if found, or create a new
entry otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since we're potentially searching the row_lock in parallel to acquiring
the read_lock on the partition, we're racing with row_locker::unlock
that may erase the _row_locks entry for the same clustering key, since
there is no lock to protect it up until the partition lock has been
acquired and the lock_partition future is resolved.
This change moves the code to search for or allocate the row lock
_after_ the partition lock has been acquired to make sure we're
synchronously starting the read/write lock function on it, without
yielding, to prevent this use-after-free.
This adds an allocation for copying the clustering key in advance
even if a row_lock entry already exists, that wasn't needed before.
It only us slows down (a bit) when there is contention and the lock
already existed when we want to go locking. In the fast path there
is no contention and then the code already had to create the lock
and copy the key. In any case, the penalty of copying the key once
is tiny compared to the rest of the work that view updates are doing.
This is required on top of 5007ded2c1 as
seen in https://github.com/scylladb/scylladb/issues/12632
which is closely related to #12168 but demonstrates a different race
causing use-after-free.
Fixes#12632
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit ac2e2f8883. It causes
a regression ("std::bad_variant_access in load_view_build_progress").
Commit 2978052113 (a reindent) is also reverted as part of
the process.
Fixes#12395
Refactor the existing stats tracking and updating
code into struct latency_stats_tracker and while at it,
count lock_acquisitions only on success.
Decrement operations_currently_waiting_for_lock in the destructor
so it's always balanced with the uncoditional increment
in the ctor.
As for updating estimated_waiting_for_lock, it is always
updated in the dtor, both on success and failure since
the wait for the lock happened, whether waiting
timed out or not.
Fixes#12190
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12225
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
This pull request introduces support for global secondary indexes based on static columns.
Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources.
The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary.
The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact.
New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed.
Materialized view with static rows as a key are __not__ implemented in this PR.
Fixes: #2963Closes#11166
* github.com:scylladb/scylladb:
test_materialized_view: verify that static columns are not allowed
test_secondary_index: add (currently failing) test for static index paging
test_secondary_index: add more tests for secondary indexes on static columns
cassandra_tests: enable existing tests for static columns
create_index_statement: lift restriction on secondary indexes on static rows
db/view: fetch and process static rows when building indexes
gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
create_index_statement: disallow creation of local indexes with static columns
select_statement: prepare paging for indexes on static columns
select_statement: do not attempt to fetch clustering columns from secondary index's table
secondary_index_manager: don't add clustering key columns to index table of static column index
replica/table: adjust the view read-before-write to return static rows when needed
db/view: process static rows in view_update_builder::on_results
db/view: adjust existing view update generation path to use clustering_or_static_row
column_computation: adjust to use clustering_or_static_row
db/view: add clustering_or_static_row
deletable_row: add column_kind parameter to is_live
view_info: adjust view_column to accept column_kind
db/view: base_dependent_view_info: split non-pk columns into regular and static
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
This commit modifies the view builder and its consumer so that static
rows are always fetched and properly processed during view build.
Currently, the view builder will always fetch both static and clustering
rows, regardless of the type of indexes being built. For indexes on
static columns this is wasteful and could be improved so that only the
types of rows relevant to indexes being built are fetched - however,
doing this sounds a bit complicated and I would rather start with
something simpler which has a better chance of working.
Adjusts the read-before-write query issued in
`table::do_push_view_replica_updates` so that, when needed, requests
static columns and makes sure that the static row is present.
The `view_update_builder::on_results()` function is changed to react to
static rows when comparing read-before-write results with the base table
mutation.
Adjusts the column_computation interface so that it is able to accept
both clustering and static rows through the common
db::view::clustering_or_static_row interface.
Adds a `clustering_or_static_row`, which is a common, immutable
representation of either a static or clustering row. It will allow to
handle view update generation based on static or clustering rows in a
uniform way.
The `view_info::view_column()` and `view_column` in view.cc allow to get
a view's column definition which corresponds to given base table's
column. They currently assume that the given column id corresponds to a
regular column. In preparation for secondary indexes based on static
columns, this commit adjusts those functions so that they accept other
kinds of columns, including static columns.
Currently, `base_dependent_view_info::_base_non_pk_columns_in_view_pk`
field keeps a list of non-primary-key columns from the base table which
are a part of the view's primary key. Because the current code does not
allow indexes on static columns yet, the columns kept in the
aforementioned field are always assumed to be regular columns of the
base table and are kept as `column_id`s which do not contain information
about the column kind.
This commit splits the `_base_non_pk_columns_in_view_pk` field into two,
one for regular columns and the other for static columns, so that it is
possible to keep both kinds of columns in `base_dependent_view_info` and
the structure can be used for secondary indexes on static columns.
Get the topology for the effective replication map rather
than from the storage_proxy to ensure its synchronized
with the natural endpoints.
Since there's no preemption between the two calls
currently there is no issue, so this is merely a clean up
of the code and not supposed to fix anything.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.
This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.
This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.
Fixes#10026Fixes#11542
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).
Refs #9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a materialized view has a key (in Alternator, this can be two
keys) which was a regular column in the base table, and a base
update modifies that regular column, there are two distinct cases:
1. If the old and new key values are different, we need to delete the
old view row, and create a new view row (with the different key).
2. If the old and new key values are the same, we just need to update
the pre-existing row.
It's important not to confuse the two cases: If we try to delete and
create the *same* view row in the same timestamp, the result will be
that the row will be deleted (a tombstone wins over data if they have the
same timestamp) instead of updated. This is what we saw in issue #11801.
We had a bug that was seen when an update set the view key column to
the old value it already had: To compare the old and new key values
we used the function compare_atomic_cell_for_merge(), but this compared
not just they values but also incorrectly compared the metadata such as
a the timestamp. Because setting a column to the same value changes its
timestamp, we wrongly concluded that these to be different view keys
and used the delete-and-create code for this case, resulting in the
view row being deleted (as explained above).
The simple fix is to compare just the key values - not looking at
the metadata.
See tests reproducing this bug and confirming its fix in the next patch.
Fixes#11801
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The replace_entry() function is nothing more than a convenience for
calling delete_old_entry() and then create_entry(). But it is only used
once in the code, and we can just open-code the two calls instead of
the one.
The reason I want to change it now is that the shortcut replace_entry()
helped hide a bug (#11801) - replace_entry() works incorrectly if the
old and new row have the same key, because if they do we get a deletion
and creation of the same row with the same timestamp - and the deletion
wins. Having the two calls not hidden by a convenience function makes
this potential problem more apparent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.
This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.
We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11751
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.
Fixes: #11668Closes#11671
When resuming a build-step, the view builder injects the partition-start
fragment of the last processed partition, to bring the consumer
(compactor) into the correct state before it starts to consume the
remainder of the partition content. This results in an invalid fragment
stream when the partition was actually over or there is nothing left for
the build step. Make the inject conditional on when the reader contains
more data for the partition.
Fixes: #11607
To be used by generate_update() for getting the
tombstone_gc_state via the table's compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and use it to access the repair history maps.
At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The structure "bytes_with_action" was very hard to understand because of
its mysterious and general-sounding name, and no comments.
In this patch I add a large comment explaining its purpose, and rename
it to a more suitable name, view_key_and_action, which suggests that
each such object is about one view key (where to add a view row), and
an additional "action" that we need to take beyond adding the view row.
This is the best I can do to make this code easier to understand without
completely reorganizing it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For collection indexes, logic of computing values for each of the column
needed to change, since a single particular column might produce more
than one value as a result.
The liveness info from individual cells of the collection impacts the
liveness info of resulting rows. Therefore it is needed to rewrite the
control flow - instead of functions getting a row from get_view_row and
later computing row markers and applying it, they compute these values
by themselves.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In case of secondary indexes, if an index does not contain any column from
the base which makes up for the primary key, then it is assumed that
during update, a change to some cells from the base table cannot cause
that we're dealing with a different row in the view. This however
doesn't take into account the possibility of computed columns which in
fact do depend on some non-primary-key columns. Introduce additional
property of an index,
has_computed_column_depending_on_base_non_primary_key.
This type of column computation will be used for creating updates to
materialized views that are indexes over collections.
This type features additional function, compute_values_with_action,
which depending on an (optional) old row and new row (the update to the
base table) returns multiple bytes_with_action, a vector of pairs
(computed value, some action), where the action signifies whether a
deletion of row with a specific key is needed, or creation thereby.
The compute_value function of column_computation has had previously the
following signature:
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
This is superfluous, since never in the history of Scylla, the last
parameter (row) was used in any implentation, and never did it happen
that it returned bytes_opt. The absurdity of this interface can be seen
especially when looking at call sites like following, where dummy empty
row was created:
```
token_column.get_computation().compute_value(
*_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty()));
```
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Code that waited for all remote view updates was already there. This
commit modifies the conditions of this wait to take into account the
"synchronous mode" (enabled when db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY is
set).
This PR removes all code that used classes `restriction`, `restrictions` and their children.
There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`.
Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore.
After that the restriction classes weren't used anymore and could be deleted as well.
Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions.
Closes#11069
* github.com:scylladb/scylla:
cql3: Remove all remaining restrictions code
cql3: Move a function from restrictions class to the test
cql3: Remove initial_key_restrictions
cql3: expr: Remove convert_to_restriction
cql3: Remove _new from _new_nonprimary_key_restrictions
cql3: Remove _nonprimary_key_restrictions field
cql3: Reimplement uses of _nonprimary_key_restrictions using expression
cql3: Keep a map of single column nonprimary key restrictions
cql3: Remove _new from _new_clustering_columns_restrictions
cql3: Remove _clustering_columns_restrictions from statement_restrictions
cql3: Use a variable instead of dynamic cast
cql3: Use the new map of single column clustering restrictions
cql3: Keep a map of single column clustering key restrictions
cql3: Return an expression in get_clustering_columns_restrctions()
cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
cql3: Don't create single element conjunction
cql3: Add expr::index_supports_some_column
cql3: Reimplement has_unrestricted_components()
cql3: Reimplement _clustering_columns_restrictions->need_filtering()
cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
cql3: Use the new clustering restrictions field instead of ->expression
cql3: Reimplement _clustering_columns_restrictions->size() using expressions
cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
cql3: expr: Add has_only_eq_binops function
cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
All parts of the code that use _nonprimary_key_restrictions
are changed to use _new_nonprimary_key_restrictions instead.
I decided not to split this into multiple commits,
as there isn't a lot of changes and they are
analogous to the ones done before for partition
and clustering columns.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
get_clustering_columns_restrctions() used to return
a shared pointer to the clustering_restrictions class.
Now everything is being converted to expression,
so it should return an expression as well.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>