Commit Graph

639 Commits

Author SHA1 Message Date
Amnon Heiman
19a414598b db/view/view.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_view_builder_builds_in_progress

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Kefu Chai
6e4cb20a69 tree: implement boost::accumulate with std::ranges library
Replace boost::accumulate() calls with std::ranges facilities. This
change reduces external dependencies and modernizes the codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23062
2025-02-26 23:22:02 +02:00
Andrzej Jackowski
b4f0a5149a db: cql3: add comments regarding unsafe interval<clustering_key_prefix>
class clustering_range is a range of Clustering Key Prefixes implemented
as interval<clustering_key_prefix>. However, due to the nature of
Clustering Key Prefix, the ordering of clustering_range is complex and
does not satisfy the invariant of interval<>. To be more specific, as a
comment in interval<> implementation states: “The end bound can never be
smaller than the start bound”. As a range of CKP violates the invariant,
some algorithms, like intersection(), can return incorrect results.
For more details refer to scylladb#8157, scylladb#21604, scylladb#22817.

This commit:
 - Add a WARNING comment to discourage usage of clustering_range
 - Add WARNING comments to potentially incorrect uses of
   interval<clustering_key_prefix> non-trivial methods
 - Add a FIXME comment to incorrect use of
   interval<clustering_key_prefix_view>::deoverlap and WARNING comments
   to related interval<clustering_key_prefix_view> misuse.

Closes scylladb/scylladb#22913
2025-02-26 12:01:28 +01:00
Kefu Chai
7ff0d7ba98 tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22857
2025-02-15 20:32:22 +02:00
Nadav Har'El
cae8a7222e alternator: fix view build on oversized GSI key attribute
Before this patch, the regular_column_transformation constructor, which
we used in Alternator GSIs to generates a view key from a regular-column
cell, accepted a cell of any size. As a reviewer (Avi) noticed, very
long cells are possible, well beyond what Scylla allows for keys (64KB),
and because regular_column_transformation stores such values in a
contiguous "bytes" object it can cause stalls.

But allowing oversized attributes creates an even more accute problem:
While view building (backfilling in DynamoDB jargon), if we encounter
an oversized (>64KB) key, the view building step will fail and the
entire view building will hang forever.

This patch fixes both problems by adding to regular_column_transformation's
constructor the check that if the cell is 64KB or larger, an empty value
is returned for the key. This causes the backfilling to silently skip
this item, which is what we expect to happen (backfilling cannot do
anything to fix or reject the pre-existing items in the best table).

A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is
introduced to reproduce this problem and its fix. The test adds a 65KB
attribute to a base table, and then adds GSIs to this table with this
attribute as its partition key or its sort key. Before this patch, the
backfilling process for the new GSIs hangs, and never completes.
After this patch, the backfilling completes and as expected contains
other base-table items but not the item with the oversized attribute.
The new test also passes on DynamoDB.

However, while implementing this fix I realized that issue #10347 also
exists for GSIs. Issue #10347 is about the fact that DynamoDB limits
partition key and sort key attributes to 2048 and 1024 bytes,
respectively. In the fix described above we only handled the accute case
of lengths above 64 KB, but we should actually skip items whose GSI
keys are over 2048 or 1024 bytes - not 64KB. This extra checking is
not handled in this patch, and is part of a wider existing issue:
Refs #10347

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:50 +01:00
Nadav Har'El
7a0027bacc mv: clean up do_delete_old_entry
The function do_delete_old_entry() had an if() which was supposedly for
the case of collection column indexing, and which our previous patch
that improved this function to support caller-specified deletion_ts
left behind.

As a reviewer noticed, the new tombstone-setting code was in an "else"
of that existing if(), and it wasn't clear what happens if we get to that
else in the collection column indexing. So I reviewed the code and added
breakpoints and realized that in fact, do_delete_old_entry() is never
called for the collection-indexing case, which has its own
update_entry_for_computed_column() which view_updates::generate_update()
calls instead of the do_delete_old_entry() function and its friends.
So it appears that do_delete_old_entry() doesn't need that special
case at all, which simplifies it.

We should eventually simplify this code further. In particular, the
function generate_update() already knows the key of the rows it
adds or deletes so do_delete_old_entry() and its friends don't need
to call get_view_rows() to get it again. But these simplifications
and other will need to come in a later patch series, this one is
already long enough :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:49 +01:00
Nadav Har'El
bc7b5926d2 mv: support regular_column_transformation key columns in view
In an earlier patch, we introduced regular_column_transformation,
a new type of computed column that does a computation on a cell in
regular column in the base and returns a potentially transformed cell
(value or deletion, timestamp and ttl). In *this* patch, we wire the
materialized view code to support this new kind of computed column that
is usable as a materialized-view key column. This new type of computed
column is not yet used in this patch - this will come in the next
patch, where we will use it for Alternator GSIs.

Before this patch, the logic of deciding when the view update needs
to create a new row or delete a new one, and which timestamp and ttl
to give to the new row, could depend on one (or two - in Alternator)
cells read from base-table regular columns. In this patch, this logic
is rewritten - the notion of "base table regular columns" is generalized
to the notion of "updatable view key columns" - these are view key
columns that an update may change - because they really are base regular
columns, or a computed function of one (regular_column_transformation).

In some sense, the new code is easier to understand - there is no longer
a separate "compute_row_marker()" function, rather the top-level
generate_update() is now in charge of finding the "updatable view key
columns" and calculate the row marker (timestamp and ttl) as part
of deciding what needs to be done.

But unfortunately the code still has separate code paths for "collection
secondary indexing", and also for old-style column_computation (basically,
only token_column_computation). Perhaps in the future this can be further
simplified.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:49 +01:00
Nadav Har'El
c8ea9f8470 mv: introduce regular_column_transformation, a new type of computed column
In the patches that follow, we want Alternator to be able to use as a
key for a materialized view (GSI) not a real column from the schema,
but rather an attribute value deserialized from a member of the ":attrs"
map.

For this, we need the ability for materialized view to define a key
column which is computed as function of a real column (":attrs").

We already have an MV feature which we called "computed column"
(column_computation), but it is wholy inadequate for this job:
column_computation can only take a partition key, and produce a value -
while we need it to take a regular column (one member of ":attrs"),
not just the partition key, and return a cell - value or deletion,
timestamp and TTL.

So in this patch we introduce a new type of computed column, which we
called "regular_column_transformation" since it intends to perform some
sort of transformation on a single column (or more accurately, a single
atomic cell). The limitation that this function transforms a single
column only is important - if we had a function of multiple columns,
we wouldn't know which timestamp or ttl it should use for the result
if the two columns had different timestamps or TTLs.

The new class isn't wired to anything yet: The MV code cannot handle
it yet, and the Alternator code will not use it yet.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-02-06 09:59:48 +01:00
Michael Litvak
6d34125eb7 view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493
2025-01-30 14:35:18 +02:00
Benny Halevy
dd21d591f6 network_topology_strategy_test: add tablets rack_aware_view_pairing tests
Test the simple case of base/view pairing with replication_factor
that is a multiple of the number of racks.

As well as the complex case when simple_tablets_rack_aware_view_pairing
is not possible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
249b793674 view: get_view_natural_endpoint: implement rack-aware pairing for tablets
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map

We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
  is a multiple of the number of racks and the minimum number of nodes
  per rack in the dc is greater than or equal to rf / nr_racks.

  In this case (that includes the single rack case), all racks would
  have the same number of replicas, so we first filter all replicas
  by dc and rack, retaining their ordinality in the process, and
  finally, we pair between the base replicas and view replicas,
  that are in the same rack, using their original order in the
  tablet-map replica set.

  For example, nr_racks=2, rf=4:
  base_replicas = { N00, N01, N10, N11 }
  view_replicas = { N11, N12, N01, N02 }
  pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
  Note that we don't optimize for self-pairing if it breaks pairing ordinality.

- Complex rack-aware pairing: when the replication factor is not
  a multiple of nr_racks.  In this case, we attempt best-match
  pairing in all racks, using the minimum number of base or view replicas
  in each rack (given their global ordinality), while pairing all the other
  replicas, across racks, sorted by their ordinality.

  For example, nr_racks=4, rf=3:
  base_replicas = { N00, N10, N20 }
  view_replicas = { N11, N21, N31 }
  pairing would be: { N00, N31 }*, { N10, N11 }, { N20, N21 }
  * cross-rack pair

  If we'd simply stable-sort both base and view replicas by rack,
  we might end up with much worse pairing across racks:
  { N00, N11 }*, { N10, N21 }*, { N20, N31 }*
  * cross-rack pair

Fixes scylladb/scylladb#17147

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
0e388a1594 view: get_view_natural_endpoint: handle case when there are too few view replicas
Currently, when reducing RF, we may drop replicas from
the view before dropping replicas from the base table.
Since get_view_natural_endpoint is allowed to return
a disengaged optional if it can't find a pair for the
base replica, replcace the exiting assertion with code
handling this case, and count those events in a new
table metric: total_view_updates_failed_pairing.

Note that this does not fix the root cause for the issue
which is the unsynchronized dropping of replicas, that
should be atomic, using a single group0 transaction.

Refs scylladb/scylladb#21492

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
858b0a51f8 view: get_view_natural_endpoint: track replica locator::nodes
Rather than tracking only the replica host_id, keep
track of the locator:::node& to prepare for
rack-aware pairing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
cadd33bdf6 view: get_view_natural_endpoint: refactor predicate function
Simplify the function logic by calculating the predicate
function once, before scanning all base and view replicas,
rather than testing the different options in the inner loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
97f85e52f7 view: get_view_natural_endpoint: clarify documentation
"self-pairing" is enabled only when use_legacy_self_pairing
is enabled.  That is currently unclear in the documentation
comment for this function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
6d4de30a3a view: mutate_MV: optimize remote_endpoints filtering check
Currently we always lookup both `my_address` and *target_endpoint
in remote_endpoints. But if my_address is in remote_endpoints
in some cases the second lookup is not needed, so do it only
to decide whether to swap target_endpoint with my_address, if
found in remote_endpoints, or to remove that match, if
*target_endpoint is already pending as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
91d3bf8ebc view: mutate_MV: lookup base and view erms synchronously
Although at the moment storage_service::replicate_to_all_cores
may yield between updating the base and view tables with
a new effective_replication_map, scylladb/scylladb#21781
was submitted to change that so that they are updated
atomically together.

This change prepares for the above change, and is harmless
at the moment.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
d04cdce0fc view: mutate_MV: calculate keyspace-dependent flags once
All view live in the same keyspace as their base
table, so calculate the keyspace-dependent flags
once, outside the per-view update loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Kamil Braun
89ee2a6834 Merge 'drop ip addresses from token metadata' from Gleb
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.

Fixes: scylladb/scylla#21777

* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
  hint manager: do not translate ip to id in case hint manager is stopped already
  locator: token_metadata: drop update_host_id() function that does nothing now
  locator: topology: drop indexing by ips
  repair: drop unneeded code
  storage_service: use host_id to look for a node in on_alive handler
  storage_proxy: translate ips to ids in forward array using gossiper
  locator: topology: remove unused functions
  storage_service: check for outdated ip in on_change notification in the peers table
  storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
  topology coordinator: change connection dropping code to work on host ids
  cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
  locator: drop unused function from tablet_effective_replication_map
  api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
  locator: token_metadata: remove unused ip based functions
  locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
  gossiper: drop get_unreachable_token_owners functions
  storage_service: use gossiper to map ip to id in node_ops operations
  storage_service: fix indentation after the last patch
  storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
  token_metadata: drop no longer used functions
  ...
2025-01-17 11:00:52 +01:00
Gleb Natapov
122d58b4ad api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead 2025-01-16 16:37:07 +02:00
Gleb Natapov
844cb090bf view: do not use get_endpoint_for_host_id_if_known to check if a node is part of the topology
Check directly in the topology instead.
2025-01-15 16:30:28 +02:00
Michael Litvak
7a6aec1a6c view_builder: hold semaphore during entire startup
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.
2025-01-14 12:31:29 +02:00
Michael Litvak
1104411f83 view_builder: pass view name by value to write_view_build_status
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.

The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.
2025-01-14 12:31:29 +02:00
Michael Litvak
b1be2d3c41 view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638
2025-01-14 12:31:20 +02:00
Michael Litvak
2a8ff478f0 view_builder: register listener for new views before reading views
When starting the view builder, we find all existing views in
`calculate_shard_build_step` and then register a listener for new views.
Between these steps we may yield and create a new view, then we miss
initializing the view build step for the new view, and we won't start
building it.

To fix this we first register the listener and then read existing views,
so a view can't be missed.

Fixes scylladb/scylladb#20338

Closes scylladb/scylladb#22184
2025-01-09 13:18:28 +02:00
Kefu Chai
e4463b11af treewide: replace boost::algorithm::join() with fmt::join()
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.

When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.

To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.

See also

- boost::algorithm::join():
  https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
  https://fmt.dev/11.0/api/#ranges-api

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22082
2025-01-07 12:45:05 +02:00
Wojciech Mitros
37a25d3af4 mv: avoid stalls when calculating affected clustering ranges
Currently, when finishing db::view::calculate_affected_clustering_ranges
we deoverlap, transform and copy all ranges prepared before. This
is all done within a single continuation and can cause stalls.

We fix this by adding yields after each transform and moving elements
to the final vector one by one instead of copying them all at the end.

After this change, the longest continuation in this code will be
deoverlapping the initial ranges (and one transform). While it has
a relatively high computational complexity (we sort all ranges), it
should execute quickly because we're operating on views there and
we don't need to copy the actual bytes. If we encounter a stall there,
we'll need to implement an asynchronous `deoverlap` method.

Fixes scylladb/scylladb#21843

Closes scylladb/scylladb#21846
2024-12-19 12:50:30 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Botond Dénes
34a8b492be Merge 'materialized view: make flow-control maximum delay configurable' from Piotr Dulikowski
This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test.

Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.

This hard-coded one maximum second delay was considered *huge* - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client.

So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before.

Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it.

The new parameter's help string mentions both these use cases of the parameter.

Fixes #18187

This is new functionality, no need to backport to any open source release.

Closes scylladb/scylladb#21647

* github.com:scylladb/scylladb:
  materialized views: test for the MV delay configuration parameter
  service: add injection for skipping view update backlog
  materialized view: make flow-control maximum delay configurable
2024-12-16 14:20:33 +02:00
muthu90tech
e49381119d locator: topology: use node& instead of node*
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.

Fixes scylladb/scylladb#20357

Closes scylladb/scylladb#21863
2024-12-12 13:22:55 +01:00
Kefu Chai
ce2f80c227 treewide: migrate from boost::make_iterator_range to ranges::subrange
Replace boost::make_iterator_range() with std::ranges::subrange.

This change improves code modernization and reduces external dependencies:

- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>

This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21787
2024-12-09 21:31:53 +02:00
Avi Kivity
9024e4940c counters.hh: drop unused boost includes
Re-add them to source files that need them.

Closes scylladb/scylladb#21738
2024-12-05 12:27:41 +02:00
Nadav Har'El
49f11f655c materialized view: make flow-control maximum delay configurable
Until this patch, the materialized view flow-control algorithm
(https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/)
used a constant delay_limit_us hard-coded to one second, which means
that when the size of view-update backlog reached the maximum (10%
of memory), we delay every request by an additional second - while
smaller amounts of backlog will result in smaller delays.

This hard-coded one maximum second delay was considered *huge* - it will
slow down a client with concurrency 1000 to just 1000 requests per
second - but we already saw some workloads where it was not enough -
such as a test workload running very slow reads at high concurrency
on a slow machine, where a latency of over one second was expected
for each read, so adding a one second latecy for writes wasn't having
any noticable affect on slowing down the client.

So this patch replaces the hard-coded default with a live-updateable
configuration parameter, `view_flow_control_delay_limit_in_ms`, which
defaults to 1000ms as before.

Another useful way in which the new `view_flow_control_delay_limit_in_ms`
can be used is to set it to 0. In that case, the view-update flow
control always adds zero delay, and in effect - does absolutely
nothing. This setting can be used in emergency situations where it
is suspected that the MV flow control is not behaving properly, and
the user wants to disable it.

The new parameter's help string mentions both these use cases of
the parameter.

Fixes #18187

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-12-05 09:51:56 +01:00
Avi Kivity
841481c202 Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).

The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.

Fixes: scylladb/scylladb#6403

perf-simple-query -- smp 1 -m 1G output

Before:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41291 insns/op,   24485 cycles/op,        0 errors)
62669.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41277 insns/op,   24695 cycles/op,        0 errors)
69172.12 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41326 insns/op,   24463 cycles/op,        0 errors)
56706.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41143 insns/op,   24513 cycles/op,        0 errors)
56416.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41186 insns/op,   24851 cycles/op,        0 errors)

         throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
  cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70

After:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   40733 insns/op,   23145 cycles/op,        0 errors)
59283.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40624 insns/op,   23948 cycles/op,        0 errors)
70851.03 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40625 insns/op,   23027 cycles/op,        0 errors)
70549.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40650 insns/op,   23266 cycles/op,        0 errors)
68634.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40622 insns/op,   22935 cycles/op,        0 errors)

         throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
  cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/

Tested mixed cluster manually.
"

* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
  group0: drop unused field from replace_info struct
  test: rename raft_address_map_test to address_map_test and move if from raft tests
  raft_address_map: remove raft address map
  topology coordinator: do not modify expire state for left/new nodes any more in raft address map
  topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
  group0: drop raft address map dependency from raft_rpc
  group0: move raft_ticker_type definition from raft_address_map.hh
  storage_service: do not update raft address map on gossiper events
  group0: drop raft address map dependency from raft_server_with_timeouts
  group0: move group0 upgrade code to host ids
  repair: drop raft address map dependency
  group0: remove unused raft address map getter from raft_group0
  group0: drop raft address map from group0_state_machine dependency since it is not used there any more
  group0: remove dependency on raft address map from group0_state_id_handler
  gossiper: add get_application_state_ptr that searches by host_id
  gossiper: change get_live_token_owners to return host ids
  view: move view building to host id
  hints: use host id to send hints
  storage_proxy: remove id_vector_to_addr since it is no longer used
  db: consistency_level: change is_sufficient_live_nodes to work on host ids
  ...
2024-12-03 18:18:48 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Gleb Natapov
20d1b80535 view: move view building to host id
Use host ids in view building code as well.
2024-12-02 10:31:13 +02:00
Kefu Chai
a5ee0c896b treewide: migrate from boost::adaptors::filtered to std::views::filter
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.

Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views

Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices

Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
   - `std::ranges::to` adaptor requires rvalue references
   - Necessitated updates to variable and parameter constness
   - Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
     from `common` to enable efficient range conversion

2. Range Iteration and Mutation
   - Range views may mutate internal state during iteration
   - Cannot pass ranges by const reference in some scenarios
   - Solution: Pass ranges by rvalue reference to explicitly indicate
     state invalidation

Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
  due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change

This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21648
2024-11-26 14:26:50 +02:00
Dawid Mędrek
f913ae571f db/view: Don't generate view updates for unselected columns
The semantics of Scylla's materialized views may vary depending on how their
primary keys correspond to the base table's one. One of the differences is
how we handle writes to columns in the base table that are not selected by
a view:

* Case 1: The view's PK is a permutation of the base table's PK:

  Since the view's primary key cannot be changed in an update, a row in
  the view remains alive as long as the corresponding row in the base table
  is alive.

  The tricky part comes when the base table has columns that are NOT selected
  by the view. CQL3 used to not allow for defining a table that didn't have
  any other columns besides its primary key. Also, when inserting a row into
  a table, it was mandatory to provide at least one value aside from the
  primary key. At some point it changed [1] and the implementation of the
  solution relied on the notion of the row marker.

  Putting the details aside, consider the following scenario:

  (i)   the base table has a primary key consisting of columns
        c_1, ..., c_k, and it has regular columns rc_1, ..., rc_n,
  (ii)  the primary key of an MV defined on that table consists of
        a permutation of c_1, ..., c_k. The MV doesn't select at least
        one of the regular columns of the base table. Without loss of
        generality, let that unselected column be rc_1.
  (iii) the base table has a row R whose only non-null value is the one
        in the regular column rc_1.

  Now, what will R correspond to in the MV? The base table doesn't have a row
  marker, but all of its regular columns in the MV will be NULLs. That's NOT
  allowed.

  To solve that problem, all unselected columns have corresponding virtual
  columns in the MV; the only information they provide is whether there is
  a value in the base table or not. This way, the MV knows if a row is still
  alive or not.

  For that reason, we send view updates to virtual columns in the following
  cases:

  (i)  the value in the column changes from NULL to a value, i.e. it's
       created,
  (ii) the value in the column exists, but its TTL has been updated.

* Case 2: The view's PK has one more column that the base table's one:

  Since the primary key of the view has a regular column C from the base
  table, it is guaranteed that if there's a row in the MV, the corresponding
  row in the base table can remain alive: since C is part of the view's PK,
  it must have a value, so the row in the base table has a value in C too.
  The problem with virtual columns from the previous case doesn't manifest
  in this one. The liveness of the cell in C determines the liveness of
  the whole row in the view.

The semantics gets more complex, but the conclusion is this: in case 1,
virtual columns exist and we may need to generate view updates for them,
while in case 2 virtual columns do NOT exist and so we don't generate
view updates for them.

What changes in this patch is we adjust the code to it. If a view has
a regular column from the base table as part of its primary key, we
no longer emit view updates when we change a column unselected by that
view. It is purely an OPTIMIZATION change.

[1]: https://issues.apache.org/jira/browse/CASSANDRA-4361

Fixes scylladb/scylladb#21652

Closes scylladb/scylladb#21653
2024-11-24 19:01:28 +02:00
Kefu Chai
33a0e5b892 treewide: replace boost::find_if with std::ranges::find_if
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::find_if`.

in this change, we:

- replace `boost::find_if` with `std::ranges::find_if`
- remove all `#include <boost/range/algorithm/find_if.hpp>`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-19 10:50:01 +08:00
Nadav Har'El
da99dc3a7f cross-tree: change to_sstring_view() to to_string_view()
For historic reasons, we have (in bytes.hh) a type sstring_view which
is an alias for std::string_view - since the same standard type can hold
a pointer into both a seastar::sstring and std::string.

This alias in unnecessary and misleading to new developers (who might
assume it is somehow different from std::string_view). This patch doesn't
yet remove all occurances of sstring_view (the request in #4062), but
begins to do it by renaming one commonly-used function, to_sstring_view(bytes)
to to_string_view() and of course changes all its uses to the new name.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 14:57:49 +02:00
Kefu Chai
59eb2ab119 treewide: s/boost::algorithm::any_of/std::ranges::any_of/
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::any_of`.

in this change, we replace `boost::algorithm::any_of` with
`std::ranges::any_of`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-05 14:06:09 +08:00
Avi Kivity
8c67f9b42e cql3: util: remove unneeded boost/range includes from header files
The includes are redistributed to the source files that need them.

Closes scylladb/scylladb#21391
2024-10-31 23:49:44 +01:00
Nadav Har'El
ee2d75b088 Merge 'Generalize "breakpoint" type of error injection' from Pavel Emelyanov
This pattern is -- if requested (by test) suspend code execution until requestor (the test) explicitly wakes it up. For that the injected place should inject a lambda that is called with so called "handler" at hand and try to read message from the handler. In many cases the inner lambda additionally prints a message into logs that tests waits upon to make sure injection was stepped on. In the end of the day this "breakpoint" is injected like

```
    co_await inject("foo", [] (auto& handler) {
        log.info("foo waiting");
        co_await handler.wait_for_message(timeout);
    });
```

This PR makes breakpoints shorter and more unified, like this

```
    co_await inject("foo", wait_for_message(timeout));
```

where `wait_for_message` is a wrapper structure used to pick new `inject()` overload.

Closes scylladb/scylladb#21342

* github.com:scylladb/scylladb:
  sstables: Use inject(wait_for_message_overload)
  treewide,error_injection: Use inject(wait_for_message) and fix tests
  treewide,error_injection: Use inject(wait_for_message) overload
  error_injection: Add inject() overload with wait_for_message wrapper
2024-10-31 21:56:27 +02:00
Wojciech Mitros
88ab8db944 mv: run view building in streaming scheduling group
View building is an expensive process that takes a long time to complete.
During the build, it's impact on other work should be minimized, even at
the expense of slightly slowing it down.

Instead, view building is currently performed in the the same scheduling
group (gossip) as other high-priority tasks, in particular raft processing,
which slows it down, making races more likely and increasing the number
of retries that need to be done.

While view building is still initiated in the gossip group (as it's the
result of adding a view, which is a schema change), in this patch the bulk
of the view building work is moved to a low-priority, maintenance scheduling
group (named "streaming" after its main use case).

Additionally, a test is added, where we make sure that the scheduling
group is the one most used when building a view.

Fixes https://github.com/scylladb/scylladb/issues/21232

Closes scylladb/scylladb#21326
2024-10-31 10:13:20 +01:00
Avi Kivity
907da210b6 compound_compat: replace use of boost ranges with std ranges
To reduce the dependency load, replace use of boost ranges
with the std equivalent.

Files that lost the indirect boost dependency have it added as a
direct dependency.
2024-10-30 19:58:07 +02:00
Pavel Emelyanov
7d8cc3ccc2 treewide,error_injection: Use inject(wait_for_message) overload
Many places want to inject a handler that waits for external kick. Now
there's convenience inject() method overload for this. It will result in
extra messages in logs, but so far no code/test cares about it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Kefu Chai
24d14b601b treewide: s/boost::adaptors::map_values/std::views::values/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::values`.

in this change, we:

- replace `boost::adaptors::map_values` with `std::views::values`
- update affected code to work with `std::views::values`
- the places where we use `boost::join()` are not changed, because
  we cannot use `std::views::concat` yet. this helper is only
  available in C++26.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21265
2024-10-27 21:32:45 +02:00
Nadav Har'El
5fd3177057 Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros
When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload.

This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency.

The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory.

This is fixed by using the read admission on the view update concurrency semaphore.
This limits the number of concurrent view update reads to
max_count_concurrent_view_update_reads, all other incoming view update reads are
queued using just a small chunk of memory. Without this, the reads would also get
queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue.

The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction.

This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload.

The issue of view updates causing increased latencies most often occurs in the following scenario:
* we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows
* we have any read workload
* one replica is slower or is handling more writes due to an imbalance of data distribution
* we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it.
* each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued
* the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies

An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario:
* the tables were prepared satisfying the conditions above
* we use a medium write workload and a very low read workload
* the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update
* to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond.

In the test above:
* without the fix, the latency of reads increases over 50s
* with the fix, the latency of reads stays below 20ms

Fixes https://github.com/scylladb/scylladb/issues/8873
Fixes https://github.com/scylladb/scylladb/issues/15805

The patch is not that small and it isn't fixing a regression, so no backports

Closes scylladb/scylladb#20887

* github.com:scylladb/scylladb:
  test: add test for high view update concurrency causing bad_allocs
  test: add test for high view update concurrency degrading read latency
  mv: add a dedicated read concurrency semaphore for view update read before writes
2024-10-22 22:17:23 +03:00
Avi Kivity
847c850034 schema: add accessors for primary key columns and non-primary-key columns
It's somewhat common to ask for the partition key and clustering key
columns, or for the static and regular columsn. Provide accessors for them
rather than requiring the user to glue them.

Some callers are converted.

Closes scylladb/scylladb#21191
2024-10-22 15:01:14 +02:00
Kefu Chai
6ead5a4696 treewide: move log.hh into utils/log.hh
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.

in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 06:54:46 +03:00