When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixes https://github.com/scylladb/scylladb/issues/22989
backport not needed - the issue is probably not common and there's a workaround
Closesscylladb/scylladb#25790
* github.com:scylladb/scylladb:
test: mv: add a test for view build interrupt during registration
view_builder: register view on all shards atomically
The load balancer introduced the idea of badness, which is a measure of
how a tablet migration effects table balance on the source and
destination. This is an abbreviated definition of the badness struct:
struct migration_badness {
double src_shard_badness = 0;
double src_node_badness = 0;
double dst_shard_badness = 0;
double dst_node_badness = 0;
...
double node_badness() const {
return std::max(src_node_badness, dst_node_badness);
}
double shard_badness() const {
return std::max(src_shard_badness, dst_shard_badness);
}
};
A negative value for either of these 4 members signifies a good
migration (improves table balance), and a positive signifies a bad
migration.
In two places in the balancer, badness for source and destination is
computed independently in two objects of type migration_badness
(src_badness and dst_badness), and later combined into a single object
similar to this:
return migration_badness{
src_badness.shard_badness(),
src_badness.node_badness(),
dst_badness.shard_badness(),
dst_badness.node_badness()
};
This is a problem when, for instance, source shard badness is good
(less that 0), shard_badness() will return 0 because of std::max().
This way the actual computed badness is not set in the final object.
This can lead to incorrect decisions made later by the balancer, when it
searches for the best migration among a set of candidates.
Closesscylladb/scylladb#26091
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.
Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.
We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.
Fixesscylladb/scylladb#26026Closesscylladb/scylladb#26115
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.
Reduce impact by running in the maintenance scheduling group.
Fixes#26037Closesscylladb/scylladb#26046
In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table.
This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started.
The PR also includes a reproducer test.
Fixesscylladb/scylladb#25912
View building coordinator hasn't been released yet, so no backport is needed.
Closesscylladb/scylladb#26144
* github.com:scylladb/scylladb:
test/test_view_building_coordinator: add reproducer
topology_coordinator: abort view building a bit later in case of tablet migration
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.
This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.
This change must be included before the next release.
Otherwise, users will be affected by a breaking change.
References: VECTOR-187
Closesscylladb/scylladb#26033
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.
Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.
The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.
One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.
This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.
There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.
By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.
Fixesscylladb/scylladb#22989
This patch corrects the column name formatting whenever
an "Undefined column name" exception is thrown.
Until now we used the `name()` function which
returns a bytes object. This resulted in a message
with a garbled ascii bytes column name instead of
a proper string. We switch to the `text()` function
that returns a sstring instead, making the message
readable.
Tests are adjusted to confirm this behavior.
Fixes: VECTOR-228
Closesscylladb/scylladb#26120
Adds a test which reproduces the issue described
in scylladb/scylladb#25912.
The test creates a situation where a single tablet is replicated across
multiple DCs / racks, and all those tablet replicas are eligible for
migration. The tablet load balancer is unpaused at that moment which
currently causes it to attempt to generate multiple migrations for
different tablet replicas of the same tablet. Before the fix for #25912,
this used to confuse the view build coordinator which would react to
each migration attempt, pausing view building work for each tablet
replica for which there was an attempt to migrate but only unpausing it
for the tablet replica that was actually migrated. After the fix, the
view build coordinator only reacts to the migration that has "won" so
the test successfully passes.
In multi DC setup, tablet load balancer might generate multiple
migrations of the same tablet_id but only one is actually commited to
the `system.tablets` table.
This patch moved abortion of view building tasks from the same start of
the migration (`<no tablet transition> -> allow_write_both_read_old`) to
the next step (`allow_write_both_read_old -> write_both_read_old`).
This way, we'll abort only tasks for which the tablet migration was
actually started.
Fixesscylladb/scylladb#25912
Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
test_streaming_deadlock_removenode starts 10240 inserts at once,
overloading a node. Due to that test fails with timeout.
Limit inserts concurrency.
Fixes: #25945.
Closesscylladb/scylladb#26102
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client
Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```
Closesscylladb/scylladb#25915
(cherry picked from commit 44d34663bc)
Closesscylladb/scylladb#26100
when cluster repair is run for an entire keyspace, nodetool makes a
repair api request for each table.
if the keyspace contains colocated tables, then the api request for the
colocated tables will fail, because currently scylla doesn't allow making
repair requests for specific colocated tables, but only for base tables.
if the request is to repair an entire keyspace then we can ignore this,
because we will make a repair request for all base tables, and this in
turn will repair also all the colocated tables in the keyspace.
however if specific tables are requested and some of them are colocated
then we should propagate the error to let the user know the request is
invalid.
Refs https://github.com/scylladb/scylladb/issues/24816
no backport - no colocated tablets in previous releases
Closesscylladb/scylladb#26051
* github.com:scylladb/scylladb:
nodetool: ignore repair request error of colocated tables
storage_service: improve error message on repair of colocated tables
Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones.
This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables.
Refs https://github.com/scylladb/scylladb/pull/24991
Refs https://github.com/scylladb/scylladb/issues/6930Closesscylladb/scylladb#25796
* github.com:scylladb/scylladb:
alternator: Store LSI keys in :attrs for newly created tables
alternator/test: Add LSI tests based mostly on the existing GSI tests
vector_store_client::stop did not properly clean up the
coroutine that was waiting for a notification on the refresh_client_cv
condition variable. As a result, the coroutine could try to access
`this` (via current_client) after the vector_store_client was destroyed.
To fix this, the `client_producer` tasks are wrapped by a gateway.
The `stop` method now signals the `client_producer` condition variable
and closes the gateway, which ensures that all `client_producer` tasks
are finished before the `stop` function returns.
The `wait_for_signal` return type was changed from `bool` to `void` as the return value was not used.
Fixes: VECTOR-230
Closesscylladb/scylladb#26076
The `json_mutation_stream_parser.cc` file was not included in the
`scylla-tools` CMake target. This could lead to "undefined reference"
linker errors when building with CMake.
This commit adds the missing source file to the target's source list.
Closesscylladb/scylladb#26108
Alternator, when creating gsi, adds artificially columns, that user
had not ask for. This patch prevents those columns from showing up in
DescribeTable's output.
Fixes#5320Closesscylladb/scylladb#25978
In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes.
This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly.
Earlier, `storage_service::find_raft_nodes_from_hoeps` code threw error upon observing any non topology member present in ignore_nodes. Since we are performing concurrent remove node operations, the timing can lead to one node being fully removed before the other node remove op begins processing which can lead to runtime error in storage_service::find_raft_nodes_from_hoeps. This error throw was added to prevent users from putting random non existent nodes in ignore_nodes list. Hence made changes in that function to account for already removed nodes and ignore those nodes instead of throwing error.
A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly.
This pr doesn't fix a critical bug. No need to backport it.
Fixes: scylladb/scylladb#24737Closesscylladb/scylladb#25713
* https://github.com/scylladb/scylladb:
raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters.
storage_service: remove assumptions and checks for ignore_nodes to be normal.
When the monitor is started, the first disk utilization value is
obtained from the actual host filesystem and not from the fake
space source function.
Thus, register a fake space source function before the monitor
is started.
Fixes: https://github.com/scylladb/scylladb/issues/26036
Backport is not required. The test has been added recently.
Closesscylladb/scylladb#26054
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
Eliminate `cql_serialization_format.hh` file by inlining it into `query-request.hh` header since the content is not used anywhere but
the aforementioned header
Removed files:
- cql_serialization_format.hh
Fixes: #22108
This is a cleanup, no need to backport
Closesscylladb/scylladb#25087
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db
This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.
This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.
The changes are for the sake of new and unreleased code. No backporting should be done.
Closesscylladb/scylladb#26075
* github.com:scylladb/scylladb:
sstables/mx/reader: remove mx::make_reader_with_index_reader
test/boost/bti_index_test: fix indentation
sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
sstables/trie: support reader_permit and trace_state properly
sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
sstables/trie/bti_index_reader: support BYPASS CACHE
test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
sstables/trie: change the signature of bti_partition_index_writer::finish
sstables/bti_index: improve signatures of special member functions in index writers
streaming/stream_transfer_task: coroutinize `estimate_partitions()`
types/comparable_bytes: add a missing implementation for date_type_impl
sstables: remove an outdated FIXME
storage_service: delete `get_splits()`
sstables/trie: fix some comment typos in bti_index_reader.cc
sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
This command was written for an investigation and was used exactly once.
This would have been a perfect candidate for the (also rarely used)
scylla-sstable script command, but it didn't exist yet.
Drop this command from the tool, such super-specific commands should be
written as sstable-scripts nowadays, which is what we will do if we ever
need this again.
Closesscylladb/scylladb#26062
When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually processed).
Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards.
This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards.
Fixes https://github.com/scylladb/scylladb/issues/25859
View building coordinator isn't present in any release yet, no backport needed.
Closesscylladb/scylladb#25832
* github.com:scylladb/scylladb:
db/view/view_building_worker: fix indent
db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`
db/view/view_building_worker: use table id in `register_staging_sstable_tasks()`
db/view/view_building_worker: move helper functions higher
Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop.
This leads to races because the worker may call `batch::abort()`, which access the abort_sources.
This patch solves this be changing `sharded<abort_source>` into `abort_source`.
Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively.
Fixes https://github.com/scylladb/scylladb/issues/25805
Fixes https://github.com/scylladb/scylladb/issues/26045
View building coordinator hasn't been released yet, so no backport needed.
Closesscylladb/scylladb#26059
* github.com:scylladb/scylladb:
db/view/view_building_worker: fix indents
db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`
db/view/view_building_worker: execute entire `batch::do_work` on tasks shard
db/view/view_building_worker: store reference to sharded worker in batch
when cluster repair is run for an entire keyspace, nodetool makes a
repair api request for each table.
if the keyspace contains colocated tables, then the api request for the
colocated tables will fail, because currently scylla doesn't allow making
repair requests for specific colocated tables, but only for base tables.
if the request is to repair an entire keyspace then we can ignore this,
because we will make a repair request for all base tables, and this in
turn will repair also all the colocated tables in the keyspace.
however if specific tables are requested and some of them are colocated
then we should propagate the error to let the user know the request is
invalid.
Refs scylladb/scylladb#24816
currently repair requests can't be added or deleted on non-base
colocated tables. improve the error message and comments to be more
clear and detailed.
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
Before `sl:driver` was introduced, service levels were assigned as
follows:
1. New connections were processed in `main`.
2. After user authentication was completed, the connection's SL was
changed to the user's SL (or `sl:default` if the user had no SL).
This commit introduces `service_level_state` to `client_state` and
implements the following logic in `transport/server`:
1. If `sl:driver` is not present in the system (for example, it was
removed), service levels behave as described above.
2. If `sl:driver` is present, the flow is:
I. New connections use `sl:driver`.
II. After user authentication is completed, the connection's SL is
changed to the user's SL (or `sl:default`).
III. If a REGISTER (to events) request is handled, the client is
processing the control connection. We mark the client_state
to permanently use `sl:driver`.
The aforementioned state `2.III` is represented by
`_control_connection` flag in `client_state`.
Fixes: scylladb/scylladb#24411
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.
Fixes: scylladb/scylladb#26040
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.
We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.
Refs: scylladb/scylladb#24411
Driver service level is a special service level that is created
automatically by the system. Therefore, it requires special handling
in DESC SCHEMA WITH INTERNALS and those test verifies the special
behavior.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.
Refs: scylladb/scylladb#24411
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
2. If `sl:driver` exists, its configuration is fully restored (emit
ALTER SERVICE LEVEL).
3. If `sl:driver` was removed, the information is retained (emit
DROP SERVICE LEVEL instead of CREATE/ALTER).
Refs: scylladb/scylladb#24411
This adds a reference to sl_controller so that, later in this patch
series, topology_coordinator can manage creating `sl:driver` once
group0 is fully operational.
Refs: scylladb/scylladb#24411
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.
A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.
Refs: scylladb/scylladb#24411
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
According to the changes in Vector Store API (VECTOR-148) the `embedding` term
should be changed to `vector`. As `vector` term is used for STL class the
internal type or variable names would be changed to `vs_vector` (aka vector
store vector). This patch changes also the HTTP ann json request payload
according to the Vector Store API changes.
Fixes: VECTOR-229
Closesscylladb/scylladb#26050
Capacity based balancing was introduced in 2025.1. It computes balance
based on a node's capacity: the number of tablets located on a node
should be directly proportional to that node's storage capacity.
This change adds this explanation to the docs.
Fixes: #25686Closesscylladb/scylladb#25687