Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Current mechanism to deprecate config options is implemented in a hacky
way in `main.cpp` and doesn't account for existing
`db::config/boost::po` API controlling lifetime of config options, hence
it's being replaced in this PR by adding yet another `value_status`
enumerator: `Deprecated`, so that deprecation of config options is
controlled in one place in `config.cc`,i.e. when specifying config options.
Motivation: https://docs.google.com/document/d/18urPG7qeb7z7WPpMYI2V_lCOkM5YGKsEU78SDJmt8bM/edit?usp=sharing
With this change, if a `Deprecated` config option is specified as
1. a command line parameter, scylla will run and log:
```
WARN 2023-11-25 23:37:22,623 [shard 0:main] init - background-writer-scheduling-quota option ignored (deprecated)
```
(Previously it was only a message printed to standard output, not a
scylla log of warn level).
2. an option in `scylla.yaml`, scylla will run and log:
```
WARN 2023-11-27 23:55:13,534 [shard 0:main] init - Option is deprecated : background_writer_scheduling_quota
```
Fixes#15887
Incorporates dropped https://github.com/scylladb/scylladb/pull/15928Closesscylladb/scylladb#16184
This miniset, completes the prerequisites for enabling commitlog hard limit on by default.
Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances.
It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however,
once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant.
Tests: unit tests.
LWT write test (#9331)
A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations:
1. The default configuration
2. Hard limit on (without changing the flush threshold)
3. the changes in this PR applied.
The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others.
Closesscylladb/scylladb#10974
* github.com:scylladb/scylladb:
commitlog: enforce commitlog size hard limit by default
commitlog: set flush threshold to half of the limit size
commitlog: unfold flush threshold assignment
This series adds handling for more failures during a topology operation
(we already handle a failure during streaming). Here we add handling of
tablet draining errors by aborting the operation and handling of errors
after streaming where an operation cannot be aborted any longer. If the
error happens when rollback is no longer possible we wait for ring delay
and proceed to the next step. Each individual patch that adds the sleep
has an explanation what the consequences of the patch are.
* 'gleb/topology-coordinator-failures' of github.com:scylladb/scylla-dev:
test: add test to check errro handling during tablet draining
test: fix test_topology_streaming_failure test to not grep the whole file
storage_service: add error injection into the tablet migration code
storage_service: topology coordinator: rollback on handle_tablet_migration failure during tablet_draining stage
storage_service: topology coordinator: do not retry the metadata barrier forever in write_both_read_new state
storage_service: topology coordinator: do not retry the metadata barrier forever in left_token_ring state
storage_service: topology coordinator: return a node that is being removed from get_excluded_nodes
storage_service: topology_coordinator: use new rollback_to_normal state in the rollback procedure
storage_service: topology coordinator: add rollback_to_normal node state
storage_service: topology coordinator: put fence version into the raft state
storage_service: topology coordinator: do fencing even if draining failed
This situation before this patch is that when tablets are enabled for
a keyspace, we can create a materialized view but later any write to
the base table fails with an on_internal_error(), saying that:
"Tried to obtain per-keyspace effective replication map of test
but it's per-table."
Indeed, with tablets, the replication is different for each table - it's
not the same for the entire keyspace.
So this patch changes the view update code to take the replication
map from the specific base table, not the keyspace.
This is good enough to get materialized-views reads and writes working
in a simple single-node case, as the included test demonstrates (the
test fails with on_internal_error() before this patch, and passes
afterwards).
But this fix is not perfect - the base-view pairing code really needs
to consider not only the base table's replication map, but also the
view table's replication map - as those can be different. We'll fix
this remaining problem as a followup in a separate patch - it will
require a substantially more elaborate test to reproduce the need
for the different mapping and to verify that fix.
Fixes#16209.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16211
Fixes#16207
commitlog::delete_segments deletes (or recycles) segments replayed.
The actual file size here is added to footprint so actual delete then
can determine iff things should be recycled or removed.
However, we build a pending delete list of named_files, and the files
we added did not have size set. Bad. Actual deletion then treated files
as zero-byte sized, i.e. footprint calculations borked.
Simple fix is just filling in the size of the objects when addind.
Added unit test for the problem.
Closesscylladb/scylladb#16210
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777Closesscylladb/scylladb#15820
* github.com:scylladb/scylladb:
docs: nodetool: flush: enrich examples
docs: nodetool: compact: fix example
api: add /storage_service/compact
api: add /storage_service/flush
compaction_manager: flush_all_tables before major compaction
database: add flush_all_tables
api: compaction: add flush_memtables option
test/nodetool: jmx: fix path to scripts/scylla-jmx
scylla-nodetool, docs: improve optional params documentation
When a view schema is changed, the schema change command also includes
mutations for the corresponding base table; these mutations don't modify
the base schema but are included in case if the receiver of view
mutations somehow didn't receive base mutations yet (this may in theory
happen outside Raft mode).
There are situations where the schema change command contains both
mutations that describe the current state of the base table -- included
by a view update, as explained above -- and mutations that want to
modify the base table. Such situation arises, for example, when we
update a user-defined type which is referenced by both a view and its
corresponding base table. This triggers a schema change of the view,
which generates mutations to modify the view and includes mutations of
the current base schema, and at the same time it triggers a schema
change of the base, which generates mutations to modify the base.
These two sets of mutations are conflicting with each other. One set
wants to preserve the current state of the base table while the other
wants to modify it. And the two sets of mutations are generated using
the same timestamp, which means that conflict resolution between them is
made on a per-mutation-cell basis, comparing the values in each cell and
taking the "larger" one (meaning of "larger" depends on the type of each
cell).
Fortunately, this conflict is currently benign -- or at least there is
no known situation where it causes problems.
Unfortunately, it started causing problems when I attempted to implement
group 0 schema versioning (PR scylladb/scylladb#15331), where instead of
calculating table versions as hashes of schema mutations, we would send
versions as part of schema change command. These versions would be
stored inside the `system_schema.scylla_tables` table, `version` column,
and sent as part of schema change mutations.
And then the conflict showed. One set of mutations wanted to preserve
the old value of `version` column while the other wanted to update it.
It turned out that sometimes the old `version` prevailed, because the
`version` column in `system_schema.scylla_tables` uses UUID-based
comparison (not timeuuid-based comparison). This manifested as issue
scylladb/scylladb#15530.
To prevent this, the idea in this commit is simple: when generating
mutations for the base table as part of corresponding view update, do
not use the provided timestamp directly -- instead, decrement it by one.
This way, if the schema change command contains mutations that want to
modify the base table, these modifying mutations will win all conflicts
based on the timestamp alone (they are using the same provided
timestamp, but not decremented).
One could argue that the choice of this timestamp is anyway arbitrary.
The original purpose of including base mutations during view update was
to ensure that a node which somehow missed the base mutations, gets them
when applying the view. But in that case, the "most correct" solution
should have been to use the *original* base mutations -- i.e. the ones
that we have on disk -- instead of generating new mutations for the base
with a refreshed timestamp. The base mutations that we have on disk have
smaller timestamps already (since these mutations are from the past,
when the base was last modified or created), so the conflict would also
not happen in this case.
But that solution would require doing a disk read, and we can avoid the
read while still fixing the conflict by using an intermediate solution:
regenerating the mutations but with `timestamp - 1`.
Ref: scylladb/scylladb#15530Closesscylladb/scylladb#16139
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6ec6
However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d1ce).
Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixesscylladb/scylladb#15777
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal.
Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well.
fixes: #13745Closesscylladb/scylladb#15635
* github.com:scylladb/scylladb:
test: Add object_store test to validate config reloading works
test: Add config update facility to test cluster
test: Make S3_Server export config file as pathlib.Path
config: Make object storage config updateable_value_source
memtable: Extend list of checking codes
sstables/storage/s3: Fix missing TOC status check
s3/client: Map http exceptions into storage_io_error
exceptions: Extend storage_io_error construction options
- remove some code that is obsolete in newer Scylla versions,
- fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless,
- refactor some code for better maintainability.
Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new.
Closesscylladb/scylladb#16162
* github.com:scylladb/scylladb:
test/pylib: log_browsing: fix type hint
migration_manager: take `abort_source&` in get_schema_for_read/write
migration_manager: inline merge_schema_in_background
migration_manager: remove unused merge_schema_from overload
migration_manager: assume `canonical_mutation` support
migration_manager: add `std::move` to avoid a copy
schema_tables: refactor `scylla_tables(schema_features)`
schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
system_keyspace: fix outdated comment
The `scylla_tables` function gives a different schema definition
for the `system_schema.scylla_tables` table, depending on whether
certain schema features are enabled or not.
The way it was implemented, we had to write `θ(2^n)` amount
of code and comments to handle `n` features.
Refactor it so that the amount of code we have to write to handle `n`
features is `θ(n)`.
When a topology coordinator rolls back from unsuccessful topology operation it
advances the fence (which is now in the raft state) after moving to normal
state. We do not want this to fail (only majority of nodes is needed for
it to not to), but currently it may fail in case the coordinator moves
to another node after changing the rollback node's state to normal, but
before updating the fence. To solve that the rollback operation needs to
go through a new rollback_to_normal state that will do the fencing
before moving to normal. This patch introduces that state, but does not use
it yet.
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_schema` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
the operator<<() based formatter is only used in its test, so
let's move it to where it is used.
we can always bring it back later if it is required in other places.
but better off implementing it as a fmt::formatter<> then.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16142
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.
in this change, we define a formatter for db::seed_provider_type.
please note, we are still formatting vector<db::seed_provider_type>
with the helper provided by seastar/core/sstring.hh, which uses
operator<<() to print the elements in the vector being printed.
so we have to keep the operator<< formatter before disabling
the generic formatter for vector<T>.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16138
Since the commitlog size hard limit is a failsafe mechanism,
we don't expect to ever hit it. If we do hit the limit, it means
that we have an exceptional condition in the system. Hence, the
impact of enforcing the commitlog hard limit is irrelevant.
Here we enforce the limit by default.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Once we enable commitlog hard limit by default, we would like
to have some room in case flushing memtables takes some time
to catch up. This threshold is half the limit.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
This commit is only a cosmetic change. It is meant to
make the flush threshold assignment more readable and
comprehensible so future changes are easier to review.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Now its plain updateable_value, but without the ..._source object the
updateable_value is just a no-op value holder. In order for the
observers to operate there must be the value source, updating it would
update the attached updateable values _and_ notify the observers.
In order for the config to be the u.v._source, config entries should be
comparable to each other, thus the <=> operator for it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since CRC is already handled by disk blocks, we can remove some of the
entry CRC:ing, both simplifying code and making at least that part of
both write and read faster.
Breaks the file into individually tagged + crc:ed pages.
Each page (sized as disk write alignment) gets a trailing
12-byte metadata, including CRC of the first page-12 bytes,
and the ID of the segment being written.
When reading, each page read is CRC:ed and checked to be part
of the expected segment by comparing ID:s. If crc is broken,
we have broken data. If crc is ok, but ID does not match, we
have a prematurely terminated segment (truncated), which, depending
on whether we use batch mode or not, implied data loss.
Refs #11845
When replaying, differentiate between the two cases for failure we have:
- A broken actual entry - i.e. entry header/data does not hold up to
crc scrutiny
- Truncated file - i.e. a chunk header is broken or unreadable. This can
be due to either "corruption" (i.e. borked write, post-corruption, hw
whatever), or simply an unterminated segment.
The difference is that the former is recoverable, the latter is not.
We now signal and report the two separately. The end result for a user
is not much different, in either case they imply data loss and the
need for repair. But there is some value in differentiating which
of the two we encountered.
Modifies and adds test cases.
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.
This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.
fixes: #16001
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16091
Fixes#15269
If segment being replayed is corrupted/truncated we can attempt skipping
completely bogues byte amounts, which can cause assert (i.e. crash) in
file_data_source_impl. This is not a crash-level error, so ensure we
range check the distance in the reader.
v2: Add to corrupt_size if trying to skip more than available. The amount added is "wrong", but at least will
ensure we log the fact that things are broken
Closesscylladb/scylladb#15270
Currently when the coordinator decides to move the fence it issues an
RPC to each node and each node locally advances fence version. This is
fine if there are no failures or failures are handled by retrying
fencing, but if we want to allow topology changes to progress even in
the presence of barrier failures it is easier to store the fence version
in the raft state. The nodes that missed fence rpc may easily catch up
to the latest fence version by simply executing a raft barrier.
this change is a cleanup.
to mark a return value without value semantics has no effect. these
`const` specifier useless. so let's drop them.
and, if we compile the tree with `-Wignore-qualifiers`, the compiler
would warn like:
```
/home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
245 | const index_metadata_kind kind() const;
| ^~~~~
```
so this change also silences the above warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).
This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.
This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.
Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.
In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.
Fixes: scylladb/scylladb#15622Closesscylladb/scylladb#15972
The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly.
There's also the upload facility that's implemented as state change too, but this PR doesn't support this part.
fixes: #13017Closesscylladb/scylladb#15829
* github.com:scylladb/scylladb:
test: Make test_sstables_excluding_staging_correctness run over s3 too
sstables,s3: Support state change (without generation change)
system_keyspace: Add state field to system.sstables
sstable_directory: Tune up sstables entries processing comment
system_keyspace: Tune up status change trace message
sstables: Add state string to state enum class convert
This patch series adds error handling for streaming failure during
topology operations instead of an infinite retry. If streaming fails the
operation is rolled back: bootstrap/replace nodes move to left and
decommissioned/remove nodes move back to normal state.
* 'gleb/streaming-failure-rollback-v4' of github.com:scylladb/scylla-dev:
raft: make sure that all operation forwarded to a leader are completed before destroying raft server
storage_service: raft topology: remove code duplication from global_tablet_token_metadata_barrier
tests: add tests for streaming failure in bootstrap/replace/remove/decomission
test/pylib: do not stop node if decommission failed with an expected error
storage_service: raft topology: fix typo in "decommission" everywhere
storage_service: raft topology: add streaming error injection
storage_service: raft topology: do not increase topology version during CDC repair
storage_service: raft topology: rollback topology operation on streaming failure.
storage_service: raft topology: load request parameters in left_token_ring state as well
storage_service: raft topology: do not report term_changed_error during global_token_metadata_barrier as an error
storage_service: raft topology: change global_token_metadata_barrier error handling to try/catch
storage_service: raft topology: make global_token_metadata_barrier node independent
storage_service: raft topology: split get_excluded_nodes from exec_global_command
storage_service: raft topology: drop unused include_local and do_retake parameters from exec_global_command which are always true
storage_service: raft topology: simplify streaming RPC failure handling
There are some schema modifications performed automatically (during
bootstrap, upgrade etc.) by Scylla that are announced by multiple calls
to `migration_manager::announce` even though they are logically one
change. Precisely, they appear in:
- `system_distributed_keyspace::start`,
- `redis:create_keyspace_if_not_exists_impl`,
- `table_helper::setup_keyspace` (for the `system_traces` keyspace).
All these places contain a FIXME telling us to `announce` only once.
There are a few reasons for this:
- calling `migration_manager::announce` with Raft is quite expensive --
taking a `read_barrier` is necessary, and that requires contacting a
leader, which then must contact a quorum,
- we must implement a retrying mechanism for every automatic `announce`
if `group0_concurrent_modification` occurs to enable support for
concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs
mentioned above would be harder, and fixing the FIXMEs later would also
be harder.
This PR fixes the first two FIXMEs and improves the situation with the
last one by reducing the number of the `announce` calls to two.
Unfortunately, reducing this number to one requires a big refactor. We
can do it as a follow-up to a new, more specific issue. Also, we leave a
new FIXME.
Fixing the first two FIXMEs required enabling the announcement of a
keyspace together with its tables. Until now, the code responsible for
preparing mutations for a new table could assume the existence of the
keyspace. This assumption wasn't necessary, but removing it required
some refactoring.
Fixesscylladb/scylladb#15437Closesscylladb/scylladb#15897
* github.com:scylladb/scylladb:
table_helper: announce twice in setup_keyspace
table_helper: refactor setup_table
redis: create_keyspace_if_not_exists_impl: fix indentation
redis: announce once in create_keyspace_if_not_exists_impl
db: system_distributed_keyspace: fix indentation
db: system_distributed_keyspace: announce once in start
tablet_allocator: update on_before_create_column_family
migration_listener: add parameter to on_before_create_column_family
alternator: executor: use new prepare_new_column_family_announcement
alternator: executor: introduce create_keyspace_metadata
migration_manager: add new prepare_new_column_family_announcement
Replacing `restrict_replication_simplestrategy` config option with
2 config options: `replication_strategy_{warn,fail}_list`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on replication strategy when creating/altering a keyspace.
The reason to rather replace than extend `restrict_replication_simplestrategy` config
option is that it was not used and we wanted to generalize it.
Only soft guardrail is enabled by default and it is set to SimpleStrategy,
which means that we'll generate a CQL warning whenever replication strategy
is set to SimpleStrategy. For new cloud deployments we'll move
SimpleStrategy from warn to the fail list.
Guardrails violations will be tracked by metrics.
Resolves#5224
Refs #8892 (the replication strategy part, not the RF part)
Closesscylladb/scylladb#15399
The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal
with legacy materialized view schemas used for secondary indexes,
schemas which were created before the notion of "computed columns" was
introduced. Back then, secondary index schemas would use a regular
"token" column. Later it became a computed column and old schemas would
be migrated during rolling upgrade.
The migration code was introduced in 2019
(db8d4a0cc6) and then fixed in 2020
(d473bc9b06).
The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming
that users don't try crazy things like upgrading from 2021.X to 2023.X
(which we do not support), all clusters will have already executed the
migration code once they upgrade to 2023.X, meaning we can get rid of
it.
The main motivation of this PR is to get rid of the
`db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft
mode this was the only call to `merge_schema` outside "group 0 code" and
in fact it is unsafe -- it uses locally generated mutations with locally
generated timestamp (`api::new_timestamp()`), so if we actually did it,
we would permanently diverge the group 0 state machine across nodes
(the schema pulling code is disabled in Raft mode). Fortunately, this
should be dead code by now, as explained in the previous paragraph.
The migration code is now turned into a sanity check, if the users
try something crazy, they will get an error instead of silent data
corruption.
Closesscylladb/scylladb#15695
* github.com:scylladb/scylladb:
view: remove unused `_backing_secondary_index`
schema_tables: turn view schema fixing code into a sanity check
schema_tables: make comment more precise
feature_service: make COMPUTED_COLUMNS feature unconditionally true
We refactor system_distributed_keyspace::start so that it takes at
most one group 0 guard and calls migration_manager::announce at
most once.
We remove a catch expression together with the FIXME from
get_updated_service_levels (add_new_columns_if_missing before the
patch) because we cannot treat the service_levels update
differently anymore.
This reverts commit 4b80130b0b, reversing
changes made to a5519c7c1f. It's suspected
of causing dtest failures due to a bug in coroutine::parallel_for_each.
Currently, when the topology coordinator accepts a node, it moves it to bootstrap state and assigns tokens to it (either new ones during bootstrap, or the replaced node's tokens). Only then it contacts the joining node to tell it about the decision and let it perform a read barrier.
However, this means that the tokens are inserted too early. After inserting the tokens the cluster is free to route write requests to it, but it might not have learned about all of the schema yet.
Fix the issue by inserting the tokens later, after completing the join node response RPC which forces the receiving node to perform a read barrier.
Refs: scylladb/scylladb#15686Fixes: scylladb/scylladb#15738Closesscylladb/scylladb#15724
* github.com:scylladb/scylladb:
test: test_topology_ops: continuously write during the test
raft topology: assign tokens after join node response rpc
storage_service: fix indentation after previous commit
raft topology: loosen assumptions about transition nodes having tokens
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.
This fix affects also alternator's secondary indexes.
Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:
./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000
Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:
p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0
With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.
Fixes https://github.com/scylladb/scylladb/issues/15844Closesscylladb/scylladb#15845
Currently, when the topology coordinator accepts a node, it moves it to
bootstrap state and assigns tokens to it (either new ones during
bootstrap, or the replaced node's tokens). Only then it contacts the
joining node to tell it about the decision and let it perform a read
barrier.
However, this means that the tokens are inserted too early. After
inserting the tokens the cluster is free to route write requests to it,
but it might not have learned about all of the schema yet.
Fix the issue by inserting the tokens later, after completing the join
node response RPC which forces the receiving node to perform a read
barrier.
In later commits, tokens for a joining/replacing node will not be
inserted when the node enters `bootstrapping`/`replacing` state but at
some later step of the procedure. Loosen some of the assumptions in
`storage_service::topology_state_load` and
`system_keyspace::load_topology_state` appropriately.
Now when the system.sstables has the state field, it can be changed
(UPDATEd). However, when changing the state AND generation, this still
won't work, because generation is the clustering key of the table in
question and cannot be just changed. This, nonetheless, is OK, as
generation changes with state only when moving an sstable from upload
dir into normal/staging and this is separate issue for S3 (#13018). For
now changing state only is OK.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The state is one of <empty>(normal)/staging/quarantine. Currently when
sstable is moved to non-normal state the s3 backend state_change() call
throws thus such sstables do not appear. Next patches are going to
change that and the new field in the system.sstables is needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>