During raft topology upgrade procedure data from
system_auth keyspace will be migrated to system_auth_v2.
Migration works mostly on top of CQL layer to minimize
amount of new code introduced, it mostly executes SELECTs
on old tables and then INSERTs on new tables. Writes are
not executed as usual but rather announced via raft.
Because keyspace is part of the query when we
migrate from v1 to v2 query should change otherwise
code would operate on old keyspace if those statics
were initialized.
Likewise keyspace name can no longer be class
field initialized in constructor as it can change
during class lifetime.
Alternator doesn't do any writes to auth
tables so it's simply change of keyspace
name.
Docs will be updated later, when auth-v2
is enabled as default.
All auth modifications will go now via group0.
This is achieved by acquiring group0 guard,
creating mutations without executing and
then announcing them.
Actually first guard is taken by query processor,
it serves as read barrier for query validations
(such as standard_role_manager::exists), otherwise
we could read older data. In principle this single
guard should be used for entire query but it's impossible
to achive with current code without major refactor.
For read before write cases it's good to do write with
the guard acquired before the read so that there
wouldn't be any modify operation allowed in between.
Alought not doing it doesn't make the implementation
worse than it currently is so the most complex cases
were left with FIXME.
To make table modifications go via raft we need to publish
mutations. Currently many system tables (especially auth) use
CQL to generate table modifications. Added function is a missing
link which will allow to do a seamless transition of certain
system tables to raft.
Because we'll be doing group0 operations we need to run on shard 0. Additional benefit
is that with needs_guard set query_processor will also do automatic retries in case of
concurrent group0 operations.
When bounce_to_shard happens we need to fill client_state with
sl_controller appropriate for destination shard.
Before the patch sl_controller was set to null after the bounce.
It was fine becauase looks like it was never used in such scenario.
With auth-v2 we need to bounce attach/detach service level statements
because they modify things via auth subsystem which needs to be called
on shard 0.
It's the same approach as done for default_authorizer in
earlier commit.
Note that only non-legacy paths were changed, in particular
legacy migrations and table creations won't be ever executed
in new keyspace as they will be managed by system_auth_keyspace
implementation.
For now we add keyspace name as class member because it's static
value anyway. But statics will be removed in future commits because
migration can occur and auth need to switch keyspace name in runtime.
Just follow the same pattern as in default_authorizer so
it's easy to track where system_auth keyspace is actually
used. It will also allow for easier parametrization.
When adding group0 replication for auth we will change only
write path and plan to reuse read path. To not copy the code
or make more complicated class hierarchy default_authorizer's
read code will remain unchanged except this parametrization,
it is needed as group0 implementation uses separate keyspace
(replication is defined on a keyspace level).
In subsequent commits legacy write path code will be separated
and new implementation placed in default_authorizer.
For now we add keyspace name as class member because it's static
value anyway. But statics will be removed in future commits because
migration can occur and auth need to switch keyspace name in runtime.
New keyspace is added similarly as system_schema keyspace,
it's being registred via system_keyspace::make which calls
all_tables to build its schema.
Dummy table 'roles' is added as keyspaces are being currently
registered by walking through their tables. Full table schemas
will be added in subsequent commits.
Change can be observed via cqlsh:
cassandra@cqlsh> describe keyspaces;
system_auth_v2 system_schema system system_distributed_everywhere
system_auth system_distributed system_traces
cassandra@cqlsh> describe keyspace system_auth_v2;
CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'} AND durable_writes = true;
CREATE TABLE system_auth_v2.roles (
role text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = 'comment'
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 604800
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a6fa, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.
The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.
With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8
The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).
Fixes: https://github.com/scylladb/scylladb/issues/15425
Fixes: https://github.com/scylladb/scylladb/issues/16899Closesscylladb/scylladb#17529
* github.com:scylladb/scylladb:
test/topology_custom: test_read_repair.py: reduce run-time
replica/database: get_query_max_result_size(): use query_page_size_in_bytes
replica/database: use include page-size in max-result-size
query-request: max_result_size: add without_page_limit()
db/config: introduce query_page_size_in_bytes
This patch fixes a UBSAN-reported integer overflow during one of our
existing tests,
test_native_functions.py::test_mintimeuuid_extreme_from_totimestamp
when attempting to convert an extreme "date" value, millions of years
in the past, into a "timestamp" value. When UBSAN crashing is enabled,
this test crashes before this patch, and succeeds after this patch.
The "date" CQL type is 32-bit count of *days* from the epoch, which can
span 2^31 days (5 million years) before or after the epoch. Meanwhile,
the "timestamp" type measures the number of milliseconds from the same
epoch, in 64 bits. Luckily (or intentionally), every "date", however
extreme, can be converted into a "timestamp": This is because 2^31 days
is 1.85e17 milliseconds, well below timestamp's limit of 2^63 milliseconds
(9.2e18).
But it turns out that our conversion function, date_to_time_point(),
used some boost::gregorian library code, which carried out these
calculations in **microsecond** resolution. The extra conversion to
microseconds wasn't just wasteful, it also caused an integer overflow
in the extreme case: 2^31 days is 1.85e20 microseconds, which does NOT
fit in a 64-bit integer. UBSAN notices this overflow, and complains
(plus, the conversion is incorrect).
The fix is to do the trivial conversion on our own (a day is, by
convention, exactly 86400 seconds - no fancy library is needed),
without the grace of Boost. The result is simpler, faster, correct
for the Pliocene-age dates, and fixes the UBSAN crash in the test.
Fixes#17516
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17527
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.
Fixesscylladb/scylladb#16821
This test needed a lot of data to ensure multiple pages when doing the
read repair. This change two key configuration items, allowing
for a drastic reduction of the data size and consequently a large
reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a6fa,
reducing this to a too small value would start killing internal
queries. Now, after said commit, this is no longer a concern, as this
limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.
With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8
The runtime of the test consequently drops from 62 seconds to 13.5
seconds (dev mode, on my build machine).
This patch changes get_unlimited_query_max_result_size():
* Also set the page-size field, not just the soft/hard limits
* Renames it to get_query_max_result_size()
* Update callers, specifically storage_proxy::get_max_result_size(),
which now has a much simpler common return path and has to drop the
page size on one rare return path.
This is a purely mechanical change, no behaviour is changed.
Returns an instance with the page_limit reset to 0. This converts a
max_results_size which is usable only with the
"page_size_and_safety_limit" feature, to one which can be used before
this feature.
To be used in the next patch.
Regulates the page size in bytes via config, instead of the currently
used hard-coded constant. Allows tests to configure lower limits so they
can work with smaller data-sets when testing paging related
functionality.
Not wired yet.
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.
The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.
The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.
There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:
1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.
Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:
1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.
The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.
Refs: #17493Closesscylladb/scylladb#17499
* github.com:scylladb/scylladb:
version_generator: check that get_next_version is called on shard 0
misc_services: fix data race from bad usage of get_next_version
Since commit f1bbf70, many compaction types can do cleanup work, but turns out
we forgot to invalidate cache on their completion.
So if a node regains ownership of token that had partition deleted in its previous
owner (and tombstone is already gone), data can be resurrected.
Tablet is not affected, as it explicitly invalidates cache during migration
cleanup stage.
Scylla 5.4 is affected.
Fixes#17501.
Fixes#17452.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17502
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.
in this change, we define formatters for
* canonical_mutation
* atomic_cell_view
* atomic_cell
* atomic_cell_or_collection::printer
Refs #13245Closesscylladb/scylladb#17506
* github.com:scylladb/scylladb:
mutation: add fmt::formatter for canonical_mutation
mutation: add fmt::formatter for atomic_cell_view and atomic_cell
mutation: add fmt::formatter for atomic_cell_or_collection::printer
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for canonical_mutation
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* atomic_cell_view
* atomic_cell
and drop their operator<<:s.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
`atomic_cell_or_collection::printer`, and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The get_next_version function can only be safely called from shard 0,
but this constraint is not enforced in any way. As evidenced in the
previous commit, it is easy to accidentally call it from a non-zero
shard.
Introduce a runtime check to get_next_version which calls
on_fatal_internal_error if it detects that the function was called form
the wrong shard. This will let us detect cross-shard use issues in
runtime.
The function `gms::version_generator::get_next_version()` can only be
called from shard 0 as it uses a global, unsynchronized counter to
issue versions. Notably, the function is used as a default argument for
the constructor of `gms::versioned_value` which is used from shorthand
constructors such as `versioned_value::cache_hitrates`,
`versioned_value::schema` etc.
The `cache_hitrate_calculator` service runs a periodic job which
updates the `CACHE_HITRATES` application state in the local gossiper
state. Each time the job is scheduled, it runs on the next shard (it
goes through shards in a round-robin fashion). The job uses the
`versioned_value::cache_hitrates` shorthand to create a
`versioned_value`, therefore risking a data race if it is not currently
executing on shard 0.
Fix the race by constructing the versioned value on shard 0.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* wrapping_interval
* interval
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17488
Unfortunately, fmt v10 dropped support for operator<< formatters,
forcing us to replace the huge number of operator<< implementations
in our code by uglier and templated fmt::formatter implementations
to get Scylla to compile on modern distros (such as Fedora 39) :-(
Kefu has already started doing this migration, here is my small
contribution - the formatter for mutation_fragment_v2::kind.
This patch is need to compile, for example,
build/dev/mutation/mutation_fragment_stream_validator.o.
I can't remove the old operator<< because it's still used by
the implementation of other operator<< functions. We can remove
all of them when we're done with this coversion. In the meantime,
I replaced the original implementation of operator<< by a trivial
implementation just passing the work to the new fmt::print support.
Refs #13245
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17432
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* bound_kind_m
* sstable_state
* indexable_element
* deletion_time
drop their operator<<:s
Refs #13245Closesscylladb/scylladb#17490
* github.com:scylladb/scylladb:
sstables: add fmt::formatter for deletion_time
sstable: add fmt::formatter for indexable_element
sstables: add fmt::foramtter for sstable_state
sstables: add fmt::formatter for sstables::bound_kind_m
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for some types used in testing.
Refs #13245Closesscylladb/scylladb#17485
* github.com:scylladb/scylladb:
test/unit: add fmt::formatter for tree_test_key_base
test: add printer for type for BOOST_REQUIRE_EQUAL
test: add fmt::formatters
test/perf: add fmt::formatters for scheduling_latency_measurer and perf_result
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* managed_bytes
* managed_bytes_view
* managed_bytes_opt
* occupancy_stats
and drop their operator<<:s
Refs https://github.com/scylladb/scylladb/issues/13245Closesscylladb/scylladb#17462
* github.com:scylladb/scylladb:
utils/managed_bytes: add fmt::formatters for managed_bytes and friends
utils/logalloc: add fmt::formatter for occupancy_stats
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
Fixes: #17232.
Closesscylladb/scylladb#17355
* github.com:scylladb/scylladb:
test: add test to check if reader is closed
sstables: close index_reader in has_partition_key
the "keyspace" argument of the "ring" command is optional. but before
this change, we considered it a mandatory option. it was wrong.
so, in this change, we make it optional, and print out the warning
message if the keyspace is not specified.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17472
* tighten the param check for toppartitions
* add an extra empty line inbetween reports
Closesscylladb/scylladb#17486
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add an extra empty line inbetween reports
tools/scylla-nodetool: tighten the param check for toppartitions