Stop using database (and including database.hh) for schema related
purposes and use data_dictionary instead.
data_dictionary::database::real_database() is called from several
places, for these reasons:
- calling yet-to-be-converted code
- callers with a legitimate need to access data (e.g. system_keyspace)
but with the ::database accessor removed from query_processor.
We'll need to find another way to supply system_keyspace with
data access.
- to gain access to the wasm engine for testing whether used
defined functions compile. We'll have to find another way to
do this as well.
The change is a straightforward replacement. One case in
modification_statement had to change a capture, but everything else
was just a search-and-replace.
Some files that lost "database.hh" gained "mutation.hh", which they
previously had access to through "database.hh".
"
These two readers are crucial for writing tests for any composable
reader so we need v2 versions of them before we can convert and test the
combined reader (for example). As these two readers are often used in
situations where the payload they deliver is specially crafted for the
test at hand, we keep their v1 versions too to avoid conversion meddling
with the tests.
Tests: unit(dev)
"
* 'forwarding-and-fragment-reader-v2/v1' of https://github.com/denesb/scylla:
flat_mutation_reader_v2: add make_flat_mutation_reader_from_fragments()
test/lib/mutation_source_test: don't force v1 reader in reverse run
mutation_source: add native_version() getter
flat_mutation_reader_v2: add make_forwardable()
position_in_partition: add after_key(position_in_partition_view)
flat_mutation_reader: make_forwardable(): fix indentation
flat_mutation_reader: make_forwardable(): coroutinize reader
Currently in the reverse run we wrap the test-provided mutation-source
and create a v1 reader with it, forcing a conversion if the
mutation-source has a v2 factory. Worse still, if the test is v2 native,
there will be a double conversion. This patch fixes this by creating a
wrapper mutation-source appropriate to the version of the underlying
factory of the wrapped mutation-source.
"
Currently stateful (readers being saved and resumed on page boundaries)
multi-range scans are broken in multiple ways. Trying to use them can
result in anything from use-after-free (#6716) or getting corrupt data
(#9718). Luckily no-one is doing such queries today, but this started to
change recently as code such as Alternator TTL and distributed
aggregate reads started using this.
This series fixes both problems and adds a unit test too exercising this
previously completely unused code-path.
Fixes: #6716Fixes: #9718
Tests: unit(dev, release, debug)
"
* 'fix-stateful-multi-range-scans/v1' of https://github.com/denesb/scylla:
test/boost/multishard_mutation_query_test: add multi-range test
test/boost/multishard_mutation_query_test: add multi-range support
multishard_mutation_query: don't drop data during stateful multi-range reads
multishard_combining_reader: reader_lifecycle_policy: allow saving read range on fast-forward
The reader_lifecycle_policy API was created around the idea of shard
readers (optionally) being saved and reused on the next page. To do
this, the lifecycle policy has to also be able to control the lifecycle
of by-reference parameters of readers: the slice and the range. This was
possible from day 1, as the readers are created through the lifecycle
policy, which can intercept and replace the said parameters with copies
that are created in stable storage. There was one whole in the design
though: fast-forwarding, which can change the range of the read, without
the lifecycle policy knowing about this. In practice this results in
fast-forwarded readers being saved together with the wrong range, their
range reference becoming stale. The only lifecycle implementation prone
to this is the one in `multishard_mutation_query.cc`, as it is the only
one actually saving readers. It will fast-forward its reader when the
query happens over multiple ranges. There were no problems related to
this so far because no one passes more than one range to said functions,
but this is incidental.
This patch solves this by adding an `update_read_range()` method to the
lifecycle policy, allowing the shard reader to update the read range
when being fast forwarded. To allow the shard reader to also have
control over the lifecycle of this range, a shared pointer is used. This
control is required because when an `evictable_reader` is the top-level
reader on the shard, it can invoke `create_reader()` with an edited
range after `update_read_range()`, replacing the fast-forwarded-to
range with a new one, yanking it out from under the feet of the
evictable reader itself. By using a shared pointer here, we can ensure
the range stays alive while it is the current one.
"
couple of preparatory changes for coroutinization of manager
"
* 'some_compaction_manager_cleanups_v5' of github.com:raphaelsc/scylla:
compaction_manager: move check_for_cleanup into perform_cleanup()
compaction_manager: replace get_total_size by one liner
compaction_manager: make consistent usage of type and name table
compaction_manager: simplify rewrite_sstables()
compaction_manager: restore indentation
new code in manager adopted name and type table, whereas historical
code still uses name and type column family. let's make it consistent
for newcomers to not get confused.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.
The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.
The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.
The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.
Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"
* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
raft: (server) improve tracing
raft: (metrics) fix spelling of waiters_awaken
raft: make forwarding optional
raft: (service) manage Raft configuration during topology changes
raft: (service) break a dependency loop
raft: (discovery) introduce leader discovery state machine
system_keyspace: mark scylla_local table as always-sync commitlog
system_keyspace: persistence for Raft Group 0 id and Raft Server Id
raft: add a test case for adding entries on follower
raft: (server) allow adding entries/modify config on a follower
raft: (test) replace virtual with override in derived class
raft: (server) fix a typo in exception message
raft: (server) implement id() helper
raft: (server) remove apply_dummy_entry()
raft: (test) fix missing initialization in generator.hh
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.
Ref #9704 (because on 4.4 and earlier releases are vulnerable).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#9705
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.
As a preparation for the change, we extend the sstable reversing reader
random schema test with a fast-forwarding test and include some minor
fixes.
Fixes#9427.
Closes#9484
* github.com:scylladb/scylla:
query-request: add comment about clustering ranges with non-full prefix key bounds
sstables: mx: enable position fast-forwarding in reverse mode
test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding
test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call
test: mutation_source_test: extract `forwardable_reader_to_mutation` function
test: random_schema: fix clustering column printing in `random_schema::cql`
This series hardens compaction_manager::remove by:
- add debug logging around task execution and stopping.
- access compaction_state as lw_shared_ptr rather than via a raw pointer.
- with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped.
- added write_lock in `remove` to make sure the lock is not held by any stray task.
Test: unit(dev), sstable_compaction_test(debug)
Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug)
Closes#9636
* github.com:scylladb/scylla:
compaction_manager: add compaction_state when table is constructed
compaction_manager: remove: fixup indentation
compaction_manager: remove: detach compaction_state before stopping ongoing compactions
compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
compaction_manager: task: keep a reference on compaction_state
test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
test: sstable_utils: compact_sstables: deregister compaction also on error path
test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
test: compaction_manager_test: add debug logging to register/deregister compaction
test: compaction_manager_test: deregister_compaction: erase by iterator
test: compaction_manager_test: move methods out of line
compaction_manager: compaction_state: use counter for compaction_disabled
compaction_manager: task: delete move and copy constructors
compaction_manager: add per-task debug log messages
compaction_manager: stop_ongoing_compactions: log number of tasks to stop
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.
However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.
In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.
Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.
Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:
Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.
Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.
Add tracing.
One of the tables needs gossiper and uses global one. This patch
prepares the fix by patching the main -> register_virtual_tables
stack with the gossiper reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Query time must be fetched after populate. If compaction is executed
during populate it may be executed with timestamp later than query_time.
This would cause the test expected compaction and compaction during
populate to be executed at different time points producing different
results. The result would be sporadic test failures depending on relative
timing of those operations. If no other mutations happen after populate,
and query_time is later than the compaction time during population, we're
guaranteed to have the same results.
Message-Id: <20211123134808.105068-1-mikolaj.sieluzycki@scylladb.com>
The manager is drained() on drain/decommission/isolate. Since now
it's storage_service who orchestrates all of the above, it needs
and explicit reference on the target.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And hold its gate to make sure the compaction_state outlives
the task and can be used to wait on all tasks and functions
using it.
With that, doing access _compaction_state[cf] to acquire
shared/exclusive locks but rather get to it via
task->compaction_state so it can be detached from
_compaction_state while task is running, if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As a prerequisite to globalizing the batchlog_manager,
allow setting a global pointer to it and instantiate
the sharded<db::batchlog_manager> on the main/cql_test_env
stack.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
After this series, compaction will finally stop including database.hh.
tests: unit(debug).
"
* 'stop_including_database_hh_for_compaction' of github.com:raphaelsc/scylla:
compaction: stop including database.hh
compaction: switch to table_state in get_fully_expired_sstables()
compaction: switch to table_state
compaction: table_state: Add missing methods required by compaction
Make compaction procedure switch to table_state. Only function in
compaction.cc still directly using table is
get_fully_expired_sstables(T,...), but subsequently we'll make it
switch to table_state and then we can finally stop including database.hh
in the compaction code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We can't have view updates happening after the database shuts down.
In particular, mutateMV depends on the keyspace effective_replaication_map
and it is going to be released when all keyspaces shut down, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for creating effective_replication_map
when token_metadata changes, and update all
keyspaces with it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It will be used further to create shared copies
of effective_replication_map based on replication_strategy
type and config options.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The db::config reference is available on the database, which
can be get from the virtual_table itself. The problem is that
it's a const refernece, while system.config will be updateable
and will need non-const reference.
Adding non-const get_config() on the database looks wrong. The
database shouldn't be used as config provider, even the const
one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
table_state is being introduced for compaction subsystem, to remove table dependency
from compaction interface, fix layer violations, and also make unit testing
easier as table_state is an abstraction that can be implemented even with no
actual table backing it.
In this series, compaction strategy interfaces are switching to table_state,
and eventually, we'll make compact_sstables() switch to it too. The idea is
that no compaction code will directly reference a table object, but only work
with the abstraction instead. So compaction subdirectory can stop
including database.hh altogether, which is a great step forward.
"
* 'table_state_v5' of https://github.com/raphaelsc/scylla:
sstable_compaction_test: switch to table_state
compaction: stop including database.hh for compaction_strategy
compaction: switch to table_state in estimated_pending_compactions()
compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
DTCS: reduce table dependency for task estimation
LCS: reduce table dependency for task estimation
table: Implement table_state
compaction: make table param of get_fully_expired_sstables() const
compaction_manager: make table param of has_table_ongoing_compaction() const
Introduce table_state
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.
Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
flat_reader_assertions::produces_range_tombstone() does not actually
check range tombstones beyond the fact that they are in fact range
tombstones (unless non-empty ck_ranges is passed).
Fixing the immediate problem reveals that:
* The assertion logic is not flexible enough to deal with
creatively-split or creatively-overlapping range tombstones.
* Some existing tests involving range tombstones are in fact wrong:
some assertions may (at least with some readers) refer to wrong
tombstones entirely, while others assert wrong things about right
tombstones.
* Range tombstones in pre-made sstables (such as those read by
sstable_3_x_test) have deletion time drift, and that now has to be
somehow dealt with.
This patch (which is not split into smaller ones because that would
either generate unreasonable amount of work towards ensuring
bisectability or entail "temporarily" disabling problematic tests,
which is cheating) contains the following changes:
* flat_reader_assertions check range tombstones more carefully, by
accumulating both expected and actually-read range tombstones into
lists and comparing those lists when a partition ends (or when the
assertion object is destroyed).
* flat_reader_assertions::may_produce_tombstones() can take
constraining ck_ranges.
* Both flat_reader_assertions and flat_reader_assertions_v2 can be
instructed to ignore tombstone deletion times, to help with tests that
read pre-made sstables.
* Affected tests are changed to reflect reality. Most changes to
tests make sense; the only one I am not completely sure about is in
test_uncompressed_filtering_and_forwarding_range_tombstones_read.
Fixes#9470
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.
This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.
There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.
Fixes: #9483
Tests: unit(dev)
Closes#9519
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.
Closes#9468
* github.com:scylladb/scylla:
CQL test environment: Fix bad initialization order
Service Level Controller: Fix possible dereference of a null pointer
Serialize the metadata changes with
keyspace create, update, or drop.
This will become necessary in the following patch
when we update the effective_replication_map
on all keyspaces and we want instances on all shards
end up with the same replication map.
Note that storage_service::keyspace_changed is called
from the scheme_merge path so it already holds
the merge_lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The service level controller was initialized with a data
accessor that uses the system distributed keyspace before
the later have been initialized. If there is a use of
this accessor (for example by calling
to: service_level_controller::get_distributed_service_levels())
if will fail miserably and crash.
Not initializing the data accessor doesn't mean the same thing
since we can deal with such call when the accessor is not
initialized.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
There are 4 flavours of mutation source tests that are all ran
sequentially -- plain, reversed and upgrade/downgrade ones that
check v1<->v2 conversions.
This patch splits them all into individual calls so that some
tests may want to have dedicated cases for each. "By default" they
are all run as they were.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
The storage_service is involved in the cdc_generation_service guts
more than needed.
- the bool _for_testing bit is cdc-only
- there's API-only cdc_generation_service getter
- cdc_g._s. startup code partially sits in s._s. one
This patch cleans most of the above leaving only the startup
_cdc_gen_id on board.
tests: unit(dev)
refs: #2795
"
* 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla:
api: Use local sharded<cdc::generation_service> reference
main: Push cdc::generation_service via API
storage_service: Ditch for_testing boolean
cdc: Replace db::config with generation_service::config
cdc: Drop db::config from description_generator
cdc: Remove all arguments from maybe_rewrite_streams_descriptions
cdc: Move maybe_rewrite_streams_descriptions into after_join
cdc: Squash two methods into one
cdc: Turn make_new_cdc_generation a service method
cdc: Remove ring-delay arg from make_new_cdc_generation
cdc: Keep database reference on generation_service
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task