Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written.
This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too.
This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign.
Fixes: https://github.com/scylladb/scylladb/issues/11174Closes#11532
* github.com:scylladb/scylladb:
mutation_compactor: add validator
mutation_fragment_stream_validator: add a 'none' validation level
test/boost/mutation_query_test: test_partition_limit: sort input data
querier: consume_page(): use partition_start as the sentinel value
treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
position_in_partition: add for_partition_{start,end}()
Adds unit tests for the function `expr::prepare_expression`.
Three minor bugs were found by these tests, both fixed in this PR.
1. When preparing a map, the type for tuple constructor was taken from an unprepared tuple, which has `nullptr` as its type.
2. Preparing an empty nonfrozen list or set resulted in `null`, but preparing a map didn't. Fixed this inconsistency.
3. Preparing a `bind_variable` with `nullptr` receiver was allowed. The `bind_variable` ended up with a `nullptr` type, which is incorrect. Changed it to throw an exception,
Closes#11941
* github.com:scylladb/scylladb:
test preparing expr::usertype_constructor
expr_test: test that prepare_expression checks style_type of collection_constructor
expr_test: test preparing expr::collection_constructor for map
prepare_expr: make preparing nonfrozen empty maps return null
prepare_expr: fix a bug in map_prepare_expression
expr_test: test preparing expr::collection_constructor for set
expr_test: test preparing expr::collection_constructor for list
expr_test: test preparing expr::tuple_constructor
expr_test: test preparing expr::untyped_constant
expr_test_utils: add make_bigint_raw/const
expr_test_utils: add make_tinyint_raw/const
expr_test: test preparing expr::bind_variable
cql3: prepare_expr: forbid preparing bind_variable without a receiver
expr_test: test preparing expr::null
expr_test: test preparing expr::cast
expr_test_utils: add make_receiver
expr_test_utils: add make_smallint_raw/const
expr_test: test preparing expr::token
expr_test: test preparing expr::subscript
expr_test: test preparing expr::column_value
expr_test: test preparing expr::unresolved_identifier
expr_test_utils: mock data_dictionary::database
BOOST_CHECK_EQUAL is a weaker form of assertion, it reports an error
and will cause the test case to fail but continues. This makes the
test harder to debug because there's no obvious way to catch the
failure in GDB and the test output is also flooded with things which
happen after the failed assertion.
Message-Id: <20221119171855.2240225-1-tgrabiec@scylladb.com>
Add a function which creates a mock instance
of data_dictionary::database.
prepare_expression requires a data_dictionary::database
as an argument, so unit tests for it need something
to pass there. make_data_dictionary_database can
be used to create an instance that is sufficient for tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().
But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.
On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.
Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface will be helpful for allowing replica::table, unit
tests and sstables::compaction to access the compaction group's tracker
which will be managed by the compaction manager, once we complete
the decoupling work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).
Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.
To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).
This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.
To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
an IP address, because now the stored `_alive_set` already contains
`raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
`gms::inet_address`. To send the ping message, we need to translate
this to IP address; we do it by the `raft_address_map` pointer
introduced in an earlier commit.
Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
Closes#11759
* github.com:scylladb/scylladb:
direct_failure_detector: get rid of complex `endpoint_id` translations
service/raft: ping `raft::server_id`s, not `gms::inet_address`es
service/raft: store `raft_address_map` reference in `direct_fd_pinger`
gms: gossiper: move `direct_fd_pinger` out to a separate service
gms: gossiper: direct_fd_pinger: extract generation number caching to a separate class
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
Since the end bound is exclusive, the end position should be
before_key(), not after_key().
Affects only tests, as far as I know, only there we can get an end
bound which is a clustering row position.
Would cause failures once row cache is switched to v2 representation
because of violated assumptions about positions.
Introduced in 76ee3f029cCloses#11823
This PR adds some unit tests for the `expr::evaluate()` function.
At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request.
I found a few places where I think it would be better to behave in a different way, but nothing serious.
Closes#11815
* github.com:scylladb/scylladb:
test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
cql3: expr: Add unit tests for bind_variable validation of collections
cql3: expr: Add test for subscripted list and map
cql3: expr: Add test for usertype_constructor
cql3: expr: Add test for tuple_constructor
cql3: expr: Add tests for evaluation of collection constructors
cql3: expr: Add tests for evaluation of column_values and bind_variables
cql3: expr: Add constant evaluation tests
test/boost: Add expr_test_utils.hh
cql3: Add ostream operator for raw_value
cql3: add is_empty_value() to raw_value and raw_value_view
expr_test_utils.hh was a header file with helper methods for
expression tests. All functions were inline, because I didn't
know how to create and link a .cc file in test/boost.
Now the header is split into expr_test_utils.hh and expr_test_utils.cc
and moved to test/lib, which is designed to keep this kind of files.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection
At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.
refs: #2737
refs: #2795
"
* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
system_keyspace: Dont maintain dc/rack cache
system_keyspace: Indentation fix after previous patch
system_keyspace: Coroutinuze build_dc_rack_info()
topology: Move all post-configuration to topology::config
snitch: Start early
gossiper: Do not export system keyspace
snitch: Remove gossiper reference
snitch: Mark get_datacenter/_rack methods const
snitch: Drop some dead dependency knots
snitch, code: Make get_datacenter() report local dc only
snitch, code: Make get_rack() report local rack only
storage_service: Populate pending endpoint in on_alive()
code: Populate pending locations
topology: Put local dc/rack on topology early
topology: Add pending locations collection
topology: Make get_location() errors more verbose
token_metadata: Add config, spread everywhere
token_metadata: Hide token_metadata_impl copy constructor
gosspier: Remove messaging service getter
snitch: Get local address to gossip via config
...
The test verifies that a row which participated in earlier merge, and
its cells lost on the timestamp check, behaves exactly like an empty
row and can accept any mutation.
This wasn't the case in versions prior to f006acc.
Closes#11787
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env
Closes#11738
* github.com:scylladb/scylladb:
system_keyspace: Make get_{local|saved}_tokens non static
size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
cql_test_env: Keep sharded<system_keyspace> reference
size_estimate_virtual_reader: Keep system_keyspace reference
system_keyspace: Pass sys_ks argument to install_virtual_readers()
system_keyspace: Make make() non-static
distributed_loader: Pass sys_ks argument to init_system_keyspace()
system_keyspace: Remove dangling forward declaration
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
What's contained in this series:
- Refactored compaction tests (and utilities) for integration with multiple groups
- The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group.
- Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue)
- Many changes in replica::table for allowing integration with multiple groups
Next:
- Introduce for_each_compaction_group() for iterating over groups wherever needed.
- Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc).
- Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups
- Introduce static option for defining number of compaction groups and implement function to map a token to its respective group.
- Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting).
Closes#11592
* github.com:scylladb/scylladb:
sstable_resharding_test: Switch to table_for_tests
replica: Move compacted_undeleted_sstables into compaction group
replica: Use correct compaction_group in try_flush_memtable_to_sstable()
replica: Make move_sstables_from_staging() robust and compaction group friendly
test: Rename column_family_for_tests to table_for_tests
sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
test: Don't expose compound set in column_family_for_tests
test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
sstable_compaction_test: Switch to table_state in compact_sstables()
sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
There's a test_get_local_ranges() call in size-estimate reader which
will need system keyspace reference. There's no other place for tests to
get it from but the cql_test_env thing
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace().
This makes raft use system ks directly, snitch is patched by another branch
Closes#11729
* github.com:scylladb/scylladb:
raft_group0: Use local reference
raft_group0: Add system keyspace reference
The compound set shouldn't be exposed in main_sstables() because
once we complete the switch to column_family_for_tests::table_state,
can happen compaction will try to remove or add elements to its
set snapshot, and compound set isn't allowed to either ops.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Needed once we switch to column_family_for_tests::table_state, so unit
tests relying on correct value will still work
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change will make table_state_for_test the table_state of
column_family_for_tests. Today, an unit test has to keep a reference
to them both and logically couple them, but that's error prone.
This change is also important when replica::table supports multiple
compaction groups, so unit tests won't have to directly reference
the table_state of table, but rather use the one managed by
column_family_for_tests.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The switch is important once we have multiple compaction groups,
as a single table may own several groups. There will no longer be
a replica::table::as_table_state().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Lots of boilerplate is reduced, and will also help to complete the
switch from replica::table to compaction::table_state in the unit
tests.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.
Fixes https://github.com/scylladb/scylladb/issues/11681.
Closes#11682
* github.com:scylladb/scylladb:
replica: Return all sstables in table::get_sstable_set()
sstables: Fix cloning of compound_sstable_set