scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 15:03:06 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	7597663ef5	cql_test_env: Use table.find_row() shortcut The require_column_has_value() finds the cell in three steps -- finds partition, then row, then cell. The class table already has a method to facilitate row finding by partition and clustering key Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 15:37:27 +03:00
Benny Halevy	5afc242814	token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id It is too early to require that all nodes in normal state have a non-null host_id. The assertion was added in `44c14f3e2b` but unfortunately there are several call sites where we add the node as normal, but without a host_id and we patch it in later on. In the future we should be able to require that once we identify nodes by host_id over gossiper and in token_metadata. Fixes scylladb/scylladb#15181 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15184	2023-08-28 21:40:55 +03:00
Botond Dénes	47ce69e9bf	Merge 'paxos_response_handler: carry effective replication map' from Benny Halevy As `create_write_response_handler` on this path accepts an `inet_address_vector_replica_set` that corresponds to the effective_replication_map_ptr in the paxos_response_handler, but currently, the function retrieves a new effective_replication_map_ptr that may not hold all the said endpoints. Fixes scylladb/scylladb#15138 Closes #15141 * github.com:scylladb/scylladb: storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found	2023-08-28 11:42:38 +03:00
Kefu Chai	86e8be2dcd	replica:database: log if endpoint not found if the endpoint specified when creating a KEYSPACE is not found, when flushing a memtable, we would throw an `std::out_of_range` exception when looking up the client in `storage_manager::_s3_endpoints` by the name of endpoint. and scylla would crash because of it. so far, we don't have a good way to error out early. since the storage option for keyspace is still experimental, we can live with this, but would be better if we can spot this error in logging messages when testing this feature. also, in this change, `std::invalid_argument` is thrown instead of `std::out_of_range`. it's more appropriate in this circumstance. Refs #15074 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15075	2023-08-28 10:51:19 +03:00
Avi Kivity	fb8375e1e7	Merge 'storage_proxy: mutate_atomically_result: carry effective replication map down to create_write_response_handler' from Benny Halevy The effective_replication_map_ptr passed to `create_write_response_handler` by `send_batchlog_mutation` must be synchronized with the one used to calculate _batchlog_endpoints to ensure they use the same topology. Fixes scylladb/scylladb#15147 Closes #15149 * github.com:scylladb/scylladb: storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context	2023-08-27 16:34:34 +03:00
Benny Halevy	a5d5b6ded1	gossiper: remove_endpoint: call on_dead notifications is endpoint was alive Since `75d1dd3a76` gossiper::convict will no longer call `mark_dead` (e.g. when called from the failure detection loop after a node is stopped following decommission) and therefore the on_dead notification won't get called. To make that explicit, if the node was alive before remove_endpoint erased it from _live_endpoint, and it has an endpoint_state, call the on_dead notifications. These are imporant to clean up after the node is dead e.g. in storage_proxy::on_down which cancels all respective write handlers. This preferred over going through `mark_dead` as the latter marks the endpoint as unreachable, which is wrong in this case as the node left the cluster. Fixes #15178 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15179	2023-08-27 16:18:27 +03:00
Takuya ASADA	ae25a216bc	scylla_fstrim_setup: stop disabling fstrim.timer Disabling fstrim.timer was for avoid running fstrim on /var/lib/scylla from both scylla-fstrim.timer and fstrim.timer, but fstrim.timer actually never do that, since it is only looking on fstab entries, not our systemd unit. To run fstrim correctly on rootfs and other filesystems not related scylla, we should stop disabling fstrim.timer. Fixes #15176 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #15177	2023-08-27 14:56:37 +03:00
Kefu Chai	83ceedb18b	storage_service: do not cast a string to string_view before formatting seastar::format() just forward the parameters to be formatted to `fmt::format_to()`, which is able to format `std::string`, so there is no need to cast the `std::string` instance to `std::string_view` for formatting it. in this change, the cast is dropped. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15143	2023-08-25 16:43:38 +03:00
Mikołaj Grzebieluch	a031a14249	tests: add asynchronous log browsing functionality Add a class that handles log file browsing with the following features: * mark: returns "a mark" to the current position of the log. * wait_for: asynchronously checks if the log contains the given message. * grep: returns a list of lines matching the regular expression in the log. Add a new endpoint in `ManagerClient` to obtain the scylla logfile path. Fixes #14782 Closes #14834	2023-08-25 14:19:09 +02:00
Raphael S. Carvalho	a22f74df00	table: Introduce storage snapshot for upcoming tablet streaming New file streaming for tablets will require integration with compaction groups. So this patch introduces a way for streaming to take a storage snapshot of a given tablet using its token range. Memtable is flushed first, so all data of a tablet can be streamed through its sstables. The interface is compaction group / tablet agnostic, but user can easily pick data from a single tablet by using the range in tablet metadata for a given tablet. E.g.: auto erm = table.get_effective_replication_map(); auto& tm = erm->get_token_metadata(); auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id()); for (auto tid : tablet_map.tablet_ids()) { auto tr = tmap.get_token_range(tid); auto ssts = co_await table.take_storage_snapshot(tr); ... } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15128	2023-08-25 13:06:02 +02:00
Patryk Jędrzejczak	9806bddf75	test: fix a test case in raft_address_map_test The test didn't test what it was supposed to test. It would pass even if set_nonexpiring() didn't insert a new entry. Closes #15157	2023-08-25 12:11:33 +02:00
Kefu Chai	d2d1141188	sstables: writer: delegate flush() in checksummed_file_data_sink_impl before this change, `checksummed_file_data_sink_impl` just inherits the `data_sink_impl::flush()` from its parent class. but as a wrapper around the underlying `_out` data_sink, this is not only an unusual design decision in a layered design of an I/O system, but also could be problematic. to be more specific, the typical user of `data_sink_impl` is a `data_sink`, whose `flush()` member function is called when the user of `data_sink` want to ensure that the data sent to the sink is pushed to the underlying storage / channel. this in general works, as the typical user of `data_sink` is in turn `output_stream`, which calls `data_sink.flush()` before closing the `data_sink` with `data_sink.close()`. and the operating system will eventually flush the data after application closes the corresponding fd. to be more specific, almost none of the popular local filesystem implements the file_operations.op, hence, it's safe even if the `output_stream` does not flush the underlying data_sink after writing to it. this is the use case when we write to sstables stored on local filesystem. but as explained above, if the data_sink is backed by a network filesystem, a layered filesystem or a storage connected via a buffered network device, then it is crucial to flush in a timely manner, otherwise we could risk data lost if the application / machine / network breaks when the data is considerered persisted but they are _not_! but the `data_sink` returned by `client::make_upload_jumbo_sink` is a little bit different. multipart upload is used under the hood, and we have to finalize the upload once all the parts are uploaded by calling `close()`. but if the caller fails / chooses to close the sink before flushing it, the upload is aborted, and the partially uploaded parts are deleted. the default-implemented `checksummed_file_data_sink_impl::flush()` breaks `upload_jumbo_sink` which is the `_out` data_sink being wrapped by `checksummed_file_data_sink_impl`. as the `flush()` calls are shortcircuited by the wrapper, the `close()` call always aborts the upload. that's why the data and index components just fail to upload with the S3 backend. in this change, we just delegate the `flush()` call to the wrapped class. Fixes #15079 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15134	2023-08-24 18:03:10 +03:00
Raphael S. Carvalho	d6cc752718	test: Fix flakiness in sstable_compaction_test.autocompaction_control_test It's possible that compaction task is preempted after completion and before reevaluation, causing pending_tasks to be > 1. Let's only exit the loop if there are no pending tasks, and also reduce 100ms sleep which is an eternity for this test. Fixes #14809. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15059	2023-08-24 13:37:06 +03:00
Benny Halevy	4a2e367e92	storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler As `create_write_response_handler` on this path accepts an `inet_address_vector_replica_set` that corresponds to the effective_replication_map_ptr in the paxos_response_handler, but currently, the function retrieves a new effective_replication_map_ptr that may not hold all the said endpoints. Fixes scylladb/scylladb#15138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:45:13 +03:00
Benny Halevy	6af0b281a6	storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler The effective_replication_map_ptr passed to `create_write_response_handler` by `send_batchlog_mutation` must be synchronized with the one used to calculate _batchlog_endpoints to ensure they use the same topology. Fixes scylladb/scylladb#15147 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 10:43:40 +03:00
Benny Halevy	098dd5021a	storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context The batchlog mutation is for system.batchlog. Rather than looking the schema up in multiple places do that once and keep it in the context object. It will be used in the next patch to get a respective effective_replication_map_ptr. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 10:43:23 +03:00
Benny Halevy	27c33015a5	storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found Return error in production rather than crashing as in https://github.com/scylladb/scylladb/issues/15138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 08:59:38 +03:00
Kefu Chai	2f17b76df7	docs/operating-scylla/admin-tools: add note on deprecating sstabledump sstabledump is deprecated in place of `scylla sstable` commands. so let's reflect this in the document. Fixes #15020 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15021	2023-08-24 08:31:29 +03:00
Botond Dénes	1609c76d62	tools/scylla-sstable: scrub: don't qurantine sstables after validate Scylla sstable promises to never mutate its input sstables. This promise was broken by `scylla sstable scrub --scrub-mode=validate`, because validate moves invalid input sstables into qurantine. This is unexpected and caused occasional failures in the scrub tests in test_tools.py. Fix by propagating a flag down to `scrub_sstables_validate_mode()` in `compaction.cc`, specifying whether validate should qurantine invalid sstables, then set this flag to false in `scylla-sstable.cc`. The existing test for validate-mode scrub is ammended to check that the sstable is not mutated. The test now fails before the fix and passes afterwards. Fixes: #14309 Closes #15139	2023-08-23 21:53:12 +03:00
Kamil Braun	93be4c0cb0	Merge 'Base node liveliness consistently on gossiper::is_alive' from Benny Halevy Currently he gossiper marks endpoint_state objects as alive/dead. I some cases the endpoint_state::is_alive function is checked but in many other cases gossiper::is_alive(endpoint) is used to determine if the endpoint is alive. This series removed the endpoint_state::is_alive state and moves all the logic to gossiper::is_alive that bases its decision on the endpoint having an endpoint_state and being in the _live_endpoints set. For that, the _live_endpoints is made sure to be replicated to all shards when changed and the endpoint_state changes are serialized under lock_endpoint, and also making sure that the endpoint_state in the _endpoint_states_map is never updated in place, but rather a temporary copy is changed and then safely replicated using gossiper::replicate Refs https://github.com/scylladb/scylladb/issues/14794 Closes #14801 * github.com:scylladb/scylladb: gossiper: mark_alive: remove local_state param endpoint_state: get rid of _is_alive member and methods gossiper: is_alive: use _live_endpoints gossiper: evict_from_membership: erase endpoint from _live_endpoints gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change gossiper: run: no need to replicate live_endpoints gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change gossiper: add mutate_live_and_unreachable_endpoints gossiper: reset_endpoint_state_map: clear also shadow endpoint sets gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards gossiper: functions that change _live_endpoints must be called on shard 0 gossiper: add lock_endpoint_update_semaphore gossiper: make _live_endpoints an unordered_set endpoint_state: use gossiper::is_alive externally	2023-08-23 17:18:05 +02:00
Gleb Natapov	d1654ccdda	storage_service: register schema version observer before joining group0 and starting gossiper The schema version is updated by group0, so if group0 starts before schema version observer is registered some updates may be missed. Since the observer is used to update node's gossiper state the gossiper may contain wrong schema version. Fix by registering the observer before starting group0 and even before starting gossiper to avoid a theoretical case that something may pull schema after start of gossiping and before the observer is registered. Fixes: #15078 Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>	2023-08-23 17:11:51 +02:00
Patryk Jędrzejczak	ef2eac9941	raft topology: make every type in request_param a named struct We make every alternative type in the request_param variant a named struct to make the code more readable. Additionally, this change will make extending request parameters easier if we decide to do so in the future. Closes #15132	2023-08-23 16:56:00 +02:00
Patryk Jędrzejczak	7eab9f8a02	raft_removenode: remove "raft topology" from errors Some runtime errors thrown in storage_service::raft_removenode start with the "raft topology " prefix. Since "raft topology" is an implementation detail, we don't want to throw this information through the user API. Only logs should contain it. Closes #15136	2023-08-23 16:20:14 +02:00
Nadav Har'El	5530c529c2	test/cql-pytest: regression test for old bug with CAST(f AS TEXT) precision When casting a float or double column to a string with `CAST(f AS TEXT)`, Scylla is expected to print the number with enough digits so that reading that string back to a float or double restores the original number exactly. This expectation isn't documented anywhere, but makes sense, and is what Cassandra does. Before commit `71bbd7475c`, this wasn't the case in Scylla: `CAST(f AS TEXT)` always printed 6 digits of precision, which was a bit under enough for a float (which can have 7 decimal digits of precision), but very much not enough for a double (which can need 15 digits). The origin of this magic "6 digits" number was that Scylla uses seastar::to_sstring() to print the float and double values, and before the aforementioned commit those functions used sprintf with the "%g" format - which always prints 6 decimal digits of precision! After that commit, to_sstring() now uses a different approach (based on fmt) to print the float and double values, that prints all significant digits. This patch adds a regression test for this bug: We write float and double values to the database, cast them to text, and then recover the float or double number from that text - and check that we get back exactly the same float or double object. The test fails before the aforementioned commit, and passes after it. It also passes on Cassandra. Refs #15127 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15131	2023-08-23 16:06:52 +03:00
Botond Dénes	e7af2a7de8	Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy And verify the they returned host_id isn't null. Call on_internal_error_noexcept in that case since all token owners are expected to have their host_id set. Aborting in testing would help fix issues in this area. Fixes scylladb/scylladb#14843 Refs scylladb/scylladb#14793 Closes #14844 * github.com:scylladb/scylladb: api: storage_service: improve description of /storage_service/host_id token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners	2023-08-23 13:55:14 +03:00
Botond Dénes	139ba553b8	Merge 'sstable, test: log sstable name and pk when capping local_deletion_time ' from Kefu Chai in this series, we also print the sstable name and pk when writing a tombstone whose local_deletion_time (ldt for short) is greater than INT32_MAX which cannot be represented by an uint32_t. Fixes #15015 Closes #15107 * github.com:scylladb/scylladb: sstable/writer: log sstable name and pk when capping ldt test: sstable_compaction_test: add a test for capped tombstone ldt	2023-08-23 09:29:54 +03:00
Botond Dénes	f7505405f0	scylla-gdb.py: use for_each_table() everywhere scylla-gdb.py has two methods for iterating over all tables: * all_tables() * for_each_table() Despite this, many places in the code iterate over the column family map directly. This patch leaves just a single method (for_each_table()) and migrates all the codebase to use it, instead of iterating over the raw map. While at it, the access to the map is made backward compatible with pre `52afd9d42d` code, said commit wrapped database::_column_families in tables_metadata object. This broke scylla-gdb.py for older versions. Closes #15121	2023-08-22 20:39:31 +03:00
Kamil Braun	169d19e5b0	Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak We add support for `--ignore-dead-nodes` in `raft_removenode` and `--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow passing only host ids of the ignored nodes. Supporting IPs is currently impossible because `raft_address_map` doesn't provide a mapping from IP to a host id. The main steps of the implementation are as follows: - add the `ignore_nodes` column to `system.topology`, - set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`, - extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes, - load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`, - add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`, - pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`. Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes. Fixes #15025 Closes #15113 * github.com:scylladb/scylladb: test: add test_raft_ignore_nodes test: ManagerClient.remove_node: allow List[HostId] for ignore_dead raft topology: pass ignore_nodes to {replace, remove}_with_repair raft topology: exec_global_command: add ignore_nodes to exclude_nodes raft topology: exec_global_command: change type of exclude_nodes topology_state_machine: extend request_param with a set of raft ids raft topology: set ignore_nodes in raft_removenode and raft_replace utils: introduce split_comma_separated_list raft topology: add the ignore_nodes column to system.topology	2023-08-22 18:04:59 +02:00
Kamil Braun	cdc3cd2b79	Merge 'raft: add fencing tests' from Petr Gusev In this PR a simple test for fencing is added. It exercises the data plane, meaning if it somehow happens that the node has a stale topology version, then requests from this node will get an error 'stale topology'. The test just decrements the node version manually through CQL, so it's quite artificial. To test a more real-world scenario we need to allow the topology change fiber to sometimes skip unavailable nodes. Now the algorithm fails and retries indefinitely in this case. The PR also adds some logs, and removes one seemingly redundant topology version increment, see the commit messages for details. Closes #14901 * github.com:scylladb/scylladb: test_fencing: add test_fence_hints test.py: output the skipped tests test.py: add skip_mode decorator and fixture test.py: add mode fixture hints: add debug log for dropped hints hints: send_one_hint: extend the scope of file_send_gate holder pylib: add ScyllaMetrics hints manager: add send_errors counter token_metadata: add debug logs fencing: add simple data plane test random_tables.py: add counter column type raft topology: don't increment version when transitioning to node_state::normal	2023-08-22 16:28:21 +02:00
Piotr Grabowski	17e3e367ca	test: use more frequent reconnection policy The default reconnection policy in Python Driver is an exponential backoff (with jitter) policy, which starts at 1 second reconnection interval and ramps up to 600 seconds. This is a problem in tests (refs #15104), especially in tests that restart or replace nodes. In such a scenario, a node can be unavailable for an extended period of time and the driver will try to reconnect to it multiple times, eventually reaching very long reconnection interval values, exceeding the timeout of a test. Fix the issue by using a exponential reconnection policy with a maximum interval of 4 seconds. A smaller value was not chosen, as each retry clutters the logs with reconnection exception stack trace. Fixes #15104 Closes #15112	2023-08-22 15:40:39 +02:00
Avi Kivity	d944872d19	Merge 'Prevent reactor stalls in to_repair_rows_list' from Benny Halevy This sort series deals with two stall sources in row-level repair `to_repair_rows_list`: 1. Freeing the input `repair_rows_on_wire` in one shot on return (as seen in https://github.com/scylladb/scylladb/issues/14537) 2. Freeing the result `row_list` in one shot on error. this hasn't been seen in testing but I have no reason to believe it is not susceptible to stalls exactly like repair_rows_on_wire with the same number of rows and mutations. Fixes https://github.com/scylladb/scylladb/issues/14537 Closes #15102 * github.com:scylladb/scylladb: repair: reindent to_repair_rows_list repair: to_repair_rows_list: clear_gently on error repair: to_repair_rows_list: consume frozen rows gently	2023-08-22 15:29:37 +03:00
Patryk Jędrzejczak	b044ee535f	test: add test_raft_ignore_nodes We add two tests verifying that --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace in raft_replace are handled correctly. We need a 7-cluster to have a Raft majority. Therefore, these tests are quite slow, and we want to run them only in the dev mode.	2023-08-22 14:19:21 +02:00
Patryk Jędrzejczak	6818d13f7d	test: ManagerClient.remove_node: allow List[HostId] for ignore_dead ManagerClient.remove_node allows passing ignore_dead only as List[IPAddress]. However, raft_removenode currently supports only host ids. To write a test that passes ignore_dead to ManagerClient.remove_node in the Raft topology mode, we allow passing ignore_dead as List[HostId]. Note that we don't want to use List[IPAddress \| HostId] because mixing IP addresses and host ids fails anyway. See ss::remove_node.set(...) in api::set_storage_service.	2023-08-22 14:19:09 +02:00
Patryk Jędrzejczak	26ad527666	raft topology: pass ignore_nodes to {replace, remove}_with_repair To properly stream ranges during the removenode or replace operation in the Raft topology mode, we pass IPs of the ignored nodes to replace_with_repair and remove_with_repair in storage_service::raft_topology_cmd_handler.	2023-08-22 14:18:39 +02:00
Patryk Jędrzejczak	e685182290	raft topology: exec_global_command: add ignore_nodes to exclude_nodes We add ignore_nodes to exclude_nodes in exec_global_command to ignore nodes marked as dead by --ignore-dead-nodes for raft_removenode and --ignore-dead-nodes-for-replace for raft_replace.	2023-08-22 14:18:37 +02:00
Patryk Jędrzejczak	5ebee35f99	raft topology: exec_global_command: change type of exclude_nodes We extend exclude_nodes in exec_global_command with ignore_nodes in the next commit. Since we already use std::unordered_set to store ids of the ignored nodes and their number is unknown, we change the type of exclude_nodes from utils::small_vector to std::unordered_set.	2023-08-22 14:17:55 +02:00
Patryk Jędrzejczak	1f57d80ba1	topology_state_machine: extend request_param with a set of raft ids We add two new alternative types to service::request_param: removenode_param and replace_param. They allow storing the list of ignored nodes loaded from the ignore_nodes column of system.topology. We also remove the raft::server_id type because it has been only used by the replace operation.	2023-08-22 14:17:37 +02:00
Patryk Jędrzejczak	7d3dc306eb	raft topology: set ignore_nodes in raft_removenode and raft_replace To handle --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace in raft_replace, we set the ignore_nodes value of the topology mutation in these functions. In the following commits, we ensure that the topology coordinator properly makes use of it.	2023-08-22 14:13:51 +02:00
Petr Gusev	1ddc76ffd1	test_fencing: add test_fence_hints The test makes a write through the first node with the third node down, this causes a hint to be stored on the first node for the second. We increment the version and fence_version on the third node, restart it, and expect to see a hint delivery failure because of versions mismatch. Then we update the versions of the first node and expect hint to be successfully delivered.	2023-08-22 15:48:40 +04:00
Petr Gusev	3ccd2abad4	test.py: output the skipped tests pytest option -rs forces it to print all the skipped tests along with the reasons. Without this option we can't tell why certain tests were skipped, maybe some of them shouldn't already.	2023-08-22 15:48:40 +04:00
Petr Gusev	c434d26b36	test.py: add skip_mode decorator and fixture Syntactic sugar for marking tests to be skipped in a particular mode. There is skip_in_debug/skip_in_release in suite.yaml, but they can be applied only on the entire file, which is unnatural and inconvenient. Also, they don't allow to specify a reason why the test is skipped. Separate dictionary skipped_funcs is needed since we can't use pytest fixtures in decorators.	2023-08-22 15:48:40 +04:00
Petr Gusev	a639d161e6	test.py: add mode fixture Sometimes a test wants to know what mode it is running in so that e.g. it can skip itself in some of them.	2023-08-22 15:48:40 +04:00
Petr Gusev	439c91851f	hints: add debug log for dropped hints Dropping data is rather important event, let's log it at least at the debug level. It'll help in debugging tests.	2023-08-22 15:48:40 +04:00
Petr Gusev	9fd3df13a2	hints: send_one_hint: extend the scope of file_send_gate holder The problem was that the holder in with_gate call was released too early. This happened before the possible call to on_hint_send_failure in then_wrapped. As a result, the effects of on_hint_send_failure (segment_replay_failed flag) were not visible in send_one_file after ctx_ptr->file_send_gate.close(), so we could decide that the segment was sent in full and delete it even if sending of some hints led to errors. Fixes #15110	2023-08-22 15:48:40 +04:00
Petr Gusev	0b7a90dff6	pylib: add ScyllaMetrics This patch adds facilities to work with Scylla metrics from test.py tests. The new metrics property was added to ManagerClient, its query method sends a request to Scylla metrics endpoint and returns and object to conveniently access the result. ScyllaMetrics is copy-pasted from test_shedding.py. It's difficult to reuse code between 'new' and 'old' styles of tests, we can't just import pylib in 'old' tests because of some problems with python search directories. A past commit of mine that attempted to solve this problem was rejected on review.	2023-08-22 14:31:04 +04:00
Petr Gusev	1b7603af23	hints manager: add send_errors counter There was no indication of problems in the hints manager metrics before. We need this counter for fencing tests in the later commit, but it seems to be useful on its own.	2023-08-22 14:31:04 +04:00
Petr Gusev	fa25e6d63e	token_metadata: add debug logs We log the new version when the new token metadata is set. Also, the log for fence_version is moved in shared_token_metadata from storage_service for uniformity.	2023-08-22 14:31:04 +04:00
Petr Gusev	360453fd87	fencing: add simple data plane test The test starts a three node cluster and manually decrements the version on the last node. It then tries to write some data through the last node and expects to get 'stale topology' exception.	2023-08-22 14:31:01 +04:00
Benny Halevy	801987ab19	gossiper: mark_alive: remove local_state param It is not used anymore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:45 +03:00
Benny Halevy	75d1dd3a76	endpoint_state: get rid of _is_alive member and methods Now that gossiper bases its is_alive status on _live_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:45 +03:00

1 2 3 4 5 ...

38529 Commits