scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Botond Dénes	2f8d77e97b	replica/table: add optional compacting to make_multishard_streaming_reader() Doing to make_multishard_streaming_reader() what the previous commit did to make_streaming_reader(). In fact, the new compaction_time parameter is simply forwarded to the make_streaming_reader() on the shard readers. Call sites are updated, but none opt in just yet.	2023-07-27 03:22:11 -04:00
Botond Dénes	42b0dd5558	replica/table: add optional compacting to make_streaming_reader() Opt-in is possible by passing an engaged `compaction_time` (gc_clock::time_point) to the method. When this new parameter is disengaged, no compaction happens. Note that there is a global override, via the enable_compacting_data_for_streaming_and_repair config item, which can force-disable this compaction. Compaction done on the output of the streaming reader does not garbage-collect tombstones! All call-sites are adjusted (the new parameter is not defaulted), but none opt in yet. This will be done in separate commit per user.	2023-07-27 03:22:11 -04:00
Botond Dénes	9e3987fc96	db/config: add config item for enabling compaction for streaming and repair Compacting can greatly reduce the amount of data to be processed by streaming and repair, but with certain data shapes, its effectiveness can be reduced and its CPU overhead might outweight the benefits. This should very rarely be the case, but leave an off switch in case this becomes a problem in a deployment. Not wired yet.	2023-07-27 03:22:11 -04:00
Botond Dénes	a22446afe0	repair: log the error which caused the repair to fail Instead of just a boolean _failed flag, persist the error message of the exception which caused the repair to fail, and include it in the log message announcing the failure.	2023-07-27 03:22:11 -04:00
Botond Dénes	ac44efea11	readers: compacting_reader: use compact_mutation_state::abandon_current_partition() When next_partition() or fast_forward_to() is called. Instead of trying to simulate a properly closed partition by injecting synthetic mutation fragments to properly close it.	2023-07-27 02:50:44 -04:00
Botond Dénes	326c3b92e5	mutation/mutation_compactor: allow user to abandon current partition Currently, the compactor requires a valid stream and thus abandoning a partition in the middle was not possible. This causes some complications for the compacting reader, which implements methods such as `next_partition()` which is possibly called in the middle of a partition. In this case the compacting reader attempts to close the partition properly by inserting a synthetic partition-end fragment into the stream. This is not enough however as it doesn't close any range tombstone changes that might be active. Instead of piling on more complexity, add an API to the compactor which allows abandoning the current partition.	2023-07-27 02:50:44 -04:00
Raphael S. Carvalho	050ce9ef1d	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818	2023-07-27 00:01:46 +02:00
Anna Stuchlik	3ed6754afc	doc: update info about cassandra superuser Fixes https://github.com/scylladb/scylla-docs/issues/4028 The goal of this update is to discourage the use of the default cassandra superuser in favor of a custom super user - and explain why it's a good practice. The scope of this commit: - Adding a new page on creating a custom superuser. The page collects and clarifies the information about the cassandra superuser from other pages. - Remove the (incomplete) information about superuser from the Authorization and Authentication pages, and add the link to the new page instead. Additionaly, this update will result in better searchability and ensures language clarity. Closes #14829	2023-07-26 23:15:31 +03:00
Avi Kivity	615544a09a	Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov This is to make m.s. initialization more solid and simplify sys.ks.::setup() Closes #14832 * github.com:scylladb/scylladb: system_keyspace: Remove unused snitch arg from setup() messaging_service: Setup preferred IPs from config	2023-07-26 22:12:28 +03:00
Nadav Har'El	59c1498338	test/alternator: don't forget to delete tables on test failures Most of the Alternator tests are careful to unconditionally remove the test tables, even if the test fails. This is important when testing on a shared database (e.g., DynamoDB) but also useful to make clean shutdown faster as there should be no user table to flush. We missed a few such cases in test_gsi.py, and this patch corrects them. We do this by using the context manager new_test_table() - which automatically deletes the table when done - instead of the function create_test_table() which needs an explicit delete at the end. There are no functional changes in this patch - most of the lines changed are just reindents. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14835	2023-07-26 21:51:22 +03:00
Benny Halevy	1e7e2eeaee	gossiper: mark_alive: use deferred_action to unmark pending Make sure _pending_mark_alive_endpoints is unmarked in any case, including exceptions. Fixes #14839 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14840	2023-07-26 21:24:56 +03:00
Nadav Har'El	056d04954c	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage()	2023-07-26 20:04:28 +03:00
Pavel Emelyanov	6b82071064	system_keyspace: Remove unused snitch arg from setup() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:05:26 +03:00
Pavel Emelyanov	0fba57a3e8	messaging_service: Setup preferred IPs from config Population of messageing service preferred IPs cache happens inside system keyspace setup() call and it needs m.s. per ce and additionally snitch. Moving preferred ip cache to initial configuration keeps m.s. start more self-contained and keeps system_keyspace::setup() simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:03:23 +03:00
Nadav Har'El	d2ca600eec	test//run: kill Scylla with SIGTERM Today, test//run always kills Scylla at the end of the test with SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was believed that a clean shutdown would take a long time, but in fact, it turns out that 99% of the shutdown time was a silly sleep in the gossip code, which this patch disables with the "--shutdown-announce-in-ms" option. After enabling this option, clean shutdown takes (in a dev build on my laptop) just 0.02 seconds. It's worth noting that this shutdown has no real work to do - no tables to flush, and so on, because the pytest framework removes all the tables in its own fixture cleanup phase. So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL. We then wait until a timeout of 10 seconds (much much more than 0.02 seconds!) for Scylla to exit. If for some reason it didn't exit (e.g., it hung during the shutdown), it is killed again with SIGKILL, which is guaranteed to succed. This change gives us two advantages 1. Every test run with test/*/run exercises the shutdown path. It is perhaps excessive, but since the shutdown is so quick, there is no big downside. 2. In a test-coverage run, a clean shutdown allows flushing the counter files, which wasn't possible when Scylla was killed with KILL -9. Fixes #8543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14825	2023-07-26 14:06:24 +03:00
Avi Kivity	ff1f461a42	Merge 'Introduce tablet load balancer' from Tomasz Grabiec After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details. The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally. The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced. Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration. Closes #14601 * github.com:scylladb/scylladb: tests: test_tablets: Add test for bootstraping a node storage_service: topology_coordinator: Implement tablet migration state machine tablets: Introduce tablet_mutation_builder service: tablet_allocator: Introduce tablet load balancer tablets: Introduce tablet_map::for_each_tablet() topology: Introduce get_node() token_metadata: Add non-const getter of tablet_metadata storage_service: Notify topology state machine after applying schema change storage_service: Implement stream_tablet RPC tablets: Introduce global_tablet_id stream_transfer_task, multishard_writer: Work with table sharder tablets: Turn tablet_id into a struct db: Do not create per-keyspace erm for tablet-based tables tablets: effective_replication_map: Take transition stage into account when computing replicas tablets: Store "stage" in transition info doc: Document tablet migration state machine and load balancer locator: erm: Make get_endpoints_for_reading() always return read replicas storage_service: topology_coordinator: Sleep on failure between retries storage_service: topology_coordinator: Simplify coordinator loop main: Require experimental raft to enable tablets	2023-07-26 12:30:29 +03:00
Botond Dénes	d0f725c1b9	test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations A test reproducing #14819, that is, the view update builder not flushing the buffer when only empty partitions are consumed (with only a tombstone in them).	2023-07-26 03:09:53 -04:00
Botond Dénes	d66b07823b	db/view/view_updating_consumer: account for the size of mutations All partitions will have a corresponding mutation object in the buffer. These objects have non-negligible sizes, yet the consumer did not bump the _buffer_size when a new partition was consumer. This resulted in empty partitions not moving the _buffer_size at all, and thus they could accumulate without bounds in the buffer, never triggering a flush just by themselves. We have recently seen this causing OOM. This patch fixes that by bumping the _buffer_size with the size of the freshly created mutation object.	2023-07-26 03:07:25 -04:00
Botond Dénes	ad2ddffb22	Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from - proxy::remote::handle_truncate() - schema_tables::merge_schema() - legacy_schema_migrator - tests All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx Closes #14778 * github.com:scylladb/scylladb: system_keyspace: Make save_truncation_record() non-static code: Pass sharded<db::system_keyspace>& to database::truncate() db: Add sharded<system_keyspace>& to legacy_schema_migrator	2023-07-26 08:48:49 +03:00
Benny Halevy	90b2e6515c	gossiper: mark_alive: enter background_msg gate The function dispatch a background operation that must be waited on in stop(). Fixes scylladb/scylladb#14791 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14797	2023-07-26 00:51:22 +02:00
Tomasz Grabiec	ae8ffe23fc	tests: test_tablets: Add test for bootstraping a node	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f0b9dcee04	storage_service: topology_coordinator: Implement tablet migration state machine See the documentation in topology-over-raft.md for description of the mechanism.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	5c681a1d63	tablets: Introduce tablet_mutation_builder	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	6f4a35f9ae	service: tablet_allocator: Introduce tablet load balancer Will be invoked by the topology coordinator later to decide which tablets to migrate.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	d59b8d316c	tablets: Introduce tablet_map::for_each_tablet()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	0e3eac29d0	topology: Introduce get_node()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f2fdf37415	token_metadata: Add non-const getter of tablet_metadata Needed for tests.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	1885f94474	storage_service: Notify topology state machine after applying schema change Table construction may allocate tablets which may need rebalancing. Notify topology change coordinator to invoke the load balancer.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	6d545b2f9e	storage_service: Implement stream_tablet RPC Performs streaming of data for a single tablet between two tablet replicas. The node which gets the RPC is the receiving replica.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	e3a8bb7ec9	tablets: Introduce global_tablet_id Identifies tablet in the scope of the whole cluster. Not to be confused with tablet replicas, which all share global_tablet_id. Will be needed by load balancer and tablet migration algorithm to identify tablets globally.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f88220aeee	stream_transfer_task, multishard_writer: Work with table sharder So that we can use it on tablet-based tables.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	8cf92d4c86	tablets: Turn tablet_id into a struct The IDL compiler cannot deal with enum classes like this.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	c2b18ae483	db: Do not create per-keyspace erm for tablet-based tables This erm is not updated when replicating token metadata in storage_service::replicate_to_all_cores() so will pin token metadata version and prevent token metadata barrier from finishing. It is not necessary to have per-keyspace erm for tablet-based tables, so just don't create it.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	91dee5c872	tablets: effective_replication_map: Take transition stage into account when computing replicas	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	dc2ec3f81c	tablets: Store "stage" in transition info It's needed to implement tablet migration. It stores the current step of tablet migration state machine. The state machine will be advanced by the topology change coordinator. See the "Tablet migration" section of topology-over-raft.md	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	05519bd5e5	doc: Document tablet migration state machine and load balancer	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	7851694eaa	locator: erm: Make get_endpoints_for_reading() always return read replicas Just a simplification. Drop the test case from token_metadata which creates pending endpoints without normal tokens. It fails after this change with exception: "sorted_tokens is empty in first_token_index!" thrown from token_metadata::first_token_index(), which is used when calculating normal endpoints. This test case is not valid, first node inserts its tokens as normal without going through bootstrap procedure.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	b642e69eb3	storage_service: topology_coordinator: Sleep on failure between retries Avoid failing in a tight loop. Can happen if some node is down, for example.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	f0e9dbf911	storage_service: topology_coordinator: Simplify coordinator loop This refactoring removes a boolean and branching which makes it easier to reason about the flow, and easier to extend it with more steps.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	b294932cf1	main: Require experimental raft to enable tablets Tablets depend on the topology changes on raft feature. Drop "tablets" from suite.yaml of the topology/ suite, which doesn't use tablets anymore.	2023-07-25 21:08:01 +02:00
Botond Dénes	fda4168300	mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() To allow const access to the mutation under construction, e.g. so the user can query its size.	2023-07-25 10:34:31 -04:00
Botond Dénes	e6fa21d1b3	mutation/mutation: add memory_usage()	2023-07-25 10:34:30 -04:00
Pavel Emelyanov	c46c57d535	messaging_service: Clear list of clients on shutdown When messaging_service shuts down it first sets _shutting_down to true and proceeds with stopping clients and servers. Stopping clients, in turn, is calling client.stop() on each. Setting _shutting_down is used in two places. First, when a client is stopped it may happen that it's in the middle of some operation, which may result in call to remove_error_rpc_client() and not to call .stop() for the second time it just does nothing if the shutdown flag is set (see `357c91a076`). Second, get_rpc_client() asserts that this flag is not set, so once shutdown started it can make sure that it will call .stop() on _all_ clients and no new ones would appear in parallel. However, after shutdown() is complete the _clients vector of maps remains intact even though all clients from it are stopped. This is not very debugging-friendly, the clients are better be removed on shutdown. fixes: #14624 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14632	2023-07-25 13:08:20 +03:00
Botond Dénes	ed025890e5	scripts/coverage.py: --run: swallow KeyboardInterrupt It is quite common to stop a tested scylla process with ^C, which will raise KeyboardInterrupt from subprocess.run(). Catch and swallow this exception, allowing the post-processing to continue. The interrupted process has to handle the interrupt correctly too -- flush the coverage data even on premature exit -- but this is for another patch. Closes #14815	2023-07-25 12:29:22 +03:00
Kefu Chai	2943d3c1b0	tools/scylla-sstable: s/foo.find(bar) != foo.end()/foo.count(bar) != 0/ just for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14816	2023-07-25 11:38:44 +03:00
Raphael S. Carvalho	0ac43ea877	Fix stack-use-after-return in mutation source excluding staging The new test detected a stack-use-after-return when using table's as_mutation_source_excluding_staging() for range reads. This doesn't really affect view updates that generate single key reads only. So the problem was only stressed in the recently added test. Otherwise, we'd have seen it when running dtests (in debug mode) that stress the view update path from staging. The problem happens because the closure was feeded into a noncopyable_function that was taken by reference. For range reads, we defer before subsequent usage of the predicate. For single key reads, we only defer after finished using the predicate. Fix is about using sstable_predicate type, so there won't be a need to construct a temporary object on stack. Fixes #14812. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14813	2023-07-25 10:38:20 +03:00
Botond Dénes	3eec990e4e	Merge 'test: use different table names in simple_backlog_controller_test ' from Kefu Chai in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added. to address this problem, in this change 1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names 2. use the new ctor in simple_backlog_controller_test Fixes #14767 Closes #14783 * github.com:scylladb/scylladb: test: use different table names in simple_backlog_controller_test test/lib/simple_schema: add ctor for customizing ks.cf test/lib/simple_schema: do not hardwire ks.cf	2023-07-25 10:26:33 +03:00
Anna Stuchlik	f6732865b9	doc: doc: move unified installer from web to docs This commit adds the information on how to install ScyllaDB without root privileges (with "unified installer", but we've decided to drop that name - see the page title). The content taken from the website https://www.scylladb.com/download/?platform=tar&version=scylla-5.2#open-source is divided into two sections: "Download and Install" and "Configure and Run ScyllaDB". In addition, the "Next Steps" section is also copied from the website, and adjusted to be in sync with other installation pages in the docs. Refs https://github.com/scylladb/scylla-docs/issues/4091 Closes #14781	2023-07-25 10:23:02 +03:00
Benny Halevy	a07440173f	storage_service: node_ops_ctl: send_to_all: fix "Node is down for" log message args order The node and op_desc args are reversed. Fixes #14807 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14808	2023-07-24 21:13:06 +03:00
Petr Gusev	5fb8da4181	hints: add fencing In this commit we just pass a fencing_token through hint_mutation RPC verb. The hints manager uses either storage_proxy::send_hint_to_all_replicas or storage_proxy::send_hint_to_endpoint to send a hint. Both methods capture the current erm and use the corresponding fencing token from it in the mutation or hint_mutation RPC verb. If these verbs are fenced out, the server stale_topology_exception is translated to a mutation_write_failure_exception on the client with an appropriate error message. The hint manager will attempt to resend the failed hint from the commitlog segment after a delay. However, if delivery is unsuccessful, the hint will be discarded after gc_grace_seconds. Closes #14580	2023-07-24 18:12:48 +02:00

1 2 3 4 5 ...

38103 Commits