scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Avi Kivity	cccd2e7fa7	Merge 'Generalize sstables TOC file reading' from Pavel Emelyanov TOC file is read and parsed in several places in the code. All do it differently, and it's worth generalizing this place. To make it happen also fix the S3 readable_file so that it could be used inside file_input_stream. Closes scylladb/scylladb#16175 * github.com:scylladb/scylladb: sstable: Generalize toc file read and parse s3/client: Don't GET object contents on out-of-bound reads s3/client: Cache stats on readable_file	2023-11-29 19:18:31 +02:00
Nadav Har'El	62f89d49e5	tablets, mv: fix on_internal_error on write to base table This situation before this patch is that when tablets are enabled for a keyspace, we can create a materialized view but later any write to the base table fails with an on_internal_error(), saying that: "Tried to obtain per-keyspace effective replication map of test but it's per-table." Indeed, with tablets, the replication is different for each table - it's not the same for the entire keyspace. So this patch changes the view update code to take the replication map from the specific base table, not the keyspace. This is good enough to get materialized-views reads and writes working in a simple single-node case, as the included test demonstrates (the test fails with on_internal_error() before this patch, and passes afterwards). But this fix is not perfect - the base-view pairing code really needs to consider not only the base table's replication map, but also the view table's replication map - as those can be different. We'll fix this remaining problem as a followup in a separate patch - it will require a substantially more elaborate test to reproduce the need for the different mapping and to verify that fix. Fixes #16209. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16211	2023-11-29 15:29:17 +01:00
Avi Kivity	cd732b1364	Update seastar submodule * seastar 830ce8673...55a821524 (34): > Revert "reactor/scheduling_group: Handle at_destroy queue special in init_new_scheduling_group_key etc" > epoll: Avoid spinning on aborted connections Fixes #12774 Fixes #7753 Fixes #13337 > Merge 'Sanitize test-only reactor facilities' from Pavel Emelyanov > test/unit: fix fmt version check > reactor/scheduling_group: Handle at_destroy queue special in init_new_scheduling_group_key etc > build: add spaces before () and after commands > reactor: use zero-initialization to initialize io_uring_params > Merge 'build: do not return a non-false condition if the option is off ' from Kefu Chai > memory: do not use variable length array > build: use tri_state_option() to link against Sanitizers > build: do not define SEASTAR_TYPE_ERASE_MORE on all builds > Revert "shared_future: make available() immediate after set_value()" > test_runner: do not throw when seastar.app fails to start > Merge 'Address issue where Seastar faults in toeplitz hash when reassembling fragment' from John Hester > defer, closeable: do not use [[nodiscard(str)]] > Merge 'build: generate config-specific rules using generator expressions' from Kefu Chai > treewide: use _v and _t for better readability > build: use different names for .pc files for each build mode > perftune.py: skip discovering IRQs for iSCSI disks > io-tester: explicit use uint64_t for boost::irange(...) > gate: correct the typo in doxygen comment > shared_future: make available() immediate after set_value() > smp: drop unused templates > include fmt/ostream.h to make headers self-sufficient > Support ccache in ./configure.py > rpc_tester: Disable -Wuninitialized when including boost.accumulators > file: construct directory_entry with aggregated ctor > file: s/ino64_t/ino_t/, s/off64_t/off_t/ > sstring_test: include fmt/std.h only if fmtlib >= 10.0.0 > file: do not include coroutine headers if coroutine is disabled > fair_queue::unregister_priority_class:fix assertion > Merge 'Generalize `net::udp_channel` into `net::datagram_channel`' from Michał Sala > Merge 'Add file::list_directory() that co_yields entries' from Pavel Emelyanov > http/file_handler: remove unnecessary cast Closes scylladb/scylladb#16201	2023-11-29 14:34:30 +02:00
Kefu Chai	c40da20092	utils/pretty_printers: stop using undocumented fmt api format_parse_context::on_error() is an undocumented API in fmt v9 and in fmt v10, see - https://fmt.dev/9.1.0/api.html#_CPPv4I0EN3fmt16basic_format_argE - https://fmt.dev/10.0.0/api.html#_CPPv4I0EN3fmt26basic_format_parse_contextE despite that this API was once used in its document for fmt v10.0.0, see https://fmt.dev/10.0.0/api.html#formatting-user-defined-types. it's still, well, undocumented. so, to have better compatibility, let's use the documented API in place of undocumented one. please note, `throw_format_error()` was still not a public API before 10.1.0, so before that release we have to throw `fmt::format_error` explicitly. so we cannot use it yet during the transitional period. because the class of `fmt::format_error` is defined in `fmt/format.h`, we need to include this header for using it. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16212	2023-11-29 12:49:04 +02:00
Pavel Emelyanov	0da37d5fa6	sstable: Generalize toc file read and parse There are several places where TOC file is parsed into a vector of components -- sstable::read_toc(), remove_by_toc_name() and remove_by_registry_entry(). All three deserve some generalization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-29 12:09:52 +03:00
Pavel Emelyanov	c5d85bdf79	s3/client: Don't GET object contents on out-of-bound reads If S3 readable file is used inside file input stream, the latter may call its read methods with position that is above file size. In that case server replies with generic http error and the fact that the range was invalid is encoded into reply body's xml. That's not great to catch this via wrong reply status exception and xml parsing all the more so we can know that the read is out-of-bound in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-29 12:09:52 +03:00
Pavel Emelyanov	339182287f	s3/client: Cache stats on readable_file S3-based sstables components are immutable, so every time stat is called there's no need to ping server again. But the main intention of this patch is to provide stats for read calls in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-11-29 12:06:54 +03:00
Calle Wilund	3b70fde3cd	commitlog: Make named_files in delete_segments have updated size Fixes #16207 commitlog::delete_segments deletes (or recycles) segments replayed. The actual file size here is added to footprint so actual delete then can determine iff things should be recycled or removed. However, we build a pending delete list of named_files, and the files we added did not have size set. Bad. Actual deletion then treated files as zero-byte sized, i.e. footprint calculations borked. Simple fix is just filling in the size of the objects when addind. Added unit test for the problem. Closes scylladb/scylladb#16210	2023-11-29 09:58:47 +02:00
Yaron Kaikov	c3ee53f3be	test.py: enable xml validation Following https://github.com/scylladb/scylladb/issues/4774#issuecomment-1752089862 Adding back xml validation Closes: https://github.com/scylladb/scylla-pkg/issues/3441 Closes scylladb/scylladb#16198	2023-11-29 09:02:36 +02:00
Botond Dénes	3ed6925673	Merge 'Major compaction: flush commitlog by forcing new active segment and flushing all tables' from Benny Halevy Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Closes scylladb/scylladb#15820 * github.com:scylladb/scylladb: docs: nodetool: flush: enrich examples docs: nodetool: compact: fix example api: add /storage_service/compact api: add /storage_service/flush compaction_manager: flush_all_tables before major compaction database: add flush_all_tables api: compaction: add flush_memtables option test/nodetool: jmx: fix path to scripts/scylla-jmx scylla-nodetool, docs: improve optional params documentation	2023-11-29 08:48:40 +02:00
Nadav Har'El	88a5ddabce	tablets, mv: create tablets for a new materialized view Before this patch, trying to create a materialized view when tablets are enabled for a keyspace results in a failure: "Tablet map not found for table <uuid>", with uuid referring to the new view. When a table schema is created, the handler on_before_create_column_family() is called - and this function creates the tablet map for the new table. The bug was that we forgot to do the same when creating a materialized view - which also a bona-fide table. In this patch we call on_before_create_column_family() also when creating the materialized view. I decided not to create a new callback (e.g., on_before_create_view()) and rather call the existing on_before_create_column_family() callback - after all, a view is a column family too. This patch also includes a test for this issue, which fails to create the view before this patch, and passes with the patch. The test is in the test/topology_experimental_raft suite, which runs Scylla with the tablets experimental feature, and will also allow me to create tests that need multiple nodes. However, the first test added here only needs a single node to reproduce the bug and validate its fix. Fixes #16194. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16205	2023-11-28 21:54:32 +01:00
Kamil Braun	3582095b79	schema_tables: use smaller timestamp for base mutations included with view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb/scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb/scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the original base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb/scylladb#15530 Closes scylladb/scylladb#16139	2023-11-28 21:51:18 +01:00
Benny Halevy	310ff20e1e	docs: nodetool: flush: enrich examples Provide 3 examples, like in the nodetool/compact page: global, per-keyspace, per-table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:48:22 +02:00
Benny Halevy	d32b90155a	docs: nodetool: compact: fix example It looks like `nodetool compact standard1` is meant to show how to compact a specified table, not a keyspace. Note that the previous example like is for a keyspace. So fix the table compaction example to: `nodetool compact keyspace1 standard1` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:45:20 +02:00
Benny Halevy	b12b142232	api: add /storage_service/compact For major compacting all tables in the database. The advantage of this api is that `commitlog->force_new_active_segment` happens only once in `database::flush_all_tables` rather than once per keyspace (when `nodetool compact` translates to a sequence of `/storage_service/keyspace_compaction` calls). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	1b576f358b	api: add /storage_service/flush For flushing all tables in the database. The advantage of this api is that `commitlog->force_new_active_segment` happens only once in `database::flush_all_tables` rather than once per keyspace (when `nodetool flush` translates to a sequence of `/storage_service/keyspace_flush` calls). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	66ba983fe0	compaction_manager: flush_all_tables before major compaction Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See `64ec1c6ec6` However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See `f42eb4d1ce`). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb/scylladb#15777 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	be763bea34	database: add flush_all_tables Flushes all tables after forcing force_new_active_segment of the commitlog to make sure all commitlog segments can get recycled. Otherwise, due to "false sharing", rarely-written tables might inhibit recycling of the commitlog segments they reference. After `f42eb4d1ce`, that won't allow compaction to purge some tombstones based on the min_gc_time. To be used in the next patch by major compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	1fd85bd37b	api: compaction: add flush_memtables option When flushing is done externally, e.g. by running `nodetool flush` prior to `nodetool compact`, flush_memtables=false can be passed to skip flushing of tables right before they are major-compacted. This is useful to prevent creation of small sstables due to excessive memtable flushing. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	7f860d612a	test/nodetool: jmx: fix path to scripts/scylla-jmx The current implementation makes no sense. Like `nodetool_path`, base the default `jmx_path` on the assumption that the test is run using, e.g. ``` (cd test/nodetool; pytest --nodetool=cassandra test_compact.py) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Benny Halevy	9324363e55	scylla-nodetool, docs: improve optional params documentation Document the behavior if no keyspace is specified or no table(s) are specified for a given keyspace. Fixes scylladb/scylladb#16032 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-28 16:37:42 +02:00
Anna Stuchlik	bfe19c0ed2	doc: add experimental support for object storage This commit adds information on how to enable object storage for a keyspace. The "Keyspace storage options" section already existed in the doc, but it was not valid as the support was only added in version 5.4 The scope of this commit: - Update the "Keyspace storage options" section. - Add the information about object storage support to the Data Definition> CREATE KEYSPACE section * Marked as "Experimental". * Excluded from the Enterprise docs with the .. only:: opensource directive. This commit must be backported to branch-5.4, as support for object storage was added in version 5.4. Closes scylladb/scylladb#16081	2023-11-28 14:27:01 +02:00
Botond Dénes	f46cdce9d3	Merge 'Make memtable flush tolerate misconfigured S3 storage' from Pavel Emelyanov Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal. Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well. fixes: #13745 Closes scylladb/scylladb#15635 * github.com:scylladb/scylladb: test: Add object_store test to validate config reloading works test: Add config update facility to test cluster test: Make S3_Server export config file as pathlib.Path config: Make object storage config updateable_value_source memtable: Extend list of checking codes sstables/storage/s3: Fix missing TOC status check s3/client: Map http exceptions into storage_io_error exceptions: Extend storage_io_error construction options	2023-11-28 09:33:37 +02:00
Botond Dénes	3ccf1e020b	Merge ' compaction: abort compaction tasks' from Aleksandra Martyniuk Compaction tasks which do not have a parent are abortable through task manager. Their children are aborted recursively. Compaction tasks of the lowest level are aborted using existing compaction task executors stopping mechanism. Closes scylladb/scylladb#16177 * github.com:scylladb/scylladb: test: test abort of compaction task that isn't started yet test: test running compaction task abort tasks: fail if a task was aborted compaction: abort task manager compaction tasks	2023-11-28 09:08:04 +02:00
Pavel Emelyanov	1efddc228d	sstable: Do not nest io-check wrappers into each other When sealing an sstable on local storage the storage driver performs several flushes on a file that is directory open via checked-file. Flush calls are wrapped with sstable_write_io_check, but that's excessive, the checked file will wrap flushes with io-checks on its own Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#16173	2023-11-27 15:53:02 +02:00
Kefu Chai	724a6e26f3	cql3: define format_as() for formatting cql3::cql3_type::raw before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. to define a formatter which can be used by raw class and its derived classes, we have to put the full template specialization before the call sites. also, please note, the forward declaration is not enough, as the compile-time formatter check of fmt requires the definition of formatter. since fmt v10 also enables us to use `format_as()` to format a certain type with the return value of `format_as()`. this fulfills our needs. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16125	2023-11-27 15:28:19 +02:00
Kefu Chai	0b69a1badc	transport: cast unaligned<T> to T for formatting it in fmt v10, it does not cast unaligned<T> to T when formatting it, instead it insists on finding a matched fmt::formatter<> specialization for it. that's why we have FTBFS with fmt v10 when printing these packed<T> variables with fmtlib v10. in this change, we just cast them to the underlying types before formatting them. because seastar::unaligned<T> does not provide a method for accessing the raw value, neither does it provide a type alias of the type of the underlying raw value, we have to cast to the type without deducing it from the printed value. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16167	2023-11-27 15:26:13 +02:00
Petr Gusev	dca28417b2	storage_service: drop unused method handle_state_replacing_update_pending_ranges	2023-11-27 12:37:26 +01:00
Tomasz Grabiec	ae5220478c	tablets: Release group0 guard when waiting for streaming to finish This bug manifested as delays in DDL statement execution, which had to wait until streaming is finished so that the topology change coordinator releases the guard. The reason is that topology change coordinator didn't release the group0 guard if there is no work to do with active migrations, and awaits the condition variable without leaving the scope. Fixes #16182 Closes scylladb/scylladb#16183	2023-11-27 12:24:27 +01:00
Nadav Har'El	8d040325ab	cql: fix SELECT toJson() or SELECT JSON of time column The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column of type "time" forgot to put the time string in quotes. The result was invalid JSON. This is patch is a one-liner fixing this bug. This patch also removes the "xfail" marker from one xfailing test for this issue which now starts to pass. We also add a second test for this issue - the existing test was for "SELECT TOJSON(t)", and the second test shows that "SELECT JSON t" had exactly the same bug - and both are fixed by the same patch. We also had a test translated from Cassandra which exposed this bug, but that test continues to fail because of other bugs, so we just need to update the xfail string. The patch also fixes one C++ test, test/boost/json_cql_query_test.cc, which enshrined the wrong behavior - JSON output that isn't even valid JSON - and had to be fixed. Unlike the Python tests, the C++ test can't be run against Cassandra, and doesn't even run a JSON parser on the output, which explains how it came to enshrine wrong output instead of helping to discover the bug. Fixes #7988 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16121	2023-11-27 10:03:04 +02:00
Anna Stuchlik	24d5dbd66f	doc: replace the OSS-only link on the Raft page This commit replaces the link to the OSS-only page (the 5.2-to-5.4 upgrade guide not present in the Enterprise docs) on the Raft page. While providing the link to the specific upgrade guide is more user-friendly, it causes build failures of the Enterprise documentation. I've replaced it with the link to the general Upgrade section. The ".. only:: opensource" directive used to wrap the OSS-only content correctly excludes the content form the Enterprise docs - but it doesn't prevent build warnings. This commit must be backported to branch-5.4 to prevent errors in all versions. Closes scylladb/scylladb#16176	2023-11-27 08:52:58 +02:00
Kefu Chai	c937827308	mutation_query: add formatter for reconcilable_result::printer before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for reconcilable_result::printer, and remove its operator<<(). Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16186	2023-11-26 20:20:50 +02:00
Konstantin Osipov	f0aa325187	test: provide overview of the contents of test/ directory Fixes #16080 Closes scylladb/scylladb#16088	2023-11-26 15:51:07 +02:00
Marcin Maliszkiewicz	81be3e0935	test/alternator/run: port -h and --omit-scylla-output options from cql-pytest Closes scylladb/scylladb#16171	2023-11-26 13:52:01 +02:00
Botond Dénes	fe7c81ea30	Update ./tools/jmx and ./tools/java submodules * ./tools/jmx 05bb7b68...80ce5996 (4): > StorageService: Normalize endpoint inetaddress strings to java form Fixes #16039 > ColumnFamilyStore: only quote table names if necessary > APIBuilder: allow quoted scope names > ColumnFamilyStore: don't fail if there is a table with ":" in its name Fixes #16153 * ./tools/java 10480342...26f5f71c (1): > NodeProbe: allow addressing table name with colon in it Also needed for #16153 Closes scylladb/scylladb#16146	2023-11-26 13:35:38 +02:00
Kefu Chai	ba3dce3815	build: do escape "\" in regular string in Python, a raw string is created using 'r' or 'R' prefix. when creating the regex using Python string, sometimes, we have to use "\" to escape the parenthesis so the tools like "sed" can consider the parenthesis as a capture group. but "\" is also used to escape strings in Python, in order to put "\" as it is, we use "\" instead of escaping "\" with "\\" which is obscure. when generating rules, we use multiple-lines string and do not want to have an empty line at the beginning of the string so added "\" continuation mark. but we fail to escape some of the "\" in the string, and just put "\(", despite that Python accepts it after failing to find a matched escaped char for it, and interprets it as "\\(". this should still be considered a misuse of oversight. with python's warning enabled, one is able see its complaints. in this change, we escape the "\" properly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16179	2023-11-26 13:34:10 +02:00
Kefu Chai	3053d63c7f	main: notify systemd that the service is ready this change addresses a regression introduced by `f4626f6b8e`, which stopped notifying systemd with the status that scylla is READY. without the notification, systemd would wait in vain for the readiness of scylla. Refs `f4626f6b8e` Fixes #16159 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16166	2023-11-26 10:38:53 +02:00
Aleksandra Martyniuk	9c2c964b8e	test: test abort of compaction task that isn't started yet Test whether a task which parent was aborted has a proper status.	2023-11-24 19:25:27 +01:00
Aleksandra Martyniuk	8639eae0ce	test: test running compaction task abort Test whether a task which is aborted while running has a proper status.	2023-11-24 19:25:20 +01:00
Botond Dénes	a472700309	Merge 'Minor fixes and refactors' from Kamil Braun - remove some code that is obsolete in newer Scylla versions, - fix some minor bugs. These bugs appear to be benign, there are no known issues caused by them, but fixing them is a good idea nevertheless, - refactor some code for better maintainability. Parts of this PR were extracted from https://github.com/scylladb/scylladb/pull/15331 (which was merged but later reverted), parts of it are new. Closes scylladb/scylladb#16162 * github.com:scylladb/scylladb: test/pylib: log_browsing: fix type hint migration_manager: take `abort_source&` in get_schema_for_read/write migration_manager: inline merge_schema_in_background migration_manager: remove unused merge_schema_from overload migration_manager: assume `canonical_mutation` support migration_manager: add `std::move` to avoid a copy schema_tables: refactor `scylla_tables(schema_features)` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-11-24 17:34:21 +02:00
Patryk Jędrzejczak	15d3ed4357	test: topology: update run_first lists `run_first` lists in `suite.yaml` files provide a simple way to shorten the tests' average running time by running the slowest tests at first. We update these lists, since they got outdated over time: - `test_topology_ip` was renamed to `test_replace` and changed suite, - `test_tablets` changed suite, - new slow tests were added: - `test_cluster_features`, - `test_raft_cluster_features`, - `test_raft_ignore_nodes`, - `test_read_repair`. Closes scylladb/scylladb#16104	2023-11-24 16:18:30 +01:00
Aleksandra Martyniuk	c74b3ec596	tasks: fail if a task was aborted run() method of task_manager::task::impl does not have to throw when a task is aborted with task manager api. Thus, a user will see that the task finished successfully which makes it inconsistent. Finish a task with a failure if it was aborted with task manager api.	2023-11-24 15:45:00 +01:00
Aleksandra Martyniuk	aa7bba2d8b	compaction: abort task manager compaction tasks Set top level compaction tasks as abortable. Compaction tasks which have no children, i.e. compaction task executors, have abort method overriden to stop compaction data.	2023-11-24 15:44:34 +01:00
Kefu Chai	ca31dab9d2	sstable: drop repaired_at related code before we support incremental repair, these is no point have the code path setting / getting it. and even worse, it incurs confusion. so, in this change, we * just set the field to 0, * drop the corresponding field in metadata_collector, as we never update it. * add a comment to explain why this variable is initialized to 0 Fixes #16098 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16169	2023-11-24 15:12:25 +02:00
Botond Dénes	697cf41b9b	Merge 'repair: Introduce small table optimization' from Asias He repair: Introduce small table optimization ) Problem: We have seen in the field it takes longer than expected to repair system tables like system_auth which has a tiny amount of data but is replicated to all nodes in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The main reason for the slowness is that even if the amount of data is small, repair has to walk though all the token ranges, that is num_tokens number_of_nodes_in_the_cluster. The overhead of the repair protocol for each token range dominates due to the small amount of data per token range. Another reason is the high network latency between DCs makes the RPC calls used to repair consume more time. ) Solution: To solve this problem, a small table optimization for repair is introduced in this patch. A new repair option is added to turn on this optimization. - No token range to repair is needed by the user. It will repair all token ranges automatically. - Users only need to send the repair rest api to one of the nodes in the cluster. It can be any of the nodes in the cluster. - It does not require the RF to be configured to replicate to all nodes in the cluster. This means it can work with any tables as long as the amount of data is low, e.g., less than 100MiB per node. ) Performance: 1) 3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2} Before: ``` repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes, role_members}, ranges_nr=1537, round_nr=4612, round_nr_fast_path_already_synced=4611, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1, rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds, tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0, row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}}, row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}}, row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2, 0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848}, {127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1, 0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s, tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={} ``` After: ``` repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats: repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes, role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178}, {127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}}, row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1}, {127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}}, row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286}, {127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286}, {127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1, 685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026}, {127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={} ``` The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X 2) 3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2} Same test as above except 5ms delay is added to simulate multiple dc network latency: The time to repair is reduced from 333s to 0.2s. 333.26758 s / 0.22625381s = 1472.98 3) 3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3} , 10 ms network latency Before: ``` repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats: repair_reason=repair, keyspace=system_auth, tables={role_attributes, role_members, roles}, ranges_nr=2305, round_nr=6916, round_nr_fast_path_already_synced=6915, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1, rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015 seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0, row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2, 1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}} MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385}, {127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1}, {127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1}, {127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={} ``` After: ``` repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats: repair_reason=repair, keyspace=system_auth, tables={role_attributes, role_members, roles}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3, 178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178}, {127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}}, row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1}, {127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1}, {127.0.57.8, 1}, {127.0.57.9, 1}}, row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2, 0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793}, {127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7, 0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}} MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634}, {127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4, 2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634}, {127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9, 2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={} ``` The time to repair is reduced from 986s (16 minutes) to 1.6s ) Summary So, a more than 1000X difference is observed for this common usage of system table repair procedure. Fixes #16011 Refs #15159 Closes scylladb/scylladb#15974 github.com:scylladb/scylladb: repair: Introduce small table optimization repair: Convert put_row_diff_with_rpc_stream to use coroutine	2023-11-24 15:11:42 +02:00
Kamil Braun	1f56962591	Merge 'test: topology: test concurrent bootstrap' from Patryk Jędrzejczak We add a test for concurrent bootstrap in the raft-based topology. Additionally, we extend the testing framework with a new function - `ManagerClient.servers_add`. It allows adding multiple servers concurrently to a cluster. This PR is the first step to fix #15423. After merging it, if the new test doesn't fail for some time in CI, we can: - use `ManagerClient.servers_add` in other tests wherever possible, - start initial servers concurrently in all suites with `initial_size > 0`. Closes scylladb/scylladb#16102 * github.com:scylladb/scylladb: test: topology: add test_concurrent_bootstrap test: ManagerClient: introduce servers_add test: ManagerClient: introduce _create_server_add_data	2023-11-24 12:41:05 +01:00
Kefu Chai	f99223919a	compaction: add formatter for map<timestamp_type, vector<shared_sstable>> before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for map<timestamp_type, vector<shared_sstable>>. since the operator<< for this type is only used in the .cc file, and the only use case of it is to provide the formatter for fmt, so the operator<< based formatter is remove in this change. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16163	2023-11-24 11:56:28 +02:00
Kamil Braun	5acfcd8ef5	Merge 'raft: send group0 RPCs only if the destination group0 server is seen as alive' from Piotr Dulikowski In topology on raft mode, the events "new node starts its group0 server" and "new node is added to group0 configuration" are not synchronized with each other. Therefore it might happen that the cluster starts sending commands to the new node before the node starts its server. This might lead to harmless, but ugly messages like: INFO 2023-09-27 15:42:42,611 [shard 0:stat] rpc - client 127.0.0.1:56352 msg_id 2: exception "Raft group b8542540-5d3b-11ee-99b8-1052801f2975 not found" in no_wait handler ignored In order to solve this, the failure detector verb is extended to report information about whether group0 is alive. The raft rpc layer will drop messages to nodes whose group0 is not seen as alive. Tested by adding a delay before group0 is started on the joining node, running all topology tests and grepping for the aforementioned log messages. Fixes: scylladb/scylladb#15853 Fixes: scylladb/scylladb#15167 Closes scylladb/scylladb#16071 * github.com:scylladb/scylladb: raft: rpc: introduce destination_not_alive_error raft: rpc: drop RPCs if the destination is not alive raft: pass raft::failure_detector to raft_rpc raft: transfer information about group0 liveness in direct_fd_ping raft: add server::is_alive	2023-11-24 10:34:05 +01:00
Patryk Jędrzejczak	a8d06aa9fd	test: topology: add test_concurrent_bootstrap We add a test for concurrent bootstrap support in the raft-based topology. The plan is to make this test temporary. In the future, we will: - use ManagerClient.servers_add in other tests wherever possible, - start initial servers concurrently in all suites with initial_size > 0. So, this test will not test anything unique. We could make the changes proposed above now instead of adding this small test. However, if we did that and it turned out that concurrent bootstrap is flaky in CI, we would make almost every CI run fail with many failures. We want to avoid such a situation. Running only this test for some time in CI will reduce the risk and make investigating any potential failures easier.	2023-11-24 09:39:01 +01:00
Patryk Jędrzejczak	cd7b282db6	test: ManagerClient: introduce servers_add We add a new function - servers_add - that allows adding multiple servers concurrently to a cluster. It makes use of a concurrent bootstrap now supported in the raft-based topology. servers_add doesn't have the replace_cfg parameter. The reason is that we don't support concurrent replace operations, at least for now. There is an implementation detail in ScyllaCluster.add_servers. We cannot simply do multiple calls to add_server concurrently. If we did that in an empty cluster, every node would take itself as the only seed and start a new cluster. To solve this, we introduce a new field - initial_seed. It is used to choose one of the servers as a seed for all servers added concurrently to an empty cluster. Note that the add_server calls in asyncio.gather in add_servers cannot race with each other when setting initial_seed because there is only one thread. In the future, we will also start all initial servers concurrently in ScyllaCluster.install_and_start. The changes in this commit were designed in a way that will make changing install_and_start easy.	2023-11-24 09:39:01 +01:00

1 2 3 4 5 ...

39974 Commits