scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 05:35:48 +00:00

Author	SHA1	Message	Date
Nadav Har'El	f08bc83cb2	cql-pytest: translate Cassandra's tests for CAST operations This is a translation of Cassandra's CQL unit test source file functions/CastFctsTest.java into our cql-pytest framework. There are 13 tests, 9 of them currently xfail. The failures are caused by one recently-discovered issue: Refs #14501: Cannot Cast Counter To Double and by three previously unknown or undocumented issues: Refs #14508: SELECT CAST column names should match Cassandra's Refs #14518: CAST from timestamp to string not same as Cassandra on zero milliseconds Refs #14522: Support CAST function not only in SELECT Curiously, the careful translation of this test also caused me to find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647 which the test in Java missed because it made the same mistake as the implementation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14528	2023-07-12 11:42:04 +03:00
Nadav Har'El	599636b307	test/alternator: fix flaky test test_ttl_expiration_gsi_lsi The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky. The test incorrectly assumes that when we write an already expired item, it will be visible for a short time until being deleted by the TTL thread. But this doesn't need to be true - if the test is slow enough, it may go look or the item after it was already expired! So we fix this test by splitting it into two parts - in the first part we write a non-expiring item, and notice it eventually appears in the GSI, LSI, and base-table. Then we write the same item again, with an expiration time - and now it should eventually disappear from the GSI, LSI and base-table. This patch also fixes a small bug which prevented this test from running on DynamoDB. Fixes #14495 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14496	2023-07-12 11:23:12 +03:00
Botond Dénes	968421a3e0	Merge 'Stop task manager compaction module properly' from Aleksandra Martyniuk Due to wrong order of stopping of compaction services, shutdown needs to wait until all compactions are complete, which may take really long. Moreover, test version of compaction manager does not abort task manager, which is strictly bounded to it, but stops its compaction module. This results in tests waiting for compaction task manager's tasks to be unregistered, which never happens. Stopping and aborting of compaction manager and task manager's compaction module are performed in a proper order. Closes #14461 * github.com:scylladb/scylladb: tasks: test: abort task manager when wrapped_compaction_manager is destructed compaction: swap compaction manager stopping order compaction: modify compaction_manager::stop()	2023-07-12 09:54:00 +03:00
Avi Kivity	118fa59ba8	tools: add cqlsh shortcut Add bin/cqlsh as a shortcut to tools/cqlsh/bin/cqlsh, intended for developers. Closes #14362	2023-07-12 09:36:59 +03:00
Pavel Emelyanov	033e5348aa	scylla-gdb: Print all clients from all idx's The scylla netw command prints clients from [0] index only, but there are more of them on messaging service. Print all Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14633	2023-07-12 09:29:02 +03:00
Botond Dénes	c5cb23a825	Merge 'Add `scylla table` to scylla-gdb' from Pavel Emelyanov The command is to print interesting and/or hard-to-get-by-hand info about individual tables Closes #14635 * github.com:scylladb/scylladb: test: Add 'scylla table' cmd test scylla-gdb: Print table phased barriers scylla-gdb: Add 'table' command	2023-07-12 09:26:59 +03:00
Kamil Braun	dc6f6cb6b0	cql_test_env: load host ID from sstables after restart Performance tests such as `perf-fast-forward` are executed in our CI environments in two steps (two invocations of the `scylla` process): first by populating data directories (with `--populate` option), then by running the actual test. These tests are using `cql_test_env`, which did not load the previously saved (in the populate step) Host ID of this node, but generated a new one randomly instead. In `b39ca97919` we enabled `consistent_cluster_management` by default. This caused the perf tests to hang in `setup_group0` at `read_barrier` step. That's because Raft group 0 was initialized with old configuration -- the one created during the populate step -- but the Raft server was started with a newly generated Host ID (which is used as the server's Raft ID), so the server considered itself as being outside the configuration. Fix this by reloading the Host ID from disk, simulating more closely the behavior of main.cc initialization. Fixes #14599 Closes #14640	2023-07-11 23:30:44 +03:00
Avi Kivity	1545ae2d3b	Merge 'Make SSTable cleanup more efficient by fast forwarding to next owned range' from Raphael "Raph" Carvalho Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node. That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317. To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead. Without further ado, before: `INFO 2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.` after: `INFO 2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.` Fixes #12998. Fixes #14317. Closes #14469 * github.com:scylladb/scylladb: test: Extend cleanup correctness test to cover more cases compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range sstables: Close SSTable reader if index exhaustion is detected in fast forward call sstables: Simplify sstable reader initialization compaction: Extend make_sstable_reader() interface to work with mutation_source test: Extend sstable partition skipping test to cover fast forward using token	2023-07-11 23:28:15 +03:00
Avi Kivity	9cdae78d04	test: expr_test: add copyright/license Closes #14613	2023-07-11 21:45:27 +03:00
Raphael S. Carvalho	60ba1d8b47	test: Extend cleanup correctness test to cover more cases Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	8d58ff1be6	compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node. That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in #14317. To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead. Without further ado, before: ... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028. after: ... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028. Fixes #12998. Fixes #14317. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	1fefe597e6	sstables: Close SSTable reader if index exhaustion is detected in fast forward call When wiring multi range reader with cleanup, I found that cleanup wouldn't be able to release disk space of input SSTables earlier. The reason is that multi range reader fast forward to the next range, therefore it enables mutation_reader::forwarding, and as a result, combined reader cannot release readers proactively as it cannot tell for sure that the underlying reader is exhausted. It may have reached EOS for the current range, but it may have data for the next one. The concept of EOS actually only applies to the current range being read. A reader that returned EOS will actually get out of this state once the combined reader fast forward to the next range. Therefore, only the underlying reader, i.e. the sstable reader, can for certain know that the data source is completely exhausted, given that tokens are read in monotonically increasing order. For reversed reads, that's not true but fast forward to range is not actually supported yet for it. Today, the SSTable reader already knows that the underlying SSTable was exhausted in fast_forward_to(), after it call index_reader's advance_to(partition_range), therefore it disables subsequent reads. We can take a step further and also check that the index was exhausted, i.e. reached EOF. So if the index is exhausted, and there's no partition to read after the fast_forward_to() call, we know that there's nothing left to do in this reader, and therefore the reader can be closed proactively, allowing the disk space of SSTable to be reclaimed if it was already deleted. We can see that the combined reader, under multi range reader, will incrementally find a set of disjoint SSTable exhausted, as it fast foward to owned ranges 1: INFO 2023-07-05 10:51:09,570 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-4525396453480898112, start},{-4525396453480898112, end}] INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,570 [shard 0] sstable - closing reader 0x60100029d800 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-3-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == end, eof ? false 2: INFO 2023-07-05 10:51:09,572 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-2253424581619911583, start},{-2253424581619911583, end}] INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,572 [shard 0] sstable - closing reader 0x60100029d400 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false And so on. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	f08a4eaacb	sstables: Simplify sstable reader initialization It's odd that we see things like: if (!is_initialized()) { return initialize().then([this] { if (!is_initialized()) { and return ensure_initialized().then([this, &pr] { if (!is_initialized()) { One might think initialize will actually initialize the reader by setting up context, and ensure_initialized() will even have stronger guarantees, meaning that the reader must be initialized by it. But none are true. In the context of single-partition read, it can happen initialize() will not set up context, meaning is_initialized() returns false, which is why initialization must be checked even after we call ensure_initialized(). Let's merge ensure_initialized() and initialize() into a maybe_initialize() which returns a boolean saying if the reader is initialized. It makes the code initializing the reader easier to understand.	2023-07-11 13:56:23 -03:00
Michał Chojnowski	b511d57fc8	Revert "Merge 'Compaction resharding tasks' from Aleksandra Martyniuk" This reverts commit `2a58b4a39a`, reversing changes made to `dd63169077`. After patch `87c8d63b7a`, table_resharding_compaction_task_impl::run() performs the forbidden action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard, which is a data race that can cause a use-after-free, typically manifesting as allocator corruption. Note: before the bad patch, this was avoided by copying the _contents_ of the lw_shared_ptr into a new, local lw_shared_ptr. Fixes #14475 Fixes #14618 Closes #14641	2023-07-11 19:11:37 +03:00
Calle Wilund	e1a52af69e	messaging_service: Do TLS init early Fixes #14299 failure_detector can try sending messages to TLS endpoints before start_listen has been called (why?). Need TLS initialized before this. So do on service creation. Closes #14493	2023-07-11 18:19:01 +03:00
Kefu Chai	b4dc3f7cd9	scylla-gdb: add sstable::generation_type printer to inspect the sstable generation after uuid-based generation change. in this change: * a pretty printer for sstable::generation_type is added * now that the pretty printer for the generation_type is registered, we can just leverage it when printing the sstable name, so instead of checking if `_generation` member variable contains `_value`, we use delegate it to `str()`, which is used by `str.format()`. as the behavior of `str()` is similar to that of the gdb `print` command, and calls `value.format_string()`, which in turn calls into `to_string()` if the "value" in question has a pretty printer. after this change, the printer is able to print both the generations before the uuid change and the ones after the change. a typical gdb session looks like: ``` (gdb) p generation._value $5 = f0770b40-1c7c-11ee-b136-bf28f8d18b88 (gdb) p generation $10 = 3g7g_0bu7_0jpvk2p0mmtlsb8lu0 (gdb) p/x generation._value.least_sig_bits $7 = 0xb136bf28f8d18b88 (gdb) p/x generation._value.most_sig_bits $8 = 0xf0770b401c7c11ee ``` if we use `scripts/base36-uuid.py` to encode the msb and lsb, we'd need to: ```console scripts/base36-uuid.py -e 0xf0770b401c7c11ee 0xb136bf28f8d18b88 3g7g_0bu7_0jpvk2p0mmtlsb8lu0 ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14561	2023-07-11 15:56:20 +03:00
Raphael S. Carvalho	3b1829f0d8	compaction: base compaction throughput on amount of data read Today, we base compaction throughput on the amount of data written, but it should be based on the amount of input data compacted instead, to show the amount of data compaction had to process during its execution. A good example is a compaction which expire 99% of data, and today throughput would be calculated on the 1% written, which will mislead the reader to think that compaction was terribly slow. Fixes #14533. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14615	2023-07-11 15:48:05 +03:00
Kefu Chai	25f4a7c400	sstables: format using format string instead of concatenating strings, let's format using the builtin support of `log::debug()`. for two reasons: 1. better performance, after this change, we don't need to materialize the concatenated string, if the "debug" level logging is not enabled. seasetar::log only formats when a certain log level is enabled. 2. better readability. with the format string, it is clear what is the fixed part, and which arguments are to be formatted. this also helps us to move to compile-time formatting check, as fmtlib requires the caller to be explicit when it wants to use runtime format string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14627	2023-07-11 15:31:20 +03:00
Pavel Emelyanov	5518502085	test: Add 'scylla table' cmd test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:12:43 +03:00
Pavel Emelyanov	2c2ad09d3c	scylla-gdb: Print table phased barriers These barriers show if there's any operation in progress (read, write, flush or stream). These are crucial to know if stopping fails, e.g. see issue #13100 These barriers are symmarized in 'scylla memory' command, but they are also good to know on per-table basis Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:10:47 +03:00
Pavel Emelyanov	1948b8fa17	scylla-gdb: Add 'table' command There's 'scylla tables' one that lists tables on the given/current shard, but the list is unable to show lots of information. It prints the table address so it can be explored by hand, but some data is more handy to be parsed and printed with the script The syntax is $ scylla table ks.cf For now just print the schema version. To be extended in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:08:55 +03:00
Botond Dénes	bc5174ced6	Merge 'doc: move the package installation instructions to the documentation' from Anna Stuchlik Refs: https://github.com/scylladb/scylla-docs/issues/4091 Fixes https://github.com/scylladb/scylla-docs/issues/3419 This PR moves the installation instructions from the [website](https://www.scylladb.com/download/) to the documentation. Key changes: - The instructions are mostly identical, so they were squeezed into one page with different tabs. - I've merged the info for Ubuntu and Debian, as well as CentOS and RHEL. - The page uses variables that should be updated each release (at least for now). - The Java requirement was updated from Java 8 to Java 11 following [this issue](https://github.com/scylladb/scylla-docs/issues/3419). - In addition, the title of the Unified Installer page has been updated to communicate better about its contents. Closes #14504 * github.com:scylladb/scylladb: doc: update the prerequisites section doc: improve the tile of Unified Installer page doc: move package install instructions to the docs	2023-07-11 14:30:11 +03:00
Avi Kivity	f26e36f448	Update seastar submodule * seastar 2b7a341210...bac344d584 (3): > tls: Export error_category instance used by tls + some common error codes > reactor: cast enum to int when formatting it > cooking: bump up zlib to 1.2.13	2023-07-11 13:24:32 +03:00
Kefu Chai	ef78b31b43	s3/client: add tagging ops with tagging ops, we will be able to attach kv pairs to an object. this will allow us to mark sstable components with taggings, and filter them based on them. * test/pylib/minio_server.py: enable anonymous user to perform more actions. because the tagging related ops are not enabled by "mc anonymous set public", we have to enable them using "set-json" subcommand. * utils/s3/client: add methods to manipulate taggings. * test/boost/s3_test: add a simple test accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14486	2023-07-11 09:30:46 +03:00
Kefu Chai	3b6e37051b	build: cmake: add more tests to CMake to be in-sync with configure.py Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14479	2023-07-11 09:21:26 +03:00
Botond Dénes	37dd2503ff	Merge 'replica,sstable: do not assign a value to a shared_ptr' from Kefu Chai instead using the operator=(T&&) to assign an instance of `T` to a shared_ptr, assign a new instance of shared_ptr to it. unlike std::shared_ptr, seastar::shared_ptr allows us to move a value into the existing value pointed by shared_ptr with operator=(). the corresponding change in seastar is `319ae0b530`. but this is a little bit confusing, as the behavior of a shared_ptr should look like a pointer instead the value pointed by it. and this could be error-prune, because user could use something like ```c++ p = std::string(); ``` by accident, and expect that the value pointed by `p` is cleared. and all copies of this shared_ptr are updated accordingly. what he/she really wants is: ```c++ p = std::string(); ``` and the code compiles, while the outcome of the statement is that the pointee of `p` is destructed, and `p` now points to a new instance of string with a new address. the copies of this instance of shared_ptr still hold the old value. this behavior is not expected. so before deprecating and removing this operator. let's stop using it. in this change, we update two caller sites of the `lw_shared_ptr::operator=(T&&)`. instead of creating a new instance pointee of the pointer in-place, a new instance of lw_shared_ptr is created, and is assigned to the existing shared_ptr. Closes #14470 github.com:scylladb/scylladb: sstables: use try_emplace() when appropriate replica,sstable: do not assign a value to a shared_ptr	2023-07-11 09:19:48 +03:00
Kefu Chai	0dca0a7f27	build: cmake: include pretty_printers.cc in util we added pretty_printers.cc back in `83c70ac04f`, in which configure.py is updated. so let's sync the CMake building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14442	2023-07-11 09:16:33 +03:00
Pavel Emelyanov	2eebb1312e	scylla-gdb: Format IPs with network byte order The scylla netw command prints connections IPs reversed: (gdb) scylla netw Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>} Outgoing connections: IP: 31.0.142.10, (netw::messaging_service::rpc_protocol_client_wrapper) 0x600008d6d490: stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0} outstanding: 0 It should unpack the address as if it was in big-endian to have it like (gdb) scylla netw Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>} Outgoing connections: IP: 10.142.0.31, (netw::messaging_service::rpc_protocol_client_wrapper) 0x600008d6d490: stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0} outstanding: 0 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14611	2023-07-11 09:12:12 +03:00
Raphael S. Carvalho	bd50943270	compaction: Extend make_sstable_reader() interface to work with mutation_source As the goal is to make compaction filter to the next owned range, make_sstable_reader() should be extended to create a reader with parameters forwarded from mutation_source interface, which will be used when wiring cleanup with multi range reader. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-10 17:19:30 -03:00
Avi Kivity	2de168e568	dist: sysctl: increase vm.vfs_cache_pressure Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref https://github.com/scylladb/scylladb/issues/14506 Closes #14509	2023-07-10 21:24:57 +03:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Avi Kivity	d645e7a515	Update seastar submodule locator/_snitch.cc updated for http::reply losing the _status_code member without a deprecation notice. seastar 99d28ff057...2b7a341210 (23): > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity Fixes #8828. > reactor: use structured binding when appropriate > Simplify payload length and mask parsing. > memcached: do not used deprecated API > build: serialize calls to openssl certificate generation > reactor: epoll backend: initialize _highres_timer_pending > shared_ptr: deprecate lw_shared_ptr operator=(T&&) > tests: fail spawn_test if output is empty > Support specifying the "build root" in configure > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov > build: correct the syntax error in comment > util: print_safe: fix hex print functions > Add code examples for handling exceptions > smp: warn if --memory parameter is not supported > Merge 'gate: track holders' from Benny Halevy > file: call lambda with std::invoke() > deleter: Delete move and copy constructors > file: fix the indent > file: call close() without the syscall thread > reactor: use s/::free()/::io_uring_free_probe()/ > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai > reactor: Don't re-evaliate local reactor for thread_pool > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov Closes #14602	2023-07-10 16:07:12 +03:00
Kamil Braun	3d58e8e424	Revert "cql3: Extend the scope of group0_guard during DDL statement execution" This reverts commit `c42a91ec72`. A significant performance regression was observed due to this change. From Avi: > perf-simple-query --smp 1 > > before: > > 216489.88 tps ( 61.1 allocs/op, 13.1 tasks/op, 43558 insns/op, 0 errors) > 217708.69 tps ( 61.1 allocs/op, 13.1 tasks/op, 43542 insns/op, 0 errors) > 219495.02 tps ( 61.1 allocs/op, 13.1 tasks/op, 43538 insns/op, 0 errors) > 216863.84 tps ( 61.1 allocs/op, 13.1 tasks/op, 43567 insns/op, 0 errors) > 218936.48 tps ( 61.1 allocs/op, 13.1 tasks/op, 43546 insns/op, 0 errors) > > after: > > 201773.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44600 insns/op, 0 errors) > 210875.48 tps ( 63.1 allocs/op, 15.1 tasks/op, 44558 insns/op, 0 errors) > 210186.55 tps ( 63.1 allocs/op, 15.1 tasks/op, 44588 insns/op, 0 errors) > 211021.76 tps ( 63.1 allocs/op, 15.1 tasks/op, 44569 insns/op, 0 errors) > 208597.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44587 insns/op, 0 errors) > > Two extra allocations, two extra tasks, 1k extra instructions, for > something that is DDL only. Fixes #14590	2023-07-10 13:20:49 +02:00
Pavel Emelyanov	ec292721d6	open_coredump: Add --scylla-build-id CLI option The script gets the build id on its own but eu-unstrip-ing the core file and searching for the necessary value in the output. This can be somewhat lenghthy operation especially on huge core files. Sometimes (e.g. in tests) the build id is known and can be just provided as an argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14574	2023-07-10 11:17:54 +03:00
Tomasz Grabiec	65a5942ec0	Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (`50e8ec77c6`). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for any `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, after we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: `ca61d88764` but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers	2023-07-10 00:28:20 +02:00
Kefu Chai	1eb76d93b7	streaming: cast the progress to a float before formatting it before this change, we format a `long` using `{:f}`. fmtlib would throw an exception when actually formatting it. so, let's make the percentage a float before formatting it. Fixes #14587 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14588	2023-07-10 00:00:40 +03:00
Kefu Chai	894039d444	build: drop the warning on -O0 might fail tests Michał Chojnowski noted that this is not true. -O0 almost doubles the run time of `./test.py --mode=debug`. but it does not fail any of the tests. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14456	2023-07-09 23:23:12 +03:00
Avi Kivity	850d759fd9	Merge 'repair: optimise repair reader with different shard count' from Gusev Petr Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially. When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`. In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100). This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`: ``` with this patch 20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2... 20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859 without this patch 21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2... 21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140 ``` Fixes: #14093 Closes #14178 * github.com:scylladb/scylladb: repair_test: add test_reader_with_different_strategies repair: extract repair_reader declaration into reader.hh repair_meta: get_estimated_partitions fix repair_meta: use multishard_filter reader if the number of partitions is small repair_meta: delay _repair_reader creation database.hh: make_multishard_streaming_reader with range parameter database.cc: extract streaming_reader_lifecycle_policy	2023-07-09 23:21:06 +03:00
Aleksandra Martyniuk	61dc98b276	api: prevent non-owner cpu access to shared_ptr In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs to sstables is passes to reducer of map_reduce0. Reducer then accesses these shared pointers. As reducer is invoked on the same shard map_reduce0 is called, we have an illegal access to shared pointer on non-owner cpu. A set of shared pointers to sstables is trasnsformed in map function, which is guaranteed to be invoked on a shard associated with the service. Fixes: #14515. Closes #14532	2023-07-09 23:09:59 +03:00
Kefu Chai	7a334c53af	cql3: expression: correct format string fmtlib uses `{}` as the placeholder for the formatted argument, not `{}}`. so let's correct it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14586	2023-07-09 22:26:29 +03:00
Kefu Chai	56c3462cba	alternator: correct format string when formatting the error message for `api_error::validation`, we always include the caller in the error message, but in this case, forgot to pass the `caller` to `seastar::format()`. if fmtlib actually formats them, it would throw. so let's pass `caller` to `seastar::format()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14589	2023-07-09 22:25:13 +03:00
Aleksandra Martyniuk	23e3251fc3	tasks: test: abort task manager when wrapped_compaction_manager is destructed When task manager is not aborted, the tasks are stored in the memory, not allowing the tasks' gate to be closed. When wrapped_compaction_manager is destructed, task manager gets aborted, so that system could shutdown.	2023-07-09 12:08:32 +02:00
Aleksandra Martyniuk	529c703143	compaction: swap compaction manager stopping order task_manager::module::stop() waits till all compactions are complete. Thus, ongoing compactions should be aborted before stop() is called not to prolong shutdown process. Task manager's compaction module is stopped after compaction_manager::do_stop(), which aborts ongoing compactions, is called.	2023-07-09 12:05:49 +02:00
Aleksandra Martyniuk	a59485b6da	compaction: modify compaction_manager::stop() In compaction_manager::stop(), do_stop() is called unconditionally. It relies on do_stop to return immediately when _state == none.	2023-07-09 12:04:14 +02:00
Michał Chojnowski	c41f0ebd2a	test: mutation_test: unflake test_external_memory_usage The test has about 1/2500000 chance to fail due to a conflict of random values. And it recently did, just to spite us. Fight back. Fixes #14563 Closes #14576	2023-07-08 15:20:25 +03:00
Kefu Chai	27d6ff36df	compound_compat: do not format an sstring with {:d} before this change, we format a sstring with "{:d}", fmtlib would throw `fmt::format_error` at runtime when formatting it. this is not expected. so, in this change, we just print the int8_t using `seastar::format()` in a single pass. and with the format specifier of `#02x` instead of adding the "0x" prefix manually. Fixes #14577 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14578	2023-07-08 15:13:11 +03:00
Kefu Chai	26dcfea84a	estimated_histogram: do not use dynamic format_string fmtlib allows us to specify the field width dynamically, so specify the field width in the same statement formatting the argument improves the readability. and use the constexpr fmt string allows us to switch to compile-time formatter supported by fmtlib v8. this change also use `fmt::print()` to format the argument right to the output ostream, instead of creating a temporary sstring, and copy it to the output ostream. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14579	2023-07-08 15:10:41 +03:00
Anna Stuchlik	88e62ec573	doc: improve User Data info in Launch on AWS Fixes https://github.com/scylladb/scylladb/issues/14565 This commit improves the description of ScyllaDB configuration via User Data on AWS. - The info about experimental features and developer mode is removed. - The description of User Data is fixed. - The example in User Data is updated. - The broken link is fixed. Closes #14569	2023-07-07 16:34:06 +02:00
Kamil Braun	de7f668441	Merge 'raft topology: send cdc generation data in parts' from Mikołaj Grzebieluch The CDC generation data can be large and not fit in a single command. This pr splits it into multiple mutations by smartly picking a `mutation_size_threshold` and sending each mutation as a separate group 0 command. Commands are sent sequentially to avoid concurrency problems. Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If a new generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. New or delayed nodes can never get the applied part in this scenario. Send the entire cdc_generations_v3 table in the snapshot to resolve this problem. A mechanism to remove old CDC generations will be introduced as a follow-up. Closes #13962 * github.com:scylladb/scylladb: test: raft topology: test `prepare_and_broadcast_cdc_generation_data` service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop raft topology: introduce `prepare_and_broadcast_cdc_generation_data` raft: add release_guard raft: group0_state_machine::merger take state_id as the maximal value from all merged commands raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot raft topology: make `mutation_size_threshold` depends on `max_command_size` raft: reduce max batch size of raft commands and raft entries raft: add description argument to add_entry_unguarded raft: introduce `write_mutations` command raft: refactor `topology_change` applying	2023-07-07 16:31:29 +02:00
Kamil Braun	f9cfd7e4f5	Merge 'raft: do not ping self in direct failure detector' from Konstantin Osipov Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead. Fixes #14388 Closes #14558 * github.com:scylladb/scylladb: direct_fd: do not ping self raft: initialize raft_group_registry with host id early raft: code cleanup	2023-07-07 14:26:17 +02:00

1 2 3 4 5 ...

37816 Commits