scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-07 07:23:15 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	a44bc233f5	compaction: refactor mapping of compaction type to string This will make it easier to introduce new type and also to map type to string and vice-versa, using reverse lookup. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:29:53 -03:00
Raphael S. Carvalho	503a0ea928	compaction: move compaction_name() out of line Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-11 09:29:46 -03:00
Calle Wilund	f44420f2c9	snapshot: Add filter to check for existing snapshot Fixes #8212 Some snapshotting operations call in on a single table at a time. When checking for existing snapshots in this case, we should not bother with snapshots in other tables. Add an optional "filter" to check routine, which if non-empty includes tables to check. Use case is "scrub" which calls with a limited set of tables to snapshot. Closes #8240	2021-03-10 20:21:38 +02:00
Benny Halevy	ff5b42a0fa	bytes_ostream: max_chunk_size: account for chunk header Currently, if the data_size is greater than max_chunk_size - sizeof(chunk), we end up allocating up to max_chunk_size + sizeof(chunk) bytes, exceeding buf.max_chunk_size(). This may lead to allocation failures, as seen in https://github.com/scylladb/scylla/issues/7950, where we couldn't allocate 131088 (= 128K + 16) bytes. This change adjusted the expose max_chunk_size() to be max_alloc_size (128KB) - sizeof(chunk) so that the allocated chunks would normally be allocated in 128KB chunks in the write() path. Added a unit test - test_large_placeholder that stresses the chunk allocation path from the write_place_holder(size) entry point to make sure it handles large chunk allocations correctly. Refs #7950 Refs #8081 Test: unit(release), bytes_ostream_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>	2021-03-10 19:54:12 +02:00
Asias He	268fa9d9fe	main: Lower shares for main scheduling group The main scheduling group has the shares of 1000, which is as high as the statement group. From time to time, we see unexpected scheduling group leaking to the main group, which causes the drop of the query performance. This patch reduce the main scheduling shares to 200, which is the same as the maintenance scheduling group. It is a safer default in case code leaks to the main scheduling group. Refs: #7720 Closes #8243	2021-03-10 19:34:45 +02:00
Takuya ASADA	af8eae317b	scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero On the environment hard limit of coredump is set to zero, coredump test script will fail since the system does not generate coredump. To avoid such issue, set ulimit -c 0 before generating SEGV on the script. Note that scylla-server.service can generate coredump even ulimit -c 0 because we set LimitCORE=infinity on its systemd unit file. Fixes #8238 Closes #8245	2021-03-10 19:28:10 +02:00
Avi Kivity	5342d79461	Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael * 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla: sstable_set: move all() implementation into sstable_set_impl sstable_set: preparatory work to change sstable_set::all() api sstables: remove bag_sstable_set	2021-03-10 19:19:26 +02:00
Botond Dénes	cf28552357	mutation_test: test_mutation_diff_with_random_generator: compact input mutations This test checks that `mutation_partition::difference()` works correctly. One of the checks it does is: m1 + m2 == m1 + (m2 - m1). If the two mutations are identical but have compactable data, e.g. a shadowable tombstone shadowed by a row marker, the apply will collapse these, causing the above equality check to fail (as m2 - m1 is null). To prevent this, compact the two input mutations. Fixes: #8221 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>	2021-03-10 16:28:14 +01:00
Raphael S. Carvalho	c3b8757fa1	sstable_set: move all() implementation into sstable_set_impl The main motivation behind this is that by moving all() impl into sstable_set_impl, sstable_set no longer needs to maintain a list with all sstables, which in turn may disagree with the respective sstable_set_impl. This will be very important for compound_sstable_set_impl which will be built from existing sets, and will implement all() by combining the all() of its managed sets. Without this patch, we'd have to insert the same sstable at both compound set and also the set managed by it, to guarantee all() of compound set would return the correct data, which would be expensive and error prone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:13 -03:00
Raphael S. Carvalho	05b07c7161	sstable_set: preparatory work to change sstable_set::all() api users of sstable_set::all() rely on the set itself keeping a reference to the returned list, so user can iterate through the list assuming that it is alive all the way through. this will change in the future though, because there will be a compound set impl which will have to merge the all() of multiple managed sets, and the result is a temporary value. so even range-based loops on all() have to keep a ref to the returned list, to avoid the list from being prematurely destroyed. so the following code for (auto& sst : sstable_set.all()) { ...} becomes for (auto sstables = sstable_set.all(); auto& sst : sstables) { ... } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-10 12:02:12 -03:00
Avi Kivity	746798fd56	Merge "sstables: get rid of data_consume_context" from Botond " This class is basically a wrapper around a unique pointer and a few short convenience methods, but is otherwise a distraction in trying to untangle the maze that is the sstable reader class hierachy. So this patchset folds it into its only real user: the sstable reader. " * 'data_consume_context_bye' of https://github.com/denesb/scylla: sstable: move data_consume_* factory methods to row.hh sstables: fold data_consume_context: into its users sstables: partition.cc: remove data_consume_* forward declarations	2021-03-10 16:45:32 +02:00
Nadav Har'El	a1725217e1	Merge 'alternator: coroutinize handle_api_request' from Piotr Sarna The indentation level is significantly reduced, and so is the number of allocations. The function signature is changed from taking an rvalue ref to taking the unique_ptr by value, because otherwise the coroutine captures the request as a reference, which results in use-after-free. Tests: unit(dev) Closes #8249 * github.com:scylladb/scylla: alternator: drop read_content_and_verify_signature alternator: coroutinize handle_api_request	2021-03-10 16:08:08 +02:00
Piotr Sarna	ba264e7199	alternator: drop read_content_and_verify_signature The only use of this helper function was inlined in a bigger coroutine, so it's no longer needed.	2021-03-10 14:42:53 +01:00
Piotr Sarna	35da51879f	alternator: coroutinize handle_api_request The indentation level is significantly reduced, and so is the number of allocations. The function signature is changed from taking an rvalue ref to taking the unique_ptr by value, because otherwise the coroutine captures the request as a reference, which results in use-after-free.	2021-03-10 14:42:52 +01:00
Botond Dénes	1aa2424dcf	sstable: move data_consume_* factory methods to row.hh	2021-03-10 15:40:50 +02:00
Botond Dénes	a06465a8f3	sstables: fold data_consume_context: into its users `data_consume_context` is a thin wrapper over the real context object and it does little more than forward method calls to it. The few methods doing more then mere forwarding can be folded into its single real user: `sstable_reader`.	2021-03-10 15:38:58 +02:00
Botond Dénes	37eb547224	sstables: partition.cc: remove data_consume_* forward declarations They don't seem to serve any purpose, everything builds fine without them.	2021-03-10 15:23:54 +02:00
Raphael S. Carvalho	f7cc431477	compaction_manager: Fix use-after-free in rewrite_sstables() Use-after-free introduced by `2cf0c4bbf1`. That's because compacting is moved into then_wrapped() lambda, so it's potentially freed on the next iteration of repeat(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>	2021-03-10 13:18:38 +02:00
Nadav Har'El	f41dac2a3a	alternator: avoid large contiguous allocation for request body Alternator request sizes can be up to 16 MB, but the current implementation had the Seastar HTTP server read the entire request as a contiguous string, and then processed it. We can't avoid reading the entire request up-front - we want to verify its integrity before doing any additional processing on it. But there is no reason why the entire request needs to be stored in one big contiguous allocation. This always a bad idea. We should use a non- contiguous buffer, and that's the goal of this patch. We use a new Seastar HTTPD feature where we can ask for an input stream, instead of a string, for the request's body. We then begin the request handling by reading lthe content of this stream into a vector<temporary_buffer<char>> (which we alias "chunked_content"). We then use this non-contiguous buffer to verify the request's signature and if successful - parse the request JSON and finally execute it. Beyond avoiding contiguous allocations, another benefit of this patch is that while parsing a long request composed of chunks, we free each chunk as soon as its parsing completed. This reduces the peak amount of memory used by the query - we no longer need to store both unparsed and parsed versions of the request at the same time. Although we already had tests with requests of different lengths, most of them were short enough to only have one chunk, and only a few had 2 or 3 chunks. So we also add a test which makes a much longer request (a BatchWriteItem with large items), which in my experiment had 17 chunks. The goal of this test is to verify that the new signature and JSON parsing code which needs to cross chunk boundaries work as expected. Fixes #7213. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>	2021-03-10 09:22:34 +01:00
Juliusz Stasiewicz	382545a614	docs: explain SSL/non-SSL and shard-aware CQL ports I added short description of shard-aware ports + updated the rules for disabling ports and enabling SSL introduced by #7992. Fixes #8146 Closes #8152	2021-03-09 22:48:30 +02:00
Tomasz Grabiec	c9c2beabc0	Merge "raft: replication tests as individual boost tests" from Alejo * alejo/raft-tests-replication-boost-5: raft: replication test: use Seastar random generator raft: replication test: rename drop_replication raft: replication test: change to Boost test raft: replication test: id helper functions raft: replication test: improve handling connectivity raft: replication test: parametrize snapshots raft: replication test: parametrize drop_replication raft: replication test: remove unused configuration raft: replication test: add license	2021-03-09 17:58:59 +01:00
Pavel Emelyanov	096e452db9	test: Fix exit condition of row_cache_test::test_eviction_from_invalidated The test populates the cache, then invalidates it, then tries to push huge (10x times the segment size) chunks into seastar memory hoping that the invalid entries will be evicted. The exit condition on the last stage is -- total memory of the region (sum of both -- used and free) becomes less than the size of one chunk. However, the condition is wrong, because cache usually contains a dummy entry that's not necessarily on lru and on some test iteration it may happen that evictable size < chunk size < evictable size + dummy size In this case test fails with bad_alloc being unable to evict the memory from under the dummy. fixes: #7959 tests: unit(row_cache_test), unit(the failing case with the triggering seed from the issue + 200 times more with random seeds) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210309134138.28099-1-xemul@scylladb.com>	2021-03-09 17:57:52 +01:00
Alejo Sanchez	f67b85e2b3	raft: replication test: use Seastar random generator Use the random generator provided by Seastar test suite. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	1bf10a87c6	raft: replication test: rename drop_replication Rename drop_replication to packet_drops for readability. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	6e193ee3bf	raft: replication test: change to Boost test Change test/raft directory to Boost test type. Run replication_test cases with their own test. RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet loss named name_drops. The directory test/raft is changed to host Boost tests instead of unit. While there improve the documentation. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:52:07 -04:00
Alejo Sanchez	8d9c797954	raft: replication test: id helper functions In raft the UUID 0 is a special case so server ids start at 1. Add two helper functions. Convert local 0-based id to raft 1-based UUID. And from UUID to raft_id. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:50:12 -04:00
Alejo Sanchez	0ffa450222	raft: replication test: improve handling connectivity Change global map of disconnected servers to a more intuitive class connected. The class is callable for the most common case connected(id). Methods connect(), disconnect(), and all() are provided for readability instead of directly calling map methods (insert, erase, clear). They also support both numerical (0 based) and server_id (UUID, 1 based) ids. The actual shared map is kept in a lw_shared_ptr. The class is passed around to be copy-constructed which is practically just creating a new lw_shared_ptr. Internally it tracks disconnected servers but externally it's more intuitive to use connect instead of disconnect. So it reads "connected id" and "not disconnected id", without double negatives. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 12:39:29 -04:00
Alejo Sanchez	7a644f37d3	raft: replication test: parametrize snapshots Snapshots and persisted snapshots created per test instead of globals. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	f72e89fcfe	raft: replication test: parametrize drop_replication Pass drop_replication down instead of keeping it global. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	5a03670f91	raft: replication test: remove unused configuration Remove test case configuration as it's not implemented yet. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Alejo Sanchez	efc6681cd6	raft: replication test: add license Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-09 11:58:20 -04:00
Piotr Sarna	d473bc9b06	Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani This is a reworked submission of #7686 which has been reverted. This series fixes some race conditions in MV/SI schema creation and load, we spotted some places where a schema without a base table reference can sneak into the registry. This can cause to an unrecoverable error since write commands with those schemas can't be issued from other nodes. Most of those cases can occur on 2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small window after a view or base table altering. Fixes #7709 Closes #8091 * github.com:scylladb/scylla: database: Fix view schemas in place when loading global_schema_ptr: add support for view's base table materialized views: create view schemas with proper base table reference. materialized views: Extract fix legacy schema into its own logic	2021-03-09 16:27:34 +01:00
Asias He	61ac8d03b9	repair: Add ignore_nodes option In some cases, user may want to repair the cluster, ignoring the node that is down. For example, run repair before run removenode operation to remove a dead node. Currently, repair will ignore the dead node and keep running repair without the dead node but report the repair is partial and report the repair is failed. It is hard to tell if the repair is failed only due to the dead node is not present or some other errors. In order to exclude the dead node, one can use the hosts option. But it is hard to understand and use, because one needs to list all the "good" hosts including the node itself. It will be much simpler, if one can just specify the node to exclude explicitly. In addition, we support ignore nodes option in other node operations like removenode. This change makes the interface to ignore a node explicitly more consistent. Refs: #7806 Closes #8233	2021-03-09 16:03:13 +01:00
Gleb Natapov	2a41ad0b57	raft: add testing for non-voting members Add tests to check if quorum (for leader election and commit index purposes) is calculated correctly in the presence of non-voting members. Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>	2021-03-09 13:51:09 +01:00
Gleb Natapov	dd6ba3d507	raft: add non-voting member support This patch adds a support for non-voting members. Non voting member is a member which vote is not counted for leader election purposes and commit index calculation purposes and it cannot become a leader. But otherwise it is a normal raft node. The state is needed to let new nodes to catch up their log without disturbing a cluster. All kind of transitions are allowed. A node may be added as a voting member directly or it may be added as non-voting and then changed to be voting one through additional configuration change. A node can be demoted from voting to non-voting member through a configuration change as well. Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>	2021-03-09 13:47:48 +01:00
Raphael S. Carvalho	863b95aa34	sstables: remove bag_sstable_set bag_sstable_set can be replaced with partitioned_sstable_set, which will provide the same functionality, given that L0 sstables go to a "bag" rather than interval map. STCS, for example, will only have L0 sstables, so it will get exact the same behavior with partitioned_sstable_set. it also gives us the benefit of keeping the leveled sstables in the interval map if user has switched from LCS to STCS, until they're all compacted into size-tiered ssts. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-03-09 08:39:48 -03:00
Avi Kivity	9038a81317	treewide: drop SEASTAR_CONCEPT Since Scylla requires C++20, there is no need to protect concept definitions or usages with SEASTAR_CONCEPT; it just clutters the code. This patch therefore removes all uses. Closes #8236	2021-03-08 16:04:20 +01:00
Asias He	dc40184faa	gossip: Handle timeout error in gossiper::do_shadow_round Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb is not handled in gossiper::do_shadow_round. If the GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes timeout, gossiper::do_shadow_round will throw an exception and fail the whole boot up process. It is fine that some of the remote nodes timeout in shadow round. It is not a must to talk to all nodes. This patch fixes an issue we saw recently in our sct tests: ``` INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping INFO \| scylla[1579]: [shard 0] gossip - gossip is already stopped INFO \| scylla[1579]: [shard 0] init - Shutting down gossiping was successful ... ERR \| scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out) ``` Fixes #8187 Closes #8213	2021-03-08 13:03:41 +01:00
Nadav Har'El	28804a50f7	alternator-test: test that index can't be a name reference (#xyz) We already have a test which shows verify DynamoDB and Alternator do not allow an index in an attribute path - like a[0].b - to be a value reference - a[:xyz].b. We forgot to verify that the index also can't be a name reference - a[#xyz].b is a syntax error. So here we add a test which confirms that this is indeed the case - DynamoDB doesn't allow it, and neither does Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>	2021-03-08 10:17:19 +01:00
Avi Kivity	938761f49f	types.cc: drop unused #include "compaction_garbage_collector.hh" Garbage-collect unused #includes. Closes #8232	2021-03-08 06:44:03 +01:00
Takuya ASADA	2d9feaacea	scylla_raid_setup: don't abort using raiddev when array_state is 'clear' On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes '/dev/md0 is already using' (issue #7627). So we merged the patch to find free mdX (`587b909`). However, look into /proc/mdstat of the AMI, it actually says no active md device available: ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat Personalities : unused devices: <none> We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True, but according to kernel doc, the file may available even array is STOPPED: clear No devices, no size, no level Writing is equivalent to STOP_ARRAY ioctl https://www.kernel.org/doc/html/v4.15/admin-guide/md.html So we should also check array_state != 'clear', not just array_state existance. Fixes #8219 Closes #8220	2021-03-07 18:30:11 +02:00
Avi Kivity	1287a5e1d0	test: index_reader_assertions: fix misuse of trichotomic comparator in has_monotonic_positions has_monotonic_positions() wants to check for a greater-than-or-equal-to relation, but actually tests for not-equal, since it treats a trichotomic comparator as a less-than comparator. This is clearly seen in the BOOST_FAIL message just below. Fix by aligning the test with the intended invariant. Luckily, the tests still pass. Ref #1449. Closes #8222	2021-03-07 13:44:37 +02:00
Eliran Sinvani	0220786710	database: Fix view schemas in place when loading On restart the view schemas are loaded and might contain old views with an unmarked computed column. We already have code to update the schema, but before we do it we load the view as is. This is not desired since once registered, this view version can be used for writes which is forbidden since we will spot a none computed column which is in the view's primary key but not in the base table at all. To solve this, in addition to altering the persistent schema, we fix the view's loaded schema in place. This is safe since computed column is just involved in generating a value for this column when creating a view update so the effect of this manipulation stays internal. The second stage of the in place fixing is to persist the changes made in the in place fixing so the view is ready for the next node restart in particular the `computed_columns` table.	2021-03-07 12:57:16 +02:00
Eliran Sinvani	04de770566	global_schema_ptr: add support for view's base table Up until now, the global_schema_ptr object was a crack through which a view schema with an uninitialized base reference could sneak. Even if the schema itself contained a base reference, the base schema didn't carry over to shards different than the shard on which the global_schema_ptr was created. Since once the schema is in the registry it might be used for everything (reads and writes), we also need to make sure that global schemas for an incomplete view schemas will not be created.	2021-03-07 12:50:42 +02:00
Eliran Sinvani	9162748b18	materialized views: create view schemas with proper base table reference. Newly created view schemas don't always have their base info, this is bad since such schemas don't support read nor write. This leaves us vulnerable to a race condition where there is an attempt to use this schema for read or write. Here we initialize the base reference and also reconfigure the view to conform to the new computed column type, which makes it usable for write and not only reads. We do it for views created in the migration manager following announcements and also for copied schemas.	2021-03-07 12:50:42 +02:00
Eliran Sinvani	39cd9dae4e	materialized views: Extract fix legacy schema into its own logic We extract the logic for fixing the view schema into it's own logic as we will need to use it in more places in the code. This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since it becomes a two liner wrapper for this logic. We also remove it here and replace the call to it with the equivalent code.	2021-03-07 12:50:42 +02:00
Takuya ASADA	53c7600da8	dist: increase fs.aio-max-nr value for other apps Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla uses, if other apps on the environment also try to use aio, aio slot will be run out. So increase value +65536 for other apps. Related #8133 Closes #8228	2021-03-07 12:11:36 +02:00
Piotr Sarna	7106ca27e6	service: reduce continuation length for paxos pruning A pair of (finally, handle_exception) is reduced to a single use of then_wrapped(), which saves an allocation. Message-Id: <01949e286db93397209435a85fcc46a8beef6d24.1614937462.git.sarna@scylladb.com>	2021-03-07 11:59:10 +02:00
Nadav Har'El	ad563c6279	Update tools/java submodule Fixes an sstableloader bug where we quoted twice column names that had to be quoted, and therefore failed on such tables - and in particular Alternator tables which always have a column called ":attrs". Fixes #8229 * tools/java 142f517a23...c5d9e8513e (1): > sstableloader: Only escape column names once	2021-03-07 10:33:49 +02:00
Botond Dénes	debaae41f9	mutation_partition: operator<<(mutation_partition::printer) Include row tombstones in the row printout. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210305094106.210249-1-bdenes@scylladb.com>	2021-03-05 14:39:39 +02:00

1 2 3 4 5 ...

25452 Commits