scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 15:03:06 +00:00

Author	SHA1	Message	Date
Nadav Har'El	cb9e2ee00a	cql-pytest: tests for fromJson() setting a map<ascii, int> The fromJson() function can take a map JSON and use it to set a map column. However, the specific example of a map<ascii, int> doesn't work in Scylla (it does work in Cassandra). The xfailing tests in this patch demonstrate this. Although the tests use perfectly legal ASCII, scylla fails the fromJson() function, with a misleading error. Refs #7949. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210121233855.100640-1-nyh@scylladb.com>	2021-01-22 14:29:25 +01:00
Pavel Emelyanov	90d445464b	compaction: Remove compaction_manager::enabled() This method was marked with 'FIXME -- should not be public' when it was introduced. Since then it has stopped being used and can even be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210122083146.5886-1-xemul@scylladb.com>	2021-01-22 14:07:38 +02:00
Kamil Braun	570d15c7bc	multishard_combining_reader: do not use `smp::count` `multishard_combining_reader` currently only works under the assumption that every table uses the same sharder configured using the node's number of shards. But we could potentially specify a different sharder for a chosen table, e.g. one that puts everything on shard 0. Then this assumption will be broken and the reader causes a segfault. Fixes #7945.	2021-01-21 18:28:18 +02:00
Nadav Har'El	328be1ca7c	cql-pytest: tests for fromJson() not accepting empty string as integer When writing to an integer column, Cassandra's fromJson() function allows not just JSON number constants, it also allows a string containing a number. Strings which do not hold a number fail with a FunctionFailure. In particular, the empty string "" is an invalid number, and should fail. The tests in this patch check this for two integer types: int and varint. Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails to recognize the error for varint, while Cassandra fails to recognize the error for int. The tests in this patch reproduce these bugs. The tests demonstrating Scylla's bug are marked xfail, and the tests demonstrating Cassandra's bug is marked "cassandra_bug" (which means it is marked xfail only when running against Cassandra, but expected to succeed on Scylla. Refs #7944. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210121133833.66075-1-nyh@scylladb.com>	2021-01-21 15:24:48 +01:00
Nadav Har'El	702b1b97bf	cql: fix error return from execution of fromJson() and other functions As reproduced in cql-pytest/test_json.py and reported in issue #7911, failing fromJson() calls should return a FUNCTION_FAILURE error, but currently produce a generic SERVER_ERROR, which can lead the client to think the server experienced some unknown internal error and the query can be retried on another server. This patch adds a new cassandra_exception subclass that we were missing - function_execution_exception - properly formats this error message (as described in the CQL protocol documentation), and uses this exception in two cases: 1. Parse errors in fromJson()'s parameters are converted into a function_execution_exception. 2. Any exceptions during the execute() of a native_scalar_function_for function is converted into a function_execution_exception. In particular, fromJson() uses a native_scalar_function_for. Note, however, that functions which already took care to produce a specific Cassandra error, this error is passed through and not converted to a function_execution_exception. An example is the blobAsText() which can return an invalid_request error, so it is left as such and not converted. This also happens in Cassandra. All relevant tests in cql-pytest/test_json.py now pass, and are no longer marked xfail. This patch also includes a few more improvements to test_json.py. Fixes #7911 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>	2021-01-21 15:21:13 +01:00
Nadav Har'El	49440d67ad	Merge: Fix multiple issues with timeuuid type Merged patch series by Konstantin Osipov: "These series improve uniqueness of generated timeuuids and change list append/prepend logic to use client/LWT timestamp in timeuuids generated for list keys. Timeuuid compare functions are optimized. The test coverage is extended for all of the above." uuid: add a comment warning against UUID::operator< uuid: replace slow versions of timeuiid compare with optimized/tested versions. test: add tests for legacy uuid compare & msb monotonicity test: add a test case for append/prepend limit test: add a test case for monotonicity of timeuuid least significant bits uuid: implement optimized timeuuid compare test: add a test case for list prepend/append with custom timestamp lists: rewrite list prepend to use append machinery lists: use query timestamp for list cell values during append uuid: fill in UUID node identifier part of UUID test: add a CQL test for list append/prepend operations	2021-01-21 13:20:07 +02:00
Konstantin Osipov	e18e2cb9f2	uuid: add a comment warning against UUID::operator<	2021-01-21 13:03:59 +03:00
Konstantin Osipov	845f6c667b	uuid: replace slow versions of timeuiid compare with optimized/tested versions.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	56d8d166cb	test: add tests for legacy uuid compare & msb monotonicity	2021-01-21 13:03:59 +03:00
Konstantin Osipov	257c5b0879	test: add a test case for append/prepend limit	2021-01-21 13:03:59 +03:00
Konstantin Osipov	d6e65a3735	test: add a test case for monotonicity of timeuuid least significant bits Ensure that timeuuid least significant bits are compared correctly.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	0af3758aff	uuid: implement optimized timeuuid compare Introduce uint64_t based comparator for serialized timeuuids. Respect Cassandra legacy for timeuuid compare order. Scylla uses two versions of timeuuid compare: - one for timeuuid values stored in uuid columns - a different one for timeuuid values stored in timeuuid columns. This commit re-implements the implementations of these comparators in types.cc and deprecates the respective implementations types.cc. They will be removed in a following patch. A micro-benchmark at https://github.com/alecco/timeuuid-bench/ shows 2-4x speed up of the new comparators.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	b4500a55c7	test: add a test case for list prepend/append with custom timestamp Scylla now takes a custom timestamp into account when executing list append/prepend operations. Test the new semantics.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	232ce6f611	lists: rewrite list prepend to use append machinery Rewrite list prepend to use the same machinery as append, and thus produce correct results when used in LWT. After this patch, list prepend begins to honor user supplied timestamps. If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00 an exception is thrown. Fixes #7611	2021-01-21 13:03:59 +03:00
Konstantin Osipov	2b8ce83eea	lists: use query timestamp for list cell values during append Scylla list cells are represented internally as a map of timeuuid => value. To append a new value to a list the coordinator generates a timeuuid reflecting the current time as key and adds a value to the map using this key. Before this patch, Scylla always generated a timeuuid for a new value, even if the query had a user supplied or LWT timestamp. This could break LWT linearizability. User supplied timestamps were ignored. This is reported as https://github.com/scylladb/scylla/issues/7611 A statement which appended multiple values to a list or a BATCH generated an own microsecond-resolution timeuuid for each value: BEGIN BATCH UPDATE ... SET a = a + [3] UPDATE ... SET a = a + [4] APPLY BATCH UPDATE ... SET a = a + [3, 4] To fix the bug, it's necessary to preserve monotonicity of timeuuids within a batch or multi-value append, but make sure they all use the microsecond time, as is set by LWT or user. To explain the fix, it's first necessary to recall the structure of time-based UUIDs: 60 bits: time since start of GMT epoch, year 1582, represented in 100-nanosecond units 4 bits: version 14 bits: clock sequence, a random number to avoid duplicates in case system clock is adjusted 2 bits: type 48 bits: MAC address (or other hardware address) The purpose of clockseq bits is as defined in https://tools.ietf.org/html/rfc4122#section-4.1.5 is to reduce the probability of UUID collision in case clock goes back in time or node id changes. The implementation should reset it whenever one of these events may occur. Since LWT microsecond time is guaranteed to be unique by Paxos, the RFC provisioning for clockseq and MAC slots becomes excessive. The fix thus changes timeuuid slot content in the following way: - time component now contains the same microsecond time for all values of a statement or a batch. The time is unique and monotonic in case of LWT. Otherwise it's most always monotonic, but may not be unique if two timestamps are created on different coordinators. - clockseq component is used to store a sequence number which is unique and monotonic for all values within the statement/batch. - to protect against time back-adjustments and duplicates if time is auto-generated, MAC component contains a random (spoof) MAC address, re-created on each restart. The address is different at each shard. The change is made for all sources of time: user, generated, LWT. Conditioning the list key generation algorithm on the source of time would unnecessarily complicate the code while not increase quality (uniqueness) of created list keys. Since 14 bits of clockseq provide us with only 16383 distinct slots per statement or batch, 3 extra bits in nanosecond part of the time are used to extend the range to 131071 values per statement/batch. If the rang is exceeded beyond the limit, an exception is produced. A twist on the use of clockseq to extend timeuuid uniqueness is that Scylla, like Cassandra, uses int8 compare to compare lower bits of timeuuid for ordering. The patch takes this into account and sign-complements the clockseq value to make it monotonic according to the legacy compare function. Fixes #7611 test: unit (dev)	2021-01-21 13:03:59 +03:00
Konstantin Osipov	6d1781be36	uuid: fill in UUID node identifier part of UUID Before this patch, UUID generation code was not creating sufficiently unique IDs: the 6 byte node identifier was mostly empty, i.e. only containing shard id. This could lead to collisions between queries executed concurrently at different coordinators, and, since timeuuid is used as key in list append and prepend operations, lead to lost updates. To generate a unique node id, the patch uses a combination of hardware MAC address (or a random number if no hardware address is available) and the current shard id. The shard id is mixed into higher bits of MAC, to reduce the chances on NIC collision within the same network. With sufficiently unique timeuuids as list cell keys, such updates are no longer lost, but multi-value update can still be "merged" with another multi-value update. E.g. if node A executes SET l = l + [4, 5] and node B executes SET l = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4, 6, 5, 7], [6, 4, 5, 7] and so on. At least we are now less likely to get any value lost. Fixes #6208. @todo: initialize UUID subsystem explicitly in main() and switch to using seastar::engine().net().network_interfaces() test: unit (dev)	2021-01-21 13:03:53 +03:00
Avi Kivity	4cfaab208e	allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations Now that managed_bytes and its users do not assume that a managed_bytes instance allocated using standard_allocation_strategy is non-fragmented, we can set the preferred max contiguous allocation to 128k. This causes managed_bytes to fragment instances that are larger than this size. Note that managed_bytes is the only user. Closes #7943	2021-01-21 11:15:13 +02:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00
Pekka Enberg	7d98e05923	Update tools/python3 submodule * tools/python3 1763a1a...c579207 (1): > dist/debian: handle rc version correctly	2021-01-21 10:41:33 +02:00
Avi Kivity	daa0e964fc	dbuild: avoid --pids-limit with podman and cgroupsv1 Podman doesn't correctly support --pids-limit with cgroupsv1. Some versions ignore it, and some versions reject the option. To avoid the error, don't supply --pids-limit if cgroupsv2 is not available (detected by its presence in /proc/filesystems). The user is required to configure the pids limit in /etc/containers/containers.conf. Fixes #7938. Closes #7939	2021-01-21 10:41:33 +02:00
Botond Dénes	4d581f1bb3	docs/README.md: guides: also mention running and debugging Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>	2021-01-20 16:07:29 +02:00
Avi Kivity	f11a0700a8	Merge "mutation_writer: explicitly close writers" from Benny " _consumer_fut is expected to return an exception on the abort path. Wait for it and drop any exception so it won't be abandoned as seen in #7904. A future<> close() method was added to return _consumer_fut. It is called both after abort() in the error path, and after consume_end_of_stream, on the success path. With that, consume_end_of_stream was made void as it doesn't return a future<> anymore. Fixes #7904 Test: unit(release) " * tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla: mutation_writer: bucket_writer: add close mutation_writer/feed_writers: refactor bucket/shard writers mutation_writer: update bucket/shard writers consume_end_of_stream	2021-01-20 16:07:29 +02:00
Pekka Enberg	6cc981d089	scylla: Add "--build-mode" command line option This adds a "--build-mode" command line option to "scylla" executable: $ ./build/dev/scylla --build-mode dev This allows you to discover the build mode of a "scylla" executable without resorting to "readelf", for example, to verify that you are looking at the correct executable while debugging packaging issues. Closes #7865	2021-01-20 16:07:29 +02:00
Botond Dénes	7eb8c71342	tools/scylla-types: add link to cql3-type-mapping.md Just like scylla-sstable-index, scylla-types accepts types in (short) cassandra class name notation. The mapping from the clq3 type names to the class names is not straight-forward in all cases, so provide a link to a table which lists the cassandra class name of all supported types (and more). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>	2021-01-20 10:50:33 +02:00
Botond Dénes	882ade7c6a	types/scylla-sstable-index: update URL to cql3-type-mapping.md Said document was recently moved but the URL was not updated. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>	2021-01-20 10:50:33 +02:00
Avi Kivity	114da51d73	Revert "commitlog: fix size of a write used to zero a segment" This reverts commit `df2f67626b`. The fix is correct, but has an unfortunate side effect with O_DSYNC: each 128k write also needs to flush the XFS log. This translates to 32MB/128k = 256 flushes, compared to one flush with the original code. A better fix would be to prezero without O_DSYNC, then reopen the file with O_DSYNC, but we can do that later. Reopens #5857.	2021-01-20 10:23:43 +02:00
Avi Kivity	586f16bf79	Merge "Cut snitch -> storage service dependency" from Pavel E " Currently storage service and snitch implicitly depend on each other. Storage service gossips snitch data on start, snitch kicks the storage service when its configuration changes. This interdependency is relaxed: - snitch gossips all its state itself without using the storage service as a mediator - storage service listens for snitch updates with the help of self-breaking subscription Both changes make snitch independent from storage service, remove yet another call for global storage service from the codebase and make the storage service -> snitch reference robust against dagling pointers/references tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev) " * 'br-snitch-gossip-2' of https://github.com/xemul/scylla: storage-service: Subscribe to snitch to update topology snitch: Introduce reconfiguration signal snitch: Always gossip snitch info itself snitch: Do gossip DC and RACK itself snitch: Add generic gossiping helper	2021-01-20 10:23:43 +02:00
Pavel Solodovnikov	041072b59f	raft: rename `storage` to `persistence` The new naming scheme more clearly communicates to the client of the raft library that the `persistence` interface implements persistency layer of the fsm that is powering the raft protocol itself rather than the client-side workflow and user-provided `state_machine`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>	2021-01-20 10:23:43 +02:00
Gleb Natapov	248449816b	raft: fix snapshot transfer with existing log prefix Current code that checks when snapshot has to be transferred does not take in account the case where there can be log entries preceding the snapshot. Fix the code to correctly test for snapshot transfer condition. Message-Id: <20210117095801.GB733394@scylladb.com>	2021-01-20 10:23:43 +02:00
Gleb Natapov	1ab262e86b	raft: test: change replication_test to submit one entry at a time replication_test's state machine is not commutative, so if commands are applied in different order the states will be different as well. Since the preemption check was added into co_await in seastar even waiting for a ready future can preempt which will cause reordering of simultaneously submitted entries in debug mode. For a long time we tried to keep entries submission parallel in the test, but with the above seastar change it is no longer possible to maintain it without changing the state machine to be commutative. The patch changes the test to submit entries one by one. Message-Id: <20210117095147.GA733394@scylladb.com>	2021-01-20 10:23:43 +02:00
Benny Halevy	f29732573a	mutation_writer: bucket_writer: add close bucket_writer::close waits for the _consumer_fut. It is called both after consume_end_of_stream() and after abort(). _consumer_fut is expected to return an exception on the abort path. Wait for it and drop any exception so it won't be abandoned as seen in #7904. With that moved to close() time, consume_end_of_stream doesn't need to return a future and is made void all the way in the stack. This is ok since queue_reader_handle::push_end_of_stream is synchronous too. Added a unit test that aborts the reader consumer during `segregate_by_timestamp`, reproducing the Exceptional future ignored issue without the fix. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 19:03:58 +02:00
Benny Halevy	fc3f9a57ff	mutation_writer/feed_writers: refactor bucket/shard writers Consolidate shard_based_splitting_writer::shard_writer and timestamp_based_splitting_writer::bucket_writer common code into mutation_writer::bucket_writer. This provides a common place to handle consume_end_of_stream() and abort(), and in particular the handling of the underlying _conmsumer_fut. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 18:48:01 +02:00
Benny Halevy	a9d91a2d09	mutation_writer: update bucket/shard writers consume_end_of_stream After `61520a33d6` feed_writers doesn't call consume_end_of_stream after abort() so no need to test if (!_handle.is_terminated()) { and consume_end_of_stream is now called in then_wrapped rather than `finally` so it's ok if it throws. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-01-19 18:44:40 +02:00
Kamil Braun	1a8630e6a7	transport: silence "broken pipe" and "connection reset by peer" errors The code would already silence broken pipe exceptions since it's expected when the other side closes the connection or when we shutdown the socket during Scylla shutdown, but the code wouldn't handle the following: 1. "Connection reset by peer" errors: these can also happen in the aforementioned two scenarios; the conditions that determine which of the two types of errors occur are unclear. 2. The scenarios would sometimes result in a `seastar::nested_exception`, mainly during shutdown. The errors could happen once when trying to send a response to a request (`_write_buf.write(...)/flush(...)`) and then again when trying to close the connection in a `finally` block. These nested exceptions were not silenced. The commit handles each of these cases. Closes #7907. Closes #7931	2021-01-19 10:30:17 +02:00
Tomasz Grabiec	94749b01eb	Merge "futurize flat_mutation_reader::next_partition" from Benny The main motivation for this patchset is to prepare for adding a async close() method to flat_mutation_reader. In order to close the reader before destroying it in all paths we need to make next_partition asynchronous so it can asynchronously close a current reader before destoring it, e.g. by reassignment of flat_mutation_reader_opt, as done in scanning_reader::next_partition. Test: unit(release, debug) * git@github.com:bhalevy/scylla.git futurize-next-partition-v1: flat_mutation_reader: return future from next_partition multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard mutation_reader: filtering_reader: fill_buffer: futurize inner loop flat_mutation_reader::impl: consumer_adapter: futurize handle_result flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer flat_mutation_reader:impl: get rid of _consume_done member	2021-01-19 10:19:03 +02:00
Alejo Sanchez	8a61e7defc	raft: etcd unit tests: test log replication etcd TestLogReplication Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	417b18aaad	raft: boost test etcd: test fsm can vote from any state etcd TestVoteFromAnyState Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	5a75c0e06a	raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs Log truncation of follower when node re-gains leadership. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	f14c44c686	raft: replication test: add etcd test for cycling leaders This test cycles 3 nodes as leaders without adding entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	f627972186	raft: testing: provide primitives to wait for log propagation For tests to be able to transition in a consistent state, in some cases it's needed to allow the followers to catch up with the leader. This prevents occasional hangs in debug mode for incoming tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	948ae813e4	raft: etcd unit tests: initial boost tests First batch of ported etcd raft unit tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:12 -04:00
Gleb Natapov	6d47a535b9	raft: combine append_request _receive and _send Combine structs for append request send and receive into a single struct. Author: Gleb Natapov <gleb@scylladb.com> Date: Mon Nov 23 14:33:14 2020 +0200	2021-01-18 12:24:13 -04:00
Konstantin Osipov	bf1a031bd6	test: add a CQL test for list append/prepend operations Test single- and multi- value list append, prepend, append and prepend in a batch, conditional statements. This covers the parts of Cassandra which are working as documented and which we intend to preserve compatibility with.	2021-01-18 17:32:00 +03:00
Jenkins	faf71c6f75	release: prepare for 4.5.dev	2021-01-18 16:05:25 +02:00
Avi Kivity	df3ef800c2	Merge 'Introduce load and stream feature' from Asias He storage_service: Introduce load_and_stream === Introduction === This feature extends the nodetool refresh to allow loading arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from disk and calculates the owning nodes of the data and streams to the owners automatically. From example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process. This can make restores and migrations much easier. === Performance === I managed to get 40MB/s per shard on my build machine. CPU: AMD Ryzen 7 1800X Eight-Core Processor DISK: Samsung SSD 970 PRO 512GB Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32 shards, we can finish the load and stream 1TB of data in 13 mins on each node. 1TB / 40 MB per shard * 32 shard / 60 s = 13 mins === Tests === backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test which creates a cluster with 4 nodes and inserts data, then use load_and_stream to restore to a 2 nodes cluster. === Usage === curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true === Notes === Btw, with the old nodetool refresh, the node will not pick up the data that does not belong to this node but it will not delete it either. One has to run nodetool cleanup to remove those data manually which is a surprise to me and probably to users as well. With load and stream, the process will delete the sstables once it finishes stream, so no nodetool cleanup is needed. The name of this feature load and stream follows load and store in CPU world. Fixes #7831 Closes #7846 * github.com:scylladb/scylla: storage_service: Introduce load_and_stream distributed_loader: Add get_sstables_from_upload_dir table: Add make_streaming_reader for given sstables set	2021-01-18 15:08:19 +02:00
Avi Kivity	60f5ec3644	Merge 'managed_bytes: switch to explicit linearization' from Michał Chojnowski This is a revival of #7490. Quoting #7490: The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope. We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes. As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized(). Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view). By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure. This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator. Closes #7820 * github.com:scylladb/scylla: test: add hashers_test memtable: fix accounting of managed_bytes in partition_snapshot_accounter test: add managed_bytes_test utils: fragment_range: add a fragment iterator for FragmentedView keys: update comments after changes and remove an unused method mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator row_cache: more indentation fixes utils: remove unused linearization facilities in `managed_bytes` class misc: fix indentation treewide: remove remaining `with_linearized_managed_bytes` uses memtable, row_cache: remove `with_linearized_managed_bytes` uses utils: managed_bytes: remove linearizing accessors keys, compound: switch from bytes_view to managed_bytes_view sstables: writer: add write_* helpers for managed_bytes_view compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view types: change equal() to accept managed_bytes_view types: add parallel interfaces for managed_bytes_view types: add to_managed_bytes(const sstring&) serializer_impl: handle managed_bytes without linearizing utils: managed_bytes: add managed_bytes_view::operator[] utils: managed_bytes: introduce managed_bytes_view utils: fragment_range: add serialization helpers for FragmentedMutableView bytes: implement std::hash using appending_hash utils: mutable_view: add substr() utils: fragment_range: add compare_unsigned utils: managed_bytes: make the constructors from bytes and bytes_view explicit utils: managed_bytes: introduce with_linearized() utils: managed_bytes: constrain with_linearized_managed_bytes() utils: managed_bytes: avoid internal uses of managed_bytes::data() utils: managed_bytes: extract do_linearize_pure() thrift: do not depend on implicit conversion of keys to bytes_view clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view cql3: expression: linearize get_value_from_mutation() eariler bytes: add to_bytes(bytes) cql3: expression: mark do_get_value() as static	2021-01-18 11:01:28 +02:00
Asias He	4d32d03172	storage_service: Introduce load_and_stream === Introduction === This feature extends the nodetool refresh to allow loading arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from disk and calculates the owning nodes of the data and streams to the owners automatically. From example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process. This can make restores and migrations much easier. === Performance === I managed to get 40MB/s per shard on my build machine. CPU: AMD Ryzen 7 1800X Eight-Core Processor DISK: Samsung SSD 970 PRO 512GB Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32 shards, we can finish the load and stream 1TB of data in 13 mins on each node. 1TB / 40 MB per shard * 32 shard / 60 s = 13 mins === Tests === backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test which creates a cluster with 4 nodes and inserts data, then use load_and_stream to restore to a 2 nodes cluster. === Usage === curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true === Notes === Btw, with the old nodetool refresh, the node will not pick up the data that does not belong to this node but it will not delete it either. One has to run nodetool cleanup to remove those data manually which is a surprise to me and probably to users as well. With load and stream, the process will delete the sstables once it finishes stream, so no nodetool cleanup is needed. The name of this feature load and stream follows load and store in CPU world. Fixes #7831	2021-01-18 16:32:33 +08:00
Avi Kivity	ab44464911	Revert "docker: remove sshd from the image" This reverts commit `32fd38f349`. Some tests (in scylla-cluster-tests) depend on it.	2021-01-17 14:34:40 +02:00
Raphael S. Carvalho	00c29e1e24	table: Move notify_bootstrap_or_replace_*() out of line Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>	2021-01-17 10:36:13 +02:00
Asias He	28007f13f8	distributed_loader: Add get_sstables_from_upload_dir This function scans sstables under the upload directory and return a list of sstables for each shard. Refs #7831	2021-01-16 20:03:17 +08:00

1 2 3 4 5 ...

24908 Commits