scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 06:53:12 +00:00

Author	SHA1	Message	Date
Gleb Natapov	8018eb4612	storage_proxy: use small_vector in storage_proxy::query_singular to store executors Having only one pk to query is the common case, so avoid allocating executor vector for that case.	2021-06-27 14:48:15 +03:00
Gleb Natapov	d908912dbd	storage_proxy: co-routinize storage_proxy::query_singular() The code becomes much easier to understand and allocation for used_replicas can be dropped.	2021-06-27 14:47:19 +03:00
Tomasz Grabiec	a60e73fe14	Merge "raft: allow to initiate leader stepdown process explicitly" from Gleb Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. * scylla-dev/raft-stepdown-v4: raft: test: test leadership transfer timeout raft: allow to initiate leader stepdown process	2021-06-23 00:14:46 +02:00
Benny Halevy	4ab4f63efe	sstables: mx/writer: flush_tmp_bufs: maybe_yield in loop This loop may cause pretty long reactor stalls as seen in https://github.com/scylladb/scylla/issues/8900 Apparently output_stream<CharType>::slow_write returns a ready future and no yielding is considered, so add a check in the top level loop (that must already be called from a seastar thread). Fixes #8900 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210622152206.156302-1-bhalevy@scylladb.com>	2021-06-22 18:56:12 +03:00
Avi Kivity	d27e88e785	Merge "compaction: prevent broken_promise or dangling reader errors" from Benny " This series prevents broken_promise or dangling reader errors when (resharding) compaction is stopped, e.g. during shutdown. At the moment compaction just closes the reader unilaterally and this yanks the reader from under the queue_reader_handle feet, causing dangling queue reader and broken_promise errors as seen in #8755. Instead, fix queue_reader::close to set value on the _full/_not_full promises and detach from the handle, and return _consume_fut from bucket_writer::consume if handle is terminated. Fixes #8755 Test: unit(dev) DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug) " * tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla: mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated queue_reader_handle: add get_exception method queue_reader: close: set value on promises on detach from handle	2021-06-22 18:52:11 +03:00
Calle Wilund	373fa3fa07	table: ensure memtable is actually in memtable list before erasing Fixes #8749 if a table::clear() was issued while we were flushing a memtable, the memtable is already gone from list. We need to check this before erase. Otherwise we get random memory corruption via std::vector::erase v2: * Make interface more set-like (tolerate non-existance in erase). Closes #8904	2021-06-22 15:58:56 +02:00
Asias He	ffa211a8c7	repair: Avoid copy rows in apply_rows_on_master_in_thread The rows are not used after the call to to_repair_rows_list. Use std::move() to avoid copying. Fixes #8902 Closes #8903	2021-06-22 15:58:56 +02:00
Benny Halevy	02917c79b6	logalloc: get rid of unused _descendant_blocked_requests Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>	2021-06-22 15:58:56 +02:00
Piotr Dulikowski	de1679b1b9	hints: make hints concurrency configurable and reduce the default Previously, hinted handoff had a hardcoded concurrency limit - at most 128 hints could be sent from a single shard at once. This commit makes this limit configurable by adding a new configuration option: `max_hinted_handoff_concurrency_per_shard`. This option can be updated in runtime. Additionally, the default concurrency per shard is made lower and is now 8. The motivation for reducing the concurrency was to mitigate the negative impact hints may have on performance of the receiving node due to them not being properly isolated with respect to I/O. Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) Refs: #8624 Closes #8646	2021-06-22 15:58:56 +02:00
Gleb Natapov	09528b8671	raft: test: test leadership transfer timeout Test that if leadership transfer cannot be done in configured time frame fsm cancels the leadership transfer process. Also check that timeout_now message is resent on each tick while leadership transfer is in progress.	2021-06-22 14:42:50 +03:00
Gleb Natapov	ed49d29473	raft: allow to initiate leader stepdown process Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. We already have a mechanism to transfer the leadership in case an active leader is removed. The patch exposes it as an external interface with a timeout period. If a node is still a leader after the timeout the operation will fail.	2021-06-22 14:36:42 +03:00
Konstantin Osipov	bd410da77a	raft: (service) rename raft_services service to raft_group_registry This is a more informative name. Helps see that, say, group0 is a separate service and not bundle all raft services together. Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Konstantin Osipov	025f18325e	raft: (service) move raft service to namespace service Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Calle Wilund	fdb5801704	table: Always use explicit commitlog discard + clear out rp_set Fixes #8733 If a memtable flush is still pending when we call table::clear(), we can end up doing a "discard-all" call to commitlog, followed by a per-segment-count (using rp_set) _later_. This will foobar our internal usage counts and quite probably cause assertion failures. Fixed by always doing per-memtable explicit discard call. But to ensure this works, since a memtable being flushed remains on memtable list for a while (why?), we must also ensure we clear out the rp_set on discard. v3: * Fix table::clear to discard rp_sets before memtables Closes #8894	2021-06-21 14:53:54 +03:00
Takuya ASADA	a677c46672	dist: stop removing /etc/systemd/system/.mount on package uninstall Listing /etc/systemd/system/.mount as ghost file seems incorrect, since user may want to keep using RAID volume / coredump directory after uninstalling Scylla, or user may want to upgrade enterprise version. Also, we mixed two types of files as ghost file, it should handle differently: 1. automatically generated by postinst scriptlet 2. generated by user invoked scylla_setup The package should remove only 1, since 2 is generated by user decision. See scylladb/scylla-enterprise#1780 Closes #8810	2021-06-21 14:53:54 +03:00
Benny Halevy	499357fb43	row_cache: autoupdating_underlying_reader: fast_forward_to: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210613104232.634621-2-bhalevy@scylladb.com>	2021-06-20 14:46:35 +03:00
Benny Halevy	3db7db5743	row_cache: autoupdating_underlying_reader: fast_forward_to: capture snapshot by value when updating reader Currently we capture the snapshot mutation_source by reference for calling create_underlying_reader after closing the reader. However, if close_reader yields, the snapshot reference passed may be gone, so capture it by value instead. Fixes #8848 Test: unit(dev) DTest: restore_snapshot_using_old_token_ownership_test(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210613104232.634621-1-bhalevy@scylladb.com>	2021-06-20 14:46:35 +03:00
Avi Kivity	5b3fb83ebe	Merge "Remove unused code here and there" from Pavel E " Few randomly spotted dead code locations over past time. Compile-test only. " * 'br-remove-unused-stuff' of https://github.com/xemul/scylla: database: Remove unused forward declarations feature: Remove unused friendship with gossiper schema_tables: Remove unused sharded<proxy> argument database: Remove few unused sharded<proxy> captures view_update_generator: Remove unused struct sstable_with_table storage_service: Remove write-only _force_remove_completion distributed_loader: Remove unused load-prio manipulations	2021-06-20 12:01:40 +03:00
Pavel Emelyanov	ab4fc41f25	database: Remove unused forward declarations Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	d606321575	feature: Remove unused friendship with gossiper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	96131349e8	schema_tables: Remove unused sharded<proxy> argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	0f36f00682	database: Remove few unused sharded<proxy> captures Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	64bb16af8a	view_update_generator: Remove unused struct sstable_with_table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	cbcbf648b6	storage_service: Remove write-only _force_remove_completion This boolean became effectively unused after `829b4c14` (repair: Make removenode safe by default) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pavel Emelyanov	7396de72b1	distributed_loader: Remove unused load-prio manipulations Mostly this was removed by `6dfeb107` (distributed_loader: remove unused code). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-18 20:19:35 +03:00
Pekka Enberg	055bc33f0f	Update tools/java submodule * tools/java 599b2368d6...5013321823 (4): > cassandra-stress: fix failure due to the assert exception on disconnect when test is completed > node_probe: toppartitions: Fix wrong class in getMethod > Fix NullPointerException in SettingsMode > cassandra-stress: Remove maxPendingPerConnection default	2021-06-18 14:19:34 +03:00
Pekka Enberg	2a9443a753	Update tools/jmx submodule * tools/jmx a7c4c39...5311e9b (2): > storage_service: takeSnapshot: support the skipFlush option > build(deps): bump snakeyaml from 1.16 to 1.26 in /scylla-apiclient	2021-06-18 14:19:29 +03:00
Avi Kivity	b099e7c254	Merge "Untie hints managers and storage service" from Pavel E " The storage service is carried along storage proxy, hints resource manager and hints managers (two of them) just to subscribe the hints managers on lifecycle events (and stop the subscription on shutdown) emitted from storage service. This dependency chain can be greatly simplified, since the storage proxy is already subscribed on lifecycle events and can kick managers directly from its hooks. tests: unit(dev), dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev) " * 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla: hints: Drop storage service from managers hints: Do not subscribe managers on lifecycle events directly	2021-06-17 17:12:31 +03:00
Nadav Har'El	a9b383f423	cql-pytest: improve test for SSL/TLS versions The existing test_ssl.py which tests for Scylla's support of various TLS and SSL versions, used a deprecated and misleading Python API for choosing the protocol version. In particular, the protocol version ssl.PROTOCOL_SSLv23 is not, despite it's name, SSL versions 2 or 3, or SSL at all - it is in fact an alias for the latest TLS version :-( This misunderstanding led us to open the incorrect issue #8837. So in this patch, we avoid the old Python APIs for choosing protocols, which were gradually deprecated, and switch to the new API introduced in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum desired protocol version. With this new API, we can correctly connect with various versions of the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the fixed test, we confirm that Scylla does not allow SSLv3 - as desired - so issue #8837 is a non-issue. Moreover, after issue #8827 was already fixed, this test now passes, so the "xfail" mark is removed. Refs #8837. Refs #8827. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210617134305.173034-1-nyh@scylladb.com>	2021-06-17 17:06:31 +03:00
Nadav Har'El	8f107ece9f	Update seastar submodule * seastar 813eee3e...0e48ba88 (5): > net/tls: on TLS handshake failure, send error to client > net/dns: fix build on gcc 11 > core: fix docstring for max_concurrent_for_each > test: alien_test: replace deprecated call to alien::submit_to() with new variant > alien: prepare for multi-instance use The fix "net/tls: on TLS handshake failure, send error to client" fixes #8827. The test test/cql-pytest/run --ssl test_ssl.py now xpasses, so I'll remove the "xfail" mark in a followup patch.	2021-06-17 16:24:57 +03:00
Pavel Emelyanov	92a4278cd1	hints: Drop storage service from managers The storage service pointer is only used so (un)subscribe to (from) lifecycle events. Now the subscription is gone, so can the storage service pointer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:09:36 +03:00
Pavel Emelyanov	acdc568ecf	hints: Do not subscribe managers on lifecycle events directly Managers sit on storage proxy which is already subscribed on lifecycle events, so it can "notify" hints managers directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-17 15:06:26 +03:00
Tomasz Grabiec	6d8440fe70	Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov The patch set introduces a few leadership transfer tests, some of them are adaptations of corresponding etcd tests (e.g. `test_leader_transfer_ignore_proposal` and `test_transfer_non_member`). Others test different scenarios ensuring that pending leadership transfer doesn't disrupt the rest of the cluster from progressing: Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and `test_leader_transferee_dies_upon_receiving_timeout_now`) as well as lost `vote_request(force)` from the new candidate (test_leader_transfer_lost_force_vote_request) don't impact the election process following that and the leader is elected as normal. * manmanson/leadership_transfer_tests_v3: raft: etcd_test: test_transfer_non_member raft: etcd_test: test_leader_transfer_ignore_proposal raft: fsm_test: test_leader_transfer_lost_force_vote_request raft: fsm_test: test_leader_transfer_lost_timeout_now raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now	2021-06-17 13:58:31 +02:00
Piotr Sarna	8cca68de75	cql3: add USING TIMEOUT support for deletes Turns out the DELETE statement already supports attributes like timestamp, so it's ridiculously easy to add USING TIMEOUT support - it's just the matter of accepting it in the grammar. Fixes #8855 Closes #8876	2021-06-17 14:21:01 +03:00
Nadav Har'El	45c2442f49	Merge 'Avoid large allocs in mv update code' from Piotr Sarna This series addresses #8852 by: * migrating to chunked_vector in view update generation code to avoid large allocations * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust. Tests: unit(release) Refs #8852 Closes #8856 * github.com:scylladb/scylla: db,view: limit the number of simultaneous view update futures db,view: use chunked_vector for view updates	2021-06-17 14:01:38 +03:00
Avi Kivity	4d70f3baee	storage_proxy: change unordered_set<inet_address> to small_vector in write path The write paths in storage_proxy pass replica sets as std::unordered_set<gms::inet_address>. This is a complex type, with N+1 allocations for N members, so we change it to a small_vector (via inet_address_vector_replica_set) which requires just one allocation, and even zero when up to three replicas are used. This change is more nuanced than the corresponding change to the read path `abe3d7d7` ("Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity"), for two reasons: - there is a quadratic algorithm in abstract_write_response_handler::response(): it searches for a replica and erases it. Since this happens for every replica, it happens N^2/2 times. - replica sets for writes always include all datacenters, while reads usually involve just one datacenter. So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2 =105 compares. We could remove this by sending the index of the replica in the replica set to the replica and ask it to include the index in the response, but I think that this is unnecessary. Those 105 compares need to be only 105/15 = 7 times cheaper than the corresponding unordered_set operation, which they surely will. Handling a response after a cross-datacenter round trip surely involves L3 cache misses, and a small_vector reduces these to a minimum compared to an unordered_set with its bucket table, linked list walking and managent, and table rehashing. Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000 --task-quota-ms show two allocations removed (as expected) and a nice reduction in instructions executed. before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op) after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op) Closes #8847	2021-06-17 13:46:40 +03:00
Avi Kivity	98cdeaf0f2	schema_tables: make the_merge_lock thread_local the_merge_lock is global, which is fine now because it is only used in shard 0. However, if we run multiple nodes in a single process, there will be multiple shard 0:s, and the_merge_lock will be accessed from multiple threads. This won't work. To fix, make it thread_local. It would be better to make it a member of some controlling object, but there isn't one. Closes #8858	2021-06-17 13:41:11 +03:00
Avi Kivity	00ff3c1366	Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy The option is provided by nodetool snapshot https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/ ``` nodetool [(-h <host> \| --host <host>)] [(-p <port> \| --port <port>)] [(-pp \| --print-port)] [(-pw <password> \| --password <password>)] [(-pwf <passwordFilePath> \| --password-file <passwordFilePath>)] [(-u <username> \| --username <username>)] snapshot [(-cf <table> \| --column-family <table> \| --table <table>)] [(-kc <kclist> \| --kc.list <kclist>)] [(-sf \| --skip-flush)] [(-t <tag> \| --tag <tag>)] [--] [<keyspaces...>] -sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data) ``` But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167) and not supported at the api level. This patch adds support for the option in advance from the api service level down via snapshot_ctl to the table class and snapshot implementation. In addition, a corresponding unit test was added to verify that taking a snapshot with `skip_flush` does not flush the memtable (at the table::snapshot level). Refs #8725 Closes #8726 * github.com:scylladb/scylla: test: database_test: add snapshot_skip_flush_works api: storage_service/snapshots: support skip-flush option snapshot: support skip_flush option table: snapshot: add skip_flush option api: storage_service/snapshots: add sf (skip_flush) option	2021-06-17 13:32:23 +03:00
Nadav Har'El	7fd7e90213	cql-pytest: translate Cassandra's tests for static columns This is a translation of Cassandra's CQL unit test source file validation/entities/StaticColumnsTest.java into our our cql-pytest framework. This test file checks various features of static columns. All these tests pass on Cassandra, and all but one pass on Scylla. The xfailing test, testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra allows but we don't. The new issue about that is: Refs #8869. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616141633.114325-1-nyh@scylladb.com>	2021-06-17 11:08:28 +02:00
Nadav Har'El	b6b4df9a47	heat-weighted load balancing: improve handling of near-perfect cache Consider two nodes with almost-100% cache hit ratio, but not exactly 100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we want to equalize the miss rate in both nodes. So we send the first node twice the number of requests we send to the second. But unless the disks are extremely limited, this doesn't make sense: As a numeric example, consider that we send 2000 requests to the first node and 1000 to the second, just so the number of misses will be the same - 2 (0.1% and 0.2% misses, respectively). At such low miss numbers, the assumption that the disk reads are the slowest part of the operation is wrong, so trying to equalize only this part is wrong. So above some threshold hit rate, we should treat all hit rates as equivalent. In the code we already had such a threshold - max_hit_rate, but it was set to the incredibly high 0.999. We saw in actual user runs (see issue #8815) that this threshold was too high - one node received twice the amount of requests that another did - although both had near-100% cache hit rates. So in this patch we lower the max_hit_rate to 0.95. This will have two consequences: 1. Two nodes with hit rates above 0.95 will be considered to have the same hit rate, so they will get equal amount of work - even if one has hit rate 0.98 and the other 0.99. 2. A cold node with it rate 0.0 will get 5% of the work of a node with the perfect hit rate limited to 0.95. This will allow the cold node to slowly warm up its cache. Before this patch, if the hot node happened to have a hit rate of 0.999 (the previous maximum), the cold node would get just 0.1% of the work and remain almost idle and fill its cache extremely slowly - which is a waste. Fixes #8815. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616180732.125295-1-nyh@scylladb.com>	2021-06-17 11:02:08 +02:00
Piotr Sarna	1fb831c8c1	db,view: limit the number of simultaneous view update futures Previously the view update code generated a continuation for each view update and stored them all in a vector. In certain cases the number of updates can grow really large (to millions and beyond), so it's better to only store a limited amount of these futures at a time.	2021-06-17 10:20:52 +02:00
Piotr Sarna	a7f7716ecf	db,view: use chunked_vector for view updates The number of view updates can grow large, especially in corner cases like removing large base partitions. Chunked vector prevents large allocations.	2021-06-17 10:15:17 +02:00
Avi Kivity	3c21833aac	cql3: expr: make column_value (and similar) a first-class expression Currently, column names can only appear in a boolean binary expression, but not on their own. This means that in the statement SELECT a FROM tab WHERE a > 3; We can represent the WHERE clause as an expression, but not the selector. To pave the way for using expressions in selector contexts, we promote the elements of binary_operator::lhs (column_value, column_value_tuple, token) to be expressions in their own right. binary_operator::lhs becomes an expression (wrapped in unique_ptr, because variants can't contain themselves). Note that all three new possibilities make sense in a selector: SELECT column FROM tab SELECT token(pk) FROM tab SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab There is some fallout from this: - because binary_operator contains a unique_ptr, it is no longer copyable. We add a copy constructor and assignment operator to compensate. - often, the new elements don't make sense when evaluating a boolean expression, which is the only context we had before. We call on_internal_error in these cases. The parser right now prevents such cases from being constructed in the first place (this is equivalent to if (some_struct_value) in C). - in statement_restrictions.cc, we need to evalute the lhs in the context of the full binary operator. I introduced with_current_binary_operator() for this; an alternative approach is to create a new sub-visitor. Closes #8797	2021-06-17 10:08:58 +03:00
Tomasz Grabiec	6bdf8c4c46	Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja Miscellaneous preparatory patches for group 0 discovery. * scylla-dev/raft-group-0-part-2-v4: raft: (service) servers map is gid -> server, not sid -> server system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID raft: (server) wait for configuration transition to complete raft: (server) implement raft::server::get_configuration() raft: (service) don't throw from schema state machine raft: (service) permit some scylla.raft cells to be empty raft: (service) properly handle failure to add a server raft: implement is_transient_error()	2021-06-17 00:15:40 +02:00
Asias He	7a32cab524	gossip: Fix use-after-free in real_mark_alive and mark_dead In commit `11a8912093` (gossiper: get_gossip_status: return string_view and make noexcept) get_gossip_status returns a pointer to an endpoint_state in endpoint_state_map. After commit `425e3b1182` (gossip: Introduce direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive can yield in the middle of the function. It is possible that endpoint_state can be removed, causing use-after-free to access it. To fix, make a copy before we yield. Fixes #8859 Closes #8862	2021-06-16 21:16:26 +02:00
Konstantin Osipov	18e3fcdbf1	raft: (service) servers map is gid -> server, not sid -> server Raft Group registry should map Raft Group Id to Raft Server, not Raft Server ID (which is identical for all groups) to Raft server. Raft Group 0 ID works as a cluster identifier, so is generated when a new cluster is created and is shared by all nodes of the same cluster. Implement a helper to get raft::server by group id. Consistently throw a new raft_group_not_found exception if there is no server or rpc for the specified group id.	2021-06-16 19:05:50 +03:00
Avi Kivity	f05ddf0967	Merge "Improve LSA descriptor encoding" from Pavel " The LSA small objects allocation latency is greatly affected by the way this allocator encodes the object descriptor in front of each allocated slot. Nowadays it's one of VLE variants implemented with the help of a loop. Re-implementing this piece with less instructions and without a loop allows greatly reducing the allocation latency. The speed-up mostly comes from loop-less code that doesn't confuse branch predictor. Also the express encoder seems to benefit from writing 8 bytes of the encoded value in one go, rather than byte- -by-byte. Perf measurements: 1. (new) logallog test shows ~40% smaller times 2. perf_mutation in release mode shows ~2% increase in tps 3. the encoder itself is 2 - 4 times faster on x86_64 and 1.05 - 3 times faster on aarch64. The speed-up depends on the 'encoded length', old encoder has linear time, the new one is constant tests: unit(dev), perf(release), just encoder on Aarch64 " * 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla: lsa: Use express encoder uleb64: Add express encoding lsa: Extract uleb64 code into header test: LSA allocation perf test	2021-06-16 18:07:13 +03:00
Pavel Emelyanov	8d0780fb92	lsa: Use express encoder To make it possible to use the express encoder, lsa needs to make sure that the value is below express supreme value and provide the size of the gap after the encoded value. Both requirements can be satisfied when encoding the migrator index on object allocation. On free the encoded value can be larger, so the extended express encoder will need more instructions and will not be that efficient, so the old encoder is used there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:12 +03:00
Pavel Emelyanov	1782b0c6b9	uleb64: Add express encoding Standard encoding is compiled into a loop that puts values into memory byte-by-byte. This works slowly, but reliably. When allocating an object LSA uses ubel64 encoder with 2 features that allow to optimize the encoder: 1. the value is migrator.index() which is small enough to fit 2 bytes when encoded 2. After the descriptor there usually comes an object which is of 8+ bytes in size Feature #1 makes it possible to encode the value with just a few instructions. In O3 level clang makes it like mov %esi,%ecx and $0xfc0,%ecx and $0x3f,%esi lea (%rsi,%rcx,4),%ecx add $0x40,%ecx Next, the encoder needs to put the value into a gap whose size depends on the alignment of previous and current objects, so the classical algo loops through this size. Feature #2 makes it possible to put the encoded value and the needed amount of zeros by using 2 64-bit movs. In this case the encoded value gets off the needed size and overwrites some memory after. That's OK, as this overwritten memory is where the allocated object _will_ be, so the contents there is not of any interest. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:10 +03:00
Pavel Emelyanov	d8dea48248	lsa: Extract uleb64 code into header The LSA code encodes an object descriptor before the object itself. The descriptor is 32-bit value and to put it in an efficient manner it's encoded into unsigned little-endian base-64 sequence. The encoding code is going to be optimized, so put it into a dedicated header in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:46:44 +03:00

1 2 3 4 5 ...

27055 Commits