scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	11c99fc41b	table: Don't use global gossiper The table::get_hit_rate needs gossiper to get hitrates state from. There's no way to carry gossiper reference on the table itself, so it's up to the callers of that method to provide it. Fortunately, there's only one caller -- the proxy -- but the call chain to carry the reference it not very short ... oh, well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-05-03 10:33:08 +03:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Benny Halevy	4d2561ff75	abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 16:10:06 +03:00
Avi Kivity	222ef17305	build, treewide: enable -Wredundant-move Returning a function parameter guarantees copy elision and does not require a std::move(). Enable -Wredundant-move to warn us that the move is unneeded, and gain slightly more readable code. A few violations are trivially adjusted. Closes #9004	2021-07-11 12:53:02 +03:00
Avi Kivity	4d70f3baee	storage_proxy: change unordered_set<inet_address> to small_vector in write path The write paths in storage_proxy pass replica sets as std::unordered_set<gms::inet_address>. This is a complex type, with N+1 allocations for N members, so we change it to a small_vector (via inet_address_vector_replica_set) which requires just one allocation, and even zero when up to three replicas are used. This change is more nuanced than the corresponding change to the read path `abe3d7d7` ("Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity"), for two reasons: - there is a quadratic algorithm in abstract_write_response_handler::response(): it searches for a replica and erases it. Since this happens for every replica, it happens N^2/2 times. - replica sets for writes always include all datacenters, while reads usually involve just one datacenter. So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2 =105 compares. We could remove this by sending the index of the replica in the replica set to the replica and ask it to include the index in the response, but I think that this is unnecessary. Those 105 compares need to be only 105/15 = 7 times cheaper than the corresponding unordered_set operation, which they surely will. Handling a response after a cross-datacenter round trip surely involves L3 cache misses, and a small_vector reduces these to a minimum compared to an unordered_set with its bucket table, linked list walking and managent, and table rehashing. Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000 --task-quota-ms show two allocations removed (as expected) and a nice reduction in instructions executed. before: median 204842.54 tps ( 54.2 allocs/op, 13.2 tasks/op, 49890 insns/op) after: median 206077.65 tps ( 52.2 allocs/op, 13.2 tasks/op, 49138 insns/op) Closes #8847	2021-06-17 13:46:40 +03:00
Nadav Har'El	b6b4df9a47	heat-weighted load balancing: improve handling of near-perfect cache Consider two nodes with almost-100% cache hit ratio, but not exactly 100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we want to equalize the miss rate in both nodes. So we send the first node twice the number of requests we send to the second. But unless the disks are extremely limited, this doesn't make sense: As a numeric example, consider that we send 2000 requests to the first node and 1000 to the second, just so the number of misses will be the same - 2 (0.1% and 0.2% misses, respectively). At such low miss numbers, the assumption that the disk reads are the slowest part of the operation is wrong, so trying to equalize only this part is wrong. So above some threshold hit rate, we should treat all hit rates as equivalent. In the code we already had such a threshold - max_hit_rate, but it was set to the incredibly high 0.999. We saw in actual user runs (see issue #8815) that this threshold was too high - one node received twice the amount of requests that another did - although both had near-100% cache hit rates. So in this patch we lower the max_hit_rate to 0.95. This will have two consequences: 1. Two nodes with hit rates above 0.95 will be considered to have the same hit rate, so they will get equal amount of work - even if one has hit rate 0.98 and the other 0.99. 2. A cold node with it rate 0.0 will get 5% of the work of a node with the perfect hit rate limited to 0.95. This will allow the cold node to slowly warm up its cache. Before this patch, if the hot node happened to have a hit rate of 0.999 (the previous maximum), the cold node would get just 0.1% of the work and remain almost idle and fill its cache extremely slowly - which is a waste. Fixes #8815. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616180732.125295-1-nyh@scylladb.com>	2021-06-17 11:02:08 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Avi Kivity	c71d007797	consistency_level: deinline assure_sufficient_live_nodes() assure_sufficient_live_nodes() is a huge template calling other huge templates, and requires "network_topology_strategy.hh". It is inlined in consistency_level.hh. This increases compile time and recompiles. Move the template out-of-line and use "extern template" to instantiate it. This is not ideal as new callers would require updates to the instantiated signatures, but I think our goal should be to de-template it completely instead. Meanwhile, this reduces some pain. Ref #1. Closes #8637	2021-05-19 15:03:51 +03:00
Avi Kivity	e9802348b5	storage_proxy, treewide: use utils::small_vector inet_address_vector:s Replace std::vector<inet_address> with a small_vector of size 3 for replica sets (reflecting the common case of local reads, and the somewhat less common case of single-datacenter writes). Vectors used to describe topology changes are of size 1, reflecting that up to one node is usually involved with topology changes. At those counts and below we save an allocation; above those counts everything still works, but small_vector allocates like std::vector. In a few places we need to convert between std::vector and the new types, but these are all out of the hot paths (or are in a hot path, but behind a cache).	2021-05-05 18:36:54 +03:00
Avi Kivity	cea5493cb7	storage_proxy, treewide: introduce names for vectors of inet_address storage_proxy works with vectors of inet_addresses for replica sets and for topology changes (pending endpoints, dead nodes). This patch introduces new names for these (without changing the underlying type - it's still std::vector<gms::inet_address>). This is so that the following patch, that changes those types to utils::small_vector, will be less noisy and highlight the real changes that take place.	2021-05-05 18:36:48 +03:00
Eliran Sinvani	925cdc9ae1	consistency level: fix wrong quorum calculation whe RF = 0 We used to calculate the number of endpoints for quorum and local_quorum unconditionally as ((rf / 2) + 1). This formula doesn't take into account the corner case where RF = 0, in this situation quorum should also be 0. This commit adds the missing corner case. Tests: Unit Tests (dev) Fixes #6905 Closes #7296	2020-09-29 13:25:41 +03:00
Konstantin Osipov	18b9bb57ac	lwt: rename metrics to match accepted terminology Rename inherited metrics cas_propose and cas_commit to cas_accept and cas_learn respectively. A while ago we made a decision to stick to widely accepted terms for Paxos rounds: prepare, accept, learn. The rest of the code is using these terms, so rename the metrics to avoid confusion/technical debt. While at it, rename a few internal methods and functions. Fixes #6169 Message-Id: <20200414213537.129547-1-kostja@scylladb.com>	2020-04-15 12:20:30 +02:00
Vladimir Davydov	25aeefd6f3	cql: fix CAS consistency level validation This patch resurrects Cassandra's code validating a consistency level for CAS requests. Basically, it makes CAS requests use a special function instead of validate_for_write to make error messages more coherent. Note, we don't need to resurrect requireNetworkTopologyStrategy as EACH_QUORUM should work just fine for both CAS and non-CAS writes. Looks like it is just an artefact of a rebase in the Cassandra repository.	2019-11-14 12:15:39 +01:00
Konstantin Osipov	e555dc502e	lwt: implement basic lightweight transactions support Support single-statement conditional updates and as well as batches. This patch almost fully rewrites column_condition.cc, implementing is_satisfied_by(). Most of the remaining complications in column_condition implementation come from the need to properly handle frozen and multi-cell collection in predicates - up until now it was not possible to compare entire collection values between each other. This is further complicated since multi-cell lists and sets are returned as maps. We can no longer assume that the columns fetched by prefetch operation are non-frozen collections. IF EXISTS/IF NOT EXISTS condition fetches all columns, besides, a column may be needed to check other condition. When fetching the old row for LWT or to apply updates on list/columns, we now calculate precisely the list of columns to fetch. The primary key columns are also included in CAS batch result set, and are thus also prefetched (the user needs them to figure out which statements failed to apply). The patch is cross-checked for compatibility with cassandra-3.11.4-1545-g86812fa502 but does deviate from the origin in handling of conditions on static row cells. This is addressed in future series.	2019-10-27 23:42:49 +03:00
Avi Kivity	4676e07400	consistency_level: simplify validation API Remove unused parameters, replace refcounted pointers by references.	2018-11-27 13:41:49 +02:00
Avi Kivity	2c08bff8d5	Split consistency_level.hh header It has two unrelated users: cql for validation, and storage_proxy for complicated calculations. Split the simple stuff into a new header to reduce dependencies.	2018-11-27 13:32:10 +02:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Avi Kivity	d77e044cde	db: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Botond Dénes	6e59cee244	db::consistency_level::filter_for_query() add preferred_endpoints To the second overload (the one without read-repair related params) too.	2018-09-03 10:31:44 +03:00
Botond Dénes	aaf67bcbaa	Consider preferred replicas when choosing endpoints for query_singular() Propagate the preferred_replicas to db::filter_for_query() and consider them when selecting the endpoints. The algoritm for selecting the endpoints is as follows: * Compute the intersection of the endpoint candidates and the preferred endpoints. * If this yields a set of endpoints that already satisfies the CL requirements use this set. * Otherwise select the remaining endpoints according to the load-balancing strategy, just like before.	2018-03-13 10:34:34 +02:00
Gleb Natapov	357c77a333	consistency_level: constify quorum_for() and local_quorum_for()	2017-12-05 13:01:20 +02:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	8437ea3b99	consistency_level: drop filter_for_query_dc_local function Merge filter_for_query_dc_local() functionality into filter_for_query(). This is more efficient since filter_for_query_dc_local() partitions endpoints into 'local' and 'remote' set but filter_for_query() already does it for CL=LOCAL so for such queries we needlessly do it twice.	2017-06-13 09:57:14 +03:00
Gleb Natapov	9cc076c9f3	storage_proxy: preserve endpoint's order while filtering local nodes for query filter_for_query() gets sorted by preference list of endpoints and should preserve that order after filtering out non local endpoints for local query. partition() does not guaranty this while stable_partition() does, so use it instead. Fixes #1450. Message-Id: <20160713100909.GM10767@scylladb.com>	2016-07-13 13:17:28 +03:00
Gleb Natapov	dfdbb1e703	storage_proxy: move hack to make coordinator most preferable node for read into sorting function This is kind of sorting, so it belongs there, but it also fixes a bug in storage_proxy::get_read_executor() that assumes filter_for_query() do not change order of nodes in all_nodes when extra replica is chosen. Otherwise if coordinator ip happens to be last in all_nodes then it will be chosen as extra replica and will be quired twice. Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>	2016-04-14 14:56:21 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Gleb Natapov	f59415b3c6	Take pending endpoints into account while checking for sufficient live nodes During bootstrapping additional copies of data has to be made to ensure that CL level is met (see CASSANDRA-833 for details). Our code does that, but it does not take into account that bootstraping node can be dead which may cause request to proceed even though there is no enough live nodes for it to be completed. In such a case request neither completes nor timeouts, so it appear to be stuck from CQL layer POV. The patch fixes this by taking into account pending nodes while checking that there are enough sufficient live nodes for operation to proceed. Fixes #965 Message-Id: <20160303165250.GG2253@scylladb.com>	2016-03-07 13:30:13 +01:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Gleb Natapov	17e54d0604	add logger for consistency level calculation	2015-09-13 11:59:17 +03:00
Gleb Natapov	04d2bef55b	give preference to local data during query Until dynamic snitch is implemented this is better than nothing. Fixes #322	2015-09-10 15:45:20 +03:00
Pekka Enberg	7fc1311d4a	db/consistency_level: Move implementation to .cc file Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-07-28 10:06:18 +03:00

34 Commits