scylladb

Author	SHA1	Message	Date
Avi Kivity	7129ddfa67	build: disable warnings that cause false-positive errors with gcc 12 gcc 12 generates some incorrect warnings (that we treat as errors). Silence them so we can build.	2022-04-18 12:27:18 +03:00
Mikołaj Sielużycki	b16e12f3a1	repair: Add unit test for flushing repair_rows_on_wire to disk. The unit test executes a simplified repair scenario by: - producing a random stream of mutation mutation_fragments, - convering them to repair_rows_on_wire, - convering them to list of repair_rows using the conversion logic extracted in previous commits from repair_meta, - flushing the rows to an sstable using the logic extracted in previous commits from repair_meta, - comparing the sstable contents with the originally produced mutation fragments. The test checks only the flushing part and is not concerned with any other piece of the repair pipeline.	2022-04-12 09:22:10 +02:00
Michael Livshin	da7c7fd3dc	delete code of the unused normalizing_reader class Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20220406161107.2376568-3-michael.livshin@scylladb.com>	2022-04-07 09:29:41 +03:00
Botond Dénes	c9e30b9a6c	tree: remove now empty mutation_reader.{hh,cc}	2022-03-30 15:42:51 +03:00
Botond Dénes	d0ea895671	readers: move multishard reader & friends to reader/multishard.cc Since the multishard reader family weighs more than 1K SLOC, it gets its own .cc file.	2022-03-30 15:42:51 +03:00
Botond Dénes	f8015d9c26	readers: move combined reader into readers/ Since the combined reader family weighs more than 1K SLOC, it gets its own .cc file.	2022-03-30 15:42:51 +03:00
Calle Wilund	56c383ba8e	test/perf/perf_commitlog: Add a small commitlog throughput test Based on perf_simple_query, just bashes data into CL using normal distribution min/max data chunk size, allowing direct freeing of segments, _but_ delayed by a normal dist as well, to "simulate" secondary delay in data persistance. Needs more stuff. Some baseline measurements on master: --min-flush-delay-in-ms 10 --max-flush-delay-in-ms 200 --commitlog-use-hard-size-limit true --commitlog-total-space-in-mb 10000 --min-data-size 160 --max-data-size 1024 --smp1 median 2065648.59 tps ( 1.1 allocs/op, 0.0 tasks/op, 1482 insns/op) median absolute deviation: 48752.44 maximum: 2161987.06 minimum: 1984267.90 --min-data-size 256 --max-data-size 16384 median 269385.25 tps ( 2.2 allocs/op, 0.7 tasks/op, 3244 insns/op) median absolute deviation: 15719.13 maximum: 323574.43 minimum: 228206.28 --min-data-size 4096 --max-data-size 61440 median 67734.22 tps ( 6.4 allocs/op, 2.9 tasks/op, 9153 insns/op) median absolute deviation: 2070.93 maximum: 82833.17 minimum: 61473.57 --min-data-size 61440 --max-data-size 1843200 median 2281.37 tps ( 79.7 allocs/op, 43.5 tasks/op, 202963 insns/op) median absolute deviation: 128.87 maximum: 3143.84 minimum: 2140.80 --min-data-size 368640 --max-data-size 6144000 median 679.76 tps (225.5 allocs/op, 116.3 tasks/op, 662700 insns/op) median absolute deviation: 39.30 maximum: 1148.95 minimum: 586.86 Actual throughput obviously meaningless, as it is run on my slow machine, but IPS might be relevant. Note that transaction throughput plummets as we increase median data sizes above ~200k, since we then more or less always end up replacing buffers in every call. Closes #10230	2022-03-22 15:18:25 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Benny Halevy	ebbbf1e687	lister: move to utils There's nothing specific to scylla in the lister classes, they could (and maybe should) be part of the seastar library. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 12:36:03 +02:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00
Botond Dénes	3aa05f7f03	Merge "Make system.clients table virtual" from Pavel Emelyanov " The table lists connected clients. For this the clients are stored in real table when they connect, update their statuses when needed and remove^w tombstone themselves when they disconnect. On start the whole table is cleared. This looks weird. Here's another approach (inspired by the hackathon project) that makes this table a pure virtual one. The schema is preserved so is the data returned. The benefits of doing it virtual are - no on-disk updates while processing clients - no potentially failing updates on non-failing disconnect - less usage of the global qctx thing - less calls to global storage_proxy - simpler support for thrift and alternator clients (today's table implementation doesn't track them) - the need to make virtual tables reg/unreg dynamic branch: https://github.com/xemul/scylla/tree/br-clients-virtual-table-4 tests: manual(dev), unit(dev) The manual test used 80-shards node and 1M connections from 1k different IP addresses. " * 'br-clients-virtual-table-4' of https://github.com/xemul/scylla: test: Add cql-pytest sanity test for system.clients table client_data: Sanitize connection_notifier transport: Indentation fix after previous patch code: Remove old on-disk version of system.clients table system_keyspace: Add clients_v virtual table protocol_server: Add get_client_data call transport: Track client state for real transport: Add stringifiers to client_data class generic_server: Gentle iterator generic_server: Type alias docs: Add system.clients description	2022-02-22 20:58:25 +03:00
Jan Ciolek	46367eec55	cql3: expr: Add tests for expr::visit Add tests for new expr::visit to ensure that it is working correctly. expr::visit had a hidden bug where trying to return a reference actually returned a reference to freed location on the stack, so now there are tests to ensure that everything works. Sadly the test `expr_visit_const_ref` also passes before the fix, but at lest expr_visit_ref doesn't compile before the fix. It would be better to test this by taking references returned by std::visit and expr::visit and checking that they point to the same address in memory, but I can't do this because I would have to access private field of expression. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2022-02-18 14:16:55 +01:00
Pavel Emelyanov	de6c60c1c9	client_data: Sanitize connection_notifier Now the connection_notifier is all gone, only the client_data bits are left. To keep it consistent -- rename the files. Also, while at it, brush up the header dependencies and remove the not really used constexprs for client states. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 15:02:26 +03:00
Piotr Dulikowski	11cb670881	utils: add result utils Adds a number of utilities for working with boost::outcome::result combined with exception_container. The utilities are meant to help with migration of the existing code to use the boost::outcome::result: - `exception_container_throw_policy` - a NoValuePolicy meant to be used as a template parameter for the boost::outcome::result. It protects the caller of `result::value()` and `result::error()` methods - if the caller wishes to get a value but the result has an error (exception_container in our case), the exception in the container will be thrown instead. In case it's the other way around, boost::outcome::bad_result_access is thrown. - `result_parallel_for_each` - a version of `parallel_for_each` which is aware of results and returns a failed result in case any of the parallel invocations return a failed result. - `result_into_future` - converts a result into a future. If the result holds a value, converts it into make_ready_future; if it holds an exception, the exception is returned as make_exception_future. - `then_ok_result` takes a `future<T>` and converts it into a `future<result<T>>`. - `result_wrap` adapts a callable of type `T -> future<result<T>>` and returns a callable of type `result<T> -> future<result<T>>`.	2022-02-08 11:08:42 +01:00
Piotr Dulikowski	80f6224959	utils: add exception_container Adds `exception_container` - a helper type used to hold exceptions as a value, without involving the std::exception_ptr. The motivation behind this type is that it allows inspecting exception's type and value without having to rethrow that exception and catch it, unlike std::exception_ptr. In our current codebase, some exception handling paths need to rethrow the exception multiple times in order to account it into metrics or encode it as an error response to the CQL client. Some types of exceptions can be thrown very frequently in case of overload (e.g. timeouts) and inspecting those exceptions with rethrows can make the overload even worse. For those kinds of exceptions it is important to handle them as cheaply as possible, and exception_container used with conjunction with boost::outcome::result can help achieve that.	2022-02-04 20:18:00 +01:00
Avi Kivity	fe65122ccd	Merge 'Distribute `select count()` queries' from Michał Sala This pull request speeds up execution of `count()` queries. It does so by splitting given query into sub-queries and distributing them across some group of nodes for parallel execution. New level of coordination was added. Node called super-coordinator splits aggregation query into sub-queries and distributes them across some group of coordinators. Super-coordinator is also responsible for merging results. To develop a mechanism for speeding up `count()` queries, there was a need to detect which queries have a `count()` selector. Due to this pull request being a proof of concept, detection was realized rather poorly. It is only allows catching the simplest cases of `count()` queries (with only one selector and no column name specified). After detecting that a query is a `count()` it should be split into sub-queries and sent to another coordinators. Splitting part wasn't that difficult, it has been achieved by limiting original query's partition ranges. Sending modified query to another node was much harder. The easiest scenario would be to send whole `cql3::statements::select_statement`. Unfortunately `cql3::statements::select_statement` can't be [de]serialized, so sending it was out of the question. Even more unfortunately, some non-[de]serializable members of `cql3::statements::select_statement` are required to start the execution process of this statement. Finally, I have decided to send a `query::read_command` paired with required [de]serializable members. Objects, that cannot be [de]serialized (such as query's selector) are mocked on the receiving end. When a super-coordinator receives a `count()` query, it splits it into sub-queries. It does so, by splitting original query's partition ranges into list of vnodes, grouping them by their owner and creating sub-queries with partition ranges set to successive results of such grouping. After creation, each sub-query is sent to the owner of its partition ranges. Owner dispatches received sub-query to all of its shards. Shards slice partition ranges of the received sub-query, so that they will only query data that is owned by them. Each shard becomes a coordinator and executes so prepared sub-query. 3 node cluster set up on powerful desktops located in the office (3x32 cores) Filled the cluster with ~2 10^8 rows using scylla-bench and run: ``` time cqlsh <ip> <port> --request-timeout=3600 -e "select count() from scylla_bench.test using timeout 1h;" ``` master: 68s * this branch: 2s 3 node cluster (each node had 2 shards, `murmur3_ignore_msb_bits` was set to 1, `num_tokens` was set to 3) ``` > cqlsh -e 'tracing on; select count() from ks.t; Now Tracing is enabled count ------- 1000 (1 rows) Tracing session: e5852020-7fc3-11ec-8600-4c4c210dd657 activity \| timestamp \| source \| source_elapsed \| client ---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2022-01-27 22:53:08.770000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 1] \| 2022-01-27 22:53:08.770451 \| 127.0.0.1 \| -- \| 127.0.0.1 Processing a statement [shard 1] \| 2022-01-27 22:53:08.770487 \| 127.0.0.1 \| 36 \| 127.0.0.1 Dispatching forward_request to 3 endpoints [shard 1] \| 2022-01-27 22:53:08.770509 \| 127.0.0.1 \| 58 \| 127.0.0.1 Sending forward_request to 127.0.0.1:0 [shard 1] \| 2022-01-27 22:53:08.770516 \| 127.0.0.1 \| 64 \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770519 \| 127.0.0.1 \| -- \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770528 \| 127.0.0.1 \| 9 \| 127.0.0.1 Start querying token range ({-4242912715832118944, end}, {-4075408479358018994, end}] [shard 1] \| 2022-01-27 22:53:08.770531 \| 127.0.0.1 \| 12 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770537 \| 127.0.0.1 \| 18 \| 127.0.0.1 Scanning cache for range ({-4242912715832118944, end}, {-4075408479358018994, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770541 \| 127.0.0.1 \| 22 \| 127.0.0.1 Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770589 \| 127.0.0.1 \| 70 \| 127.0.0.1 Sending forward_request to 127.0.0.2:0 [shard 1] \| 2022-01-27 22:53:08.770600 \| 127.0.0.1 \| 149 \| 127.0.0.1 Sending forward_request to 127.0.0.3:0 [shard 1] \| 2022-01-27 22:53:08.770608 \| 127.0.0.1 \| 157 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770627 \| 127.0.0.1 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770639 \| 127.0.0.1 \| 11 \| 127.0.0.1 Start querying token range ({2507462623645193091, end}, {3897266736829642805, end}] [shard 0] \| 2022-01-27 22:53:08.770643 \| 127.0.0.1 \| 15 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770646 \| 127.0.0.1 \| 19 \| 127.0.0.1 Scanning cache for range ({2507462623645193091, end}, {3897266736829642805, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770649 \| 127.0.0.1 \| 22 \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770658 \| 127.0.0.2 \| -- \| 127.0.0.1 Executing forward_request [shard 1] \| 2022-01-27 22:53:08.770674 \| 127.0.0.3 \| 5 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770698 \| 127.0.0.2 \| 40 \| 127.0.0.1 Start querying token range [{4611686018427387904, start}, {5592106830937975806, end}] [shard 1] \| 2022-01-27 22:53:08.770704 \| 127.0.0.2 \| 46 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770710 \| 127.0.0.2 \| 52 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770712 \| 127.0.0.3 \| 43 \| 127.0.0.1 Scanning cache for range [{4611686018427387904, start}, {5592106830937975806, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770714 \| 127.0.0.2 \| 56 \| 127.0.0.1 Start querying token range [{-4611686018427387904, start}, {-4242912715832118944, end}] [shard 1] \| 2022-01-27 22:53:08.770718 \| 127.0.0.3 \| 49 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770739 \| 127.0.0.3 \| 70 \| 127.0.0.1 Scanning cache for range [{-4611686018427387904, start}, {-4242912715832118944, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770743 \| 127.0.0.3 \| 73 \| 127.0.0.1 Page stats: 17 partition(s), 0 static row(s) (0 live, 0 dead), 17 clustering row(s) (17 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770814 \| 127.0.0.3 \| 145 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770846 \| 127.0.0.3 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770862 \| 127.0.0.3 \| 16 \| 127.0.0.1 Page stats: 71 partition(s), 0 static row(s) (0 live, 0 dead), 71 clustering row(s) (71 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.770865 \| 127.0.0.1 \| 238 \| 127.0.0.1 Start querying token range ({-6683686776653114062, end}, {-6473446911791631266, end}] [shard 0] \| 2022-01-27 22:53:08.770867 \| 127.0.0.3 \| 21 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770874 \| 127.0.0.3 \| 28 \| 127.0.0.1 Scanning cache for range ({-6683686776653114062, end}, {-6473446911791631266, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770879 \| 127.0.0.3 \| 33 \| 127.0.0.1 Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.770880 \| 127.0.0.2 \| 222 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.770888 \| 127.0.0.1 \| 369 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770909 \| 127.0.0.1 \| 390 \| 127.0.0.1 Start querying token range ({-4075408479358018994, end}, {-3391415989210253693, end}] [shard 1] \| 2022-01-27 22:53:08.770911 \| 127.0.0.1 \| 392 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.770914 \| 127.0.0.1 \| 395 \| 127.0.0.1 Scanning cache for range ({-4075408479358018994, end}, {-3391415989210253693, end}] and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.770936 \| 127.0.0.1 \| 418 \| 127.0.0.1 Executing forward_request [shard 0] \| 2022-01-27 22:53:08.770951 \| 127.0.0.2 \| -- \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.770966 \| 127.0.0.2 \| 15 \| 127.0.0.1 Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.770969 \| 127.0.0.3 \| 123 \| 127.0.0.1 Start querying token range (-inf, {-6683686776653114062, end}] [shard 0] \| 2022-01-27 22:53:08.770969 \| 127.0.0.2 \| 18 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.770974 \| 127.0.0.2 \| 23 \| 127.0.0.1 Scanning cache for range (-inf, {-6683686776653114062, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.770977 \| 127.0.0.2 \| 26 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.770993 \| 127.0.0.3 \| 324 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.770998 \| 127.0.0.3 \| 329 \| 127.0.0.1 Start querying token range ({-3391415989210253693, end}, {0, start}) [shard 1] \| 2022-01-27 22:53:08.771001 \| 127.0.0.3 \| 332 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.771004 \| 127.0.0.3 \| 335 \| 127.0.0.1 Scanning cache for range ({-3391415989210253693, end}, {0, start}) and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.771007 \| 127.0.0.3 \| 338 \| 127.0.0.1 Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.771044 \| 127.0.0.1 \| 525 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771069 \| 127.0.0.1 \| 442 \| 127.0.0.1 On shard execution result is [71] [shard 0] \| 2022-01-27 22:53:08.771145 \| 127.0.0.1 \| 518 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771308 \| 127.0.0.1 \| 789 \| 127.0.0.1 On shard execution result is [60] [shard 1] \| 2022-01-27 22:53:08.771351 \| 127.0.0.1 \| 832 \| 127.0.0.1 Page stats: 127 partition(s), 0 static row(s) (0 live, 0 dead), 127 clustering row(s) (127 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.771379 \| 127.0.0.2 \| 427 \| 127.0.0.1 Page stats: 183 partition(s), 0 static row(s) (0 live, 0 dead), 183 clustering row(s) (183 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.771385 \| 127.0.0.3 \| 716 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771402 \| 127.0.0.3 \| 556 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771403 \| 127.0.0.2 \| 745 \| 127.0.0.1 read_data: querying locally [shard 1] \| 2022-01-27 22:53:08.771408 \| 127.0.0.2 \| 750 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771409 \| 127.0.0.3 \| 563 \| 127.0.0.1 Start querying token range ({5592106830937975806, end}, +inf) [shard 1] \| 2022-01-27 22:53:08.771411 \| 127.0.0.2 \| 754 \| 127.0.0.1 Start querying token range ({-6272011798787969456, end}, {-4611686018427387904, start}) [shard 0] \| 2022-01-27 22:53:08.771412 \| 127.0.0.3 \| 566 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771415 \| 127.0.0.3 \| 569 \| 127.0.0.1 Creating shard reader on shard: 1 [shard 1] \| 2022-01-27 22:53:08.771415 \| 127.0.0.2 \| 757 \| 127.0.0.1 Scanning cache for range ({5592106830937975806, end}, +inf) and slice {(-inf, +inf)} [shard 1] \| 2022-01-27 22:53:08.771419 \| 127.0.0.2 \| 761 \| 127.0.0.1 Scanning cache for range ({-6272011798787969456, end}, {-4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771419 \| 127.0.0.3 \| 573 \| 127.0.0.1 Received forward_result=[131] from 127.0.0.1:0 [shard 1] \| 2022-01-27 22:53:08.771454 \| 127.0.0.1 \| 1003 \| 127.0.0.1 Page stats: 74 partition(s), 0 static row(s) (0 live, 0 dead), 74 clustering row(s) (74 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.771764 \| 127.0.0.3 \| 918 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771768 \| 127.0.0.3 \| 922 \| 127.0.0.1 Start querying token range [{0, start}, {2507462623645193091, end}] [shard 0] \| 2022-01-27 22:53:08.771771 \| 127.0.0.3 \| 925 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771775 \| 127.0.0.3 \| 929 \| 127.0.0.1 Scanning cache for range [{0, start}, {2507462623645193091, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771779 \| 127.0.0.3 \| 933 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.771935 \| 127.0.0.3 \| 1265 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.771950 \| 127.0.0.2 \| 998 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.771956 \| 127.0.0.2 \| 1004 \| 127.0.0.1 Start querying token range ({-6473446911791631266, end}, {-6272011798787969456, end}] [shard 0] \| 2022-01-27 22:53:08.771959 \| 127.0.0.2 \| 1008 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.771963 \| 127.0.0.2 \| 1011 \| 127.0.0.1 Scanning cache for range ({-6473446911791631266, end}, {-6272011798787969456, end}] and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.771966 \| 127.0.0.2 \| 1014 \| 127.0.0.1 Page stats: 13 partition(s), 0 static row(s) (0 live, 0 dead), 13 clustering row(s) (13 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772008 \| 127.0.0.2 \| 1057 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2022-01-27 22:53:08.772012 \| 127.0.0.2 \| 1061 \| 127.0.0.1 Start querying token range ({3897266736829642805, end}, {4611686018427387904, start}) [shard 0] \| 2022-01-27 22:53:08.772014 \| 127.0.0.2 \| 1063 \| 127.0.0.1 Creating shard reader on shard: 0 [shard 0] \| 2022-01-27 22:53:08.772016 \| 127.0.0.2 \| 1065 \| 127.0.0.1 Scanning cache for range ({3897266736829642805, end}, {4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] \| 2022-01-27 22:53:08.772019 \| 127.0.0.2 \| 1067 \| 127.0.0.1 On shard execution result is [200] [shard 1] \| 2022-01-27 22:53:08.772053 \| 127.0.0.3 \| 1384 \| 127.0.0.1 Page stats: 56 partition(s), 0 static row(s) (0 live, 0 dead), 56 clustering row(s) (56 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772138 \| 127.0.0.2 \| 1186 \| 127.0.0.1 Page stats: 190 partition(s), 0 static row(s) (0 live, 0 dead), 190 clustering row(s) (190 live, 0 dead) and 0 range tombstone(s) [shard 1] \| 2022-01-27 22:53:08.772364 \| 127.0.0.2 \| 1706 \| 127.0.0.1 Page stats: 149 partition(s), 0 static row(s) (0 live, 0 dead), 149 clustering row(s) (149 live, 0 dead) and 0 range tombstone(s) [shard 0] \| 2022-01-27 22:53:08.772407 \| 127.0.0.3 \| 1561 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772417 \| 127.0.0.3 \| 1571 \| 127.0.0.1 Querying is done [shard 1] \| 2022-01-27 22:53:08.772418 \| 127.0.0.2 \| 1760 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772426 \| 127.0.0.2 \| 1475 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772428 \| 127.0.0.2 \| 1476 \| 127.0.0.1 Querying is done [shard 0] \| 2022-01-27 22:53:08.772449 \| 127.0.0.3 \| 1604 \| 127.0.0.1 On shard execution result is [196] [shard 0] \| 2022-01-27 22:53:08.772555 \| 127.0.0.2 \| 1603 \| 127.0.0.1 On shard execution result is [238] [shard 1] \| 2022-01-27 22:53:08.772674 \| 127.0.0.2 \| 2016 \| 127.0.0.1 On shard execution result is [235] [shard 0] \| 2022-01-27 22:53:08.772770 \| 127.0.0.3 \| 1924 \| 127.0.0.1 Received forward_result=[435] from 127.0.0.3:0 [shard 1] \| 2022-01-27 22:53:08.772933 \| 127.0.0.1 \| 2482 \| 127.0.0.1 Received forward_result=[434] from 127.0.0.2:0 [shard 1] \| 2022-01-27 22:53:08.773110 \| 127.0.0.1 \| 2658 \| 127.0.0.1 Merged result is [1000] [shard 1] \| 2022-01-27 22:53:08.773111 \| 127.0.0.1 \| 2660 \| 127.0.0.1 Done processing - preparing a result [shard 1] \| 2022-01-27 22:53:08.773114 \| 127.0.0.1 \| 2663 \| 127.0.0.1 Request complete \| 2022-01-27 22:53:08.772666 \| 127.0.0.1 \| 2666 \| 127.0.0.1 ``` Fixes #1385 Closes #9209 github.com:scylladb/scylla: docs: add parallel aggregations design doc db: config: add a flag to disable new parallelized aggregation algorithm test: add parallelized select count test forward_service: add metrics forward_service: parallelize execution across shards forward_service: add tracing cql3: statements: introduce parallelized_select_statement cql3: query_processor: add forward_service reference to query_processor gms: add PARALLELIZED_AGGREGATION feature service: introduce forward_service storage_proxy: extract query_ranges_to_vnodes_generator to a separate file messaging_service: add verb for count() request forwarding cql3: selection: detect if a selection represents count()	2022-02-04 12:34:19 +02:00
Nadav Har'El	87e48d61a7	build: rebuild relocatable packages if version changed In commit `d72465531e` we fixed the building of relocatable packages of submodules (tools/java, etc.) to use the top-level Scylla's version. However, if on an active working directory Scylla's version changes - as we just did from 4.7 to 5.0 - these relocatable packages are not rebuilt with the new version number, and as a result some of our scripts (such as the docker build) can't find them. Because the build-submodule-reloc rule depends on the files build/SCYLLA-{PRODUCT,VERSION,RELEASE}-FILE (which is what the aforementioned commit did), in this patch we add those files as a dependency whenever build-submodule-reloc is used. This means that if any of these files change, we rebuild the relocatable packages and anything depending on them (e.g., Debian packages). Fixes #10018. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220202131248.1610678-1-nyh@scylladb.com>	2022-02-03 10:19:15 +02:00
Michał Sala	a6cf3f52bd	service: introduce forward_service The new service is responsible for: * spreading forward_request execution across multiple nodes in cluster * collecting forward_request execution results and merging them `forward_service::dispatch` method takes forward_request as an argument, and forwards its execution to group of other nodes (using rpc verb added in previous commits). Each node (in the group chosen by dispatch method) is provided with forward_request, which is no different from the original argument except for changed partition ranges. They are changed so that vnodes contained in them are owned by recipient node. Executing forward_request is realized in `forward_service::execute` method, that is registered to be called on FORWARD_REQUEST verb receipt. Process of executing forward_request consists of mocking few non-serializable object (such as `cql3::selection`) in order to create `service:pager:query_pagers::pager` and `cql3::selection::result_set_builder`. After pager and result_set_builder creation, execution process resembles what might be seen in select_statement's execution path.	2022-02-01 21:14:41 +01:00
Michał Sala	0fe59082ec	storage_proxy: extract query_ranges_to_vnodes_generator to a separate file Such separation allows using query_ranges_to_vnodes_generator by other services without needing a storage_proxy dependency.	2022-02-01 21:14:41 +01:00
Michał Sala	fff454761a	messaging_service: add verb for count() request forwarding Except for the verb addition, this commit also defines forward_request and forward_result structures, used as an argument and result of the new rpc. forward_request is used to forward information about select statement that does count() (or other aggregating functions such as max, min, avg in the future). Due to the inability to serialize cql3::statements::select_statement, I chose to include query::read_command, dht::partition_range_vector and some configuration options in forward_request. They can be serialized and are sufficient enough to allow creation of service::pager::query_pagers::pager.	2022-02-01 21:14:41 +01:00
Kamil Braun	b863a63b08	test: unit test for clearing old entries in group0 history We perform a bunch of schema changes with different values of `migration_manager::_group0_history_gc_duration` and check if entries are cleared according to this setting.	2022-01-25 13:13:35 +01:00
Kamil Braun	509ac2130f	service: raft: group0_state_machine: introduce `group0_command` Objects of this type will be serialized and sent as commands to the group 0 state machine. They contain a set of mutations which modify group 0 tables (at this point: schema tables and group 0 history table), the 'previous state ID' which is the last state ID present in the history table when the operation described by this command has started, and the 'new state ID' which will be appended to the history table if this change is successful (successful = the previous state ID is still equal to the last state ID in the history table at the moment of application). It also contains the address of the node which constructed this command. The state ID mechanism will be described in more detail in a later commit.	2022-01-24 15:20:37 +01:00
Kamil Braun	538cc6ecb9	service: raft: rename `schema_raft_state_machine` to `group0_state_machine` Generalize the name so it doesn't suggest that group 0 contains only schema state.	2022-01-24 15:12:50 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Gleb Natapov	8a25b740df	raft: split idl to rpc and storage Storage uses only small part of the IDL, so it can include only the part that is relevant to it.	2022-01-13 13:14:46 +02:00
Avi Kivity	0e5d196499	Merge "move storage proxy verbs to the IDL" from Gleb * 'gleb/sp-idl-v1' of github.com:scylladb/scylla-dev: storage_proxy: move all verbs to the IDL idl-compiler: allow const references in send() parameter list idl-compiler: support smart pointers in verb's return value idl-compiler: support multiple return value and optional in a return value idl-compiler: handle :: at the beginning of a type idl-compiler: sending one way message without timeout does not require ret value specialization as well storage_proxy: convert more address vectors to inet_address_vector_replica_set	2022-01-12 12:34:18 +02:00
Nadav Har'El	7a9f69ec38	Merge 'lister cleanup and test' from Benny Halevy Split off of #9835. The series removes extraneous includes of lister.hh from header files and adds a unit test for lister::scan_dir to test throwing an exception from the walker function passed to `scan_dir`. Test: unit(dev) Closes #9885 * github.com:scylladb/scylla: test: add lister_list lister: add more overloads of fs::path operator/ for std::string and string_view resource_manager: remove unnecessary include of lister.hh from header file sstables: sstable_directory: remove unncessary include of lister.hh from header file	2022-01-12 08:20:07 +01:00
Nadav Har'El	c5f29fe3ea	configure.py: don't use deprecated mktemp() configure.py uses the deprecated Python function tempfile.mktemp(). Because this function is labeled a "security risk" it is also a magnet for automated security scanners... So let's replace it with the recommended tempfile.mkstemp() and avoid future complaints. The actual security implications of this mktemp() call is negligible to non-existent: First it's just the build process (configure.py), not the build product itself. Second, the worst that an attacker (which needs to run in the build machine!) can do is to cause a compilation test in configure.py to fail because it can't write to its output file. Reported by @srikanthprathi Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220111121924.615173-1-nyh@scylladb.com>	2022-01-11 17:06:14 +02:00
Benny Halevy	1e6829e9f1	test: add lister_list Test the lister class. In particular the ability to abort the lister when the walker function throws an exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-01-11 17:04:16 +02:00
Avi Kivity	4392c20bd3	replica: move distributed_loader into replica module distributed_loader is replica-side thing, so it belongs in the replica module ("distributed" refers to its ability to load sstables in their correct shards). So move it to the replica module.	2022-01-10 15:25:28 +02:00
Gleb Natapov	1db151bd75	storage_proxy: move all verbs to the IDL Define all verbs in the IDL instead of manually codding them.	2022-01-10 14:58:28 +02:00
Botond Dénes	0f60cc84f4	Merge 'replica: create a replica module' from Avi Kivity Move the ::database, ::keyspace, and ::table classes to a new replica namespace and replica/ directory. This designates objects that only have meaning on a replica and should not be used on a coordinator (but note that not all replica-only classes should be in this module, for example compaction and sstables are lower-level objects that deserve their own modules). The module is imperfect - some additional classes like distributed_loader should also be moved, but there is only one way to untie Gordian knots. Closes #9872 * github.com:scylladb/scylla: replica: move ::database, ::keyspace, and ::table to replica namespace database: Move database, keyspace, table classes to replica/ directory	2022-01-07 13:37:40 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Avi Kivity	b850b34bcc	build: reduce inline threshold on aarch64 to 300 We see coroutine miscompiles with 600. Fixes #9881. Closes #9883	2022-01-06 15:13:27 +02:00
Botond Dénes	015d09a926	tools: utils: add configure_tool_mode() Which configures seastar to act more appropriate to a tool app. I.e. don't act as if it owns the place, taking over all system resources. These tools are often run on a developer machine, or even next to a running scylla instance, we want them to be the least intrusive possible. Also use the new tool mode in the existing tools. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211220143104.132327-1-bdenes@scylladb.com>	2022-01-05 15:33:57 +02:00
Nadav Har'El	dcc42d3815	configure.py: re-run configure.py if the build/ directory is gone When you run "configure.py", the result is not only the creation of ./build.ninja - it also creates build/<mode>/seastar/build.ninja and build/<mode>/abseil/build.ninja. After a "rm -r build" (or "ninja clean"), "ninja" will no longer work because those files are missing when Scylla's ninja tries to run ninja in those internal project. So we need to add a dependency, e.g., that running ninja in Seastar requires build/<mode>/seastar/build.ninja to exist, and also say that the rule that (re)runs "configure.py" generates those files. After this patch, configure.py --with-some-parameters --of-your-choice rm -r build ninja works - "ninja" will re-run configure.py with the same parameters when it needs Seastar's or Abseil's build.ninja. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211230133702.869177-1-nyh@scylladb.com>	2022-01-05 10:15:19 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Avi Kivity	5eccb42846	Merge "Host tool executables in the scylla main executable" from Botond " A big problem with scylla tool executables is that they include the entire scylla codebase and thus they are just as big as the scylla executable itself, making them impractical to deploy on production machines. We could try to combat this by selectively including only the actually needed dependencies but even ignoring the huge churn of sorting out our depedency hell (which we should do at one point anyway), some tools may genuinely depend on most of the scylla codebase. A better solution is to host the tool executables in the scylla executable itself, switching between the actual main function to run some way. The tools themselves don't contain a lot of code so this won't cause any considerable bloat in the size of the scylla executable itself. This series does exactly this, folds all the tool executables into the scylla one, with main() switching between the actual main it will delegate to based on a argv[1] command line argument. If this is a known tool name, the respective tool's main will be invoked. If it is "server", missing or unrecognized, the scylla main is invoked. Originally this series used argv[0] as the mean to switch between the main to run. This approach was abandoned for the approach mentioned above for the following reasons: * No launcher script, hard link, soft link or similar games are needed to launch a specific tool. * No packaging needed, all tools are automatically deployed. * Explicit tool selection, no surprises after renaming scylla to something else. * Tools are discoverable via scylla's description. * Follows the trend set by modern command line multi-command or multi-app programs, like git. Fixes: #7801 Tests: unit(dev) " * 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla: main,tools,configure.py: fold tools into scylla exec tools: prepare for inclusion in scylla's main main: add skeleton switching code on argv[1] main: extract scylla specific code into scylla_main()	2022-01-04 17:55:07 +02:00
Eliran Sinvani	6d9d00ec9c	conofigure.py: Set seastar scheduling groups count explicitly In order to have stability and also regression control, we set the scheduling groups parameter explicitly. Closes #9847	2021-12-27 15:48:45 +02:00
Botond Dénes	bb0874b28b	main,tools,configure.py: fold tools into scylla exec The infrastructure is now in place. Remove the proxy main of the tools, and add appropriate `else if` statements to the executable switch in main.cc. Also remove the tool applications from the `apps` list and add their respective sources as dependencies to the main scylla executable. With this, we now have all tool executables living inside the scylla main one.	2021-12-20 18:27:25 +02:00
Avi Kivity	021c7593b8	data_dictionary: move user_types_metadata to new module data_dictionary The new module will contain all schema related metadata, detached from actual data access (provided by the database class). User types is the first contents to be moved to the new module.	2021-12-15 13:52:10 +02:00
Avi Kivity	c519857beb	build: rearrange -O3 and -f<optimization-option> options It turns out that -O3 enabled -fslp-vectorize even if it is disabled before -O3 on the command line. Rearrange the code so that -O3 is before the more specific optimization options.	2021-12-07 17:52:32 +02:00
Avi Kivity	04ad07b072	build: disable superword-level parallism (slp) on clang Clang (and gcc) can combine loads and stores of independent variables into wider operations, often using vector registers. This reduces instruction count and execution unit occupancy. However, clang is too aggressive and generates loads that break the store-to-load forwarding rules: a load must be the same size or smaller than the corresponding load, or it will execute with a large penalty. Disabling slp results in larger but faster code. Comparing before and after on Zen 3: slp: 226766.49 tps ( 75.1 allocs/op, 12.1 tasks/op, 45073 insns/op) 226679.57 tps ( 75.1 allocs/op, 12.1 tasks/op, 45074 insns/op) 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) 225884.34 tps ( 75.1 allocs/op, 12.1 tasks/op, 45068 insns/op) 225998.16 tps ( 75.1 allocs/op, 12.1 tasks/op, 45056 insns/op) median 226168.79 tps ( 75.1 allocs/op, 12.1 tasks/op, 45061 insns/op) median absolute deviation: 284.45 maximum: 226766.49 minimum: 225884.34 no slp: 228195.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 45109 insns/op) 227773.76 tps ( 75.1 allocs/op, 12.1 tasks/op, 45123 insns/op) 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) 228157.43 tps ( 75.1 allocs/op, 12.1 tasks/op, 45129 insns/op) 228072.29 tps ( 75.1 allocs/op, 12.1 tasks/op, 45128 insns/op) median 228088.98 tps ( 75.1 allocs/op, 12.1 tasks/op, 45117 insns/op) median absolute deviation: 68.45 maximum: 228195.33 minimum: 227773.76 Disabling slp increases the instruction count by ~60 instructions per op (0.13%) but increases throughput by 0.85%. This shows the impact of the violation is quite high. It can also be observed by the effect on stalled cycles: slp: 44,932.70 msec task-clock # 0.993 CPUs utilized 13,618 context-switches # 303.075 /sec 33 cpu-migrations # 0.734 /sec 1,695 page-faults # 37.723 /sec 211,997,160,633 cycles # 4.718 GHz (71.67%) 1,118,855,786 stalled-cycles-frontend # 0.53% frontend cycles idle (71.67%) 1,258,837,025 stalled-cycles-backend # 0.59% backend cycles idle (71.66%) 454,445,559,376 instructions # 2.14 insn per cycle # 0.00 stalled cycles per insn (71.66%) 83,557,588,477 branches # 1.860 G/sec (71.67%) 174,313,252 branch-misses # 0.21% of all branches (71.67%) no-slp: 44,579.83 msec task-clock # 0.986 CPUs utilized 13,435 context-switches # 301.369 /sec 33 cpu-migrations # 0.740 /sec 1,691 page-faults # 37.932 /sec 210,070,080,283 cycles # 4.712 GHz (71.68%) 1,066,774,628 stalled-cycles-frontend # 0.51% frontend cycles idle (71.68%) 1,082,255,966 stalled-cycles-backend # 0.52% backend cycles idle (71.66%) 455,067,924,891 instructions # 2.17 insn per cycle # 0.00 stalled cycles per insn (71.68%) 83,597,450,748 branches # 1.875 G/sec (71.65%) 151,897,866 branch-misses # 0.18% of all branches (71.68%) Note the differences in "backend cycles idle" and "stalled cycles per insn". I also observed the same pattern on a much older generation Intel (although the baseline instructions per clock there are around 0.56). slp: 42232.64 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) 42331.33 tps ( 75.1 allocs/op, 12.1 tasks/op, 44857 insns/op) 42315.89 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) 42410.19 tps ( 75.1 allocs/op, 12.1 tasks/op, 44818 insns/op) median 42318.87 tps ( 75.1 allocs/op, 12.1 tasks/op, 44849 insns/op) median absolute deviation: 12.46 maximum: 42410.19 minimum: 42232.64 no-slp: 42464.18 tps ( 75.1 allocs/op, 12.1 tasks/op, 44886 insns/op) 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) 42783.95 tps ( 75.1 allocs/op, 12.1 tasks/op, 44961 insns/op) 42671.23 tps ( 75.1 allocs/op, 12.1 tasks/op, 44947 insns/op) 42487.82 tps ( 75.1 allocs/op, 12.1 tasks/op, 44875 insns/op) median 42631.88 tps ( 75.1 allocs/op, 12.1 tasks/op, 44939 insns/op) median absolute deviation: 144.06 maximum: 42783.95 minimum: 42464.18 slp: 26,877.01 msec task-clock # 0.989 CPUs utilized 15,621 context-switches # 0.581 K/sec 9 cpu-migrations # 0.000 K/sec 55,322 page-faults # 0.002 M/sec 96,084,360,190 cycles # 3.575 GHz (72.55%) 71,435,545,235 stalled-cycles-frontend # 74.35% frontend cycles idle (72.57%) 59,531,573,539 stalled-cycles-backend # 61.96% backend cycles idle (70.96%) 53,273,420,083 instructions # 0.55 insn per cycle # 1.34 stalled cycles per insn (72.55%) 10,240,844,987 branches # 381.026 M/sec (72.57%) 94,348,150 branch-misses # 0.92% of all branches (72.57%) no-slp: 26,381.66 msec task-clock # 0.971 CPUs utilized 15,586 context-switches # 0.591 K/sec 9 cpu-migrations # 0.000 K/sec 55,318 page-faults # 0.002 M/sec 94,317,505,691 cycles # 3.575 GHz (72.59%) 69,693,601,709 stalled-cycles-frontend # 73.89% frontend cycles idle (72.59%) 57,579,078,046 stalled-cycles-backend # 61.05% backend cycles idle (58.08%) 53,260,417,953 instructions # 0.56 insn per cycle # 1.31 stalled cycles per insn (72.60%) 10,235,123,948 branches # 387.964 M/sec (72.60%) 96,002,988 branch-misses # 0.94% of all branches (72.62%)	2021-12-07 17:08:38 +02:00
Avi Kivity	595cc328b1	Merge 'cql3: Remove term, replace with expression' from Jan Ciołek This PR finally removes the `term` class and replaces it with `expression`. * There was some trouble with `lwt_cache_id` in `expr::function_call`. The current code works the following way: * for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called. * `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls * Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them. This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer. To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids. I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour. The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch * `shared_ptr<term>` being `nullptr` could mean: * It represents a cql value `null` * That there is no value, like `std::nullopt` (for example in `attributes.hh`) * That it's a mistake, it shouldn't be possible A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value. * On the other hand `raw_value` cased to bool means: * `false` - null or unset * `true` - some value, maybe empty I ran a simple benchmark on my laptop to see how performance is affected: ``` build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10 ``` * On master (`a21b1fbb2f`) I get: ``` 176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op) median 176506.60 tps ( 77.0 allocs/op, 12.0 tasks/op, 45831 insns/op) median absolute deviation: 0.00 maximum: 176506.60 minimum: 176506.60 ``` * On this branch I get: ``` 172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op) median 172225.30 tps ( 75.1 allocs/op, 12.1 tasks/op, 46106 insns/op) median absolute deviation: 0.00 maximum: 172225.30 minimum: 172225.30 ``` Closes #9481 * github.com:scylladb/scylla: cql3: Remove remaining mentions of term cql3: Remove term cql3: Rename prepare_term to prepare_expression cql3: Make prepare_term return an expression instead of term cql3: expr: Add size check to evaluate_set cql3: expr: Add expr::contains_bind_marker cql3: expr: Rename find_atom to find_binop cql3: expr: Add find_in_expression cql3: Remove term in operations cql3: Remove term in relations cql3: Remove term in multi_column_restrictions cql3: Remove term in term_slice, rename to bounds_slice cql3: expr: Remove term in expression cql3: expr: Add evaluate_IN_list(expression, options) cql3: Remove term in column_condition cql3: Remove term in select_statement cql3: Remove term in update_statement cql3: Use internal cql format in insert_prepared_json_statement cache types: Add map_type_impl::serialize(range of <bytes, bytes>) cql3: Remove term in cql3/attributes cql3: expr: Add constant::view() method cql3: expr: Implement fill_prepare_context(expression) cql3: expr: add expr::visit that takes a mutable expression cql3: expr: Add receiver to expr::bind_variable	2021-11-30 16:39:39 +02:00
Konstantin Osipov	c22f945f11	raft: (service) manage Raft configuration during topology changes Operations of adding or removing a node to Raft configuration are made idempotent: they do nothing if already done, and they are safe to resume after a failure. However, since topology changes are not transactional, if a bootstrap or removal procedure fails midway, Raft group 0 configuration may go out of sync with topology state as seen by gossip. In future we must change gossip to avoid making any persistent changes to the cluster: all changes to persistent topology state will be done exclusively through Raft Group 0. Specifically, instead of persisting the tokens by advertising them through gossip, the bootstrap will commit a change to a system table using Raft group 0. nodetool will switch from looking at gossip-managed tables to consulting with Raft Group 0 configuration or Raft-managed tables. Once this transformation is done, naturally, adding a node to Raft configuration (perhaps as a non-voting member at first) will become the first persistent change to ring state applied when a node joins; removing a node from the Raft Group 0 configuration will become the last action when removing a node. Until this is done, do our best to avoid a cluster state when a removed node or a node which addition failed is stuck in Raft configuration, but the node is no longer present in gossip-managed system tables. In other words, keep the gossip the primary source of truth. For this purpose, carefully chose the timing when we join and leave Raft group 0: Join the Raft group 0 only after we've advertised our tokens, so the cluster is aware of this node, it's visible in nodetool status, but before node state jumps to "normal", i.e. before it accepts queries. Since the operation is idempotent, invoke it on each restart. Remove the node from Group 0 before its tokens are removed from gossip-managed system tables. This guarantees that if removal from Raft group 0 fails for whatever reason, the node stays in the ring, so nodetool removenode and friends are re-tried. Add tracing.	2021-11-25 12:35:42 +03:00
Konstantin Osipov	8ee88a9d8a	raft: (discovery) introduce leader discovery state machine Introduce a special state machine used to to find a leader of an existing Raft cluster or create a new cluster. This state machine should be used when a new Scylla node has no persisted Raft Group 0 configuration. The algorithm is initialized with a list of seed IP addresses, IP address of this server, and, this server's Raft server id. The IP addresses are used to construct an initial list of peers. Then, the algorithm tries to contact each peer (excluding self) from its peer list and share the peer list with this peer, as well as get the peer's peer list. If this peer is already part of some Raft cluster, this information is also shared. On a response from a peer, the current peer's peer list is updated. The algorithm stops when all peers have exchanged peer information or one of the peers responds with id of a Raft group and Raft server address of the group leader. (If any of the peers fails to respond, the algorithm re-tries ad infinitum with a timeout). More formally, the algorithm stops when one of the following is true: - it finds an instance with initialized Raft Group 0, with a leader - all the peers have been contacted, and this server's Raft server id is the smallest among all contacted peers.	2021-11-25 11:50:38 +03:00
Benny Halevy	d2703eace7	test: remove gossip_test First, it doesn't test the gossiper so it's unclear why have it at all. And it doesn't test anything more than what we test using the cql_test_env either. For testing gossip there is test/manual/gossip. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211122081305.789375-2-bhalevy@scylladb.com>	2021-11-22 16:15:41 +02:00
Botond Dénes	d4d4c0ace7	redis: mv service.* -> controller.*	2021-11-17 13:58:49 +02:00
Avi Kivity	7a3930f7cf	Merge 'More nodetool-replacing virtual tables' from Botond Dénes This PR introduces 4 new virtual tables aimed at replacing nodetool commands, working towards the long-term goal of replacing nodetool completely at least for cluster information retrieval purposes. As you may have noticed, most of these replacement are not exact matches. This is on purpose. I feel that the nodetool commands are somewhat chaotic: they might have had a clear plan on what command prints what but after years of organic development they are a mess of fields that feel like don't belong. In addition to this, they are centered on C* terminology which often sounds strange or doesn't make any sense for scylla (off-heap memory, counter cache, etc.). So in this PR I tried to do a few things: * Drop all fields that don't make sense for scylla; * Rename/reformat/rephrase fields that have a corresponding concept in scylla, so that it uses the scylla terminology; * Group information in tables based on some common theme; With these guidelines in mind lets look at the virtual tables introduced in this PR: * `system.snapshots` - replacement for `nodetool listnapshots`; * `system.protocol_servers`- replacement for `nodetool statusbinary` as well as `Thrift active` and `Native Transport active` from `nodetool info`; * `system.runtime_info` - replacement for `nodetool info`, not an exact match: some fields were removed, some were refactored to make sense for scylla; * `system.versions` - replacement for `nodetool version`, prints all versions, including build-id; Closes #9517 * github.com:scylladb/scylla: test/cql-pytest: add virtual_tables.py test/cql-pytest: nodetool.py: add take_snapshot() db/system_keyspace: add versions table configure.py: move release.cc and build_id.cc to scylla_core db/system_keyspace: add runtime_info table db/system_keyspace: add protocol_servers table service: storage_service: s/client_shutdown_hooks/protocol_servers/ service: storage_service: remove unused unregister_client_shutdown_hook redis: redis_service: implement the protocol_server interface alternator: controller: implement the protocol_server interface transport: controller: implement the protocol_server interface thrift: controller: implement the protocol_server interface Add protocol_server interface db/system_keyspace: add snapshots virtual table db/virtual_table: remove _db member db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables() docs/design-notes/system_keyspace.md: add listing of existing virtual tables docs/guides: add virtual-tables.md	2021-11-07 16:55:31 +02:00
Botond Dénes	5c87263ff8	configure.py: move release.cc and build_id.cc to scylla_core These two files were only added to the scylla executable and some specific unit tests. As we are about to use the symbols defined in these files in some scylla_core code move them there.	2021-11-05 15:42:42 +02:00

1 2 3 4 5 ...

1522 Commits