Commit Graph

2714 Commits

Author SHA1 Message Date
Nadav Har'El
b3cfd4ce07 cql-pytest: translate Cassandra's tests for ALTER operations
This is a translation of Cassandra's CQL unit test source file
validation/operations/AlterTest.java into our our cql-pytest framework.

This test file includes 24 tests for various types of ALTER operations
(of keyspaces, tables and types). Two additional tests which required
multiple data centers to test were dropped with a comment explaining why.

All 24 tests pass on Cassandra, with 8 failing on Scylla reproducing
one already known Scylla issue and 5 previously-unknown ones:

  Refs #8948:  Cassandra 3.11.10 uses "class" instead of
               "sstable_compression" for compression settings by default
  Refs #9929:  Cassandra added "USING TIMESTAMP" to "ALTER TABLE",
               we didn't.
  Refs #9930:  Forbid re-adding static columns as regular and vice versa
  Refs #9935:  Scylla stores un-expanded compaction class name in system
               tables.
  Refs #10036: Reject empty options while altering a keyspace
  Refs #10037: If there are multiple values for a key, CQL silently
               chooses last value

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-2-nyh@scylladb.com>
2022-02-07 10:57:43 +02:00
Nadav Har'El
b61876f4ff test/cql-pytest: implement nodetool.compact()
Implement the nodetool.compact() function, requesting a major compaction
of the given table. As usual for the nodetool.* functions, this is
implemented with the REST API if available (i.e., testing Scylla), or
with the external "nodetool" command if not (for testing Cassandra).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-1-nyh@scylladb.com>
2022-02-07 10:57:42 +02:00
Konstantin Osipov
caeaba60f9 cql_repl: use POSIX primitives to reset input/output
Seastar uses POSIX IO for output in addition to C++ iostreams,
e.g. in print_safe(), where it write()s directly to stdout.

Instead of manipulating C++ output streams to reset
stdout/log files, reopen the underlying file descriptors
to output/log files.

Fixes #9962 "cql_repl prints junk into the log"
Message-Id: <20220204205032.1313150-1-kostja@scylladb.com>
2022-02-07 10:53:20 +02:00
Avi Kivity
fe65122ccd Merge 'Distribute select count(*) queries' from Michał Sala
This pull request speeds up execution of `count(*)` queries. It does so by splitting given query into sub-queries and distributing them across some group of nodes for parallel execution.

New level of coordination was added. Node called super-coordinator splits aggregation query into sub-queries and distributes them across some group of coordinators. Super-coordinator is also responsible for merging results.

To develop a mechanism for speeding up `count(*)` queries, there was a need to detect which queries have a `count(*)` selector. Due to this pull request being a proof of concept, detection was realized rather poorly. It is only allows catching the simplest cases of `count(*)` queries (with only one selector and no column name specified).

After detecting that a query is a `count(*)` it should be split into sub-queries and sent to another coordinators. Splitting part wasn't that difficult, it has been achieved by limiting original query's partition ranges. Sending modified query to another node was much harder. The easiest scenario would be to send whole `cql3::statements::select_statement`. Unfortunately `cql3::statements::select_statement` can't be [de]serialized, so sending it was out of the question. Even more unfortunately, some non-[de]serializable members of `cql3::statements::select_statement` are required to start the execution process of this statement. Finally, I have decided to send a `query::read_command` paired with required [de]serializable members. Objects, that cannot be [de]serialized (such as query's selector) are mocked on the receiving end.

When a super-coordinator receives a `count(*)` query, it splits it into sub-queries. It does so, by splitting original query's partition ranges into list of vnodes, grouping them by their owner and creating sub-queries with partition ranges set to successive results of such grouping. After creation, each sub-query is sent to the owner of its partition ranges. Owner dispatches received sub-query to all of its shards. Shards slice partition ranges of the received sub-query, so that they will only query data that is owned by them. Each shard becomes a coordinator and executes so prepared sub-query.

3 node cluster set up on powerful desktops located in the office (3x32 cores)
Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:
```
time cqlsh <ip> <port> --request-timeout=3600 -e "select count(*) from scylla_bench.test using timeout 1h;"
```

* master: 68s
* this branch: 2s

3 node cluster (each node had 2 shards, `murmur3_ignore_msb_bits` was set to 1, `num_tokens` was set to 3)

```
>  cqlsh -e 'tracing on; select count(*) from ks.t;
Now Tracing is enabled

 count
-------
  1000

(1 rows)

Tracing session: e5852020-7fc3-11ec-8600-4c4c210dd657

 activity                                                                                                                                    | timestamp                  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                          Execute CQL3 query | 2022-01-27 22:53:08.770000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                               Parsing a statement [shard 1] | 2022-01-27 22:53:08.770451 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                            Processing a statement [shard 1] | 2022-01-27 22:53:08.770487 | 127.0.0.1 |             36 | 127.0.0.1
                                                                                        Dispatching forward_request to 3 endpoints [shard 1] | 2022-01-27 22:53:08.770509 | 127.0.0.1 |             58 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.770516 | 127.0.0.1 |             64 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770519 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770528 | 127.0.0.1 |              9 | 127.0.0.1
                                             Start querying token range ({-4242912715832118944, end}, {-4075408479358018994, end}] [shard 1] | 2022-01-27 22:53:08.770531 | 127.0.0.1 |             12 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770537 | 127.0.0.1 |             18 | 127.0.0.1
                      Scanning cache for range ({-4242912715832118944, end}, {-4075408479358018994, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770541 | 127.0.0.1 |             22 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770589 | 127.0.0.1 |             70 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.770600 | 127.0.0.1 |            149 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.770608 | 127.0.0.1 |            157 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770627 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770639 | 127.0.0.1 |             11 | 127.0.0.1
                                               Start querying token range ({2507462623645193091, end}, {3897266736829642805, end}] [shard 0] | 2022-01-27 22:53:08.770643 | 127.0.0.1 |             15 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770646 | 127.0.0.1 |             19 | 127.0.0.1
                        Scanning cache for range ({2507462623645193091, end}, {3897266736829642805, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770649 | 127.0.0.1 |             22 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770658 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770674 | 127.0.0.3 |              5 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770698 | 127.0.0.2 |             40 | 127.0.0.1
                                             Start querying token range [{4611686018427387904, start}, {5592106830937975806, end}] [shard 1] | 2022-01-27 22:53:08.770704 | 127.0.0.2 |             46 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770710 | 127.0.0.2 |             52 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770712 | 127.0.0.3 |             43 | 127.0.0.1
                      Scanning cache for range [{4611686018427387904, start}, {5592106830937975806, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770714 | 127.0.0.2 |             56 | 127.0.0.1
                                           Start querying token range [{-4611686018427387904, start}, {-4242912715832118944, end}] [shard 1] | 2022-01-27 22:53:08.770718 | 127.0.0.3 |             49 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770739 | 127.0.0.3 |             70 | 127.0.0.1
                    Scanning cache for range [{-4611686018427387904, start}, {-4242912715832118944, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770743 | 127.0.0.3 |             73 | 127.0.0.1
    Page stats: 17 partition(s), 0 static row(s) (0 live, 0 dead), 17 clustering row(s) (17 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770814 | 127.0.0.3 |            145 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770846 | 127.0.0.3 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770862 | 127.0.0.3 |             16 | 127.0.0.1
    Page stats: 71 partition(s), 0 static row(s) (0 live, 0 dead), 71 clustering row(s) (71 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770865 | 127.0.0.1 |            238 | 127.0.0.1
                                             Start querying token range ({-6683686776653114062, end}, {-6473446911791631266, end}] [shard 0] | 2022-01-27 22:53:08.770867 | 127.0.0.3 |             21 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770874 | 127.0.0.3 |             28 | 127.0.0.1
                      Scanning cache for range ({-6683686776653114062, end}, {-6473446911791631266, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770879 | 127.0.0.3 |             33 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770880 | 127.0.0.2 |            222 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770888 | 127.0.0.1 |            369 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770909 | 127.0.0.1 |            390 | 127.0.0.1
                                             Start querying token range ({-4075408479358018994, end}, {-3391415989210253693, end}] [shard 1] | 2022-01-27 22:53:08.770911 | 127.0.0.1 |            392 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770914 | 127.0.0.1 |            395 | 127.0.0.1
                      Scanning cache for range ({-4075408479358018994, end}, {-3391415989210253693, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770936 | 127.0.0.1 |            418 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770951 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770966 | 127.0.0.2 |             15 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.3 |            123 | 127.0.0.1
                                                                    Start querying token range (-inf, {-6683686776653114062, end}] [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.2 |             18 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770974 | 127.0.0.2 |             23 | 127.0.0.1
                                             Scanning cache for range (-inf, {-6683686776653114062, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770977 | 127.0.0.2 |             26 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770993 | 127.0.0.3 |            324 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770998 | 127.0.0.3 |            329 | 127.0.0.1
                                                              Start querying token range ({-3391415989210253693, end}, {0, start}) [shard 1] | 2022-01-27 22:53:08.771001 | 127.0.0.3 |            332 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771004 | 127.0.0.3 |            335 | 127.0.0.1
                                       Scanning cache for range ({-3391415989210253693, end}, {0, start}) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771007 | 127.0.0.3 |            338 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771044 | 127.0.0.1 |            525 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771069 | 127.0.0.1 |            442 | 127.0.0.1
                                                                                                 On shard execution result is [71] [shard 0] | 2022-01-27 22:53:08.771145 | 127.0.0.1 |            518 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771308 | 127.0.0.1 |            789 | 127.0.0.1
                                                                                                 On shard execution result is [60] [shard 1] | 2022-01-27 22:53:08.771351 | 127.0.0.1 |            832 | 127.0.0.1
 Page stats: 127 partition(s), 0 static row(s) (0 live, 0 dead), 127 clustering row(s) (127 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771379 | 127.0.0.2 |            427 | 127.0.0.1
 Page stats: 183 partition(s), 0 static row(s) (0 live, 0 dead), 183 clustering row(s) (183 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771385 | 127.0.0.3 |            716 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771402 | 127.0.0.3 |            556 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771403 | 127.0.0.2 |            745 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.771408 | 127.0.0.2 |            750 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771409 | 127.0.0.3 |            563 | 127.0.0.1
                                                                     Start querying token range ({5592106830937975806, end}, +inf) [shard 1] | 2022-01-27 22:53:08.771411 | 127.0.0.2 |            754 | 127.0.0.1
                                           Start querying token range ({-6272011798787969456, end}, {-4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.771412 | 127.0.0.3 |            566 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771415 | 127.0.0.3 |            569 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771415 | 127.0.0.2 |            757 | 127.0.0.1
                                              Scanning cache for range ({5592106830937975806, end}, +inf) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771419 | 127.0.0.2 |            761 | 127.0.0.1
                    Scanning cache for range ({-6272011798787969456, end}, {-4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771419 | 127.0.0.3 |            573 | 127.0.0.1
                                                                                    Received forward_result=[131] from 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.771454 | 127.0.0.1 |           1003 | 127.0.0.1
    Page stats: 74 partition(s), 0 static row(s) (0 live, 0 dead), 74 clustering row(s) (74 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771764 | 127.0.0.3 |            918 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771768 | 127.0.0.3 |            922 | 127.0.0.1
                                                               Start querying token range [{0, start}, {2507462623645193091, end}] [shard 0] | 2022-01-27 22:53:08.771771 | 127.0.0.3 |            925 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771775 | 127.0.0.3 |            929 | 127.0.0.1
                                        Scanning cache for range [{0, start}, {2507462623645193091, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771779 | 127.0.0.3 |            933 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771935 | 127.0.0.3 |           1265 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771950 | 127.0.0.2 |            998 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771956 | 127.0.0.2 |           1004 | 127.0.0.1
                                             Start querying token range ({-6473446911791631266, end}, {-6272011798787969456, end}] [shard 0] | 2022-01-27 22:53:08.771959 | 127.0.0.2 |           1008 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771963 | 127.0.0.2 |           1011 | 127.0.0.1
                      Scanning cache for range ({-6473446911791631266, end}, {-6272011798787969456, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771966 | 127.0.0.2 |           1014 | 127.0.0.1
    Page stats: 13 partition(s), 0 static row(s) (0 live, 0 dead), 13 clustering row(s) (13 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772008 | 127.0.0.2 |           1057 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.772012 | 127.0.0.2 |           1061 | 127.0.0.1
                                             Start querying token range ({3897266736829642805, end}, {4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.772014 | 127.0.0.2 |           1063 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.772016 | 127.0.0.2 |           1065 | 127.0.0.1
                      Scanning cache for range ({3897266736829642805, end}, {4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.772019 | 127.0.0.2 |           1067 | 127.0.0.1
                                                                                                On shard execution result is [200] [shard 1] | 2022-01-27 22:53:08.772053 | 127.0.0.3 |           1384 | 127.0.0.1
    Page stats: 56 partition(s), 0 static row(s) (0 live, 0 dead), 56 clustering row(s) (56 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772138 | 127.0.0.2 |           1186 | 127.0.0.1
 Page stats: 190 partition(s), 0 static row(s) (0 live, 0 dead), 190 clustering row(s) (190 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.772364 | 127.0.0.2 |           1706 | 127.0.0.1
 Page stats: 149 partition(s), 0 static row(s) (0 live, 0 dead), 149 clustering row(s) (149 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772407 | 127.0.0.3 |           1561 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772417 | 127.0.0.3 |           1571 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.772418 | 127.0.0.2 |           1760 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772426 | 127.0.0.2 |           1475 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772428 | 127.0.0.2 |           1476 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772449 | 127.0.0.3 |           1604 | 127.0.0.1
                                                                                                On shard execution result is [196] [shard 0] | 2022-01-27 22:53:08.772555 | 127.0.0.2 |           1603 | 127.0.0.1
                                                                                                On shard execution result is [238] [shard 1] | 2022-01-27 22:53:08.772674 | 127.0.0.2 |           2016 | 127.0.0.1
                                                                                                On shard execution result is [235] [shard 0] | 2022-01-27 22:53:08.772770 | 127.0.0.3 |           1924 | 127.0.0.1
                                                                                    Received forward_result=[435] from 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.772933 | 127.0.0.1 |           2482 | 127.0.0.1
                                                                                    Received forward_result=[434] from 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.773110 | 127.0.0.1 |           2658 | 127.0.0.1
                                                                                                           Merged result is [1000] [shard 1] | 2022-01-27 22:53:08.773111 | 127.0.0.1 |           2660 | 127.0.0.1
                                                                                              Done processing - preparing a result [shard 1] | 2022-01-27 22:53:08.773114 | 127.0.0.1 |           2663 | 127.0.0.1
                                                                                                                            Request complete | 2022-01-27 22:53:08.772666 | 127.0.0.1 |           2666 | 127.0.0.1
```

Fixes #1385

Closes #9209

* github.com:scylladb/scylla:
  docs: add parallel aggregations design doc
  db: config: add a flag to disable new parallelized aggregation algorithm
  test: add parallelized select count test
  forward_service: add metrics
  forward_service: parallelize execution across shards
  forward_service: add tracing
  cql3: statements: introduce parallelized_select_statement
  cql3: query_processor: add forward_service reference to query_processor
  gms: add PARALLELIZED_AGGREGATION feature
  service: introduce forward_service
  storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
  messaging_service: add verb for count(*) request forwarding
  cql3: selection: detect if a selection represents count(*)
2022-02-04 12:34:19 +02:00
Nadav Har'El
b54e85088d Merge 'snapshots: Fix snapshot-ctl to include snapshots of dropped tables' from Benny Halevy
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.

This PR is a rebased version of #9539 (and slightly cleaned-up, cosmetically)
and so it replaces the previous PR.

Fixes #3463
Closes #7122

Closes #9884

* github.com:scylladb/scylla:
  snapshots: Fix snapshot-ctl to include snapshots of dropped tables
  table: snapshot: add debug messages
2022-02-04 12:34:19 +02:00
Botond Dénes
d309a86708 Merge 'Add keyspace_offstrategy_compaction api' from Benny Halevy
This series adds methods to perform offstrategy compaction, if needed, returning a future<bool>
so the caller can wait on it until compaction completes.
The returned value is true iff offstrategy compaction was needed.

The added keyspace_offstrategy_compaction calls perform_offstrategy_compaction on the specified keyspace and tables, return the number of tables that required offstrategy compaction.

A respective unit test was added to the rest_api pytest.

This PR replaces https://github.com/scylladb/scylla/pull/9095 that suggested adding an option to `keyspace_compaction`
since offstrategy compaction triggering logic is different enough from major compaction meriting a new api.

Test: unit (dev)

Closes #9980

* github.com:scylladb/scylla:
  test: rest_api: add unit tests for keyspace_offstrategy_compaction api
  api: add keyspace_offstrategy_compaction
  compaction_manager: get rid of submit_offstrategy
  table: add perform_offstrategy_compaction
  compaction_manager: perform_offstrategy: print ks.cf in log messages
  compaction_manager: allow waiting on offstrategy compaction
2022-02-02 13:15:31 +02:00
Piotr Wojtczak
0dd7739716 snapshots: Fix snapshot-ctl to include snapshots of dropped tables
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.

Fixes #3463
Closes #7122

Closes #9884

Signed-off-by: Piotr Wojtczak <piotr.m.wojtczak@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-01 22:31:43 +02:00
Michał Sala
140bab279c test: add parallelized select count test
Added test that checks if a SELECT COUNT(*) query was transformed and
processed in a parallel way. Checking is done by looking at the cql
statistics and comparing subsequent counts of parallelized aggregation
SELECT query executions.
2022-02-01 21:14:41 +01:00
Michał Sala
66a93d3000 cql3: query_processor: add forward_service reference to query_processor 2022-02-01 21:14:41 +01:00
Michał Sala
0fe59082ec storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
Such separation allows using query_ranges_to_vnodes_generator by other
services without needing a storage_proxy dependency.
2022-02-01 21:14:41 +01:00
Tomasz Grabiec
00a9326ae7 Merge "raft: let modify_config finish on a follower that removes itself" from Kamil
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.

There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.

`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.

We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.

Fixes #9981.

A regression test was also added in randomized_nemesis_test.

* kbr/modify-config-finishes-v1:
  test: raft: randomized_nemesis_test: regression test for #9981
  raft: server: don't create local waiter in `modify_config`
2022-01-31 20:14:50 +01:00
Nadav Har'El
8a745593a2 Merge 'alternator: fill UnprocessedKeys for failed batch reads' from Piotr Sarna
DynamoDB protocol specifies that when getting items in a batch
failed only partially, unprocessed keys can be returned so that
the user can perform a retry.
Alternator used to fail the whole request if any of the reads failed,
but right now it instead produces the list of unprocessed keys
and returns them to the user, as long as at least 1 read was
successful.

This series comes with a test based on Scylla's error injection mechanism, and thus is only useful in modes which come with error injection compiled in. In release mode, expect to see the following message:
SKIPPED (Error injection not enabled in Scylla - try compiling in dev/debug/sanitize mode)

Fixes #9984

Closes #9986

* github.com:scylladb/scylla:
  test: add total failure case for GetBatchItem
  test: add error injection case for GetBatchItem
  test: add a context manager for error injection to alternator
  alternator: add error injection to BatchGetItem
  alternator: fill UnprocessedKeys for failed batch reads
2022-01-31 15:28:24 +02:00
Piotr Sarna
c87126198d test: add total failure case for GetBatchItem
The test verifies that if all reads from a batch operation
failed, the result is an error, and not a success response
with UnprocessedKeys parameter set to all keys.
2022-01-31 14:21:55 +01:00
Piotr Sarna
e79c2943fc test: add error injection case for GetBatchItem
The new test case is based on Scylla error injection mechanism
and forces a partial read by failing some requests from the batch.
2022-01-31 14:21:55 +01:00
Piotr Sarna
99c5bec0e2 test: add a context manager for error injection to alternator
With the new context manager it's now easier to request an error
to be injected via REST API. Note that error injection is only
enabled in certain build modes (dev, debug, sanitize)
and the test case will be skipped if it's not possible to use
this mechanism.
2022-01-31 14:21:55 +01:00
Tomasz Grabiec
8297ae531d Merge "Automatically retry CQL DDL statements in presence of concurrent changes" from Kamil
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.

Catch the exception in CQL DDL statement execution function and retry.

In addition, improve the description of CQL DDL statements
in group 0 history table.

Add a test which checks that group 0 history grows iff a schema change does
not throw `group0_concurrent_modification`. Also check that the retry
mechanism works as expected.

* kbr/ddl-retry-v1:
  test: unit test for group 0 concurrent change protection and CQL DDL retries
  cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes
2022-01-31 14:12:35 +01:00
Tomasz Grabiec
b78bab7286 Merge "raft: fixes and improvements to the library and nemesis test" from Kamil
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.

This allowed to catch a serious bug, which is fixed in the first patch.

The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.

* kbr/nemesis-random-v1:
  test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
  test: raft: randomized_nemesis_test: print details when detecting inconsistency
  test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
  test: raft: randomized_nemesis_test: keep server id in impure_state_machine
  test: raft: randomized_nemesis_test: frequent snapshotting configuration
  test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
  test: raft: randomized_nemesis_test: simplify ticker
  test: raft: randomized_nemesis_test: randomize network delay
  test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
  test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
  test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
  test: raft: randomized_nemesis_test: fix obsolete comment
  raft: fsm: print configuration entries appearing in the log
  raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
  raft: server: abort snapshot applications before waiting for rpc abort
  raft: server: logging fix
  raft: fsm: don't advance commit index beyond matched entries
2022-01-31 13:25:27 +01:00
Mikołaj Sielużycki
93d6eb6d51 compacting_reader: Support fast_forward_to position range.
Fast forwarding is delegated to the underlying reader and assumes the
it's supported. The only corner case requiring special handling that has
shown up in the tests is producing partition start mutation in the
forwarding case if there are no other fragments.

compacting state keeps track of uncompacted partition start, but doesn't
emit it by default. If end of stream is reached without producing a
mutation fragment, partition start is not emitted. This is invalid
behaviour in the forwarding case, so I've added a public method to
compacting state to force marking partition as non-empty. I don't like
this solution, as it feels like breaking an abstraction, but I didn't
come across a better idea.

Tests: unit(dev, debug, release)

Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>
2022-01-31 13:37:36 +02:00
Nadav Har'El
a25e265373 test/alternator: improve comment on why we need "global_random"
Improve the comment that explains why we needed to use an explicitly
shared random sequence instead of the usual "random". We now understand
that we need this workaround to undo what the pytest-randomly plugin does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220130155557.1181345-1-nyh@scylladb.com>
2022-01-31 10:07:56 +01:00
Nadav Har'El
59fe6a402c test/cql-pytest: use unique keys instead of random keys
Some of the tests in test/cql-pytest share the same table but use
different keys to ensure they don't collide. Before this patch we used a
random key, which was usually fine, but we recently noticed that the
pytest-randomly plugin may cause different tests to run through the *same*
sequence of random numbers and ruin our intent that different tests use
different keys.

So instead of using a *random* key, let's use a *unique* key. We can
achieve this uniqueness trivially - using a counter variable - because
anyway the uniqueness is only needed inside a single temporary table -
which is different in every run.

Another benefit is that it will now be clearer that the tests are
deterministic and not random - the intent of a random_string() key
was never to randomly walk the entire key space (random_string()
anyway had a pretty narrow idea of what a random string looks like) -
it was just to get a unique key.

Refs #9988 (fixes it for cql-pytest, but not for test/alternator)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-01-31 09:01:23 +02:00
Benny Halevy
1c25934399 test: rest_api: add unit tests for keyspace_offstrategy_compaction api
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-30 20:40:40 +02:00
Tomasz Grabiec
b734615f51 util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
2022-01-30 19:57:35 +02:00
Piotr Sarna
471205bdcf test/alternator: use a global random generator for all test cases
It was observed (perhaps it depends on the Python implementation)
that an identical seed was used for multiple test cases,
which violated the assumption that generated values are in fact
unique. Using a global generator instead makes sure that it was
only seeded once.

Tests: unit(dev) # alternator tests used to fail for me locally
  before this patch was applied
Message-Id: <315d372b4363f449d04b57f7a7d701dcb9a6160a.1643365856.git.sarna@scylladb.com>
2022-01-30 16:40:20 +02:00
Kamil Braun
d10b508380 test: raft: randomized_nemesis_test: regression test for #9981 2022-01-27 17:50:40 +01:00
Kamil Braun
4a52b802ac test: unit test for group 0 concurrent change protection and CQL DDL retries
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
2022-01-27 11:26:15 +01:00
Tomasz Grabiec
ba6c02b38a Merge "Clear old entries from group 0 history when performing schema changes" from Kamil
When performing a change through group 0 (which right now means schema
changes), clear entries from group 0 history table which are older
than one week.

This is done by including an appropriate range tombstone in the group 0
history table mutation.

* kbr/g0-history-gc-v2:
  idl: group0_state_machine: fix license blurb
  test: unit test for clearing old entries in group0 history
  service: migration_manager: clear old entries from group 0 history when announcing
2022-01-26 16:12:40 +01:00
Kamil Braun
95ac8ead4f test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency 2022-01-26 16:09:41 +01:00
Kamil Braun
e249ea5aef test: raft: randomized_nemesis_test: print details when detecting inconsistency
If the returned result is inconsistent with the constructed model, print
the differences in detail instead of just failing an assertion.
2022-01-26 16:09:41 +01:00
Kamil Braun
1170e47af4 test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in impure_state_machine
Useful for debugging.
2022-01-26 16:09:41 +01:00
Kamil Braun
b8158e0b43 test: raft: randomized_nemesis_test: keep server id in impure_state_machine
Will be used for logging.
2022-01-26 16:09:41 +01:00
Kamil Braun
3c01449472 test: raft: randomized_nemesis_test: frequent snapshotting configuration
With probability 1/2, run the test with a configuration that causes
servers to take snapshots frequently.
2022-01-26 16:09:41 +01:00
Kamil Braun
7546a9ebb5 test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
Previously all servers were ticked at the same moment, every 10
network/timer ticks.

Now we tick each server with probability 1/10 on each network/timer
tick. Thus, on average, every server is ticked once per 10 ticks.
But now we're able to obtain more interesting behaviors.
E.g. we can now observe servers which are stalling for as long as 10 ticks
and servers which temporarily speed up to tick once per each network tick.
2022-01-26 16:09:41 +01:00
Kamil Braun
5d986b2682 test: raft: randomized_nemesis_test: simplify ticker
Instead of taking a set of functions with different periods, take a
single function that is called on every tick. The periodicity can be
implemented easily on the user side.
2022-01-26 16:09:41 +01:00
Kamil Braun
173fb2bf36 test: raft: randomized_nemesis_test: randomize network delay
As a side effect, this causes messages to be delivered in a different
order they were sent, adding even more chaos.
2022-01-26 16:09:41 +01:00
Kamil Braun
00c18adbb0 test: raft: randomized_nemesis_test: fix use-after-free in environment::crash()
The lambda attached to `_crash_fiber` was a coroutine. The coroutine
would use `this` captured by the lambda after the `co_await`, where the
lambda object (hence its captures) was already destroyed.

No idea why it worked before and sanitizers did not complain in debug
mode.
2022-01-26 16:09:41 +01:00
Kamil Braun
4c68e6a04c test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
Two-way RPC functions such as `send_snapshot` had a guard object which
was captured in a lambda passed to `with_gate`. The guard object, on
destruction, accessed the `rpc` object. Unfortunately, the guard object
could outlive the `rpc` object. That's because the lambda, and hence the
guard object, was destroyed after `with_gate` finished (it lived in the
frame of the caller of `with_gate`, i.e. `send_snapshot` and others),
so it could be destroyed after `rpc` (the gate prevents `rpc` from being
destroyed).

Make sure that the guard object is destroyed before `with_gate` finishes
by creating it inside the lambda body - not capturing inside the object.
2022-01-26 16:09:41 +01:00
Kamil Braun
871f0d00ce test: raft: randomized_nemesis_test: rpc: don't propagate gate_closed_exception outside
The `raft::rpc` interface functions are called by `raft::server_impl`
and the exceptions may be propagated outside the server, e.g. through
the `add_entry` API.

Translate the internal `gate_closed_exception` to an external
`raft::stopped_error`.
2022-01-26 16:09:41 +01:00
Kamil Braun
9da4ffc1c7 test: raft: randomized_nemesis_test: fix obsolete comment 2022-01-26 16:09:41 +01:00
Kamil Braun
44a1a8a8b0 raft: operator<<(ostream&, ...) implementation for server_address and configuration
Useful for debugging.

Had to make `configuration` constructor explicit. Otherwise the
`operator<<` implementation for `configuration` would implicitly convert
the `server_address` to `configuration` when trying to output it, causing
infinite recursion.

Removed implicit uses of the constructor.
2022-01-26 16:09:41 +01:00
Gleb Natapov
579dcf187a raft: allow an option to persist commit index
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.

For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.

Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
2022-01-26 14:06:39 +01:00
Kamil Braun
b863a63b08 test: unit test for clearing old entries in group0 history
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
2022-01-25 13:13:35 +01:00
Botond Dénes
eb42213db4 compact_mutation: close active range tombstone on page end
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.

Fixes: #9907

Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
2022-01-25 09:52:30 +02:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Benny Halevy
188cedd533 test: lister_test: test_lister_abort: generate at least one entry
Without this fix, generate_random_content could generate 0 entries
and the expected exception would never be injected.

With it, we generate at least 1 entry and the test passes
with the offending random-seed:

```
random-seed=1898914316
Generated 1 dir entries
Aborting lister after 1 dir entries
test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed
```

Fixes #9953

Test: lister_test.test_lister_abort --random-seed=1898914316(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Benny Halevy
f439edca35 test: sstable_compaction_test: twcs_reshape_with_disjoint_set_test: take min_threshold into consideration
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.

Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Piotr Jastrzebski
09d4438a0d cdc: Handle compact storage correctly in preimage
Base tables that use compact storage may have a special artificial
column that has an empty type.

c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.

This patch makes sure that generation of preimage ignores such empty
column as well.

Fixes #9876
Closes #9910

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2022-01-20 13:23:38 +01:00
Nadav Har'El
7cb6250c40 Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy
This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size

https://github.com/scylladb/scylla/issues/9897
https://github.com/scylladb/scylla/issues/9898

And adds a couple unit tests to tests the snapshot_ctl functionality.

Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug)

Closes #9899

* github.com:scylladb/scylla:
  table: get_snapshot_details: count allocated_size
  snapshot_ctl: cleanup true_snapshots_size
  snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
2022-01-19 11:57:15 +02:00
Benny Halevy
5db3cbe1e4 snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
snapshot_ctl uses map_reduce over all database shards,
each counting the size of the snapshots directory,
which is shared, not per-shard.

So the total live size returned by it is multiples by the number of shards.

Add a unit test to test that.

Fixes #9897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-19 07:50:53 +02:00
Nadav Har'El
1ce73c2ab3 Merge 'utils::is_timeout_exception: Ensure we handle nested exception types' from Calle Wilund
Fixes #9922

storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.

This adds check of nested types in exception handling, as well as a test
for the routine itself.

Closes #9932

* github.com:scylladb/scylla:
  database/storage_proxy: Use "is_timeout_exception" instead of catch match
  utils::is_timeout_exception: Ensure we handle nested exception types
2022-01-18 23:49:41 +02:00