Recently, the expression-rewrite effort changed the way that GROUP BY is
implemented. Usually GROUP BY involves an aggregation function (e.g., if
you want a separate SUM per partition). But there's also a query like
SELECT p, c1, c2, v FROM tbl GROUP BY p
This query is supposed to return one row - the *first* row in clustering
order - per group (in this case, partition). The expression rewrite
re-implemented this feature by introducing a new internal aggregator,
first(), which returns the first aggregated value. The above query is
rewritten into:
SELECT first(p), first(c1), first(c2), first(v) FROM tbl GROUP BY p
This case works correctly, and we even have a regression test for it.
But unfortunately the rewrite broke the following query:
SELECT * FROM tbl GROUP BY p
Note the "*" instead of the explicit list of columns.
In our implementation, a selection of "*" is looks like an empty
selection, and it didn't get the "first()" treatment and it remained
a "SELECT *" - and wrongly returned all rows instead of just the first
one in each partition. This was a regression - it worked correctly in
Scylla 5.2 (and also in Cassandra) - see the next patch for a
regression test.
In this patch we fix this regression. When there is a GROUP BY, the "*"
is rewritten to the appropriate list of all visible columns and then
gets the first() treatment, so it will return only the first row as
expected. The next patch will be a test that confirms the bug and its
fix.
Fixes#16531
Signed-off-by: Nadav Har'El <nyh@scylladb.com>