mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-20 14:52:10 +00:00
The vector store returns for every ANN search, in addition to the keys of the matching items, two additional vectors - "distances" and "similarity_cores". The "distances" are raw distance metrics - lower scores are better matches, while "similarity_scores" are modified such that higher scores are better matches. Traditionally, search scores in systems like Cassandra and Open Search use the "similarity scores" approach (higher is better, results are returned in decreasing similarity order), so this is the more interesting vector of the two. But before this patch, our vector_store_client::ann() inspected only "distances". But... then, it didn't return even that to the caller :-) So in this patch, we: 1. Ignore "distances" and instead look at "similarity scores", which is what users really want based on their experience with other vector and non-vector search engines. 2. Return the similarity score of each match together with the match. We already have this score (the vector store returns it) and we can add it to the existing primary_key structure of each result. So each result is a "struct primary_key" which has fields partition, clustering, and after this patch - similarity. Existing callers in CQL and Alternator vector search will ignore this "similarity" field in each result, and not notice it was added. But in the next patch, we'll allow Alternator's vector search to return this similarity in each result. The existing unit tests for vector_store_client.cc mocked vector-store responses with "distances", without "similarity_scores", so no longer represent what we actually expect the vector store to do. So this patch also contains modifications for these tests, to mock and to test "similarity_scores" - not "distances". The more interesting tests, in the next patch, use the real vector store and check that we really do get a "similarity_scores" response from it. This patch also handles a small corner case for DOT_PRODUCT, which is the only unbounded similarity function. If the similarity overflows the 32-bit float, the vector store returns a JSON "null" instead of a JSON number (since JSON doesn't support infinite numbers). Our existing vector-store client code errored out when it saw this "null", which is wrong - the request should be allowed to proceed. So in this patch when we see a "null" JSON for similarity, we return +Inf. This is usually correct because the top results really have +Inf, not -Inf, but if we ask for all items we can reach those with similarity -Inf and incorrectly assign +Inf to them (we have a test for this case in the next patch). But this problenm won't happen when Limit is low, and in any case it's better than aborting the request after it had already succeeded. Signed-off-by: Nadav Har'El <nyh@scylladb.com>