Files
scylladb/alternator/serialization.hh
Nadav Har'El 85c6cafb1d alternator: add optimized vector type for vector search
Today in Alternator vector search, vectors are presented to the API as
lists of numbers. I.e., in JSON a vector is sent in requests and responses
as:

     {"L": [{"N": "3.14159"}, {"N":" "6.7"}}

This format is verbose and inefficient for long vectors. Even worse,
because the "N" number format has precision guarantees in DynamoDB,
we cannot optimize the storage of such vectors by, for example, storing
the numbers as 32-bit floats. We actually store these vectors as JSON,
exactly as shown above.

So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for
vectors. The above vector will look like this in JSON:

     {"FLOAT32VECTOR": [3.14159, 6.7]}

Note that each number is an unquoted JSON number, not a JSON string.
Importantly, the definition of the "FLOAT32VECTOR" type specifies that
components of the vector only have 32-bit precision. This means that
Scylla may store internally these vectors as lists of 32-bit floats -
not as a JSON. And indeed, this patch includes this optimization:
Top-level vector attributes are now encoded in an optimized way,
as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements
of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754
representation of each floating-point component).

This patch also includes documentation, and extensive tests that the
new "FLOAT32VECTOR" type works (which also serves as an example how to
use it in the boto3 SDK), that it is indeed encoded internally as 32-bit
floats and not wasteful JSON strings, and that vector search on such items
work. The last thing requires cooperation from the vector store, of
course - it needs to be able to understand the new optimized encoding
of vector attributes in addition to the old unoptimized one.

Note that the old unoptimized ("list of numbers") vectors are still
supported. Although not recommended for general use, some users might
still want to use the unoptimized type if they have pre-existing data
created on DynamoDB or Alternator without vector search in mind, and
the vectors already exist as lists of numbers.

Although this is less important, the new vector type "FLOAT32VECTOR"
is also allowed in a Query's QueryVector.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 11:57:45 +03:00

117 lines
4.7 KiB
C++

/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <string>
#include <string_view>
#include <optional>
#include "types/types.hh"
#include "schema/schema_fwd.hh"
#include "keys/keys.hh"
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"
class position_in_partition;
namespace alternator {
enum class alternator_type : int8_t {
// Do not reorder or delete entries in this enum, because these values are
// written to disk as part of the item encoding.
S, B, BOOL, N, NOT_SUPPORTED_YET, FLOAT32VECTOR
};
// FLOAT32VECTOR is an Alternator-only extension to DynamoDB's JSON type
// system. It takes a JSON array of numbers (not quoted strings as the N type
// does) and states that for these numbers, only the precision and range of
// 32-bit floats is guaranteed. This allows Alternator to store these vectors
// much more efficiently than if they were a JSON array of N's - we store
// them as big-endian 32-bit IEEE 754 floats, exactly 4 bytes each.
inline constexpr std::string_view float32vector_type_name = "FLOAT32VECTOR";
struct type_info {
alternator_type atype;
data_type dtype;
};
struct type_representation {
std::string ident;
data_type dtype;
};
inline constexpr std::string_view scylla_paging_region(":scylla:paging:region");
inline constexpr std::string_view scylla_paging_weight(":scylla:paging:weight");
type_info type_info_from_string(std::string_view type);
type_representation represent_type(alternator_type atype);
bytes serialize_item(const rjson::value& item);
rjson::value deserialize_item(bytes_view bv);
std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type);
std::string type_to_string(data_type type);
bytes get_key_column_value(const rjson::value& item, const column_definition& column);
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column);
rjson::value json_key_column_value(bytes_view cell, const column_definition& column);
partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);
// try_unwrap_number is like unwrap_number, but returns an unset optional
// when the given v does not encode a number.
std::optional<big_decimal> try_unwrap_number(const rjson::value& v);
// unwrap_bytes decodes byte value, on decoding failure it either raises api_error::serialization
// iff from_query is true or returns unset optional iff from_query is false.
// Therefore it's safe to dereference returned optional when called with from_query equal true.
std::optional<bytes> unwrap_bytes(const rjson::value& value, bool from_query);
// Check if a given JSON object encodes a set (i.e., it is a {"SS": [...]}, or "NS", "BS"
// and returns set's type and a pointer to that set. If the object does not encode a set,
// returned value is {"", nullptr}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v);
// Check if a given JSON object encodes a list (i.e., it is a {"L": [...]}
// and returns a pointer to that list.
const rjson::value* unwrap_list(const rjson::value& v);
// Take two JSON-encoded numeric values ({"N": "thenumber"}) and return the
// sum, again as a JSON-encoded number.
rjson::value number_add(const rjson::value& v1, const rjson::value& v2);
rjson::value number_subtract(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded set values (e.g. {"SS": [...the actual set]}) and
// return the sum of both sets, again as a set value.
rjson::value set_sum(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded set values (e.g. {"SS": [...the actual list]}) and
// return the difference of s1 - s2, again as a set value.
// DynamoDB does not allow empty sets, so if resulting set is empty, return
// an unset optional instead.
std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded list values (remember that a list value is
// {"L": [...the actual list]}) and return the concatenation, again as
// a list value.
// Returns a null value if one of the arguments is not actually a list.
rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2);
namespace internal {
struct magnitude_and_precision {
int magnitude;
int precision;
};
magnitude_and_precision get_magnitude_and_precision(std::string_view);
}
}