Today in Alternator vector search, vectors are presented to the API as
lists of numbers. I.e., in JSON a vector is sent in requests and responses
as:
{"L": [{"N": "3.14159"}, {"N":" "6.7"}}
This format is verbose and inefficient for long vectors. Even worse,
because the "N" number format has precision guarantees in DynamoDB,
we cannot optimize the storage of such vectors by, for example, storing
the numbers as 32-bit floats. We actually store these vectors as JSON,
exactly as shown above.
So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for
vectors. The above vector will look like this in JSON:
{"FLOAT32VECTOR": [3.14159, 6.7]}
Note that each number is an unquoted JSON number, not a JSON string.
Importantly, the definition of the "FLOAT32VECTOR" type specifies that
components of the vector only have 32-bit precision. This means that
Scylla may store internally these vectors as lists of 32-bit floats -
not as a JSON. And indeed, this patch includes this optimization:
Top-level vector attributes are now encoded in an optimized way,
as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements
of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754
representation of each floating-point component).
This patch also includes documentation, and extensive tests that the
new "FLOAT32VECTOR" type works (which also serves as an example how to
use it in the boto3 SDK), that it is indeed encoded internally as 32-bit
floats and not wasteful JSON strings, and that vector search on such items
work. The last thing requires cooperation from the vector store, of
course - it needs to be able to understand the new optimized encoding
of vector attributes in addition to the old unoptimized one.
Note that the old unoptimized ("list of numbers") vectors are still
supported. Although not recommended for general use, some users might
still want to use the unoptimized type if they have pre-existing data
created on DynamoDB or Alternator without vector search in mind, and
the vectors already exist as lists of numbers.
Although this is less important, the new vector type "FLOAT32VECTOR"
is also allowed in a Query's QueryVector.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
ScyllaDB Documentation
This repository contains the source files for ScyllaDB documentation.
- The
devfolder contains developer-oriented documentation related to the ScyllaDB code base. It is not published and is only available via GitHub. - All other folders and files contain user-oriented documentation related to ScyllaDB and are sources for docs.scylladb.com/manual.
To report a documentation bug or suggest an improvement, open an issue in GitHub issues for this project.
To contribute to the documentation, open a GitHub pull request.
Key Guidelines for Contributors
- The user documentation is written in reStructuredText (RST) - a plaintext markup language similar to Markdown. If you're not familiar with RST, see ScyllaDB RST Examples.
- The developer documentation is written in Markdown. See Basic Markdown Syntax for reference.
- Follow the ScyllaDB Style Guide.
To prevent the build from failing:
-
If you add a new file, ensure it's added to an appropriate toctree, for example:
.. toctree:: :maxdepth: 2 :hidden: Page X </folder1/article1> Page Y </folder1/article2> Your New Page </folder1/your-new-article> -
Make sure the link syntax is correct. See the guidelines on creating links
-
Make sure the section headings are correct. See the guidelines on creating headings Note that the markup must be at least as long as the text in the heading. For example:
---------------------- Prerequisites ----------------------
Building User Documentation
Prerequisites
- Python
- poetry
- make
See the ScyllaDB Sphinx Theme prerequisites to check which versions of the above are currently required.
Mac OS X
You must have a working Homebrew in order to install the needed tools.
You also need the standard utility make.
Check if you have these two items with the following commands:
brew help
make -h
Linux Distributions
Building the user docs should work out of the box on most Linux distributions.
Windows
Use "Bash on Ubuntu on Windows" for the same tools and capabilities as on Linux distributions.
Building the Docs
- Run
make previewin thedocs/directory to build the documentation. - Preview the built documentation locally at http://127.0.0.1:5500/.
Cleanup
You can clean up all the build products and auto-installed Python stuff with:
make pristine
Information for Contributors
If you are interested in contributing to Scylla docs, please read the Scylla open source page at http://www.scylladb.com/opensource/ and complete a Scylla contributor agreement if needed. We can only accept documentation pull requests if we have a contributor agreement on file for you.
Third-party Documentation
-
Do any copying as a separate commit. Always commit an unmodified version first and then do any editing in a separate commit.
-
We already have a copy of the Apache license in our tree, so you do not need to commit a copy of the license.
-
Include the copyright header from the source file in the edited version. If you are copying an Apache Cassandra document with no copyright header, use:
This document includes material from Apache Cassandra.
Apache Cassandra is Copyright 2009-2014 The Apache Software Foundation.