mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-22 15:52:13 +00:00
Users ought to have possibility to create the local index for Vector Search based only on a part of the partition key. This commits provides this by removing requirements of 'full partition key only' for custom local index. The commit updates docs to explain that local vector index can use only a part of the partition key. The commit implements cqlpy test to check fixed functionality. Fixes: SCYLLADB-953 Needs to be backported to 2026.1 as it is a fix for local vector indexes. Closes scylladb/scylladb#28931
348 lines
20 KiB
ReStructuredText
348 lines
20 KiB
ReStructuredText
|
|
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
.. or more contributor license agreements. See the NOTICE file
|
|
.. distributed with this work for additional information
|
|
.. regarding copyright ownership. The ASF licenses this file
|
|
.. to you under the Apache License, Version 2.0 (the
|
|
.. "License"); you may not use this file except in compliance
|
|
.. with the License. You may obtain a copy of the License at
|
|
..
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
..
|
|
.. Unless required by applicable law or agreed to in writing, software
|
|
.. distributed under the License is distributed on an "AS IS" BASIS,
|
|
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
.. See the License for the specific language governing permissions and
|
|
.. limitations under the License.
|
|
|
|
.. highlight:: cql
|
|
|
|
.. _secondary-indexes:
|
|
|
|
Global Secondary Indexes
|
|
------------------------
|
|
|
|
CQL supports creating secondary indexes on tables, allowing queries on the table to use those indexes. A secondary index
|
|
is identified by a name defined by:
|
|
|
|
.. code-block::
|
|
|
|
index_name: re('[a-zA-Z_0-9]+')
|
|
|
|
|
|
|
|
.. _create-index-statement:
|
|
|
|
CREATE INDEX
|
|
^^^^^^^^^^^^
|
|
|
|
Creating a secondary index on a table uses the ``CREATE INDEX`` statement:
|
|
|
|
.. code-block::
|
|
|
|
create_index_statement: CREATE [ CUSTOM ] INDEX [ IF NOT EXISTS ] [ `index_name` ]
|
|
: ON `table_name` '(' `index_identifier` ')'
|
|
: [ USING `string` [ WITH `index_properties` ] ]
|
|
index_identifier: `column_name`
|
|
:| ( FULL ) '(' `column_name` ')'
|
|
index_properties: index_property (AND index_property)*
|
|
index_property: OPTIONS = `map_literal`
|
|
:| view_property
|
|
|
|
where `view_property` is any :ref:`property <mv-options>` that can be used when creating
|
|
a :doc:`materialized view </features/materialized-views>`. The only exception is `CLUSTERING ORDER BY`,
|
|
which is not supported by secondary indexes.
|
|
|
|
If the statement is provided with a materialized view property, it will not be applied to the index itself.
|
|
Instead, it will be applied to the underlying materialized view of it.
|
|
|
|
For instance::
|
|
|
|
CREATE INDEX userIndex ON NerdMovies (user);
|
|
CREATE INDEX ON Mutants (abilityId);
|
|
|
|
-- Create a secondary index called `catsIndex` on the table `Animals`.
|
|
-- The indexed column is `cats`. Both properties, `comment` and
|
|
-- `synchronous_updates`, are view properties, so the underlying materialized
|
|
-- view will be configured with: `comment = 'everyone likes cats'` and
|
|
-- `synchronous_updates = true`.
|
|
CREATE INDEX catsIndex ON Animals (cats) WITH comment = 'everyone likes cats' AND synchronous_updates = true;
|
|
|
|
-- Create a secondary index called `dogsIndex` on the same table, `Animals`.
|
|
-- This time, the indexed column is `dogs`. The property `gc_grace_seconds` is
|
|
-- a view property, so the underlying materialized view will be configured with
|
|
-- `gc_grace_seconds = 13`.
|
|
CREATE INDEX dogsIndex ON Animals (dogs) WITH gc_grace_seconds = 13;
|
|
|
|
-- The view property `CLUSTERING ORDER BY` is not supported by secondary indexes,
|
|
-- so this statement will be rejected by Scylla.
|
|
CREATE INDEX bearsIndex ON Animals (bears) WITH CLUSTERING ORDER BY (bears ASC);
|
|
|
|
View properties of a secondary index have the same limitations as those imposed by materialized views.
|
|
For instance, a materialized view cannot be created specifying ``gc_grace_seconds = 0``, so creating
|
|
a secondary index with the same property will not be possible either.
|
|
|
|
Example::
|
|
|
|
-- This statement will be rejected by Scylla because creating
|
|
-- a materialized view with `gc_grace_seconds = 0` is not possible.
|
|
CREATE INDEX names ON clients (name) WITH gc_grace_seconds = 0;
|
|
|
|
-- This statement will also be rejected by Scylla.
|
|
-- It's not possible to use `COMPACT STORAGE` with a materialized view.
|
|
CREATE INDEX names ON clients (name) WITH COMPACT STORAGE;
|
|
|
|
The ``CREATE INDEX`` statement is used to create a new (automatic) secondary index for a given (existing) column in a
|
|
given table. A name for the index itself can be specified before the ``ON`` keyword, if desired. If data already exists
|
|
for the column, it will be indexed asynchronously. After the index is created, new data for the column is indexed
|
|
automatically at insertion time.
|
|
|
|
Local Secondary Index
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
:doc:`Local Secondary Indexes </features/local-secondary-indexes>` is an enhancement of :doc:`Global Secondary Indexes </features/secondary-indexes>`, which allows ScyllaDB to optimize the use case in which the partition key of the base table is also the partition key of the index. Local Secondary Index syntax is the same as above, with extra parentheses on the partition key.
|
|
|
|
.. code-block::
|
|
|
|
index_identifier: `column_name`
|
|
:| ( PK ) | KEYS | VALUES | FULL ) '(' `column_name` ')'
|
|
|
|
Example:
|
|
|
|
.. code-block:: cql
|
|
|
|
CREATE TABLE menus (location text, name text, price float, dish_type text, PRIMARY KEY(location, name));
|
|
CREATE INDEX ON menus((location),dish_type);
|
|
|
|
More on :doc:`Local Secondary Indexes </features/local-secondary-indexes>`
|
|
|
|
.. Attempting to create an already existing index will return an error unless the ``IF NOT EXISTS`` option is used. If it
|
|
.. is used, the statement will be a no-op if the index already exists.
|
|
|
|
.. Indexes on Map Keys (supported in ScyllaDB 2.2)
|
|
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. When creating an index on a :ref:`maps <maps>`, you may index either the keys or the values. If the column identifier is
|
|
.. placed within the ``keys()`` function, the index will be on the map keys, allowing you to use ``CONTAINS KEY`` in
|
|
.. ``WHERE`` clauses. Otherwise, the index will be on the map values.
|
|
|
|
|
|
.. _create-vector-index-statement:
|
|
|
|
Vector Index :label-note:`ScyllaDB Cloud`
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. note::
|
|
|
|
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
|
|
Vector indexes do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
|
|
about Vector Search is available in the
|
|
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
|
|
|
|
ScyllaDB supports creating vector indexes on tables, allowing queries on the table to use those indexes for efficient
|
|
similarity search on vector data. Vector indexes can be a global index for indexing vectors per table or a local
|
|
index for indexing vectors per partition.
|
|
|
|
The vector index is the only custom type index supported in ScyllaDB. It is created using
|
|
the ``CUSTOM`` keyword and specifying the index type as ``vector_index``. It is also possible to
|
|
add additional columns to the index for filtering the search results. The partition column
|
|
specified in the global vector index definition must be the vector column, and any subsequent
|
|
columns are treated as filtering columns. The local vector index requires that the partition key
|
|
of the index is a subset of the table's partition key columns and the vector column is the first
|
|
one from the following columns.
|
|
|
|
ScyllaDB allows creating multiple **named** vector indexes on the same vector column.
|
|
This can be used to create a replacement index before dropping an older one.
|
|
Unnamed duplicate vector index definitions are still rejected, and index names
|
|
must remain unique within a keyspace.
|
|
|
|
Example of a simple index:
|
|
|
|
.. code-block:: cql
|
|
|
|
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding)
|
|
USING 'vector_index'
|
|
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
|
|
|
|
The vector column (``embedding``) is indexed to enable similarity search using
|
|
a global vector index. Additional filtering can be performed on the primary key
|
|
columns of the base table.
|
|
|
|
Example of a global vector index with additional filtering:
|
|
|
|
.. code-block:: cql
|
|
|
|
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding, category, info)
|
|
USING 'vector_index'
|
|
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
|
|
|
|
The vector column (``embedding``) is indexed to enable similarity search using
|
|
a global index. Additional columns are added for filtering the search results.
|
|
The filtering is possible on ``category``, ``info`` and all primary key columns
|
|
of the base table.
|
|
|
|
Example of a local vector index:
|
|
|
|
.. code-block:: cql
|
|
|
|
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings ((id, created_at), embedding, category, info)
|
|
USING 'vector_index'
|
|
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
|
|
|
|
The vector column (``embedding``) is indexed for similarity search (a local
|
|
index) and additional columns are added for filtering the search results. The
|
|
filtering is possible on ``category``, ``info`` and all primary key columns of
|
|
the base table. The partition key of the base table must contains both columns
|
|
``id`` and ``created_at``. It is allowed to create a local vector index using
|
|
only a part of the partition key columns of the base table.
|
|
|
|
Vector indexes support additional filtering columns of native data types
|
|
(excluding counter and duration). The indexed column itself must be a vector
|
|
column, while the extra columns can be used to filter search results.
|
|
|
|
The supported types are:
|
|
|
|
* ``ascii``
|
|
* ``bigint``
|
|
* ``blob``
|
|
* ``boolean``
|
|
* ``date``
|
|
* ``decimal``
|
|
* ``double``
|
|
* ``float``
|
|
* ``inet``
|
|
* ``int``
|
|
* ``smallint``
|
|
* ``text``
|
|
* ``varchar``
|
|
* ``time``
|
|
* ``timestamp``
|
|
* ``timeuuid``
|
|
* ``tinyint``
|
|
* ``uuid``
|
|
* ``varint``
|
|
|
|
|
|
The following options are supported for vector indexes. All of them are optional.
|
|
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| Option Name | Description | Default Value |
|
|
+==============================+==========================================================================================================+===============+
|
|
| ``similarity_function`` | The similarity function to use for vector comparisons. Supported values are: | ``COSINE`` |
|
|
| | ``COSINE``, ``EUCLIDEAN``, and ``DOT_PRODUCT``. ``DOT_PRODUCT`` requires vectors to be | |
|
|
| | normalized, meaning each vector should have unit length (L2 norm = 1). For more information, see | |
|
|
| | `Vector normalization <https://en.wikipedia.org/wiki/Normalization_(statistics)#Vector_normalization>`_. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``maximum_node_connections`` | The maximum number of connections per node in the HNSW graph. In other HNSW implementations | ``16`` |
|
|
| | it is often denoted as ``m``. Higher values lead to better recall (i.e., more relevant | |
|
|
| | results are found) but increase memory usage and index size. Supported values are integers | |
|
|
| | between 1 and 512. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``construction_beam_width`` | The beam width to use during index **construction**. In other HNSW implementations it is often | ``128`` |
|
|
| | denoted as ``efConstruction``. Higher values lead to better recall (i.e., more relevant | |
|
|
| | results are found) but increase index creation time and memory usage. Supported values are | |
|
|
| | integers between 1 and 4096. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``search_beam_width`` | The beam width to use during index **search**. In other HNSW implementations it is often denoted | ``64`` |
|
|
| | as ``efSearch``. Higher values lead to better recall (i.e., more relevant results are found) | |
|
|
| | but increase query latency. Supported values are integers between 1 and 4096. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``quantization`` | The quantization method to use for compressing vectors in Vector Index. Vectors in base table | ``f32`` |
|
|
| | are never compressed. Supported values (case-insensitive) are: | |
|
|
| | | |
|
|
| | * ``f32``: 32-bit single-precision IEEE 754 floating-point. | |
|
|
| | * ``f16``: 16-bit standard half-precision floating-point (IEEE 754). | |
|
|
| | * ``bf16``: 16-bit "Brain" floating-point (optimized for ML workloads). | |
|
|
| | * ``i8``: 8-bit signed integer. | |
|
|
| | * ``b1``: 1-bit binary value (packed 8 per byte). | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``oversampling`` | A multiplier for the candidate set size during the search phase. For example, if a query asks for 10 | ``1.0`` |
|
|
| | similar vectors (``LIMIT 10``) and ``oversampling`` is 2.0, the search will initially retrieve 20 | |
|
|
| | candidates. This can improve accuracy at the cost of latency. Supported values are | |
|
|
| | floating-point numbers between 1.0 (no oversampling) and 100.0. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``rescoring`` | Flag enabling recalculation of similarity scores with full precision and re-ranking of the candidate set.| ``false`` |
|
|
| | Valid only for quantization below ``f32``. Supported values are: | |
|
|
| | | |
|
|
| | * ``true``: Enable rescoring. | |
|
|
| | * ``false``: Disable rescoring. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
| ``source_model`` | The name of the embedding model that produced the vectors (e.g., ``"ada002"``). Cassandra client | *(none)* |
|
|
| | libraries such as CassIO send this option to tag the index with the model. Cassandra SAI rejects it as | |
|
|
| | an unrecognized property; ScyllaDB accepts and preserves it in ``DESCRIBE`` output for compatibility | |
|
|
| | with those libraries, but does not act on it. | |
|
|
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
|
|
|
|
|
|
.. _cassandra-sai-compatibility:
|
|
|
|
Cassandra SAI Compatibility for Vector Search
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
ScyllaDB accepts the Cassandra ``StorageAttachedIndex`` (SAI) class name in ``CREATE CUSTOM INDEX``
|
|
statements **for vector columns**. Cassandra libraries such as
|
|
`CassIO <https://cassio.org/>`_ and `LangChain <https://www.langchain.com/>`_ use SAI to create
|
|
vector indexes; ScyllaDB recognizes these statements for compatibility.
|
|
|
|
When ScyllaDB encounters an SAI class name on a **vector column**, the index is automatically
|
|
created as a native ``vector_index``. The following class names are recognized:
|
|
|
|
* ``org.apache.cassandra.index.sai.StorageAttachedIndex`` (exact case required)
|
|
* ``StorageAttachedIndex`` (case-insensitive)
|
|
* ``SAI`` (case-insensitive)
|
|
|
|
Example::
|
|
|
|
-- Cassandra SAI statement accepted by ScyllaDB:
|
|
CREATE CUSTOM INDEX ON my_table (embedding)
|
|
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
|
|
WITH OPTIONS = {'similarity_function': 'COSINE'};
|
|
|
|
-- Equivalent to:
|
|
CREATE CUSTOM INDEX ON my_table (embedding)
|
|
USING 'vector_index'
|
|
WITH OPTIONS = {'similarity_function': 'COSINE'};
|
|
|
|
The ``similarity_function`` option is supported by both Cassandra SAI and ScyllaDB.
|
|
|
|
.. note::
|
|
|
|
SAI class names are only supported on **vector columns**. Using an SAI class name on a
|
|
non-vector column (e.g., ``text`` or ``int``) will result in an error. General SAI
|
|
indexing of non-vector columns is not supported by ScyllaDB; use a
|
|
:doc:`secondary index </cql/secondary-indexes>` instead.
|
|
|
|
.. _drop-index-statement:
|
|
|
|
DROP INDEX
|
|
^^^^^^^^^^
|
|
|
|
Dropping a secondary index uses the ``DROP INDEX`` statement:
|
|
|
|
.. code-block::
|
|
|
|
drop_index_statement: DROP INDEX [ IF EXISTS ] `index_name`
|
|
|
|
The ``DROP INDEX`` statement is used to drop an existing secondary index. The argument of the statement is the index
|
|
name, which may optionally specify the keyspace of the index.
|
|
|
|
If the index is currently being built, the ``DROP INDEX`` can still be executed. Once the ``DROP INDEX`` command is issued,
|
|
the system stops the build process and cleans up any partially built data associated with the index.
|
|
|
|
.. If the index does not exists, the statement will return an error, unless ``IF EXISTS`` is used in which case the
|
|
.. operation is a no-op.
|
|
|
|
Additional Information
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
* :doc:`Global Secondary Indexes </features/secondary-indexes/>`
|
|
* :doc:`Local Secondary Indexes </features/local-secondary-indexes/>`
|
|
|
|
The following courses are available from ScyllaDB University:
|
|
|
|
* `Materialized Views and Secondary Indexes <https://university.scylladb.com/courses/data-modeling/lessons/materialized-views-secondary-indexes-and-filtering/>`_
|
|
* `Global Secondary Indexes <https://university.scylladb.com/courses/data-modeling/lessons/materialized-views-secondary-indexes-and-filtering/topic/global-secondary-indexes/>`_
|
|
* `Local Secondary Indexes <https://university.scylladb.com/courses/data-modeling/lessons/materialized-views-secondary-indexes-and-filtering/topic/local-secondary-indexes-and-combining-both-types-of-indexes/>`_
|
|
|
|
.. include:: /rst_include/apache-copyrights.rst
|