Commit Graph

17 Commits

Author SHA1 Message Date
Karol Nowacki
5474cc6cc2 vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832
2026-03-13 16:28:22 +01:00
Karol Nowacki
ca7f9a8baf vector_search: fix TLS server name with IP
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

Fixes: VECTOR-528
2026-02-19 13:00:03 +01:00
Karol Nowacki
6205aad601 vector_search: add warn log for failed ann requests
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
2026-02-19 13:00:03 +01:00
Karol Nowacki
079fe17e8b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
2026-02-12 10:08:37 +01:00
Marcin Maliszkiewicz
62962f33bb fix rjson::value to string_view conversion with missing GetStringLength call
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
2025-12-09 19:27:21 +01:00
Karol Nowacki
a54bf50290 vector_search: Fix requests hanging on unreachable nodes
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.

This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.

This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.

Closes scylladb/scylladb#27388
2025-12-03 21:01:43 +02:00
Karol Nowacki
086c6992f5 vector_search: Fix ANN query abort on CQL timeout
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
2025-12-02 01:17:01 +01:00
Karol Nowacki
b6afacfc1e vector_search: Reduce connection and keep-alive timeouts
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.

This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
2025-12-02 01:17:01 +01:00
Karol Nowacki
c40b3ba4b3 vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327
2025-11-22 08:18:45 +01:00
Karol Nowacki
9563d87f74 vector_search: Don't mark nodes as down on 5xx server errors
For an `/ann` search request, a 5xx server response does not
indicate that the node is down. It can signify a transient state, such
as the index full scan being in progress.

Previously, treating a 503 error as a node fault would cause the node
to be incorrectly marked as down, for example, when a new index was
being created. This commit ensures that such errors are treated as
transient search failures, not node failures.
2025-11-20 08:10:20 +01:00
Karol Nowacki
05b9cafb57 vector_search: Fix status response parsing
The response was incorrectly parsed as a plain string and compared
directly with C++ string. However, the body contains a JSON string,
which includes escaped quotes that caused comparison failures.
2025-11-19 10:02:05 +01:00
Karol Nowacki
7f45f15237 vector_search: Improve vector-store health checking
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.

The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
2025-11-17 06:21:31 +01:00
Karol Nowacki
1972fb315b vector_search: Set max backoff delay to 2x read request timeout
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
2025-11-14 08:05:21 +01:00
Karol Nowacki
097c0f9592 vector_search: Report status check exception via on_internal_error_noexcept
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
2025-11-14 08:05:21 +01:00
Karol Nowacki
009d3ea278 vector_search: Add backoff for failed clients
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.

Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
2025-11-14 07:38:01 +01:00
Karol Nowacki
49a177b51e vector_search: Use std::expected for low-level client errors
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.

The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
2025-11-14 07:23:40 +01:00
Karol Nowacki
62f8b26bd7 vector_search: Extract client class
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.

This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.

`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
2025-11-14 07:23:40 +01:00