scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Nadav Har'El	85c6cafb1d	alternator: add optimized vector type for vector search Today in Alternator vector search, vectors are presented to the API as lists of numbers. I.e., in JSON a vector is sent in requests and responses as: {"L": [{"N": "3.14159"}, {"N":" "6.7"}} This format is verbose and inefficient for long vectors. Even worse, because the "N" number format has precision guarantees in DynamoDB, we cannot optimize the storage of such vectors by, for example, storing the numbers as 32-bit floats. We actually store these vectors as JSON, exactly as shown above. So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for vectors. The above vector will look like this in JSON: {"FLOAT32VECTOR": [3.14159, 6.7]} Note that each number is an unquoted JSON number, not a JSON string. Importantly, the definition of the "FLOAT32VECTOR" type specifies that components of the vector only have 32-bit precision. This means that Scylla may store internally these vectors as lists of 32-bit floats - not as a JSON. And indeed, this patch includes this optimization: Top-level vector attributes are now encoded in an optimized way, as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754 representation of each floating-point component). This patch also includes documentation, and extensive tests that the new "FLOAT32VECTOR" type works (which also serves as an example how to use it in the boto3 SDK), that it is indeed encoded internally as 32-bit floats and not wasteful JSON strings, and that vector search on such items work. The last thing requires cooperation from the vector store, of course - it needs to be able to understand the new optimized encoding of vector attributes in addition to the old unoptimized one. Note that the old unoptimized ("list of numbers") vectors are still supported. Although not recommended for general use, some users might still want to use the unoptimized type if they have pre-existing data created on DynamoDB or Alternator without vector search in mind, and the vectors already exist as lists of numbers. Although this is less important, the new vector type "FLOAT32VECTOR" is also allowed in a Query's QueryVector. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-13 11:57:45 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Nadav Har'El	f23e796e76	alternator: fix typos in comments and variable names Copilot found these typos in comments and variable name in alternator/, so might as well fix them. There are no functional changes in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28447	2026-02-02 19:16:43 +03:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Nadav Har'El	828cc98e4c	alternator: add function serialized_value_if_type() This patch introduces a function serialized_value_if_type() which takes a serialized value stored in the ":attrs" map, and converts it into a serialized CQL type if it matches a particular type (S, B or N) - or returns null the value has the wrong type. We will use this function in the following patch for deserializing values stored in the ":attrs" map to use them as a materialized view key. If the value has the right type, it will be converted to the CQL type and used as the key - but if it has the wrong type the key will be null and it will not appear in the view. This is exactly how GSI is supposed to behave. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Nadav Har'El	3c0603558c	alternator: add validation of numbers' magnitude and precision DynamoDB limits the allowed magnitude and precision of numbers - valid decimal exponents are between -130 and 125 and up to 38 significant decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal" type which offers unlimited precision. This can cause two problems: 1. Users might get used to this "unofficial" feature and start relying on it, not allowing us to switch to a more efficient limited-precision implementation later. 2. If huge exponents are allowed, e.g., 1e-1000000, summing such a number with 1.0 will result in a huge number, huge allocations and stalls. This is highly undesirable. After this patch, all tests in test/alternator/test_number.py now pass. The various failing tests which verify magnitude and precision limitations in different places (key attributes, non-key attributes, and arithmetic expressions) now pass - so their "xfail" tags are removed. Fixes #6794 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-05-02 11:04:05 +03:00
Kefu Chai	df63e2ba27	types: move types.{cc,hh} into types they are part of the CQL type system, and are "closer" to types. let's move them into "types" directory. the building systems are updated accordingly. the source files referencing `types.hh` were updated using following command: ``` find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} + ``` the source files under sstables include "types.hh", which is indeed the one located under "sstables", so include "sstables/types.hh" instea, so it's more explicit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12926	2023-02-19 21:05:45 +02:00
Avi Kivity	69a385fd9d	Introduce schema/ module Schema related files are moved there. This excludes schema files that also interact with mutations, because the mutation module depends on the schema. Those files will have to go into a separate module. Closes #12858	2023-02-15 11:01:50 +02:00
Marcin Maliszkiewicz	6f055ca5f9	alternator: evaluate expressions as false for stored malformed binary data We'll try to distinguish the case when data comes from the storage rather than user reuqest. Such attribute can be used in expressions and when it can't be decoded it should make expression evaluate as false to simply exclude the row during filter query or scan. Note that this change focuses on binary type, for other types we may have some inconsistencies in the implementation.	2023-01-16 15:15:27 +01:00
Botond Dénes	2b0bc11f2e	service/paging: use position_in_partition instead of clustering_key for last row The former allows for expressing more positions, like a position before/after a clustering key. This practically enables the coordinator side paging logic, for a query to be stopped at a tombstone (which can have said positions).	2022-06-23 13:36:20 +03:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Nadav Har'El	f7e984110d	alternator: add another unwrap_number() variant We have an unwrap_number() function which in case of data errors (such as the value not being a number) throws an exception with a given string used in the message. In this patch we add a variant of unwrap_number() - try_unwrap_number() - which doesn't take a message, and doesn't throw exceptions - instead it returns an empty std::optional if the given value is not a number. This function is useful in places where we need to know if we got a number or not, but both are fine but not errors. We'll use it in a following patch to parse expiration times for the TTL feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-25 22:01:37 +02:00
Nadav Har'El	7e6c5394f3	alternator: move list_concatenate() function The list_concatenate() function was only used for UpdateExpression's ADD operation, so we made it a static function in the source file where it was used. In the next patch, we'll want to use it in another place (AttributeUpdates' ADD operation), so let's move it to the same file where similar functions for sets exist. This patch is almost entirely a code move, but also makes one small change: list_concatenate() used to throw an exception if one of the arguments wasn't a list, but the text of this exception was specific to UpdateExpression. So in the new version, we return a null value in this case - and the caller checks for it and throws the right exception. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-11-03 10:19:26 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Piotr Sarna	4de23d256e	alternator,utils: move rjson.hh to utils/ rjson is going to replace libjsoncpp, so it's moved from alternator to the common utils/ directory.	2020-07-03 08:30:01 +02:00
Nadav Har'El	493d7e6716	alternator: avoid unnecessary conversion to string In a couple of places, where we already have a std::string_view, there is no need to convert to to a std::string (which requires allocation). One cool observation (by Piotr Sarna) is that map over std::string_view is fine, when the strings in the map are always string constants. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:16:26 +03:00
Nadav Har'El	8c026b9f10	alternator: move some code out of executor.cc The source file alternator/executor.cc has grown too much, reaching almost 4,000 lines. In this patch I move about 400 lines out of executor.cc: 1. Some functions related to serialization of sets and lists were moved to serialization.cc, 2. Functions related to evaluating parsed expressions were moved to expressions.cc. The header file expressions_eval.hh was also removed - the calculate_value() functions now live in expressions.cc, so we can just define them in expressions.hh, no need for a separate header files. This patch just moves code around. It doesn't make any functional changes. Refs #5783. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:16:26 +03:00
Pavel Emelyanov	4fa12f2fb8	header: De-bloat schema.hh The header sits in many other headers, but there's a handy schema_fwd.hh that's tiny and contains needed declarations for other headers. So replace shema.hh with schema_fwd.hh in most of the headers (and remove completely from some). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200303102050.18462-1-xemul@scylladb.com>	2020-03-03 11:34:00 +01:00
Nadav Har'El	15515b2cc1	alternator: more useful get_key_from_typed_value() utility function We had a get_key_from_typed_value() utility function to decode a JSON-encoded value with a known type (the JSON encoding is a map whose key is the type, the value always a string because all possible key types - string, bytes and number, are encoded as strings). However, the function was less useful than it could have been - it was missing one check for a malformed object (a check which only appeared in one of its callers), it unnecessarily received the column's expected type (all the callers passed it the given key column's type). The cleaned up function will be more useful for the following patch to support KeyConditionExpression, which wants to reuse it. While at it, this patch also uses rjson::to_string_view(it->value) instead of the less correct it->value.GetString() (the latter relies on null-termination, which is actually true for JSON strings, but there is no reason to rely on it). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200213192509.32685-3-nyh@scylladb.com>	2020-02-16 11:22:30 +02:00
Piotr Sarna	9504bbf5a4	alternator: move unwrap_set to serialization header The utility function for unwrapping a set is going to be useful across source files, so it's moved to serialization.hh/serialization.cc.	2019-12-10 15:08:47 +01:00
Dejan Mircevski	9955f0342f	alternator: Make unwrap_number() visible unwrap_number() is now a public function in serialization.hh instead of a static function visible only in executor.cc. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2019-10-31 10:46:30 -04:00
Nadav Har'El	c9eb9d9c76	alternator: update license blurbs Update all the license blurbs to the one we use in the open-source Scylla project, licensed under the AGPL. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190825160321.10016-1-nyh@scylladb.com>	2019-09-11 18:01:05 +03:00
Piotr Sarna	9c05051b59	alternator: extract getting key value subfunction Currently the only utility function for getting key bytes from JSON was to parse a document with the following format: "key_column_name" : { "key_column_type" : VALUE }. However, it's also useful to parse only the inner document, i.e.: { "key_column_type" : VALUE }.	2019-09-11 18:01:05 +03:00
Piotr Sarna	cb29d6485e	alternator: migrate to rapidjson library Profiling alternator implied that JSON parsing takes up a fair amount of CPU, and as such should be optimized. libjsoncpp is a standard library for handling JSON objects, but it also proves slower than rapidjson, which is hereby used instead. The results indicated that libjsoncpp used roughly 30% of CPU for a single-shard alternator instance under stress, while rapidjson dropped that usage to 18% without optimizations. Future optimizations should include eliding object copying, string copying and perhaps experimenting with different JSON allocators.	2019-09-11 18:01:04 +03:00
Piotr Sarna	b67f22bfc6	alternator: move related functions to serialization.cc Existing functions related to serialization and deserialization are moved to serialization.cc source file. Message-Id: <fb49a08b05fdfcf7473e6a7f0ac53f6eaedc0144.1559646761.git.sarna@scylladb.com>	2019-09-11 15:06:05 +03:00
Piotr Sarna	b3fd4b5660	alternator: add simple attribute serialization routines Attributes used to be written into the database in raw JSON format, which is far from optimal. This patch introduces more robust serializationi routines for simple alternator types: S, B, BOOL, N. Serialization uses the first byte to encode attribute type and follows with serializing data in binary form. More complex types (sets, lists, etc.) are currently still serialized in raw JSON and will be optimized in follow-up patches. Message-Id: <10955606455bbe9165affb8ac8fba4d9e7c3705f.1559646761.git.sarna@scylladb.com>	2019-09-11 15:01:07 +03:00

27 Commits