scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	705ec24977	db/config.cc: increment components_memory_reclaim_threshold config default Incremented the components_memory_reclaim_threshold config's default value to 0.2 as the previous value was too strict and caused unnecessary eviction in otherwise healthy clusters. Fixes #18607 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `3d7d1fa72a`) Closes #19011	2024-06-04 07:13:28 +03:00
Botond Dénes	e89eb41e70	Merge '[Backport 5.2] : Reload reclaimed bloom filters when memory is available ' from Lakshmi Narayanan Sreethar PR https://github.com/scylladb/scylladb/pull/17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components). The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory. Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.2. Closes #18666 * github.com:scylladb/scylladb: sstable_datafile_test: add testcase to test reclaim during reload sstable_datafile_test: add test to verify auto reload of reclaimed components sstables_manager: reload previously reclaimed components when memory is available sstables_manager: start a fiber to reload components sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables sstable_datafile_test: add test to verify reclaimed components reload sstables: support reloading reclaimed components sstables_manager: add new intrusive set to track the reclaimed sstables sstable: add link and comparator class to support new instrusive set sstable: renamed intrusive list link type sstable: track memory reclaimed from components per sstable sstable: rename local variable in sstable::total_reclaimable_memory_size	2024-05-30 11:11:39 +03:00
Botond Dénes	331e0c4ca7	Merge '[Backport 5.2] mutation_fragment_stream_validating_filter: respect validating_level::none' from ScyllaDB Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to. Fixes: #18662 (cherry picked from commit `f6511ca1b0`) (cherry picked from commit `e7b07692b6`) (cherry picked from commit `78afb3644c`) Refs #18667 Closes #18723 * github.com:scylladb/scylladb: test/boost/mutation_fragment_test.cc: add test for validator validation levels mutation: mutation_fragment_stream_validating_filter: fix validation_level::none mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter	2024-05-27 08:52:06 +03:00
Alexey Novikov	32be38dae5	make timestamp string format cassandra compatible when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z' it concerns any conversion except JSON timestamp format JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z' both formats always contain milliseconds and timezone specification Fixes #14518 Fixes #7997 Closes #14726 Fixes #16575 (cherry picked from commit `ff721ec3e3`) Closes #18852	2024-05-26 16:30:06 +03:00
Botond Dénes	3dacf6a4b1	test/boost/mutation_fragment_test.cc: add test for validator validation levels To make sure that the validator doesn't validate what the validation level doesn't include. (cherry picked from commit `78afb3644c`)	2024-05-24 03:36:28 -04:00
Benny Halevy	d947f1e275	chunked_vector_test: add more exception safety tests For insertion, with and without reservation, and for fill and copy constructors. Reproduces https://github.com/scylladb/scylladb/issues/18635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:33:42 +03:00
Benny Halevy	d727382cc1	chunked_vector_test: exception_safe_class: count also moved objects We have to account for moved objects as well as copied objects so they will be balanced with the respective `del_live_object` calls called by the destructor. However, since chunked_vector requires the value_type to be nothrow_move_constructible, just count the additional live object, but do not modify _countdown or, respectively, throw an exception, as this should be considered only for the default and copy constructors. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:33:42 +03:00
Lakshmi Narayanan Sreethar	d4c523e9ef	sstable_datafile_test: add testcase to test reclaim during reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4d22c4b68b`)	2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar	b505ce4897	sstable_datafile_test: add test to verify auto reload of reclaimed components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a080daaa94`)	2024-05-14 19:20:06 +05:30
Lakshmi Narayanan Sreethar	72494af137	sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables The testcase uses an sstable whose mutation key and the generation are owned by different shards. Due to this, when process_sstable_dir is called, the sstable gets loaded into a different shard than the one that was intended. This also means that the sstable and the sstable manager end up in different shards. The following patch will introduce a condition variable in sstables manager which will be signalled from the sstables. If the sstable and the sstable manager are in different shards, the signalling will cause the testcase to fail in debug mode with this error : "Promise task was set on shard x but made ready on shard y". So, fix it by supplying appropriate generation number owned by the same shard which owns the mutation key as well. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `24064064e9`)	2024-05-14 19:19:22 +05:30
Lakshmi Narayanan Sreethar	c15e72695d	sstable_datafile_test: add test to verify reclaimed components reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `69b2a127b0`)	2024-05-14 19:19:18 +05:30
Kamil Braun	b68c06cc3a	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. (cherry picked from commit `8df6d10e88`) Closes #18558	2024-05-08 15:46:59 +02:00
Lakshmi Narayanan Sreethar	0fc0474ccc	sstables: reclaim_memory_from_components: do not update _recognised_components When reclaiming memory from bloom filters, do not remove them from _recognised_components, as that leads to the on-disk filter component being left back on disk when the SSTable is deleted. Fixes #18398 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#18400 (cherry picked from commit `6af2659b57`) Closes #18437	2024-04-29 10:02:52 +03:00
Lakshmi Narayanan Sreethar	dd9ab15bb5	test_bloom_filter.py: disable reclaiming memory from components Disabled reclaiming memory from sstable components in the testcase as it interferes with the false positive calculation. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d86505e399`)	2024-04-16 15:50:22 +05:30
Lakshmi Narayanan Sreethar	96db5ae5e3	sstable_datafile_test: add tests to verify auto reclamation of components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d261f0fbea`)	2024-04-16 15:49:58 +05:30
Lakshmi Narayanan Sreethar	beea229deb	test/lib: allow overriding available memory via test_env_config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `169629dd40`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	31251b37dd	sstable_datafile_test: add testcase to verify reclamation from sstables Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e0b6186d16`)	2024-04-16 15:30:30 +05:30
Aleksandra Martyniuk	97671eb935	test: add test for repair_row::size() Add test which checs whether repair_row::size() considers external memory. (cherry picked from commit `51c09a84cc`)	2024-04-09 13:29:33 +02:00
Kefu Chai	6209f5d6d4	tests: utils: error injection: print time duration instead of count instead of casting / comparing the count of duration unit, let's just compare the durations, so that boost.test is able to print the duration in a more informative and user friendly way (line wrapped) test/boost/error_injection_test.cc(167): fatal error: in "test_inject_future_disabled": critical check wait_time > sleep_msec has failed [23839ns <= 10ms] Refs #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1d33a68dd7`)	2024-03-20 09:40:16 +00:00
Kamil Braun	0bb338c521	test: remove test_writes_to_recent_previous_cdc_generations The test in its original form relies on the `error_injections_at_startup` feature, which 5.2 doesn't have, so I adapted the test to enable error injections after bootstrapping nodes in the backport (`9c44bbce67`). That is however incorrect, it's important for the injection to be enabled while the nodes are booting, otherwise the test will be flaky, as we observed. Details in scylladb/scylladb#17749. Remove the test from 5.2 branch. Fixes scylladb/scylladb#17749 Closes #17750	2024-03-15 10:22:22 +02:00
Nadav Har'El	08077ff3e8	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172 (cherry picked from commit `21e7deafeb`)	2024-03-13 15:08:32 +02:00
Kamil Braun	405387c663	test: unflake test_topology_remove_garbage_group0 The test is booting nodes, and then immediately starts shutting down nodes and removing them from the cluster. The shutting down and removing may happen before driver manages to connect to all nodes in the cluster. In particular, the driver didn't yet connect to the last bootstrapped node. Or it can even happen that the driver has connected, but the control connection is established to the first node, and the driver fetched topology from the first node when the first node didn't yet consider the last node to be normal. So the driver decides to close connection to the last node like this: ``` 22:34:03.159 DEBUG> [control connection] Removing host not found in peers metadata: <Host: 127.42.90.14:9042 datacenter1> ``` Eventually, at the end of the test, only the last node remains, all other nodes have been removed or stopped. But the driver does not have a connection to that last node. Fix this problem by ensuring that: - all nodes see each other as NORMAL, - the driver has connected to all nodes at the beginning of the test, before we start shutting down and removing nodes. Fixes scylladb/scylladb#16373 (cherry picked from commit `a68701ed4f`) Closes #17703	2024-03-12 13:43:21 +01:00
Michał Chojnowski	54048e5613	sstables: fix a use-after-free in key_view::explode() key_view::explode() contains a blatant use-after-free: unless the input is already linearized, it returns a view to a local temporary buffer. This is rare, because partition keys are usually not large enough to be fragmented. But for a sufficiently large key, this bug causes a corrupted partition_key down the line. Fixes #17625 (cherry picked from commit `7a7b8972e5`) Closes #17725	2024-03-11 16:17:32 +02:00
Lakshmi Narayanan Sreethar	e8736ae431	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658 (cherry picked from commit `76f0d5e35b`) Closes #17694	2024-03-08 10:56:12 +02:00
Botond Dénes	ba373f83e4	Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: https://github.com/scylladb/scylladb/issues/17028. Fixes: https://github.com/scylladb/scylladb/issues/15370. Fixes: https://github.com/scylladb/scylladb/issues/15598. Closes #17528 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-28 16:33:01 +02:00
Aleksandra Martyniuk	23493bb342	test: test drop table on receiver side during streaming (cherry picked from commit `2ea5d9b623`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	296be93714	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key. (cherry picked from commit `4530be9e5b`)	2024-02-26 16:17:12 +01:00
Avi Kivity	9c44bbce67	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations (cherry picked from commit `9bb4482ad0`) Backport note: replaced `servers_add` with `server_add` loop in tests replaced `error_injections_at_startup` (not implemented in 5.2) with `enable_injection` post-boot	2024-02-22 15:05:19 +01:00
Nadav Har'El	6a6115cd86	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164 (cherry picked from commit `14315fcbc3`)	2024-02-22 15:36:58 +02:00
Michał Jadwiszczak	0d22471222	schema::describe: print 'synchronous_updates' only if it was specified While describing materialized view, print `synchronous_updates` option only if the tag is present in schema's extensions map. Previously if the key wasn't present, the default (false) value was printed. Fixes: #14924 Closes #14928 (cherry picked from commit `b92d47362f`)	2024-02-19 09:10:34 +02:00
Botond Dénes	422a731e85	query: do not kill unpaged queries when they reach the tombstone-limit The reason we introduced the tombstone-limit (query_tombstone_page_limit), was to allow paged queries to return incomplete/empty pages in the face of large tombstone spans. This works by cutting the page after the tombstone-limit amount of tombstones were processed. If the read is unpaged, it is killed instead. This was a mistake. First, it doesn't really make sense, the reason we introduced the tombstone limit, was to allow paged queries to process large tombstone-spans without timing out. It does not help unpaged queries. Furthermore, the tombstone-limit can kill internal queries done on behalf of user queries, because all our internal queries are unpaged. This can cause denial of service. So in this patch we disable the tombstone-limit for unpaged queries altogether, they are allowed to continue even after having processed the configured limit of tombstones. Fixes: #17241 Closes scylladb/scylladb#17242 (cherry picked from commit `f068d1a6fa`)	2024-02-15 12:50:30 +02:00
Botond Dénes	94af1df2cf	Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs. The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function. Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw. Fixes #17035 Closes scylladb/scylladb#17073 * github.com:scylladb/scylladb: utils: replace assert() by on_internal_error() utils: add on_internal_error with common logger utils: add a timeuuid minimum, like we had maximum test/cql-pytest: tests for "date" type (cherry picked from commit `2a4b991772`)	2024-02-07 14:19:32 +02:00
Kamil Braun	4e257c5c74	test_raft_snapshot_request: fix flakiness (again) At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb/scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since `1640f83fdc`, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb/scylladb#17106 (cherry picked from commit `c911bf1a33`)	2024-02-02 11:31:19 +01:00
Kamil Braun	08021dc906	test_raft_snapshot_request: fix flakiness Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit `74bf60a8ca`)	2024-02-02 11:31:19 +01:00
Botond Dénes	ce0ed29ad6	Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb/scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (`ff386e7a44`) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb/scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit `d202d32f81`) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started	2024-02-01 15:38:51 +01:00
Michael Huang	84004ab83c	raft: Store snapshot update and truncate log atomically In case the snapshot update fails, we don't truncate commit log. Fixes scylladb/scylladb#9603 Closes scylladb/scylladb#15540 (cherry picked from commit `1640f83fdc`)	2024-02-01 13:10:05 +01:00
Kamil Braun	753e2d3c57	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336 (cherry picked from commit `ff386e7a44`) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb/scylladb#16683	2024-01-31 17:00:10 +01:00
Avi Kivity	351d6d6531	Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani When a base table changes and altered, so does the views that might refer to the added column (which includes "SELECT " views and also views that might need to use this column for rows lifetime (virtual columns). However the query processor implementation for views change notification was an empty function. Since views are tables, the query processor needs to at least treat them as such (and maybe in the future, do also some MV specific stuff). This commit adds a call to `on_update_column_family` from within `on_update_view`. The side effect true to this date is that prepared statements for views which changed due to a base table change will be invalidated. Fixes https://github.com/scylladb/scylladb/issues/16392 This series also adds a test which fails without this fix and passes when the fix is applied. Closes scylladb/scylladb#16897 github.com:scylladb/scylladb: Add test for mv prepared statements invalidation on base alter query processor: treat view changes at least as table changes (cherry picked from commit `5810396ba1`)	2024-01-23 21:31:47 +02:00
Michał Jadwiszczak	29da20b9e0	schema: add scylla specific options to schema description Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates` options to schema description. Fixes: #12389 Fixes: scylladb/scylla-enterprise#2979 Closes #16786	2024-01-16 09:56:08 +02:00
Aleksandra Martyniuk	8b77fbc904	cql3: statements: call check_restricted_table_properties in prepare Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. The error is no longer returned as an exceptional future, but it is thrown. Adjust the tests accordingly. (cherry picked from commit `60fdc44bce`)	2024-01-11 16:10:26 +01:00
Nadav Har'El	ac0056f4bc	Merge 'Fix partition estimation with TWCS tables during streaming' from Raphael "Raph" Carvalho TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows. Turns out we had two problems in this area that leads to suboptimal bloom filters. 1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed. 2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count. For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong. Fixes https://github.com/scylladb/scylladb/issues/15704. Closes scylladb/scylladb#15938 * github.com:scylladb/scylladb: streaming: Improve partition estimation with TWCS streaming: Don't adjust partition estimate if segregation is postponed (cherry picked from commit `64d1d5cf62`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #16672	2024-01-08 09:06:43 +02:00
Kamil Braun	287546923e	Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak Fixes #9405 `sync_point` API provided with incorrect sync point id might allocate crazy amount of memory and fail with `std::bad_alloc`. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in `db::hints::decode` before decoding. Closes #14534 * github.com:scylladb/scylladb: db: hints: add checksum to sync point encoding db: hints: add the version_size constant (cherry picked from commit `eb6202ef9c`) The only difference from the original merge commit is the include path of `xx_hasher.hh`. On branch 5.2, this file is in the root directory, not `utils`. Closes #16458	2023-12-19 17:39:50 +02:00
Michael Huang	5499f7b5a8	cdc: use chunked_vector for topology_description entries Lists can grow very big. Let's use a chunked vector to prevent large contiguous allocations. Fixes: #15302. Closes scylladb/scylladb#15428 (cherry picked from commit `62a8a31be7`)	2023-12-19 13:43:23 +01:00
Piotr Grabowski	7055ac45d1	test: use more frequent reconnection policy The default reconnection policy in Python Driver is an exponential backoff (with jitter) policy, which starts at 1 second reconnection interval and ramps up to 600 seconds. This is a problem in tests (refs #15104), especially in tests that restart or replace nodes. In such a scenario, a node can be unavailable for an extended period of time and the driver will try to reconnect to it multiple times, eventually reaching very long reconnection interval values, exceeding the timeout of a test. Fix the issue by using a exponential reconnection policy with a maximum interval of 4 seconds. A smaller value was not chosen, as each retry clutters the logs with reconnection exception stack trace. Fixes #15104 Closes #15112 (cherry picked from commit `17e3e367ca`)	2023-12-19 13:43:23 +01:00
Alexey Novikov	6bcf9e6631	When add duration field to UDT check whether this UDT is used in some clustering key Having values of the duration type is not allowed for clustering columns, because duration can't be ordered. This is correctly validated when creating a table but do not validated when we alter the type. Fixes #12913 Closes scylladb/scylladb#16022 (cherry picked from commit `bd73536b33`)	2023-12-19 06:58:41 -05:00
Botond Dénes	46a29e9a02	Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes #6389. Closes #13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy (cherry picked from commit `dba1d36aa6`) Closes #16453	2023-12-19 10:19:31 +02:00
Nadav Har'El	91e05dc646	cql: fix SELECT toJson() or SELECT JSON of time column The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column of type "time" forgot to put the time string in quotes. The result was invalid JSON. This is patch is a one-liner fixing this bug. This patch also removes the "xfail" marker from one xfailing test for this issue which now starts to pass. We also add a second test for this issue - the existing test was for "SELECT TOJSON(t)", and the second test shows that "SELECT JSON t" had exactly the same bug - and both are fixed by the same patch. We also had a test translated from Cassandra which exposed this bug, but that test continues to fail because of other bugs, so we just need to update the xfail string. The patch also fixes one C++ test, test/boost/json_cql_query_test.cc, which enshrined the wrong behavior - JSON output that isn't even valid JSON - and had to be fixed. Unlike the Python tests, the C++ test can't be run against Cassandra, and doesn't even run a JSON parser on the output, which explains how it came to enshrine wrong output instead of helping to discover the bug. Fixes #7988 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16121 (cherry picked from commit `8d040325ab`)	2023-12-18 18:19:20 +02:00
Michael Huang	af38b255c8	cql3: Fix invalid JSON parsing for JSON objects with ASCII keys For JSON objects represented as map<ascii, int>, don't treat ASCII keys as a nested JSON string. We were doing that prior to the patch, which led to parsing errors. Included the error offset where JSON parsing failed for rjson::parse related functions to help identify parsing errors better. Fixes: #7949 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15499 (cherry picked from commit `75109e9519`)	2023-12-18 13:45:57 +02:00
Kamil Braun	9aaaa66981	Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name. Fixes #13657 Closes #13658 * github.com:scylladb/scylladb: cql3: fix printing of column_specification::name in some error messages cql3: fix printing of column_definition::name in some error messages (cherry picked from commit `a29b8cd02b`)	2023-12-18 09:55:37 +02:00
Botond Dénes	64503a7137	Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski reconcilable_result_builder passes range tombstone changes to _rt_assembler using table schema, not query schema. This means that a tombstone with bounds (a; b), where a < b in query schema but a > b in table schema, will not be emitted from mutation_query. This is a very serious bug, because it means that such tombstones in reverse queries are not reconciled with data from other replicas. If any queried replica has a row, but not the range tombstone which deleted the row, the reconciled result will contain the deleted row. In particular, range deletes performed while a replica is down will not later be visible to reverse queries which select this replica, regardless of the consistency level. As far as I can see, this doesn't result in any persistent data loss. Only in that some data might appear resurrected to reverse queries, until the relevant range tombstone is fully repaired. This series fixes the bug and adds a minimal reproducer test. Fixes #10598 Closes scylladb/scylladb#16003 * github.com:scylladb/scylladb: mutation_query_test: test that range tombstones are sent in reverse queries mutation_query: properly send range tombstones in reverse queries (cherry picked from commit `65e42e4166`)	2023-12-14 12:53:07 +02:00

1 2 3 4 5 ...

4251 Commits