scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Botond Dénes	ba373f83e4	Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: https://github.com/scylladb/scylladb/issues/17028. Fixes: https://github.com/scylladb/scylladb/issues/15370. Fixes: https://github.com/scylladb/scylladb/issues/15598. Closes #17528 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-28 16:33:01 +02:00
Aleksandra Martyniuk	23493bb342	test: test drop table on receiver side during streaming (cherry picked from commit `2ea5d9b623`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	296be93714	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key. (cherry picked from commit `4530be9e5b`)	2024-02-26 16:17:12 +01:00
Avi Kivity	9c44bbce67	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations (cherry picked from commit `9bb4482ad0`) Backport note: replaced `servers_add` with `server_add` loop in tests replaced `error_injections_at_startup` (not implemented in 5.2) with `enable_injection` post-boot	2024-02-22 15:05:19 +01:00
Nadav Har'El	6a6115cd86	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164 (cherry picked from commit `14315fcbc3`)	2024-02-22 15:36:58 +02:00
Michał Jadwiszczak	0d22471222	schema::describe: print 'synchronous_updates' only if it was specified While describing materialized view, print `synchronous_updates` option only if the tag is present in schema's extensions map. Previously if the key wasn't present, the default (false) value was printed. Fixes: #14924 Closes #14928 (cherry picked from commit `b92d47362f`)	2024-02-19 09:10:34 +02:00
Botond Dénes	422a731e85	query: do not kill unpaged queries when they reach the tombstone-limit The reason we introduced the tombstone-limit (query_tombstone_page_limit), was to allow paged queries to return incomplete/empty pages in the face of large tombstone spans. This works by cutting the page after the tombstone-limit amount of tombstones were processed. If the read is unpaged, it is killed instead. This was a mistake. First, it doesn't really make sense, the reason we introduced the tombstone limit, was to allow paged queries to process large tombstone-spans without timing out. It does not help unpaged queries. Furthermore, the tombstone-limit can kill internal queries done on behalf of user queries, because all our internal queries are unpaged. This can cause denial of service. So in this patch we disable the tombstone-limit for unpaged queries altogether, they are allowed to continue even after having processed the configured limit of tombstones. Fixes: #17241 Closes scylladb/scylladb#17242 (cherry picked from commit `f068d1a6fa`)	2024-02-15 12:50:30 +02:00
Botond Dénes	94af1df2cf	Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs. The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function. Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw. Fixes #17035 Closes scylladb/scylladb#17073 * github.com:scylladb/scylladb: utils: replace assert() by on_internal_error() utils: add on_internal_error with common logger utils: add a timeuuid minimum, like we had maximum test/cql-pytest: tests for "date" type (cherry picked from commit `2a4b991772`)	2024-02-07 14:19:32 +02:00
Kamil Braun	4e257c5c74	test_raft_snapshot_request: fix flakiness (again) At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb/scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since `1640f83fdc`, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb/scylladb#17106 (cherry picked from commit `c911bf1a33`)	2024-02-02 11:31:19 +01:00
Kamil Braun	08021dc906	test_raft_snapshot_request: fix flakiness Add workaround for scylladb/python-driver#295. Also an assert made at the end of the test was false, it is fixed with appropriate comment added. (cherry picked from commit `74bf60a8ca`)	2024-02-02 11:31:19 +01:00
Botond Dénes	ce0ed29ad6	Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun This allows the user of `raft::server` to cause it to create a snapshot and truncate the Raft log (leaving no trailing entries; in the future we may extend the API to specify number of trailing entries left if needed). In a later commit we'll add a REST endpoint to Scylla to trigger group 0 snapshots. One use case for this API is to create group 0 snapshots in Scylla deployments which upgraded to Raft in version 5.2 and started with an empty Raft log with no snapshot at the beginning. This causes problems, e.g. when a new node bootstraps to the cluster, it will not receive a snapshot that would contain both schema and group 0 history, which would then lead to inconsistent schema state and trigger assertion failures as observed in scylladb/scylladb#16683. In 5.4 the logic of initial group 0 setup was changed to start the Raft log with a snapshot at index 1 (`ff386e7a44`) but a problem remains with these existing deployments coming from 5.2, we need a way to trigger a snapshot in them (other than performing 1000 arbitrary schema changes). Another potential use case in the future would be to trigger snapshots based on external memory pressure in tablet Raft groups (for strongly consistent tables). The PR adds the API to `raft::server` and a HTTP endpoint that uses it. In a follow-up PR, we plan to modify group 0 server startup logic to automatically call this API if it sees that no snapshot is present yet (to automatically fix the aforementioned 5.2 deployments once they upgrade.) Closes scylladb/scylladb#16816 * github.com:scylladb/scylladb: raft: remove `empty()` from `fsm_output` test: add test for manual triggering of Raft snapshots api: add HTTP endpoint to trigger Raft snapshots raft: server: add `trigger_snapshot` API raft: server: track last persisted snapshot descriptor index raft: server: framework for handling server requests raft: server: inline `poll_fsm_output` raft: server: fix indentation raft: server: move `io_fiber`'s processing of `batch` to a separate function raft: move `poll_output()` from `fsm` to `server` raft: move `_sm_events` from `fsm` to `server` raft: fsm: remove constructor used only in tests raft: fsm: move trace message from `poll_output` to `has_output` raft: fsm: extract `has_output()` raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor` raft: server: pass `*_aborted` to `set_exception` call (cherry picked from commit `d202d32f81`) Backport notes: - `has_output()` has a smaller condition in the backported version (because the condition was smaller in `poll_output()`) - `process_fsm_output` has a smaller body (because `io_fiber` had a smaller body) in the backported version - the HTTP API is only started if `raft_group_registry` is started	2024-02-01 15:38:51 +01:00
Michael Huang	84004ab83c	raft: Store snapshot update and truncate log atomically In case the snapshot update fails, we don't truncate commit log. Fixes scylladb/scylladb#9603 Closes scylladb/scylladb#15540 (cherry picked from commit `1640f83fdc`)	2024-02-01 13:10:05 +01:00
Kamil Braun	753e2d3c57	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336 (cherry picked from commit `ff386e7a44`) Backport note: contrary to the claims above, it turns out that it is actually necessary to create snapshots in clusters which bootstrap with Raft, because of tombstones in current schema state expire hence applying schema mutations from old Raft log entries is not really idempotent. Snapshot transfer, which transfers group 0 history and state_ids, prevents old entries from applying schema mutations over latest schema state. Ref: scylladb/scylladb#16683	2024-01-31 17:00:10 +01:00
Avi Kivity	351d6d6531	Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani When a base table changes and altered, so does the views that might refer to the added column (which includes "SELECT " views and also views that might need to use this column for rows lifetime (virtual columns). However the query processor implementation for views change notification was an empty function. Since views are tables, the query processor needs to at least treat them as such (and maybe in the future, do also some MV specific stuff). This commit adds a call to `on_update_column_family` from within `on_update_view`. The side effect true to this date is that prepared statements for views which changed due to a base table change will be invalidated. Fixes https://github.com/scylladb/scylladb/issues/16392 This series also adds a test which fails without this fix and passes when the fix is applied. Closes scylladb/scylladb#16897 github.com:scylladb/scylladb: Add test for mv prepared statements invalidation on base alter query processor: treat view changes at least as table changes (cherry picked from commit `5810396ba1`)	2024-01-23 21:31:47 +02:00
Michał Jadwiszczak	29da20b9e0	schema: add scylla specific options to schema description Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates` options to schema description. Fixes: #12389 Fixes: scylladb/scylla-enterprise#2979 Closes #16786	2024-01-16 09:56:08 +02:00
Aleksandra Martyniuk	8b77fbc904	cql3: statements: call check_restricted_table_properties in prepare Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. The error is no longer returned as an exceptional future, but it is thrown. Adjust the tests accordingly. (cherry picked from commit `60fdc44bce`)	2024-01-11 16:10:26 +01:00
Nadav Har'El	ac0056f4bc	Merge 'Fix partition estimation with TWCS tables during streaming' from Raphael "Raph" Carvalho TWCS tables require partition estimation adjustment as incoming streaming data can be segregated into the time windows. Turns out we had two problems in this area that leads to suboptimal bloom filters. 1) With off-strategy enabled, data segregation is postponed, but partition estimation was adjusted as if segregation wasn't postponed. Solved by not adjusting estimation if segregation is postponed. 2) With off-strategy disabled, data segregation is not postponed, but streaming didn't feed any metadata into partition estimation procedure, meaning it had to assume the max windows input data can be segregated into (100). Solved by using schema's default TTL for a precise estimation of window count. For the future, we want to dynamically size filters (see https://github.com/scylladb/scylladb/issues/2024), especially for TWCS that might have SSTables that are left uncompacted until they're fully expired, meaning that the system won't heal itself in a timely manner through compaction on a SSTable that had partition estimation really wrong. Fixes https://github.com/scylladb/scylladb/issues/15704. Closes scylladb/scylladb#15938 * github.com:scylladb/scylladb: streaming: Improve partition estimation with TWCS streaming: Don't adjust partition estimate if segregation is postponed (cherry picked from commit `64d1d5cf62`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #16672	2024-01-08 09:06:43 +02:00
Kamil Braun	287546923e	Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak Fixes #9405 `sync_point` API provided with incorrect sync point id might allocate crazy amount of memory and fail with `std::bad_alloc`. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in `db::hints::decode` before decoding. Closes #14534 * github.com:scylladb/scylladb: db: hints: add checksum to sync point encoding db: hints: add the version_size constant (cherry picked from commit `eb6202ef9c`) The only difference from the original merge commit is the include path of `xx_hasher.hh`. On branch 5.2, this file is in the root directory, not `utils`. Closes #16458	2023-12-19 17:39:50 +02:00
Michael Huang	5499f7b5a8	cdc: use chunked_vector for topology_description entries Lists can grow very big. Let's use a chunked vector to prevent large contiguous allocations. Fixes: #15302. Closes scylladb/scylladb#15428 (cherry picked from commit `62a8a31be7`)	2023-12-19 13:43:23 +01:00
Piotr Grabowski	7055ac45d1	test: use more frequent reconnection policy The default reconnection policy in Python Driver is an exponential backoff (with jitter) policy, which starts at 1 second reconnection interval and ramps up to 600 seconds. This is a problem in tests (refs #15104), especially in tests that restart or replace nodes. In such a scenario, a node can be unavailable for an extended period of time and the driver will try to reconnect to it multiple times, eventually reaching very long reconnection interval values, exceeding the timeout of a test. Fix the issue by using a exponential reconnection policy with a maximum interval of 4 seconds. A smaller value was not chosen, as each retry clutters the logs with reconnection exception stack trace. Fixes #15104 Closes #15112 (cherry picked from commit `17e3e367ca`)	2023-12-19 13:43:23 +01:00
Alexey Novikov	6bcf9e6631	When add duration field to UDT check whether this UDT is used in some clustering key Having values of the duration type is not allowed for clustering columns, because duration can't be ordered. This is correctly validated when creating a table but do not validated when we alter the type. Fixes #12913 Closes scylladb/scylladb#16022 (cherry picked from commit `bd73536b33`)	2023-12-19 06:58:41 -05:00
Botond Dénes	46a29e9a02	Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed. The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected. Fixes #6389. Closes #13150 * github.com:scylladb/scylladb: test/alternator: test concurrent TagResource / UntagResource db/tags: drop unsafe update_tags() utility function alternator: isolate concurrent modification to tags db/tags: add safe modify_tags() utility functions migration_manager: expose access to storage_proxy (cherry picked from commit `dba1d36aa6`) Closes #16453	2023-12-19 10:19:31 +02:00
Nadav Har'El	91e05dc646	cql: fix SELECT toJson() or SELECT JSON of time column The implementation of "SELECT TOJSON(t)" or "SELECT JSON t" for a column of type "time" forgot to put the time string in quotes. The result was invalid JSON. This is patch is a one-liner fixing this bug. This patch also removes the "xfail" marker from one xfailing test for this issue which now starts to pass. We also add a second test for this issue - the existing test was for "SELECT TOJSON(t)", and the second test shows that "SELECT JSON t" had exactly the same bug - and both are fixed by the same patch. We also had a test translated from Cassandra which exposed this bug, but that test continues to fail because of other bugs, so we just need to update the xfail string. The patch also fixes one C++ test, test/boost/json_cql_query_test.cc, which enshrined the wrong behavior - JSON output that isn't even valid JSON - and had to be fixed. Unlike the Python tests, the C++ test can't be run against Cassandra, and doesn't even run a JSON parser on the output, which explains how it came to enshrine wrong output instead of helping to discover the bug. Fixes #7988 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#16121 (cherry picked from commit `8d040325ab`)	2023-12-18 18:19:20 +02:00
Michael Huang	af38b255c8	cql3: Fix invalid JSON parsing for JSON objects with ASCII keys For JSON objects represented as map<ascii, int>, don't treat ASCII keys as a nested JSON string. We were doing that prior to the patch, which led to parsing errors. Included the error offset where JSON parsing failed for rjson::parse related functions to help identify parsing errors better. Fixes: #7949 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15499 (cherry picked from commit `75109e9519`)	2023-12-18 13:45:57 +02:00
Kamil Braun	9aaaa66981	Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name. Fixes #13657 Closes #13658 * github.com:scylladb/scylladb: cql3: fix printing of column_specification::name in some error messages cql3: fix printing of column_definition::name in some error messages (cherry picked from commit `a29b8cd02b`)	2023-12-18 09:55:37 +02:00
Botond Dénes	64503a7137	Merge 'mutation_query: properly send range tombstones in reverse queries' from Michał Chojnowski reconcilable_result_builder passes range tombstone changes to _rt_assembler using table schema, not query schema. This means that a tombstone with bounds (a; b), where a < b in query schema but a > b in table schema, will not be emitted from mutation_query. This is a very serious bug, because it means that such tombstones in reverse queries are not reconciled with data from other replicas. If any queried replica has a row, but not the range tombstone which deleted the row, the reconciled result will contain the deleted row. In particular, range deletes performed while a replica is down will not later be visible to reverse queries which select this replica, regardless of the consistency level. As far as I can see, this doesn't result in any persistent data loss. Only in that some data might appear resurrected to reverse queries, until the relevant range tombstone is fully repaired. This series fixes the bug and adds a minimal reproducer test. Fixes #10598 Closes scylladb/scylladb#16003 * github.com:scylladb/scylladb: mutation_query_test: test that range tombstones are sent in reverse queries mutation_query: properly send range tombstones in reverse queries (cherry picked from commit `65e42e4166`)	2023-12-14 12:53:07 +02:00
Botond Dénes	33d2da94ab	reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty The execution loop consumes permits from the _ready_list and executes them. The _ready_list usually contains a single permit. When the _ready_list is not empty, new permits are queued until it becomes empty. The execution loops relies on admission checks triggered by the read releasing resouces, to bring in any queued read into the _ready_list, while it is executing the current read. But in some cases the current read might not free any resorces and thus fail to trigger an admission check and the currently queued permits will sit in the queue until another source triggers an admission check. I don't yet know how this situation can occur, if at all, but it is reproducible with a simple unit test, so it is best to cover this corner-case in the off-chance it happens in the wild. Add an explicit admission check to the execution loop, after the _ready_list is exhausted, to make sure any waiters that can be admitted with an empty _ready_list are admitted immediately and execution continues. Fixes: #13540 Closes #13541 (cherry picked from commit `b790f14456`)	2023-12-07 16:04:55 +02:00
Raphael S. Carvalho	1b8c078cab	test: Fix sporadic failures of database_test database_test is failing sporadically and the cause was traced back to commit `e3e7c3c7e5`. The commit forces a subset of tests in database_test, to run once for each of predefined x_log2_compaction_group settings. That causes two problems: 1) test becomes 240% slower in dev mode. 2) queries on system.auth is timing out, and the reason is a small table being spread across hundreds of compaction groups in each shard. so to satisfy a range scan, there will be multiple hops, making the overhead huge. additionally, the compaction group aware sstable set is not merged yet. so even point queries will unnecessarily scan through all the groups. Fixes #13660. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13851 (cherry picked from commit `a7ceb987f5`)	2023-11-30 17:31:07 +02:00
Avi Kivity	40eed1f1c5	Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d` (5.2.0), it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485. Manually tested using ccm on cluster upgrade scenarios and node restarts. Closes #14441 * github.com:scylladb/scylladb: test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled schema_mutations, migration_manager: Ignore empty partitions in per-table digest migration_manager, schema_tables: Implement migration_manager::reload_schema() schema_tables: Avoid crashing when table selector has only one kind of tables (cherry picked from commit `cf81eef370`)	2023-11-21 01:29:28 +01:00
Pavel Emelyanov	f76ba217e7	Merge 'api: failure_detector: invoke on shard 0' from Kamil Braun These APIs may return stale or simply incorrect data on shards other than 0. Newer versions of Scylla are better at maintaining cross-shard consistency, but we need a simple fix that can be easily and without risk be backported to older versions; this is the fix. Add a simple test to check that the `failure_detector/endpoints` API returns nonzero generation. Fixes: scylladb/scylladb#15816 Closes scylladb/scylladb#15970 * github.com:scylladb/scylladb: test: rest_api: test that generation is nonzero in `failure_detector/endpoints` api: failure_detector: fix indentation api: failure_detector: invoke on shard 0 (cherry picked from commit `9443253f3d`)	2023-11-07 15:12:12 +01:00
Botond Dénes	17e4d535db	test/cql-pytest/nodetool.py: no_autocompaction_context: use the correct API This `with` context is supposed to disable, then re-enable autocompaction for the given keyspaces, but it used the wrong API for it, it used the column_family/autocompaction API, which operates on column families, not keyspaces. This oversight led to a silent failure because the code didn't check the result of the request. Both are fixed in this patch: * switch to use `storage_service/auto_compaction/{keyspace}` endpoint * check the result of the API calls and report errors as exceptions Fixes: #13553 Closes #13568 (cherry picked from commit `66ee73641e`)	2023-11-07 13:59:01 +02:00
Tomasz Grabiec	573ef87245	Merge ' tool/scylla-sstable: more flexibility in obtaining the schema' from Botond Dénes scylla-sstable currently has two ways to obtain the schema: * via a `schema.cql` file. * load schema definition from memory (only works for system tables). This meant that for most cases it was necessary to export the schema into a CQL format and write it to a file. This is very flexible. The sstable can be inspected anywhere, it doesn't have to be on the same host where it originates form. Yet in many cases the sstable is inspected on the same host where it originates from. In this cases, the schema is readily available in the schema tables on disk and it is plain annoying to have to export it into a file, just to quickly inspect an sstable file. This series solves this annoyance by providing a mechanism to load schemas from the on-disk schema tables. Furthermore, an auto-detect mechanism is provided to detect the location of these schema tables based on the path of the sstable, but if that fails, the tool check the usual locations of the scylla data dir, the scylla confguration file and even looks for environment variables that tell the location of these. The old methods are still supported. In fact, if a schema.cql is present in the working directory of the tool, it is preferred over any other method, allowing for an easy force-override. If the auto-detection magic fails, an error is printed to the console, advising the user to turn on debug level logging to see what went wrong. A comprehensive test is added which checks all the different schema loading mechanisms. The documentation is also updated to reflect the changes. This change breaks the backward-compatibility of the command-line API of the tool, as `--system-schema` is now just a flag, the keyspace and table names are supplied separately via the new `--keyspace` and `--table` options. I don't think this will break anybody's workflow as this tools is still lightly used, exactly because of the annoying way the schema has to be provided. Hopefully after this series, this will change. Example: ``` $ ./build/dev/scylla sstable dump-data /var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine/me-1-big-Data.db {"sstables":{"/var/lib/scylla/data/ks/tbl2-d55ba230b9a811ed9ae8495671e9e4f8/quarantine//me-1-big-Data.db":[{"key":{"token":"-3485513579396041028","raw":"000400000000","value":"0"},"clustering_elements":[{"type":"clustering-row","key":{"raw":"","value":""},"marker":{"timestamp":1677837047297728},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1677837047297728,"value":"0"}}}]}]}} ``` As seen above, subdirectories like qurantine, staging etc are also supported. Fixes: https://github.com/scylladb/scylladb/issues/10126 Closes #13448 * github.com:scylladb/scylladb: test/cql-pytest: test_tools.py: add tests for schema loading test/cql-pytest: add no_autocompaction_context docs: scylla-sstable.rst: remove accidentally added copy-pasta docs: scylla-sstable.rst: remove paragraph with schema limitations docs: scylla-sstable.rst: update schema section test/cql-pytest: nodetool.py: add flush_keyspace() tools/scylla-sstable: reform schema loading mechanism tools/schema_loader: add load_schema_from_schema_tables() db/schema_tables: expose types schema (cherry picked from commit `952b455310`) Closes #15386	2023-11-02 17:25:18 +02:00
Botond Dénes	d606e9bfa2	Merge '[branch-5.2] Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to be mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes https://github.com/scylladb/scylladb/issues/14992. Backport notes: relatively easy to backport, had to include replica: Make compaction_group responsible for deleting off-strategy compaction input and compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 Closes #15793 * github.com:scylladb/scylladb: test: Verify that off-strategy can do incremental compaction compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 compaction: Clear pending_replacement list when tombstone GC is disabled compaction: Enable incremental compaction on off-strategy compaction: Extend reshape type to allow for incremental compaction compaction: Move reshape_compaction in the source compaction: Enable incremental compaction only if replacer callback is engaged replica: Make compaction_group responsible for deleting off-strategy compaction input	2023-10-30 12:00:54 +02:00
Avi Kivity	ea198d884d	cql3: grammar: reject intValue with no contents The grammar mistakenly allows nothing to be parsed as an intValue (itself accepted in LIMIT and similar clauses). Easily fixed by removing the empty alternative. A unit test is added. Fixes #14705. Closes #14707 (cherry picked from commit `e00811caac`)	2023-10-25 19:15:28 +03:00
Raphael S. Carvalho	b8c8794e14	test: Verify that off-strategy can do incremental compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-22 17:05:33 -03:00
Raphael S. Carvalho	6798f9676f	Resurrect optimization to avoid bloom filter checks during compaction Commit `8c4b5e4` introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing any tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of runs, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Cherry picked from commit `38b226f997` Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes #14091. Closes #13908 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15744	2023-10-20 09:34:53 +03:00
Botond Dénes	ca8723a6fd	Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy Modeled after get_live_members_synchronized, get_unreachable_members_synchronized calls replicate_live_endpoints_on_change to synchronize the state of unreachable_members on all shards. Fixes #12261 Fixes #15088 Also, add rest_api unit test for those apis Closes #15093 * github.com:scylladb/scylladb: test: rest_api: add test_gossiper gossiper: add get_unreachable_members_synchronized (cherry picked from commit `57deeb5d39`) Backport note: `gossiper::lock_endpoint_update_semaphore` helper function was missing, replaced with `get_units(g._endpoint_update_semaphore, 1)`	2023-09-27 15:09:32 +02:00
Avi Kivity	34e0afb18a	Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros Currently, when creating the table, permissions may be mistakenly granted to the user even if the table is already existing. This can happen in two cases: The query has a IF NOT EXISTS clause - as a result no exception is thrown after encountering the existing table, and the permission granting is not prevented. The query is handled by a non-zero shard - as a result we accept the query with a bounce_to_shard result_message, again without preventing the granting of permissions. These two cases are now avoided by checking the result_message generated when handling the query - now we only grant permissions when the query resulted in a schema_change message. Additionally, a test is added that reproduces both of the mentioned cases. CVE-2023-33972 Fixes #15467. * 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq: auth: do not grant permissions to creator without actually creating transport: add is_schema_change() method to result_message (cherry picked from commit `ab6988c52f`)	2023-09-19 01:47:27 +03:00
Jan Ciolek	cd9458eeb1	cql.g: make the parser reject INSERT JSON without a JSON value We allow inserting column values using a JSON value, eg: ```cql INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}'; ``` When no JSON value is specified, the query should be rejected. Scylla used to crash in such cases. A recent change fixed the crash (https://github.com/scylladb/scylladb/pull/14706), it now fails on unwrapping an uninitialized value, but really it should be rejected at the parsing stage, so let's fix the grammar so that it doesn't allow JSON queries without JSON values. A unit test is added to prevent regressions. Refs: https://github.com/scylladb/scylladb/pull/14707 Fixes: https://github.com/scylladb/scylladb/issues/14709 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> \Closes #14785 (cherry picked from commit `cbc97b41d4`)	2023-09-14 21:07:21 +03:00
Nadav Har'El	e917b874f9	test/alternator: fix flaky test test_ttl_expiration_gsi_lsi The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky. The test incorrectly assumes that when we write an already expired item, it will be visible for a short time until being deleted by the TTL thread. But this doesn't need to be true - if the test is slow enough, it may go look or the item after it was already expired! So we fix this test by splitting it into two parts - in the first part we write a non-expiring item, and notice it eventually appears in the GSI, LSI, and base-table. Then we write the same item again, with an expiration time - and now it should eventually disappear from the GSI, LSI and base-table. This patch also fixes a small bug which prevented this test from running on DynamoDB. Fixes #14495 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14496 (cherry picked from commit `599636b307`)	2023-09-14 20:43:51 +03:00
Petr Gusev	a83c0a8bbc	test_secondary_index_collections: change insert/create index order Secondary index creation is asynchronous, meaning it takes time for existing data to be reflected within the index. However, new data added after the index is created should appear in it immediately. The test consisted of two parts. The first created a series of indexes for one table, added test data to the table, and then ran a series of checks. In the second part, several new indexes were added to the same table, and checks were made to make sure that already existing data would appear in them. This last part was flaky. The patch just moves the index creation statements from the second part to the first. Fixes: #14076 Closes #14090 (cherry picked from commit `0415ac3d5f`) Closes #15101	2023-08-24 14:09:08 +03:00
Botond Dénes	098baaef48	Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742). As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass. Fixes #14501 Fixes #14742 Closes #14745 * github.com:scylladb/scylladb: test/cql-pytest: test confirming that casting to counter doesn't work cql: support casting of counter to other types cql: implement missing counterasblob() and blobascounter() functions cql: implement missing type functions for "counters" type (cherry picked from commit `a637ddd09c`)	2023-08-13 14:53:48 +03:00
Nadav Har'El	e11561ef65	cql-pytest: translate Cassandra's tests for compact tables This is a translation of Cassandra's CQL unit test source file validation/operations/CompactStorageTest.java into our cql-pytest framework. This very large test file includes 86 tests for various types of operations and corner cases of WITH COMPACT STORAGE tables. All 86 tests pass on Cassandra (except one using a deprecated feature that needs to be specially enabled). 30 of the tests fail on Scylla reproducing 7 already-known Scylla issues and 7 previously-unknown issues: Already known issues: Refs #3882: Support "ALTER TABLE DROP COMPACT STORAGE" Refs #4244: Add support for mixing token, multi- and single-column restrictions Refs #5361: LIMIT doesn't work when using GROUP BY Refs #5362: LIMIT is not doing it right when using GROUP BY Refs #5363: PER PARTITION LIMIT doesn't work right when using GROUP BY Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax Refs #8627: Cleanly reject updates with indexed values where value > 64k New issues: Refs #12471: Range deletions on COMPACT STORAGE is not supported Refs #12474: DELETE prints misleading error message suggesting ALLOW FILTERING would work Refs #12477: Combination of COUNT with GROUP BY is different from Cassandra in case of no matches Refs #12479: SELECT DISTINCT should refuse GROUP BY with clustering column Refs #12526: Support filtering on COMPACT tables Refs #12749: Unsupported empty clustering key in COMPACT table Refs #12815: Hidden column "value" in compact table isn't completely hidden Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12816 (cherry picked from commit `328cdb2124`)	2023-08-13 14:44:19 +03:00
Nadav Har'El	e03c21a83b	cql-pytest: translate Cassandra's tests for CAST operations This is a translation of Cassandra's CQL unit test source file functions/CastFctsTest.java into our cql-pytest framework. There are 13 tests, 9 of them currently xfail. The failures are caused by one recently-discovered issue: Refs #14501: Cannot Cast Counter To Double and by three previously unknown or undocumented issues: Refs #14508: SELECT CAST column names should match Cassandra's Refs #14518: CAST from timestamp to string not same as Cassandra on zero milliseconds Refs #14522: Support CAST function not only in SELECT Curiously, the careful translation of this test also caused me to find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647 which the test in Java missed because it made the same mistake as the implementation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14528 (cherry picked from commit `f08bc83cb2`)	2023-08-13 14:41:36 +03:00
Nadav Har'El	79b5befe65	test/cql-pytest: add tests for data casts and inf in sums This patch adds tests to reproduce issue #13551. The issue, discovered by a dtest (cql_cast_test.py), claimed that either cast() or sum(cast()) from varint type broke. So we add two tests in cql-pytest: 1. A new test file, test_cast_data.py, for testing data casts (a CAST (...) as ... in a SELECT), starting with testing casts from varint to other types. The test uncovers a lot of interesting cases (it is heavily commented to explain these cases) but nothing there is wrong and all tests pass on Scylla. 2. An xfailing test for sum() aggregate of +Inf and -Inf. It turns out that this caused #13551. In Cassandra and older Scylla, the sum returned a NaN. In Scylla today, it generates a misleading error message. As usual, the tests were run on both Cassandra (4.1.1) and Scylla. Refs #13551. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `78555ba7f1`)	2023-08-13 14:40:08 +03:00
Petr Gusev	aca9e41a44	topology.cc: remove_endpoint: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. This is a backport of `bcb1d7c` to branch-5.2. Refs: #14184 Closes #14893	2023-08-11 14:29:37 +03:00
Botond Dénes	bcb8f6a8dd	Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak If semaphore mismatch occurs, check whether both semaphores belong to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error. Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup__querier` methods. The mismatch can happen if user's scheduling group changed during a query. We don't want to throw an error then, but drop and reset cached reader. This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it. Refers: https://github.com/scylladb/scylla-enterprise/issues/3182 Refers: https://github.com/scylladb/scylla-enterprise/issues/3050 Closes: #14770 Closes #14736 github.com:scylladb/scylladb: querier_cache: add stats of scheduling group mismatches querier_cache: check semaphore mismatch during querier lookup querier_cache: add reference to `replica::database::is_user_semaphore()` replica:database: add method to determine if semaphore is user one (cherry picked from commit `a8feb7428d`)	2023-08-09 10:20:53 +03:00
Nadav Har'El	e34c62c567	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage() (cherry picked from commit `056d04954c`)	2023-07-31 03:43:44 -04:00
Nadav Har'El	992c50173a	Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved. We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different. We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change. A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash. Fixes: https://github.com/scylladb/scylladb/issues/13129 Closes #14429 * github.com:scylladb/scylladb: modification_statement: reject conditional statements with empty clustering key statements/cas_request: fix crash on empty clustering range in LWT (cherry picked from commit `49c8c06b1b`)	2023-07-31 09:14:55 +03:00
Raphael S. Carvalho	d2369fc546	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818 (cherry picked from commit `050ce9ef1d`)	2023-07-28 13:56:28 +02:00

1 2 3 4 5 ...

4227 Commits