scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 16:22:15 +00:00

Author	SHA1	Message	Date
Calle Wilund	8e828f608d	docs: Add EAR docs Merge docs relating to EAR.	2025-01-09 10:40:47 +00:00
Botond Dénes	a2436f139f	docs/dev: review-checklist.md: expand the guide for good commit log Closes scylladb/scylladb#22214	2025-01-08 13:01:35 +02:00
Avi Kivity	202f16e799	Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins). Different service levels will be isolated in the following way: - Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level. - Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates. - Each service level gets its own TCP connections for RPC to prevent priority inversion issues. Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed. This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made. No backport is required. Closes scylladb/scylladb#22031 * github.com:scylladb/scylladb: tracing: record scheduling group in trace event record qos: un-shared-from-this standard_service_level_distributed_data_accessor alternator: execute under scheduling group for service level test.py: support multiple commands in prepare_cql in suite.yml docs: add documentation for workload prioritization docs/dev: describe workload prioritization features in service_levels test/auth_cluster: test workload prioritization in service level tests cqlpy/test_service_levels: add workload prioritization tests api: introduce service levels specific API api/cql_server_test: add information about scheduling group db/virtual_tables: add scheduling group column to system.clients test/boost: update service_level_controller_test for workload prio qos: include number of shares in DESCRIBE cql3/statements: update SL statements for workload prioritization transport/server: use scheduling group assigned to current user messaging_service: use separate set of connections per service levels replica/database: add reader concurrency semaphore groups qos: manage and assign scheduling groups to service levels qos: use the shares field in service level reads/writes qos: add shares to service_level_options qos: explicitly specify columns when querying service level tables db/system_distributed_keyspace: add shares column and upgrade code db/system_keyspace: adjust SL schema for workload prioritization gms: introduce WORKLOAD_PRIORITIZATION cluster feature build: increase the max number of scheduling groups qos: return correct error code when SL does not exist	2025-01-02 20:05:36 +02:00
Kefu Chai	233e3969c4	utils: correct misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22143	2025-01-02 16:47:57 +02:00
Piotr Dulikowski	241e710c19	docs/dev: describe workload prioritization features in service_levels The concept of shares, and some helper HTTP APIs, are now described in the developer documentation for service levels.	2025-01-02 07:13:34 +01:00
Michał Chojnowski	fdb2d2209c	messaging_service: use advanced_rpc_compression::tracker for compression This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`, `dict_sampler` and `dictionary_service` in `main()`, and wires them to each other and to `messaging_service`. `messaging_service` compresses its network traffic with compressors managed by the `advanced_rpc_compression::tracker`. All this traffic is passed as a single merged "stream" through `dict_sampler`. `dictionary_service` has access to `dict_sampler`. On chosen nodes (by default: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the `alien_worker` thread) and publishes the new dictionary to `system.dicts` via Raft. This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes, which updates the dictionary used by the compressors it manages.	2024-12-27 10:17:58 +01:00
Michał Chojnowski	0fd1050784	utils: add advanced_rpc_compressor Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries as the network traffic compressors for Seastar's RPC servers. The main jobs of this glue are: 1. Implementing the API expected by Seastar from RPC compressors. 2. Expose metrics about the effectiveness of the compression. 3. Allow dynamically switching algorithms and dictionaries on a running connection, without any extra waits. The biggest design decision here is that the choice of algorithm and dictionary is negotiated by both sides of the connection, not dictated unilaterally by the sender. The negotiation algorithm is fairly complicated (a TLA+ model validating it is included in the commit). Unilateral compression choice would be much simpler. However, negotiation avoids re-sending the same dictionary over every connection in the cluster after dictionary updates (with one-way communication, it's the only reliable way to ensure that our receiver possesses the dictionary we are about to start using), lets receivers ask for a cheaper compression mode if they want, and lets them refuse to update a dictionary if they don't think they have enough free memory for that. In hindsight, those properties probably weren't worth the extra complexity and extra development effort. Zstd can be quite expensive, so this patch also includes a mechanism which temporarily downgrades the compressor from zstd to lz4 if zstd has been using too much CPU in a given slice of time. But it should be noted that this can't be treated as a reliable "protection" from negative performance effects of zstd, since a downgrade can happen on the sender side, and receivers are at the mercy of senders.	2024-12-23 23:37:02 +01:00
Tomasz Grabiec	8e60a0b831	Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped. It works with tablets, but is not safe. A concurrent replication process may bring back old data. This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request. Backporting is not needed. Fixes #16411 Closes scylladb/scylladb#19789 * github.com:scylladb/scylladb: docs: docs: topology-over-raft: Document truncate_table request storage_proxy: fix indentation and remove empty catch/rethrow test: add tests for truncate with tablets storage_proxy: use new TRUNCATE for tablets truncate: make TRUNCATE a global topology operation storage_service: move logic of wait_for_topology_request_completion() RPC: add truncate_with_tablets RPC with frozen_topology_guard feature_service: added cluster feature for system.topology schema change system.topology_requests: change schema storage_proxy: propagate group0 client and TSM dependency	2024-12-10 17:50:50 +01:00
Ferenc Szili	49cc771bda	docs: docs: topology-over-raft: Document truncate_table request	2024-12-09 16:38:50 +01:00
Tomasz Grabiec	7e2875d648	Merge 'Add tablet merge support' from Raphael Raph Carvalho The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now. Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges. Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off. Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings. The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do. While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations. When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one. Fixes #18181. system test details: test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml instance type: i3.8xlarge nodes: 3 target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges) description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges. data_set_size: ~100G initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge. latency of reads and writes that happened in parallel to split and merge: ``` $ for i in scylla-bench; do cat $i \| grep "Mode\\|99th:\\|99\.9th:"; done Mode: write 99.9th: 3.145727ms 99th: 1.998847ms 99.9th: 3.145727ms 99th: 2.031615ms Mode: read 99.9th: 3.145727ms 99th: 2.031615ms 99.9th: 3.145727ms 99th: 2.031615ms Mode: write 99.9th: 3.047423ms 99th: 1.933311ms 99.9th: 3.047423ms 99th: 1.933311ms Mode: read 99.9th: 3.145727ms 99th: 1.900543ms 99.9th: 3.145727ms 99th: 1.900543ms Mode: write 99.9th: 5.079039ms 99th: 3.604479ms 99.9th: 35.389439ms 99th: 25.624575ms Mode: write 99.9th: 3.047423ms 99th: 1.998847ms 99.9th: 3.047423ms 99th: 1.998847ms Mode: read 99.9th: 3.080191ms 99th: 2.031615ms 99.9th: 3.112959ms 99th: 2.031615ms ``` Closes scylladb/scylladb#20572 github.com:scylladb/scylladb: docs: Document tablet merging tests/boost: Add test to verify correctness of balancer decisions during merge tests/topology_experimental_raft: Add tablet merge test service: Handle exception when retrying split service: Co-locate sibling tablets for a table undergoing merge gms: Add cluster feature for tablet merge service: Make merge of resize plan commutative replica: Implement merging of compaction groups on merge completion replica: Handle tablet merge completion service: Implement tablet map resize for merge locator: Introduce merge_tablet_info() service: Rename topology::transition_state::tablet_split_finalization service: Respect initial_tablet_count if table is in growing mode service: Wire migration_tablet_set into the load balancer locator: Add tablet_map::sibling_tablets() service: Introduce sorted_replicas_for_tablet_load() locator/tablets: Extend tablet_replica equality comparator to three-way service: Introduce alias to per-table candidate map type service: Add replication constraint check variant for migration_tablet_set service: Add convergence check variant for migration_tablet_set service: Add migration helpers for migration_tablet_set service/tablet_allocator: Introduce migration_tablet_set service: Introduce migration_plan::add(migrations_vector) locator/tablets: Introduce tablet_map::for_each_sibling_tablets() locator/tablets: Introduce tablet_map::needs_merge() locator/tablets: Introduce resize_decision::initial_decision() locator/tablets: Fix return type of three-way comparison operators service: Extract update of node load on migrations service: Extract converge check for intra-node migration service: Extract erase of tablet replicas from candidate list scripts/tablet-mon: Allow visualization of tablet id	2024-12-06 18:06:20 +01:00
Emil Maskovsky	2b07d93bea	raft: clean up the documentation Small adjustments and improvements to the documentation in the raft section. Fixing Markdown lint warnings: - MD004/ul-style: Unordered list style [Expected: dash; Actual: asterisk] - MD007/ul-indent: Unordered list indentation [Expected: 0; Actual: 2] - MD032/blanks-around-lists: Lists should be surrounded by blank lines - MD036/no-emphasis-as-heading: Emphasis used instead of a heading - MD046/code-block-style: Code block style [Expected: fenced; Actual: indented] Closes scylladb/scylladb#21780	2024-12-05 13:44:11 +01:00
Raphael S. Carvalho	d93a0040e5	docs: Document tablet merging Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-12-04 13:11:11 -03:00
Raphael S. Carvalho	e00798f1b1	service: Rename topology::transition_state::tablet_split_finalization This transition state will be reused by merge completion, so let's rename it to tablet_resize_finalization. The completion handling path will also be reused, so let's rename functions involved similarly. The old name "tablet split finalization" is deprecated but still recognized and points to the correct transition. Otherwise, the reverse lookup would fail when populating topology system table which last state was split finalization. NOTE: I thought of adding a new tablet_merge_finalization, but it would complicate things since more than one table could be ready for either split or merge, so you need a generic transition state for handling resize completion. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-12-03 20:45:20 -03:00
Botond Dénes	87bdfb80aa	docs/dev/reader-concurrency-semaphore.md: fix formatting of diagnostics dump Indent the whole thing so it is formatted as code, not as text. Closes scylladb/scylladb#21693	2024-11-27 12:13:16 +03:00
Aleksandra Martyniuk	3b86150e88	docs: update task manager docs	2024-11-26 09:57:41 +01:00
Avi Kivity	29497f8c5d	Merge 'Automatically compute schema version of system tables' from Tomasz Grabiec Schema of system tables is defined statically and table_schema_version needs to be explicitly set in code like this: ``` builder.with_version(system_keyspace::generate_schema_version(table_id, version_offset)); ``` Whenever schema is changed, the schema version needs to change, otherwise we hit undefined behavior when trying to interpret mutation data created with the old schema using the new schema. It's not obvious that one needs to do that and developers often forget to do that. There were several instances of mistakes of omission, some caught during review, some not, e.g.: `31ea74b96e`. This patch changes definitions to call the new `schema_builder::with_hash_version()`, which will make the schema builder compute version from schema definition so that changes of the schema will automatically change the version. This way we no longer rely on the developer to remember to bump the version offset. All nodes should arrive at the same version, which is verified by existing `test_group0_schema_versioning` and a new unit test: `test_system_schema_version_is_stable`. Closes scylladb/scylladb#21602 * github.com:scylladb/scylladb: system_tables: Compute schema version automatically schema_builder: Introduce with_hash_version() schema: Store raw_view_info in schema::raw_schema schema: Remove dead comment hashing: Add hasher for unordered_map hashing: Add hasher for unique_ptr hashing: Add hasher for double [avi: add missing include <memory> to hashing.hh]	2024-11-24 18:44:32 +02:00
Asias He	9d58a911f1	docs: Update system_keyspace.md for tablet repair related info	2024-11-20 09:42:41 +08:00
Asias He	afd356ea9a	docs: Add docs for tablet repair migration	2024-11-20 09:42:41 +08:00
Tomasz Grabiec	8738d9bfa0	system_tables: Compute schema version automatically This depends on the previous change to the schema_builder which makes version computation depend on definition only instead of being new time uuid. This way we avoid the possibility for a common mistake when schema of a system table is extended but we forget to bump up its version passed to .with_version().	2024-11-15 19:16:41 +01:00
Nadav Har'El	8c215141a1	test: rename "cql-pytest" to "cqlpy" Python and Python developers don't like directory names to include a minus sign, like "cql-pytest". In this patch we rename test/cql-pytest to test/cqlpy, and also change a few references in other code (e.g., code that used test/cql-pytest/run.py) and also references to this test suite in documentation and comments. Arguably, the word "test" was always redundant in test/cql-pytest, and I want to leave the "py" in test/cqlpy to emphasize that it's Python-based tests, contrasting with test/cql which are CQL-request-only approval tests. Fixes #20846 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-11-06 16:48:36 +02:00
Dawid Mędrek	495c1188e9	docs/dev: Document semantics of describing CDC tables	2024-10-31 11:25:19 +01:00
Dawid Mędrek	b984488552	cql3: Rename `SALTED HASH` to `HASHED PASSWORD` Cassandra 4.1 announced a new option to create a role with: `HASHED PASSWORD`. Example: ``` CREATE ROLE bob WITH HASHED PASSWORD = 'hashed_password'; ``` We've already introduced another option following the same semantics: `SALTED HASH`; example: ``` CREATE ROLE bob WITH SALTED HASH = 'salted_hash'; ``` The change hasn't made it to any release yet, so in this commit we rename it to `HASHED PASSWORD` to be compatible with Cassandra. Additionally, we adjust existing tests to work against Cassandra too. Fixes scylladb/scylladb#21350 Closes scylladb/scylladb#21352	2024-10-30 14:07:58 +02:00
Avi Kivity	94c21e5c05	Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to increase selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, in anticipation for sharing the promoted index cache, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% ``` After: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% ``` Backports: none, not a regression Closes scylladb/scylladb#20522 * github.com:scylladb/scylladb: perf: perf_fast_forward: Add test case for querying missing rows perf-fast-forward: Allow overriding promoted index block size perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows sstables: bsearch_clustered_cursor: Add more tracing points sstables: reader: Log data file range sstables: bsearch_clustered_cursor: Unify skip_info logging sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block sstables: bsearch_clustered_cursor: Skip even to the first block test: sstables: sstable_3_x_test: Improve failure message sstables: mx: writer: Never include partition_end marker in promoted index block width sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions sstables: clustered_cursor: Track current block	2024-10-28 21:13:23 +02:00
Benny Halevy	2268912589	docs: add documentation for scylla_identifier Commit `3a12ad96c7` added an sstable_identifier uuid to the SSTable scylla_metadata component, however it was under-documented and this patch adds the missing documentation for the sstable component format, and to the scylla sstable tool documentation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21221	2024-10-28 08:18:08 +02:00
Kefu Chai	9bd9ee9f36	docs: promote object storage configuration to user-facing documentation this commit moves the object storage configuration guide from the developer documentation to the user-facing admin documentation. the change reflects the increasing importance of object storage integration in user-facing features. in this change: - move relevant content from `docs/dev/object_storage.md` to `docs/operating-scylla/admin.rst` - reformat the content from Markdown to reStructuredText (RST) - reword and restructure the content to be more user-friendly - add explanations and context suitable for a broader audience this change makes the object storage configuration information more accessible to Scylla administrators and end-users, supporting the adoption of new features built on top of object storage integration. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 18:26:19 +08:00
Anna Stuchlik	a601845780	doc: remove outdated JMX references This commit removes references to JMX from the docs. Context: The JMX server has been dropped and removed from installation. The user can install it manually if needed, as documented with https://github.com/scylladb/scylladb/issues/18687. This commit removes the outdated information about JMX from other pages in the documentation, including the docs for nodetool, the list of ports, and the admin section. Also, the no longer relevant JMX information is removed from the Docker Hub docs. Fixes https://github.com/scylladb/scylladb/issues/18687 Fixes https://github.com/scylladb/scylladb/issues/19575 Closes scylladb/scylladb#20917	2024-10-07 13:55:15 +03:00
Tomasz Grabiec	7f077893ed	sstables: mx: writer: Never include partition_end marker in promoted index block width Currently, it may happen that the last promoted index block includes the partition_end marker. That's because we first write the partition end marker and then emit the unclosed block. This behavior matches Cassandra (checked in 3.x and 5.0.1). This is problematic for ruling out data file reads based on index. The width field is currently unused, but it will be used later where the width of the last block is used to compute the skip position past the last block for lookups which land after all keys in the partition. If width includes the marker then such a skip would land in the next partition, which is incorrect, as the reader context expects a cell element. Even if that was recognized, it's wrong - if this is not a single partition read (so upper bound is not at the next partition too), then we would read from the wrong (next) partition. We want to be able to make such skips in order to avoid unnecessary data file IO for reads of missing rows. Currently, we would always read the last block even if the key is past its "end" position. Another way to solve this would be to propagate the "past the last block" condition from the index cursor to the reader and let it deal with it, but the logic for that would be complicated. With this fix, there is no special logic required.	2024-10-03 14:09:57 +02:00
Anna Stuchlik	7eb1dc2ae5	doc: document the option to run ScyllaDB in Docker on macOS This commit adds a description of a workaround to create a multi-node ScyllaDB cluster with Docker on macOS. Refs https://github.com/scylladb/scylladb/issues/16806 See https://forum.scylladb.com/t/running-3-node-scylladb-in-docker/1057/4 Closes scylladb/scylladb#20857	2024-10-01 14:58:58 +03:00
Pavel Emelyanov	ae76481444	Merge 'treewide: add "table" parameter to "backup" API ' from Kefu Chai with this parameter, "backup" API can backup the given table, this enables it to be a drop-in replacement of existing rclone API used by scylla manager. Fixes https://github.com/scylladb/scylladb/issues/20636 --- this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt. Closes scylladb/scylladb#20661 * github.com:scylladb/scylladb: backup_task: fix the indent treewide: add "table" parameter to "backup" API	2024-09-25 10:53:38 +03:00
Kefu Chai	d663b6c13b	treewide: add "table" parameter to "backup" API with this parameter, "backup" API can backup the given table, this enables it to be a drop-in replacement of existing rclone API used by scylla manager. in this change: * api/storage_service: add "table" parameter to "backup" API. * snapshot_ctl: compose the full path of the snapshot directory in `snapshot_ctl::start_backup`. since we have all the information for composing the snapshot directory, and what the `backup_task_impl` class is interested is but the snapshot directory, we just pass the path to it instead the individual components of the directory. * backup_task_impl: instead of scan the whole keyspace recursively, only scan the specified snapshot directory. Fixes scylladb/scylladb#20636 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-25 09:11:26 +08:00
Avi Kivity	d16ea0afd6	Merge 'cql3: Extend DESC SCHEMA by auth and service levels' from Dawid Mędrek Auth has been managed via Raft since Scylla 6.0. Restoring data following the usual procedure (1) is error-prone and so a safer method must have been designed and implemented. That's what happens in this PR. We want to extend `DESC SCHEMA` by auth and service levels to provide a safe way to backup and restore those two components. To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS` and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`. * `DESC SCHEMA` -- no change, i.e. the statement describes the current schema items such as keyspaces, tables, views, UDTs, etc. * `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier and also describes auth and service levels. No information about passwords is returned. * `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same as the previous tier and also includes information about the salted hashes corresponding to the passwords of roles. To restore existing roles, we extend the `CREATE ROLE` statement by allowing to use the option `WITH SALTED HASH = '[...]'`. --- Implementation strategy: * Add missing things/adjust existing ones that will be used later. * Implement creating a role with salted hash. * Add tests for creating a role with salted hash. * Prepare for implementing describe functionality of auth and service levels. * Implement describe functionality for elements of auth and service levels. * Extend the grammar. * Add tests for describe auth and service levels. * Add/update documentation. --- (1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html In case the link stops working, restoring a schema was realised by managing raw files on disk. Fixes scylladb/scylladb#18750 Fixes scylladb/scylladb#18751 Fixes scylladb/scylladb#20711 Closes scylladb/scylladb#20168 * github.com:scylladb/scylladb: docs: Update user documentation for backup and restore docs/dev: Add documentation for DESC SCHEMA test: Add tests for describing auth and service levels cql3/functions/user_function: Remove newline character before and after UDF body cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS auth: Implement describing auth auth/authenticator: Add member functions for querying password hash service/qos/service_level_controller: Describe service levels data_dictionary: Remove keyspace_element.hh treewide: Start using new overloads of describe treewide: Fix indentation in describe functions treewide: Return create statement optionally in describe functions treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element treewide: Start using schema::ks_name() instead of schema::keyspace_name() cql3: Refactor `description` cql3: Move description to dedicated files test: Add tests for `CREATE ROLE WITH SALTED HASH` cql3/statements: Restrict CREATE ROLE WITH SALTED HASH auth: Allow for creating roles with SALTED HASH types: Introduce a function `cql3_type_name_without_frozen()` cql3/util: Accept std::string_view rather than const sstring&	2024-09-24 21:44:32 +03:00
Dawid Mędrek	5e1d7f109a	docs/dev: Add documentation for DESC SCHEMA We add documentation for developers addressing `DESCRIBE SCHEMA`. It covers the following aspects of it: * motivation, * synopsis of the solution, * implementation of the solution, as well as a few subsections explaining the details: * restoring process and its side effects, * restoring roles with passwords, * list of statements generated by `DESC SCHEMA` with examples, * implementation details.	2024-09-24 14:18:01 +02:00
Michał Jadwiszczak	d7945eea2a	docs/dev/service_levels: replace `unspecified` workload type with `NULL` `unspecified` workload type is an internal value and it's not exposed to user via CQL. Default value for workload type from user's perspective is `NULL`. Fixes scylladb/scylladb#20780	2024-09-24 11:43:29 +03:00
Pavel Emelyanov	eb22c2a8c8	Merge 'reader_concurrency_semaphore: improve the diagnostics dump' from Botond Dénes * Also dump diagnostics when a read times out while active (not queued). * Add the "Trigger permit" line, containing the details of the permit which caused the diagnostics dump (by e.g. timing out). * Add the "Identified bottleneck(s)" line, containing the identified bottlenecks which lead to permits being queued. This line is missing if no such bottleneck can be identified. * Document the new features, as well as the stat dump, which was added some time ago. Example of the new dump format: ``` INFO 2024-09-12 08:09:48,046 [shard 0:main] reader_concurrency_semaphore - Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/10 count and 106192275/32768 memory resources: timed out, dumping permit diagnostics: Trigger permit: count=0, memory=0, table=ks.tbl0, operation=mutation-query, state=waiting_for_admission Identified bottleneck(s): memory permits count memory table/operation/state 3 2 26M ./push-view-updates-2/active 3 2 16M ks.tbl1/push-view-updates-1/active 1 1 15M ks.tbl2/push-view-updates-1/active 1 0 13M ks.tbl1/multishard-mutation-query/active 1 0 12M ks.tbl0/push-view-updates-1/active 1 1 10M ks.tbl3/push-view-updates-2/active 1 1 6060K ks.tbl3/multishard-mutation-query/active 2 1 1930K ks.tbl0/push-view-updates-2/active 1 0 1216K ks.tbl0/multishard-mutation-query/active 6 0 0B ks.tbl1/shard-reader/waiting_for_admission 3 0 0B ./data-query/waiting_for_admission 9 0 0B ks.tbl0/mutation-query/waiting_for_admission 2 0 0B ks.tbl2/shard-reader/waiting_for_admission 4 0 0B ks.tbl0/shard-reader/waiting_for_admission 9 0 0B ks.tbl0/data-query/waiting_for_admission 7 0 0B ks.tbl3/mutation-query/waiting_for_admission 5 0 0B ks.tbl1/mutation-query/waiting_for_admission 2 0 0B ks.tbl2/mutation-query/waiting_for_admission 8 0 0B ks.tbl1/data-query/waiting_for_admission 1 0 0B ./mutation-query/waiting_for_admission 26 0 0B permits omitted for brevity 96 8 101M total Stats: permit_based_evictions: 0 time_based_evictions: 0 inactive_reads: 0 total_successful_reads: 0 total_failed_reads: 0 total_reads_shed_due_to_overload: 0 total_reads_killed_due_to_kill_limit: 0 reads_admitted: 1 reads_enqueued_for_admission: 82 reads_enqueued_for_memory: 0 reads_admitted_immediately: 1 reads_queued_because_ready_list: 0 reads_queued_because_need_cpu_permits: 82 reads_queued_because_memory_resources: 0 reads_queued_because_count_resources: 0 reads_queued_with_eviction: 0 total_permits: 97 current_permits: 96 need_cpu_permits: 0 awaits_permits: 0 disk_reads: 0 sstables_read: 0 ``` Fixes: https://github.com/scylladb/scylladb/issues/19535 Improvement, no backport needed. Closes scylladb/scylladb#20545 * github.com:scylladb/scylladb: docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps test/boost/reader_concurrency_semaphore_test: test the new diagnostics functionality reader_concurrency_semaphore: add bottleneck self-diagnosis to diagnosis dump reader_concurrency_semaphore: include trigger permit in diagnostic dump reader_concurrency_semaphore: propagate permit to do_dump_reader_permit_diagnostics() reader_concurrency_semaphore: use consistent exception type for timeout reader_concurrency_semaphore: dump diagnostics when non-waiting reader times out	2024-09-18 14:06:05 +03:00
Ernest Zaslavsky	924325fd25	treewide: add "prefix" parameter to backup API Allow the caller to pass the prefix when performing backup and restore Fixes scylladb/scylladb#20335 Closes scylladb/scylladb#20413	2024-09-18 08:25:00 +03:00
Botond Dénes	c7c5817808	Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy When purging regular tombstone consult the min_live_timestamp, if available. This is safe since we don't need to protect dead data from resurrection, as it is already dead. For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp, if available, otherwise fallback to the min_live_timestamp. If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data. In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable. If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp. Fixes scylladb/scylladb#20423 Fixes scylladb/scylladb#20424 > [!NOTE] > no backport needed at this time > We may consider backport later on after given some soak time in master/enterprise > since we do see tombstone accumulation in the field under some materialized views workloads Closes scylladb/scylladb#20446 * github.com:scylladb/scylladb: cql-pytest: add test_compaction_tombstone_gc sstable_compaction_test: add mv_tombstone_purge_test sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection sstable_compaction_test: tombstone_purge_test: add testlog debugging sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp sstable, compaction: add debug logging for extended min timestamp stats compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats compaction: define max_purgeable_fn tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh sstables: scylla_metadata: add ext_timestamp_stats compaction_group, storage_group, table_state: add extended timestamp stats getters sstables, memtable: track live timestamps memtable_encoding_stats_collector: update row_marker: do nothing if missing	2024-09-13 08:56:51 +03:00
Botond Dénes	f834ad81e0	docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps The part of the document which explains diagnostics dumps was due for an update. It was missing an explanation on the dumped stats and it also needs to explain the "Problematic permit" and "Identified bottleneck(s)".	2024-09-12 08:31:25 -04:00
Benny Halevy	4de4af954f	sstables: scylla_metadata: add ext_timestamp_stats Store and retrieve the optional extended timestamp statistics (min_live_timestamp and min_live_row_marker_timestamp) in the scylla_metadata component. Note that there is no need for a cluster feature to store those attributes since the scylla_metadata on-disk format is extensible so that old sstables can be read by new versions, seeing the extra stats is missing, and new sstables can be read by old versions that ignore unknown scylla metadata section types. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Botond Dénes	de81388edb	Merge 'commitlog: Handle oversized entries' from Calle Wilund Refs #18161 Yet another approach to dealing with large commitlog submissions. We handle oversize single mutation by adding yet another entry typo: fragmented. In this case we only add a fragment (aha) of the data that needs storing into each entry, along with metadata to correlate and reconstruct the full entry on replay. Because these fragmented entries are spread over N segments, we also need to add references from the first segment in a chain to the subsequent ones. These are released once we clear the relevant cf_id count in the base. * This approach has the downside that due to how serialization etc works w.r.t. mutations, we need to create an intermediate buffer to hold the full serialized target entry. This is then incrementally written into entries of < max_mutation_size, successively requesting more segments. On replay, when encountering a fragment chain, the fragment is added to a "state", i.e. a mapping of currently processing frag chains. Once we've found all fragments and concatenated the buffers into a single fragmented one, we can issue a replay callback as usual. Note that a replay caller will need to create and provide such a state object. Old signature replay function remains for tests and such. This approach bumps the file format (docs to come). To ensure "atomicity" we both force synchronization, and should the whole op fail, we restore segment state (rewinding), thus discarding data all we wrote. Closes scylladb/scylladb#19472 * github.com:scylladb/scylladb: commitlog/database: Make some commitlog options updatable + add feature listener features/config: Add feature for fragmented commitlog entries docs: Add entry on commitlog file format v4 commitlog_test: Add more oversized cases commitlog_replayer: Replay segments in order created commitlog_replayer: Use replay state to support fragmented entries commitlog_replayer: coroutinize partly commitlog: Handle oversized entries	2024-09-10 17:15:46 +03:00
Avi Kivity	c3e19425bd	Merge 'docs/dev/docker-hub.md: refresh aio-max-nr calculation' from Laszlo Ersek ~~~ What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back to scylla commit `f4412029f4` ("docs/docker-hub.md: add quickstart section with --smp 1", 2020-09-22). Problems with the current language: - The "65K" claim as default value on non-production systems is wrong; "fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K. - The section in question uses equal signs (=) incorrectly. The intent was probably to say "which means the same as", but that's not what equality means. - In the same section, the relational operator "<" is bogus. The available AIO count must be at least as high (>=) as the requested AIO count. - Clearer names should be used; adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc" sets a great example: - "reactor::max_aio" should be called "storage_iocbs", - "detect_aio_poll" should be called "preempt_iocbs", - "reactor_backend_aio::max_polls" should be called "network_iocbs". - The specific value 10000 for the last one ("network_iocbs") is not correct in scylla's context. It is correct as the Seastar default, but scylla has used 50000 since commit `2cfc517874` ("main, test: adjust number of networking iocbs", 2021-07-18). Rewrite the section to address these problems. See also: - https://github.com/scylladb/scylladb/issues/5981 - https://github.com/scylladb/seastar/pull/2396 - https://github.com/scylladb/scylladb/pull/19921 Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> ~~~ No need for backporting; the documentation being refreshed targets developers as audience, not end-users. Closes scylladb/scylladb#20398 * github.com:scylladb/scylladb: docs/dev/docker-hub.md: refresh aio-max-nr calculation docs/dev/docker-hub.md: strip trailing whitespace	2024-09-09 15:04:38 +03:00
Laszlo Ersek	53524974db	docs/dev/maintainer.md: clarify "Updating submodule references" Before the introduction of "scripts/refresh-submodules.sh", there was indeed some manual work for the maintainer to do, hence "publish your work" must have sounded correct. Today, the phrase "publish your work" sounds confusing. Commit `71da4e6e79` ("docs: Document sync-submodules.sh script in maintainer.md", 2020-06-18) should have arguably reworded the last step of the submodule refresh procedure; let's do it now. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Closes scylladb/scylladb#20333	2024-09-05 13:57:32 +03:00
Calle Wilund	9bf452c7a0	docs: Add entry on commitlog file format v4	2024-09-03 16:38:28 +00:00
Laszlo Ersek	cd0819e3ed	docs/dev/docker-hub.md: refresh aio-max-nr calculation What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back to scylla commit `f4412029f4` ("docs/docker-hub.md: add quickstart section with --smp 1", 2020-09-22). Problems with the current language: - The "65K" claim as default value on non-production systems is wrong; "fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K. - The section in question uses equal signs (=) incorrectly. The intent was probably to say "which means the same as", but that's not what equality means. - In the same section, the relational operator "<" is bogus. The available AIO count must be at least as high (>=) as the requested AIO count. - Clearer names should be used; adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc" sets a great example: - "reactor::max_aio" should be called "storage_iocbs", - "detect_aio_poll" should be called "preempt_iocbs", - "reactor_backend_aio::max_polls" should be called "network_iocbs". - The specific value 10000 for the last one ("network_iocbs") is not correct in scylla's context. It is correct as the Seastar default, but scylla has used 50000 since commit `2cfc517874` ("main, test: adjust number of networking iocbs", 2021-07-18). Rewrite the section to address these problems. See also: - https://github.com/scylladb/scylladb/issues/5981 - https://github.com/scylladb/seastar/pull/2396 - https://github.com/scylladb/scylladb/pull/19921 Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-09-03 12:10:59 +02:00
Laszlo Ersek	15738d14ce	docs/dev/docker-hub.md: strip trailing whitespace Strip trailing whitespace. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-09-03 12:00:28 +02:00
Kefu Chai	28b5471c01	docs/dev/maintainer.md: fix formatting * in the "Backporting Seastar commits" section, there's a single quote instead of a backtick in this line, so fix it. * add backticks around `refresh-submodules.sh`, which is a filename. * correct the command line setting a git config option, because `git-config` does not support this command line syntax, ```console $ git config --global diff.conflictstyle = diff3 $ git config --global get diff.conflictstyle = $ git config --global diff.conflictstyle diff3 $ git config --global get diff.conflictstyle diff3 ``` quote from git-config(1) > ``` > git config set [<file-option>] [--type=<type>] [--all] [--value=<value>] [--fixed-value] <name> <value> > ``` * stop using the deprecated mode of the `git-config` command, and use subcommand instead. as git-config(1) puts: > git config <name> <value> [<value-pattern>] > Replaced by git config set [--value=<pattern>] <name> <value>. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20328	2024-09-01 22:24:01 +03:00
Aleksandra Martyniuk	3d78172328	api: task_manager: add operation to get ttl	2024-08-29 13:53:39 +02:00
Pavel Emelyanov	245cc852dd	docs: Document the new backup method Add the new /storage_service/backup endpoint to object_storage.md as yet another way to use S3 from Scylla.	2024-08-22 19:47:06 +03:00
Pavel Emelyanov	8949d73cd9	docs: Update object_storage.md with AWS_ environment Commit `51c53d8db6` made it possible to configure object storage endpoint creds via environment. Mention this in the docs.	2024-08-22 14:08:21 +03:00
Pavel Emelyanov	d3f9865d2f	docs: Restructure object_storage.md Currently the doc assumes that object storage can only be used to keep sstables on it. It's going to change, restructure the doc to allow for more usage scenarios.	2024-08-22 14:08:21 +03:00
Łukasz Paszkowski	f4ca734ccb	reverse-reads.md: Drop legacy reverse format information	2024-08-13 10:07:12 +02:00

1 2 3 4 5 ...

268 Commits