Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
```
After:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
```
Backports: none, not a regression
Closesscylladb/scylladb#20522
* github.com:scylladb/scylladb:
perf: perf_fast_forward: Add test case for querying missing rows
perf-fast-forward: Allow overriding promoted index block size
perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
sstables: bsearch_clustered_cursor: Add more tracing points
sstables: reader: Log data file range
sstables: bsearch_clustered_cursor: Unify skip_info logging
sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
sstables: bsearch_clustered_cursor: Skip even to the first block
test: sstables: sstable_3_x_test: Improve failure message
sstables: mx: writer: Never include partition_end marker in promoted index block width
sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
sstables: clustered_cursor: Track current block
This commit improves the README file so that it's more helpful
to documentation contributors. Especially, it:
- Adds the link to the prerequisites.
- Add information on troubleshooting (checking the links, headings, etc.)
- Removes the section on creating a knowledge base article, as we no longer
promote adding KBs in favor of creating a coherent documentation set.
Fixes https://github.com/scylladb/scylladb/issues/21257Closesscylladb/scylladb#21262
This commit updates the configuration for ScyllaDB documentation so that:
- 6.2 is the latest version.
- 6.2 is removed from the list of unstable versions.
It must be merged when ScyllaDB 6.2 is released.
In addition, this commit uncomments the redirections that should be applied
when version 6.2 is the latest stable version (which will happen when this commit
is merged).
No backport is required.
Closesscylladb/scylladb#21133
Commit 3a12ad96c7
added an sstable_identifier uuid to the SSTable
scylla_metadata component, however it was
under-documented and this patch adds the missing
documentation for the sstable component format,
and to the scylla sstable tool documentation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#21221
this series:
- promote object storage configuration to user-facing documentation
- reference object storage config doc from nodetool commands
---
the nodetool backup/restore commands are not included by any LTS branches yet, hence no need to backport.
Closesscylladb/scylladb#21071
* github.com:scylladb/scylladb:
docs: move keyspace-storage-option from cql-extensions to admin
docs: reference admin.rst for object storage config
docs: reference object storage config doc from nodetool commands
docs: promote object storage configuration to user-facing documentation
instead of repeating it in cql-extensions.md, let's reference
the object storage related settings in admin.rst
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Enhance the documentation for nodetool commands that use the `--endpoint`
option by linking to the object storage configuration guide. This change
provides users with essential context and detailed setup instructions
for S3-compatible storage endpoints.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this commit moves the object storage configuration guide from the developer
documentation to the user-facing admin documentation. the change reflects
the increasing importance of object storage integration in user-facing
features.
in this change:
- move relevant content from `docs/dev/object_storage.md` to
`docs/operating-scylla/admin.rst`
- reformat the content from Markdown to reStructuredText (RST)
- reword and restructure the content to be more user-friendly
- add explanations and context suitable for a broader audience
this change makes the object storage configuration information more
accessible to Scylla administrators and end-users, supporting the adoption
of new features built on top of object storage integration.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 787ea4b1d4, we added "sstables" argument to the "nodetool restore"
command. but we failed to update the document to reflect the change.
in this change, we update the document for "restore" command to reflect
the latest implementation changes introduced in commit 787ea4b1d4:
* Add information about the new "sstables" argument
* Update command line usage of "--table" argument -- it is now madatory
* Update the example accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21135
Fixes#18903
Adds a "transitional" internode encryption mode, under which all _outgoing_ RPC connections will use TLS, but we will still accept any incoming non-tls connection.
This allows an operator to perform a move to TLS RPC without cluster downtime:
1. For each server, add certificate etc options to server_encryption_options + internode_encryption=none + set ssl_storage_port + restart (rolling)
2. For each server, set internode_encryption=transitional + RR
3. For each server, set internode_encryption=all + RR
Closesscylladb/scylladb#18939
* github.com:scylladb/scylladb:
test::topology: Add test for TLS upgrade and downgrade of internode encryption
docs: Add internode_encryption=transitional documentation
messaging_service: Add "transitional" internode encryptipn mode
messaging_service: Create TLS connector even if internode_enc=none when certs set
This commit removes the raw:: html directive (with the exception of an embedded animation) because:
- It is not supported by the dark theme and looks bad.
- It's a legacy directive, and we no longer need it on index pages.
Fixes https://github.com/scylladb/scylladb/issues/20881Closesscylladb/scylladb#21062
- `docs/Makefile`: work around python-poetry issue https://github.com/python-poetry/poetry/issues/8761
- `docs/README.md`: fix minimum poetry version
No backporting needed (docs development).
Closes scylladb/scylladb#21118
* github.com:scylladb/scylladb:
docs/README.md: fix minimum poetry version
docs/Makefile: work around python-poetry issue #8761
In this patch we add to docs/new-apis.md (Alternator-specific API)
a description of the service discovery HTTP requests - `/` and
`/localnodes` that was previously not documented except in a design
document that is unfortunately no longer available publically.
The description also includes the recently added `dc` and `rack`
parameters for the `/localnodes` request.
Fixes#20989
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the documentation of Alternator-specific APIs (APIs
which are unique to Alternator and don't exist in DynamoDB) appear as
a section of the main document alternator.md. In the next patch we
want to describe yet another Alternator feature and make this section
even longer. But there is growing sentiment that the Alternator
documentation should be split into more, shorter, pages (Refs #19822)
so this patch splits the Alternator-specific API documentation into a
new file, new-apis.md.
There is no new content in the patch - just movement of existing content
plus a reference to the new page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit 2a3012db7f ("docs/README.md: expand prerequisites list",
2022-08-31) referenced poetry release 1.12, which does not exist even
today (as of this writing, the latest release is 1.8.4). The intent was
probably 1.1.12.
Copy the minimum version from "sphinx-scylladb-theme": 1.8.1 (see
"docs/source/getting-started/installation.rst" and
"docs/source/getting-started/quickstart.rst" at commit f7c26b422572).
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Python-poetry is affected by bug
<https://github.com/python-poetry/poetry/issues/8761>. Namely, if you have
"keyring" <https://pypi.org/project/keyring/> installed, poetry will try
to gain access to the Default collection in the (ex. GNOME) keyring, even
if poetry only needs read-only access to package repositories, and even if
those repos are public.
Consequently, you either unlock your Default collection for poetry
(unjustifiedly), or your GUI session gets effectively locked up, because
any time you hit Cancel on the keyring unlock dialog, poetry immediately
pops up another, and this dialog grabs the keyboard -- you cannot even
switch to a character VT, for killing poetry; you have to log in via ssh
for that.
This issue is not visible to users who don't use "keyring" (GNOME or
otherwise). For those who do, work around the problem by selecting the
"null" keyring back-end, in the environment of every poetry invocation.
Note: I have not regression-tested the workaround in a desktop environment
where "keyring" is unavailable to begin with.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.
Closesscylladb/scylladb#21018
Currently, it may happen that the last promoted index block includes
the partition_end marker. That's because we first write the partition
end marker and then emit the unclosed block. This behavior matches
Cassandra (checked in 3.x and 5.0.1).
This is problematic for ruling out data file reads based on index.
The width field is currently unused, but it will be used later where
the width of the last block is used to compute the skip position past
the last block for lookups which land after all keys in the
partition. If width includes the marker then such a skip would land in
the next partition, which is incorrect, as the reader context expects
a cell element. Even if that was recognized, it's wrong - if this is
not a single partition read (so upper bound is not at the next
partition too), then we would read from the wrong (next) partition.
We want to be able to make such skips in order to avoid unnecessary
data file IO for reads of missing rows. Currently, we would always
read the last block even if the key is past its "end" position.
Another way to solve this would be to propagate the "past the last
block" condition from the index cursor to the reader and let it deal
with it, but the logic for that would be complicated. With this fix,
there is no special logic required.
Removes the update command from the setup command.
This is required because versions now are not strictly pinned in the poetry.lock file since Sphinx ScyllaDB Theme 1.8.
Closesscylladb/scylladb#20876
When users create a table using the Alternator API, they can decide if the billing is PROVISIONED of PAY_PER_REQUEST.
If the billing is set to PROVISIONED, they need to set the ProvisionedThroughput ReadCapacityUnits (RCU) and WriteCapacityUnits (WCU).
This series adds support for getting and setting the ProvisionedThroughput. The values will be stored as table extension tags.
Following how TTL is stored within the Alternator, we will use ```system:rcu_attribute``` and ```system:wcu_attribute``` for the labels.
The series adds a test that sets ProvisionedThroughput and validates that it gets the value back. It was tested with both Alternator and AWS.
This series is part of the effort to monitor, limit, and bill Alternator operations.
New code, no need to backport.
Closesscylladb/scylladb#20056
* github.com:scylladb/scylladb:
docs/alternator/compatibility.md: explain the consumed capacity provisioned
Add test/alternator/test_provisioned_throughput.py
test/alternator/util.py: Allow override BillingMode
alternator/executor.cc: Store ProvisionedThroughput
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.
Fixes https://github.com/scylladb/scylladb/issues/20636
---
this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt.
Closesscylladb/scylladb#20661
* github.com:scylladb/scylladb:
backup_task: fix the indent
treewide: add "table" parameter to "backup" API
This commit modifies the Features page in the following way:
- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
Enterprise info in the OSS docs)
Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711Closesscylladb/scylladb#20635
with this parameter, "backup" API can backup the given table, this
enables it to be a drop-in replacement of existing rclone API used by
scylla manager.
in this change:
* api/storage_service: add "table" parameter to "backup" API.
* snapshot_ctl: compose the full path of the snapshot directory in
`snapshot_ctl::start_backup`. since we have all the information
for composing the snapshot directory, and what the `backup_task_impl`
class is interested is but the snapshot directory, we just pass
the path to it instead the individual components of the directory.
* backup_task_impl: instead of scan the whole keyspace recursively,
only scan the specified snapshot directory.
Fixesscylladb/scylladb#20636
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Auth has been managed via Raft since Scylla 6.0. Restoring data
following the usual procedure (1) is error-prone and so a safer
method must have been designed and implemented. That's what
happens in this PR.
We want to extend `DESC SCHEMA` by auth and service levels
to provide a safe way to backup and restore those two components.
To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS`
and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`.
* `DESC SCHEMA` -- no change, i.e. the statement describes the current
schema items such as keyspaces, tables, views, UDTs, etc.
* `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier
and also describes auth and service levels. No information about
passwords is returned.
* `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same
as the previous tier and also includes information about the salted
hashes corresponding to the passwords of roles.
To restore existing roles, we extend the `CREATE ROLE` statement
by allowing to use the option `WITH SALTED HASH = '[...]'`.
---
Implementation strategy:
* Add missing things/adjust existing ones that will be used later.
* Implement creating a role with salted hash.
* Add tests for creating a role with salted hash.
* Prepare for implementing describe functionality of auth and service levels.
* Implement describe functionality for elements of auth and service levels.
* Extend the grammar.
* Add tests for describe auth and service levels.
* Add/update documentation.
---
(1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html
In case the link stops working, restoring a schema was realised
by managing raw files on disk.
Fixesscylladb/scylladb#18750Fixesscylladb/scylladb#18751Fixesscylladb/scylladb#20711Closesscylladb/scylladb#20168
* github.com:scylladb/scylladb:
docs: Update user documentation for backup and restore
docs/dev: Add documentation for DESC SCHEMA
test: Add tests for describing auth and service levels
cql3/functions/user_function: Remove newline character before and after UDF body
cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS
auth: Implement describing auth
auth/authenticator: Add member functions for querying password hash
service/qos/service_level_controller: Describe service levels
data_dictionary: Remove keyspace_element.hh
treewide: Start using new overloads of describe
treewide: Fix indentation in describe functions
treewide: Return create statement optionally in describe functions
treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element
treewide: Start using schema::ks_name() instead of schema::keyspace_name()
cql3: Refactor `description`
cql3: Move description to dedicated files
test: Add tests for `CREATE ROLE WITH SALTED HASH`
cql3/statements: Restrict CREATE ROLE WITH SALTED HASH
auth: Allow for creating roles with SALTED HASH
types: Introduce a function `cql3_type_name_without_frozen()`
cql3/util: Accept std::string_view rather than const sstring&
We update the relevant articles addressing backing-up
and restoring the schema by specifying that the user
performing it must be a superuser. We also update
the required version of cqlsh.
Additionally, we add an article covering the fundamental
information on `DESCRIBE SCHEMA`.
We add documentation for developers addressing
`DESCRIBE SCHEMA`. It covers the following aspects
of it:
* motivation,
* synopsis of the solution,
* implementation of the solution,
as well as a few subsections explaining the details:
* restoring process and its side effects,
* restoring roles with passwords,
* list of statements generated by `DESC SCHEMA`
with examples,
* implementation details.
When executing `DESC SCHEMA WITH INTERNALS`, Scylla now also returns
statements that can be used to recreate service levels and restore
the state of auth. That encompasses granting roles and permissions
as well as attaching service levels to roles.
If the additional parameter `WITH PASSWORDS` is provided,
the statements corresponding to recreating roles in the system
will also contain the stored salted hashes.
`unspecified` workload type is an internal value and it's not exposed to
user via CQL.
Default value for workload type from user's perspective is `NULL`.
Fixesscylladb/scylladb#20780
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20696
We introduce a way to create a role with explictly
provided salted hash.
The algorithm for creating a role with a password works
like this:
1. The user issues a statement `CREATE ROLE <role> WITH
PASSWORD = '<password>' <...>`.
2. Scylla produces a hash based on the value of
`<password>`.
3. Scylla puts the produced hash in `system.roles`,
in the column `salted_hash`.
The newly introduced way to create a role is based
on a new form of the create statement:
`CREATE ROLE <role> WITH SALTED HASH = '<salted_hash>`
The difference in the algorithm used for processing
this statement is that we insert `<salted_hash>`
into `system.roles` directly, without hashing it.
The rationale for introducing this new statement is that
we want to be able to restore roles. The original password
isn't stored anywhere in the database (as intended),
so we need to rely on the column `salted_hash`.
* Also dump diagnostics when a read times out while active (not queued).
* Add the "Trigger permit" line, containing the details of the permit which caused the diagnostics dump (by e.g. timing out).
* Add the "Identified bottleneck(s)" line, containing the identified bottlenecks which lead to permits being queued. This line is missing if no such bottleneck can be identified.
* Document the new features, as well as the stat dump, which was added some time ago.
Example of the new dump format:
```
INFO 2024-09-12 08:09:48,046 [shard 0:main] reader_concurrency_semaphore - Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/10 count and 106192275/32768 memory resources: timed out, dumping permit diagnostics:
Trigger permit: count=0, memory=0, table=ks.tbl0, operation=mutation-query, state=waiting_for_admission
Identified bottleneck(s): memory
permits count memory table/operation/state
3 2 26M *.*/push-view-updates-2/active
3 2 16M ks.tbl1/push-view-updates-1/active
1 1 15M ks.tbl2/push-view-updates-1/active
1 0 13M ks.tbl1/multishard-mutation-query/active
1 0 12M ks.tbl0/push-view-updates-1/active
1 1 10M ks.tbl3/push-view-updates-2/active
1 1 6060K ks.tbl3/multishard-mutation-query/active
2 1 1930K ks.tbl0/push-view-updates-2/active
1 0 1216K ks.tbl0/multishard-mutation-query/active
6 0 0B ks.tbl1/shard-reader/waiting_for_admission
3 0 0B *.*/data-query/waiting_for_admission
9 0 0B ks.tbl0/mutation-query/waiting_for_admission
2 0 0B ks.tbl2/shard-reader/waiting_for_admission
4 0 0B ks.tbl0/shard-reader/waiting_for_admission
9 0 0B ks.tbl0/data-query/waiting_for_admission
7 0 0B ks.tbl3/mutation-query/waiting_for_admission
5 0 0B ks.tbl1/mutation-query/waiting_for_admission
2 0 0B ks.tbl2/mutation-query/waiting_for_admission
8 0 0B ks.tbl1/data-query/waiting_for_admission
1 0 0B *.*/mutation-query/waiting_for_admission
26 0 0B permits omitted for brevity
96 8 101M total
Stats:
permit_based_evictions: 0
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 0
total_failed_reads: 0
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 1
reads_enqueued_for_admission: 82
reads_enqueued_for_memory: 0
reads_admitted_immediately: 1
reads_queued_because_ready_list: 0
reads_queued_because_need_cpu_permits: 82
reads_queued_because_memory_resources: 0
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 0
total_permits: 97
current_permits: 96
need_cpu_permits: 0
awaits_permits: 0
disk_reads: 0
sstables_read: 0
```
Fixes: https://github.com/scylladb/scylladb/issues/19535
Improvement, no backport needed.
Closesscylladb/scylladb#20545
* github.com:scylladb/scylladb:
docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps
test/boost/reader_concurrency_semaphore_test: test the new diagnostics functionality
reader_concurrency_semaphore: add bottleneck self-diagnosis to diagnosis dump
reader_concurrency_semaphore: include trigger permit in diagnostic dump
reader_concurrency_semaphore: propagate permit to do_dump_reader_permit_diagnostics()
reader_concurrency_semaphore: use consistent exception type for timeout
reader_concurrency_semaphore: dump diagnostics when non-waiting reader times out