thrift support was deprecated since ScyllaDB 5.2 > Thrift API - legacy ScyllaDB (and Apache Cassandra) API is > deprecated and will be removed in followup release. Thrift has > been disabled by default. so let's drop it. in this change, * thrift protocol support is dropped * all references to thrift support in document are dropped * the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well. * "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool. * `rpc_port` and `start_rpc` options are preserved, but they are marked as "Unused". so that the new release of scylladb can consume existing scylla.yaml configurations which might contain these settings. by making them deprecated, user will be able get warned, and update their configurations before we actually remove them in the next major release. Fixes #3811 Fixes #18416 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
454 lines
13 KiB
Markdown
454 lines
13 KiB
Markdown
# System keyspace layout
|
|
|
|
This section describes layouts and usage of system.* tables.
|
|
|
|
## The system.large\_* tables
|
|
|
|
Scylla performs better if partitions, rows, or cells are not too
|
|
large. To help diagnose cases where these grow too large, scylla keeps
|
|
3 tables that record large partitions (including those with too many
|
|
rows), rows, and cells, respectively.
|
|
|
|
The meaning of an entry in each of these tables is similar. It means
|
|
that there is a particular sstable with a large partition, row, cell,
|
|
or a partition with too many rows. In particular, this implies that:
|
|
|
|
* There is no entry until compaction aggregates enough data in a
|
|
single sstable.
|
|
* The entry stays around until the sstable is deleted.
|
|
|
|
In addition, the entries also have a TTL of 30 days.
|
|
|
|
## system.large\_partitions
|
|
|
|
Large partition table can be used to trace largest partitions in a
|
|
cluster. Partitions with too many rows are also recorded there.
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.large_partitions (
|
|
keyspace_name text,
|
|
table_name text,
|
|
sstable_name text,
|
|
partition_size bigint,
|
|
partition_key text,
|
|
range_tombstones bigint,
|
|
dead_rows bigint,
|
|
rows bigint,
|
|
compaction_time timestamp,
|
|
PRIMARY KEY ((keyspace_name, table_name), sstable_name, partition_size, partition_key)
|
|
) WITH CLUSTERING ORDER BY (sstable_name ASC, partition_size DESC, partition_key ASC);
|
|
~~~
|
|
|
|
### Example usage
|
|
|
|
#### Extracting large partitions info
|
|
~~~
|
|
SELECT * FROM system.large_partitions;
|
|
~~~
|
|
|
|
#### Extracting large partitions info for a single table
|
|
~~~
|
|
SELECT * FROM system.large_partitions WHERE keyspace_name = 'ks1' and table_name = 'standard1';
|
|
~~~
|
|
|
|
## system.large\_rows
|
|
|
|
Large row table can be used to trace large clustering and static rows in a cluster.
|
|
|
|
This table is currently only used with the MC format (issue #4868).
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.large_rows (
|
|
keyspace_name text,
|
|
table_name text,
|
|
sstable_name text,
|
|
row_size bigint,
|
|
partition_key text,
|
|
clustering_key text,
|
|
compaction_time timestamp,
|
|
PRIMARY KEY ((keyspace_name, table_name), sstable_name, row_size, partition_key, clustering_key)
|
|
) WITH CLUSTERING ORDER BY (sstable_name ASC, row_size DESC, partition_key ASC, clustering_key ASC);
|
|
~~~
|
|
|
|
### Example usage
|
|
|
|
#### Extracting large row info
|
|
~~~
|
|
SELECT * FROM system.large_rows;
|
|
~~~
|
|
|
|
#### Extracting large rows info for a single table
|
|
~~~
|
|
SELECT * FROM system.large_rows WHERE keyspace_name = 'ks1' and table_name = 'standard1';
|
|
~~~
|
|
|
|
## system.large\_cells
|
|
|
|
Large cell table can be used to trace large cells in a cluster.
|
|
|
|
This table is currently only used with the MC format (issue #4868).
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.large_cells (
|
|
keyspace_name text,
|
|
table_name text,
|
|
sstable_name text,
|
|
cell_size bigint,
|
|
partition_key text,
|
|
clustering_key text,
|
|
column_name text,
|
|
compaction_time timestamp,
|
|
collection_elements bigint,
|
|
PRIMARY KEY ((keyspace_name, table_name), sstable_name, cell_size, partition_key, clustering_key, column_name)
|
|
) WITH CLUSTERING ORDER BY (sstable_name ASC, cell_size DESC, partition_key ASC, clustering_key ASC, column_name ASC)
|
|
~~~
|
|
|
|
Note that a collection is just one cell. There is no information about
|
|
the size of each collection element.
|
|
|
|
### Example usage
|
|
|
|
#### Extracting large cells info
|
|
~~~
|
|
SELECT * FROM system.large_cells;
|
|
~~~
|
|
|
|
#### Extracting large cells info for a single table
|
|
~~~
|
|
SELECT * FROM system.large_cells WHERE keyspace_name = 'ks1' and table_name = 'standard1';
|
|
~~~
|
|
|
|
## system.raft
|
|
|
|
Holds information about Raft
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.raft (
|
|
group_id timeuuid,
|
|
index bigint,
|
|
term bigint,
|
|
data blob,
|
|
vote_term bigint static,
|
|
vote uuid static,
|
|
snapshot_id uuid static,
|
|
commit_idx bigint static,
|
|
PRIMARY KEY (group_id, index)
|
|
) WITH CLUSTERING ORDER BY (index ASC)
|
|
~~~
|
|
|
|
## system.truncated
|
|
|
|
Holds truncation replay positions per table and shard
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.truncated (
|
|
table_uuid uuid, # id of truncated table
|
|
shard int, # shard
|
|
position int, # replay position
|
|
segment_id bigint, # replay segment
|
|
truncated_at timestamp static, # truncation time
|
|
PRIMARY KEY (table_uuid, shard)
|
|
) WITH CLUSTERING ORDER BY (shard ASC)
|
|
~~~
|
|
|
|
When a table is truncated, sstables are removed and the current replay position for each
|
|
shard (last mutation to be committed to either sstable or memtable) is collected.
|
|
These are then inserted into the above table, using shard as clustering.
|
|
|
|
When doing commitlog replay (in case of a crash), the data is read from the above
|
|
table and mutations are filtered based on the replay positions to ensure
|
|
truncated data is not resurrected.
|
|
|
|
Note that until the above table was added, truncation records where kept in the
|
|
`truncated_at` map column in the `system.local` table. When booting up, scylla will
|
|
merge the data in the legacy store with data the `truncated` table. Until the whole
|
|
cluster agrees on the feature `TRUNCATION_TABLE` truncation will write both new and
|
|
legacy records. When the feature is agreed upon the legacy map is removed.
|
|
|
|
## system.sstables
|
|
|
|
The "ownership" table for non-local sstables
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.sstables (
|
|
location text,
|
|
generation timeuuid,
|
|
format text,
|
|
status text,
|
|
uuid uuid,
|
|
version text,
|
|
PRIMARY KEY (location, generation)
|
|
)
|
|
~~~
|
|
|
|
When a user keyspace is created with S3 storage options, sstables are put on the
|
|
remote object storage and the information about them is kept in this table. The
|
|
"uuid" field is used to point to the "folder" in which all sstables files are.
|
|
|
|
## system.tablets
|
|
|
|
Holds information about all tablets in the cluster.
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.tablets (
|
|
keyspace_name text,
|
|
table_id uuid,
|
|
last_token bigint,
|
|
new_replicas frozen<list<frozen<tuple<uuid, int>>>>,
|
|
replicas frozen<list<frozen<tuple<uuid, int>>>>,
|
|
stage text,
|
|
transition text,
|
|
table_name text static,
|
|
tablet_count int static,
|
|
resize_type text static,
|
|
resize_seq_number bigint static,
|
|
PRIMARY KEY ((keyspace_name, table_id), last_token)
|
|
)
|
|
~~~
|
|
|
|
Each partition (keyspace_name, table_id) represents a tablet map of a given table.
|
|
|
|
Only tables which use tablet-based replication strategy have an entry here.
|
|
|
|
`tablet_count` is the number of tablets in the map.
|
|
`table_name` is the name of the table, provided for convenience.
|
|
|
|
`resize_type` is the resize decision type that spans all tablets of a given table, which can be one of: `merge`, `split` or `none`.
|
|
|
|
`resize_seq_number` is the sequence number (>= 0) of the resize decision that globally identifies it. It's monotonically increasing, incremented by one for every new decision, so a higher value means it came later in time.
|
|
|
|
`last_token` is the last token owned by the tablet. The i-th tablet, where i = 0, 1, ..., `tablet_count`-1),
|
|
owns the token range:
|
|
```
|
|
(-inf, last_token(0)] for i = 0
|
|
(last_token(i-1), last_token(i)] for i > 0
|
|
```
|
|
|
|
Each tablet is represented by a single row. `replicas` holds the set of shard-replicas of the tablet.
|
|
It's a list of tuples where the first element is `host_id` of the replica and the second element is the `shard_id` of the replica.
|
|
|
|
During tablet migration, the columns `new_replicas`, `stage` and `transition` are set to represent the transition. The
|
|
`new_replicas` column holds what will be put in `replicas` after transition is done.
|
|
|
|
During tablet splitting, the load balancer sets `resize_type` column with `split`, and sets `resize_seq_number` with the next sequence number, which is the previous value incremented by one.
|
|
|
|
The `transition` column can have the following values:
|
|
* `migration` - One tablet replica is moving from one shard to another.
|
|
* `rebuild` - New tablet replica is created from the remaining replicas.
|
|
|
|
# Virtual tables in the system keyspace
|
|
|
|
Virtual tables behave just like a regular table from the user's point of view.
|
|
The difference between them and regular tables comes down to how they are implemented.
|
|
While regular tables have memtables/commitlog/sstables and all you would expect from CQL tables, virtual tables translate some in-memory structure to CQL result format.
|
|
For more details see the [virtual-tables.md](virtual-tables.md).
|
|
|
|
Below you can find a list of virtual tables. Sorted in alphabetical order (please keep it so when modifying!).
|
|
|
|
## system.cluster_status
|
|
|
|
Contain information about the status of each endpoint in the cluster.
|
|
Equivalent of the `nodetool status` command.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.cluster_status (
|
|
peer inet PRIMARY KEY,
|
|
dc text,
|
|
host_id uuid,
|
|
load text,
|
|
owns float,
|
|
status text,
|
|
tokens int,
|
|
up boolean
|
|
)
|
|
```
|
|
|
|
Implemented by `cluster_status_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.protocol_servers
|
|
|
|
The list of all the client-facing data-plane protocol servers and listen addresses (if running).
|
|
Equivalent of the `nodetool statusbinary` plus the `Native Transport active` fields from `nodetool info`.
|
|
|
|
TODO: include control-plane diagnostics-plane protocols here too.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.protocol_servers (
|
|
name text PRIMARY KEY,
|
|
is_running boolean,
|
|
listen_addresses frozen<list<text>>,
|
|
protocol text,
|
|
protocol_version text
|
|
)
|
|
```
|
|
|
|
Columns:
|
|
* `name` - the name/alias of the server, this is sometimes different than the protocol the server serves, e.g.: the CQL server is often called "native";
|
|
* `listen_addresses` - the addresses this server listens on, empty if the server is not running;
|
|
* `protocol` - the name of the protocol this server serves;
|
|
* `protocol_version` - the version of the protocol this server understands;
|
|
|
|
Implemented by `protocol_servers_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.size_estimates
|
|
|
|
Size estimates for individual token-ranges of each keyspace/table.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.size_estimates (
|
|
keyspace_name text,
|
|
table_name text,
|
|
range_start text,
|
|
range_end text,
|
|
mean_partition_size bigint,
|
|
partitions_count bigint,
|
|
PRIMARY KEY (keyspace_name, table_name, range_start, range_end)
|
|
)
|
|
```
|
|
|
|
Implemented by `size_estimates_mutation_reader` in `db/size_estimates_virtual_reader.{hh,cc}`.
|
|
|
|
## system.snapshots
|
|
|
|
The list of snapshots on the node.
|
|
Equivalent to the `nodetool listsnapshots` command.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.snapshots (
|
|
keyspace_name text,
|
|
table_name text,
|
|
snapshot_name text,
|
|
live bigint,
|
|
total bigint,
|
|
PRIMARY KEY (keyspace_name, table_name, snapshot_name)
|
|
)
|
|
```
|
|
|
|
Implemented by `snapshots_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.runtime_info
|
|
|
|
Runtime specific information, like memory stats, memtable stats, cache stats and more.
|
|
Data is grouped so that related items stay together and are easily queried.
|
|
Roughly equivalent of the `nodetool info`, `nodetool gettraceprobability` and `nodetool statusgossup` commands.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.runtime_info (
|
|
group text,
|
|
item text,
|
|
value text,
|
|
PRIMARY KEY (group, item)
|
|
)
|
|
```
|
|
|
|
Implemented by `runtime_info_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.token_ring
|
|
|
|
The ring description for each keyspace.
|
|
Equivalent of the `nodetool describe_ring $KEYSPACE` command (when filtered for `WHERE keyspace=$KEYSPACE`).
|
|
Overlaps with the output of `nodetool ring`.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.token_ring (
|
|
keyspace_name text,
|
|
start_token text,
|
|
endpoint inet,
|
|
dc text,
|
|
end_token text,
|
|
rack text,
|
|
PRIMARY KEY (keyspace_name, start_token, endpoint)
|
|
)
|
|
```
|
|
|
|
Implemented by `token_ring_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.versions
|
|
|
|
All version-related information.
|
|
Equivalent of `nodetool version` command, but contains more versions.
|
|
|
|
Schema:
|
|
```cql
|
|
CREATE TABLE system.versions (
|
|
key text PRIMARY KEY,
|
|
build_id text,
|
|
build_mode text,
|
|
compatible_version text,
|
|
version text
|
|
)
|
|
```
|
|
|
|
Implemented by `versions_table` in `db/system_keyspace.cc`.
|
|
|
|
## system.config
|
|
|
|
Holds all configuration variables in use
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.config (
|
|
name text PRIMARY KEY,
|
|
source text,
|
|
type text,
|
|
value text
|
|
)
|
|
~~~
|
|
|
|
The source of the option is one of 'default', 'config', 'cli', 'cql' or 'internal'
|
|
which means the value wasn't changed from its default, was configured via config
|
|
file, was set by commandline option or via updating this table, or was deliberately
|
|
configured by Scylla internals. Any way the option was updated overrides the
|
|
previous one, so shown here is the latest one used.
|
|
|
|
The type denotes the variable type like 'string', 'bool', 'integer', etc. Including
|
|
some scylla-internal configuration types.
|
|
|
|
The value is shown as it would appear in the json config file.
|
|
|
|
The table can be updated with the UPDATE statement. The accepted value parameter
|
|
must (of course) be a text, it's converted to the target configuration value as
|
|
needed.
|
|
|
|
## system.clients
|
|
|
|
Holds information about clients connections
|
|
|
|
Schema:
|
|
~~~
|
|
CREATE TABLE system.clients (
|
|
address inet,
|
|
port int,
|
|
client_type text,
|
|
connection_stage text,
|
|
driver_name text,
|
|
driver_version text,
|
|
hostname text,
|
|
protocol_version int,
|
|
shard_id int,
|
|
ssl_cipher_suite text,
|
|
ssl_enabled boolean,
|
|
ssl_protocol text,
|
|
username text,
|
|
PRIMARY KEY (address, port, client_type)
|
|
) WITH CLUSTERING ORDER BY (port ASC, client_type ASC)
|
|
~~~
|
|
|
|
Currently only CQL clients are tracked. The table used to be present on disk (in data
|
|
directory) before and including version 4.5.
|
|
|
|
## TODO: the rest
|