Document a protocol extension that exposes the sharding algorithm to drivers, and recommend how to use it to achieve connection-per-core.
83 lines
3.5 KiB
Markdown
83 lines
3.5 KiB
Markdown
Protocol extensions to the Cassandra Native Protocol
|
|
====================================================
|
|
|
|
This document specifies extensions to the protocol defined
|
|
by Cassandra's native_protocol_v4.spec and native_protocol_v5.spec.
|
|
The extensions are designed so that a driver supporting them can
|
|
continue to interoperate with Cassandra and other compatible servers
|
|
with no configuration needed; the driver can discover the extensions
|
|
and enable them conditionally.
|
|
|
|
An extension can be discovered by using the OPTIONS request; the
|
|
returned SUPPORTED response will have zero or more options beginning
|
|
with SCYLLA indicating extensions defined in this documented, in
|
|
addition to options documented by Cassandra. How to use the extension
|
|
is further explained in this document.
|
|
|
|
# Intranode sharding
|
|
|
|
This extension allows the driver to discover how Scylla internally
|
|
partitions data among logical cores. It can then create at least
|
|
one connection per logical core, and send queries directly to the
|
|
logical core that will serve them, greatly improving load balancing
|
|
and efficiency.
|
|
|
|
To use the extension, send the OPTIONS message. The data is returned
|
|
in the SUPPORTED message, as a set of key/value options. Numeric values
|
|
are returned as their base-10 ASCII representation.
|
|
|
|
The keys and values are:
|
|
- `SCYLLA_SHARD` is an integer, the zero-based shard number this connection
|
|
is connected to (for example, `3`).
|
|
- `SCYLLA_NR_SHARDS` is an integer containing the number of shards on this
|
|
node (for example, `12`). All shard numbers are smaller than this number.
|
|
- `SCYLLA_PARTITIONER` is a the fully-qualified name of the partitioner in use (i.e.
|
|
`org.apache.cassandra.partitioners.Murmur3Partitioner`).
|
|
- `SCYLLA_SHARDING_ALGORITHM` is the name of an algorithm used to select how
|
|
partitions are mapped into shards (described below)
|
|
- `SCYLLA_SHARDING_IGNORE_MSB` is an integer parameter to the algorithm (also
|
|
described below)
|
|
|
|
Currently, one `SCYLLA_SHARDING_ALGORITHM` is defined,
|
|
`biased-token-round-robin`. To apply the algorithm,
|
|
perform the following steps (assuming infinite-precision arithmetic):
|
|
|
|
- subtract the minimum token value from the partition's token
|
|
in order to bias it: `biased_token = token - (-2**63)`
|
|
- shift `biased_token` left by `ignore_msb` bits, discarding any
|
|
bits beyond the 63rd:
|
|
`biased_token = (biased_token << SCYLLA_SHARDING_IGNORE_MSB) % (2**64)`
|
|
- multiply by `SCYLLA_NR_SHARDS` and perform a truncating division by 2**64:
|
|
`shard = (biased_token * SCYLLA_NR_SHARDS) / 2**64`
|
|
|
|
(this apparently convoluted algorithm replaces a slow division instruction with
|
|
a fast multiply instruction).
|
|
|
|
in C with 128-bit arithmetic support, these operations can be efficiently
|
|
performed in three steps:
|
|
|
|
```c++
|
|
uint64_t biased_token = token + ((uint64_t)1 << 63);
|
|
biased_token <<= ignore_msb;
|
|
int shard = ((unsigned __int128)biased_token * nr_shards) >> 64;
|
|
```
|
|
|
|
In languages without 128-bit arithmetic support, use the following (this example
|
|
is for Java):
|
|
|
|
```Java
|
|
private int scyllaShardOf(long token) {
|
|
token += Long.MIN_VALUE;
|
|
token <<= ignoreMsb;
|
|
long tokLo = token & 0xffffffffL;
|
|
long tokHi = (token >>> 32) & 0xffffffffL;
|
|
long mul1 = tokLo * nrShards;
|
|
long mul2 = tokHi * nrShards;
|
|
long sum = (mul1 >>> 32) + mul2;
|
|
return (int)(sum >>> 32);
|
|
}
|
|
```
|
|
|
|
It is recommended that drivers open connections until they have at
|
|
least one connection per shard, then close excess connections.
|