Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
Closes#14343
The chunk size used in sstable compression can be set when creating a
table, using the "chunk_length_in_kb" parameter. It can be any power-of-two
multiple of 1KB. Very large compression chunks are not useful - they
offer diminishing returns on compression ratio, and require very large
memory buffers and reading a very large amount of disk data just to
read a small row. In fact, small chunks are recommended - Scylla
defaults to 4 KB chunks, and Cassandra lowered their default from 64 KB
(in Cassandra 3) to 16 KB (in Cassandra 4).
Therefore, allowing arbitrarily large chunk sizes is just asking for
trouble. Today, a user can ask for a 1 GB chunk size, and crash or hang
Scylla when it runs out of memory. So in this patch we add a hard limit
of 128 KB for the chunk size - anything larger is refused.
Fixes#9933
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14267
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
We still use it in many places in unit tests, which is ok because
those tables are vnode-based.
We want to check incorrect uses in production as they may lead to hard
to debug consistency problems.
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
schema::get_sharder() does not use the correct sharder for
tablet-based tables. Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.
Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:
start_token = sharder.token_for_next_shard(end_token, shard);
If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
Soon, compute_shards_for_this_sstable() will need to take a sharder object.
open_data() is called indirectly from sstable::load() and directly
after writing an sstable from various paths. The latter don't really
need to compute shards, since the field is already set by the writer. In
order to reduce code churn, move compute_shards_for_this_sstable() to
the load() path only so that only load() needs to take the sharder.
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.
Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.
Existing users will be transitioned incrementally in later patches.
The logic was extracted from ring_position_range_sharder::next(), and
the latter was changed to rely on sharder::next_shard().
The tablet sharder will have a different implementation for
next_shard(). This way, ring_position_range_sharder can work with both
current sharder and the tablet sharder.
Querying from virtual table system.token_ring fails if there is a
tablet-based table due to attempt to obtain a per-keyspace erm.
Fix by not showing such keyspaces.
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.
Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.
It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
The netyr may exist, but its schema may not yet be loaded. learn()
didn't take that into account. This problem is not reachable in
production code, which currently always calls get_or_load() before
learn(), except for boot, but there's no concurrency at that point.
Exposed by unit test added later.
System tables have static schemas and code uses those static schemas
instead of looking them up in the database. We want those schemas to
have a valid table() once the table is created, so we need to attach
registry entry to the target schema rather than to a schema duplicate.
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.
Two new columns are added to `system.topology`:
- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.
These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.
During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.
Closes#14232
* github.com:scylladb/scylladb:
storage_service: update supported cluster features in group0 on start
storage_service: add methods for features to topology mutation builder
storage_service: use explicit ::set overload instead of a template
storage_service: reimplement mutation builder setters
storage_service: introduce topology_mutation_builder_base
topology_state_machine: include information about features
system_keyspace: introduce deserialize_set_column
db/system_keyspace: add storage for cluster features managed in group 0
Now, when a node starts, it will update its `supported_features` row in
`system.topology` via `update_topology_with_local_metadata`.
At this point, the functionality behind cluster features on raft is
mostly incomplete and the state of the `supported_features` column does
not influence anything so it's safe to update this column
unconditionally. In the future, the node will only join / start group0
server if it is sure that it supports all enabled features and it can
safely update the `supported_features` parameter.
The newly added `supported_features` and `enabled_features` columns can
now be modified via topology mutation builders:
- `supported_features` can now be overwritten via a new overload of
`topology_node_mutation_builder::set`.
- `enabled_features` can now be extended (i.e. more elements can be
added to it) via `topology_mutation_builder::add_enabled_features`. As
the set of enabled features only grows, this should be sufficient.
The `topology_node_mutation_builder::set` function has an overload which
accepts any type which can be converted to string via `::format`. Its
presence can lead to easy mistakes which can only be detected at runtime
rather at compile time. A concrete example: I wrote a function that
accepts an std::set<S> where S is convertible to sstring; it turns out
that std::string_view is not std::convertible_to sstring and overload
resolution falled back to the catch-all overload.
This commit gets rid of the catch-all overload and replaces it with
explicit ones. Fortunately, it was used for only two enums, so it wasn't
much work.
As promised in the previous commit which introduced
topology_mutation_builder_base, this commit adjusts existing setters of
topology mutation builder and topology node mutation builder to use
helper methods defined in the base class.
Note that the `::set` method for the unordered set of tokens now does
not delete the column in case an empty value is set, instead it just
writes an empty set. This semantic is arguably more clear given that we
have an explicit `::del` method and it shouldn't affect the existing
implementation - we never intentionally insert an empty set of tokens.
Introduces `topology_mutation_builder_base` which will be a base class
for both topology mutation builder and topology node mutation builder.
Its purpose is to abstract away some detail about setting/deleting/etc.
column in the mutation, the actual topology (node) mutation builder will
only have to care about converting types and/or allowing only particular
columns to be set. The class is using CRTP: derived classes provide
access to the row being modified, schema and the timestamp.
For the sake of commit diff readability, this commt only introduces this
class and changes the builders to derive from it but no setter
implementations are modified - this will be done in the next commit.
There are three places in system_keyspace.cc which deserialize a column
holding a set of tokens and convert it to an unordered set of
dht::token. The deserialization process involves a small number of steps
that are the same in all of those places, therefore they can be
abstracted away.
This commit adds `deserialize_set_column` function which takes care of
deserializing the column to `set_type_impl::native_type` which can be
then passed to `decode_tokens`. The new function will also be useful for
decoding set columns with cluster features, which will be handled in the
next commit.