The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.
Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.
Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.
Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.
Fixes#10607Closes#12432
* github.com:scylladb/scylladb:
treewide: drop cql_serialization_format
cql: modification_statement: drop protocol check for LWT
transport: drop cql protocol versions 1 and 2
The wasmtime runtime allocates memory for the executable code of
the WASM programs using mmap and not the seastar allocator. As
a result, the memory that Scylla actually uses becomes not only
the memory preallocated for the seastar allocator but the sum of
that and the memory allocated for executable codes by the WASM
runtime.
To keep limiting the memory used by Scylla, we measure how much
memory do the WASM programs use and if they use too much, compiled
WASM UDFs (modules) that are currently not in use are evicted to
make room.
To evict a module it is required to evict all instances of this
module (the underlying implementation of modules and instances uses
shared pointers to the executable code). For this reason, we add
reference counts to modules. Each instance using a module is a
reference. When an instance is destroyed, a reference is removed.
If all references to a module are removed, the executable code
for this module is deallocated.
The eviction of a module is actually acheved by eviction of all
its references. When we want to free memory for a new module we
repeatedly evict instances from the wasm_instance_cache using its
LRU strategy until some module loses all its instances. This
process may not succeed if the instances currently in use (so not
in the cache) use too much memory - in this case the query also
fails. Otherwise the new module is added to the tracking system.
This strategy may evict some instances unnecessarily, but evicting
modules should not happen frequently, and any more efficient
solution requires an even bigger intervention into the code.
Different users may require different limits for their UDFs. This
patch allows them to configure the size of their cache of wasm,
the maximum size of indivitual instances stored in the cache, the
time after which the instances are evicted, the fuel that all wasm
UDFs are allowed to consume before yielding (for the control of
latency), the fuel that wasm UDFs are allowed to consume in total
(to allow performing longer computations in the UDF without
detecting an infinite loop) and the hard limit of the size of UDFs
that are executed (to avoid large allocations)
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).
A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.
Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.
One test checking the 16-bit serialization format is removed.
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
This reverts commit ac2e2f8883. It causes
a regression ("std::bad_variant_access in load_view_build_progress").
Commit 2978052113 (a reindent) is also reverted as part of
the process.
Fixes#12395
This new option allows user to control the number of compaction groups
per table per shard. It's 0 by default which implies a single compaction
group, as is today.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
active_memtable() was fine to a single group, but with multiple groups,
there will be one active memtable per group. Let's change the
interface to reflect that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now, with a44ca06906, is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12294
* github.com:scylladb/scylladb:
topology: get rid of pending state
topology: debug log update and remove endpoint
Refactor the existing stats tracking and updating
code into struct latency_stats_tracker and while at it,
count lock_acquisitions only on success.
Decrement operations_currently_waiting_for_lock in the destructor
so it's always balanced with the uncoditional increment
in the ctor.
As for updating estimated_waiting_for_lock, it is always
updated in the dtor, both on success and failure since
the wait for the lock happened, whether waiting
timed out or not.
Fixes#12190
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12225
Sometimes a single modification to a base partition requires updates to
a large number of view rows. A common example is deletion of a base
partition containing many rows. A large BATCH is also possible.
To avoid large allocations, we split the large amount of work into
batch of 100 (max_rows_for_view_updates) rows each. The existing code
assumed an empty result from one of these batches meant that we are
done. But this assumption was incorrect: There are several cases when
a base-table update may not need a view update to be generated (see
can_skip_view_updates()) so if all 100 rows in a batch were skipped,
the view update stopped prematurely. This patch includes two tests
showing when this bug can happen - one test using a partition deletion
with a USING TIMESTAMP causing the deletion to not affect the first
100 rows, and a second test using a specially-crafed large BATCH.
These use cases are fairly esoteric, but in fact hit a user in the
wild, which led to the discovery of this bug.
The fix is fairly simple: To detect when build_some() is done it is no
longer enough to check if it returned zero view-update rows; Rather,
it explicitly returns whether or not it is done as an std::optional.
The patch includes several tests for this bug, which pass on Cassandra,
failed on Scylla before this patch, and pass with this patch.
Fixes#12297.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12305
Now, with a44ca06906,
is_normal_token_owner that replaced is_member
does not rely anymore on the pending status
of endpoints in topology.
With that we can get rid of this state and just keep
all endpoints we know about in the topology.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Thanks to #12250, Host IDs uniquely identify nodes. We can use them as Raft IDs which simplifies the code and makes reasoning about it easier, because Host IDs are always guaranteed to be present (while Raft IDs may be missing during upgrade).
Fixes: https://github.com/scylladb/scylladb/issues/12204Closes#12275
* github.com:scylladb/scylladb:
service/raft: raft_group0: take `raft::server_id` parameter in `remove_from_group0`
gms, service: stop gossiping and storing RAFT_SERVER_ID
Revert "gms/gossiper: fetch RAFT_SERVER_ID during shadow round"
service: use HOST_ID instead of RAFT_SERVER_ID during replace
service/raft: use gossiped HOST_ID instead of RAFT_SERVER_ID to update Raft address map
main: use Host ID as Raft ID
It is equal to (if present) HOST_ID and no longer used for anything.
The application state was only gossiped if `experimental-features`
contained `raft`, so we can free this slot.
Similarly, `raft_server_id`s were only persisted in `system.peers` if
the `SUPPORTS_RAFT` cluster feature was enabled, which happened only
when `experimental-features` contained `raft`. The `raft_server_id`
field in the schema was also introduced recently in `master` and didn't
get to be in a release yet. Given either of these reasons, we can remove
this field safely.
Fixes /scylladb/scylla-enterprise/issues#1262
Changes the somewhat ambiguous "none" into "not set" to clarify that "none" is not an
option to be written out, but an absense of a choice (in which case you also have made
a choice).
Closes#12270
The Host ID now uniquely identifies a node (we no longer steal it during
node replace) and Raft is still experimental. We can reuse the Host ID
of a node as its Raft ID. This will allow us to remove and simplify a
lot of code.
With this we can already remove some dead code in this commit.
Now that the local host_id is never changed externally
(by the storage_service upon replace-node),
the method can be made private and be used only for initializing the
local host_id to a random one.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This pull request introduces support for global secondary indexes based on static columns.
Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources.
The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary.
The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact.
New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed.
Materialized view with static rows as a key are __not__ implemented in this PR.
Fixes: #2963Closes#11166
* github.com:scylladb/scylladb:
test_materialized_view: verify that static columns are not allowed
test_secondary_index: add (currently failing) test for static index paging
test_secondary_index: add more tests for secondary indexes on static columns
cassandra_tests: enable existing tests for static columns
create_index_statement: lift restriction on secondary indexes on static rows
db/view: fetch and process static rows when building indexes
gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
create_index_statement: disallow creation of local indexes with static columns
select_statement: prepare paging for indexes on static columns
select_statement: do not attempt to fetch clustering columns from secondary index's table
secondary_index_manager: don't add clustering key columns to index table of static column index
replica/table: adjust the view read-before-write to return static rows when needed
db/view: process static rows in view_update_builder::on_results
db/view: adjust existing view update generation path to use clustering_or_static_row
column_computation: adjust to use clustering_or_static_row
db/view: add clustering_or_static_row
deletable_row: add column_kind parameter to is_live
view_info: adjust view_column to accept column_kind
db/view: base_dependent_view_info: split non-pk columns into regular and static
The problematic scenario this patch fixes might happen due to
unfortunate serialization of locks/unlocks between lock_pk and lock_ck,
as follows:
1. lock_pk acquires an exclusive lock on the partition.
2.a lock_ck attempts to acquire shared lock on the partition
and any lock on the row. both cases currently use a fiber
returning a future<rwlock::holder>.
2.b since the partition is locked, the lock_partition times out
returning an exceptional future. lock_row has no such problem
and succeeds, returning a future holding a rwlock::holder,
pointing to the row lock.
3.a the lock_holder previously returned by lock_pk is destroyed,
calling `row_locker::unlock`
3.b row_locker::unlock sees that the partition is not locked
and erases it, including the row locks it contains.
4.a when_all_succeeds continuation in lock_ck runs. Since
the lock_partition future failed, it destroyes both futures.
4.b the lock_row future is destroyed with the rwlock::holder value.
4.c ~holder attempts to return the semaphore units to the row rwlock,
but the latter was already destroyed in 3.b above.
Acquiring the partition lock and row lock in parallel
doesn't help anything, but it complicates error handling
as seen above,
This patch serializes acquiring the row lock in lock_ck
after locking the partition to prevent the above race.
This way, erasing the unlocked partition is never expected
to happen while any of its rows locks is held.
Fixes#12168
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12208
This commit modifies the view builder and its consumer so that static
rows are always fetched and properly processed during view build.
Currently, the view builder will always fetch both static and clustering
rows, regardless of the type of indexes being built. For indexes on
static columns this is wasteful and could be improved so that only the
types of rows relevant to indexes being built are fetched - however,
doing this sounds a bit complicated and I would rather start with
something simpler which has a better chance of working.
Adjusts the read-before-write query issued in
`table::do_push_view_replica_updates` so that, when needed, requests
static columns and makes sure that the static row is present.
The `view_update_builder::on_results()` function is changed to react to
static rows when comparing read-before-write results with the base table
mutation.
Adjusts the column_computation interface so that it is able to accept
both clustering and static rows through the common
db::view::clustering_or_static_row interface.
Adds a `clustering_or_static_row`, which is a common, immutable
representation of either a static or clustering row. It will allow to
handle view update generation based on static or clustering rows in a
uniform way.
The `view_info::view_column()` and `view_column` in view.cc allow to get
a view's column definition which corresponds to given base table's
column. They currently assume that the given column id corresponds to a
regular column. In preparation for secondary indexes based on static
columns, this commit adjusts those functions so that they accept other
kinds of columns, including static columns.
Currently, `base_dependent_view_info::_base_non_pk_columns_in_view_pk`
field keeps a list of non-primary-key columns from the base table which
are a part of the view's primary key. Because the current code does not
allow indexes on static columns yet, the columns kept in the
aforementioned field are always assumed to be regular columns of the
base table and are kept as `column_id`s which do not contain information
about the column kind.
This commit splits the `_base_non_pk_columns_in_view_pk` field into two,
one for regular columns and the other for static columns, so that it is
possible to keep both kinds of columns in `base_dependent_view_info` and
the structure can be used for secondary indexes on static columns.
run_snapshot_list_operation() takes a continuation, so passing it
a lambda coroutine without protection is dangerous.
Protect the coroutine with coroutine::lambda so it doesn't lost its
contents.
Fixes#12192.
Closes#12193
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Closes#12107
* github.com:scylladb/scylladb:
service/raft: specialized verb for failure detector pinger
db: system_keyspace: de-staticize `{get,set}_raft_server_id`
service/raft: make this node's Raft ID available early in group registry
Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain.
Closes#12155
* github.com:scylladb/scylladb:
test, utils: Use only one tempdir
sstable_compaction_test: Dont create nested envs
mutation_reader_test: Remove unused create_sstable() helper
tests, lib: Move globals onto sstables::test_env
tests: Use sstables::test_env.db_config() to access config
features: Mark feature_config_from_db_config const
sstable_3_x_test: Use env method to create sst
sstable_3_x_test: Indentation fix after previous patch
sstable_3_x_test: Use sstable::test_env
test: Add config to sstable::test_env creation
config: Add constexpr value for default murmur ignore bits
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.
The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.
Closes#12035
* github.com:scylladb/scylladb:
raft: (rpc) do not throw in case of a missing IP address in RPC
raft: (address map) actively maintain ip <-> raft server id map
1) make address map API flexible
Before this patch:
- having a mapping without an actual IP address was an
internal error
- not having a mapping for an IP address was an internal
error
- re-mapping to a new IP address wasn't allowed
After this patch:
- the address map may contain a mapping
without an actual IP address, and the caller must be prepared for it:
find() will return a nullopt. This happens when we first add an entry
to Raft configuration and only later learn its IP address, e.g. via
gossip.
- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications
Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.
3) prompt address map state with app state
Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).
The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.
Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.
Thanks to the changes above, Raft configuration no longer
stores IP addresses.
We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
Without memory corruption it's not possible for the switch to
fall through, and the compiler will error if we forget to add
a case. The compiler however is obliged to consider that we might
store some other value in the variable.
Global index page caching, as introduced in 4.6
(078a6e422b and 9f957f1cf9) has proven to be misdesigned,
because it poses a risk of catastrophic performance regressions in
common workloads by flooding the cache with useless index entries.
Because of that risk, it should be disabled by default.
Refs #11202Fixes#11889Closes#11890
In system_keyspace::get_repair_history value of repair_uuid
is got from row as tasks::task_id.
tasks::task_id is represented by an abstract_type specific
for utils::UUID. Thus, since their typeids differ, bad_cast
is thrown.
repair_uuid is got from row as utils::UUID and then cast.
Since no longer needed, data_type_for<tasks::task_id> is deleted.
Fixes: #11966Closes#12062
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.
This patch removes this requirement and in essence GAs this feature.
Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.
This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.
Fixes#12037
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12049
Don't call get_datacenter(ep) without checking
first has_endpoint(ep) since the former may abort
on internal error if the endpoint is not listed
in topology.
Refs #11870
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12054
We plan to stop storing IP addresses in Raft configuration, and instead
use the information disseminated through gossip to locate Raft peers.
Implement patches that are building up to that:
* improve Raft API of configuration change notifications
* disseminate raft host id in Gossip
* avoid using Raft addresses from Raft configuraiton, and instead
consistently use the translation layer between raft server id <-> IP
address
Closes#11953
* github.com:scylladb/scylladb:
raft: persist the initial raft address map
raft: (upgrade) do not use IP addresses from Raft config
raft: (and gossip) begin gossiping raft server ids
raft: change the API of conf change notifications
The lister accepts sort of a filter -- what kind of entries to list, regular, directories or both. It currently uses unordered_set, but enum_set is shorter and better describes the intent.
Closes#12017
* github.com:scylladb/scylladb:
lister: Make lister::dir_entry_types an enum_set
database: Avoid useless local variable
This type is currently an unordered_set, but only consists of at most
two elements. Making it an enum_set renders it into a size_t variable
and better describes the intention.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>