The comment mentioned tables that no longer exist: their names have
changed some time ago. Update the comment to be name-agnostic.
Furthemore, the second part of the comment related to a case of "joining
a node without bootstrapping". Fortunately this operation is no longer
possible (after #6848 which became part of Scylla 4.3) so we can shorten
the comment.
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
These functions were not used anywhere but had to be maintained anyway.
When (if) the expiration algorithm actually gets implemented (see issue #7300),
the functions can be added back (perhaps they will need to look differently
at that time, and it's likely that the `expire` column won't be used in the
expiration algorithm in the end anyway).
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.
Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8387
* github.com:scylladb/scylla:
hints: clarify docstring comment for can_send
hints: use token_metadata to tell if node is in the ring
hints: slightly reogranize "if" statement in can_send
storage_service: release token_metadata lock before notify_left
storage_service: notify_left after token_metadata is replicated
Now, the docstring comment next to can_send better represents the
condition that is checked inside that function. The statement about
returning true when destination left the NORMAL state is replaced with a
statement about returning true when the destination has left the ring.
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.
Test: unit(dev)
Refs #8352
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8360
Fixes#8369
This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.
We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file. Not intended.
Closes#8370
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.
Fixes#8336Closes#8338
* github.com:scylladb/scylla:
docs: mention disabling Thrift by default
db,config: disable Thrift by default
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.
The option was introduced in seastar by 8bec57bc.
tests: unit(dev), dtest(commitlog:
test_batch_commitlog,
test_periodic_commitlog,
test_commitlog_replay_on_startup)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.
Fixes#8336
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
Closes#8250
* github.com:scylladb/scylla:
commitlog: Make pre-allocation drop O_DSYNC while pre-filling
commitlog: coroutinize allocate_segment_ex
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
reference.
Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.
Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.
To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.
Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again
Fixes#8137Closes#8206
* github.com:scylladb/scylla:
hints_manager: don't use commitlog hard space limit
commitlog: add an option to allow going over size limit
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.
However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).
Fixed by doing reverse iteration.
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.
If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.
By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.
The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).
Closes#8140
* github.com:scylladb/scylla:
treewide: remove timeout config from query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
service: add timeout config to client state
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.
This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)
Fixes#8177Closes#8178
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.
I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.
This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.
Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.
Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578
mutations
Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
nodes
In a mixed cluster there can be a situation where `scylla_tables` needs
to be sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.
Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.