TLS certificate authenticator registers itself using a
`class_registrator`. that's why CMake is able to build without
compiling this source file. but for the sake of completeness, and
to be sync with configure.py, let's add it to CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14866
Debug mode is so slow that the work:poll ratio decreases, leading
to even more slowness as more polling is done for the same amount
of work.
Increase the task quota to recover some performance.
Ref #14752.
Closes#14820
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
This commit adds a requirement to upgrade ScyllaDB
drivers before upgrading ScyllaDB.
The requirement to upgrade the Monitoring Stack
has been moved to the new section so that
both prerequisites are documented together.
NOTE: The information is added to the 5.2-to-5.3
upgrade guide because all future upgrade guides
will be based on this one (as it's the latest one).
If 5.3 is released, this commit should be backported
to branch-5.3.
Refs https://github.com/scylladb/scylladb/issues/13958Closes#14771
If the cluster isn't empty and all servers are stopped, calling
ScyllaCluster.add_server can start a new cluster. That's because
ScyllaCluster._seeds uses the running servers to calculate the
seed node list, so if all nodes are down, the new node would
select only itself as a seed, starting a new cluster.
As a single ScyllaCluster should describe a single cluster, we
make ScyllaCluster.add_server fail when called on a non-empty
cluster with all its nodes stopped.
Closes#14804
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification
Fixes#14518Fixes#7997Closes#14726
simpler than the "begin, end" iterator pair. and also tighten the
type constraints, now require the value type to be
sstables::shared_sstable. this matches what we are expecting in
the implementation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14678
The db.get_version() called that early returns value that database got
construction-time, i.e. -- empty_version thing. It makes little sense
committing it into the system k.s. all the more so the "real" version is
calculated and updated few steps after .setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14833
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14824
It was found that cached_file dtor can hit the following assert
after OOM
cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`
cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.
That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.
share() is responsible for automatically placing the page into
LRU once refcount drops to zero.
If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.
Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.
A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.
A reproducer was added.
Fixes#14814.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14818
Fixes https://github.com/scylladb/scylla-docs/issues/4028
The goal of this update is to discourage the use of
the default cassandra superuser in favor of a custom
super user - and explain why it's a good practice.
The scope of this commit:
- Adding a new page on creating a custom superuser.
The page collects and clarifies the information
about the cassandra superuser from other pages.
- Remove the (incomplete) information about
superuser from the Authorization and Authentication
pages, and add the link to the new page instead.
Additionaly, this update will result in better
searchability and ensures language clarity.
Closes#14829
This is to make m.s. initialization more solid and simplify sys.ks.::setup()
Closes#14832
* github.com:scylladb/scylladb:
system_keyspace: Remove unused snitch arg from setup()
messaging_service: Setup preferred IPs from config
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.
We missed a few such cases in test_gsi.py, and this patch corrects them.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.
There are no functional changes in this patch - most of the lines
changed are just reindents.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14835
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.
Fixes#14839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14840
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today, test/*/run always kills Scylla at the end of the test with
SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was
believed that a clean shutdown would take a long time, but in fact,
it turns out that 99% of the shutdown time was a silly sleep in the
gossip code, which this patch disables with the "--shutdown-announce-in-ms"
option.
After enabling this option, clean shutdown takes (in a dev build on
my laptop) just 0.02 seconds. It's worth noting that this shutdown
has no real work to do - no tables to flush, and so on, because the
pytest framework removes all the tables in its own fixture cleanup
phase.
So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL.
We then wait until a timeout of 10 seconds (much much more than 0.02
seconds!) for Scylla to exit. If for some reason it didn't exit (e.g.,
it hung during the shutdown), it is killed again with SIGKILL, which
is guaranteed to succed.
This change gives us two advantages
1. Every test run with test/*/run exercises the shutdown path. It is perhaps
excessive, but since the shutdown is so quick, there is no big downside.
2. In a test-coverage run, a clean shutdown allows flushing the counter
files, which wasn't possible when Scylla was killed with KILL -9.
Fixes#8543
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14825
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.
The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.
The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.
Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.
Closes#14601
* github.com:scylladb/scylladb:
tests: test_tablets: Add test for bootstraping a node
storage_service: topology_coordinator: Implement tablet migration state machine
tablets: Introduce tablet_mutation_builder
service: tablet_allocator: Introduce tablet load balancer
tablets: Introduce tablet_map::for_each_tablet()
topology: Introduce get_node()
token_metadata: Add non-const getter of tablet_metadata
storage_service: Notify topology state machine after applying schema change
storage_service: Implement stream_tablet RPC
tablets: Introduce global_tablet_id
stream_transfer_task, multishard_writer: Work with table sharder
tablets: Turn tablet_id into a struct
db: Do not create per-keyspace erm for tablet-based tables
tablets: effective_replication_map: Take transition stage into account when computing replicas
tablets: Store "stage" in transition info
doc: Document tablet migration state machine and load balancer
locator: erm: Make get_endpoints_for_reading() always return read replicas
storage_service: topology_coordinator: Sleep on failure between retries
storage_service: topology_coordinator: Simplify coordinator loop
main: Require experimental raft to enable tablets
A test reproducing #14819, that is, the view update builder not flushing
the buffer when only empty partitions are consumed (with only a
tombstone in them).
All partitions will have a corresponding mutation object in the buffer.
These objects have non-negligible sizes, yet the consumer did not bump
the _buffer_size when a new partition was consumer. This resulted in
empty partitions not moving the _buffer_size at all, and thus they could
accumulate without bounds in the buffer, never triggering a flush just
by themselves. We have recently seen this causing OOM.
This patch fixes that by bumping the _buffer_size with the size of the
freshly created mutation object.
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from
- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests
All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx
Closes#14778
* github.com:scylladb/scylladb:
system_keyspace: Make save_truncation_record() non-static
code: Pass sharded<db::system_keyspace>& to database::truncate()
db: Add sharded<system_keyspace>& to legacy_schema_migrator
The function dispatch a background operation that must be
waited on in stop().
Fixes scylladb/scylladb#14791
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14797
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.
Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.
It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md
Just a simplification.
Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
When messaging_service shuts down it first sets _shutting_down to true
and proceeds with stopping clients and servers. Stopping clients, in
turn, is calling client.stop() on each.
Setting _shutting_down is used in two places.
First, when a client is stopped it may happen that it's in the middle of
some operation, which may result in call to remove_error_rpc_client()
and not to call .stop() for the second time it just does nothing if the
shutdown flag is set (see 357c91a076).
Second, get_rpc_client() asserts that this flag is not set, so once
shutdown started it can make sure that it will call .stop() on _all_
clients and no new ones would appear in parallel.
However, after shutdown() is complete the _clients vector of maps
remains intact even though all clients from it are stopped. This is not
very debugging-friendly, the clients are better be removed on shutdown.
fixes: #14624
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14632
It is quite common to stop a tested scylla process with ^C, which will
raise KeyboardInterrupt from subprocess.run(). Catch and swallow this
exception, allowing the post-processing to continue.
The interrupted process has to handle the interrupt correctly too --
flush the coverage data even on premature exit -- but this is for
another patch.
Closes#14815
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813