To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.
This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.
Fixes#16752Closesscylladb/scylladb#16888
* github.com:scylladb/scylladb:
test: test_tablets: add a test for cleanup after migration
test: pylib: add ScyllaCluster.wipe_sstables
test: boost: add commitlog_cleanup_test
db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
replica: table: garbage-collect irrelevant system.commitlog_cleanups records
db: commitlog: add min_position()
replica: table: populate system.commitlog_cleanups on tablet cleanup
db: system_keyspace: add system.commitlog_cleanups
replica: table: refresh compound sstable set after tablet cleanup
db::schema_tables::all_table_names() returns std::vector<sstring>.
Usage of range-for loop without reference results in copying each
of the elements of the traversed container. Such copying is redundant.
This change introduces usage of const reference to avoid copying.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16983
We didn't send the `barrier_and_drain` command to a
decommissioning node that could still be coordinating requests. It
could happen that a decommissioning node sent a request with an
old topology version after normal nodes received the new fence
version. Then, the request would fail on replicas with the stale
topology exception.
This PR fixes this problem by modifying `exec_global_command`.
From now on, it sends `barrier_and_drain` to a decommissioning
node.
We also stop filtering stale topology exceptions in
`test_topology_ops`. We added this filter after detecting the bug
fixed by this PR.
Fixesscylladb/scylladb#15804Fixesscylladb/scylladb#16579Fixesscylladb/scylladb#16642Closesscylladb/scylladb#16797
* github.com:scylladb/scylladb:
test: test_topology_ops: remove failed mutations filter
raft topology: send barrier_and_drain to a decommissioning node
raft topology: ensure at most one transitioning node
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.
in this change, we
* specify the "launcher" option of nodetool
test suite to "unshare", so that all its tests are launched
in separated namespaces.
* do not use a random address for the mock server, as the network
namespaces are separated.
Fixes#16542Closesscylladb/scylladb#16773
* github.com:scylladb/scylladb:
test/nodetool: run nodetool tests using "unshare"
test.py: add "launcher" option support
While the cleanup is ongoing. Otherwise, a concurrent table drop might
trigger a use-after-free, as we have seen in dtests recently.
Fixes: #16770Closesscylladb/scylladb#16874
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902Closesscylladb/scylladb#16264
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.
Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.
Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.
New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.
In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.
Fixes https://github.com/scylladb/scylladb/issues/16690.
Closesscylladb/scylladb#16894
* github.com:scylladb/scylladb:
tests: tablets: Add tests for removenode and replace
tablets: Add support for removenode and replace handling
topology_coordinator: tablets: Do not fail in a tight loop
topology_coordinator: tablets: Avoid warnings about ignored failured future
storage_service, topology: Track excluded state in locator::topology
raft topology: Introduce param-less topology::get_excluded_nodes()
raft topology: Move get_excluded_nodes() to topology
tablets: load_balancer: Generalize load tracking
tablets: Introduce get_migration_streaming_info() which works on migration request
tablets: Move migration_to_transition_info() to tablets.hh
tablets: Extract get_new_replicas() which works on migraiton request
tablets: Move tablet_migration_info to tablets.hh
tablets: Store transition kind per tablet
We added this filter after detecting a bug in the Raft-based
topology. We weren't sending `barrier_and_drain` commands to a
decommissioning node that could still be coordinating requests.
It could cause stale topology exceptions on replicas if the
decommissioning node sent a request with an old topology version
after normal nodes received the new fence version.
This bug has been fixed in the previous commit, so we remove the
filter.
Before this patch, we didn't send the `barrier_and_drain` command
to a decommissioning node that could still be coordinating
requests. It could happen that a decommissioning node sent
a request with an old topology version after normal nodes received
the new fence version. Then, the request would fail on replicas
with the stale topology exception.
We fix this problem by modifying `exec_global_command`. From now
on, it sends `barrier_and_drain` to a decommissioning node, which
can also be in the `left_token_ring` state.
We add a sanity check to ensure at most one transitioning node at
a time. If there is more, something must have gone wrong.
In the future, we might implement concurrent topology operations.
Then, we will remove this sanity check.
We also extend the comment describing `transition_nodes` so that
it better explains why we use a map and how it should be handled.
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.
in this change, we
* specify the "launcher" option of nodetool
test suite to "unshare", so that all its tests are launched
in separated namespaces.
* use a random fixed address for the mock server, as the network
namespaces are not shared anymore
* add an option in `nodetool/conftest.py`, so that it can optionally
setup the lo network interface when it is launched in a separated
new network namespace.
Fixes#16542
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, all "tool" test suites use "pytest" to launch their
tests. but some of the tests might need a dedicated namespace so they
do not interfere with each other. fortunately, "unshare(1)" allows us
to run a progame in new namespaces.
in this change, we add a "launcher" option to "tool" test suites. so
that these tests can run with the specified "launcher" instead of using
"launcher". if "launcher" is not specified, its default value of
"pytest" is used.
Refs #16542
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Before this patch we received internal server error
"Attempted to create key component from empty optional" when used null in
multi-column relations.
This patch adds a null check for each element of each tuple in the
expression and generates an invalid request error if it finds such an element.
Modified cassandra test and added a new one that checks the occurrence of null values in tuples.
Added a test that checks whether the wrong number of items is entered in tuples.
Fixes#13217Closesscylladb/scylladb#16415
The test TestScyllaSsstableSchemaLoading.test_fail_schema_autodetect was
observed to be flaky. Sometimes failing on local setups, but not in CI.
As it turns out, this is because, when run via test.py, the test's
working directory is root directory of scylla.git. In this case,
scylla-sstable will find and read conf/scylla.yaml. After having done
so, it will try look in the default data directory
(/var/lib/scylla/data) for the schema tables. If the local machine
happens to have a scylla data-dir setup at the above mentioned location,
it will read the schema tables and will succeed to find the tested
table (which is system table, so it is always present). This will fail
the test, as the test expects the opposite -- the table not being found.
The solution is to change the test's working directory to the random
temporary work dir, so that the local environment doesn't interfere with
it.
Fixes: #16828Closesscylladb/scylladb#16837
This PR contains improvements related to usage of std::vector and looping over containers in the range-for loop.
It is advised to use `std::vector::reserve()` to avoid unneeded memory allocations when the total size is known beforehand.
When looping over a container that stores non-trivial types usage of const reference is advised to avoid redundant copies.
Closesscylladb/scylladb#16978
* github.com:scylladb/scylladb:
api/api.hh: use const reference when looping over container
api/api.hh: use std::vector::reserve() when the total size is known
The gossiper topology change code calls left/joined notifiers when a
node leave or joins the cluster. This code it not executed in topology coordinator
mode, so the coordinator needs to call those notifiers by itself. The
series add the calls.
Fixesscylladb/scylladb#15841
* 'gleb/raft-topo-notifications-v1' of github.com:scylladb/scylla-dev:
storage service: topology coordinator: call notify_joined() when a node joins a cluster
storage service: topology coordinator: call notify_left() when a node leaves a cluster
storage_service: drop redundant check from notify_joined()
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)
test/boost/error_injection_test.cc(167): fatal error:
in "test_inject_future_disabled":
critical check wait_time > sleep_msec has failed [23839ns <= 10ms]
Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.
Fixes#15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When the topology coordinator is used for topology changes the gossiper
based code that calls notify_joined() is not called. The coordinator needs
to call it itself. But it needs to call it only once when node becomes
normal. For that the patch changes state loading code to remember the
old set of nodes in normal state to check if a node that is normal after
new state is loaded was not in the normal state before.
The sstable writer held the effective_replication_map_ptr while writing
sstables, which is both a layering violation and slows down tablet load
balancing. It was needed in order to ensure the sharder was stable. But
it turns out that sharding metadata is unnecessary for tablets, so just
skip the whole thing when writing an sstable for tablets.
Closesscylladb/scylladb#16953
* github.com:scylladb/scylladb:
sstables: writer: don't require effective_replication_map for sharding metadata
schema: provide method to get sharder, iff it is static
This mini-series contains two bug fixes that were found as part of testing coverage reporting in CI:
ref: https://github.com/scylladb/scylladb/pull/16895
1. The html-fixup which is triggered when using:`test/pylib/coverage_utils.py lcov-tools genhtml...` rendered incorrect links for multiple links in the same line.
2. For files that contined `,` in their name the output was simply wrong and resulted in lcov not being able to find such files for the purpose of filtering or generating reports.
The aforementioned draft PR served as a testing bed for finding and fixing those bugs.
Closesscylladb/scylladb#16977
* github.com:scylladb/scylladb:
lcov_utils.py: support sourcefiles that contains commas in their name
coreage_utils.py: make regular expression lazy in html-fixup
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for db::schema_tables::table_kind,
and its operator<<() is still used by the homebrew generic formatter
for std::map<>, so it is preserved.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16972
When reference is not used in the range-for loop, then
each element of a container is copied. Such copying
is not a problem for scalar types. However, the in case
of non-trivial types it may cause unneeded overhead.
This change replaces copying with const references
to avoid copying of types like seastar::sstring etc.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
When growing via push_back(), std::vector may need to reallocate
its internal block of memory due to not enough space. It is advised
to allocate the required space before appending elements if the
size is known beforehand.
This change introduces usage of std::vector::reserve() in api.hh
to ensure that push_back() does not cause reallocations.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
As part of the parsing, every line of an lcov file was modeled as
INFO_TYPE:field[,field]...
However specifically for info type "SF" which represents the source file
there can only be one field.
This caused files that are using ',' in their names to be cut down up to
the first ',' and as a results not handled correctly by lcov_utils.py
especially when rewriting a file.
This patch adds a special handling for the "SF" INFO_TYPE.
ref : `man geninfo`
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The html-fixup procedure was created because of a bug in genhtml (`man
genhtml` for details about what genhtml is). The bug is that genhtml
doesn't account for file names that contains illegal url characters (ref:
https://stackoverflow.com/a/1547940/2669716). html-fixup converts those
characters to the %<octet> notation (i.e space character becomes %20
etc..). However, the regular expression used to detect links was eager,
which didn't account for multiple links in the same line. This was
discovered during browsing one of the report and noticing that the links
that are meant to alternate between code view and function view of a
source got scrambled and unusable after html-fixup.
This change makes the regex that is used to detect links lazy so it can
handle multiple links in the same line in an html file correctly.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Loading schemas of views and indexes was not supported, with either `--schema-file`, or when loading schema from schema sstables.
This PR addresses both:
* When loading schema from CQL (file), `CREATE MATERIALIZED VIEW` and `CREATE INDEX` statements are now also processed correctly.
* When loading schema from schema tables, `system_schema.views` is also processed, when the table has no corresponding entry in `system_schema.tables`.
Tests are also added.
Fixes: #16492Closesscylladb/scylladb#16517
* github.com:scylladb/scylladb:
test/cql-pytest: test_tools.py: add schema-loading tests for MV/SI
test/cql-pytest: test_tools.py: extract some fixture logic to functions
test/cql-pytest: test_tools.py: extract common schema-loading facilities into base-class
tools/schema_loader: load_schema_from_schema_tables(): add support for MV/SI schemas
tools/schema_loader: load_one_schema_from_file(): add support for view/index schemas
test/boost/schema_loader_test: add test for mvs and indexes
tools/schema_loader: load_schemas(): implement parsing views/indexes from CQL
replica/database: extract existing_index_names and get_available_index_name
tools/schema_loader: make real_db.tables the only source of truth on existing tables
tools/schema_loader: table(): store const keyspace&
tools/schema_loader: make database,keyspace,table non-movable
cql3/statements/create_index_statement: build_index_schema(): include index metadata in returned value
cql3/statements/create_index_statement: make build_index_schema() public
cql3/statements/create_index_statement: relax some method's dependence on qp
cql3/statements/create_view_statement: make prepare_view() public
Native histograms (also known as sparse histograms) are an experimental Prometheus feature.
They use protobuf as the reporting layer.
Native histograms hold the benefits of high resolution at a lower resource cost.
This series allows sending histograms in a native histogram format over protobuf.
By default, protobuf support is disabled. To use protobuf with native histograms, the command line flag prometheus_allow_protobuf should be set to true, and the Prometheus server should send the accept header with protobuf.
Fixes#12931Closesscylladb/scylladb#16737
* github.com:scylladb/scylladb:
main.cc: Add prometheus_allow_protobuf command line
histogram_metrics_helper: support native histogram
config: Add prometheus_allow_protobuf flag
Add empty line before list of different checksums in
validate-checksums's description. Otherwise the list is not rendered.
Closesscylladb/scylladb#16401
we deduce the paths to other SSTable components from the one
specified from the command line, for instance, if
/a/b/c/me-really-big-Data.db is fed to `scylla sstable`, the tool
would try to read /a/b/c/me-really-big-TOC.txt for the list of
other components. this works fine if the full path is specified
in the command line.
but if a relative path is specified, like, "me-really-big-Data.db",
this does not work anymore. before this change, the tool
would be reading "/me-really-big-TOC.txt", which does not exist
under most circumstances. while $PWD/me-really-big-TOC.txt should
exist if the SSTable is sane.
after this change, we always convert the specified path to
its canonical representation, no matter it is relative or absolutate.
this enables us to get the correct parent path path when trying
to read, for instance, the TOC component.
Fixes#16955
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16964
To avoid data resurrection, mutations deleted by cleanup operations
have to be skipped during commitlog replay.
This patch implements this, based on the metadata recorded on cleanup
operations into system.commitlog_cleanups.
Currently, rows in system.commitlog_cleanups are only dropped on node restart,
so the table can accumulate an unbounded number of records.
This probably isn't a problem in practice, because tablet cleanups aren't that
frequent, but this patch adds a countermeasure anyway.
This patch makes the choice to delete the unneeded records right when new records
are added. This isn't ideal -- it would be more natural if the unneeded records
were deleted as soon as they become unneeded -- but it does the job with a
minimal amount of code.
Add a helper function which returns the minimum replay position
across all existing or future commitlog segments.
Only positions greater or equal to it can be replayed on the next reboot.
We will use this helper in a future patch to garbage collect some cleanup
metadata which refers to replay positions.
To avoid data resurrection after cleanup, we have to filter out the
cleaned mutations during commitlog replay.
In this patch, we get tablet cleanup to record the affected set of mutations
to system.commitlog_cleanups. In a later patch, we will use these records
for filtering during commitlog replay.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for rjson::value, and drop its
operator<<().
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16956
When the topology coordinator is used for topology changes the gossiper
based code that calls notify_left() is not called. The coordinator needs
to call it itself.
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.
However, this is both bad and unnecessary:
- bad: holding on to an effective_replication_map_ptr is a barrier
for topology operations, preventing tablet migrations (etc) while
an sstable is being written
- unnecessary: tablets don't require sharding metadata at all, since
two tablets cannot overlap (unlike two sstables from different shards in
the same node). So the first/last key is sufficient to determine the
shard/tablet ownership.
Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
The current get_sharder() method only allows getting a static sharder
(since a dynamic sharder needs additional protection). However, it
chooses to abort if someone attempt to get a dynamic sharder.
In one case, it's useful to get a sharder only if it's static, so
provide a method to do that. This is for providing sstable sharding
metadata, which isn't useful with tablets.