Commit Graph

40863 Commits

Author SHA1 Message Date
Avi Kivity
03313d359e Merge ' db: commitlog_replayer: ignore mutations affected by (tablet) cleanups ' from Michał Chojnowski
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.

This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.

Fixes #16752

Closes scylladb/scylladb#16888

* github.com:scylladb/scylladb:
  test: test_tablets: add a test for cleanup after migration
  test: pylib: add ScyllaCluster.wipe_sstables
  test: boost: add commitlog_cleanup_test
  db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
  replica: table: garbage-collect irrelevant system.commitlog_cleanups records
  db: commitlog: add min_position()
  replica: table: populate system.commitlog_cleanups on tablet cleanup
  db: system_keyspace: add system.commitlog_cleanups
  replica: table: refresh compound sstable set after tablet cleanup
2024-01-25 20:51:03 +02:00
Patryk Wrobel
a858daf038 service/client_state.cc: remove redundant copying
db::schema_tables::all_table_names() returns std::vector<sstring>.
Usage of range-for loop without reference results in copying each
of the elements of the traversed container. Such copying is redundant.

This change introduces usage of const reference to avoid copying.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#16983
2024-01-25 20:35:05 +02:00
Kamil Braun
543ad0987a Merge 'raft topology: send barrier_and_drain to a decommissioning node' from Patryk Jędrzejczak
We didn't send the `barrier_and_drain` command to a
decommissioning node that could still be coordinating requests. It
could happen that a decommissioning node sent a request with an
old topology version after normal nodes received the new fence
version. Then, the request would fail on replicas with the stale
topology exception.

This PR fixes this problem by modifying `exec_global_command`.
From now on, it sends `barrier_and_drain` to a decommissioning
node.

We also stop filtering stale topology exceptions in
`test_topology_ops`. We added this filter after detecting the bug
fixed by this PR.

Fixes scylladb/scylladb#15804
Fixes scylladb/scylladb#16579
Fixes scylladb/scylladb#16642

Closes scylladb/scylladb#16797

* github.com:scylladb/scylladb:
  test: test_topology_ops: remove failed mutations filter
  raft topology: send barrier_and_drain to a decommissioning node
  raft topology: ensure at most one transitioning node
2024-01-25 16:09:02 +01:00
Kefu Chai
ee28cf2285 test.py: s/defalt/default/
this typo was identified by codespell

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16980
2024-01-25 16:54:07 +02:00
Botond Dénes
6d5ee6d48a Merge 'test/nodetool: run nodetool tests using "unshare"' from Kefu Chai
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.

in this change, we

* specify the "launcher" option of nodetool
  test suite to "unshare", so that all its tests are launched
  in separated namespaces.
* do not use a random address for the mock server, as the network
  namespaces are separated.

Fixes #16542

Closes scylladb/scylladb#16773

* github.com:scylladb/scylladb:
  test/nodetool: run nodetool tests using "unshare"
  test.py: add "launcher" option support
2024-01-25 16:53:49 +02:00
Botond Dénes
c67698ea06 compaction/compaction_manager: perform_cleanup(): hold the compaction gate
While the cleanup is ongoing. Otherwise, a concurrent table drop might
trigger a use-after-free, as we have seen in dtests recently.

Fixes: #16770

Closes scylladb/scylladb#16874
2024-01-25 14:52:50 +01:00
Pavel Emelyanov
bf3cae4992 Merge 'tests: utils: error injection: print time duration instead of count' from Kefu Chai
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

Closes scylladb/scylladb#16264

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-01-25 16:13:27 +03:00
Avi Kivity
69d597075a Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes https://github.com/scylladb/scylladb/issues/16690.

Closes scylladb/scylladb#16894

* github.com:scylladb/scylladb:
  tests: tablets: Add tests for removenode and replace
  tablets: Add support for removenode and replace handling
  topology_coordinator: tablets: Do not fail in a tight loop
  topology_coordinator: tablets: Avoid warnings about ignored failured future
  storage_service, topology: Track excluded state in locator::topology
  raft topology: Introduce param-less topology::get_excluded_nodes()
  raft topology: Move get_excluded_nodes() to topology
  tablets: load_balancer: Generalize load tracking
  tablets: Introduce get_migration_streaming_info() which works on migration request
  tablets: Move migration_to_transition_info() to tablets.hh
  tablets: Extract get_new_replicas() which works on migraiton request
  tablets: Move tablet_migration_info to tablets.hh
  tablets: Store transition kind per tablet
2024-01-25 14:49:43 +02:00
Patryk Jędrzejczak
b348014745 test: test_topology_ops: remove failed mutations filter
We added this filter after detecting a bug in the Raft-based
topology. We weren't sending `barrier_and_drain` commands to a
decommissioning node that could still be coordinating requests.
It could cause stale topology exceptions on replicas if the
decommissioning node sent a request with an old topology version
after normal nodes received the new fence version.

This bug has been fixed in the previous commit, so we remove the
filter.
2024-01-25 13:42:48 +01:00
Patryk Jędrzejczak
9aebd6dd96 raft topology: send barrier_and_drain to a decommissioning node
Before this patch, we didn't send the `barrier_and_drain` command
to a decommissioning node that could still be coordinating
requests. It could happen that a decommissioning node sent
a request with an old topology version after normal nodes received
the new fence version. Then, the request would fail on replicas
with the stale topology exception.

We fix this problem by modifying `exec_global_command`. From now
on, it sends `barrier_and_drain` to a decommissioning node, which
can also be in the `left_token_ring` state.
2024-01-25 13:42:48 +01:00
Patryk Jędrzejczak
378cbd0b70 raft topology: ensure at most one transitioning node
We add a sanity check to ensure at most one transitioning node at
a time. If there is more, something must have gone wrong.

In the future, we might implement concurrent topology operations.
Then, we will remove this sanity check.

We also extend the comment describing `transition_nodes` so that
it better explains why we use a map and how it should be handled.
2024-01-25 13:42:46 +01:00
Alexander Turetskiy
c1ae5425f7 DROP TYPE IF EXISTS should work on non-existent keyspace
DROP TYPE IF EXISTS should pass and do nothing  on non-existent keyspace

fixes #9082

Closes scylladb/scylladb#16504
2024-01-25 14:28:43 +02:00
Kefu Chai
b1431f08f7 test/nodetool: run nodetool tests using "unshare"
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.

in this change, we

* specify the "launcher" option of nodetool
  test suite to "unshare", so that all its tests are launched
  in separated namespaces.
* use a random fixed address for the mock server, as the network
  namespaces are not shared anymore
* add an option in `nodetool/conftest.py`, so that it can optionally
  setup the lo network interface when it is launched in a separated
  new network namespace.

Fixes #16542
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-25 20:28:36 +08:00
Kefu Chai
35b3c51f40 test.py: add "launcher" option support
before this change, all "tool" test suites use "pytest" to launch their
tests. but some of the tests might need a dedicated namespace so they
do not interfere with each other. fortunately, "unshare(1)" allows us
to run a progame in new namespaces.

in this change, we add a "launcher" option to "tool" test suites. so
that these tests can run with the specified "launcher" instead of using
"launcher". if "launcher" is not specified, its default value of
"pytest" is used.

Refs #16542
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-25 20:28:01 +08:00
Kurashkin Nikita
d90eeb5c4f cql3:statement_restrictions.cc: multi-column relation null check
Before this patch we received internal server error
"Attempted to create key component from empty optional" when used null in
multi-column relations.
This patch adds a null check for each element of each tuple in the
expression and generates an invalid request error if it finds such an element.

Modified cassandra test and added a new one that checks the occurrence of null values in tuples.
Added a test that checks whether the wrong number of items is entered in tuples.

Fixes #13217

Closes scylladb/scylladb#16415
2024-01-25 14:17:43 +02:00
Botond Dénes
5df4ad2e48 test/cql-pytest: test_tools.py: fix flaky schema load failure test
The test TestScyllaSsstableSchemaLoading.test_fail_schema_autodetect was
observed to be flaky. Sometimes failing on local setups, but not in CI.
As it turns out, this is because, when run via test.py, the test's
working directory is root directory of scylla.git. In this case,
scylla-sstable will find and read conf/scylla.yaml. After having done
so, it will try look in the default data directory
(/var/lib/scylla/data) for the schema tables. If the local machine
happens to have a scylla data-dir setup at the above mentioned location,
it will read the schema tables and will succeed to find the tested
table (which is system table, so it is always present). This will fail
the test, as the test expects the opposite -- the table not being found.

The solution is to change the test's working directory to the random
temporary work dir, so that the local environment doesn't interfere with
it.

Fixes: #16828

Closes scylladb/scylladb#16837
2024-01-25 15:14:16 +03:00
Botond Dénes
b341aa8f6d Merge 'api/api.hh: improve usage of standard containers' from Patryk Wróbel
This PR contains improvements related to usage of std::vector and looping over containers in the range-for loop.

It is advised to use `std::vector::reserve()` to avoid unneeded memory allocations when the total size is known beforehand.

When looping over a container that stores non-trivial types usage of const reference is advised to avoid redundant copies.

Closes scylladb/scylladb#16978

* github.com:scylladb/scylladb:
  api/api.hh: use const reference when looping over container
  api/api.hh: use std::vector::reserve() when the total size is known
2024-01-25 13:22:48 +02:00
Kamil Braun
994a2ea5c3 Merge 'Call left/joined notifiers when topology coordinator is enabled' from Gleb
The gossiper topology change code calls left/joined notifiers when a
node leave or joins the cluster. This code it not executed in topology coordinator
mode, so the coordinator needs to call those notifiers by itself. The
series add the calls.

Fixes scylladb/scylladb#15841

* 'gleb/raft-topo-notifications-v1' of github.com:scylladb/scylla-dev:
  storage service: topology coordinator: call notify_joined() when a node joins a cluster
  storage service: topology coordinator: call notify_left() when a node leaves a cluster
  storage_service: drop redundant check from notify_joined()
2024-01-25 12:12:53 +01:00
Kefu Chai
1d33a68dd7 tests: utils: error injection: print time duration instead of count
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)

test/boost/error_injection_test.cc(167): fatal error:
    in "test_inject_future_disabled":
      critical check wait_time > sleep_msec has failed [23839ns <= 10ms]

Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-25 19:10:24 +08:00
Kefu Chai
8a5689e7a7 error_injection: do not cast to milliseconds when injecting timeout
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.

Fixes #15902

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-25 19:10:24 +08:00
Gleb Natapov
adf70aae15 storage service: topology coordinator: call notify_joined() when a node joins a cluster
When the topology coordinator is used for topology changes the gossiper
based code that calls notify_joined() is not called. The coordinator needs
to call it itself. But it needs to call it only once when node becomes
normal. For that the patch changes state loading code to remember the
old set of nodes in normal state to check if a node that is normal after
new state is loaded was not in the normal state before.
2024-01-25 12:28:08 +02:00
Botond Dénes
c9f247f3e8 Merge 'sstables: writer: don't block topology changes while writing sstables' from Avi Kivity
The sstable writer held the effective_replication_map_ptr while writing
sstables, which is both a layering violation and slows down tablet load
balancing. It was needed in order to ensure the sharder was stable. But
it turns out that sharding metadata is unnecessary for tablets, so just
skip the whole thing when writing an sstable for tablets.

Closes scylladb/scylladb#16953

* github.com:scylladb/scylladb:
  sstables: writer: don't require effective_replication_map for sharding metadata
  schema: provide method to get sharder, iff it is static
2024-01-25 12:12:01 +02:00
Botond Dénes
8e82df6fb6 Merge 'coverage libraries: bug fixes' from Eliran Sinvani
This mini-series contains two bug fixes that were found as part of testing coverage reporting in CI:
ref: https://github.com/scylladb/scylladb/pull/16895

1. The html-fixup which is triggered when using:`test/pylib/coverage_utils.py lcov-tools genhtml...` rendered incorrect links for multiple links in the same line.
2. For files that contined `,` in their name the output was simply wrong and resulted in lcov not being able to find such files for the purpose of filtering or generating reports.

The aforementioned draft PR served as a testing bed for finding and fixing those bugs.

Closes scylladb/scylladb#16977

* github.com:scylladb/scylladb:
  lcov_utils.py: support sourcefiles that contains commas in their name
  coreage_utils.py: make regular expression lazy in  html-fixup
2024-01-25 11:46:15 +02:00
Kefu Chai
0fbfc96619 db: add formatter for schema_tables::table_kind
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for db::schema_tables::table_kind,
and its operator<<() is still used by the homebrew generic formatter
for std::map<>, so it is preserved.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16972
2024-01-25 11:33:13 +03:00
Kefu Chai
ffb5ad494f api: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16973
2024-01-25 11:28:02 +03:00
Patryk Wrobel
cdfe0c1c35 api/api.hh: use const reference when looping over container
When reference is not used in the range-for loop, then
each element of a container is copied. Such copying
is not a problem for scalar types. However, the in case
of non-trivial types it may cause unneeded overhead.

This change replaces copying with const references
to avoid copying of types like seastar::sstring etc.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-25 09:20:35 +01:00
Patryk Wrobel
1ca71f2532 api/api.hh: use std::vector::reserve() when the total size is known
When growing via push_back(), std::vector may need to reallocate
its internal block of memory due to not enough space. It is advised
to allocate the required space before appending elements if the
size is known beforehand.

This change introduces usage of std::vector::reserve() in api.hh
to ensure that push_back() does not cause reallocations.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-25 08:50:19 +01:00
Eliran Sinvani
d27283918f lcov_utils.py: support sourcefiles that contains commas in their name
As part of the parsing, every line of an lcov file was modeled as
INFO_TYPE:field[,field]...
However specifically for info type "SF" which represents the source file
there can only be one field.
This caused files that are using ',' in their names to be cut down up to
the first ',' and as a results not handled  correctly by lcov_utils.py
especially when rewriting a file.
This patch adds a special handling for the "SF" INFO_TYPE.
ref : `man geninfo`

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2024-01-25 09:30:52 +02:00
Eliran Sinvani
11eb9f5bb2 coreage_utils.py: make regular expression lazy in html-fixup
The html-fixup procedure was created because of a bug in genhtml (`man
genhtml` for details about what genhtml is). The bug is that genhtml
doesn't account for file names that contains illegal  url characters (ref:
https://stackoverflow.com/a/1547940/2669716). html-fixup converts those
characters to the %<octet> notation (i.e space character becomes %20
etc..). However, the regular expression used to detect links was eager,
which didn't account for multiple links in the same line. This was
discovered during browsing one of the report and noticing that the links
that are meant to alternate between code view and function view of a
source got scrambled and unusable after html-fixup.
This change makes the regex that is used to detect links lazy so it can
handle multiple links in the same line in an html file correctly.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2024-01-25 09:30:42 +02:00
Nadav Har'El
69a68e35dd Merge 'scylla-sstable: add support for loading schema of views and indexes' from Botond Dénes
Loading schemas of views and indexes was not supported, with either `--schema-file`, or when loading schema from schema sstables.
This PR addresses both:
* When loading schema from CQL (file), `CREATE MATERIALIZED VIEW` and `CREATE INDEX` statements are now also processed correctly.
* When loading schema from schema tables, `system_schema.views` is also processed, when the table has no corresponding entry in `system_schema.tables`.

Tests are also added.

Fixes: #16492

Closes scylladb/scylladb#16517

* github.com:scylladb/scylladb:
  test/cql-pytest: test_tools.py: add schema-loading tests for MV/SI
  test/cql-pytest: test_tools.py: extract some fixture logic to functions
  test/cql-pytest: test_tools.py: extract common schema-loading facilities into base-class
  tools/schema_loader: load_schema_from_schema_tables(): add support for MV/SI schemas
  tools/schema_loader: load_one_schema_from_file(): add support for view/index schemas
  test/boost/schema_loader_test: add test for mvs and indexes
  tools/schema_loader: load_schemas(): implement parsing views/indexes from CQL
  replica/database: extract existing_index_names and get_available_index_name
  tools/schema_loader: make real_db.tables the only source of truth on existing tables
  tools/schema_loader: table(): store const keyspace&
  tools/schema_loader: make database,keyspace,table non-movable
  cql3/statements/create_index_statement: build_index_schema(): include index metadata in returned value
  cql3/statements/create_index_statement: make build_index_schema() public
  cql3/statements/create_index_statement: relax some method's dependence on qp
  cql3/statements/create_view_statement: make prepare_view() public
2024-01-24 23:36:54 +02:00
Nadav Har'El
df6c9828ef Merge 'Add protobuf and Native histogram support' from Amnon Heiman
Native histograms (also known as sparse histograms) are an experimental Prometheus feature.
They use protobuf as the reporting layer.
Native histograms hold the benefits of high resolution at a lower resource cost.

This series allows sending histograms in a native histogram format over protobuf.
By default, protobuf support is disabled. To use protobuf with native histograms, the command line flag prometheus_allow_protobuf should be set to true, and the Prometheus server should send the accept header with protobuf.

Fixes #12931

Closes scylladb/scylladb#16737

* github.com:scylladb/scylladb:
  main.cc: Add prometheus_allow_protobuf command line
  histogram_metrics_helper: support native histogram
  config: Add prometheus_allow_protobuf flag
2024-01-24 21:24:50 +02:00
Michał Chojnowski
f0eadc734e test: test_tablets: add a test for cleanup after migration
Reproduces the problems fixed by earlier commits in the series.
2024-01-24 19:36:29 +01:00
Botond Dénes
7bb3ed7f23 docs/operating-scylla: scylla-sstable.rst: fix checksum list
Add empty line before list of different checksums in
validate-checksums's description. Otherwise the list is not rendered.

Closes scylladb/scylladb#16401
2024-01-24 16:34:13 +01:00
Kefu Chai
a9851cf834 test.py: replace "$foo is False" with "not $foo"
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16960
2024-01-24 15:21:53 +02:00
Kefu Chai
add74ec8ee mutation_writer: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16958
2024-01-24 15:20:02 +02:00
Kefu Chai
c978d1b3f8 config: s/re-use/reuse/
this misspelling is identified by codespell.
per m-w, reuse is a word per-se, and we don't need the hyphen for
addressing the ambiguity in the use cases, like, recover and re-cover.
see also https://www.merriam-webster.com/dictionary/reuse

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16962
2024-01-24 15:19:03 +02:00
Kefu Chai
8c39aba820 tools/scylla-sstable: use canonical path for sst_path
we deduce the paths to other SSTable components from the one
specified from the command line, for instance, if
/a/b/c/me-really-big-Data.db is fed to `scylla sstable`, the tool
would try to read /a/b/c/me-really-big-TOC.txt for the list of
other components. this works fine if the full path is specified
in the command line.

but if a relative path is specified, like, "me-really-big-Data.db",
this does not work anymore. before this change, the tool
would be reading "/me-really-big-TOC.txt", which does not exist
under most circumstances. while $PWD/me-really-big-TOC.txt should
exist if the SSTable is sane.

after this change, we always convert the specified path to
its canonical representation, no matter it is relative or absolutate.
this enables us to get the correct parent path path when trying
to read, for instance, the TOC component.

Fixes #16955
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16964
2024-01-24 13:28:40 +02:00
Michał Chojnowski
b88a0eb9ab test: pylib: add ScyllaCluster.wipe_sstables
Add a method which wipes sstables files for a particular table on a particular
stopped node.
2024-01-24 11:52:49 +01:00
Michał Chojnowski
94cdfcaa94 test: boost: add commitlog_cleanup_test
Adds a test for the commitlog cleanup functionality added
earlier in the series.
2024-01-24 10:37:39 +01:00
Michał Chojnowski
a246bb39ef db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
To avoid data resurrection, mutations deleted by cleanup operations
have to be skipped during commitlog replay.

This patch implements this, based on the metadata recorded on cleanup
operations into system.commitlog_cleanups.
2024-01-24 10:37:39 +01:00
Michał Chojnowski
f458a1bf3e replica: table: garbage-collect irrelevant system.commitlog_cleanups records
Currently, rows in system.commitlog_cleanups are only dropped on node restart,
so the table can accumulate an unbounded number of records.

This probably isn't a problem in practice, because tablet cleanups aren't that
frequent, but this patch adds a countermeasure anyway.

This patch makes the choice to delete the unneeded records right when new records
are added. This isn't ideal -- it would be more natural if the unneeded records
were deleted as soon as they become unneeded -- but it does the job with a
minimal amount of code.
2024-01-24 10:37:38 +01:00
Michał Chojnowski
05ff32ebf9 db: commitlog: add min_position()
Add a helper function which returns the minimum replay position
across all existing or future commitlog segments.
Only positions greater or equal to it can be replayed on the next reboot.

We will use this helper in a future patch to garbage collect some cleanup
metadata which refers to replay positions.
2024-01-24 10:37:38 +01:00
Michał Chojnowski
a10650959c replica: table: populate system.commitlog_cleanups on tablet cleanup
To avoid data resurrection after cleanup, we have to filter out the
cleaned mutations during commitlog replay.

In this patch, we get tablet cleanup to record the affected set of mutations
to system.commitlog_cleanups. In a later patch, we will use these records
for filtering during commitlog replay.
2024-01-24 10:37:38 +01:00
Michał Chojnowski
7c5a8894be db: system_keyspace: add system.commitlog_cleanups
Add a system table which will hold records of cleanup operations,
for the purpose of filtering commitlog replays to avoid data
resurrections.
2024-01-24 10:37:38 +01:00
Michał Chojnowski
8bfd078c54 replica: table: refresh compound sstable set after tablet cleanup
If the compound set isn't refreshed, readers will keep seeing
the dataset as it was before the cleanup, which is a bug.
2024-01-24 10:37:38 +01:00
Kefu Chai
207fe93b90 utils: add formatter for rjson::value
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for rjson::value, and drop its
operator<<().

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16956
2024-01-24 10:30:52 +02:00
Gleb Natapov
b97ff54a41 storage service: topology coordinator: call notify_left() when a node leaves a cluster
When the topology coordinator is used for topology changes the gossiper
based code that calls notify_left() is not called. The coordinator needs
to call it itself.
2024-01-24 10:21:01 +02:00
Gleb Natapov
5459a8b9a5 storage_service: drop redundant check from notify_joined()
notify_joined() is called from handle_state_normal only, so there is no
point checking that the state is normal inside the function as well.
2024-01-24 10:17:12 +02:00
Avi Kivity
8ee75ae8f4 sstables: writer: don't require effective_replication_map for sharding metadata
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.

However, this is both bad and unnecessary:
 - bad: holding on to an effective_replication_map_ptr is a barrier
   for topology operations, preventing tablet migrations (etc) while
   an sstable is being written
 - unnecessary: tablets don't require sharding metadata at all, since
   two tablets cannot overlap (unlike two sstables from different shards in
   the same node). So the first/last key is sufficient to determine the
   shard/tablet ownership.

Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
2024-01-23 22:23:08 +02:00
Avi Kivity
b88f422a53 schema: provide method to get sharder, iff it is static
The current get_sharder() method only allows getting a static sharder
(since a dynamic sharder needs additional protection). However, it
chooses to abort if someone attempt to get a dynamic sharder.

In one case, it's useful to get a sharder only if it's static, so
provide a method to do that. This is for providing sstable sharding
metadata, which isn't useful with tablets.
2024-01-23 22:20:59 +02:00