Commit Graph

36154 Commits

Author SHA1 Message Date
Benny Halevy
e18eb71fa3 test: locator_topology: add test_update_node
Reproduces issue fixed in PR #13502

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-14 17:51:07 +03:00
Benny Halevy
e29994b2aa topology: add_node, unindex_node: make exception safe
Current if index_node throws when trying to
add an already indexed node, pop_node might
unindex the existing node instead of the new one.

Instead, with this change, unindex_node looks up
the node by its pointer and removed it from the
index map only if it's found there so to clean up
safely after index_node throws (at any stage).

Add a unit test to verify that.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-14 17:51:05 +03:00
Botond Dénes
1da02706dd Merge 'Discard SSTable bloom filter on load-and-stream' from Raphael "Raph" Carvalho
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.

It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.

In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.

Closes #13486

* github.com:scylladb/scylladb:
  sstable_loader: Discard SSTable bloom filter on load-and-stream
  sstables: Allow SSTable loading to discard bloom filter
  sstables: Allow sstable_directory user to feed custom sstable open config
  sstables: Move sstable_open_info into open_info.hh
2023-04-14 06:18:54 +03:00
Benny Halevy
b71f229fc2 topology: node: update_node: do not override internal changed flag by state option
Currently, opt_st overrides the internal `changed` flag
by setting it with the opt_st changed status.
Instead, it should use `|=` to keep it true if it is already so.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #13502
2023-04-13 17:46:59 +02:00
Raphael S. Carvalho
fe6df3d270 sstable_loader: Discard SSTable bloom filter on load-and-stream
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.

It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.

In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-13 11:34:22 -03:00
Raphael S. Carvalho
17261369ea sstables: Allow SSTable loading to discard bloom filter
If bloom filter is not loaded, it means that an always-present filter
is used, which translates into the SSTable being opened on every
single read.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-13 11:34:22 -03:00
Raphael S. Carvalho
1427a5ce98 sstables: Allow sstable_directory user to feed custom sstable open config
This will be used by load-and-stream to load SSTables in its own
customized way.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-13 11:34:16 -03:00
Raphael S. Carvalho
86516f4cef sstables: Move sstable_open_info into open_info.hh
So sstable_directory can access its definition without having to
include sstables.hh.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-13 11:31:14 -03:00
Botond Dénes
bd57471e54 reader_concurrency_semaphore: don't evict inactive readers needlessly
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.

Fixes: #11803

Closes #13286
2023-04-13 15:20:18 +03:00
Pavel Emelyanov
b1501d4261 s3/client: Don't use designated initialization of sys stat struct
It makes compiler complan about mis-ordered initialization of st_nlink
vs st_mode on different arches. Current code (st_nlink before st_mode)
compiled fine on x86, but fails on ARM which wants st_mode to come
before st_nlink. Changing the order would, apparently, break x86 build
with similar message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13499
2023-04-13 15:13:56 +03:00
Kefu Chai
87170bf07a build: cmake: add more tests
this change should add the remaining tests under boost/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13494
2023-04-13 14:57:00 +03:00
Botond Dénes
e103ef3bcb Update seastar submodule
* seastar 1204efbc...ed7a0f54 (46):
  > gate: s/intgernal/internal/
  > reactor: set reactor::_stopping to true on all shards
  > condition-variable: replace the coroutine wakeup task with a promise
  > tutorial: explain the buffer_size_t param of generator coroutine
  > log: call log_level_map explicitly in constructor
  > future: de-variadicate make_ready_future() and similar helpers
  > timer-set,scollectd: remove unnecessary ";"
  > util/conversion: remove inclusion guards
  > foreign_ptr: destroy: use run_in_background
  > abort_source, abortable_fifo: use is_nothrow_invocable_r_v<>
  > alien: add type constraint for alien::run_on and alien::submit_to
  > alien: add noexcept specifier for lambda passed to run_on()
  > test: alien_test: test alien::run_on() also
  > test: alien_test: throw if unexpected things happens
  > future: make API level 6 mandatory
  > api-level: update IDE fallback
  > core/on_internal_error: always log error with backtrace
  > future: make API level 5 mandatory
  > websocket: fix frame parsing.
  > websocket: fix frame assembling.
  > when_all: drop code for API_LEVEL < 4
  > future: drop internal call_then_impl
  > future: when_all_succeed(): make API level 4 mandatory
  > reactor: trade comment for type constraints
  > sstring: s/is_invocable_r/is_invocable_r_v/
  > doc: compatibility: document API levels 5 and 6
  > demos: file_demo: pass a string_view to_open_file_dma()
  > TLS: Add issuer/subject info to verification error message
  > test: fstream_test: drop unnecessary API_LEVEL check
  > manual_clock: advance: use run_in_background to expire_timers
  > reactor: add run_in_backround and close
  > websocket: shutdown input first.
  > websocket: use gate to guard background tasks.
  > websocket: remove trailing spaces.
  > websocket_demo: ignore sleep_aborted exception.
  > websocket_demo: fix coredump.
  > fstream: drop API level 2 (make_file_output_stream() returning non-future)
  > core/sstring: do not use ostream_formatter
  > metrics: use fmt::to_string() when creating a label
  > backtrace: fix size calculation in dl_iterate_phdr
  > Downgrade expected stall detector warning to info
  > fix: Add missing inline code blocks
  > spawn_test: fix /bin/cat stuck in reading input.
  > reactor: pass fd opened in blocking mode to spawned process
  > reactor: skip sigaction if handler has been registered before.
  > reactor: allow registering handler multiple times for a signal.
2023-04-13 14:28:30 +03:00
Kefu Chai
29ca0009a2 dist/debian: do not Depend on ${shlibs:Depends}
the substvar of `${shlibs:Depends}` is set by dh_shlibdeps, which
inspects the ELF images being packaged to figure out the shared
library dependencies for packages. but since f3c3b9183c,
we just override the `override_dh_shlibdeps` target in debian/rules
with no-op. as we take care of the shared library dependencies by
vendoring the runtime dependencies by ourselves using the relocatable
package. so this variable is never set. that's why `dpkg-gencontrol`
complains when processing `debian/control` and trying to materialize
the substvars.

in this change, the occurances of `${shlibs:Depends}` are removed
to silence the warnings from `dpkg-gencontrol`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13457
2023-04-13 08:34:05 +03:00
Raphael S. Carvalho
9760149e8d compaction: Don't bump compaction shares during major execution
Commit 49892a0, back in 2018, bumps the compaction shares by 200 to
guarantee a minimum base line.

However, after commit e3f561d, major compaction runs in maintenance
group meaning that bumping shares became completely irrelevant and
only causes regular compaction to be unnecessarily more aggressive.

Fixes #13487.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13488
2023-04-13 08:20:25 +03:00
Botond Dénes
50ee4033a9 Update tools/jmx submodule
* tools/jmx 602329c9...57c16938 (1):
  > install.sh: replace tab with spaces
2023-04-12 13:28:23 +03:00
Botond Dénes
5d0c0ae0c4 Merge 'token_metadata: use topology nodes for endpoint_to_host_id map' from Benny Halevy
Currently, token_metadata_impl maintains a "shadow" endpoint to host_id map on top of the maps in topology.
This series first reimplements the functions that currently use this map to use topology instead.
Then the important users of `get_endpoint_to_host_id_map_for_reading`: node_ops_ctl and view_builder
and converted to use a new `topology::for_each_node` function to process all nodes in topology directly, without going through `get_endpoint_to_host_id_map_for_reading`.

Closes #13476

* github.com:scylladb/scylladb:
  view_builder: view_build_statuses: use topology::for_each_node
  storage_service: node_ops_ctl: refresh_sync_nodes: use topology::for_each_node
  topology: add for_each_node
  token_metadata: get endpoint to node map from topology
2023-04-12 10:33:02 +03:00
Botond Dénes
0c51f72ad6 Merge 'utils, mutation: replace operator<<(..) with fmt formatter' from Kefu Chai
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `tombstone` and `shadowable_tombstone` without the help of fmt::ostream. and their `operator<<(ostream,..)` are dropped, as there are no users of them anymore.

Refs #13245

Closes #13474

* github.com:scylladb/scylladb:
  mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone>
  utils: specialize fmt::formatter<optional<>>
2023-04-12 09:32:56 +03:00
Kefu Chai
ff202723c6 utils: big_decimal: specialize fmt::formatter<big_decimal>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `big_decimal` without the help of `operator<<`. this operator
is droppe in this change, as all its callers are now using fmtlib
for formatting now. we might need to use fmtlib to implement `big_decimal::to_string()`,
and use `fmt::to_string()` instead, but let's leave it for a follow-up
change.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13479
2023-04-12 09:20:50 +03:00
Botond Dénes
f82287a9af Update tools/jmx/ submodule
* tools/jmx/ b7ae52bc...602329c9 (1):
  > metrics: EstimatedHistogram::getValues() returns bucketOffsets
2023-04-12 09:17:57 +03:00
Botond Dénes
525b21042f Merge 'Rewrite sstables keyspace compaction task' from Aleksandra Martyniuk
Task manager task implementations of classes that cover
rewrite sstables keyspace compaction which can be start
through /storage_service/keyspace_compaction/ api.

Top level task covers the whole compaction and creates child
tasks on each shard.

Closes #12714

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test rewrite sstables compaction
  compaction: create task manager's task for rewrite sstables keyspace compaction on one shard
  compaction: create task manager's task for rewrite sstables keyspace compaction
  compaction: create rewrite_sstables_compaction_task_impl
2023-04-12 08:38:59 +03:00
Aleksandra Martyniuk
25cfffc3ae compaction: rename local_offstrategy_keyspace_compaction_task_impl to shard_offstrategy_keyspace_compaction_task_impl
Closes #13475
2023-04-12 08:38:25 +03:00
Kefu Chai
1cb95b8cff mutation: specialize fmt::formatter<tombstone> and fmt::formatter<shadowable_tombstone>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `tombstone` and `shadowable_tombstone` without the
help of `operator<<`.

in this change, only `operator<<(ostream&, const shadowable_tombstone&)`
is dropped, and all its callers are now using fmtlib for formatting the
instances of `shadowable_tombstone` now.
`operator<<(ostream&, const tombstone&)` is preserved. as it is still
used by Boost::test for printing the operands in case the comparing tests
fail.

please note, before this change we were using a concrete string
for indent. after this change, some of the places are changed to
using fmtlib for indent.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-12 10:57:03 +08:00
Kefu Chai
c980bd54ad utils: specialize fmt::formatter<optional<>>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `optional<T>` without the help of `operator<<()`.

this change also enables us to ditch more `operator<<()`s in future.
as we are relying on `operator<<(ostream&, const optional<T>&)` for
printing instances of `optional<T>`, and `operator<<(ostream&, const optional<T>&)`
in turn uses the `operator<<(ostream&, const T&)`. so, the new
specialization of `fmt::formatter<optional<>>` will remove yet
another caller of these operators.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-12 10:57:03 +08:00
Benny Halevy
535b71eba3 view_builder: view_build_statuses: use topology::for_each_node
Instead of tmptr->get_endpoint_to_host_id_map_for_reading.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-11 18:14:51 +03:00
Benny Halevy
d89fb02d24 storage_service: node_ops_ctl: refresh_sync_nodes: use topology::for_each_node
Instead of tmptr->get_endpoint_to_host_id_map_for_reading.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-11 18:14:47 +03:00
Kefu Chai
59579d5876 utils: fragment_range: specialize fmt::formatter<FragmentedView>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print classes fulfill the requirement of `FragmentedView` concept
without the help of template function of `to_hex()`, this function is
dropped in this change, as all its callers are now using fmtlib
for formatting now. the helper of `fragment_to_hex()` is dropped
as well, its only caller is `to_hex()`.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13471
2023-04-11 16:09:38 +03:00
Benny Halevy
7b76369ffc topology: add for_each_node
To eventually replace token_metadata::get_endpoint_to_host_id_map_for_reading

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-11 15:55:39 +03:00
Benny Halevy
e635aa30d6 token_metadata: get endpoint to node map from topology
Don't maintain a "shadow" endpoint_to_host_id_map in token_metadata_impl.

Instead, get the nodes_by_endpoint map from topology
and use it to build the endpoint_to_host_id_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-11 15:48:30 +03:00
Botond Dénes
f1bbf705f9 Merge 'Cleanup sstables in resharding and other compaction types' from Benny Halevy
This series extends sstable cleanup to resharding and other (offstrategy, major, and regular) compaction types so to:
* cleanup uploaded sstables (#11933)
* cleanup staging sstables after they are moved back to the main directory and become eligible for compaction (#9559)

When perform_cleanup is called, all sstables are scanned, and those that require cleanup are marked as such, and are added for tracking to table_state::cleanup_sstable_set.  They are removed from that set once released by compaction.
Along with that sstables set, we keep the owned_ranges_ptr used by cleanup in the table_state to allow other compaction types (offstrategy, major, or regular) to cleanup those sstables that are marked as require_cleanup and that were skipped by cleanup compaction for either being in the maintenance set (requiring offstrategy compaction) or in staging.

Resharding is using a more straightforward mechanism of passing the owned token ranges when resharding uploaded sstables and using it to detect sstable that require cleanup, now done as piggybacked on resharding compaction.

Closes #12422

* github.com:scylladb/scylladb:
  table: discard_sstables: update_sstable_cleanup_state when deleting sstables
  compaction_manager: compact_sstables: retrieve owned ranges if required
  sstables: add a printer for shared_sstable
  compaction_manager: keep owned_ranges_ptr in compaction_state
  compaction_manager: perform_cleanup: keep sstables in compaction_state::sstables_requiring_cleanup
  compaction: refactor compaction_state out of compaction_manager
  compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh
  compaction_manager: compacting_sstable_registration: keep a ref to the compaction_state
  compaction_manager: refactor get_candidates
  compaction_manager: get_candidates: mark as const
  table, compaction_manager: add requires_cleanup
  sstable_set: add for_each_sstable_until
  distributed_loader: reshard: update sstable cleanup state
  table, compaction_manager: add update_sstable_cleanup_state
  compaction_manager: needs_cleanup: delete unused schema param
  compaction_manager: perform_cleanup: disallow empty sorted_owened_ranges
  distributed_loader: reshard: consider sstables for cleanup
  distributed_loader: process_upload_dir: pass owned_ranges_ptr to reshard
  distributed_loader: reshard: add optional owned_ranges_ptr param
  distributed_loader: reshard: get a ref to table_state
  distributed_loader: reshard: capture creator by ref
  distributed_loader: reshard: reserve num_jobs buckets
  compaction: move owned ranges filtering to base class
  compaction: move owned_ranges into descriptor
2023-04-11 14:52:29 +03:00
Botond Dénes
38c98b370f Update tools/jmx/ submodule
* tools/jmx/ 48e16998...b7ae52bc (1):
  > install.sh: do not fail if jre-11 is not installed
2023-04-11 14:51:31 +03:00
Kefu Chai
dcce0c96a9 create-relocatable-package.py: error out if pigz fails
before this change, we don't error out even if pigz fails. but
there is chance that pigz fails to create the gzip'ed relocatable
tarball either due to environmental issues or some other problems,
and we are not aware of this until packaging scripts like
`reloc/build_rpm.sh` tries to ungzip this corrupted gzip file.

in this change, if pigz's status code is not 0, the status code
is printed, and create-relocatable-package.py will return 1.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13459
2023-04-11 14:29:25 +03:00
Aleksandra Martyniuk
e170fa1c99 test: extend test_compaction_task.py to test rewrite sstables compaction 2023-04-11 13:07:22 +02:00
Aleksandra Martyniuk
a93f044efa compaction: create task manager's task for rewrite sstables keyspace compaction on one shard
Implementation of task_manager's task that covers rewrite sstables keyspace
compaction on one shard.
2023-04-11 13:07:17 +02:00
Botond Dénes
a8e59d9fb2 Merge 'Metrics relabel from file' from Amnon Heiman
This series adds an option to read the relabel config from file.

Most of Scylla's metrics are reported per-shard, some times they are also reported per scheduling
groups or per tables.  With modern hardware, this can quickly grow to a large number of metrics that
overload Scylla and the  collecting server.

One of the main issues around metrics reduction is that many of the metrics are only
helpful in certain situations.

For example, Scylla monitoring only looks at a subset of the metrics. So in large deployments
it would be helpful to scrap only those.

An option to do that, would be to mark all dashboards related metrics with a label value, and then Prometheus
will request only metrics with that label value.

There are two main limitations to scrap by label values:
1. some of the metrics we want to report are in seastar, so we'll need to label them somehow (we cannot just add random labels to seastar metrics)
2. things change, new metrics are introduce and we may want them, it's not practicall to re-compile and wait
for a new release whenever we want to change a label just for monitoring.

It will be best to have the option to add metrics freely and choose at runtime what to report.

This series make use of Seastar API to perform metrics manipulation dynamically. It includes adding, removing, and changing labels and also enable and disable metrics, and enable and disable the skip_when_empty option.

After this series the configuration could be used with:
```--relabel-config-file conf.yaml```

The general logic and format follows Prometheus metrics_relabel_config configuration.

Where the configuration file looks like:
```
$ cat conf.yaml
relabel_configs:
  - source_labels: [shard]
    action: drop
    target_label: shard
    regex: (2)
  - source_labels: [shard]
    action: replace
    target_label: level
    replacement: $1
    regex: (.*3)

```

Closes #12687

* github.com:scylladb/scylladb:
  main: Load metrics relabel config from a file if it exists
  Add relabel from file support.
2023-04-11 12:47:09 +03:00
Aleksandra Martyniuk
c4098df4ec compaction: create task manager's task for rewrite sstables keyspace compaction
Implementation of task_manager's task covering rewrite sstables keyspace
compaction that can be started through storage_service api.
2023-04-11 11:04:21 +02:00
Aleksandra Martyniuk
814254adfd compaction: create rewrite_sstables_compaction_task_impl
rewrite_sstables_compaction_task_impl serves as a base class of all
concrete rewrite sstables compaction task classes.
2023-04-11 11:03:09 +02:00
Botond Dénes
dba1d36aa6 Merge 'alternator: fix isolation of concurrent modifications to tags' from Nadav Har'El
Alternator's implementation of TagResource, UntagResource and UpdateTimeToLive (the latter uses tags to store the TTL configuration) was unsafe for concurrent modifications - some of these modifications may be lost. This short series fixes the bug, and also adds (in the last patch) a test that reproduces the bug and verifies that it's fixed.

The cause of the incorrect isolation was that we separately read the old tags and wrote the modified tags. In this series we introduce a new function, `modify_tags()` which can do both under one lock, so concurrent tag operations are serialized and therefore isolated as expected.

Fixes #6389.

Closes #13150

* github.com:scylladb/scylladb:
  test/alternator: test concurrent TagResource / UntagResource
  db/tags: drop unsafe update_tags() utility function
  alternator: isolate concurrent modification to tags
  db/tags: add safe modify_tags() utility functions
  migration_manager: expose access to storage_proxy
2023-04-11 11:17:23 +03:00
Anna Stuchlik
2921059ebb doc: add a disclaimer about unsupported upgrade
Fixes https://github.com/scylladb/scylla-enterprise/issues/2805

This commit adds the disclaimer that an upgrade by replacing
the cluster nodes with nodes with a different release
is not supported.

Closes #13445
2023-04-11 10:47:39 +03:00
Kefu Chai
86b66a9875 build: cmake: drop test_table.CC
this change mirrors the corresponding change in `configure.py` in
4b5b6a9010 .

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13461
2023-04-11 09:42:58 +03:00
Nadav Har'El
79114c5030 cql-pytest: translate Cassandra's tests for DELETE operations
This is a translation of Cassandra's CQL unit test source file
validation/operations/DeleteTest.java into our cql-pytest framework.

There are 51 tests, and they did not reproduce any previously-unknown
bug, but did provide additional reproducers for three known issues:

Refs  #4244 Add support for mixing token, multi- and single-column
            restrictions

Refs #12474 DELETE prints misleading error message suggesting ALLOW
            FILTERING would work

Refs #13250 one-element multi-column restriction should be handled like
            a single-column restriction

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #13436
2023-04-11 09:10:11 +03:00
Botond Dénes
355583066e Merge 'Reduce memory footprint of SSTable index summary' from Raphael "Raph" Carvalho
SSTable summary is one of the components fully loaded into memory that may have a significant footprint.

This series reduces the summary footprint by reducing the amount of token information that we need to keep
in memory for each summary entry.

Of course, the benefit of this size optimization is proportional to the amount of summary entries, which
in turn is proportional to the number of partitions in a SSTable.

Therefore we can say that this optimization will benefit the most tables which have tons of small-sized
partitions, which will result in big summaries.

Results:

```
BEFORE

[1000000  pkeys]		 data size: 	4035888890,  summary -> memory footprint: 	5843232,  entries: 88158
[10000000 pkeys]		 data size: 	40368888890, summary -> memory footprint: 	55787128, entries: 844925

AFTER

[1000000  pkeys]		 data size: 	4035888890,  summary -> memory footprint: 	4351536,  entries: 88158
[10000000 pkeys]		 data size: 	40368888890, summary -> memory footprint: 	42211984, entries: 844925
```

That shows a 25% reduction in footprint, for both 1 and 10 million pkeys.

Closes #13447

* github.com:scylladb/scylladb:
  sstables: Store raw token into summary entries
  sstables: Don't store token data into summary's memory pool
2023-04-11 08:29:11 +03:00
Botond Dénes
05b381bfa2 Merge 'Simple S3 storage for sstables' from Pavel Emelyanov
The PR adds sstables storage backend that keeps all component files as S3 objects and system.sstables_registry ownership table that keeps track of what sstables objects belong to local node and their names.

When a keyspace is configured with 'STORAGE = { 'type': 'S3' }' the respective class table object eventually gets the storage_options instance pointing to the target S3 endpoint and bucket. All the sstables created for that table attach the S3 storage implementation that maintains components' files as S3 objects. Writing to and reading from components is handled by the S3 client facilities from utils/. Changing the sstable state, which is -- moving between normal, staging and quarantine states -- is not yet implemented, but would eventually happen by updating entries in the sstables registry.

To keep track of which node owns which objects, to provide bucket-wide uniqueness of object names and to maintain sstable state the storage driver keeps records in the system.sstables_registry ownership table. The table maps sstable location and generation to the object format, version, status-state (*) and (!) unique identifier (some time soon this identifier is supposed to be replaced with UUID sstables generations). The component object name is thus s3://bucket/uuid/component_basename. The registry is also used on boot. The distributed loader picks up sstables from all the tables found in schema and for S3-backed keyspaces it lists entries in the registry to a) identify those and b) get their unique S3-side identifiers to open by name.

(*) About sstable's status and state.

The state field is the part of today's sstable path on disk -- staging, quarantine, normal (root table data dir), etc. Since S3 doesn't have the renaming facility, moving sstable between those states is only possible by updating the entry in the registry. This is not yet implemented in this set (#13017)

The status field tracks sstable' transition through its creation-deletion. It first starts with 'creating' status which corresponds to the today's TemporaryTOC file. After being created and written to the sstable moves into 'sealed' state which corresponds to the today's normal sstable being with the TOC file. To delete sstable atomically it first moves into 'removing' state which is equivalent to being in the deletion-log for the on-disk sstable. Once removed from the bucket, the entry is removed from the registry.

To play with:

1. Start minio (installed by install-dependencies.sh)
```
export MINIO_ROOT_USER=${root_user}
export MINIO_ROOT_PASSWORD=${root_pass}
mkdir -p ${root_directory}
minio server ${root_directory}
```

2. Configure minio CLI, create anonymous bucket
```
mc config host rm local
mc config host add local http://127.0.0.1:9000 ${root_user} ${root_pass}
mc mb local/sstables
mc anonymous set public local/sstables
```

3. Start Scylla with object-storage feature enabled
``` scylla ... --experimental-features=keyspace-storage-options --workdir ${as_usual}```

4. Create KS with S3 storage
``` create keyspace ... storage = { 'type': 'S3', 'endpoint': '127.0.0.1:9000', 'bucket': 'sstables' };```

The S3 client has a logger named "s3", it's useful to use on with `trace` verbosity.

Closes #12523

* github.com:scylladb/scylladb:
  test: Add object-storage test
  distributed_loader: Print storage type when populating
  sstable_directory: Add ownership table components lister
  sstable_directory: Make components_lister and API
  sstable_directory: Create components lister based on storage options
  sstables: Add S3 storage implementation
  system_keyspace: Add ownership table
  system_keyspace: Plug to user sstables manager too
  sstable: Make storage instance based on storage options
  sstable_directory: Keep storage_options aboard
  sstable: Virtualize the helper that gets on-disk stats for sstable
  sstable, storage: Virtualize data sink making for small components
  sstable, storage: Virtualize data sink making for Data and Index
  sstable/writer: Shuffle writer::init_file_writers()
  sstable: Make storage an API
  utils: Add S3 readable file impl for random reads
  utils: Add S3 data sink for multipart upload
  utils: Add S3 client with basic ops
  cql-pytest: Add option to run scylla over stable directory
  test.py: Equip it with minio server
  sstables: Detach write_toc() helper
2023-04-11 08:17:25 +03:00
Benny Halevy
96660b2ef7 table: discard_sstables: update_sstable_cleanup_state when deleting sstables
We need to remove the deleted sstables from
update_sstable_cleanup_state otherwise their data and index
files will remain opened and their storage space won't be reclaimed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:37:56 +03:00
Benny Halevy
4db961ecac compaction_manager: compact_sstables: retrieve owned ranges if required
If any of the sstables to-be-compacted requires cleanup,
retrive the owned_ranges_ptr from the table_state.

With that, staging sstables will eventually be cleaned up
via regular compaction.

Refs #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:36:10 +03:00
Benny Halevy
9105f9800c sstables: add a printer for shared_sstable
Refactor the printing logic in compaction::formatted_sstables_list
out to sstables::to_string(const shared_sstable&, bool include_origin)
and operator<<(const shared_sstable) on top of it.

So that we can easily print std::vector<shared_sstable>
from compaction_manager in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:31:35 +03:00
Benny Halevy
d87925d9fc compaction_manager: keep owned_ranges_ptr in compaction_state
When perform_cleanup adds sstables to sstables_requiring_cleanup,
also save the owned_ranges_ptr in the compaction_state so
it could be used by other compaction types like
regular, reshape, or major compaction.

When the exhausted sstables are released, check
if sstables_requiring_cleanup is empty, and if it is,
clear also the owned_ranges_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:30:53 +03:00
Benny Halevy
c2bf0e0b72 compaction_manager: perform_cleanup: keep sstables in compaction_state::sstables_requiring_cleanup
As a first step towards parallel cleanup by
(regular) compaction and cleanup compaction,
filter all sstables in perform_cleanup
and keep the set of sstables in the compaction_state.

Erase from that set when the sstables are unregistered
from compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:30:39 +03:00
Benny Halevy
b3192b9f16 compaction: refactor compaction_state out of compaction_manager
To use it both from compaction_manager and compaction_descriptor
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:28:16 +03:00
Benny Halevy
73280c0a15 compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh
So it can be used in the next patch that will refactor
compaction_state out of class compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:19:04 +03:00
Benny Halevy
690697961c compaction_manager: compacting_sstable_registration: keep a ref to the compaction_state
To be used for managing sstables requiring cleanup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:18:02 +03:00