Commit Graph

22 Commits

Author SHA1 Message Date
Botond Dénes
5a73c3374e sstables_loader: opt-in for compacting the stream
No point in loading expired/covered data.
2023-07-27 03:22:11 -04:00
Botond Dénes
42b0dd5558 replica/table: add optional compacting to make_streaming_reader()
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!

All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
2023-07-27 03:22:11 -04:00
Kefu Chai
84683c3549 sstable_loader: update comment to reflect latest changes
we have a dedicated facility for loading sstables since
68dfcf5256, and column_family (i.e. table)
is not responsible for loading new sstables. so update the comment
to reflect this change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14154
2023-06-06 14:31:15 +03:00
Tomasz Grabiec
9b17ad3771 locator: Introduce per-table replication strategy
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.

Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.

For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.

Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
2023-04-24 10:49:36 +02:00
Raphael S. Carvalho
fe6df3d270 sstable_loader: Discard SSTable bloom filter on load-and-stream
Load-and-stream reads the entire content from SSTables, therefore it can
afford to discard the bloom filter that might otherwise consume a significant
amount of memory. Bloom filters are only needed by compaction and other
replica::table operations that might want to check the presence of keys
in the SSTable files, like single-partition reads.

It's not uncommon to see Data:Filter ratio of less than 100:1, meaning
that for ~300G of data, filters will take ~3G.

In addition to saving memory footprint, it also reduces operation time
as load-and-stream no longer have to read, parse and build the filters
from disk into memory.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-04-13 11:34:22 -03:00
Botond Dénes
156e5d346d reader_permit: keep trace_state pointer on permit
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
2023-03-22 04:58:01 -04:00
Raphael S. Carvalho
fbeee8b65d Optimize load-and-stream
load-and-stream implements no policy when deciding which SSTables will go in
each streaming round (batch of 16 SSTables), meaning the choice is random.

It can take advantage of the fact that the LSM-tree layout, with ICS and LCS,
is a set of SSTable runs, where each run is composed of SSTables that are
disjoint in their key range.

By sorting SSTables to be streamed by their first key, the effect is that
SSTable runs will be incrementally streamed (in token order).

SSTable runs in the same replica group (or in the same node) will have their
content deduplicated, reducing significantly the amount of data we need to
put on the wire. The improvement is proportional to the space amplification
in the table, which again, depends on the compaction strategy used.

Another important benefit is that the destination nodes will receive SSTables
in token order, allowing off-strategy compaction to be more efficient.

This is how I tested it:

1) Generated a 5GB dataset to a ICS table.
2) Started a fresh 2-node cluster. RF=2.
3) Ran load-and-stream against one of the replicas.

BEFORE:

$ time curl -X POST "http://127.0.0.1:10000/storage_service/sstables/keyspace1?cf=standard1&load_and_stream=true"

real	4m40.613s
user	0m0.005s
sys	0m0.007s

AFTER:

$ time curl -X POST "http://127.0.0.1:10000/storage_service/sstables/keyspace1?cf=standard1&load_and_stream=true"

real	2m39.271s
user	0m0.005s
sys	0m0.004s

That's ~1.76x faster.

That's explained by deduplication:

BEFORE:

INFO  2023-02-17 22:59:01,100 [shard 0] stream_session - [Stream #79d3ce7a-ea47-4b6e-9214-930610a18ccd] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3445376, received_partitions=2755835
INFO  2023-02-17 22:59:41,491 [shard 0] stream_session - [Stream #bc6bad99-4438-4e1e-92db-b2cb394039c8] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3308288, received_partitions=2836491
INFO  2023-02-17 23:00:20,585 [shard 0] stream_session - [Stream #e95c4f49-0a2f-47ea-b41f-d900dd87ead5] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3129088, received_partitions=2734029
INFO  2023-02-17 23:00:49,297 [shard 0] stream_session - [Stream #255cba95-a099-4fec-a72c-f87d5cac2b1d] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2544128, received_partitions=1959370
INFO  2023-02-17 23:01:33,110 [shard 0] stream_session - [Stream #96b5737e-30c7-4af8-a8b8-96fecbcbcbd0] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3624576, received_partitions=3085681
INFO  2023-02-17 23:02:20,909 [shard 0] stream_session - [Stream #3185a48b-fb9e-4190-88f4-5c7a386bc9bd] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3505024, received_partitions=3079345
INFO  2023-02-17 23:03:02,039 [shard 0] stream_session - [Stream #0d2964dc-d5e3-4775-825c-97f736d14713] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2808192, received_partitions=2655811

AFTER:

INFO  2023-02-17 23:12:49,155 [shard 0] stream_session - [Stream #bf00963c-3334-4035-b1a9-4b3ceb7a188a] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2965376, received_partitions=1006535
INFO  2023-02-17 23:13:13,365 [shard 0] stream_session - [Stream #1cd2e3ac-a68b-4cb5-8a06-707e91cf59db] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3543936, received_partitions=1406157
INFO  2023-02-17 23:13:37,474 [shard 0] stream_session - [Stream #5a278230-6b4b-461f-8396-c15df7092d03] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3639936, received_partitions=1371298
INFO  2023-02-17 23:14:02,132 [shard 0] stream_session - [Stream #19f40dc3-e02a-4321-a917-a6590d99dd03] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3638912, received_partitions=1435386
INFO  2023-02-17 23:14:26,673 [shard 0] stream_session - [Stream #d47507eb-2067-4e8f-a4f7-c82d5fbd4228] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=3561600, received_partitions=1423024
INFO  2023-02-17 23:14:49,307 [shard 0] stream_session - [Stream #d42ee911-253a-4de6-ac89-6a3c05b88d66] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2382592, received_partitions=1452656
INFO  2023-02-17 23:15:10,067 [shard 0] stream_session - [Stream #1f78c1bf-8e20-41bd-95de-16de3fc5f86c] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2632320, received_partitions=1252298

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20230219191924.37070-1-raphaelsc@scylladb.com>
2023-02-20 12:46:14 +01:00
Benny Halevy
314e45d957 streaming: define plan_id as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:45:30 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Raphael S. Carvalho
aa667e590e sstable_set: Fix partitioned_sstable_set constructor
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.

we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
2022-06-21 11:58:13 +03:00
Avi Kivity
afc06f0017 messaging: forward-declare types in messaging_service.hh
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.

Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.

Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.

Closes #10755
2022-06-09 15:52:12 +03:00
Michael Livshin
00bee4e0b3 sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Avi Kivity
4b53af0bd5 treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).

One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.

Closes #10699
2022-05-31 09:06:24 +03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
97d74de8fc Merge "flat_mutation_reader: clone evictable_reader & convert some others" from Michael Livshin
"
The first patch introduces evictable_reader_v2, and the second one
further simplifies it.  We clone instead of converting because there
is at least one downstream (by way of multishard_combining_reader) use
that is not itself straightforward to convert at the moment
(multishard_mutation_query), and because evictable_reader instances
cannot be {up,down}graded (since users also access the undelying
buffers).  This also means that shard_reader, reader_lifecycle_policy
and multishard_combining_reader have to be cloned.
"

* tag 'clone-evictable-reader-to-v2/v3' of https://github.com/cmm/scylla:
  convert make_multishard_streaming_reader() to flat_mutation_reader_v2
  convert table::make_streaming_reader() to flat_mutation_reader_v2
  convert make_flat_multi_range_reader() to flat_mutation_reader_v2
  view_update_generator: remove unneeded call to downgrade_to_v1()
  introduce multishard_combining_reader_v2
  introduce shard_reader_v2
  introduce the reader_lifecycle_policy_v2 abstract base
  evictable_reader_v2: further code simplifications
  introduce evictable_reader_v2 & friends
2022-01-11 17:01:08 +02:00
Michael Livshin
be5118a7c9 convert table::make_streaming_reader() to flat_mutation_reader_v2
All changes are mechanical.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Avi Kivity
4392c20bd3 replica: move distributed_loader into replica module
distributed_loader is replica-side thing, so it belongs in the
replica module ("distributed" refers to its ability to load
sstables in their correct shards). So move it to the replica
module.
2022-01-10 15:25:28 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
e51fcc22f3 sstable_loader: add missing include <cfloat>
Needed for FLT_EPSILON

Closes #9646
2021-11-17 09:01:49 +02:00
Benny Halevy
fdaa891332 storage_service, sstables_loader: use effective_replication_map to get_natural_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:50:27 +03:00
Pavel Emelyanov
68ecec0197 sstables_loader: Accept the sstables loading code
The code was moved in the relevant .cc file by previous patch, now
make it sit in the relevant class. One "significant" change is that
the messaging service is available by local reference already, not
by the sharded one. Other dependencies are already satisfied by the
patch that introduced the sstables_loader class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:08:21 +03:00
Pavel Emelyanov
42f83f6669 storage_service: Move the sstables loading code
Just cut-n-paste the code into sstables_loader.cc. No other
changes but replace storage service logger with its own one.
For now the code stays in storage_service class, but next
patch will relocate the code into the sstables_loader one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:07:39 +03:00