as `_bucket` is an `unordered_map<bucket_id, timestamp_bucket_writer>`,
when writing to a given bucket, we try to create a writer with the
specified bucket id, so the returned iterator should point to a node
whose `first` element is always the bucket id.
so, there is no need to reference `it` for the bucket id, let's just
reference the parameter. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20598
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
This writer is used by streaming, on tablet migration and
load-and-stream.
The caller of distribute_reader_and_consume_on_shards(), which provides
a sharder, is supposed to ensure that effective_replication_map is kept
alive around it, in order for topology coordinator to wait for any writes
which may be in flight to reach their shards before tablet replica starts
another migration. This is already the case:
1) repair and load-and-stream keep the erm around writing.
2) tablet migration uses autorefreshing_sharder, so it does not, but
it keeps the topology_guard around the operation in the consumer,
which serves the same purpose.
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
Token group is an abstraction that allows us to easily segregate a
mutation stream into buckets. Groups share the same properties as
compaction groups. Groups follow the ring order and they don't
overlap each other. Groups are defined according to a classifier,
which return an id given a token. It's expected that classifier
return ids in monotonic increasing order.
The reasons for this abstraction are:
1) we don't want to make segregator aware of compaction groups
2) splitting happens before tablet metadata is changed, so the
the segregator will have to classify based on whether the token
belongs to left (group id 0) or right (group id 1) side of
the range to be split.
The reason for not extending sstable writer instead, is that
today, writer consumer can only tell producer to switch to a
new writer, when consuming the end of a partition, but that
would be too late for us, as we have to decide to move to
a new writer at partition start instead.
It will be wired into compaction when it happens in split mode.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To be used in the next patch to control whether the semaphore registers
and exports metrics or not. We want to move metric registration to the
semaphore but we don't want all semaphores to export metrics. The
decision on whether a semaphore should or shouldn't export metrics
should be made on a case-by-case basis so this new parameter has no
default value (except for the for_tests constructor).
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.
after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.
This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.
It probably already causes problems for such keys.
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
Although its API was long converted to v2, its implementation stayed v1
because the memtable and mutation API were still v1. Now that the
memtable flush returns a v2 reader we can have a second look at
converting this. While the mutation API still uses v1, this can easily
be worked around by using going through `mutation_rebuilder_v2`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220302145945.189607-1-bdenes@scylladb.com>
"
Also convert the foreign_reader used by it in the process.
Tests: unit(dev)
"
* 'multishard-writer-v2/v1' of https://github.com/denesb/scylla:
mutation_writer/multishard_writer: remove now unused v1 factory overloads
test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards()
flat_mutation_reader: add v2 variant of make_generating_reader()
mutation_reader: multishard_writer: migrate implementation to v2
mutation_reader: convert foreign_reader to v2
streaming/consumer: convert to v2
mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards()
Just the factory function itself. The underlying machinery stays v1 for
now. Behind the scenes the v2 variant still invokes the v1 one, with the
necessary conversions.
This allows migrating users to the v2 interface, migrating the machinery
later.
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.
Memtables are also used outside the replica, in two places:
- in some virtual tables; this is also in some way inside the replica,
(virtual readers are installed at the replica level, not the
cooordinator), so I don't consider it a layering violation
- in many sstable unit tests, as a convenient way to create sstables
with known input. This is a layering violation.
We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).
Test: unit (dev)
Closes#10120
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Just a facade using converters behind the scenes. The actual segregator
is not worth migrating to v2 while mutation and the flushing readers
don't have a v2 versions. Still, migrating all users to a v2 API allows
the conversion to happen at a single point where more work is necessary,
instead of scattered around all the users.
We leave the v1 version in place to aid incremental migration to the v2
one.
Since we have multiple writers using this that we don't want to migrate
all at once, we create a v2 version of said classes so we can migrate
them incrementally.
Due to an error in transforming the above routine, readers who have <= a
buffer worth of content are dropped without consuming them.
This is due to the outer consume loop being conditioned on
`is_end_of_stream()`, which will be set for readers that eagerly
pre-fill their buffer and also have no more data then what is in their
buffer.
Change the condition to also check for `is_buffer_empty()` and only drop
the reader if both of these are true.
Fixes: #9594
Tests: unit(mutation_writer_test --repeat=200, dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108092923.104504-1-bdenes@scylladb.com>
The current method of segregating partitions doesn't work well for huge
number of small partitions. For especially bad input, it can produce
hundreds or even thousands of buckets. This patch adds a new segregator
specialized for this use-case. This segregator uses a memtable to sort
out-of-order partitions in-memory. When the memtable size reaches the
provided max-memory limit, it is flushed to disk and a new empty one is
created. In-order partitions bypass the sorting altogether and go to the
fast-path bucket.
The new method is not used yet, this will come in the next patch.
Recently we observed an OOM caused by the partition based splitting
writer going crazy, creating 1.7K buckets while scrubbing an especially
broken sstable. To avoid situations like that in the future, this patch
provides a max limit for the number of live buckets. When the number of
buckets reach this number, the largest bucket is closed and replaced by
a bucket. This will end up creating more output sstables during scrub
overall, but now they won't all be written at the same time causing
insane memory pressure and possibly OOM.
Scrub compaction sets this limit to 100, the same limit the TWCS's
timestamp based splitting writer uses (implemented through the
classifier -
time_window_compaction_strategy::max_data_segregation_window_count).
Fixes: #9400
Tests: unit(dev)
Closes#9401