Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
The reference is put on the snitch_ptr because this is the sharded<>
thing and because gossiper reference is the same for different snitch
drivers. Also, getting gossiper from snitch_ptr by driver will look
simpler than getting it from any base class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
Today, both operations are picking the highest level as the ideal level for
placing the output, but the size of input should be used instead.
The formula for calculating the ideal level is:
ceil(log base(fan_out) of (total_input_size / max_fragment_size))
where fan_out = 10 by default,
total_input_size = total size of input data and
max_fragment_size = maximum size for fragment (160M by default)
such that 20 fragments will be placed at level 2, as level 1
capacity is 10 fragments only.
By placing the output in the incorrect level, tons of backlog will be generated
for LCS because it will either have to promote or demote fragments until the
levels are properly balanced.
"
* 'optimize_lcs_major_and_reshape/v2' of https://github.com/raphaelsc/scylla:
compaction: LCS: avoid needless work post major compaction completion
compaction: LCS: avoid needless work post reshape completion
compaction: LCS: extract calculation of ideal level for input
compaction: LCS: Fix off-by-one in formula used to calculate ideal level
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.
When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.
This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.
New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.
Fixes#10440.
Closes#10456
* github.com:scylladb/scylla:
loading_cache_test: test prepared stmts cache
loading_cache: force minimum size of unprivileged
loading_cache: extract dropping entries to lambdas
loading_cache: separately track size of sections
loading_cache: fix typo in 'privileged'
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.
Fixes: #10450Closes#10451
* github.com:scylladb/scylla:
replica/database: drop_column_family(): drop querier cache entries after waiting for ops
replica/database: finish coroutinizing drop_column_family()
replica/database: make remove(const column_family&) private
Add a new test that sets up a CQL environment with a very small prepared
statements cache. The test reproduces a scenario described in #10440,
where a privileged section of prepared statement cache gets large
and that could possibly starve the unprivileged section, making it
impossible to execute BATCH statements. Additionally, at the end of the
test, prepared statements/"simulated batches" with prepared statements
are executed a random number of times, stressing the cache.
To create a CQL environment with small prepared cache, cql_test_config
is extended to allow setting custom memory_config value.
This patch enforces a minimum size of unprivileged section when
performing shrink() operation.
When the cache is shrank, we still drop entries first from unprivileged
section (as before this commit), however if this section is already small
(smaller than max_size / 2), we will drop entries from the privileged
section.
For example if the cache could store at most 50 entries and there are 49
entries in privileged section, after adding 5 entries (that would go to
unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.
New tests are added to check this behavior and bookkeeping of section
sizes.
Fixes#10440.
That's done by picking the ideal level for the input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.
Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.
upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"
* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
readers: make upgrade_to_v2() private
test/lib/mutation_source_test: remove upgrade_to_v2 tests
readers: remove v1 forwardable reader
readers: remove v1 empty_reader
readers: remove v1 delegating_reader
sstables/kl: make reader impl v2 native
sstables/kl: return v2 reader from factory methods
sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
partition_snapshot_reader: convert implementation to native v2
mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because it is
incorrectly assuming all files being deleted unconditionally has
TOC, but that's not true for partial files that need to be removed.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
Let's fix this by taking into account temp TOC for partial files.
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10411
* github.com:scylladb/scylla:
sstables: Fix deletion of partial SSTables
sstables: Fix fsync_directory()
sstables: Rename dirname() to a more descriptive name
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
The only user is row level repair: it is replaced with
downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has
lots of downgrade_to_v1() calls, we will deal with these later all at
once.
Another use is the empty mutation source, this is trivially converted to
use the v2 variant.
Reads (part of operations) running concurrent to `drop_column_family()`
can create querier cache entries while we wait for them to finish in
`await_pending_ops()`. Move the cache entry eviction to after this, to
ensure such entries are also cleaned up before destroying the table
object.
This moves the `_querier_cache.evict_all_for_table()` from
`database::remove()` to `database::drop_column_family()`. With that the
former doesn't have to return `future<>` anymore. While at it (changing
the signature) also rename `column_family` -> `table`.
Also add a regression unit test.
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
"
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
This means there is still conversion ingoing when reading from cache and
when populating, but when reading from underlying, the stream can now be
passed through as-is without conversions.
Also, any future v2 related changes to the in-memory storage will now be
limited to the cache reader implementation itself.
In the process, the non-forwarding reader, whose only user is the cache,
is also converted to v2.
"
Performance results reported by Botond:
"
build/release/test/perf/perf_simple_query -c1 -m2G --flush --
duration=20
BEFORE
median 130421.76 tps ( 71.1 allocs/op, 12.1 tasks/op, 47462
insns/op)
median absolute deviation: 319.64
maximum: 131028.33
minimum: 127502.55
AFTER
median 133297.41 tps ( 64.1 allocs/op, 12.2 tasks/op, 45406
insns/op)
median absolute deviation: 2964.24
maximum: 137581.56
minimum: 123739.4
Getting rid of those upgrade/downgrade was good for allocs and ops.
Curiously there is a 0.1 rise in number of tasks though.
"
* 'row-cache-readers-v2/v1' of https://github.com/denesb/scylla:
row_cache: update reader implementations to v2
range_tombstone_change_generator: flush(): add end_of_range
readers/nonforwardable: convert to v2
read_context: fix indentation
read_context: coroutinize move_to_next_partition()
row_cache: cache_entry::read(): return v2 reader
row_cache: return v2 readers from make_reader*()
readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
It makes mutation_fragment_queue copyable and makes the pointer to
pending mutation fragments in next commit stable. This allows moving the
mutation_fragment_queue without breaking the underlying
upgrading_consumer.
And adjust callers. The factory functions just sprinkle upgrade_to_v2()
on returned readers for now.
One test in row_cache_test.cc had to be disabled, because the upgrade to
v2 wrapper we now have over cache readers doesn't allow it to directly
control the reader's buffer size and so the test fails. There is a FIXME
left in the test code and the test will be re-enabled once a native v2
reader implementation allows us to get rid of the upgrade wrapper.
This patch series splits up parts of repair pipeline to allow unit testing
various bits of code without having to run full dtest suite. The reason why
repair pipeline has no unit tests is that by definition repair requires multiple
nodes, while unit test environment works only for a single node.
However, it is possible to explicitly define interfaces between various parts of the
pipeline, inject dependencies and test them individually. This patch series is focused
on taking repair_rows_on_wire (frozen mutation representation of changes coming from
another node) and flushing them to an sstable.
The commits are split into the following parts:
- pulling out classes to separate headers so that they can be included (potentially indirectly) from the test,
- pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested,
- refactoring repair_writer so that the actual writing logic can be injected as dependency,
- creating the unit test.
tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)
Closes#10345
* github.com:scylladb/scylla:
repair: Add unit test for flushing repair_rows_on_wire to disk.
repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.
repair: Make parts of repair_writer interface private.
repair: Rename inputs to flush_rows.
repair: Make repair_meta::flush_rows a free function.
repair: Split flush_rows_in_working_row_buf to two functions and make one static.
repair: Rename inputs to to_repair_rows_list.
repair: Make to_repair_rows_list a free function.
repair: Make repair_meta::to_repair_rows_list a static function
repair: Fix indentation in repair_writer.
repair: Move repair_writer to separate header.
repair: Move repair_row to a separate header.
repair: Move repair_sync_boundary to a separate header.
repair: Move decorated_key_with_hash to separate header.
repair: Move row_repair hashing logic to separate class and file.
"
Optimize consuming from a single partition.
This gives us significant improvement with single, small mutations,
as shown with perf_mutation_readers, compared to the vector-based
flat_mutation_reader_from_mutations_v2.
These are expected to be common on the write path,
and can be optimized for view building.
results from: perf_mutation_readers -c1 --random-seed=840478750
(userspace cpu-frequency governer, 2.2GHz)
test iterations median mad min max
Before:
combined.one_row 720118 825.668ns 1.020ns 824.648ns 827.750ns
After:
combined.one_mutation 881482 751.157ns 0.397ns 750.211ns 751.912ns
combined.one_row 843270 756.553ns 0.303ns 755.889ns 757.911ns
The grand plan is to follow up
with make_flat_mutation_reader_from_frozen_mutation_v2
so that we can read directly from either a mutation
or frozen_mutation without having to unfreeze it e.g. in
table::push_view_replica_updates.
Test: unit(dev)
Perf: perf_mutation_readers(release)
"
* tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla:
perf: perf_mutation_readers: add one_mutation case
test: mutation_query_test: make make_source static
mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
mutation readers: add make_flat_mutation_reader_from_mutation_v2
readers: delete slice_mutation.hh
test: flat_mutation_reader_test: mock_consumer: add debug logging
test: flat_mutation_reader_test: mock_consumer: make depth counter signed
"
There's a generic way to start-stop services in scylla, that includes
5 "actions" (some are optional and/or implicit though)
service_config cfg = ...
sharded<service>.start(cfg)
service.invoke_on_all(&service::start)
service.invoke_on_all(&service::shutdown)
service.invoke_on_all(&servuce::stop)
sharded<service>.stop()
and most of the service out there conforms to that scheme. Not snitch
(spoiler: and not tracing), for which there's a couple of helpers that
do all that magic behind the scenes, "configuring" snitch is done with
the help of overloaded constructors. The latter is extra complicated
with the need to register snitch drivers in class-registry for each
constructor overload. Also there's an external shards synchronization
on stop.
This set brings snitch start/stop code to the described standard: the
create/stop helpers are removed, creation acceps the config structure,
per-shard start/stop (snitch has no drain for now) happens in the
simple invoke-on-all manner.
The intended side effect of this change is the ability to add explicit
dependencies to snitch (in the future, not in this set).
tests: unit(dev)
"
* 'br-snitch-config' of https://github.com/xemul/scylla:
snitch: Remove create_snitch/stop_snitch
snitch: Simplify stop (and pause_io)
snitch: Move io_is_stopped to property-file driver
snitch: Remove init_snitch_obj()
snitch: Move instance creation into snitch_ptr constructor
snitch: Make config-based construction of all drivers
snitch: Declare snitch_ptr peering and rework container() method
snitch: Introduce container() method
We want to return stop_iteration::yes once we crossed
the initial depth threshold, with an unsigned depth counter,
it might wraparound and look > 1.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no user impact.
Fixes#10364.
Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
The unit test executes a simplified repair scenario by:
- producing a random stream of mutation mutation_fragments,
- convering them to repair_rows_on_wire,
- convering them to list of repair_rows using the conversion logic
extracted in previous commits from repair_meta,
- flushing the rows to an sstable using the logic extracted in previous
commits from repair_meta,
- comparing the sstable contents with the originally produced mutation
fragments.
The test checks only the flushing part and is not concerned with any
other piece of the repair pipeline.
After previous patches both, create_snitch() and stop_snitch() no look
like the classica sharded service start/stop sequence. Finally both
helpers can be removed and the rest of the user can just call start/stop
on locally obtained sharded references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently snitch drivers register themselves in class-registry with all
sorts of construction options possible. All those different constuctors
are in fact "config options".
When later snitch will declare its dependencies (gossiper and system
keyspace), it will require patching all this registrations, which's very
inconvenient.
This patch introduces the snitch_config struct and replaces all the
snitch constructors with the snitch_driver(snitch_config cfg) one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series is part of the shared storage project.
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
This option is guarded with a schema feature, because it's kept in a new schema table: `system_schema.scylla_keyspaces`.
Example of the contents of the new table:
```cql
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;
keyspace_name | storage_options | storage_type
---------------+------------------------------------------------+--------------
ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} | S3
```
Native storage options are not kept in the table, as this format doesn't hold any extra options and it would therefore just be a waste of storage.
Closes#10144
* github.com:scylladb/scylla:
test: regenerate schema_change_test for storage options case
test: improve output of schema_change_test regeneration
docs: add a paragraph on keyspace storage options
test: add test cases for keyspace storage options
database,cql3: add STORAGE option to keyspaces
db: add keyspace-storage-options experimental feature
db,schema_tables: add scylla_keyspaces table
db,gms: add SCYLLA_KEYSPACE schema feature
db,gms: add KEYSPACE_STORAGE_OPTIONS feature
While reviewing "utils/chunked_managed_vector: Fix corruption in case there is more
than one chunk", I was worried that there could be a correctness issue
when pop_back() pops off the first element of the last chunk, but turns
out I made an off-by-one error in my theory. Anyway, I wrote a unit test
to verify my assumption and I found worth submitting it upstream.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408133555.12397-2-raphaelsc@scylladb.com>
Keyspace storage options series adds a new schema table:
system_schema.scylla_keyspaces. The regenerated cases ensure
that this new table is taken into account when the schema feature
is available.
Schema change test operates on pre-generated sstables, and sometimes
this set of sstables needs to be regenerated. In order to make the
regeneration process more ergonomic, the output is now directly
copyable as valid C++ representation of UUIDs.
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.
Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.
Currently, the container is only used in the sstable partition index
cache.
Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.
Introduced in 78e5b9fd85 (4.6.0)
Fixes#10290
Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
There's a public call on replica::table to get back the compaction
manager reference. It's not needed, actually. The users of the call are
distributed loader which already has database at hand, and a test that
creates itw own instance of compaction manager for its testing tables
and thus also has it available.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220406171351.3050-1-xemul@scylladb.com>
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.
Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.
This is used today only in to_data_query_result
which is invoked on the read-repair path.
Refs #10038Fixes#10021
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
Bucket awareness in cleanup was introduced in a69d98c3d0.
STCS and TWCS already support it, and now LCS will receive it.
The goal of bucket awareness is to reduce writeamp in cleanup,
therefore reducing operation time. Additionally, garbage collection
becomes more efficient as shadowed data can now be potentially
compacted with the data that shadows it, assuming they're on
the same level.
The implementation for LCS is simple. Will reuse the procedure
for STCS for returning jobs in level 0. And one job will be
returned for each non-empty level > 0. What allows us to do it
is our incremental selection approach used in compaction,
that sets a limit on memory usage and disk space requirement.
Fixes#10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>