The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.
Branches 3.1,3.2,3.3
Fixes#5793
Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
query_processor is a central class, so reducing its includes
can reduce dependencies treewite. This patch removes includes
for parsed_statement, cf_statement, and untyped_result_set and
fixes up the rest of the tree to include what it lacks as a result
of these removals.
The view_update_generator acceps (and keeps) database and storage_proxy,
the latter is only needed to initialize the view_updating_consumer which,
in turn, only needs it to get database from (to find column family).
This can be relaxed by providing the database from _generator to _consumer
directly, without using the storage_proxy in between.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112427.18419-1-xemul@scylladb.com>
The storage proxy statistics structure did not contain
a method for registering the statistics for metric
groups, instead, each user had to register some
of the metrics by itself. There is no real reason
for separating the metrics registration from
the statistics data. There is even less justification
for doing this only for part of the stats as is
the case for those statistics.
This commit internalize the metrics registration
in the storage_proxy stats structures.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
"
Grab the lowest hanging fruits.
This patch-set makes three important changes:
* Consume the memory for I/O operations on tracked files, *before* they
are forwarded to the underlying file.
* Track memory consumed by buffers created for parsing in
`continuous_data_consumer`. As this is the basis for the data, index
and promoted index parsers, all three are covered now in this regard.
* Track the index file.
The remaining, not-so-low handing fruits in order of
gain/cost(performance) ratio:
* Track in-memory index lists.
* Track in-memory promoted index blocks.
* Track reader buffer memory.
Note that this ordering might change based on the workload and other
environmental factors.
Also included in this series is an infrastructure refactoring to make
tracking memory easier and involve including lighter headers, as well as
a manual test designed to allow testing and experimenting with the
effects of changes to the accuracy of the tracking of reader memory
consumption.
Refs: #4176
Refs: #2778
Tests: unit(dev), manual(sstable_scan_footprint_test)
The latter was run as:
build/dev/test/manual/sstable_scan_footprint_test -c1 -m2G --reads=4000
--read-concurrency=1 --logger-log-level test=trace --collect-stats
--stats-period-ms=20
This will trickle reads until the semaphore blocks, then wait until the
wait queue drains before sending new reads. This way we are not testing
the effectiveness of the pre-admission estimation (which is terribly
optimistic) and instead check that with slowly ramping up read load the
semaphore will block on memory preventing OOM.
This now runs to completion without a single `std::bad_alloc`. The read
concurrency semaphore allows between 15-30 reads, and is always blocked
on memory.
"
* 'more-accurate-reader-resource-tracking/v1' of ssh://github.com/denesb/scylla:
test/manual/sstable_scan_footprint_test: improve memory consumption diagnostics
tests/manual/sstable_scan_footprint_test: use the semaphore to determine read rate
tests/manual: Add test measuring memory demand of concurrent sstable reads
index_reader: make the index file tracked
sstables/continuous_data_consumer: track buffers used for parsing
reader_concurrency_semaphore: tracking_file_impl: consume memory speculatively
reader_concurrency_semaphore: bye reader_resource_tracker
treewide: replace reader_resource_tracer with reader_permit
reader_permit: expose make_tracked_temporary_buffer()
reader_permit: introduce make_tracked_file()
reader_permit: introduce memory_units
reader_concurrency_semaphore: mv reader_resources and reader_permit to reader_permit.hh
reader_concurrency_semaphore: reader_permit: make it a value type
reader_concurrency_semaphore: s/resources/reader_resources/
reader_concurrency_semaphore::reader_permit: move methods out-of-line
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
The method view_info::partition_ranges() is unused.
Also drop the now-dead _partition_ranges data member.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
A mistake in handling legacy checks for special 'idx_token' column
resulted in not recognizing materialized views backing secondary
indexes properly. The mistake is really a typo, but with bad
consequences - instead of checking the view schema for being an index,
we asked for the base schema, which is definitely not an index of
itself.
Branches 3.1,3.2 (asap)
Fixes#5621Fixes#4744
Before this patch the iterations over migration_notifier::_listeners
could race with listeners being added and removed.
The addition side is not modified, since it is common to add a
listener during construction and it would require a fairly big
refactoring. Instead, the iteration is modified to use indexes instead
of iterators so that it is still valid if another listener is added
concurrently.
For removal we use a rw lock, since removing an element invalidates
indexes too. There are only a few places that needed refactoring to
handle unregister_listener returning a future<>, so this is probably
OK.
Fixes#5541.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200120192819.136305-1-espindola@scylladb.com>
The set make dependencies between mm and other services cleaner,
in particular, after the set:
- the query processor no longer needs migration manager
(which doesn't need query processor either)
- the database no longer needs migration manager, thus the mutual
dependency between these two is dropped, only migration manager
-> database is left
- the migration manager -> storage_service dependency is relaxed,
one more patchset will be needed to remove it, thus dropping one
more mutual dependency between them, only the storage_service
-> migration manager will be left
- the migration manager is stopped on drain, but several more
services need it on stop, thus causing use after free problems,
in particular there's a caught bug when view builder crashes
when unregistering from notifier list on stop. Fixed.
Tests: unit(dev)
Fixes: #5404
The migration manager itself is still needed on start to wait
for schema agreement, but there's no longer the need for the
life-time reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous assumption was that there can only be one regular base column
in the view key. The assumption is still correct for tables created
via CQL, but it's internally possible to create a view with multiple
such columns - the new assumption is that if there are multiple columns,
they share their liveness.
This patch is vital for indexing to work properly on alternator,
so it would be best to solve the issue upstream. I strived to leave
the existing semantics intact as long as only up to one regular
column is part of the materialized view primary key, which is the case
for Scylla's materialized views. For alternator it may not be true,
but all regular columns in alternator share liveness info (since
alternator does not support per-column TTL), which is sufficient
to compute view updates in a consistent way.
Fixes#5006
Tests: unit(dev), alternator(test_gsi_update_second_regular_base_column, tic-tac-toe demo)
Message-Id: <c9dec243ce903d3a922ce077dc274f988bcf5d57.1567604945.git.sarna@scylladb.com>
Previous implementation did not take into account that a column
in a partition key might exist in a mutation, but in a DEAD state
- if it's deleted. There are no regressions for CQL, while for
alternator and its capability of having two regular base columns
in a view key, this additional check must be performed.
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.
Fixes#5340
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consumer may throw, in this case, break from the loop and retry.
move_sstable_from_staging_in_thread may theoretically throw too,
ignore the error in this case since the sstable was already processed,
individual move failures are already ignored and moving from staging
will be retried upon restart.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
`collection_type_impl::serialize_mutation_form`
became `collection_mutation(_view)_description::serialize`.
Previously callers had to cast their data_type down to collection_type
to use serialize_mutation_form. Now it's done inside `serialize`.
In the future `serialize` will be generalized to handle UDTs.
`collection_type_impl::deserialize_mutation_form`
became a free standing function `deserialize_collection_mutation`
with similiar benefits. Actually, noone needs to call this function
manually because of the next paragraph.
A common pattern consisting of linearizing data inside a `collection_mutation_view`
followed by calling `deserialize_mutation_form` has been abstracted out
as a `with_deserialized` method inside collection_mutation_view.
serialize_mutation_form_only_live was removed,
because it hadn't been used anywhere.
collection_type_impl::mutation became collection_mutation_description.
collection_type_impl::mutation_view became collection_mutation_view_description.
These classes now reside inside collection_mutation.hh.
Additional documentation has been written for these classes.
Related function implementations were moved to collection_mutation.cc.
This makes it easier to generalize these classes to non-frozen UDTs in future commits.
The new names (together with documentation) better describe their purpose.
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
Currently, if updating bookkeeping operations for view building fails,
we log the error message and continue. However, during shutdown,
some errors are more likely to happen due to existing issues
like #4384. To differentiate actual errors from semi-expected
errors during shutdown, the latter are now logged with a warning
level instead of error.
Fixes#4954
Stopping services which occurs in a destructor of deferred_action
should not throw, or it will end the program with
terminate(). View builder breaks a semaphore during its shutdown,
which results in propagating a broken_semaphore exception,
which in turn results in throwing an exception during stop().get().
In order to fix that issue, semaphore exceptions are explicitly
ignored, since they're expected to appear during shutdown.
Fixes#4875
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent
legacy views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before returning
it to the user. That may cause duplicated clustering row fragments to be
emitted by the reader, which may cause undefined behaviour in consumers.
The solution is to keep track of previous clustering keys for each
partition and drop fragments that would cause duplication. That way if
any shard is still building a view, its progress will be returned,
and if many shards are still building, the returned value will indicate
the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from <tgrabiec@scylladb.com>
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.
Fixes#4321
Tests: unit(dev)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When looking for optimization paths, columns selected in a view
are checked against multiple conditions - unfortunately virtual
columns were erroneously skipped from that check, which resulted
in ignoring their TTLs. That can lead to overoptimizing
and not including vital liveness info into view rows,
which can then result in row disappearing too early.
In some cases generating view updates for columns that were not
selected in CREATE VIEW statement is redundant - it is the case
when the update will not influence row liveness in anyway.
Currently, these cases are optimized out:
- row marker is live and only unselected columns were updated;
- row marked is not live and only unselected columns were updated,
and in the process nothing was created or deleted and there was
no TTL involved;
It's detrimental to keep querying index manager whether a view
is backing a secondary index every time, so this value is cached
at construct time.
At the same time, this value is not simply passed to view_info
when being created in secondary index manager, in order to
decouple materialized view logic from secondary indexes as much as
possible (the sole existence of is_index() is bad enough).
Give the constant 1024*1024 introduced in an earlier commit a name,
"batch_memory_max", and move it from view.cc to view_builder.hh.
It now resides next to the pre-existing constant that controlled how
many rows were read in each build step, "batch_size".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217100222.15673-1-nyh@scylladb.com>
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.
Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.
As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.
Fixes#4213.
This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.
Tests: unit (release)
"
* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
all: rename view_update_from_staging_generator
distributed_loader: fix indentation
service: add generating view updates from uploaded sstables
init: pass view update generator to storage service
sstables: treat sstables in upload dir as needing view build
sstables,table: rename is_staging to requires_view_building
distributed_loader: use proper directory for opening SSTable
db,view: make throttling optional for view_update_generator
Currently registering new view updates is throttled by a semaphore,
which makes sense during stream sessions in order to avoid overloading
the queue. Still, registration also occurs during initialization,
where it makes little sense to wait on a semaphore, since view update
generator might not have started at all yet.
During streaming, there's a race between streamed sstables
and view creation, which might result in some tables not being
used to generate view updates, even though they should.
That happens when the decision about view update path for a table
is done before view creation, but after already receiving some sstables
via streaming. These will not be used in view building even though
they should.
Hence, a phaser is used to make the view builder wait for all ongoing
stream sessions for a table to finish before proceeding with build steps.
Refs #4032