Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.
The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.
Comments from Glauber:
This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:
1) A bug/regression is fixed with the boundary calculations for the
memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
themselves are run in the compaction scheduling group. Having that
working is what changes this patch the most: we now store the
scheduling group in the compaction manager and let the compaction
manager itself enforce the scheduling group.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.
The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.
Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
Even if shared_ptr is const it doesn't mean that its internal state is
immutable and it still cannot be freely shared across shards.
Fixes assertion failure in build/debug/tests/cql_roles_query_test.
Message-Id: <20180201125221.30531-1-pdziepak@scylladb.com>
Shared pointer don't like being shared across shards.
Fixes assertion failure in build/debug/tests/mutation_reader_test.
Message-Id: <20180201125017.30259-1-pdziepak@scylladb.com>
"This series adds the base for the V2 Swagger definition file.
After the series, the definition file will be at:
http://localhost:10000/v2
It can be used with the swagger ui, by replacing the url in the search
path."
* 'amnon/swagger_20' of github.com:scylladb/seastar-dev:
Register the API V2 swagger file
Adding the header part of the swagger2.0 API
"Before this patch set, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, with two
different rows in the view table, instead of just one with the latest
data. In this series we add locking which serializes the two conflicting
updates, and solves this problem.
I explain in more detail why such locking is needed, and what kinds of
locks are needed, in the third patch."
* 'master' of https://github.com/nyh/scylla:
Materialized views: serialize read-modify-update of base table
Materialized views: test row_locker class
Materialized views: implement row and partition locking mechanism
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.
The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.
In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.
The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.
This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.
The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.
Fixes#3123.
Fixes#3133.
Tests: unit-tests (release)
BEFORE
test iterations median mad min max
memtable.one_partition_one_row 1155362 869.139ns 0.282ns 868.465ns 873.253ns
memtable.one_partition_many_rows 127252 7.871us 15.252ns 7.851us 7.886us
memtable.many_partitions_one_row 58715 17.109us 2.765ns 17.013us 17.112us
memtable.many_partitions_many_rows 4839 206.717us 212.385ns 206.505us 207.448us
AFTER
test iterations median mad min max
memtable.one_partition_one_row 1194453 839.223ns 0.503ns 834.952ns 842.841ns
memtable.one_partition_many_rows 133785 7.477us 4.492ns 7.473us 7.507us
memtable.many_partitions_one_row 60267 16.680us 18.027ns 16.592us 16.700us
memtable.many_partitions_many_rows 4975 201.048us 144.929ns 200.822us 201.699us
./before_sq ./after_sq diff
read 337373.86 353694.24 4.8%
write 388759.99 394135.78 1.4%
* https://github.com/pdziepak/scylla.git memtable-exception-safety/v2:
tests/perf: add microbenchmarks for memtable reader
flat_mutation_reader: add allocation point in push_mutation_fragment
linearization_context: remove non-trivial operations from fast path
lsa: split alloc section into reserving and reclamation-disabled parts
lsa: optimise disabling reclamation and invalidation counter
mutation_fragment: allow creating clustering row in place
paratition_snapshot_reader: minimise amount of retryable code
memtable: drop memtable_entry::read()
tests/memtable: add test for reader exception safety
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.
However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.
In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.
Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.
In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.
All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.
push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
"This patchset makes index_reader consume promoted index incrementally
on demand as the reader advances through the current partition instead
of storing the entire promoted index which can be huge.
When the current page is parsed, data for promoted indices are turned
into input streams that are only read and parsed if a particular
position within a partition is seeked for. This avoids potentially large
allocations for big partitions."
* 'issues/2981/v10' of https://github.com/argenet/scylla:
Use advance_past for single partition upper bound.
Remove obsolete types and methods.
Simplify continuous_data_consumer::consume_input() interface.
Parse promoted index entries lazily upon request rather than immediately.
Add helper input streams: buffer_input_stream and prepended_input_stream.
Support skipping over bytes from input stream in parsers based on continuous_data_consumer
Add performance tests for large partition slicing using clustering keys.
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.
A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.
Fixes#3168
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a unit test for the row_locker facility. It tests various
combination of shared and exclusive locks on rows and on partitions,
some should succeed immediately and some should block.
This tests the row_locker's API only, it does not use or test anything
in Materialized Views.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a "row_locker" class providing locking (shard-locally) of
individual clustering rows or entire partitions, and both exclusive and
shared locks (a.k.a. reader/writer lock).
As we'll see in a following patch, we need this locking capability for
materialized views, to serialize the read-modify-update modifications
which involve the same rows or partitions.
The new row_locker is significantly different from the existing cell_locker.
The two main differences are that 1. row_locker also supports locking the
entire partition, not just individual rows (or cells in them), and that
2. row_locker supports also shared (reader) locks, not just exclusive locks.
For this reason we opted for a new implementation, instead of making large
modificiations to the existing cell_locker. And we put the source files
in the view/ directory, because row_locker's requirements are pretty
specific to the needs of materialized views.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.
When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.
Fixes#3068
Tests: unit-tests (release)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
* seastar 770c450...19efbd9 (3):
> configure.py: add --static-yaml-cpp option to link libyaml-cpp statically
> Merge 'Avoid kernel stalls due to fsync' from Avi
> rwlock: add exception-safe lock/unlock alternative
This patch adds a scripts/find-maintainer script, similar to
script/get_maintainer.pl in Linux, which looks up maintainers and
reviewers for a specific file from a MAINTAINERS file.
Example usage looks as follows:
$ ./scripts/find-maintainer cql3/statements/create_view_statement.cc
CQL QUERY LANGUAGE
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Duarte Nunes <duarte@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
The main objective of this script is to make it easier for people to
find reviewers and maintainers for their patches.
Message-Id: <20180119075556.31441-1-penberg@scylladb.com>
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
These types and methods are no longer in use since the index_reader is
now consuming promoted index incrementally.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.
When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.
Fixes#2981
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.
prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Currently we don't check data_file_directories existance before running iotune,
therefore it's shows unclear error message.
To make the message better, check the directory existance on scylla_io_setup.
Fixes#3137
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517200647-6347-1-git-send-email-syuu@scylladb.com>
* seastar d03896d...770c450 (10):
> tls_test: Fix echo test not setting server trust store
> tls: Do not restrict re-handshake to client
> tls: Actually verify client certificate if requested
> rwlock: add method for determining if an rwlock is locked
> metrics: Add missing `break` to metric_value::operator+()
> memory: fix error injector throwing from noexcept memory allocator functions
> systemwide_memory_barrier: don't use mprotect() on ARM
> sharded: Add const version of sharded::local()
> Add const overloads of front() and back() to the circular_buffer.
> Remove unused lambda captures
Fixes#3072
The exponential_backoff_retry instance is captured by move and is then
indirectly moved again as repeat_until_value() moves the lambda its
passed into its internal state. This caused problems as internal
lambdas store references to the instance and these references go stale
after the move.
To fix this keep hold of the existential_backoff_retry instance in an
enclosing do_with() to make it safe for internal lambdas to reference
it.
Indentation will be fixed by the next patch.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>