Commit Graph

17595 Commits

Author SHA1 Message Date
Avi Kivity
7534412071 table_helper: de-inline insert() and setup_keyspace()
After previous patches de-templated these functions, we can de-inline them.
This helps reduce compile time and prepares to reduce header dependencies.
2019-01-05 16:28:46 +02:00
Avi Kivity
cfedf4ab0f table_helper: de-template setup_keyspace()
This setup function has no reason to be a template and is easily
converted. We can then later de-inline it to reduce dependencies.
2019-01-05 16:23:10 +02:00
Avi Kivity
659147cd79 table_helper: simplify template body of table_helper::insert()
Move most of the body into a non-template overload to reduce dependencies
in the header (and template bloat). The function is not on any fast path,
and noncopyable_function will likely not even allocate anything.
2019-01-05 16:22:08 +02:00
Avi Kivity
c3ef99f84f schema_tables: remove #include of database.hh
Distribute in source files (and one header - table_helper.hh) that need it.
2019-01-05 15:43:07 +02:00
Avi Kivity
f43f82d1d2 cql_type_parser: remove dependency on user_types_metadata
A default parameter of type T (or lw_shared_ptr<T>) requires that T be
defined. Remove the depndency by redefining the default parameter
as an overload, for T = user_types_metadata.
2019-01-05 15:40:58 +02:00
Avi Kivity
4ba1d4d1dc thrift: add missing include of sleep.hh
Currently obtained indirectly through database.hh.
2019-01-05 15:39:30 +02:00
Avi Kivity
d24962e16c cql3: ks_prop_defs: remove #include "database.hh"
Replace with forward declaration to reduce rebuilds.
2019-01-05 14:26:03 +02:00
Jesse Haber-Kucharsky
17a5f7acab build: Link against libatomic
Since Scylla uses functions from the `atomic` header in its own source
code, we need to explicitly link against the stub library that is
provided for hardware architectures that do not have native support for
atomic operations.

Fixes #4053

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7d62e762130494d73565ce8c031f53aaf866d3aa.1546645041.git.jhaberku@scylladb.com>
2019-01-05 13:38:57 +02:00
Avi Kivity
36e4e9fb54 Update seastar submodule
* seastar 6c8c229...67fd967 (1):
  > perftune.py: tune only active NVMe HW queues on i3 AWS instances
2019-01-04 13:17:29 +02:00
Avi Kivity
b0980ba7c6 compaction_controller: increase minimum shares to 50 (~5%) for small-data workloads
The workload in #3844 has these characteristics:
 - very small data set size (a few gigabytes per shard)
 - large working set size (all the data, enough for high cache miss rate)
 - high overwrite rate (so a compaction results in 12X data reduction)

As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.

While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.

We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.

Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).

Fixes #3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>
2019-01-04 10:58:43 +01:00
Duarte Nunes
b851cb1a9a distributed_loader: Forbid uploading MV sstables
Instead suggest that the views be re-created.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190103142933.35354-1-duarte@scylladb.com>
2019-01-03 16:31:20 +02:00
Duarte Nunes
3235c13125 utils/fragmented_temporary_buffer: Correctly implement remove_suffix()
The current implementation breaks the invariant that

_size_bytes = reduce(_fragments, &temporary_buffer::size)

In particular, this breaks algorithms that check the individual
segment size.

Correctly implement remove_suffix() by destroying superfluous
temporary_buffer's and by trimming the last one, if needed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190103133523.34937-1-duarte@scylladb.com>
2019-01-03 13:37:01 +00:00
Botond Dénes
021feef513 querier_cache: simplify memory eviction use-after-free fix, add tests
Simplify the fix for memory based eviction, introduced by 918d255 so
there is no need to massage the counters.

Also add a check to `test_memory_based_cache_eviction` which checks for
the bug fixed. While at it also add a check to
`test_time_based_cache_eviction` for the fix to time based eviction
(e5a0ea3).

Tests: tests/querier_cache:debug
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c89e2788a88c2a701a2c39f377328e77ac01e3ef.1546515465.git.bdenes@scylladb.com>
2019-01-03 13:44:08 +02:00
Tomasz Grabiec
1613a623e1 Merge "Fix crash on corrupt sstable" from Rafael
* https://github.com/espindola/scylla espindola/invalid_boundary4:
  sstables: Refactor predicates on bound_kind_m
  Fix crash on corrupt sstable
2019-01-03 12:02:09 +01:00
Duarte Nunes
42d9ca8266 Merge 'Add staging SSTables support to row level repair' from Piotr
"
This series adds staging SSTables support to row level repair.
It was introduced for streaming sessions before, but since row level
repair doesn't leverage sessions at all, it's added separately.

Tests:
unit (release)
dtest (repair_additional_test.py:RepairAdditionalTest,
       excluding repair_abort_test, which fails for me locally on master)
"

* 'add_staging_sstables_generation_to_row_level_repair_2' of https://github.com/psarna/scylla:
  repair: add staging sstables support to row level repair
  main,repair: add params to row level repair init
  streaming,view: move view update checks to separate file
2019-01-03 09:40:13 +00:00
Piotr Sarna
a73d9ccf31 service: mark existing views as built before bootstrap
When a node is bootstrapping, it will receive data from other nodes
via streaming, including materialized views. Regardless whether these
views are built on other nodes or not, building them on newly
bootstrapped nodes has no effect - updates were either already streamed
completely (if view building have finished) or will be propagated
via view building, if the process is still ongoing.
So, marking all views as 'built' for the bootstrapped node prevents it
from spawning superfluous view building processes.

Fixes #4001
Message-Id: <fd53692c38d944122d1b1013fdb0aedf517fa409.1546498861.git.sarna@scylladb.com>
2019-01-03 09:39:33 +00:00
Botond Dénes
e5a0ea390a querier_cache: unregister queriers evicted due to expired TTL
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.

Fix by unregistering the querier from the semaphore before destroying
the entry.

Refs: #4018
Refs: #4031

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
2019-01-03 10:29:26 +02:00
Piotr Sarna
bc74ac6f09 repair: add staging sstables support to row level repair
In some cases, sstables created during row level repair
should be enqueued as staging in order to generate
view updates from them.

Fixes #4034
2019-01-03 08:36:45 +01:00
Piotr Sarna
a0003c52cf main,repair: add params to row level repair init
Row level repair needs references to system distributed keyspace
and view update generator in order to enqueue some sstables
as staging.
2019-01-03 08:31:41 +01:00
Piotr Sarna
9d46715613 streaming,view: move view update checks to separate file
Checking if view update path should be used for sstables
is going to be reused in row level repair code,
so relevant functions are moved to a separate header.
2019-01-03 08:31:40 +01:00
Avi Kivity
918d255168 querier_cache: unregister querier from reader_concurrency_semaphore during eviction
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.

Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.

Fixes #4018.

Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
2019-01-03 09:15:07 +02:00
Rafael Ávila de Espíndola
28c014351f Fix crash on corrupt sstable
The check in consume_range_tombstone was too late. Before getting to
it we would fail an assert calling to_bound_kind.

This moves the check earlier and adds a testcase.

Tests: unit (release)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-01-02 17:52:07 -08:00
Rafael Ávila de Espíndola
3c9178d122 sstables: Refactor predicates on bound_kind_m
This moves the predicate functions to the start of the file, renames
is_in_bound_kind to is_bound_kind for consistency with to_bound_kind
and defines all predicates in a similar fashion.

It also uses the predicates to reduce code duplication.

Tests: unit (release)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-01-02 17:50:44 -08:00
Avi Kivity
2717bdd301 tools: toolchain: allow adjusting "docker run" command line
It is useful to adjust the command line when running the docker image,
for example to attach a data volume or a ccache directory. Add e mechanism
to do that.
Message-Id: <20181228163306.19439-1-avi@scylladb.com>
2019-01-01 21:44:50 +00:00
Avi Kivity
d19660ec0a Merge "commitlog: Use fragmented buffers for reading entries" from Duarte
"
Instead of allocating a contiguous temporary_buffer when reading
mutations from the commitlog - or hint - replaying, use fragemnted
buffers instead.

Refs #4020
"

* 'commitlog/fragmented-read/v1' of https://github.com/duarten/scylla:
  db/commitlog: Use fragmented buffers to read entries
  db/commitlog: Implement skip in terms of input buffer skipping
  tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
  utils/fragmented_temporary_buffer: Add remove_suffix
  tests/fragmented_temporary_buffer_test: Add unit test for skip()
  utils/fragmented_temporary_buffer: Allow skipping in the input stream
2019-01-01 19:08:34 +02:00
Avi Kivity
6641353854 tracing: remove static class_registry
Static class_registries hinder librarification by requiring linking with
all object files (instead of a library from which objects are linked on
demand) and reduce readability by hiding dependencies and by their
horrible syntax. Hide them behind a non-static, non-template tracing
backend registry.
Message-Id: <20181229121000.7885-1-avi@scylladb.com>
2018-12-31 13:24:54 +00:00
Duarte Nunes
b7517183fa db/commitlog: Use fragmented buffers to read entries
Leverage fragmented_temporary_buffer when reading commit log
entries, avoiding large allocations.

Refs #4020

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
0e50a9bc6d db/commitlog: Implement skip in terms of input buffer skipping
This simplifies the code and allows to get rid of the overload of
advance() taking a temporary_buffer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
8379ac6189 tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
1a88cd7992 utils/fragmented_temporary_buffer: Add remove_suffix
Essentially hide some bytes off the end of the buffer. Needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
50dd8b67b2 tests/fragmented_temporary_buffer_test: Add unit test for skip()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
8eab0a3e01 utils/fragmented_temporary_buffer: Allow skipping in the input stream
Add fragmented_temporary_buffer::istream::skip(), needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Avi Kivity
c180a18dbb Distribute distributed_loader into its own header and source files
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
2018-12-31 14:27:27 +02:00
Avi Kivity
49958d5836 tools: toolchain: update for lz4 1.8.3
lz4 1.8.3 was released with a fix for data corruption during compression. While
the release notes indicate we aren't vulnerable, be cautious and update anyway.
Message-Id: <20181230144716.7238-1-avi@scylladb.com>
2018-12-31 14:27:27 +02:00
Hagit Segev
141fad9c14 Update README.md
fix a typo
2018-12-31 13:33:04 +02:00
Asias He
d90836a2d3 streaming: Make total_incoming_bytes and total_outgoing_bytes metrics monotonic
Currently, they increases and decreases as the stream sessions are
created and destroyed. Make them prometheus monotonically increasing
counter for easier monitoring.

Message-Id: <7c07cea25a59a09377292dc8f64ed33ff12eda87.1545959905.git.asias@scylladb.com>
2018-12-30 16:52:17 +02:00
Pekka Enberg
96172b7bca Merge 'Fixes for the view_update_from_staging_generator' from Duarte
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.

Fixes #4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
  db/view/view_update_from_staging_generator: Break semaphore on stop()
  db/view/view_update_from_staging_generator: Restore formatting
  db/view/view_update_from_staging_generator: Avoid creating more than one fiber
2018-12-29 18:31:40 +02:00
Duarte Nunes
f41d13f38c db/view/view_update_from_staging_generator: Break semaphore on stop()
This avoid having fibers waiting _registration_sem without ever being
notified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:55:04 +00:00
Duarte Nunes
4974addc5c db/view/view_update_from_staging_generator: Restore formatting
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:55:02 +00:00
Duarte Nunes
201196130d db/view/view_update_from_staging_generator: Avoid creating more than one fiber
If view_update_from_staging_generator::maybe_generate_view_updates()
is called before view_update_from_staging_generator::start(), as can
happen in main.cc, then we can potentially create more than one fiber,
which leads to corrupted state and conflicting operations.

To avoid this, use just one fiber and be explicit about notifying it
that more work is needed, by leveraging a condition-variable.

Fixes #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:52:51 +00:00
Duarte Nunes
66113a2d39 Merge 'Replace query_processor's sharded<database> with plain database' from Avi
"
A sharded<database> is not very useful for accessing data since data is
usually distributed across many nodes, while a sharded<database>
contains only a single node's view. So it is really only used for
accessing replicated metadata, not data. As such only the local shard
is accessed.

Use that to simplify query_processor a little by replacing sharded<database>
with a plain database.

We can probably be more ambitious and make all accesses, data and metadata,
go through storage_proxy, but this is a start.
"

* tag 'qp-unshard-database/v1' of https://github.com/avikivity/scylla:
  query_processor: replace sharded<database> with the local shard
  commitlog_replayer: don't use query_processor
  client_state: change set_keyspace() to accept a single database shard
  legacy_schema_migrator: initialize with database reference
2018-12-29 12:14:19 +00:00
Avi Kivity
0c0cc66ee7 system_keyspace, view: reduce interdependencies
system_keyspace is an implementation detail for most of its users, not
part of the interface, as it's only used to store internal data. Therefore,
including it in a header file causes unneeded dependencies.

This patch removes a dependency between views and system_keyspace.hh
by moving view_name and view_build_progress into a separate header file,
and using forward declarations where possible. This allows us to
remove an inclusion of system_keyspace.hh from a header file (the last
one), so that further changes to system_keyspace.hh will cause fewer
recompilations.
Message-Id: <20181228215736.11493-1-avi@scylladb.com>
2018-12-29 12:12:15 +00:00
Avi Kivity
30745eeb72 query_processor: replace sharded<database> with the local shard
query_processor uses storage_proxy to access data, and the local
database object to access replicated metadata. While it seems strange
that the database object is not used to access data, it is logical
when you consider that a sharded<database> only contain's this node's
data, not the cluster data.

Take advantage of this to replace sharded<database> with a single database
shard.
2018-12-29 11:02:15 +02:00
Avi Kivity
f0a709cfc8 commitlog_replayer: don't use query_processor
During normal writes, query processing happens before commitlog, so
logically commitlog replaying the commitlog shouldn't need it. And in
fact the dependency on query_processor can be eliminated, all it needs
is the local node's database.
2018-12-29 11:00:29 +02:00
Avi Kivity
7830086317 client_state: change set_keyspace() to accept a single database shard
set_keyspace() only needs one shard (it is checking replicated state,
not sharded data) so arrange for it to receive only that one shard.
2018-12-29 10:58:39 +02:00
Avi Kivity
e4233262cf legacy_schema_migrator: initialize with database reference
Provide legacy_schema_migrator with a sharded<database> so it doesn't need
to use the one from query_processor. We want to replace query_processor's
sharded<database> with just a local database reference in order to simplify
it, and this is standing in the way.
2018-12-29 10:58:22 +02:00
Duarte Nunes
bab7e6877b streaming/stream_session: Only stage sstables for tables with views
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.

Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
2018-12-28 18:32:24 +02:00
Avi Kivity
feddf0b021 tools: toolchain: patch boost for use-after-free in Boost.Test XML output
The version of boost in Fedora 29 has a use-after-free bug that is only
exposed when ./test.py is run with the --jenkins flag.  To patch it,
use a fixed version from the copr repository scylladb/toolchain.
Message-Id: <20181228150419.29623-1-avi@scylladb.com>
2018-12-28 16:35:28 +01:00
Tomasz Grabiec
7747f2dde3 Merge "nodetool toppartitions" from Rafi & Avi
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.

Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
  interfaces with data_listeners and the REST api),
- REST api for toppartitions query.

Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).

What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).

Fixes #2811

* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
  top_k: whitespace and minor fixes
  top_k: map template arguments
  top_k: std::list -> chunked_vector
  top_k: support for appending top_k results
  nodetool toppartitions: refactor table::config constructor
  nodetool toppartitions: data listeners
  nodetool toppartitions: add data_listeners to database/table
  nodetool toppartitions: fully_qualified_cf_name
  nodetool toppartitions: Toppartitions query implementation
  nodetool toppartitions: Toppartitions query REST API
  nodetool toppartitions: nodetool-toppartitions script
2018-12-28 16:31:24 +01:00
Rafi Einstein
7677d2ba2c nodetool toppartitions: nodetool-toppartitions script
A Python script mimicking the nodetool toppartitions utility, utilizing Scylla REST API.

Examples:
$ ./nodetool-toppartitions --help
usage: nodetool-toppartitions [-h] [-k LIST_SIZE] [-s CAPACITY]
                              keyspace table duration

Samples database reads and writes and reports the most active partitions in a
specified table

positional arguments:
  keyspace      Name of keyspace
  table         Name of column family
  duration      Query duration in milliseconds

optional arguments:
  -h, --help    show this help message and exit
  -k LIST_SIZE  The number of the top partitions to list (default: 10)
  -s CAPACITY   The capacity of stream summary (default: 256)

$ ./nodetool-toppartitions ks test1 10000
READ
  Partition   Count
  30          2
  20          2
  10          2

WRITE
  Partition   Count
  30          1
  20          1
  10          1

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:48:03 +02:00