Commit Graph

17574 Commits

Author SHA1 Message Date
Avi Kivity
527e3a58ff install-dependencies.sh: add maven and ant
Add tools needed to build scylla-jmx and scylla-tools-java. While
not requirements of this repository, it's nicer if a single setup
can be used to build and run everything.

We also install pystache as it's used by packaging scripts.
2019-01-03 16:16:45 +02:00
Avi Kivity
918d255168 querier_cache: unregister querier from reader_concurrency_semaphore during eviction
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.

Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.

Fixes #4018.

Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
2019-01-03 09:15:07 +02:00
Avi Kivity
2717bdd301 tools: toolchain: allow adjusting "docker run" command line
It is useful to adjust the command line when running the docker image,
for example to attach a data volume or a ccache directory. Add e mechanism
to do that.
Message-Id: <20181228163306.19439-1-avi@scylladb.com>
2019-01-01 21:44:50 +00:00
Avi Kivity
d19660ec0a Merge "commitlog: Use fragmented buffers for reading entries" from Duarte
"
Instead of allocating a contiguous temporary_buffer when reading
mutations from the commitlog - or hint - replaying, use fragemnted
buffers instead.

Refs #4020
"

* 'commitlog/fragmented-read/v1' of https://github.com/duarten/scylla:
  db/commitlog: Use fragmented buffers to read entries
  db/commitlog: Implement skip in terms of input buffer skipping
  tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
  utils/fragmented_temporary_buffer: Add remove_suffix
  tests/fragmented_temporary_buffer_test: Add unit test for skip()
  utils/fragmented_temporary_buffer: Allow skipping in the input stream
2019-01-01 19:08:34 +02:00
Avi Kivity
6641353854 tracing: remove static class_registry
Static class_registries hinder librarification by requiring linking with
all object files (instead of a library from which objects are linked on
demand) and reduce readability by hiding dependencies and by their
horrible syntax. Hide them behind a non-static, non-template tracing
backend registry.
Message-Id: <20181229121000.7885-1-avi@scylladb.com>
2018-12-31 13:24:54 +00:00
Duarte Nunes
b7517183fa db/commitlog: Use fragmented buffers to read entries
Leverage fragmented_temporary_buffer when reading commit log
entries, avoiding large allocations.

Refs #4020

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
0e50a9bc6d db/commitlog: Implement skip in terms of input buffer skipping
This simplifies the code and allows to get rid of the overload of
advance() taking a temporary_buffer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
8379ac6189 tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
1a88cd7992 utils/fragmented_temporary_buffer: Add remove_suffix
Essentially hide some bytes off the end of the buffer. Needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
50dd8b67b2 tests/fragmented_temporary_buffer_test: Add unit test for skip()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
8eab0a3e01 utils/fragmented_temporary_buffer: Allow skipping in the input stream
Add fragmented_temporary_buffer::istream::skip(), needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Avi Kivity
c180a18dbb Distribute distributed_loader into its own header and source files
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
2018-12-31 14:27:27 +02:00
Avi Kivity
49958d5836 tools: toolchain: update for lz4 1.8.3
lz4 1.8.3 was released with a fix for data corruption during compression. While
the release notes indicate we aren't vulnerable, be cautious and update anyway.
Message-Id: <20181230144716.7238-1-avi@scylladb.com>
2018-12-31 14:27:27 +02:00
Hagit Segev
141fad9c14 Update README.md
fix a typo
2018-12-31 13:33:04 +02:00
Asias He
d90836a2d3 streaming: Make total_incoming_bytes and total_outgoing_bytes metrics monotonic
Currently, they increases and decreases as the stream sessions are
created and destroyed. Make them prometheus monotonically increasing
counter for easier monitoring.

Message-Id: <7c07cea25a59a09377292dc8f64ed33ff12eda87.1545959905.git.asias@scylladb.com>
2018-12-30 16:52:17 +02:00
Pekka Enberg
96172b7bca Merge 'Fixes for the view_update_from_staging_generator' from Duarte
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.

Fixes #4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
  db/view/view_update_from_staging_generator: Break semaphore on stop()
  db/view/view_update_from_staging_generator: Restore formatting
  db/view/view_update_from_staging_generator: Avoid creating more than one fiber
2018-12-29 18:31:40 +02:00
Duarte Nunes
f41d13f38c db/view/view_update_from_staging_generator: Break semaphore on stop()
This avoid having fibers waiting _registration_sem without ever being
notified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:55:04 +00:00
Duarte Nunes
4974addc5c db/view/view_update_from_staging_generator: Restore formatting
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:55:02 +00:00
Duarte Nunes
201196130d db/view/view_update_from_staging_generator: Avoid creating more than one fiber
If view_update_from_staging_generator::maybe_generate_view_updates()
is called before view_update_from_staging_generator::start(), as can
happen in main.cc, then we can potentially create more than one fiber,
which leads to corrupted state and conflicting operations.

To avoid this, use just one fiber and be explicit about notifying it
that more work is needed, by leveraging a condition-variable.

Fixes #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-29 12:52:51 +00:00
Duarte Nunes
66113a2d39 Merge 'Replace query_processor's sharded<database> with plain database' from Avi
"
A sharded<database> is not very useful for accessing data since data is
usually distributed across many nodes, while a sharded<database>
contains only a single node's view. So it is really only used for
accessing replicated metadata, not data. As such only the local shard
is accessed.

Use that to simplify query_processor a little by replacing sharded<database>
with a plain database.

We can probably be more ambitious and make all accesses, data and metadata,
go through storage_proxy, but this is a start.
"

* tag 'qp-unshard-database/v1' of https://github.com/avikivity/scylla:
  query_processor: replace sharded<database> with the local shard
  commitlog_replayer: don't use query_processor
  client_state: change set_keyspace() to accept a single database shard
  legacy_schema_migrator: initialize with database reference
2018-12-29 12:14:19 +00:00
Avi Kivity
0c0cc66ee7 system_keyspace, view: reduce interdependencies
system_keyspace is an implementation detail for most of its users, not
part of the interface, as it's only used to store internal data. Therefore,
including it in a header file causes unneeded dependencies.

This patch removes a dependency between views and system_keyspace.hh
by moving view_name and view_build_progress into a separate header file,
and using forward declarations where possible. This allows us to
remove an inclusion of system_keyspace.hh from a header file (the last
one), so that further changes to system_keyspace.hh will cause fewer
recompilations.
Message-Id: <20181228215736.11493-1-avi@scylladb.com>
2018-12-29 12:12:15 +00:00
Avi Kivity
30745eeb72 query_processor: replace sharded<database> with the local shard
query_processor uses storage_proxy to access data, and the local
database object to access replicated metadata. While it seems strange
that the database object is not used to access data, it is logical
when you consider that a sharded<database> only contain's this node's
data, not the cluster data.

Take advantage of this to replace sharded<database> with a single database
shard.
2018-12-29 11:02:15 +02:00
Avi Kivity
f0a709cfc8 commitlog_replayer: don't use query_processor
During normal writes, query processing happens before commitlog, so
logically commitlog replaying the commitlog shouldn't need it. And in
fact the dependency on query_processor can be eliminated, all it needs
is the local node's database.
2018-12-29 11:00:29 +02:00
Avi Kivity
7830086317 client_state: change set_keyspace() to accept a single database shard
set_keyspace() only needs one shard (it is checking replicated state,
not sharded data) so arrange for it to receive only that one shard.
2018-12-29 10:58:39 +02:00
Avi Kivity
e4233262cf legacy_schema_migrator: initialize with database reference
Provide legacy_schema_migrator with a sharded<database> so it doesn't need
to use the one from query_processor. We want to replace query_processor's
sharded<database> with just a local database reference in order to simplify
it, and this is standing in the way.
2018-12-29 10:58:22 +02:00
Duarte Nunes
bab7e6877b streaming/stream_session: Only stage sstables for tables with views
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.

Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
2018-12-28 18:32:24 +02:00
Avi Kivity
feddf0b021 tools: toolchain: patch boost for use-after-free in Boost.Test XML output
The version of boost in Fedora 29 has a use-after-free bug that is only
exposed when ./test.py is run with the --jenkins flag.  To patch it,
use a fixed version from the copr repository scylladb/toolchain.
Message-Id: <20181228150419.29623-1-avi@scylladb.com>
2018-12-28 16:35:28 +01:00
Tomasz Grabiec
7747f2dde3 Merge "nodetool toppartitions" from Rafi & Avi
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.

Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
  interfaces with data_listeners and the REST api),
- REST api for toppartitions query.

Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).

What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).

Fixes #2811

* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
  top_k: whitespace and minor fixes
  top_k: map template arguments
  top_k: std::list -> chunked_vector
  top_k: support for appending top_k results
  nodetool toppartitions: refactor table::config constructor
  nodetool toppartitions: data listeners
  nodetool toppartitions: add data_listeners to database/table
  nodetool toppartitions: fully_qualified_cf_name
  nodetool toppartitions: Toppartitions query implementation
  nodetool toppartitions: Toppartitions query REST API
  nodetool toppartitions: nodetool-toppartitions script
2018-12-28 16:31:24 +01:00
Rafi Einstein
7677d2ba2c nodetool toppartitions: nodetool-toppartitions script
A Python script mimicking the nodetool toppartitions utility, utilizing Scylla REST API.

Examples:
$ ./nodetool-toppartitions --help
usage: nodetool-toppartitions [-h] [-k LIST_SIZE] [-s CAPACITY]
                              keyspace table duration

Samples database reads and writes and reports the most active partitions in a
specified table

positional arguments:
  keyspace      Name of keyspace
  table         Name of column family
  duration      Query duration in milliseconds

optional arguments:
  -h, --help    show this help message and exit
  -k LIST_SIZE  The number of the top partitions to list (default: 10)
  -s CAPACITY   The capacity of stream summary (default: 256)

$ ./nodetool-toppartitions ks test1 10000
READ
  Partition   Count
  30          2
  20          2
  10          2

WRITE
  Partition   Count
  30          1
  20          1
  10          1

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:48:03 +02:00
Rafi Einstein
197f38d4ee nodetool toppartitions: Toppartitions query REST API
A HTTP GET operation starts the query (with args: ks/cf name and duration in ms).
It executes synchroneously, results are returned as JSON:
$ curl -s -X GET http://localhost:10000/column_family/toppartitions/ks:cf1?duration=10000 | jq
{
  "read": [
    {
      "count": "15",
      "error": "0",
      "partition": "4b504d39354f37353131"
    },
    {
      "count": "15",
      "error": "0",
      "partition": "3738313134394d353530"
    }
  ],
  "write": [
    {
      "count": "15",
      "error": "0",
      "partition": "4b504d39354f37353131"
    },
    {
      "count": "15",
      "error": "0",
      "partition": "3738313134394d353530"
    }
  ]
}

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
6b2c21f69b nodetool toppartitions: Toppartitions query implementation
toppartitions_query installs toppartitions_data_listener-s on all database shards, waits for
the designated period, uninstalls shards and collects top-k read/write partition keys.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
404f75def5 nodetool toppartitions: fully_qualified_cf_name
Encapsulate keyspace:column_family REST API argument parsing into fully_qualified_cf_name class.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
0bffe5f83e nodetool toppartitions: add data_listeners to database/table
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().

Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
08ba115c16 nodetool toppartitions: data listeners
Mechanism that interfaces with mutation readers in database and table classes, to
allow tracking most frequent partition keys in read and write operation.
Basic design is specified in #2811.

Tracking top #rows and #bytes will be supported in the future.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
038f8c7988 nodetool toppartitions: refactor table::config constructor
Eliminae extra parameters to ctor and deduce them instead from db param.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:57 +02:00
Rafi Einstein
eda43b93c9 top_k: support for appending top_k results
Allow appending results of one top_k into another.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:56 +02:00
Rafi Einstein
aeebe8e86b top_k: std::list -> chunked_vector
Replaced std::list with chunked_vector. Because chunked_vector requires
a noexcept move constructor from its value type, change the bad_boy type
in the unit test not to throw in the move constructor.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:07 +02:00
Avi Kivity
8e2f6d0513 Merge "Fix use-after-free when destroying partition_snapshots in the background"from Tomasz
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.

The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.

When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.

Introduced in f3da043 (in >= 3.0-rc1)

Fixes #4030.

Tests:

  - mvcc_test (debug)

"

* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
  tests: mvcc: Introduce mvcc_container::migrate()
  tests: mvcc: Make mvcc_partition move-constructible
  tests: mvcc: Introduce mvcc_container::make_not_evictable()
  tests: mvcc: Allow constructing mvcc_container without a cache_tracker
  mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
  mvcc: partition_snapshot: Introduce migrate()
  mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner
2018-12-28 12:45:10 +02:00
Tomasz Grabiec
bb1c9cb6f3 tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed 2018-12-28 10:32:39 +01:00
Tomasz Grabiec
4d13dea39a tests: mvcc: Introduce mvcc_container::migrate() 2018-12-28 10:32:39 +01:00
Tomasz Grabiec
676868ed31 tests: mvcc: Make mvcc_partition move-constructible 2018-12-28 10:32:39 +01:00
Tomasz Grabiec
c6798f7872 tests: mvcc: Introduce mvcc_container::make_not_evictable() 2018-12-28 10:32:39 +01:00
Tomasz Grabiec
1fa00656ea tests: mvcc: Allow constructing mvcc_container without a cache_tracker
Some test cases will need many containers to simulate memtable ->
cache transitions, but there can be only one cache_tracker per shard
due to metrics. Allow constructing a conatiner without a cache_tracker
(and thus non-evictable).
2018-12-28 10:32:39 +01:00
Tomasz Grabiec
ac49b1def0 mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that,
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumses destruction.

The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.

When memtable is destroyed without being moved to cache there is no
problem, because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.

Introduced in f3da043.

Fixes #4030.
2018-12-27 18:08:50 +01:00
Tomasz Grabiec
20f5d5d1a1 mvcc: partition_snapshot: Introduce migrate()
Snapshots which outlive the memtable will need to have their
_region and _cleaner references updated.

The snapshot can be destroyed after the memtable when it is queud in
the mutation_cleaner.
2018-12-27 18:08:50 +01:00
Tomasz Grabiec
67f9afbd1a mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner 2018-12-27 18:08:50 +01:00
Gleb Natapov
37b4043677 streaming: always read from rpc::source until end-of-stream during mutation sending
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.

Fixes: #4025

Message-Id: <20181227135051.GC29458@scylladb.com>
2018-12-27 16:50:53 +02:00
Asias He
4d3c463536 storage_service: Stop cql server before gossip
We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.

======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
    self.make_schema_changes(session, namespace='ns2')
  File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
    session.execute('USE ks_%s' % namespace)
  File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
  File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
    raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed

The test:

   session = self.patient_cql_connection(node2)
   self.prepare_for_changes(session, namespace='ns2')
   node1.stop()
   self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws

The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.

If we shutdown the cql server first, the connection count will be zero
and the session will be removed.

Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
2018-12-27 14:13:43 +02:00
Duarte Nunes
2f69ba2844 lwt: Remove Paxos-related Cassandra code
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227112526.4180-1-duarte@scylladb.com>
2018-12-27 13:30:10 +02:00
Duarte Nunes
66e45469b2 streaming/stream_session: Don't use table reference across defer points
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.

Refs #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
2018-12-27 13:05:46 +02:00