Commit Graph

4972 Commits

Author SHA1 Message Date
Benny Halevy
585ac6e641 database: use std::experimental::filesystem::path for lister::path
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.

TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2018-12-02 22:02:10 +02:00
Avi Kivity
4676e07400 consistency_level: simplify validation API
Remove unused parameters, replace refcounted pointers by references.
2018-11-27 13:41:49 +02:00
Avi Kivity
2c08bff8d5 Split consistency_level.hh header
It has two unrelated users: cql for validation, and storage_proxy for
complicated calculations. Split the simple stuff into a new header to reduce
dependencies.
2018-11-27 13:32:10 +02:00
Avi Kivity
b351a9fee7 db/repair_decision.hh: add missing #include
Message-Id: <20181126154948.2453-1-avi@scylladb.com>
2018-11-26 18:49:08 +01:00
Gleb Natapov
b4a8802edc hints: make hints manager more resilient to unexpected directory content
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
2018-11-21 14:53:03 +00:00
Gleb Natapov
9433d02624 hints: add auxiliary function for scanning high level hints directory
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
2018-11-21 14:53:03 +00:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Duarte Nunes
6fbf792777 db/view/view_builder: Don't timeout waiting for view to be built
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.

This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.

Fixes #3920

Tests: unit release(view_build_test, view_schema_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
2018-11-15 19:41:43 +02:00
Piotr Sarna
fc7267c797 db/view: add view_update_from_staging_generator service
A shardable service for generating mv updates after restarts
is added.
2018-11-13 15:01:52 +01:00
Piotr Sarna
ed05d91adc db/view: add view updating consumer
This consumer is used to generate and push view replica updates
from read mutations.
2018-11-13 14:54:39 +01:00
Avi Kivity
d77e044cde db: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
04b70a2ff8 system_keyspace: simplify complicated sprint()
update_peer_info() uses two sprint()s where one would do, which confuses
the sprint-to-fmt translator. Simplify the code by using just one call.
2018-11-01 13:16:17 +00:00
Nadav Har'El
b8337f8c9d Materalized views: fix race condition in resharding while view building
When a node reshards (i.e., restarts with a different number of CPUs), and
is in the middle of building a view for a pre-existing table, the view
building needs to find the right token from which to start building on all
shards. We ran the same code on all shards, hoping they would all make
the same decision on which token to continue. But in some cases, one
shard might make the decision, start building, and make progress -
all before a second shard goes to make the decision, which will now
be different.

This resulted, in some rare cases, in the new materialized view missing
a few rows when the build was interrupted with a resharding.

The fix is to add the missing synchronization: All shards should make
the same decision on whether and how to reshard - and only then should
start building the view.

Fixes #3890
Fixes #3452

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181028140549.21200-1-nyh@scylladb.com>
2018-10-28 17:20:10 +00:00
Duarte Nunes
e46ef6723b Merge seastar upstream
* seastar d152f2d...c1e0e5d (6):
  > scripts: perftune.py: properly merge parameters from the command line and the configuration file
  > fmt: update to 5.2.1
  > io_queue: only increment statistics when request is admitted
  > Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
  > fstream: remove default extent allocation hint
  > core/semaphore: Change the access of semaphore_units main ctor

Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.

sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-25 12:53:30 +03:00
Benny Halevy
2a57c454f2 update_compaction_history: handle execute_cql exception
Fixes #3774

Tested using view_schema_test with and without injecting an exception in
modification_statement::do_execute for "compaction_history".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181017105758.9602-3-bhalevy@scylladb.com>
2018-10-24 18:39:53 +03:00
Avi Kivity
a9836ad758 thrift: limit message size
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.

We also need to limit memory used in aggregate by thrift, but that is left to
another patch.

Fixes #3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>
2018-10-24 09:57:58 +01:00
Vlad Zolotarov
4d1bb719a4 config: enable hinted handoff by default
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181019180401.12400-1-vladz@scylladb.com>
2018-10-24 09:47:36 +03:00
Avi Kivity
1533487ba8 Merge "hinted handoff: give a sender a low priority" from Vlad
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.

In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.

Fixes #3817
"

* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
  db::hints::manager: use "streaming" I/O scheduling class for reads
  commitlog::read_log_file(): set the a read I/O priority class explicitly
  db::hints::manager: add hints sender to the "streaming" CPU scheduling group
2018-10-23 16:55:05 +00:00
Avi Kivity
d9e0ea6bb0 config: mark range_request_timeout_in_ms and request_timeout_in_ms as Used
This makes them available in scylla --help.

Fixes #3884.
Message-Id: <20181023101150.29856-1-avi@scylladb.com>
2018-10-23 11:52:03 +01:00
Duarte Nunes
f3a5ec0fd9 db/view: Don't copy keyspace name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181022104527.14555-1-duarte@scylladb.com>
2018-10-22 13:00:00 +02:00
Vlad Zolotarov
aca0882a3f hinted handoff: enable storing hints before starting messaging_service
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.

We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.

Refs #3828

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:49:58 -04:00
Vlad Zolotarov
cff4186517 db::hints::manager: add a "started" state
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.

Refs #3828

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:41:36 -04:00
Vlad Zolotarov
fb513a4b23 db::hints::manager: introduce a _state
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:41:33 -04:00
Duarte Nunes
624472d16a db/hints/manager: Expose current backlog
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:35:00 +01:00
Duarte Nunes
6dcb7a39d4 db/hints/manager: Move decision about blocking hints to the manager
The space_watchdog enables or disables hints for the managers
associated with a particular device. We encapsulate this decision
inside the hints::managers by introducing the update_backlog()
function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:35:00 +01:00
Duarte Nunes
207c9c8e38 db/hints/resource_manager: Correctly account resources in space_watchdog
A db::hints::resource_manager manages the resources for one or two
db::hints::managers. Each of these can be using the same or different
devices. The db::hints::space_watchdog periodically checks whether
each manager is within their resource allocation, and if not disables
it.

The watchdog iterates over the managers and accounts for the total
size they are using. This is wrong, since it can account in the same
variable the size consumed by managers using different devices.

We fix this while taking advantage of the fact that on_timer is now
called in the context of a seastar::thread, instead of using future
combinators.

Fixes #3821

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:34:54 +01:00
Duarte Nunes
25d266bdc1 db/hints/resource_manager: Replace timer with seastar::thread
Will make on_timer() much simpler to allow fixing a bug in subsequent
patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:32:16 +01:00
Duarte Nunes
278aa13bb0 db/hints/resource_manager: Ensure managers are correctly registered
Registering a manager for a new device used
std::unordered_map::emplace(), which may not insert the specified
value if one with the same key has already been added. This could
happen if both managers were using the same device and the fiber
deferred in-between adding them.

Found during code reading. Could cause hints to not be disabled for an
overloaded manager.

Fixes #3822

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:32:16 +01:00
Duarte Nunes
9e3b09cf48 db/hints/resource_manager: Fix formatting
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:32:16 +01:00
Duarte Nunes
622ac734da db/hints: Disallow moving or copying the managers
Disable the copy and move ctors and assignment operators for both the
hints::manager and the hints::resource_manager.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:32:16 +01:00
Vlad Zolotarov
5b12ec441d db::hints::manager: use "streaming" I/O scheduling class for reads
Make sure that read I/O in the context of HH sending do not overpower I/O
in the context of queries, memtable flushes or compactions.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-10 15:22:43 -04:00
Vlad Zolotarov
a89188de07 commitlog::read_log_file(): set the a read I/O priority class explicitly
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-10 15:22:43 -04:00
Vlad Zolotarov
629972d586 db::hints::manager: add hints sender to the "streaming" CPU scheduling group
Make sure that HH sends do not overpower (CPU wise) regular WRITEs flow.

Fixes #3817

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-10 15:22:43 -04:00
Duarte Nunes
74d809f8be db/hints/manager: Use frozen_mutation instead of mutation
Instead of unfreezing a mutation from the commitlog and then freezing
it again to send, just keep the read frozen mutation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-07 19:57:30 +01:00
Duarte Nunes
6eec9748fc db/hints/manager: Use database::find_schema()
Instead of using find_column_family() and repeatedly asking for
column_family::schema(), use database::find_schema() instead.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-07 19:57:30 +01:00
Duarte Nunes
5b3d08defc db/commitlog/commitlog_entry: Allow moving the contained mutation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-07 19:57:30 +01:00
Nadav Har'El
1d5f8d0015 materialized views: update stats.write statistics in all cases
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.

In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.

Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
2018-10-02 20:44:58 +01:00
Duarte Nunes
40a30d4129 db/schema_tables: Diff tables using ID instead of name
Currently we diff schemas based on table/view name, and if the names
match, then we detect altered schemas by comparing the schema
mutations. This fails to detect transitions which involve dropping and
recreating a schema with the same name, if a node receives these
notifications simultaneously (for example, if the node was temporarily
down or partitioned).

Note that because the ID is persisted and created when executing a
create_table_statement, then even if a schema is re-created with the
exact same structure as before, we will still considered it altered
because the mutations will differ.

This also stops schema pulling from working, since it relies on schema
merging.

The solution is to diff schemas using their ID, and not their name.

Keyspaces and user types are also susceptible to this, but in their
case it's fine: these are values with no identity, and are just
metadata. Dropping and recreating a keyspace can be views as dropping
all tables from the keyspace, altering it, and eventually adding new
tables to the keyspace.

Note that this solution doesn't apply to tables dropped and created
with the same ID (using the `WITH ID = {}` syntax). For that, we would
need to detect deltas instead of applying changes and then reading the
new state to find differences. However, this solution is enough,
because tables are usually created with ID = {} for very specific,
peculiar reasons. The original motivation meant for the new table to
be treated exactly as the old, so the current behavior is in fact the
desired one.

Tests: unit(release), dtests(schema_test, schema_management_test)

Fixes #3797

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-2-duarte@scylladb.com>
2018-10-02 20:15:46 +02:00
Duarte Nunes
e404f09a23 db/schema_tables: Drop tables before creating new ones
Doing it by the inverse order doesn't support dropping and creating a
schema with the same name.

Refs #3797

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-1-duarte@scylladb.com>
2018-10-02 20:15:32 +02:00
Avi Kivity
337ee6153a Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.

Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.

Tests: unit {release}

+ tested Scylla with both 'la' and 'mc' formats to work fine:

cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};                                                                  [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
    <<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
    <<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;

 pk | ck | rc
----+----+------
  2 |  3 | null
  4 |  6 | null

(2 rows)
    <<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;

 pk | ck | rc
----+----+------
  5 |  7 | null
  8 | 10 | null
  2 |  3 | null
  4 |  6 | null
  7 |  9 | null
  6 |  8 | null

(6 rows)
"

* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
  database: Honour enable_sstables_mc_format configuration option.
  sstables: Support SSTables 'mc' format as a feature.
  db: Add configuration option for enabling SSTables 'mc' format.
  tests: Add test for reading a complex column with zero subcolumns (SST3).
  sstables: Fix parsing of complex columns with zero subcolumns.
  sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
  sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
  bytes: Add helper for turning bytes_view into sstring_view.
  sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
  sstables: Fix string formatting for exception messages in m_format_read_helpers.
  sstables: Don't validate timestamps against the max value on parsing.
  sstables: Always store only min bases in serialization_header.
  sstables: Support 'mc' version parsing from filename.
  SST3: Make sure we call consume_partition_end
2018-09-26 11:10:07 +01:00
Vladimir Krivopalov
650b245657 db: Add configuration option for enabling SSTables 'mc' format.
This flag will only be used for testing purposes until Scylla 3.o
release and will be removed once SSTables 'mc' testing is completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Avi Kivity
c6f651ead4 Merge "Use fragmented buffers in commitlog writes" from Paweł
"
This series changes commitlog write path so that it uses fragmented
buffers and therefore avoids large allocations. This is done by first
switching the code to use seastar memory_output_stream interface, which
can handle fragmented buffer without any additional actions from the
user code needed and then making it use buffers of fixed size 128 kB.

Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table)
"

* tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla:
  commitlog: switch to fragmented buffers
  commitlog: drop buffer pools
  commitlog: drop recovery from bad alloc
  utils: drop data_output
  commitlog: use memory_output_stream
  serialization_visitors: add support for memory_output_stream
  utils: fragmented_temporary_buffer::view: add remove_prefix()
  utils: fragmented_temporary_buffer: add empty() and size_bytes()
  utils: fragmented_temporary_buffer: add get_ostream()
  idl: serializer: don't assume Iterator::value_type is bytes_view
  idl: serializer:  create buffer view from streams
  utils: crc: accept FragmentRange
2018-09-25 12:43:06 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Paweł Dziepak
4469f76e7c commitlog: switch to fragmented buffers
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
7c1add6769 commitlog: drop buffer pools
Buffer pools were added in 7191a130bb
"Commitlog: recycle buffers to reduce fragmentation." They introduce a
lot of complexity and will become unnecessary once the code is switched
to use fixed-size 128kB buffers.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
9fee8b8d76 commitlog: drop recovery from bad alloc
If a node cannot allocate a 128 kB it is already in a very bad shape, so
there isn't much value in trying to recover by attempting smaller
allocations and it just adds more complexity to the segment allocation.
It actually may be better to let some requests fail and give the node a
chance to recover rather than trying to use every last byte of free
memory and end up with bad_alloc in a noexcept context.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
2e5b375309 utils: drop data_output 2018-09-18 17:22:59 +01:00
Paweł Dziepak
fe48aaae46 commitlog: use memory_output_stream
memory_output_stream deals with all required pointer arithmetic and
allows easy transition to fragmented buffers.
2018-09-18 17:22:59 +01:00
Tomasz Grabiec
cd201d1987 db/batchlog_manager: Do not return a value from timer callback
Timer callbacks are std::function<void()>.

Exposed by changing callback_t to noncopyable_function<>.

Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>
2018-09-05 12:32:21 +03:00
Botond Dénes
6e59cee244 db::consistency_level::filter_for_query() add preferred_endpoints
To the second overload (the one without read-repair related params) too.
2018-09-03 10:31:44 +03:00