Commit Graph

1655 Commits

Author SHA1 Message Date
Nadav Har'El
c1a7a071ea merge: Remove most inclusions of reactor.hh
Merged patch series from Avi Kivity:

This patchset removes most inclusions of reactor.hh, by switching
to new namespace-scoped API:s instead of those using engine()
as a way to get the reactor. With this, we are down to 12 translation
units depending on reactor.hh, mostly for deprecated API:s like
reactor::at_exit().

Avi Kivity (3):
  logalloc: use namespace-scope seastar::idle_cpu_handler and related
    rather than reactor scope
  test: sstable-utils: deinline do_make_keys()
  treewide: replace calls to engine().some_api() with some_api()

 configure.py                                  | 14 +++-----
 auth/common.hh                                |  3 +-
 checked-file-impl.hh                          |  4 +--
 db/system_keyspace_view_types.hh              |  2 +-
 flat_mutation_reader.hh                       |  1 +
 lister.hh                                     |  2 +-
 message/messaging_service.hh                  |  2 +-
 redis/server.hh                               |  2 +-
 sstables/compress.hh                          |  2 +-
 sstables/integrity_checked_file_impl.hh       |  2 +-
 test/lib/sstable_utils.hh                     | 35 ++++---------------
 test/lib/test_services.hh                     |  2 +-
 thrift/server.hh                              |  2 +-
 transport/server.hh                           |  2 +-
 utils/error_injection.hh                      |  3 +-
 utils/joinpoint.hh                            |  2 +-
 utils/loading_cache.hh                        |  2 +-
 utils/logalloc.hh                             |  6 ++--
 utils/rate_limiter.hh                         |  2 +-
 api/system.cc                                 |  1 +
 auth/default_authorizer.cc                    |  2 +-
 auth/password_authenticator.cc                |  2 +-
 database.cc                                   |  1 +
 db/commitlog/commitlog.cc                     |  4 +--
 db/hints/resource_manager.cc                  |  3 +-
 db/system_distributed_keyspace.cc             |  2 +-
 dht/i_partitioner.cc                          |  2 +-
 gms/feature_service.cc                        |  3 +-
 lister.cc                                     |  4 +--
 locator/ec2_snitch.cc                         |  3 +-
 locator/gce_snitch.cc                         |  1 +
 main.cc                                       |  1 +
 reader_concurrency_semaphore.cc               |  2 +-
 redis/server.cc                               |  4 +--
 sstables/sstables.cc                          | 11 +++---
 table.cc                                      |  3 +-
 test/boost/commitlog_test.cc                  |  2 +-
 test/boost/database_test.cc                   |  2 +-
 test/boost/flush_queue_test.cc                |  2 +-
 test/boost/gossip_test.cc                     |  2 +-
 .../gossiping_property_file_snitch_test.cc    |  1 +
 test/boost/loading_cache_test.cc              |  2 +-
 test/boost/sstable_3_x_test.cc                |  1 +
 test/boost/sstable_datafile_test.cc           |  1 +
 test/boost/sstable_test.cc                    |  1 +
 test/lib/sstable_utils.cc                     | 26 ++++++++++++++
 test/manual/gossip.cc                         |  2 +-
 test/manual/hint_test.cc                      |  2 +-
 test/manual/sstable_scan_footprint_test.cc    |  2 +-
 test/perf/perf_mutation.cc                    |  1 +
 test/perf/perf_row_cache_update.cc            |  1 +
 test/perf/perf_sstable.cc                     |  1 +
 test/tools/cql_repl.cc                        |  2 +-
 thrift/server.cc                              |  2 +-
 transport/server.cc                           |  4 +--
 utils/config_file.cc                          |  3 +-
 utils/file_lock.cc                            |  2 +-
 utils/logalloc.cc                             | 14 ++++----
 utils/updateable_value.cc                     |  2 +-
 59 files changed, 119 insertions(+), 98 deletions(-)
2020-04-05 13:47:39 +03:00
Avi Kivity
88ade3110f treewide: replace calls to engine().some_api() with some_api()
This removes the need to include reactor.hh, a source of compile
time bloat.

In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.

Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().

Ref #1
2020-04-05 12:46:04 +03:00
Avi Kivity
1799cfa88a logalloc: use namespace-scope seastar::idle_cpu_handler and related rather than reactor scope
This allows us to drop a #include <reactor.hh>, reducing compile time.

Several translation units that lost access to required declarations
are updated with the required includes (this can be an include of
reactor.hh itself, in case the translation unit that lost it got it
indirectly via logalloc.hh)

Ref #1.
2020-04-05 12:45:08 +03:00
Piotr Sarna
1a9083b342 db,view: guard view builder startup with a semaphore
The startup routine performs some bookkeeping operations on views,
and so do these events:
 - on_create_view;
 - on_drop_view;
 - on_update_view.
Since the above events are guarded with a semaphore, the startup
routine should also take the same semaphore - in order to ensure
that all bookkeeping operations are serialized.

Refs #6094
2020-04-05 11:41:26 +02:00
Piotr Sarna
8da4a5b78c db,view: nitpick: change & operator to && for booleans
Although it's technically correct to use the bitwise and operator
on booleans as well, it's slightly confusing for the reader.
2020-04-05 11:41:25 +02:00
Piotr Sarna
e49805b7b8 db,view: remove unneeded implicit capture-by-reference
The lambda does not use any other captures, so it does not to
implicitly capture anything by reference.
2020-04-05 11:41:25 +02:00
Piotr Sarna
3f19865493 db,view: fix waiting for a view building future
The future was marked with a `FIXME: discarded future`, but there's really
no reason not to wait for it, and it was probably meant to be waited for
since its implementation.
2020-04-05 11:41:25 +02:00
Botond Dénes
240b5e0594 frozen_schema: key() remove unused schema parameter
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200402092249.680210-1-bdenes@scylladb.com>
2020-04-02 14:43:35 +02:00
Konstantin Osipov
9948f548a5 lwt: remove Paxos from experimental list
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.

Remove the check for LWT from Alternator.

Note that in order for the cluster to work with LWT, all nodes need
to support it.

Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.

Changes in v2:
* remove enable_lwt feature flag, it's always there

Closes #6102

test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
2020-04-01 09:12:21 +02:00
Avi Kivity
dee0b68347 Merge 'Separate sharding and partitioning logic' from Piotr J
"
Currently, both sharding and partitioning logic is encapsulated into partitioners. This is not desirable because these two concepts are totally independent and shouldn't be coupled together in such a way.

This PR separates sharding and partitioning. Partitioning will still live in i_partitioner class and its subclasses. Sharding is extracted to a new class called sharding_info. Both partitioners and sharding_info are still managed by schema class. Partitioner can be accessed with schema::get_partitioner while sharding_info can be accessed with schema::get_sharding_info.

The transition is done in steps:
1. sharding_info class is defined and all the sharding logic is extracted from partitioner to the new class. Temporarily sharding_info is still embedded into i_partitioner and all sharding related functions in i_partitioner call delegate to the embedded sharding_info object.
2. All calls to i_partitioner functions that are related to sharding are gradually switched to calls to sharding_info equivalents. sharding_info.
3. Once everything uses sharding_info, all sharding logic is dropped from i_partitioner.

Tests: unit(dev, release)
"

* haaawk-sharding_info: (32 commits)
  dummy_sharder: rename dummy_sharding_info.* to dummy_sharder.*
  sharding_info: rename the class to sharder
  i_partitioner:remove embeded sharding_info
  i_partitioner: remove unused get_sharding_info
  schema: remove incorrect comment
  schema: make it possible to set sharding_info per schema
  i_partitioner: remove unused shard_count
  multishard_writer: stop calling i_partitioner::shard_count
  i_partitioner: remove sharding_ignore_msb
  partitioner_test: test ranges and sharding_infos
  i_partitioner: remove unused split_ranges_to_shards
  i_partitioner: remove unused shard_of function
  sstable-utils: use sharding_info::shard_of
  create_token_range_from_keys: use sharding info for shard_of
  multishard_mutation_query_test: use sharding info for shard_of
  distribute_reader_and_consume_on_shards: use sharding_info::shard_of
  multishard_mutation_query: use sharding_info::shard_of
  dht::shard_of: use schema::get_sharding_info
  i_partitioner: remove unused token_for_next_shard
  split_range_to_single_shard: use sharding info instead of partitioner
  ...
2020-03-31 13:40:51 +03:00
Gleb Natapov
8a408ac5a8 lwt: remove entries from system.paxos table after successful learn stage
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.

The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round.  It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.

Fixes #5779

Message-Id: <20200330154853.GA31074@scylladb.com>
2020-03-30 21:02:14 +03:00
Piotr Jastrzebski
e72696a8e6 sharding_info: rename the class to sharder
Also rename all variables that were named si or sinfo
to sharder.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
2e850421a0 i_partitioner:remove embeded sharding_info
sharding_info embeded into partitioner is no longer
used anywhere and can be removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
7bd2b8d73f schema: make it possible to set sharding_info per schema
Previously schema::get_sharding_info was obtaining
sharding_info from the partitioner but we want to remove
sharding_info from the partitioner so we need a place
in schema to store it there instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Gleb Natapov
b3db6f5b04 lwt: rename "in_progress_ballot" cell to "promise" in system.paxos table
The value that is stored in "in_progress_ballot" cell is the value of
promised ballot, so call the cell accordingly to avoid confusion
especially as we have a notion of "in progress" proposal in the code
which is not the same as in_progress_ballot here.

We can still do it without care about backwards compatibility since LWT
is still marked as experimental.

Fixes #6087.

Message-Id: <20200326095758.GA10219@scylladb.com>
2020-03-30 12:01:55 +03:00
Asias He
743b529c2b gossip: Add an option to force gossip generation
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.

n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)

One year later, user wants the upgrade n1,n2,n3 to a new version

when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.

Such unnecessary marking of node down can cause availability issues.
For example:

DC1: n1, n2
DC2: n3, n4

When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.

To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.

Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.

Fixes #5164
2020-03-27 12:15:21 +01:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Rafael Ávila de Espíndola
eca0ac5772 everywhere: Update for deprecated apply functions
Now apply is only for tuples, for varargs use invoke.

This depends on the seastar changes adding invoke.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200324163809.93648-1-espindola@scylladb.com>
2020-03-25 08:49:53 +02:00
Avi Kivity
a314283469 Merge "Minor cleanups to cql3 code regarding shared_ptr's" from Pavel S
"
This small series consists of several changes that aim to
reduce the number of shared_ptr's in cql3 code.

Also it contains a patch that makes CqlParser::query to return
std::unique_ptr<> instead of seastar::shared_ptr<>, which leads
to more understandable code and lays foundation for further
optimizations (e.g. possibly eliminating shared_ptr's in
`prepared_statement` and just moving raw statements in `prepare`
without copying them).

Tests: unit(dev, debug)
"

* 'feature/cql_cleanups_9' of https://github.com/ManManson/scylla:
  cql3: return raw::parsed_statement as unique_ptr
  cql3: de-pointerize arguments to some of CQL grammar rules and definitions.
  cql3: make abstract_marker::make_in_receiver accept cref to column_specification
2020-03-24 14:51:49 +02:00
Calle Wilund
9fee712d62 db::commitlog: Don't write trailing zero block unless needed
Fixes #5899

When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).

However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).

We should only write trailing zero if we are below the max_size
file size when terminating

Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)

v2:
- Fix test to take into account that files might be deleted
  behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
  queued for delete.

Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
2020-03-24 11:31:55 +01:00
Pavel Solodovnikov
adc6a98b59 cql3: return raw::parsed_statement as unique_ptr
Change CQL parsing routine to return std::unique_ptr
instead of seastar::shared_ptr.

This can help reduce redundant shared_ptr copies even further.

Make some supplementary changes necessary for this transition:
 * Remove enabled_shared_from_this base class from the following
   classes: truncate_statement, authorization_statement,
   authentication_statement: these were previously constructing
   prepared_statement instance in `prepare` method using
   `shared_from_this`.
   Make `prepare` methods implementation of inheriting classes
   mirror implementation from other statements (i.e.
   create a shallow copy of the object when prepairing into
   `prepared_statement`; this could be further refactored
   to avoid copies as much as possible).
 * Remove unused fields in create_role_statement which led to
   error while using compiler-generated copy ctor (copying
   uninitialied bool values via ctor).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-03-23 23:19:21 +03:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Pekka Enberg
6b2cd1bd7d Revert "db::commitlog: Don't write trailing zero block unless needed"
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:

"I have bisected the recent failures [in commitlog_test] on next to this
 patch."
2020-03-20 22:30:58 +02:00
Nadav Har'El
7922b9eb8f materialized views: reduce recompilation when db/view/view.hh changes.
Before this patch, when db/view/view.hh was modified, 89 source files had to
be recompiled. After this patch, this number is down to 5.

Most of the irrelevant source files got view.hh by including database.hh,
which included view.hh just for the definition of statistics. So in this
patch we split the view statistics to a separate header file, view_stats.hh,
and database.hh only includes that. A few source files which included
only database.hh and also needed view.hh (for materialized-view related
functions) now need to include view.hh explicitly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200319121031.540-1-nyh@scylladb.com>
2020-03-19 15:46:14 +02:00
Piotr Sarna
0c11e07faf view,table: fix waiting for view updates during building
View updates sent as part of the view building process should never
be ignored, but fd49fd7 introduced a bug which may cause exactly that:
the updates are mistakenly sent to background, so the view builder
will not receive negative feedback if an update failed, which will
in turn not cause a retry. Consequently, view building may report
that it "finished" building a view, while some of the updates were
lost. A simple fix is to restore previous behaviour - all updates
triggered by view building are now waited for.

Fixes #6038
Tests: unit(dev),
dtest: interrupt_build_process_with_resharding_low_to_half_test
2020-03-19 10:50:54 +02:00
Avi Kivity
ee9df91a76 Merge "Allow setting partitioner per table" from Piotr
"
This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used.

The PR is composed of the following parts:

 - Introduction of schema::get_partitioner that still returns dht::global_partitioner
 - Replacement of all the usage of dht::global_partitioner with schema::get_partitioner
 - Making it possible to set table-specific partitioner in a schema_builder
 - Remove all the places that were setting default partitioner except for main.cc (mostly tests)
 - Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase
 - Remove dht::global_partitioner

After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner.

There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&.

The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with:

 - Making sure a table has the same partitioner on each node
 - Allowing user to set up a table-specific partitioner on table
 - Signal driver about what partitioner is used by a given table
 - Persist partitioner info for each table that does not use default partitioner.
Fixes #5493

Tests: unit(dev, release, debug), dtest(byo)
"

* 'per_table_partitioner' of https://github.com/haaawk/scylla:
  schema: drop optional from _partitioner field
  make_multishard_combining_reader: stop taking partitioner
  split_range_to_single_shard: stop taking partitioner as argument
  tests: remove unused murmur3 includes
  partitioner: move default_partitioner to schema.cc
  partitioner: hide dht::default_partitioner
  schema: include partitioner name in scylla tables mutation
  schema: make it possible to set custom partitioner
  scylla_tables: add partitioner column
  schema_features: add PER_TABLE_PARTITIONERS feature
  features: add PER_TABLE_PARTITIONERS feature
2020-03-16 11:13:47 +02:00
Piotr Jastrzebski
57b69fb804 schema: include partitioner name in scylla tables mutation
There are two results of this patch:
1. New partitioner name column is persited on node's disk in scylla_tables
2. New partitioner name column is included into schema digest

This is achieved by including this new column in scylla tables mutation.
For that we:
1. Add partitioner name to the result of make_scylla_tables_mutation.
   If table does not have a specific partitioner set and uses default
   partitioner then we don't include the name of such default partitioner.
   Only the name of custom partitioner is added if a table has one.
2. In create_table_from_mutations we check whether scylla tables mutation
   has a partitioner name set. If so then we use it as a parameter for
   schema_builder.

Note that previous patches have ensured that this new column will be included
into schema digest only after the whole cluster supports per table partitioners.
Before that, during rolling upgrade, new partitioner name column is hidden and
not shared with other nodes.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
f83ff8fda1 scylla_tables: add partitioner column
Following commits make it possible to set a specific
partitioner for a table. We want to persist that information
and include it into schema digest. For that a new column
in scylla_tables is needed. This commit adds such column.

We add the new column to scylla_tables because it's a Scylla
specific extension.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Piotr Jastrzebski
782f2caf41 schema_features: add PER_TABLE_PARTITIONERS feature
With per table partitioners, partitioner name will be a part
of table schema. To allow rolling upgrade we need to perform
special logic that hides new partitioner name schema column
during the upgrade. This commit adds new schema feature that
controls this logic.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-15 10:25:20 +01:00
Nadav Har'El
635e6d887c materialized views: fix corner case of view updates used by Alternator
While CQL does not allow creation of a materialized view with more than one
base regular column in the view's key, in Alternator we do allow this - both
partition and clustering key may be a base regular column. We had a bug in
the logic handling this case:

If the new base row is missing a value for *one* of the view key columns,
we shouldn't create a view row. Similarly, if the existing base row was
missing a value for *one* of the view key columns, a view row does not
exist and doesn't need to be deleted.  This was done incorrectly, and made
decisions based on just one of the key columns, and the logic is now
fixed (and I think, simplified) in this patch.

With this patch, the Alternator test which previously failed because of
this problem now passes. The patch also includes new tests in the existing
C++ unit test test_view_with_two_regular_base_columns_in_key. This tests
was already supposed to be testing various cases of two-new-key-columns
updates, but missed the cases explained above. These new tests failed
badly before this patch - some of them had clean write errors, others
caused crashes. With this patch, they pass.

Fixes #6008.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312162503.8944-1-nyh@scylladb.com>
2020-03-15 07:57:33 +01:00
Piotr Sarna
2061e6a9cc db,view: perform local view updates synchronously
Local view updates (updates applied to a local node,
without remote communication) are from now on performed
synchronously - which adds consistency guarantees, as a local
write failure will be returned to the client instead of being
silently ignored.
2020-03-11 09:05:56 +01:00
Piotr Sarna
fd49fd773c db,view: move putting view updates to background to mutate_MV
Currently, launching view updates as an asynchronous background job
is done via not waiting for mutate_MV() future in
table::generate_and_propagate_view_updates. That has a big downside,
since mutate_MV() handles *all* view updates for *all* views of a table,
so it's not possible to wait for each view independently.
Per-view granularity is required in order to implement synchronous
view updates of local views - because then we'll synchronously
wait for all views that write to a local node (due to having a matching
partition key with the base), while remote view updates will still
be sent asynchronously.
In order to do that, instead of not waiting for mutate_MV,
we do wait for it properly, but instead launch the asynchronous,
unwaited-for futures inside mutate_MV.
Effectively that means no changes for view updates so far - all updates
will be fired in the background. Later, another patch will introduce
a way to wait for selected updates to finish.
2020-03-11 09:05:56 +01:00
Piotr Sarna
3b3659e8cd db,view: drop default parameter for mutate_MV::allow_hints
Default parameters are considered harmful, and as part of a cleanup
before editing view.cc code, a default value for allow_hints parameter
is removed.
2020-03-11 09:05:56 +01:00
Gleb Natapov
cd73f552b9 storage_service, database: do not move sharded services
It may be not safe to move sharded services, so it will be prohibited in
the future seastar update. Remove all current cases where we do it.

Fixes #5814.

Message-Id: <20200301095423.GY434@scylladb.com>
2020-03-10 12:51:02 +02:00
Tomasz Grabiec
3548e85ff7 Merge "features: Properly resolve when_enabled futures on stop" from Pavel E.
If the feature service is stopped without enabling some features,
the latrer may end up with "broken promise" exception on futures
attached to the _pr promise. Fix this by switching the only user
of it onto 'listener' API and remove future-based one.

Tests: unit(debug), manual start-stop and aborted-start
2020-03-10 10:09:24 +02:00
Calle Wilund
0b34d88957 db::commitlog: Don't write trailing zero block unless needed
Fixes #5899

When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).

However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).

We should only write trailing zero if we are below the max_size
file size when terminating

Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)

Message-Id: <20200226121601.15347-2-calle@scylladb.com>
2020-03-08 16:51:53 +02:00
Piotr Jastrzebski
968177da04 cdc: store tokens in cdc description as longs
Previously the tokens were stored as strings
because token could have been represented in multiple ways.
Now token representation is always int64_t so we can
store them as ints in cdc description as well.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-06 11:59:59 +01:00
Piotr Dulikowski
861c7b5626 schema: get cdc options from schema extensions
Removes logic responsible for setting cdc_options from dedicated column
in scylla_tables, and uses the "cdc" schema extension instead.
2020-03-05 16:11:21 +01:00
Piotr Dulikowski
6895b0e395 db::extensions: add shorthands for add_schema_extension
This abstract away a pattern used everywhere when adding a schema
extension.
2020-03-05 16:09:44 +01:00
Piotr Sarna
c35160457b Merge 'Clean up stream_id representation' from Piotr
With #5950 we changed the representation of stream_id
in CDC Log from two int columns to a single blob column.

This PR cleans up stream_id representation internally.
Now stream_id is stored as blob both in-memory and in
internal CDC tables.

Tests: unit(dev)

* hawk/stream_id_representation:
  cdc: store stream_ids as blobs in internal tables
  cdc: improve do_update_streams_description
  cdc: Fix generate_topology_description
  cdc: add stream_id::operator<
  cdc: change stream_id representation
2020-03-05 14:14:29 +01:00
Calle Wilund
b48255a4cd db::commitlog: Only zero disk blocks not already allocated in segment
Fixes #5891
Refs #5899

When creating segments with the o_dsync option active, we write max_size
zeros to disk, to ensure actual disk blocks are allocated.

However, if we recycle a segment, we should, when not actually creating
a new file, check the existing size on disk, and only zero any blocks
not already allocated (i.e. if recycled file was smaller than max_size,
due to segement truncation on close).

test: unit
Message-Id: <20200226121601.15347-1-calle@scylladb.com>
2020-03-05 13:27:08 +01:00
Piotr Jastrzebski
57cfe6d0e1 cdc: store stream_ids as blobs in internal tables
In new CDC Log format stream_id is represented by a single
blob column so it makes sense to store it in the same form
everywhere - including internal CDC tables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 11:31:22 +01:00
Pavel Emelyanov
4fa12f2fb8 header: De-bloat schema.hh
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
2020-03-03 11:34:00 +01:00
Piotr Jastrzebski
f105f43008 commitlog: remove FIXME
In segment_manager::on_timer() there's a FIXME
to stop discarding future returned from sync()
but sync() does not return any future so it's safe
to remove the FIXME and stop casting to (void).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6d6d819cb2972e47e5f3fbe7b896499c64b09e53.1583230579.git.piotr@scylladb.com>
2020-03-03 12:21:56 +02:00
Pavel Emelyanov
e63f5187b2 system_keyspace: Rework migrate_truncation_records feature subscription
The function in question uses future-based .when_enabled() subscription
on cluster_supports_truncation_table feature. This method is considered
to be unsafe, so here's the patch that changes it onto feature::listener.

The completion of the migration is only awaited by a single test, so
this waiting mechanism is also slightly simplified.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-03-02 19:55:28 +03:00
Botond Dénes
75efa707ce db/config: add config memory limit of otherwise unlimited queries
We have a few kind of queries whose memory consumption is not limited at
all. One of these is reverse queries, which reads entire partitions into
memory, before reversing them. These partitions can be larger than
memory and thus such a query can single-handedly cause OOM.
This patch introduces a configuration for a memory limit for such
queries. This will serve as a hard limit and queries which attempt to
use more memory than this, will be aborted.
The limit is propagated to table objects, with the intention of keeping
system tables unlimited. These tables are usually small and initiators
of system queries are not prepared for failures.
2020-02-27 18:11:54 +02:00
Avi Kivity
956b092012 Merge "Repair based node operation" from Asias
"
Here is a simple introduction to the node operations scylla supports and
some of the issues.

 - Replace operation

It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.

 - Rebuild operation

It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.

 - Bootstrap operation

It is used to add a new node into the cluster. The token ring
changes. Do no suffer from the "not the latest replica” issue. New
node pulls data from existing nodes that are losing the token range.

Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.

Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again
even if the node already has 99.99% of the data.

 - Decommission operation

It is used to remove a live node form the cluster. Token ring
changes. Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.

It suffers from resumable issue like bootstrap operation.

 - Removenode operation

It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.

To solve all the issues above. We could use repair based node operation.
The idea behind repair based node operations is simple: use repair to
sync data between replicas instead of streaming.

The benefits:

 - Latest copy is guaranteed

 - Resumable in nature

 - No extra data is streamed on wire
   E.g., rebuild twice, will not stream the same data twice

 - Unified code path for all the node operations

 - Free repair operation during bootstrap, replace operation and so on.

Fixes: #3003
Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
"

* 'repair_for_node_ops' of https://github.com/asias/scylla:
  docs: Add doc for repair_based_node_ops
  storage_service: Enable node repair based ops for bootstrap
  storage_service: Enable node repair based ops for decommission
  storage_service: Enable node repair based ops for replace
  storage_service: Enable node repair based ops for removenode
  storage_service: Enable node repair based ops for rebuild
  storage_service: Use the same tokens as previous bootstrap
  storage_service: Add is_repair_based_node_ops_enabled helper
  config: Add enable_repair_based_node_ops
  repair: Add replace_with_repair
  repair: Add rebuild_with_repair
  repair: Add do_rebuild_replace_with_repair
  repair: Add removenode_with_repair
  repair: Add decommission_with_repair
  repair: Add do_decommission_removenode_with_repair
  repair: Add bootstrap_with_repair
  repair: Introduce sync_data_using_repair
  repair: Propagate exception in tracker::run
2020-02-26 20:37:25 +02:00
Piotr Dulikowski
41d82e39ea storage proxy: rename mutate_hint_from_scratch
Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.

Tests: unit(dev)
2020-02-24 17:30:22 +02:00
Asias He
cb4045e11d config: Add enable_repair_based_node_ops
An option to enable the repair based node operations.
2020-02-24 11:11:40 +08:00
Gleb Natapov
df2f67626b commitlog: fix size of a write used to zero a segment
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.

Fixes #5857

Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
2020-02-20 17:22:21 +02:00