Commit Graph

18852 Commits

Author SHA1 Message Date
Avi Kivity
af2a3859f6 Update seastar submodule
* seastar b629d5ef7a...a5b9f77d52 (6):
  > perftune.py: add comment explaining why we don't log errors when binding NVMe IRQs for all but i3.nonmetal machines
  > sharded: do a two phase shutdown for sharded services
  > chunked_fifo: add iterator
  > perftune.py: fix the i3 metal detection pattern
  > core/memory: remove translation api
  > reactor: file_type: offer option to not follow symbolic links
2019-06-30 11:32:21 +03:00
Glauber Costa
d916601ea4 toppartitions: fix typo
toppartitons -> toppartitions

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190627160937.7842-1-glauber@scylladb.com>
2019-06-27 19:13:58 +03:00
Tomasz Grabiec
e071445373 Merge "More precise poisoning in logalloc" from Rafael
With this unused descriptors and objects should always be poisoned.

 * https://github.com/espindola/scylla/ align-descriptors-so-that-they-are-poisoned-v4:
 Convert macros to inline functions
 More precise poisoning in logalloc
2019-06-27 16:30:40 +02:00
Takuya ASADA
eabb872789 dist/redhat: install /usr/sbin symlinks correctly
On current scylla.spec, shell glob pattern "scylla_*setup" does not correctly
expanded, it mistakenly created a symlink named "/usr/sbin/scylla_*setup".
We need to expand them, need to create symlinks for each setup scripts.

Fixes #4605

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-2-syuu@scylladb.com>
2019-06-27 14:22:40 +03:00
Takuya ASADA
828b63f4fb dist/redhat: manage *.pyc as a part of package
Since we don't install .pyc files on our package, python3 will generate .pyc
file when we launch setup script first time.
Then we will have unmanaged files under script directory, it will remain when
Scylla package upgraded / removed.

We need to compile *.py when we generate relocatable package, add compiled .pyc
files on .rpm/.deb packages.

Fixes #4612

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-1-syuu@scylladb.com>
2019-06-27 14:22:39 +03:00
Rafael Ávila de Espíndola
d8dbacc7f6 More precise poisoning in logalloc
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-06-26 13:13:48 -07:00
Rafael Ávila de Espíndola
6a2accb483 Convert macros to inline functions
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-06-26 13:13:48 -07:00
Avi Kivity
dd76943125 Merge "Segregate data when streaming by timestamp for time window compaction strategy" from Botond
"
When writing streamed data into sstables, while using time window
compaction strategy, we have to emit a new sstable for each time window.
Otherwise we can end up with sstables, mixing data from wildly different
windows, ruining the compaction strategy's ability to drop entire
sstables when all data within is expired. This gets worse as these mixed
sstables get compacted together with sstables that used to contain a
single time window.

This series provides a solution to this by segregating the data by its
atom's the time-windows. This is done on the new RPC streaming and the
new row-level, repair, memtable-flush and compaction, ensuring that the
segregation requirement is respected at all times.

Fixes: #2687
"

* 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla:
  streaming,repair: restore indentation
  repair: pass the data stream through the compaction strategy's interposer consumer
  streaming: pass the data stream through the compaction strategy's interposer consumer
  TWCS: implement add_interposer_consumer()
  compaction_strategy: add add_interposer_consumer()
  Add mutation_source_metadata
  tests: add unit test for timestamp_based_splitting_writer
  Add timestamp_based_splitting_writer
  Introduce mutation_writer namespace
2019-06-26 19:18:52 +03:00
Tomasz Grabiec
3e30a33e31 Merge "Introduce tests::random_schema" from Botond
Most of our tests use overly simplistic schemas (`simple_schema`) or
very specialized ones that focus on exercising a specific area of the
tested code. This is fine in most places as not all code is schema
dependent, however practice has showed that there can be nasty bugs
hiding in dark corners that only appear with a schema that has a
specific combination of types.

This series introduces `tests::random_schema` a utility class for
generating random schemas and random data for them. An important goal is
to make using random schemas in tests as simple and convenient as
possible, therefore fostering the appearance of tests using random
schemas.

Random schema was developed to help testing code I'm currently working
on, which segregates data by time-windows. As I wasn't confident in my
ability to think of every possible combination of types that can break
my code I came up with random-schema to help me finding these corner
cases. So far I consider it a success, it already found bugs in my code
that I'm not sure I would have found if I had relied on specific
schemas. It also found bugs in unrelated areas of the code which proves
my point in the first paragraph.

* https://github.com/denesb/scylla.git random_schema/v5:
  tests/data_model: approximate to the modeled data structures
  data_value: add ascii constructor
  tests/random-utils.hh: add stepped_int_distribution
  tests/random-utils.hh: get_int() add overloads that accept external
    rand engine
  tests/random-utils.hh: add get_real()
  tests: introduce random_schema
2019-06-26 18:10:20 +02:00
Botond Dénes
12b8405720 streaming,repair: restore indentation
Deferred from the previous two patches.
2019-06-26 18:45:36 +03:00
Botond Dénes
e3f4692868 repair: pass the data stream through the compaction strategy's interposer consumer 2019-06-26 18:45:36 +03:00
Botond Dénes
9c2407573c streaming: pass the data stream through the compaction strategy's interposer consumer 2019-06-26 18:45:36 +03:00
Botond Dénes
ee563928df TWCS: implement add_interposer_consumer()
Exploit the interposer customization point to inject a consumer that will
segregate the mutation stream based on the contained atoms' timestamps,
allowing the requirements of TWCS to be mantained every time sstables
are written to disk.
For the implementation, `timestamp_based_splitting_writer` is used,
with a classifier that maps timestamps to windows.
2019-06-26 18:45:36 +03:00
Tomasz Grabiec
2d3e3640df Merge "Collection: use utils::chunked_vector to store the cells" from Botond
This is a band-aid patch that is supposed to fix the immediate problem
of large collections causing large allocations. The proper fix is to
use IMR but that will take time. In the meanwhile alleviate the
pressure on the memory allocator by using a chunked storage collection
(utils::chunked_vector) instead of std::vector. In the linked issue
seastar::chunked_fifo was also proposed as the container to use,
however chunked fifo is not traversable in reverse which disqualifies
it from this role.

Refs: #3602
2019-06-26 15:32:25 +02:00
Botond Dénes
a280dcfe4c compaction_strategy: add add_interposer_consumer()
This will be the customization point for compaction strategies, used to
inject a specific interposer consumer that can manipulate the fragment
stream so that it satisfies the requirements of the compaction strategy.
For now the only candidate for injecting such an interposer is
time-window compaction strategy, which needs to write sstables that
only contains atoms belonging to the same time-window. By default no
interposer is injected.
Also add an accompanying customization point
`adjust_partition_estimate()` which returns the estimated per-sstable
partition-estimate that the interposer will produce.
2019-06-26 15:45:59 +03:00
Botond Dénes
3ce902a4be Add mutation_source_metadata
This struct contains metadata regarding to a mutation_source. Currently
it contains the min and max timestamp. This will be used later by
compaction strategies to determine whether a given mutation stream has
to be split or not.
2019-06-26 15:45:59 +03:00
Botond Dénes
25d7cbedc0 tests: add unit test for timestamp_based_splitting_writer 2019-06-26 15:45:59 +03:00
Botond Dénes
df29600eec Add timestamp_based_splitting_writer
This writer implements the core logic of time-window based data
segregation. It splits the fragment stream provided by a reader, such
that each atom (cell) in the stream will be written into a consumer
based on the time-window its timestamp belongs to. The end result is
that each consumer will only see fragments, whoose atoms all have
timestamps belonging to the same time-window.
When a mutation fragment has atoms belonging to different time-windows,
it is split into as many fragments as needed so each has only atoms
that belong to the same time-window.
2019-06-26 15:45:59 +03:00
Botond Dénes
2693f1838a Introduce mutation_writer namespace
Currently there is a single mutation_writer: `multishard_writer`,
however in the next path we are going to add another one. This is the
right moment to move these into a common namespace (and folder), we
have way too much stuff scattered already in the top-level namespace
(and folder).
Also rename `tests/multishard_writer_test.cc` to
`tests/mutation_writer_test.cc`, this test-suite will be the home of all
the different mutation writer's unit test cases.
2019-06-26 15:45:59 +03:00
Avi Kivity
adcc95dddc Merge "sstable: mc: reader: Optimize multi-partition scans for data sets with small partitions" from Tomasz
"
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.

This results in less CPU overhead.

The ka/la reader is left still stopping after row end.

Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):

perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:

Before:

   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         0.952372            4   1000000    1050009        755    1050765    1046585      976.0    971     124256       1       0        0        0        0        0        0        0  99.7%
After:

   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         0.790178            4   1000000    1265538       1150    1266687    1263684      975.0    971     124256       2       0        0        0        0        0        0        0  99.6%

Tests: unit (dev)
"

* 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla:
  sstable: mc: reader: Do not stop parsing across partitions
  sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader
  sstables: reader: Simplify _single_partition_read checking
  sstables: reader: Update stats from on_next_partition()
  sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range()
  sstables: ka/la: reader make push_ready_fragments() safe to call many times
  sstables: mc: reader: Move out-of-range check out of push_ready_fragments()
  sstables: reader: Return void from push_ready_fragments()
  sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range()
  sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
2019-06-26 13:19:12 +03:00
Avi Kivity
06a9596491 tests: cql_test_env: disable commitlog O_DSYNC
O_DSYNC causes commitlog to pre-allocate each commitlog segment by writing
zeroes into it. In normal operation, this is amortized over the many
times the segment will be reused. In tests, this is wasteful, but under
the default workstation configuration with /tmp using tmpfs, no actual
writes occur.

However on a non-default configuration with /tmp mounted on a real disk,
this causes huge disk I/O and eventually a crash (observed in
schema_change_test). The crash is likely only caused indirectly, as the
extra I/O (exacerbated by many tests running in parallel) xcauses timeouts.

I reproduced this problem by running 15 copies of schema_change_test in
parallel with /tmp mounted on a real filesystem. Without this change, I
usually observe one or two of the copies crashing, with the change they
complete (and much more quickly, too).
2019-06-26 12:15:53 +02:00
Asias He
f0f0beba2e repair: Move the global tracker object into repair_service
The tracker object was a static object in repair.cc. At the time we initialize
it, we do not know the smp::count, so we have to initialize the _repairs
object when it is used on the fly.

    void init_repair_info() {
        if (_repairs.size() != smp::count) {
            _repairs.resize(smp::count);
        }
    }

This introduces a race if init_repair_info is called on different
thread(shard).

To fix, put the tracker object inside the newly introduced
repair_service object which is created in main.cc.

Fixes #4593
Message-Id: <b1adef1c0528354d2f92f8aaddc3c4bee5dc8a0a.1561537841.git.asias@scylladb.com>
2019-06-26 12:53:10 +03:00
Botond Dénes
572a738777 collection: use chunked_vector to store cells
This is quick fix to the immediate problem of large collections causing
large allocations, triggering stalls or OOM. The proper fix is to
use IMR for storing the cells, but that is a complex change that will
require time, so let's not stall/OOM in the meanwhile.
2019-06-26 11:40:44 +03:00
Botond Dénes
c68ffc330e types: don't copy collection_type_impl::mutation_view
Just because its a view its not cheap to copy.
2019-06-26 11:39:41 +03:00
Rafael Ávila de Espíndola
94d2194c77 dht: token: Simplify operator<
While this is a strict weak ordering, it is not obvious and duplicates
a bit of logic. This ptach simplifies it by using tri_compare.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190621204820.37874-1-espindola@scylladb.com>
2019-06-25 19:06:30 +03:00
Tomasz Grabiec
269e65a8db Merge "Sync schema before repair" from Asias
This series makes sure new schema is propagated to repair master and
follower nodes before repair.

Fixes #4575

* dev.git asias/repair_pull_schema_v2:
  migration_manager: Add sync_schema
  repair: Sync schema from follower nodes before repair
2019-06-25 19:05:29 +03:00
Amos Kong
f0cd589a75 dist: suppress the yaml load warning
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated,
as the default Loader is unsafe. Please read https://msg.pyyaml.org/load
for full details.

Fix it by use new safe interface - yaml.safe_load()

Signed-off-by: Amos Kong <amos@scylladb.com>
Cc: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <9b68601845117274573474ede0341cc81f80efa6.1561156205.git.amos@scylladb.com>
2019-06-25 19:05:29 +03:00
Avi Kivity
fc629bb14f Merge "cql3: lift infinite bound check" from Benny & Piotr
"
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.

Fixes #432

Update test_range_deletion_scenarios unit test accordingly.
"

* 'cql3-lift-infinite-bound-check' of https://github.com/bhalevy/scylla:
  cql3: lift infinite bound check if it's supported
  service: enable infinite bound range deletions with mc
  database: add flag for infinite bound range deletions
2019-06-25 19:05:29 +03:00
Nadav Har'El
a88c9ca5a5 Merge branch 'add_proper_aggregation_for_paged_indexing_2' of git://github.com/psarna/scylla into next
Piotr Sarna says:

Fixes #4540
This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.

Tests: unit(dev)

Piotr Sarna (8):
  cql3: split execute_base_query implementation
  cql3: enable explicit copying of query_options
  cql3: add a query options constructor with explicit page size
  cql3: add proper aggregation to paged indexing
  cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
  tests: add query_options to cquery_nofail
  tests: add indexing + paging + aggregation test case
  tests: add indexing+paging test case for clustering keys
2019-06-25 19:05:29 +03:00
Avi Kivity
7195f75fb2 Update seastar submodule
* seastar ded50bd8a4...b629d5ef7a (9):
  > sharded: no_sharded_instance_exception: fix grammar
  > core,net: output_stream: remove redundant std::move()
  > perftune: make sure that ethtool -K has a chance of succeeding
  > net/dpdk: upgrade to dpdk-19.05
  > perftune.py: Fix a few more places where we use deprecated pyudev.Device ones
  > reactor: provide an uptime function
  > rpc: add sink::flush() to streaming api
  > Use a table to document the various build modes
  > foreign_ptr: Fix compilation error due to unused variable
2019-06-25 19:05:29 +03:00
Avi Kivity
9d21341733 review-checklist.md: add common checks
- code style
 - naming
 - micro-performance
 - concurrency
 - unit-testing
 - templates and type erasure
 - singletons
2019-06-25 19:05:29 +03:00
Piotr Sarna
efa7951ea5 main: stop view builder conditionally
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.

Fixes #4589
2019-06-25 19:05:29 +03:00
Asias He
bb5665331c repair: Sync schema from follower nodes before repair
Since commit "repair: Use the same schema version for repair master and
followers", repair master and followers uses the same schema version
that master decides to use during the whole repair operation. If master
has older version of schema, repair could ignore the data which makes use
of the new schema, e.g., writes to new columns.

To fix, always sync the schema agreement before repair.

The master node pulls schema from followers and applies locally. The
master then uses the "merged" schema. The followers use
get_schema_for_write() to pull the "merged" schema.

Fixes #4575
Backports: 3.1
2019-06-25 17:13:47 +08:00
Asias He
14c1a71860 migration_manager: Add sync_schema
Makes sure this node knows about all schema changes known by
"nodes" that were made prior to this call.

Refs: #4575
Backports: 3.1
2019-06-25 17:13:47 +08:00
Botond Dénes
d00cb4916c tests: introduce random_schema
random_schema is a utility class that provides methods for generating
random schemas as well as generating data (mutations) for them. The aim
is to make using random schemas in tests as simple and convenient as
is using `simple_schema`. For this reason the interface of
`random_schema` follows closely that of `simple_schema` to the extent
that it makes sense. An important difference is that `random_schema`
relies on `data_model` to actually build mutations. So all its
mutation-related operations work with `data_model::mutation_descrition`
instead of actual `mutation` objects. Once the user arrived at the
desired mutation description they can generate an actual mutation via
`data_model::mutation_description::build()`.

In addition to the `random_schema` class, the `random_schema.hh` header
exposes the generic utility classes for generating types and values
that it internally uses.

random_schema is fully deterministic. Using the same seed and the same
set of operations is guaranteed to result in generating the same schema
and data.
2019-06-25 12:01:33 +03:00
Botond Dénes
070d72ee23 tests/random-utils.hh: add get_real() 2019-06-25 12:01:33 +03:00
Botond Dénes
2d9f6c3b63 tests/random-utils.hh: get_int() add overloads that accept external rand engine 2019-06-25 12:01:33 +03:00
Botond Dénes
2a7710129e tests/random-utils.hh: add stepped_int_distribution 2019-06-25 12:01:33 +03:00
Botond Dénes
a3f9932a2f data_value: add ascii constructor
To allow a `data_value` with `ascii_type` to be constructed.
2019-06-25 12:01:33 +03:00
Botond Dénes
1bd8b77770 tests/data_model: approximate to the modeled data structures
Make the the data modelling structures model their "real" counterparts
more closely, allowing the user greater control on the produced data.
The changes:
* Add timestamp to atomic_value (which is now a struct, not just an
    alias to bytes).
* Add tombstone to collection.
* Add row_tombstone to row.
* Add bound kinds and tombstone to range_tombstone.

Great care was taken to preserve backward compatibility, to avoid
unnecessary changes in existing code.
2019-06-25 12:01:33 +03:00
Piotr Sarna
add40d4e59 cql3: lift infinite bound check if it's supported
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.

[bhalevy] Update test_range_deletion_scenarios unit test accordingly.

Fixes #432

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-06-24 15:58:34 +03:00
Piotr Sarna
c19fdc4c90 service: enable infinite bound range deletions with mc
As soon as it's agreed that the cluster supports sstables in mc format,
infinite bound range deletions in statements can be safely enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-06-24 15:58:28 +03:00
Piotr Sarna
e77ef849af database: add flag for infinite bound range deletions
Database can only support infinite bound range deletions if sstable mc
format is supported. As a first step to implement these checks,
an appropriate flag is added to database.
2019-06-24 15:57:47 +03:00
Piotr Sarna
b668ee2b2d tests: add indexing+paging test case for clustering keys
Indexing a non-prefix part of the clustering key has a separate
code path (see issue #3405), so it deserves a separate test case.
2019-06-24 14:51:17 +02:00
Piotr Sarna
3d9a37f28f tests: add indexing + paging + aggregation test case
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.

Refs #4540
2019-06-24 14:06:42 +02:00
Piotr Sarna
60cafcc39c tests: add query_options to cquery_nofail
The cquery_nofail utility is extended, so it can accept custom
query options, just like execute_cql does.
2019-06-24 14:06:41 +02:00
Piotr Sarna
fe18638de3 cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
The constant will be later used in test scenarios.
2019-06-24 13:21:37 +02:00
Piotr Sarna
bb08af7e68 cql3: add proper aggregation to paged indexing
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.

Fixes #4540
2019-06-24 13:21:32 +02:00
Piotr Sarna
97d476b90f cql3: add a query options constructor with explicit page size
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
2019-06-24 13:21:32 +02:00
Piotr Sarna
fa89e220ef cql3: enable explicit copying of query_options 2019-06-24 12:57:04 +02:00