Commit Graph

14517 Commits

Author SHA1 Message Date
Duarte Nunes
d757c87107 cql3/query_processor: Remove prepared statements upon dropping a view
Fixes #3198

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209143652.31852-1-duarte@scylladb.com>
2018-02-09 16:30:28 +00:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Duarte Nunes
456b678e0b database.hh: Fix data query stage argument type
Fixes a merge gone wrong.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180208163338.25238-1-duarte@scylladb.com>
2018-02-08 16:35:10 +00:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Avi Kivity
6298655178 Merge "Inline and optimise more aggressively" from Paweł
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.

A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.

Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.

Fixes #1644.

Tests: unit-test (release)

Performance tested with gcc 7.3.

Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.

         ./before    ./after   diff
 read   338662.12  405377.80  19.7%
 write  387378.89  466744.15  20.5%

Microbenchmarks
single run duration:      1.000s
number of runs:           5

BEFORE
test                                      iterations      median         mad         min         max
combined.one_row                              858933   536.389ns     0.819ns   534.823ns   537.208ns
combined.single_active                          8469    77.131us    11.000ns    77.118us    77.145us
combined.many_overlapping                       1199   664.105us   160.807ns   663.818us   668.527us
combined.disjoint_interleaved                   8100    75.522us    22.254ns    75.500us    75.732us
combined.disjoint_ranges                        8288    72.580us    10.571ns    72.568us    72.599us
memtable.one_partition_one_row               1216233   825.581ns     0.446ns   821.450ns   826.027ns
memtable.one_partition_many_rows              127336     7.855us     2.153ns     7.853us     7.898us
memtable.many_partitions_one_row               57919    17.356us     6.028ns    17.259us    17.362us
memtable.many_partitions_many_rows              4751   210.496us   102.339ns   210.393us   211.188us

AFTER
test                                      iterations      median         mad         min         max
combined.one_row                             1002321   450.292ns     0.313ns   447.202ns   450.605ns
combined.single_active                          9605    67.086us     8.620ns    67.073us    67.115us
combined.many_overlapping                       1476   519.554us     5.334ns   519.549us   519.953us
combined.disjoint_interleaved                   9280    64.363us     5.328ns    64.335us    64.369us
combined.disjoint_ranges                        9481    61.893us     3.620ns    61.885us    61.903us
memtable.one_partition_one_row               1432668   699.775ns     0.106ns   696.023ns   699.918ns
memtable.one_partition_many_rows              153692     6.536us     6.885ns     6.501us     6.543us
memtable.many_partitions_one_row               63319    15.879us     5.080ns    15.793us    15.884us
memtable.many_partitions_many_rows              5659   176.717us    66.770ns   176.650us   177.778us"

* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
  configure.py: set optimisation level to -O3
  configure.py: set inline-unit-growth to 300
  configure.py: flag_supported: support flags with spaces
  configure.py: rename warning_supported to flag_supported
  configure.py: pass optimisation flags to seastar/configure.py
  cql3/select_statement: do not capture stack variables by reference
2018-02-08 17:45:41 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Paweł Dziepak
eb5b76ea50 configure.py: set optimisation level to -O3 2018-02-08 14:46:11 +00:00
Paweł Dziepak
bc65659a46 configure.py: set inline-unit-growth to 300
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
89063a9cc0 configure.py: flag_supported: support flags with spaces 2018-02-08 14:46:11 +00:00
Paweł Dziepak
8f4b30b572 configure.py: rename warning_supported to flag_supported
warning_supported() can be used to detect support of any compiler flag,
not just warnings.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
a8372b87eb configure.py: pass optimisation flags to seastar/configure.py 2018-02-08 14:46:11 +00:00
Paweł Dziepak
b635fec9bf cql3/select_statement: do not capture stack variables by reference
Default capture by reference considered harmful in async code.
2018-02-08 14:46:10 +00:00
Avi Kivity
ee763d889a Merge seastar upstream
* seastar 6d02263...2b0a81d (7):
  > configure.py: add -Wno-stringop-overflow
  > configure.py: add --optflags for specifying optimisation flags
  > build: add protobuf-compiler to docker dev image
  > build: update docker builder to newer Fedora
  > json_element: stream_object to get its parameter by value
  > json_element: stream range object
  > build: add yaml-cpp-devel installation to Dockerfile
2018-02-08 16:45:01 +02:00
Raphael S. Carvalho
312bd9ce25 Remove SSTable's atomic deletion manager
Not used anymore, can be deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:38:45 -02:00
Raphael S. Carvalho
1472cfcc19 Stop using SSTable's atomic deletion manager
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:27:17 -02:00
Raphael S. Carvalho
b78881c0e9 database: split column_family::rebuild_sstable_list
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:18:18 -02:00
Glauber Costa
4272279bbb controllers: unify the I/O and CPU controllers
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.

Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one.  We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.

In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:30 -05:00
Glauber Costa
7b6f188e27 controllers: allow a static priority to override the controller output
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.

That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
6f295a2a8a controllers: update control points for memtable I/O controller
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.

The end goal is to fully integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
b895d495cc controllers: allow memtable I/O controller to have shares statically set
This is so it looks more like the CPU controller. The end goal is to integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c099c98676 controllers: retire auto_adjust_flush_quota
It no longer makes sense now that we have the full scheduler +
controllers.  In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
2c1d5cf966 database: remove cpu_flush_quota metric
We can now grab that from the CPU scheduler, that exports both runtime
and shares.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c4974392b7 allow update_cache and clear_gently to use the entire task quota.
We have had a quota of partitions to process in clear_gently /
update_cache, so that we don't overwork. However, with those things now
being in their own task group there is no harm in allowing it to run
until we reach a natural preemption point.

While we are at it, clear_gently did not check for need_preempt()
before, so this patch fixes it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
a3a4d0a17a row_cache: actually use the scheduling group for update_cache
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.

This patch fixes that and gets rid of the old group just in case.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
ce94e6deb7 database: place data_query execution stage into scheduling_group
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.

Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
2018-02-07 17:19:29 -05:00
Avi Kivity
2ee163d32b config: mark background_writer_scheduling_quota as Unused
Since the background writer flush quota config is no longer used, mark
it Unused.
2018-02-07 17:19:29 -05:00
Avi Kivity
ac525c9124 memtable, database: make memtable::clear_gently() inherit scheduling_group
Instead of using a private thread_scheduling_group, make clear_gently use
its caller's scheduling_group to control resource usage.
2018-02-07 17:19:29 -05:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Glauber Costa
98549775fa sstable_tests: make sure min_threshold is set explicitly
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.

Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.

So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
2018-02-07 18:45:52 +01:00
Tomasz Grabiec
d398aa913e cache: Fix calculation of active_reads()
Message-Id: <1518023341-27855-1-git-send-email-tgrabiec@scylladb.com>
2018-02-07 17:20:00 +00:00
Takuya ASADA
2c2173917c dist/common/scripts/scylla_raid_setup: skip blkdiscard when disk is not supported TRIM
Since we unconditionally running blkdiscard on disks, we may get ioctl error
message on some disks which does not support TRIM.

This can be ignore but it's bad UX, so let's skip running blkdiscard when TRIM
is not supported on the disk.

Fixes #2774

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517992904-13838-1-git-send-email-syuu@scylladb.com>
2018-02-07 13:30:05 +02:00
Paweł Dziepak
6ccd317c38 Merge "Do not evict from memtable snapshots" from Tomasz
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.

Tests: unit (release)"

* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test for eviction with non-evictable snapshots
  mutation_partition: Define + operator on tombstones
  tests: mvcc: Check that partition is fully discontinuous after eviction
  tests: row_cache: Add test for memtable readers surviving flush and eviction
  memtable: Make printable
  mvcc: Take partition_entry by const ref in operator<<()
  mvcc: Do not evict from non-evictable snapshots
  mvcc: Drop unnecessary assignment to partition_snapshot::_version
  tests: Use partition_entry::make_evictable() where appropriate
  mvcc: Encapsulate construction of evictable entries
2018-02-06 14:46:24 +00:00
Tomasz Grabiec
3c51cc79d5 tests: mvcc: Add test for eviction with non-evictable snapshots 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d37131d320 mutation_partition: Define + operator on tombstones 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
ec5fe5b207 tests: mvcc: Check that partition is fully discontinuous after eviction
evict() should remove everything, including range tombstones, so whole
clustering range should be marked as discontinuous.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c1b82e60e3 tests: row_cache: Add test for memtable readers surviving flush and eviction
Reproduces https://github.com/scylladb/scylla/issues/3186
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d85d651e0f memtable: Make printable
Useful when debugging test failures.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
06b7b54c3d mvcc: Take partition_entry by const ref in operator<<()
Some users will only have const&.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
50f5bee12e mvcc: Do not evict from non-evictable snapshots
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c391bff1d2 mvcc: Drop unnecessary assignment to partition_snapshot::_version
merge_partition_versions() is responsible for merging versions
unpinned by the current snapshot. If that fails, we don't need to set
_version back since versions must be still referenced by someone else,
this snapshot is not a unique owner.

This change makes it easier to add tracking of evictability.
2018-02-06 14:24:18 +01:00
Tomasz Grabiec
439cbada2c tests: Use partition_entry::make_evictable() where appropriate 2018-02-06 14:24:18 +01:00
Raphael S. Carvalho
09f4ee808f sstables/compress: Fix race condition in segmented offset reading of shared sstable
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.

So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.

We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.

Tests: release mode.

Fixes #3148.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
2018-02-06 12:10:10 +02:00
Tomasz Grabiec
d899ae0f02 mvcc: Encapsulate construction of evictable entries
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
2018-02-05 17:54:03 +01:00
Avi Kivity
a94564a637 Merge seastar upstream
* seastar 21badbd...6d02263 (4):
  > build: detect name of ninja executable
  > queue: pop_eventually/push_eventually should throw when called after abort
  > build: compile libfmt out-of-line
  > core/gate: Ensure with_gate leaves gate on exception
2018-02-05 14:42:07 +02:00
Tomasz Grabiec
d21fbc26c7 tests: range_tombstone_list: Do not depend on argument evaluation order
next_pos() calls could be reordered resulting in invalid tombstones being
generated.
Message-Id: <1517833688-20022-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:31:37 +00:00
Tomasz Grabiec
d2baa49313 tests: Do not produce invalid range tombstones
Upper bound should not be smaller than lower bound. Found by
asserting on valid bounds.
Message-Id: <1517833602-19732-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:29:03 +00:00
Takuya ASADA
6d134c0c2b dist/redhat: block installing Scylla on older kernel
We uses AmbientCapabilities directive on systemd unit, but it does not work
on older kernel, causes following error:
"systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument"

It only works on kernel-3.10.0-514 == CentOS7.3 or later, block installing rpm
to prevent the error.

Fixes #3176

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517822764-2684-1-git-send-email-syuu@scylladb.com>
2018-02-05 12:57:17 +02:00
Duarte Nunes
46099e4f58 tests/role_manager_test: Stop role_manager
Not stopping them may cause the tests to fail due to an asynchronous
process being scheduled and accessing freed data.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180202221640.28609-1-duarte@scylladb.com>
2018-02-05 09:39:59 +00:00
Avi Kivity
6919c7434e Merge seastar upstream
* seastar 19efbd9...21badbd (4):
  > reactor: change adjustment method for tasks becoming active
  > Merge 'Update ARM port' from Avi
  > http: Do not wait for close connection on stop if listen did not completed
  > core/future-util: Don't allow rvalues in do_for_each()
2018-02-04 14:28:28 +02:00