Commit Graph

10714 Commits

Author SHA1 Message Date
Glauber Costa
aa375cd33d commitlog: use commitlog priority for replay
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:02 -05:00
Glauber Costa
4d3d774757 commitlog: close replay file
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 12:35:24 -05:00
Avi Kivity
eaf83ab59c Merge seastar upstream
* seastar 3001c08...31c5fd7 (2):
  > Safe use of collectd during shutdown
  > udp: abort reader and writer when udp channel close
2016-11-17 18:44:28 +02:00
Piotr Jastrzebski
9d33948487 mutation_rebuilder: fix fragment size calculation
It wasn't calculating the size of data correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>
2016-11-17 16:23:42 +00:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Avi Kivity
5d067eebf2 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.
2016-11-17 14:36:43 +02:00
Avi Kivity
18078bea9b storage_proxy: avoid calculating digest when only one replica is contacted
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all.  The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).

This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.

In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
2016-11-17 13:04:30 +02:00
Asias He
dc50ce0ce5 streaming: Make the mutation readers when streaming starts
Currenlty we make the mutation readers for streaming at different
time point, i.e.,

do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
     make a mutation reader for this range
     read mutations from the reader and send
})

If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.

Fix it by making all the readers before starting to stream.

Fixes #1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
2016-11-17 12:41:53 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Glauber Costa
f08162e181 database: rework memtable flush logic
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.

This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.

After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.

For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.

Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.

Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges.  That is the perfect moment for us to act.
It indicates that all the data flushing part is done.

The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:58 -05:00
Glauber Costa
895e838ac0 get rid of max_memtable_size
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.

It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.

The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.

The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.

This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
2ed3f342c1 pass a region to dirty_memory_manager accounting API
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.

We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
f86c9e36f4 logalloc: allow region group reclaimer to specify a soft limit
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.

However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.

This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.

There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
da738a6cd1 database: remove outdated comment
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
919de98aa5 database: uphold virtual dirty for system tables.
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.

In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Avi Kivity
f26c6569d2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 61ff5c6...25e101f (1):
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
2016-11-16 22:36:31 +02:00
Takuya ASADA
3802f289f8 dist: remove bc from dependency
Since we replaced shellscript based cpuset generator with python based one,
we no longer depends to bc command.
d571123afd

So drop it from .rpm/.deb dependency.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>
2016-11-16 15:02:55 +02:00
Amnon Heiman
a4be7afbb0 API: cache_capacity should use uint for summing
Using integer as a type for the map_reduce causes number over overflow.

Fixes #1801

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>
2016-11-16 13:55:46 +01:00
Avi Kivity
31d0e31de2 Merge seastar upstream
* seastar 47e1821...3001c08 (5):
  > core: Introduce weak_ptr<>
  > timer: Add missing include
  > tutorial: fix TeX template
  > Merge "Adding the metrics layer" from Amnon
  > core/memory: let malloc(0) return a valid pointer
2016-11-16 14:20:49 +02:00
Pekka Enberg
8a4bd6ecd5 README: Guidelines for contributing
Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>
2016-11-16 12:50:02 +02:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Tomasz Grabiec
11c5f4ab50 storage_proxy: Add counters for throttled writes 2016-11-15 17:18:25 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
055d78ee4c query_pagers: distinct queries do not have clustering keys
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.

The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.

Fixes #1822.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 11:06:01 +01:00
Glauber Costa
93386bcec7 histograms: do not use latency_in_nano
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.

This patch will keep everything as a duration until last moment, and
then we'll convert when needed.

This was suggested by Amnon.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
2016-11-14 18:01:43 +02:00
Nadav Har'El
c5254b6502 repair: fix undefined variable
If the "trace" parameter of the repair was not given, we will use the
"trace" variable without setting it. We need to set a default value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>
2016-11-14 17:16:19 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
14c1b17105 storage_service: fix construct_range_to_endpoint_map with semi-infinite range
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.

Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.

Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
2016-11-14 11:39:48 +01:00
Raphael S. Carvalho
9a9f0d3a0f main: fix exception handling when initializing data or commitlog dirs
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.

Fixes #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
2016-11-14 12:34:10 +02:00
Takuya ASADA
d571123afd dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead
We get error from bc command when we run the script on >34 ncpus,
to prevent the issue add a python script to generate cpuset parameter.

Fixes #1824

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>
2016-11-14 11:45:23 +02:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
bf20aa722b Merge "Fixes for histogram and moving average calculations" from Glauber
"JMX metrics were found to be either not showing, or showing absurd
values.  Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.

Fixes #1832, #1836, #1837"
2016-11-13 11:16:32 +02:00
Avi Kivity
2670e46f3e storage_service: deinline most methods
Most inline methods in storage_service are too large to be inlined, and
just increase compile time.  De-inline them.
2016-11-12 21:12:28 +02:00
Glauber Costa
608d825790 histogram: fix reporting units
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.

I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.

Fixes #1836

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-11 11:36:56 -05:00
Glauber Costa
1342d044eb moving averages: change metrics calculation
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.

With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.

This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.

Fixes  #1837

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-10 22:33:34 -05:00
Amnon Heiman
a977ea85e1 histogram: moving_average and total rate should be calculate in seconds
The moving average and the total average should be calculated in seconds
and not nanoseconds.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-11-10 22:32:53 -05:00
Glauber Costa
d3f11fbabf histogram: moving averages: fix inverted parameters
moving_averages constructor is defined like this:

    moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)

But when it is time to initialize them, we do this:

	... {tick_interval(), std::chrono::minutes(1)} ...

As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
2016-11-10 11:28:51 -08:00
Paweł Dziepak
f16d6f9c40 partition_version: make sure that snapshot is destroyed under LSA
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.

Refs #1831.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
2016-11-10 13:13:10 +01:00
Gleb Natapov
27e041606b fix LOCAL_ONE printout
Message-Id: <20161109125307.GH7766@scylladb.com>
2016-11-09 12:53:55 +00:00
Duarte Nunes
e680587b8a sstable_test: Be explicit about uncompressed tables
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
2016-11-09 11:21:59 +02:00
Pekka Enberg
b3dea313dd Merge "API changes for Cassandra 3.x migration" from Calle
"Mostly small changes/additions to the API calls to match Cv3
 requirements/semantics, i.e. updated scylla-jmx can implement required
 nodetool etc calls in a working fashion."
2016-11-09 10:30:32 +02:00
Duarte Nunes
e33c02aa60 cql3: Disable compression on empty properties
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:

ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};

However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
2016-11-09 10:03:59 +02:00
Gleb Natapov
93f068bd44 storage_proxy: fix speculation target selection logic
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.

Examples:

1. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_QUORUM.

In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.

2. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_ONE + RRD.DC_LOCAL

In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.

The patch below fixed both of those scenarios.

Message-Id: <20161103154637.GS7766@scylladb.com>
2016-11-08 18:32:47 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00