Commit Graph

10897 Commits

Author SHA1 Message Date
Paweł Dziepak
5d7185fd39 db: add result_memory_limiter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
ee89d80d5c query: add result size limiter
This patch introduces an infrastrucutre for limiting result size.

There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order

In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
43fe3439ca reconcilable_result: properly propagate short_read flag
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
837d24f1b2 query_pagers: handle short reads properly
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
da7ca85040 query: allow short reads
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Paweł Dziepak
7a15c89b1d serializer_impl: add serializer for bool_class<Tag>
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Takuya ASADA
8918a4be57 dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed
Instead of abort scylla_setup, print warning message then continue to next setup.

Fixes #1357

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>
2016-12-14 13:31:50 +02:00
Tomasz Grabiec
c9344826e9 tests: Remove unintentional enablement of trace-level logging
Sneaked in by mistake.
2016-12-14 10:58:07 +01:00
Tomasz Grabiec
fe6a70dba1 tests: commitlog: Fix assumption about write visibility
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
2016-12-14 11:29:33 +02:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Takuya ASADA
c18a95cddf dist/redhat: add scylla_lib.sh to scylla.spec
Fix .rpm build error.

Fixes #1932

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>
2016-12-14 10:27:37 +02:00
Glauber Costa
56df53f51e compaction_manager: fix shutdown sequence
By the time we are able to acquire this semaphore, we may be stopped
already. So we need to test it before we go ahead. I can see shutdown
hangs before this patch that are fixed with it applied.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>
2016-12-14 09:26:24 +01:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Glauber Costa
be9e4c71ad database: remove flush_token
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
98030ad66c database: abstract pressure condition notification
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
c9a8b03311 database: encapsulate semaphore_units into a flush_permit
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.

Preparation patch for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
2e8c7d2c62 database: remove friendship declaration
Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
bb1509c21e database: simplify flush_one
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
8ab7c04caa database: make memtable_list aware in cases it can't flush
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Takuya ASADA
0a6312d254 dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant
Use it as "if is_debian_variant; then".
Fixes #1931

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:42 +02:00
Takuya ASADA
ed4cd1908f dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection
CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from
"is_debian_variant" to "! is_debian_variant".

Fixes #1930

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:29 +02:00
Takuya ASADA
6c0dc55495 dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh
This fixes following command not found error:
```
/usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found
```

Fixes #1929

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:13 +02:00
Takuya ASADA
3b74c50546 dist/ubuntu: add uuidgen to package dependency
We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup
script may abort because of command not found.

Fixes #1928

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:28:48 +02:00
Duarte Nunes
1e75a4950e database: Complete query when hitting partition limit
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
2016-12-13 14:53:46 +02:00
Tomasz Grabiec
f451014785 schema: Implement operator<< for column_mapping
Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:20:46 +02:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Avi Kivity
32d55bbb4c Merge seastar upstream
* seastar 0773e98...6fbd792 (2):
  > tls: Only run our "verify" function in client session
  > Merge "Clean the metric definition" from Amnon

Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
2016-12-13 12:17:14 +02:00
Avi Kivity
6f9c317b91 Merge "Use uuid file in housekeeping" from Amnon
"This patch adds the use of uuid file to the housekeeping daily version check.
uuid file are optional, if a file is missing no uuid will be used."
2016-12-13 10:52:44 +02:00
Avi Kivity
c67782f169 Merge seastar upstream
* seastar 0a74317...0773e98 (6):
  > tls: Add support for client cetrificate verification & priority strings
  > semaphore: add consume_units
  > semaphore: add available_units()
  > thread: check need_preempt for threads in a scheduling group as well
  > tutorial: fix semaphore example, and text
  > stop_iteration: add && and || operators
2016-12-12 18:06:19 +02:00
Avi Kivity
c801cc4bd1 Merge "streaming and repair updates" from Asias
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
2016-12-12 11:32:42 +02:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
e523803a5d token_metadata: Introduce interval_to_range helper
It is used to convert a boost::icl::interval<token> interval back to a
range<token>.
2016-12-12 11:09:26 +08:00
Asias He
af3d76e6ac repair: Fix a typo in the log
sucessfully -> successfully
2016-12-12 11:09:26 +08:00
Asias He
374324e6fb repair: Fix shard_begin and shard_end
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.

Fix by using the newly introduced dht::split_range_to_shards

Use the cf.make_streaming_reader with ranges to simplify the code a bit.
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
cd2105b8bd database: make_streaming_reader for ranges
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.

We can make the reader more efficient later.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Amnon Heiman
08dcd8cb4a scylla housekeeping ubuntu service: use uuid file
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:35:07 +02:00
Amnon Heiman
6fef24aaf0 housekeeping systemd service: use uuid file
This set the housekeeping systemd service to use a uuid file and use
daily mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:02:16 +02:00
Amnon Heiman
17b8306bc4 scylla-housekeeping support uuid file
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.

If the file is missing no uuid will be used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:00:34 +02:00
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
548f6066c5 tests: add test for sstable set's incremental selector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00