Commit Graph

721 Commits

Author SHA1 Message Date
Glauber Costa
f2e0affcc5 database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
2017-12-13 10:26:07 +02:00
Nadav Har'El
b594f21f91 Allow reading exactly desired byte ranges and fast_forward_to
Allow reading exactly desired byte ranges and fast_forward_to

In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve perforance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170718110643.8667-1-nyh@scylladb.com>
2017-07-18 16:54:11 +03:00
Raphael S. Carvalho
611c25234e database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
(cherry picked from commit b9d0645199)
2017-07-03 12:49:34 +03:00
Calle Wilund
665d14584c database: Fix assert in truncate to handle empty memtables+sstables
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.

Fixes #2489

Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>

(cherry picked from commit 525730e135)
2017-06-14 16:25:57 +03:00
Raphael S. Carvalho
befd4c9819 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
(cherry picked from commit 3b5ad23532)
2017-05-28 10:39:14 +03:00
Avi Kivity
eb2fe0fbd3 Merge "reduce memory requirement for loading sstables" from Rapahel
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."

* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
  database: fix indentation of distributed_loader::open_sstable
  database: reduce memory requirement to load sstables
  sstables: loads components for a sstable in parallel
  sstables: enable read ahead for read of in-memory components
  sstables: make random_access_reader work with read ahead

(cherry picked from commit ef428d008c)
2017-05-25 12:59:55 +03:00
Raphael S. Carvalho
eb6b0b1267 db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
(cherry picked from commit b7e1575ad4)
2017-05-25 11:50:17 +03:00
Gleb Natapov
df56c108b7 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384

Message-Id: <20170516103018.GQ3874@scylladb.com>
(cherry picked from commit c7ad3b9959)
2017-05-21 18:56:22 +03:00
Paweł Dziepak
98d782cfe1 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
(cherry picked from commit 24f4dcf9e4)
2017-04-30 19:17:55 +03:00
Raphael S. Carvalho
66c4dcba8e database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
(cherry picked from commit 7deeffc953)
2017-03-19 17:16:33 +02:00
Amnon Heiman
63bec22d28 database: requests_blocked_memory metric should be unique
Metrics name should be unique per type.

requests_blocked_memory was registered twice, one as a gauge and one as
derived.

This is not allowed.

Fixes #2165

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
(cherry picked from commit 0a2eba1b94)
2017-03-15 12:43:01 +02:00
Paweł Dziepak
b6db9e3d51 db: make do_apply_counter_update() propagate timeout to db_apply()
db_apply() expects to be given a time point at which the request will
time out. Originally, do_apply_counter_update() passed 0, which meant
that all requests were timed out if do_apply() needed to wait. The
caller of do_apply_counter_update() is already given a correct timeout
time point so the only thing needed to fix this problem it to propagate
it properly inside do_apply_counter_update() to the call to do_apply().

Fixes #2119.
Message-Id: <20170307104405.5843-1-pdziepak@scylladb.com>
2017-03-07 12:44:11 +01:00
Avi Kivity
4f416c7272 Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()

(cherry picked from commit 7a00dd6985)
2017-02-03 09:47:50 +01:00
Paweł Dziepak
5a0955e89d db: add operations for applying counter updates 2017-02-02 10:35:14 +00:00
Raphael S. Carvalho
1857ba0abc db: fix bad resource usage distribution when resharding due to refresh
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.

flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.

Fixes #2008.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
2017-01-19 18:55:21 +02:00
Duarte Nunes
d53f96e0da column_family: Only update stats once for a shared sstables
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding

This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.

Refs #1592

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
2017-01-19 17:40:35 +02:00
Tomasz Grabiec
ea9ab36ad5 db: Move operator<<() definition to .cc
Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:43 +02:00
Vlad Zolotarov
cda382e8d6 database: move collectd registrations to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Raphael S. Carvalho
68dfcf5256 db: avoid excessive memory usage during resharding
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.

SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.

For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.

Fixes #1951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-09 15:24:36 -02:00
Avi Kivity
be11b054e1 Merge "Reduce the size of mutation_partition" from Piotr
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.

This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes

This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.

Fixes #742."

* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
  intrusive_set: rename size() to calculate_size()
  Make intrusive_set_external_comparator::_value_traits static
  Implement intrusive set using rbtree_algorithms
  mutation_partition: make apply_reversibly_intrusive_set nongeneric
  mutation_partition: take schema in find_row and clustered_row
  mutation_partition: Extract intrusive set logic to a class.
  mutation_partition: Replace value_comp with key_comp calls
2017-01-05 17:34:10 +02:00
Tomasz Grabiec
cd630fece6 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
2017-01-05 14:53:51 +02:00
Piotr Jastrzebski
4bbe05dd47 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Paweł Dziepak
1a52569f7d storage_proxy: pass maximum result size to replicas
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
2016-12-22 17:16:23 +01:00
Paweł Dziepak
a0523df8d6 result_memory_limiter: add accounter for digest reads
Digest reads differ from data reads in a way that they do not really
consume any memory. We still want them to stop in the same place that
data reads would, but the per-shard semaphore shouldn't be updated by
them.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
aa083d3d85 result_memory_limiter: split new_read() to new_{data, mutation}_read()
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.

Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
2016-12-22 13:35:04 +01:00
Raphael S. Carvalho
27fb8ec512 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
2016-12-21 23:18:06 +02:00
Avi Kivity
875635554d Merge "educe overhead of partition presence checker during cache update" from Tomasz
Refs #1943.

* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
  db: Compute key hash once in partition_presence_checker
  bloom_filter: Allow checking presence using pre-hashed key
  db: Use incremental selector in partition_presence_checker
2016-12-21 14:24:54 +02:00
Duarte Nunes
3fd79bb6d6 schema_tables: Merge views for schema merging
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
06ab61a570 schema_tables: Extract update_column_family
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
ecc4290bc6 database: Remove view from base table upon drop
This patch changes the drop_column_family() function to remove
a view schema from the list of views of its base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
4f166cfa6a database: Parse views schema table upon init
This patch adds code for parsing the views schema table upon init and
also ensures that when adding a view column family, that we add it to
its base table list of views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
40c684b5f5 database: Extract common create cf code
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2b231f22b8 keyspace_metadata: Add tables() and views() functions
This patch adds utility functions to keyspace_metadata to select only
the tables or only the views out of all the schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
7818339791 materialized views: Add view class
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Tomasz Grabiec
0e487b3499 db: Compute key hash once in partition_presence_checker
I measured reduction of cache update time by 20% for 6 sstables and by
40% for 16.

Refs #1943.
2016-12-19 14:20:58 +01:00
Tomasz Grabiec
78844fa2e5 db: Use incremental selector in partition_presence_checker
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.

Refs #1943.

Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.

Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
2016-12-19 14:20:58 +01:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
85034c1b57 Convert to use dht::partition_range 2016-12-19 08:04:30 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Avi Kivity
6bb875bdb7 Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions
2016-12-16 13:57:37 +02:00
Glauber Costa
7133583797 track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
2016-12-16 10:59:40 +02:00
Paweł Dziepak
cf679a413c db: use multi range reader for streaming readers
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.

The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
2016-12-15 13:54:43 +00:00
Duarte Nunes
781cd82cb8 column_family: Use counters in query::result::builder
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Paweł Dziepak
cfd4d0f680 db: add metrics for short reads and memory used for results
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
6c33a4f177 db: create result_memory_accounters when starting query
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.

Moreover, range queries need a way of merging memory usage information
from different shards.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
15de8de9e5 reconcilable_result: keep result_memory_tracker object
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00