This patch adds an utility function that allows fetching the set of
column_families that do not belong to the system keyspace.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch allows a set of a column_family's sstables to be
selected according to a range of ring_positions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
memtable_list::seal_on_overlflow() is called on each mutation to check
if current memtable should be flushed. It will call
memtable_list::seal_active_memtable() when that is the case.
The number of concurrent seals is guarded by a semaphore, starting
from commit 0f64eb7e7d, and allows
at most 4 of them.
If there are 4 flushes already pending, every incoming mutation will
enqueue a new flush task on the semaphore's wait list, without waiting
for it. The wait queue can grow without bounds, eventually leading to
out-of-memory.
The fix is to seal the memtable immediately to satisfy should_flush()
condition, but limit concurrency of actual flushes. This way the wait
queue size on the semaphore is limited by memtables pending a flush,
which is fairly limited.
Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.
Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.
This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.
Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When flush_streaming_mutations() is called at the end of streaming it is
supposed to flush all data and then invalidate cache. ranges However, if
there are already some memtable flushes in progress it won't wait for them.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.
The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.
Fixes#1195
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.
One comment in particular makes it quite clear:
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.
This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Using sstable_set will allow us to filter sstables during a query before
actually creating a reader (this is left to the next patch; here we just
convert the users of the _sstables field).
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.
To limit the damage, add two limits on the queue:
- a timeout, which is equal to the read timeout
- a queue length limit, which is equal to 2% of the shard memory divided
by an estimate of the queued request size (1kb)
Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
We currently have a problem in update_cache, that can be trigger by ordering
issues related to memtable flush termination (not initiation) and/or
update_cache() call duration.
That issue is described in #1364, and in short, happens if a call to
update_cache starts before and ongoing call finishes. There is now a new SSTable
that should be consulted by the presence checker that is not.
The partition checker operates in a stale list because we need to make sure the
SSTable we just wrote is excluded from it. This patch changes the partition
checker so that all SSTables currently in use are consulted, except for the one
we have just flushed. That provides both the guarantee that we won't check our
own SSTable and access to the most up-to-date SSTable list.
Fixes#1364
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>
After commit faa4581, each shard only starts splitting its shared sstables
after opening all sstables. This was important because compaction needs to
be aware of all sstables.
However, another bug remained: If one shard finishes loading its sstables
and starts the splitting compactions, and in parallel a different shard is
still opening sstables - the second shard might find a half-written sstable
being written by the first shard, and abort on a malformed sstable.
So in this patch we start the shared sstable rewrites - on all shards -
only after all shards finished loading their sstables. Doing this is easy,
because main.cc already contains a list of sequential steps where each
uses invoke_on_all() to make sure the step completes on all shards before
continuing to the next step.
Fixes#1371
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>
Starting in commit 721f7d1d4f, we start "rewriting" a shared sstable (i.e.,
splitting it into individual shards) as soon as it is loaded in each shard.
However as discovered in issue #1366, this is too soon: Our compaction
process relies in several places that compaction is only done after all
the sstables of the same CF have been loaded. One example is that we
need to know the content of the other sstables to decide which tombstones
we can expire (this is issue #1366). Another example is that we use the
last generation number we are aware of to decide the number of the next
compaction output - and this is wrong before we saw all sstables.
So with this patch, while loading sstables we only make a list of shared
sstables which need to be rewritten - and the actual rewrite is only started
when we finish reading all the sstables for this CF. We need to do this in
two cases: reboot (when we load all the existing sstables we find on disk),
and nodetool referesh (when we import a set of new sstables).
Fixes#1366.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>
"Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
Fixes #1291."
Several shards may share the same sstable - e.g., when re-starting scylla
with a different number of shards, or when importing sstables from an
external source. Sharing an sstable is fine, but it can result in excessive
disk space use because the shared sstable cannot be deleted until all
the shards using it have finished compacting it. Normally, we have no idea
when the shards will decide to compact these sstables - e.g., with size-
tiered-compaction a large sstable will take a long time until we decide
to compact it. So what this patch does is to initiate compaction of the
shared sstables - on each shard using it - so that a soon as possible after
the restart, we will have the original sstable is split into separate
sstables per shard, and the original sstable can be deleted. If several
sstables are shared, we serialize this compaction process so that each
shard only rewrites one sstable at a time. Regular compactions may happen
in parallel, but they will not not be able to choose any of the shared
sstables because those are already marked as being compacted.
Commit 3f2286d0 increased the need for this patch, because since that
commit, if we don't delete the shared sstable, we also cannot delete
additional sstables which the different shards compacted with it. For one
scylla user, this resulted in so much excessive disk space use, that it
literally filled the whole disk.
After this patch commit 3f2286d0, or the discussion in issue #1318 on how
to improve it, is no longer necessary, because we will never compact a shared
sstable together with any other sstable - as explained above, the shared
sstables are marked as "being compacted" so the regular compactions will
avoid them.
Fixes#1314.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously, we were using a stat to decide if compaction should be
retried, but that's not efficient. The information is also lost
after node is restarted.
After these changes, compaction will be retried until strategy is
satisfied, i.e. there is nothing to compact.
We will now be doing the following in a loop:
Get compaction job from compaction strategy.
If cannot run, finish the loop.
Otherwise, compact this column family.
Go back to start of the loop.
By the way, pending_compactions stat will be deprecated after this
commit. Previously, it was increased to indicate the want for
compaction and decreased when compaction finished. Now, we can
compact more than we asked for, so it would be decreased below 0.
Also, it's the strategy that will tell the want for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>
Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
We can only free memory for a region_group when the entire memtable is released.
This means that while the disk can handle requests from multiple memtables just fine,
we won't free any memory until all of them finish. If we are under a pressure situation
we will take a lot more time to leave it.
Ideally, with write-behind, we would allow just one memtable to be flushed at a
time. But since we don't have it enabled, it's better to serialize the flushes
so that only some memtables (4) are flushed at a time. Having the memtable writer
bandwidth all to itself, the memtable will finish sooner, release memory sooner,
and recover the system's health sooner.
We would like to do that without having streaming and memtables starve each
other. Ideally, that should mean half the bandwidth for each - but that
sacrifices memtable writes in the common case there is no streaming. Again,
write behind will help here, and since this is something we intend to do, there
is no need to complicate the code too much for an interim solution.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch introduces an explicit behavior enum class - one of delayed or
immediate, that allow callers to tell the memtable list whether they want a
delayed flush (default), or force an immediate flush. So far this only affects
the streaming code (memtables just ignore it), but the concept is one that can
be easily generalized.
With that in place, we can revert back the stop function to use the standard
flush. I have argued before that adding infrastructure like that would not be
worth it for the sake of stop alone, but some other code could now use it.
Specifically, the active reclaimer for the throttler would like to force
immediate flushes, as delayed flushes really won't make a lot of difference in
reducing memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and run 'nodetool refresh'.
That's supposed to be the preferred option from now on."
This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and call 'nodetool refresh'.
That's supposed to be the preferred option from now on.
For each sstable in upload directory, refresh will do the following:
1) Mutate sstable level to 0.
2) Create hard links to its components in column family dir, using
a new generation. We make it safe by creating a hard link to temporary
TOC first.
3) Remove all of its components in upload directory.
This new code runs after refresh checked for new sstables in the column
family directory. Otherwise, we could have a generation conflict.
Unlike the first step, this new step runs with sstable write enabled.
It's easier here because we know exactly which sstables are new.
After that, refresh will load new sstables found in column family
and upload directories.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In a preparation move for the LSA throttler, we have reordered the
initialization fields in database.hh so that the sizes of the regions are
computed before the initialization of the region.
However, that seemingly innocent move broke one of our tests. The reason behind
that, is that if we don't destroy the column families before destroying the
region, we may end up with a use after free in the memtable destructor - that
itself expects to call into the region.
This patch reorders the initialization so that the CF list still comes after the
dirty regions (therefore being destroyed first), while maintaining the relative
ordering between size / region that we needed in the first place.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0669984b5bccdb2c950f2444bdee4427abad56ba.1463508884.git.glauber@scylladb.com>
timed_rate_moving_average_and_histogram
As part of moving the derived statistic in to scylla, this replaces the
histogram object in the column_family to
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We've been keeping two constructors for the column family to allow for a
version without the commitlog. But it's by now quite complicated to maintain
the two, because changes always have to be made in two places.
This patch adds a private constructor that does the actual construction, and
have the public constructors to call it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>
Reloads keyspace metadata and replaces in existing keyspace.
Note: since keyspace metadata, and consequently, replication
strategy now becomes volatile, keyspace::metadata now returns
shared pointer by value (i.e. keep-alive).
Replication strategy should receive the same treatment, but
since it is extensively used, but never kept across a
continuation, I've just added a comment for now.
It was noticed that small sstables will accumulate for a column family because
scylla was limited to two compaction per shard, and a column family could have
at most one compaction running at a given shard. With the number of sstables
increasing rapidly, read performance is degraded.
At the moment, our compaction manager works by running two compaction task
handlers that run in parallel to the rest of the system. Each task handler
gets to run when needed, gets a column family from compaction manager queue,
runs compaction on it, and goes to sleep again. That's basically its cycle.
Compaction manager only allows one instance of a column family to be on its
queue, meaning that it's impossible for a column family to be compacted in
parallel. One compaction starts after another for a given column family.
To solve the problem described, we want to concurrently run compaction jobs
of a column family that have different "size tier" (or "weight").
For those unfamiliar, compaction job contains a list of sstables that will be
compacted together.
The "size tier" of a compaction job is the log of the total size of the input
sstables. So a compaction job only gets to run if its "size tier" is not the
same of an ongoing compaction. There is no point in compacting concurrently at
the same "size tier", because that slows down both compactions.
We will no longer queue column families in compaction manager. Instead, we
create a new fiber to run compaction on demand.
This fiber that runs asynchronously will do the following:
1) Get a compaction job from compaction strategy.
2) Calculate "size tier" of compaction job.
3) Run compaction job if its "size tier" is not the same of an ongoing
compaction for the given column family.
As before, it may decide to re-compact a column family based on a stat stored
in column family object.
Ran all compaction-related dtests.
Fixes#1216.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>
"This series introduces some additional metrics (mostly) in a storage_proxy and
a database level that are meant to create a better picture of how data flows
in the cluster.
First of all where possible counters of each category (e.g. total writes in the storage
proxy level) are split into the following categories:
- operations performed on a local Node
- operations performed on remote Nodes aggregated per DC
In a storage_proxy level there are the following metrics that have this "split"
nature (all on a sending side):
- total writes (attempts/errors)
- writes performed as a result of a Read Repair logic
- total data reads (attempts/completed/errors)
- total digest reads (attempts/completed/errors)
- total mutations data reads (attempts/completed/errors)
In a batchlog_manager:
- writes performed as a result of a batchlog replay logic
Thereby if for instance somebody wants to get an idea of how many writes
the current Node performs due to user requested mutations only he/she has
to take a counter of total writes and subtract the writes resulted by Read
Repairs and batchlog replays.
On a receiving side of a storage_proxy we add the two following counters:
- total number of received mutations
- total number of forwarded mutations (attempts/errors)
In order to get a better picture of what is going on on a local Node
we are adding two counters on a database level:
- total number of writes
- total number of reads
Comparing these to total writes/reads in a storage_proxy may give a good
idea if there is an excessive access to a local DB for example."
This patch adds a counter of total writes and reads
for each shard.
It seems that nothing ensures that all database queries are
ready before database object is destroyed.
Make _stats lw_shared_ptr in order to ensure that the object is
alive when lambda gets to incrementing it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This field is superfluous and adds confusion regarding the user_types
field in the keyspace metadata.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
From Glauber:
There are current some outstanding issues with the throttling code. It's
easier to see them with the streaming code, but at least one of them is general.
One of them is related to situations in which the amount of memory available
leaves only one memtable fitting in memory. That would only happen with the
general code if we set the memtable cleanup threshold to 100 % - and I don't
even know if it is valid - but will happen quite often with the streaming code.
If that happens, we'll start throttling when that memtable is being written,
but won't be able to put anything else in its place - leading to unnecessary
throttling.
The second, and more serious, happens when we start throttling and the amount
of available memory is not at least 1MB. This can deadlock the database in
the sense that it will prevent any request from continuing, and in turn causing
a flush due to memtable size. It is a good practice anyway to always guarantee
progress.
Fixes#1144
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.
We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.
We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>