* seastar c82c36f...9267dfa (6):
> app_template: Make run() wait for func when reactor exit is triggered externally
> core: Introduce futurize_apply() helper
> rpc: make unexpected eof messages more informative
> Fix boost version check
> reactor: more fix for smp poll with older boost
> reactor: fix build on older boost due to spsc_queue::read_available()
2a46410f4a changed sstable_list from a map
to a set, so it is no longer sorted by generation. The code for finding
the list of sstables not being compacted relied on this sort order, and
now broke, returning a longer list than needed (including some of the
sstables being compacted). As a result, the compaction code preserved
the tombstones, incorrectly thinking there was still live data they
referenced.
Fix by sorting the set explicitly.
Fixes#1429.
Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>
"After this patchset, date tiered compaction strategy is supported by Scylla.
For those who don't know what it is about, the following article may help:
https://labs.spotify.com/2014/12/18/date-tiered-compaction/
It's also nicely explained here by our wiki page:
https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction
Basically, date tiered strategy was developed to help the database perform better
when facing a time series workload. Date tiered strategy will work to keep data
written at nearly the same time together, such that the number of relevant sstables
for a time-based query is relatively low. We still lacks support to filter out
sstables based on time parameters of a query, but that feature should come ASAP.
The following dtests now pass:
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_delete_test
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test
Used cassandra-stress with the parameter '-schema compaction\(strategy=DateTieredCompactionStrategy\)'
to check stability.
Fixes #511."
"Issue 1195 describes a scenario with a fairly easy reproducer in which
we can freeze the database. That involves writing simultaneously to
multiple CFs, such that the sum of all the memory they are using is larger
than the dirty memory limit, without not any of them individually being
larger than the memtable size.
This patchset rewrites the throttling code, including now active flushes
so that this situation cannot happen.
Fixes#1195"
This commit is basically about converting Java to C++.
Date tiered compaction strategy isn't wired yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strongly based on org.apache.cassandra.db.compaction.
CompactionController.getFullyExpiredSSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We weren't updating max local deletion time for cells that contain
ttl, or for tombstone cells.
If there is a live cell with no ttl, then max local deletion time
is supposed to store maximum value, which means that the sstable
will not be fully expired later on.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File can be found at the following C* directory:
src/java/org/apache/cassandra/db/compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.
The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.
Fixes#1195
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.
One comment in particular makes it quite clear:
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.
This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series contains remaining changes necessary to safely enable read
ahead of sstables. Basically, it makes sure that input_streams are
always properly closed (even in case of exception during read)."
At the moment, it's not possible to know how many compaction are needed for
compaction strategy to be satisfied. It's not possible to know exactly the
number of pending compaction, but the strategy can provide an estimation.
For size tiered, it's based on number of sstables in each bucket. By dividing
bucket size by max threshold, we get number of compaction needed to compact
that single bucket.
For leveled, it's about the number of sstables that exceeds the limit in
each level.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In 3a36ec33db (gossip: Wait longer for seed node during boot up), we
increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5
seconds * 60 = 5 minutes.
In 57ee9676c2 (storage_service: Fix default ring_delay time), we fixed
the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds
= 30 minutes, which is too long.
Make it 5 minues.