"Issue 1195 describes a scenario with a fairly easy reproducer in which
we can freeze the database. That involves writing simultaneously to
multiple CFs, such that the sum of all the memory they are using is larger
than the dirty memory limit, without not any of them individually being
larger than the memtable size.
This patchset rewrites the throttling code, including now active flushes
so that this situation cannot happen.
Fixes#1195"
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.
The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.
Fixes#1195
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.
One comment in particular makes it quite clear:
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.
This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series contains remaining changes necessary to safely enable read
ahead of sstables. Basically, it makes sure that input_streams are
always properly closed (even in case of exception during read)."
At the moment, it's not possible to know how many compaction are needed for
compaction strategy to be satisfied. It's not possible to know exactly the
number of pending compaction, but the strategy can provide an estimation.
For size tiered, it's based on number of sstables in each bucket. By dividing
bucket size by max threshold, we get number of compaction needed to compact
that single bucket.
For leveled, it's about the number of sstables that exceeds the limit in
each level.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In 3a36ec33db (gossip: Wait longer for seed node during boot up), we
increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5
seconds * 60 = 5 minutes.
In 57ee9676c2 (storage_service: Fix default ring_delay time), we fixed
the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds
= 30 minutes, which is too long.
Make it 5 minues.
If someone tried to naively use utf8_type->decompose("18wX"), this would
mysteriously fail, returning an empty key.
decompose takes a data_value, so the compiler looked for an implict
conversion from the string constant (const char*) to data_value. We did
not have such a conversion, only conversion from sstring. But the compiler
chose (backed by the C++ standard, no doubt) to implicitly convert the
const char* to a bool (!), and then use data_value(bool). It did not
convert the const char* to an sstring, nor did it warn about the possible
ambiguity.
So this patch adds a data_value(const char*) constructor, so people will
not fall into the same trap that I fell into...
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467643462-6349-1-git-send-email-nyh@scylladb.com>
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.
This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables. This
data structure has a method to obtain the sstables used for a read.
For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.
Skip the feature check for seed node for now, util we have a proper solution.
Fixes recent dtest failure due to fail to boot the seed node.
Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
"This series converts sstable writers (including compaction) to streamed
mutations and makes them use consumer-style interface.
Code related to sstable writes and compaction is converted to consumers
that can be used with consume_flattened_in_thread() (which is a variant
of consume_flattened() intended to be run inside a thread).
compac_for_query is improved so that it can be reused by sstable
compaction."
Amnon says:
The API that returns the version, currently returns the compatibility
version
(e.g. the version the compatible origin version - currently 2.1.8).
The check version functionality need to know what is the current running
version of scylla. For that a new API was added that return the current
version.
The result is equivalent of running scylla --version.
After this series a call to:
$ curl -X GET
"http://localhost:10000/storage_service/scylla_release_version"
"666.development-20160703.72f0d4d"
Which is the json representation of:
$ ./build/release/scylla --version
666.development-20160703.72f0d4d
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
data_resource lookup uses data_resource::name(), which uses sprint(), which
uses (indirectly) locale, which takes a global lock. This is a bottleneck
on large machines.
Fix by not using name() during lookup.
Fixes#1419
Message-Id: <1467616296-17645-1-git-send-email-avi@scylladb.com>
This adds a definition to the scylla release version. The API already
return the compatibility version (ie. the compatible origin version)
This definition returns the scylla version, a call to the API should
return the same result as running scylla --version.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Apply compaction strategy specific logic to narrow down the set of sstables
used for a query; can speed up reads using LeveledCompactionStrategy
significantly.
Fixes#1185.
Using sstable_set will allow us to filter sstables during a query before
actually creating a reader (this is left to the next patch; here we just
convert the users of the _sstables field).
Allow compaction_strategy to create a container for sstables that is
optimized for the strategy.
Most compaction_strategies return bag_sstable_set; leveled compaction
returns the specialized partitioned_sstable_set.
partitioned_sstable_set assumes that sstable are mostly partitioned along
the token range: only a few sstables will be needed to access a particular
token. It is implemented as an interval_map.
bag_sstable_set is a generic sstable_set implementation: it assumes nothing
about the sstables. It is implemented as a vector, and any select will
return the entire sstable set.
sstable_set abstracts the notion of a container of sstables, allowing
different compaction strategies to supply their own implementation. The
intended user is leveled compaction strategy; since it partitions sstables,
it can quickly restrict the number of sstables that participate in a query
by looking at the min/max partition key.
sstable_set also maintains an internal lw_shared_ptr<sstable_list>,
in parallel with the abstract container. This is to support
column_family::get_sstable(), which returns a lw_shared_ptr<sstable_list>
which must be anchored somewhere if it is not saved at the caller side,
as it isn't in most current callers.
ring_position is built for modern code that does not require default
constructors or stateless comparators. But not all code is modern, so
supply a compatible_ring_position that works with old code, at the cost
of some extra storage. Intended user is boost's interval container
library.
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
If read ahead is going to be enabled it is important to close
input_stream<> properly (and wait for completion) before destroying it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>