The query_size_estimates() function queries the size_estimates system
table for a given keyspace and table, filtering out the token ranges
according to the specified tokens.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit ecfa04da77)
This patch fixes stop() by checking if the current CPU instead of
whether the service is active (which it won't be at the time stop() is
called).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit d984cc30bf)
This patch makes range_estimates a proper struct, where tokens are
represented as dht::tokens rather than dht::ring_position*.
We also pass other arguments to update_ and clear_size_estimates by
copy, since one will already be required.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit e16f3f2969)
Fixes#1484.
We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
The fix is to give each table a separate joinpoint instance.
Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5e8f0efc85)
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This patch implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. The size_estimates_recorder class
corresponds to the one in Cassandra's SizeEstimatesRecorder.java.
Estimation is carried out by shard 0. Since we're estimating based on
data in shared sstables, having multiple shards doing this would skew
the results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements functions that allow the size_estimates system
table to be updated and cleared. The size_estimates table is updated
per schema with a set of token ranges and the associated estimations
of how many partitions there are and their mean size.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
filter_for_query() gets sorted by preference list of endpoints and
should preserve that order after filtering out non local endpoints for
local query. partition() does not guaranty this while stable_partition()
does, so use it instead.
Fixes#1450.
Message-Id: <20160713100909.GM10767@scylladb.com>
We have imported most of our data about config options from Cassandra. Due to
that, many options that mention the database by name are still using
"Cassandra".
Specially for the user visible options, which is something that a user sees, we
should really be using Scylla here.
This patch was created by automatically replacing every occurrence of "Cassandra"
with "Scylla" and then later on discarding the ones in which the change didn't
make sense (such as Unused options and mentions to the Cassandra documentation)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>
Continuation reordering could cause us to repeatedly see the
segment-local flag var even though actual write/sync ops are done.
Can cause wild recursion without actual delayed continuation ->
SOE.
Fix by also checking queue status, since this is the wait object.
Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322.
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.
To limit the damage, add two limits on the queue:
- a timeout, which is equal to the read timeout
- a queue length limit, which is equal to 2% of the shard memory divided
by an estimate of the queued request size (1kb)
Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
If a CF does not have any sstables at all, we should treat it
as having a replay position of zero. However, since we also
must deal with potential re-sharding, we cannot just set
shard->uuid->zero initially, because we don't know what shards
existed.
Go through all CF:s post map-reduce, and for every shard where
a CF does not have an RP-mapping (no sstables found), set the
global min pos (for shard) to zero.
Fixes#1372
Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>
Try to emulate the origin behaviour for batch reply. They use an
explicit write handler, combinging
1.) Hinting to all known dead endpoints
2.) Sending to all persumed live, requiring ack from all
3.) Hinting to endpoint to which send failed.
We don't have hints, so try to work around by doing send with
cl=ALL, and if send fails (wholly or partially), retain the
batch in the log.
This is still slight behavioural difference, and we also risk
filling up the batch log in extreme cases. (Though probably not
in any real environment).
Refs #1222
Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>
Using an ordering mechanism better than rw-locks for write/flush
means we can wait for pending write in batch mode, and coalesce
data from more than one mutation into a chunk.
It also means we can wait for a specific read+flush pair (based on
file position).
Downside is that we will not do parallel writes in batch mode (unless
we run out of buffer), which might underutilize the disk bandwidth.
Upside is that running in batch mode (i.e. per-write consistency)
now has way better bandwidth, and also, at least with high mutation
rate, better average latency.
Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.
Message-Id: <20160602102642.GJ9939@scylladb.com>
compact_on_idle will lead users to thinking we're talking about sstable
compaction, not log-structured-allocator compaction.
Rename the variable to reduce the probability of confusion.
Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>
"This series introduces some additional metrics (mostly) in a storage_proxy and
a database level that are meant to create a better picture of how data flows
in the cluster.
First of all where possible counters of each category (e.g. total writes in the storage
proxy level) are split into the following categories:
- operations performed on a local Node
- operations performed on remote Nodes aggregated per DC
In a storage_proxy level there are the following metrics that have this "split"
nature (all on a sending side):
- total writes (attempts/errors)
- writes performed as a result of a Read Repair logic
- total data reads (attempts/completed/errors)
- total digest reads (attempts/completed/errors)
- total mutations data reads (attempts/completed/errors)
In a batchlog_manager:
- writes performed as a result of a batchlog replay logic
Thereby if for instance somebody wants to get an idea of how many writes
the current Node performs due to user requested mutations only he/she has
to take a counter of total writes and subtract the writes resulted by Read
Repairs and batchlog replays.
On a receiving side of a storage_proxy we add the two following counters:
- total number of received mutations
- total number of forwarded mutations (attempts/errors)
In order to get a better picture of what is going on on a local Node
we are adding two counters on a database level:
- total number of writes
- total number of reads
Comparing these to total writes/reads in a storage_proxy may give a good
idea if there is an excessive access to a local DB for example."
This patch ensures type_parser can handle user defined types. It also
prefixes user_type_impl::make_name() with
org.apache.cassandra.db.marshal.UserType.
Fixes#631
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the merge_types() function,
allowing mutations to user defined types to be applied.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>