Following Cassandra, our default sstable compression chunk size is 64 KB.
The big downside of this default size is that small reads need to read
and uncompress a large chunk, around 32 KB (if compression halves the data
size). In this patch we switch the default chunk size to 4 KB, which allows
faster small reads (the report in issue #1337 was of a 60-fold speedup...).
Since commit 2f56577, large reads will not be signficantly slowed down by
changing to a small chunk size. The remaining potential downside of this
change is lowering of the compression ratio because of the smaller chunks
individually compressed. However, experimentation shows that the compression
ratio is hurt somewhat, but not dramatically, by lowering the chunk size:
A recent survey of Cassandra compression in
https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/
reports a compression ratio of 2 for 64 KB chunks, vs. 1.75 for 4 KB chunks.
My own test on a cassandra-stress workload (whose data is relatively hard
to compress), showed compression ratio 1.25 for 64 KB chunk, vs. 1.23 for
4 KB chunks.
Also remember that if a user wants to control the chunk length for a
particular table, he can - the 64 KB or 4 KB sizes are just the default.
Fixes#1337
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467063335-12096-1-git-send-email-nyh@scylladb.com>
Helps in identifying pointers allocated through seastar
allocator. Shows to which thread the pointer belongs, to which size
class, whether it's live or free, what's the offset realtive to the
live object.
Example:
(gdb) scylla ptr 0x6040abe88170
thread 1, small (size <= 320), live (0x6040abe88140 +48)
Message-Id: <1467047215-1763-1-git-send-email-tgrabiec@scylladb.com>
* seastar 3029ebe...c15055c (5):
> memory: add option to mlock() all memory
> reactor: run idle poll handler with a pure poll function
> ignore all but one failed futures in map_reduce
> tutorial: more general exception printout on startup
> resource: don't abort on too-high io queue count
Fixes#1395.
Fixes#1400.
From Avi:
Both the cql binary transport and the rpc server have protection against
too many concurrent requests overwhelming the database due to transient
allocations. There work by estimating the amount of memory a request
requires, and accounting that against a semaphore. When the semaphore
blocks, we stop dequeing requests from the tcp connection.
Unfortunately, this doesn't work for reads, because we can't estimate the
required memory size. A small read request can require many sstables to be
read, perhaps concurrently, and a large response to be generated.
Fix by limiting the number of concurrent reads in a shard to 100. This
is more than enough concurrency for any reasonable disk, and there is no
network communication at this level, so we're safe from high network
latency requiring high concurrency.
Fixes#1398.
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
A restricting_reader wraps a mutation_reader, and restricts it concurrency
using a provided semaphore; this allows controlling read concurrency, which
is important since reads can consume a lot of resources ((number of
participating sstables) * 128k after we have streaming mutations, and a lot
more before).
gcc 6 complains that deleting a managed_bytes::external isn't defined
because the size isn't known. I'm not sure it's correct, but there's no
way to tell because flexible arrays aren't standardized.
Fix by using an array of zero size.
Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>
Using a template lambda invokes a bug in Fedora 24's boost where the
lambda's parameter is an internal boost type rather than a range_tombestone.
Constraining the parameter with an explicit type avoids the problem.
Message-Id: <1466844211-17298-1-git-send-email-avi@scylladb.com>
This adds to the definition of the collectd API the ability to turn on
and off specific collectd metrics.
For the GET end point a POST option was added that allow to enable or
disable a metric.
The general GET endpoint now returns the enable flag that indicates if
the metric is enable.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1466932139-19264-2-git-send-email-amnon@scylladb.com>
"This patchset implements the thrib describe verbs:
- describe_keyspace
- describe_keyspaces
- describe_cluster_name
- describe_version
- describe_ring
- describe_local_ring
- describe_token_map
- describe_partitioner
- describe_snitch
- describe_schema_versions
The verbs describe_splits and describe_splits_ex are not implemented
because they are marked as experimentail (Origin's thrift interface has
this to say about them: "experimental API for hadoop/parallel query
support. may change violently and without warning."). Some drivers have
moved away from depending on this verb (SPARKC-94). The correct way to
implement the verbs for us would be to use the size_estimates system table
(CASSANDRA-7688). However, we currently don't populate size_estimates, which
is done by SizeEstimatesRecorder.java in Origin."
This patch removes a conversion function from an internal type
name to Origin's naming, which isn't needed because the
abstract_type hierarchy already keeps that mapping.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't implement describe_splits, and this patch describes why that
it. In a nutshell, to properly implement this, we would need something
like Origin's SizeEstimatesRecorder.java, but as the verb is marked as
experimental, we don't do it for now.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch completes the system_add_keyspace verb by setting all
relevant options on the new schemas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In thrift, a static column family is one where all columns are
defined upon schema creation. It maps to a CQL table with a singular
partition key and a set of regular columns.
On the other hand, a dynamic column family is one which allows column
to be dynamically added by insertion requests. It maps to a CQL table
with a partition key and a clustering key, which will hold the names of
the dynamic columns, and a regular column, which will how the respective
values. If the thrift comparator type is composite, then there will be a
clustering column for each of the composite's components.
There can also be mixed column families; supporting those is future
work.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the make_exception function from thrift/handler.cc to
the new header file thrift/utils.hh.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We currently have a problem in update_cache, that can be trigger by ordering
issues related to memtable flush termination (not initiation) and/or
update_cache() call duration.
That issue is described in #1364, and in short, happens if a call to
update_cache starts before and ongoing call finishes. There is now a new SSTable
that should be consulted by the presence checker that is not.
The partition checker operates in a stale list because we need to make sure the
SSTable we just wrote is excluded from it. This patch changes the partition
checker so that all SSTables currently in use are consulted, except for the one
we have just flushed. That provides both the guarantee that we won't check our
own SSTable and access to the most up-to-date SSTable list.
Fixes#1364
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>
Add contiguity flag to cache entry and set it in scanning reader.
Partitions fetched during scanning are continuous
and we know there's nothing between them.
Clear contiguity flag on cache entries
when the succeeding entry is removed.
Use continuous flag in range queries.
Don't go do disk if we know that there's nothing
between two entries we have in cache. We know that
when continuous flag of the first one is set to true.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>
If we don't, std::terminate() causes a core dump, even though an
exception is sort-of-expected here and can be handled.
Add an exception handler to fix.
Fixes#1379.
Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>
gdb gets confused if a non-fully-qualified class name is used when
we are in some namespace context. Help it out by adding a :: prefix.
Message-Id: <1466587895-8690-1-git-send-email-avi@scylladb.com>
This patchset adds two new types of query limits:
- Per partition row limit, which limits how many rows
a given partition may return; needed both for thrift
and for future CQL features;
- Limit on the number of partitions returned, needed
by thrift.