"This series optimizes CQL query parameter handling by avoiding memory
allocation and copies where possible. I have only tested with the POSIX
stack and have not seen performance difference in cassandra-stress
because Linux networking dominates the profiles. The optimizations
should improve things with DPDK, though, because the
cql_server::read_query_options() hotspot is effectively eliminated."
Store values as bytes view when possible. This improves the CQL protocol
option parsing path by avoiding allocating memory and copying individual
values as "bytes" objects.
Please note that we retain the non-view version for internal queries
where performance is not as important.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
While we have a with_gate() call when using the connection, it is
somewhere deep in the bowels of process_request_one(). If that
continuation is deferred, the gate can be closed without anyone
noticing that a request is still working on it; when we do hit the
with_gate, we'll see a use-after-free.
Large allocations can fail, and since we can't retry them after compaction
(yet), these failures are propagated to the client.
This patchset removes the larges offenders.
Instead of failing normal allocations when the seastar allocator cannot
allocate a segment, provide a generous reserve. An allocation failure
will now be satisified from the reserve, but it will still trigger a
reclaim. This allows hiding low-memory conditions from the user.
"This series expose statistics from the row_cache in the cache_service API.
After this series the following methods will be available:
get_row_hits
get_row_requests
get_row_hit_rate
get_row_size
get_row_entries"
Large vectors require contiguous storage, which may not be available (or may
be expensive to obtain). Switch to deque<> instead, which allocates
discontiguous storage.
Allocation problems were observed with the summary and with the bloom
filter bitmaps.
Like boost::dynamic_bitset, but less capable. On the other hand it avoids
very large allocations, which are incurred by the bloom filter's bitset
on even moderately sized sstables.
For stopping a task of compaction manager, we first close the gate
used by compaction then bust semaphore via semaphore::broken().
The problem is that semaphore::broken() only signals waiters, and so
subsequent semaphore::wait() calls would succeed and the task would
remain alive forever.
The fix is to signal semaphore, forcing the task to exit via gate
exception, so we will no longer rely on semaphore::broken() for
finishing the task. That's possible because we try to access the
gate right after we waited on semaphore.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Add a abstract_replication_strategy::get_primary_ranges() method, which is
very similar to the existing get_ranges(), except that only the "primary"
owner of each range will return it in its list.
This is needed for the "primary range" repair option, which asks to repair
only the primary range. This option is useful when the user plans to start
a repair on *all* nodes, we shouldn't repair the same token range multiple
times, so each range should be repaired by only one of the nodes.
abstract_replication_strategy::get_primary_ranges() is similar to Origin's
StorageService.getPrimaryRangesForEndpoint().
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
There can be multiple sends underway when first one detects an error and
destroys rpc client, but the rpc client is still in use by other sends.
Fix this by making rpc client pointer shared and hold on it for each
send operation.
Origin forbdis empty values in clustering key only if that clustering
key is non-composite (i.e. there is only one column).
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Requested by Avi. The added benefit is that the code for repairing
all the ranges in parallel is now identical to the code of repairing
the ranges one by one - just replace do_for_each with parallel_for_each,
and no need for a different implementation using semaphores like I had
before this patch.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
If we don't yield, we can run out of memory while moving a memtable into
cache.
This reduces the chance that writing an sstable will fail because we could
not transfer the memtable into the cache.