Commit Graph

12919 Commits

Author SHA1 Message Date
Paweł Dziepak
9d82a1ebfd abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
31afc2f242 shared_index_lists: restore indentation
Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
93eaa95378 sstables: make shared_index_lists::get_or_load exception safe
Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Avi Kivity
ef85cf1cb3 Merge "Compress in-memory compression-info" from Botond
"Overly large metadata can hog memory which especially hurts in setups
with bad disk/memory ratio. To ease the pain compress the in-memory
compression-info.

The compression is implemented based on Avi's idea which is to group n
offsets together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size).  Also offsets are allocated
only just enough bits to store their maximum value. The offsets are thus
packed in a buffer like so:
    arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
This of course means that stored offsets will not be aligned, not even
on a byte boundary, but the size reduction pretty convincing.
In addition, segments are stored in buckets, where each bucket has its
own base offset. In addition, segments in a buckets are optimized to
address as large of a chunk of the data as possible for a given chunk
size."

Ref #1946.

* 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla:
  Add unit test for compress::offsets
  Optimise the storage of compression chunk offsets
  Add script to precompute segmented compression parameters
2017-08-22 10:30:58 +03:00
Botond Dénes
62c18da35c Add unit test for compress::offsets 2017-08-21 17:06:20 +03:00
Botond Dénes
028c7a0888 Optimise the storage of compression chunk offsets
To reduce the memory footprint of compression-info, n offsets are
grouped together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are
allocated only just enough bits to store their maximum value. The
offsets are thus packed in a buffer like so:
     arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.

The optimal value of n can be calculated for a given file_size (f) and
chunk_size (c), by finding the minima of the following function:

f(n) = (f/c)/n * (log2(f) + (n - 1)*log2((n-1)*(c + 64)))

This is done in an empirical way, using a script (see below).

Furthermore segments are stored in buckets, where each bucket has its
own base offset. Each bucket therefore can address an equal chunk of the
file and furthermore each segment in a bucket can address an equal
sub-chunk of this area.
The value of a given offset i is thus:
    bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i)

To account for the bucketed storage we calculate a local_f, which is
optimized so that a bucketful of segmented offsets can address the
largest possible chunk of f. As value of this local_f only depends on
the bucket_size (b) and c the value of n can be made independent of f
and therefore only depend on one dynamic value, c. This makes life much
simpler as we don't need to know the size of the file up-front, we can
just append buckets to the storage on demand, while the required storage
is still less than a third [1] of the original storage requirements
(std::deque<uint64>).

The table with the minima(f(n)) for different f and c values is
pre-computed by gen_segmented_compress_params.py and
stored in sstables/segmented_compress_params.hh. This script also
creates a table with the best values of local_f for the given
bucket_size. At runtime we only select the best params based on c.

[1] This was calculated for c=4K and b=4K
2017-08-21 17:06:12 +03:00
Avi Kivity
de011ece52 main: deprecate non-murmur3 partitioners more forcefully
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
2017-08-21 14:32:22 +02:00
Avi Kivity
9f415ef870 sstables: accurate summary entry size calculation
Calculate the summary entry size correctly, so we don't end up with oversize
summaries.
Message-Id: <20170819184255.14181-2-avi@scylladb.com>
2017-08-21 14:28:57 +02:00
Avi Kivity
17c372bf0e sstables: get rid of 64kB minimum index advance to generate summary
Limiting summary entry generation to at most one summary entry
per 64k of index data can lead to large index pages, with thousands
of index entries per summary entry. These are slow to parse, and there
is no real gain from the limit, since we already enforce a size
limit on the summary.

Remove the limit and allow summary entry generation based solely on
spanned data size.

Fixes #2711.
Message-Id: <20170819184255.14181-1-avi@scylladb.com>
2017-08-21 14:26:44 +02:00
Avi Kivity
81a33df25d dht: reduce split_range_to_single_shard contiguous memory demand
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.

Fix by using a deque.

Fixes #2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
2017-08-21 14:25:45 +02:00
Piotr Jastrzebski
c602ffd610 Make Scylla ttl expiration behave like in Cassandra
Fixes #2497

[tgrabiec: reworked the title]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2f5a99dce6ef11fe0ef135c9fa0592078fc9a056.1502886874.git.piotr@scylladb.com>
2017-08-21 14:25:45 +02:00
Botond Dénes
eae33a1f19 Add script to precompute segmented compression parameters
The script generates sstables/segmented_compress_params.hh which
contains a list with the optimal number of grouped offsets for
different data and chunk sizes as well as a list with the best
nominal data sizes for different chunk sizes, given a bucket size.
Data sizes are in the range of [2**4,2**50] and chunks in the
range of [2**4, 2**30]. Data sizes that are not used with the current
bucket_size are ommited.
See next commit for details of how the calculated values are used.
2017-08-21 10:44:08 +03:00
Avi Kivity
5a2439e702 main: check for large allocations
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.

Message-Id: <20170819135212.25230-1-avi@scylladb.com>
2017-08-21 10:25:40 +03:00
Pekka Enberg
318423d50b Merge seastar upstream
* seastar 2d16aca...e96881a (4):
  > memory: add detector for large allocations
  > memory: reduce large allocations for small pools
  > net: Fix potential NULL pointer dereference in udp.cc
  > Update dpdk submodule
2017-08-21 10:24:08 +03:00
Tomasz Grabiec
8f2ca52740 tests: Run test_query_only_static_row test case on all mutation sources
The test checks behavior common to all mutation readers, so it's
better to run it against all mutation sources rather than only for
cache reader.

Message-Id: <1503072333-17995-1-git-send-email-tgrabiec@scylladb.com>
2017-08-20 12:23:28 +03:00
Raphael S. Carvalho
10eaa2339e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
2017-08-20 11:35:14 +03:00
Takuya ASADA
38b2ff617f dist/redhat: follow the change on libgcc/libstdc++ package name
Since we moved to external 3rdparty repository, we added '53' suffix on gcc
packages, so follow the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-2-syuu@scylladb.com>
2017-08-19 16:01:28 +03:00
Takuya ASADA
f1b5401d1f dist/redhat: Change g++ command name on CentOS
We have added '-5.3' suffix on g++ command from scylla-gcc53-c++-5.3.1-2.2,
follow the change on scylla build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-1-syuu@scylladb.com>
2017-08-19 16:01:27 +03:00
Avi Kivity
e428805ba5 Merge "Optimize query result partition and row counts" from Duarte
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.

This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."

* 'calculate-counts/v3' of github.com:duarten/scylla:
  query-result: Send row and partition count over the wire
  query::result: Optimize calculate_counts()
2017-08-17 13:41:21 +03:00
Alexys Jacob
e5ff8efea3 dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.

This fixes the detection path of the packages for scylla_setup.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
2017-08-17 13:20:43 +03:00
Nadav Har'El
7832d8a883 get rid of unused part in configure.py
Scylla's configure.py contains stuff we copied from Seastar's
configure.py, but is no longer used. Let's get rid of some of it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170813150842.12603-1-nyh@scylladb.com>
2017-08-17 12:05:44 +03:00
Duarte Nunes
1e7f0eab82 memtable: Created readers should be fast forwardable by default
mutation_reader::forwarding defaults to yes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170816180304.2121-1-duarte@scylladb.com>
2017-08-17 10:21:01 +03:00
Botond Dénes
e70cfc8f36 incremental_reader_selector: account for possibly disengaged lower bound
In addition to the constructor (fixed previously) the check for no
sstables on the first call to select() also has to be prepared for the
lower bound of the range being disengaged.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>
2017-08-17 10:07:26 +03:00
Botond Dénes
af83b7f57b incremental_reader_selector: use lazy_deref instead of tertiary operator
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>
2017-08-17 08:45:46 +03:00
Botond Dénes
eb7eee510d combined_mutation_reader_test: use the global const objects directly
Instead of local ones.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3ec1a70e4c0198c0563dff9688bbaa7fcfcace71.1502891190.git.bdenes@scylladb.com>
2017-08-16 16:56:42 +03:00
Paweł Dziepak
784dcbf1ca sstables: initialise index metrics on all shards
Fixes #2702.

Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>
2017-08-16 15:44:26 +03:00
Avi Kivity
d7e3fbc6fe Merge seastar upstream
* seastar 2a43102...2d16aca (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:09:59 +03:00
Botond Dénes
611774b1d9 Use the incremental reader for compaction
As leveled compaction strategy stands to gain the most from
incrementally opening sstables.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>
2017-08-15 21:38:04 +03:00
Takuya ASADA
0f9b095867 dist/common/scripts: prevent ignoreing flag that passed after another flag which requires parameter
When user mistakenly forgot to pass parameter for a flag, our scripts misparses
next flag as the parameter.
ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed
    '--ntp-domain --setup-nic'.
Result of that, next flag will ignore by scripts.
To prevent such behavior, reject any parameter that start with '--'.

Fixes #2609

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170815114751.6223-1-syuu@scylladb.com>
2017-08-15 18:27:32 +03:00
Duarte Nunes
c7aa3ea069 mutation_partition: Remove obsolete short read detection
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.

This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
2017-08-15 12:01:55 +01:00
Avi Kivity
8df6dd1fa0 database: make incremental_reader_selector robust vs. full-range partition_range
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.

Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
2017-08-15 11:03:22 +01:00
Avi Kivity
a35bfb3ea9 Merge seastar upstream
* seastar 47b31f6...2a43102 (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:23:02 +03:00
Avi Kivity
e892a0082a Merge "Drop exhausted mutation_readers when possible" from Duarte
"Exhausted readers belonging to a combined_mutation_reader can be fast
forwarded, so we have to keep them around. However, if the reader is
not fast forwardable, then we can drop the contained readers and their
buffers."

* 'ff-reader/v2' of github.com:duarten/scylla:
  combined_mutation_reader: Drop exhausted readers if not in FF mode
  combined_mutation_reader: Remove superfluous mutation_readers list
  memtable_snapshot_source: Created readers should be fast forwardable
2017-08-14 16:20:38 +03:00
Duarte Nunes
7fb6a74302 combined_mutation_reader: Drop exhausted readers if not in FF mode
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
0b53f88a42 combined_mutation_reader: Remove superfluous mutation_readers list
The _all_readers variable can do the same job.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
77477605c1 memtable_snapshot_source: Created readers should be fast forwardable
As they're used by the cache tests.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Avi Kivity
afff29bdb9 Merge seastar upstream
* seastar edb73ab...47b31f6 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691.
2017-08-14 15:09:42 +03:00
Duarte Nunes
a17cef76b2 query-result-writer: Remove unneeded field
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811102940.22747-1-duarte@scylladb.com>
2017-08-14 12:33:33 +01:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Duarte Nunes
3b9a9b7321 query-result: Send row and partition count over the wire
To avoid calculating them on the coordinator side.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:29:06 +02:00
Duarte Nunes
d7bab684ea query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:28:29 +02:00
Avi Kivity
cb2c5016ea Merge seastar upstream
* seastar 7a49ae5...edb73ab (11):
  > scripts: perftune.py: change the network module mode auto selection heuristic
  > net/tls: explicitly ignore ready future during shutdown
  > Use python2 explicitly as an interpreter for Python v2 scripts
  > peering_sharded_service: prevent over-run the container
  > Add link to documentation to the README.md
  > Add guidelines for contributing to Seastar
  > sharded: fix move constructor for peering_sharded_service services
  > Provide a convenient way to lazy-convert to string the values of pointers
  > tutorial: overhaul semaphores section
  > simple-stream: Make fragmented::write_substream return simple if possible
  > simple-stream: Make simple/fragmented memory output stream top level
2017-08-14 10:29:27 +03:00
Raphael S. Carvalho
050a7019b8 sstables/index_reader: fix index reader for summary entry spanning lots of keys
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.

Fixes test_sstable_conforms_to_mutation_source

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
2017-08-12 09:44:16 +03:00
Duarte Nunes
08e284a07e combined_mutation_reader: Don't drop mutation readers
This patch fixes a regression introduced in a6b9186ca.

We should keep the readers around in case a subsequent call to
fast_forward() will require them.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811160444.12795-1-duarte@scylladb.com>
2017-08-11 19:17:29 +03:00
Duarte Nunes
44b6da2e90 test.py: Add combined_mutation_reader_test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811155017.9899-1-duarte@scylladb.com>
2017-08-11 18:54:11 +03:00
Avi Kivity
dbf8625ac9 Merge "size-based sampling for sstable summary" from Raphael
"Fixes #1842."

* 'size_based_sampling_v3' of github.com:raphaelsc/scylla:
  tests: test summary entry spanning more keys than min interval
  db/config: introduce sstable_summary_ratio option
  sstables: introduce size-based sampling for sstable summary
  sstables: make components_writer::offset const qualified and uint64_t
  sstables: make writer::offset const qualified and uint64_t
2017-08-11 18:41:45 +03:00
Duarte Nunes
e7d56884c0 list_reader_selector: Prevent infinite loop
In case the readers are empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811153142.8926-1-duarte@scylladb.com>
2017-08-11 18:34:55 +03:00
Vladimir Krivopalov
003e8cf250 Use python2 explicitly as an interpreter for Python v2 scripts
Signed-off-by: Vladimir Krivopalov <vladimir.krivopalov@gmail.com>
Message-Id: <20170811032712.4362-1-vladimir.krivopalov@gmail.com>
2017-08-11 18:08:11 +03:00
Duarte Nunes
20337053ad Don't use literal lambdas
These are only available in C++17. Fixes the build after b5460c2.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-11 13:08:42 +02:00
Duarte Nunes
b5460c2990 Merge "Support duration type" from Jesse
"This patch series adds support for the `duration` type in CQL, which
was added to Cassandra in 3.10.

As part of this work, it was necessary also to add support for the
`vint` and `unsigned vint` types to the native protocol implementation,
which are part of v5 of the specification.

To test interactively, it is necessary to use cqlsh distributed with
Cassandra, as the version we distribute does not yet support the
duration type."

* 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla:
  Support `duration` CQL native type
  CQL native protocol: Add support for `vint` serialization
  duration_test.cc: Add test for printing zero duration
  duration.cc: Remove nop `const` qualifier on return type
  Change `const` qualifier declaration order for `duration`
  duration.cc: Simplify range checking
  Rename `duration` to `cql_duration`
2017-08-11 10:56:55 +01:00