Commit Graph

15391 Commits

Author SHA1 Message Date
Glauber Costa
4b4e9f6c8c STCS_backlog: remove unused attribute
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
10046593be compaction strategy: move size tiered backlog to a header
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.

To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
36ccb1dd7c compaction_strategy: delete major_compaction_strategy class
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.

And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:59 -04:00
Glauber Costa
9320d6f17f compaction: make sure that user-initiated compactions always have a minimum priority
We have observed the following behavior with user initiated compactions,
like major compactions:

- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
  progress.

Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:25 -04:00
Glauber Costa
c55ab93178 backlog_controller: add constants to represent a globally disabled controller
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.

We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.

So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".

Let's turn that into a constant instead. It will help us convey meaning.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:25:23 -04:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Glauber Costa
d3f985ef46 backlog_controller: allow users to compute inverse function of shares
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-21 19:35:07 -04:00
Avi Kivity
d9c80cac26 dist: move Red Hat installation from .spec %install to new install.sh
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).

This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.

Ref #3243.

Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
2018-05-17 13:46:27 +02:00
Avi Kivity
98967da94f Merge seastar upstream
* seastar 0a1a327...a6cb005 (1):
  > Merge " misc fixes for iotune" from Glauber
2018-05-17 12:42:46 +03:00
Avi Kivity
3b8118d4e5 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
2018-05-16 15:38:29 +01:00
Avi Kivity
20271b3890 Update scylla-ami submodule
* dist/ami/files/scylla-ami e0b35dc...025644d (1):
  > Merge "AMI build fix" from Takuya
2018-05-16 12:33:45 +03:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
534068a0f7 Update seastar submodule
Fixes #3339

* seastar 840002c...0a1a327 (7):
  > Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
  > Merge 'Handle Intel's NICs in a special way'  from Vlad
  > reactor: fix calculation of idle ticks
  > log: streamline logging internals a little
  > Merge "CMake imrovements and compatibility" from Jesse
  > iotune: fix typo in property name
  > cmake: do not find_package(Boost ...) if Boost is a target
2018-05-16 09:11:22 +02:00
Avi Kivity
832e8fb1e0 Merge "Support writing counters in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.

Tests: unit {release}.

Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).

Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)

Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)
"

* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
  tests: Unit tests to cover writing counters in SSTables 3.x format.
  sstables: Support writing counters for SSTables 3.x.
  sstables: Move code writing counter value into a separate helper.
2018-05-16 08:46:15 +03:00
Raphael S. Carvalho
59c57861ae tests/sstable_test: switch to dynamic temporary dir creation
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
2018-05-16 08:00:29 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
3775a9ecec lsa: Reduce amount of segment compactions
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).

This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.

In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
2018-05-15 21:49:01 +02:00
Vladimir Krivopalov
a16b8d5d77 tests: Unit tests to cover writing counters in SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
ffd8886da9 sstables: Support writing counters for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
28c3c21c73 sstables: Move code writing counter value into a separate helper.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Avi Kivity
5f3a5c436e Merge "chunked vector memory estimation" from Glauber
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"

* 'chunked_memory-v2' of github.com:glommer/scylla:
  large_bitset: be more accurate with memory usage
  chunked_vector: exports its current memory usage
2018-05-15 19:00:36 +03:00
Glauber Costa
2ba08178ca large_bitset: be more accurate with memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Glauber Costa
7190bb4f95 chunked_vector: exports its current memory usage
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:

1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.

2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.

The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Raphael S. Carvalho
83e64192d3 tests/perf: fix compaction and write mode of perf_sstable
storage_service_for_tests must be instantiated only once at a global
scope.

Fixes #3369.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180510042200.2548-1-raphaelsc@scylladb.com>
2018-05-15 18:00:18 +03:00
Avi Kivity
e0ef39705f dist: redhat: properly package scylla_blocktune.py
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.

Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
2018-05-15 18:00:05 +03:00
Piotr Sarna
40bf5d671b cql: add secondary index metrics
This commit adds basic secondary index metrics to cql_stats:
 * total number of indexes creates
 * total number of indexes dropped
 * total number of reads from a secondary index
 * total number of rows read from a secondary index

References #3384
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <d5eda7a343cee547c921dd4d289ecb1ac1c2bf24.1526374243.git.sarna@scylladb.com>
2018-05-15 17:59:53 +03:00
Avi Kivity
4f81e1f55a Merge "Use CRC32 to calculate checksums for SSTables 3.0." from Vladimir
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.

Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).

Tests: unit {release}
"

* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
  tests: Add test covering checksumming SSTables 3.0 with CRC32.
  sstables: Support CRC32 checksum for SSTables 3.x.
  sstables: Move adler32 routines under the scope of a class.
  sstables: Move checksum utils into separate header.
  sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
2018-05-15 10:18:14 +03:00
Duarte Nunes
3a7d655d01 Merge 'transport: reduce unneeded continuations' from Avi
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"

* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
  transport: remove unused capture of flags variable
  transport: merge response write and error handling continuations
  transport: make write_repsonse() return void
  transport: de-template a lambda
  transport: merge memory-management and logging continuations
  transport: remove gate continuation
  transport: merge two response processing continuations
  transport: simplify response processing continuation
  transport: remove gratuitous continuation from process_request_one()
2018-05-14 10:12:07 +01:00
Avi Kivity
4500baaaf4 transport: remove unused capture of flags variable 2018-05-14 09:41:06 +03:00
Avi Kivity
88f8fe3168 transport: merge response write and error handling continuations
The response write continuation does not defer, so traditional try/catch
works well and saves a continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
3e8d1c8fd7 transport: make write_repsonse() return void
It just schedules the response, and returns immediately.

(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
2018-05-14 09:41:06 +03:00
Avi Kivity
b26f36c2ec transport: de-template a lambda
Generic templates = annoying.
2018-05-14 09:41:06 +03:00
Avi Kivity
7a9b73f166 transport: merge memory-management and logging continuations
Merge a continuation that just keeps things alive with another that
just logs things.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0887a55e4 transport: remove gate continuation
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
876837a5da transport: merge two response processing continuations
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
2018-05-14 09:41:06 +03:00
Avi Kivity
38619138be transport: simplify response processing continuation
A continuation in the response processing path is only doing
transformation on the output. Make that clear by returning a value,
not a future.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0a1478b6c transport: remove gratuitous continuation from process_request_one()
No need to call then() just to convert exceptions to futures,
futurize_apply() does this with less ado.
2018-05-14 09:41:06 +03:00
Vladimir Krivopalov
1da6144f90 tests: Add test covering checksumming SSTables 3.0 with CRC32.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
e6dfa008d8 sstables: Support CRC32 checksum for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
adb43959d1 sstables: Move adler32 routines under the scope of a class.
This is a step towards making digest algorithm customizable at compile
time.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
4e4030676f sstables: Move checksum utils into separate header.
Checksummed writer doesn't need to include all compression stuff.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Nadav Har'El
f5536d607e secondary index: fix multiple appearance of rows
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.

The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
2018-05-13 20:08:14 +02:00
Avi Kivity
7d29addb1f mutation_reader: optimize make_combined_reader for the single-reader case
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.

Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
2018-05-13 20:07:10 +02:00
Duarte Nunes
a23bda3393 Merge 'Implement separate timeout for range queries' from Avi
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.

While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.

Fixes #3013.

Tests: unit (release)
"

* tag '3013/v1.1' of https://github.com/avikivity/scylla:
  storage_proxy: remove default_query_timeout()
  storage_proxy: don't use default timeouts
  query_options: augment with timeout_config
  thrift: configure thrift transport and handler with a timeout_config
  transport: configure native transport with a timeout_config
  cql3: define and populate timeout_config_selector
  timeout_config: introduce timeout configuration
2018-05-13 20:05:50 +02:00
Glauber Costa
3d2c4c1cf8 main: change I/O scheduler verification code
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.

Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.

If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Glauber Costa
2e0c673432 database: release flush permits earlier
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.

During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.

We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.

We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.

This patch fixes that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Tomasz Grabiec
8faafdaae5 lsa: Avoid the call to segment_pool::descriptor() in compact() 2018-05-11 19:07:23 +02:00
Tomasz Grabiec
19edf3970e lsa: Make reclamation on reserve refill more efficient
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.

This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
2018-05-11 19:07:23 +02:00
Takuya ASADA
6fa3c4dcad dist/redhat: replace scylla-libgcc72/scylla-libstdc++72 with scylla-2.2 metapackage
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.

Fixes #3373

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
2018-05-11 09:41:57 +03:00
Vladimir Krivopalov
f443e85476 sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 11:11:06 -07:00