Commit Graph

17174 Commits

Author SHA1 Message Date
Vladimir Krivopalov
c53afd7bba tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 15:29:28 -08:00
Vladimir Krivopalov
cfbde5b89c tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 15:29:28 -08:00
Vladimir Krivopalov
9955710cac tests: Extend test checking tombstones histogram to cover all SSTables versions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 12:36:22 -08:00
Vladimir Krivopalov
cdae62ec29 sstables: Properly track row-level tombstones when writing SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 12:36:22 -08:00
Vladimir Krivopalov
0f3fb32028 tests: Run min_max_clustering_key_test_2 for all SSTables versions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 12:36:22 -08:00
Vladimir Krivopalov
c474b0d851 tests: Make reusable_sst() helper accept SSTables version parameter.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-12-05 12:36:22 -08:00
Asias He
a5d8b66f2c gossip: Make favor newly added node log debug level
It is not very useful for user to know this.

Message-Id: <6c2dfc522d6974adb97c34fbc1e3a0339d2d530c.1543997137.git.asias@scylladb.com>
2018-12-05 10:45:03 +02:00
Avi Kivity
b0cb69ec25 Merge "Make sstable reader fail on unknown colum names in MC format" from Piotr
"
Before the reader was just ignoring such columns but this creates a risk of data loss.

Refs #2598
"

* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
  sstables: Add test_sstable_reader_on_unknown_column
  sstables: Exception on sstable's column not present in schema
  sstables: store column name in column_translation::column_info
  sstables: Make test_dropped_column_handling test dropped columns
2018-12-05 10:43:29 +02:00
Takuya ASADA
9388f3d626 reloc: drop --jobs from build_deb.sh/build_rpm.sh scripts
Since we merged relocatable package, build_deb.sh/build_rpm.sh only does
packaging using prebuilt binary taken from relocatable package, won't compile
anything.

So passing --jobs option to build_deb.sh/build_rpm.sh becomes meaningless,
we can drop it.

Note that we still can specify --jobs option on reloc/build_reloc.sh, it
runs "ninja-build -jN" to compile Scylla, then generate relocatable package.

See #3956

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181204205652.25138-1-syuu@scylladb.com>
2018-12-04 21:00:51 +00:00
Tomasz Grabiec
b8c405c019 Merge "Correct the usage of row ttl and add write-read test" from Piotr
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.

* seastar-dev.git haaawk/sst3/write-read-test/v3:
  Fix use_row_ttl condition
  Add test_all_data_is_read_back
2018-12-04 19:47:28 +01:00
Tomasz Grabiec
9a4c00beb7 utils/gz: Fix compilation on non-x86 archs
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.

Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
2018-12-04 18:17:27 +00:00
Piotr Jastrzebski
fed3b51abe Add test_all_data_is_read_back
This tests that a source after being populated with a mutation
returns exactly the same mutation when read.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-12-04 11:42:08 +01:00
Piotr Sarna
7b0a3fbf8a auth: add abort_source to waiting for schema agreement
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.

Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>
2018-12-04 10:41:09 +00:00
Piotr Jastrzebski
75b99838fc Fix use_row_ttl condition
Previous condition was wrong and was using row ttl too often.

We also have to change test_dead_row_marker to compare
resulting sstable with sstable generated by Origin not
by sstableupgrade.
This is because sstableupgrade transmits information about deleted row
marker automatically to cells in that row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-12-04 10:51:36 +01:00
Avi Kivity
c3e664eec2 Merge "Improve corrupt sstable reporting" from Rafael
"
This is a small step in fixing issue #2347. It is mostly tests and
testing infrastructure, but it does include a fix for a case where we
were missing the filename in the malformed_sstable_exception.
"

* 'espindola/sstable-corruption-v2' of https://github.com/espindola/scylla:
  Add a filename to a malformed_sstable_exception.
  Try to read the full sst in broken_sst.
  Convert tests to SEASTAR_THREAD_TEST_CASE.
  Check the exception message.
  Move some tests to broken_sstable_test.cc
2018-12-04 10:32:10 +02:00
Avi Kivity
414b14a6bd Merge "Make inactive shard readers evictable" from Botond
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
  reads.
* Behave very badly when too many of them is running concurrently
  (trashing).
* May deadlock if enough of them is running without a timeout.

The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
  duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
  handle) as each scan now can make progress by evicting inactive shard
  readers belonging to other scans.
* Will not deadlock at all.

In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
  newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
  `foreign_reader`.

I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.

Fixes: #3835
"

* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
  tests/multishard_mutation_query_test: test stateless query too
  tests/querier_cache: fail resource-based eviction test gracefully
  tests/querier_cache: simplify resource-based eviction test
  tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
  tests/mutation_reader_test: restore indentation
  tests/mutation_reader_test: enrich pause-related multishard reader test
  multishard_combining_reader: use pause-resume API
  query::partition_slice: add clear_ranges() method
  position_in_partition: add region() accessor
  foreign_reader: add pause-resume API
  tests/mutation_reader_test: implement the pause-resume API
  query_mutations_on_all_shards(): implement pause-resume API
  make_multishard_streaming_reader(): implement the pause-resume API
  database: add accessors for user and streaming concurrency semaphores
  reader_lifecycle_policy: extend with a pause-resume API
  query_mutations_on_all_shards(): restore indentation
  query_mutations_on_all_shards(): simplify the state-machine
  multishard_combining_reader: use the reader lifecycle policy
  multishard_combining_reader: add reader lifecycle policy
  multishard_combining_reader: drop unnecessary `reader_promise` member
  ...
2018-12-04 10:22:35 +02:00
Botond Dénes
9de4f3a834 tests/multishard_mutation_query_test: test stateless query too
In the `test_read_all`, do a stateless read as well to ensure that path
works correctly as well.
2018-12-04 08:51:05 +02:00
Botond Dénes
6676ceba7f tests/querier_cache: fail resource-based eviction test gracefully
Currently when this test fails, resources are not released in the
correct order, which results in ASAN complaining about use-after-free
in debug builds. This is due to the BOOST_REQUIRE macro aborting the
test when the predicate fails, not allowing for correct destruction
order to take place.
To avoid this ugly failure, that adds noise and might cause a developer
investigating into the failure starting on the wrong path, use the more
mild BOOST_CHECK family of test macros. These will allow the test to run
to completion even when the predicate fails, allowing for the correct
destruction of the resources.
2018-12-04 08:51:05 +02:00
Botond Dénes
93e41397f7 tests/querier_cache: simplify resource-based eviction test
Now that we have an accessor for all concurrency semaphores, we don't
need the tricks of creating a dummy keyspace to get them. Use the
accessors instead.
2018-12-04 08:51:05 +02:00
Botond Dénes
dcd2d116a3 tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
Test the interaction of the multishard reader with the foreign reader
w.r.t next_partition(). next_partition() is a special operation, as it
its execution is deferred until the next cross-shard operations. Give it
some extra stress-testing.
2018-12-04 08:51:05 +02:00
Botond Dénes
20e994e526 tests/mutation_reader_test: restore indentation
Left over from the previous patch.
2018-12-04 08:51:05 +02:00
Botond Dénes
a577ff97e9 tests/mutation_reader_test: enrich pause-related multishard reader test
Enrich the existing test_multishard_combining_reader_as_mutation_source
test case with delaying the pause/resume and eviction of paused
readers.
2018-12-04 08:51:05 +02:00
Botond Dénes
22b14d593b multishard_combining_reader: use pause-resume API
Refactor the multishard combining reader to make use of the new
pause-resume API to pause inactive shard readers.

Make the pause-resume API mandatory to implement, as by now all existing
clients have adapted it.
2018-12-04 08:51:05 +02:00
Botond Dénes
77b758707c query::partition_slice: add clear_ranges() method
Allows for clearing any custom partition ranges, effectively resetting
them to the default ones. Useful for code that needs to set several
different specific partition ranges, one after the other, but doesn't
want to remember the last key it set a range for to be able to clear the
previous range with `clear_range()`.
2018-12-04 08:51:05 +02:00
Botond Dénes
a594fd39ce position_in_partition: add region() accessor 2018-12-04 08:51:05 +02:00
Botond Dénes
9601d23e0d foreign_reader: add pause-resume API
Allowing for pausing the reader and later resume it. Pausing the reader
waits on the ongoing read ahead (if any), executes any pending
`next_partition()` calls and than detaches the shard reader's buffer.
The paused shard reader is returned to the client.
Resuming the reader consists of getting the previously detached reader
back, or one that has the same position as the old reader had.
This API allows for making the inactive shard readers of the
`multishard_combining_reader` evictable.
The API is private, it's only accessible for classes knowing the full
definition of the `foreign_reader` (which resides in a .cc file).
2018-12-04 08:51:05 +02:00
Botond Dénes
a12fae366d tests/mutation_reader_test: implement the pause-resume API 2018-12-04 08:51:05 +02:00
Botond Dénes
f334d3717f query_mutations_on_all_shards(): implement pause-resume API 2018-12-04 08:51:05 +02:00
Botond Dénes
72ed655ef0 make_multishard_streaming_reader(): implement the pause-resume API 2018-12-04 08:51:05 +02:00
Botond Dénes
bf0d1f4eea database: add accessors for user and streaming concurrency semaphores
These will soon be needed to register inactive user and streaming reads
with the respective semaphores.
2018-12-04 08:51:05 +02:00
Botond Dénes
5f67a065c6 reader_lifecycle_policy: extend with a pause-resume API
This API provides a way for the mulishard reader to pause inactive shard
readers and later resume them when they are needed again. This allows
for these paused shard readers to be evicted when the node is under
pressure.
How the readers are made evictable while paused is up to the clients.

Using this API in the `multishard_combining_reader` and implementing it
in the clients will be done in the next patches.

Provide default implementation for the new virtual methods to facilitate
gradual adoption.
2018-12-04 08:51:05 +02:00
Botond Dénes
6f0e0c4ed7 query_mutations_on_all_shards(): restore indentation
The previous patch added half-aligned lines to improve readability of
that patch.
2018-12-04 08:51:05 +02:00
Botond Dénes
aa6083a75b query_mutations_on_all_shards(): simplify the state-machine
The `read_context` which handles creating, saving and looking-up the
shard readers has to deal with its `destroy_reader()` method called any
time, even before some other method finished its work. For example it is
valid for a reader to be requested to be destroyed, even before the
contexts finishes creating it.
This means that state transitions that take time can be interleaved with
another state transition request. To deal with this the read context
uses `future_` states, states that mark an ongoing state transitions.
This allows for state transition request that arrive in the middle of
another state transition to be attached as a continuation to the ongoing
transition, and to be executed after that finishes. This however
resulted in complex code, that has to handle readers being in all sorts
of different states, when the `save_readers()` method is called.
To avoid all this complexity, exploit the fact that `destroy_reader()`
receives a future<> as its argument, which resolves when all previous
state transitions have finished. Use a gate to wait on all these futures
to resolve. This way we don't need all those transitional states,
instead in `save_readers()` we only need to wait on the gate to close.
Thus the number of states `save_readers()` has to consider drops
drastically.

This has the theoretical drawback of the process of saving the readers
having to wait on each of the readers to stop, but in practice the
process finishes when the last reader is saved anyway, so I don't expect
this to result in any slowdown.
2018-12-04 08:51:05 +02:00
Botond Dénes
007619de4c multishard_combining_reader: use the reader lifecycle policy
Refactor the multishard combining reader and its clients to use the
reader lifecycle policy introduced in the previous patch.
2018-12-04 08:51:05 +02:00
Botond Dénes
0a616c899e multishard_combining_reader: add reader lifecycle policy
Currently `multishard_combining_reader` takes two functors, one for
creating the readers and optionally one for destroying them.
A bag of functors (`std::function`) however make for a terrible
interface, and as we are about to add some more customization points,
it's time to use something more formal: policy based design, a
well-known design pattern.

As well as merging the job of the two functors into a single policy
class, also widen the area of responsibility of the policy to include
keeping alive any resource the shard readers might need on their home
shard. Implementing a proper reader cleanup is now not optional either.

This patch only adds the `reader_managing_policy` interface,
refactoring the multishard reader to use it will be done in the next patch.
2018-12-04 08:51:05 +02:00
Botond Dénes
301abaca07 multishard_combining_reader: drop unnecessary reader_promise member
The `reader_promise` member of the `shard_reader` was used to
synchronize a foreground request to create the underlying reader with an
ongoing background request with the same goal. This is however
unnecessary. The underlying reader is created in the background only as
part of a read ahead. In this case there is no need for extra
synchronization point, the foreground reader create request can just
wait for the read ahead to finish, for which there already exists a
mean. Furthermore, foreground reader create requests are always followed
by a `fill_buffer()` request, so by waiting on the read ahead we ensure
that the following `fill_buffer()` call will not block.
2018-12-04 08:51:05 +02:00
Botond Dénes
a73175fdbc multishard_combining_reader: drop tracking of pending next_partition calls
Shard readers used to track pending `next_partition()` calls that they
couldn't execute, because their underlying reader wasn't created yet.
These pending calls were then executed after the reader was created.
However the only situation where a shard reader can receive a
`next_partition()` call, before its underlying reader wasn't created is
when `next_partition()` is called on the multishard reader before a
single fragment is read. In this case we know we are at a partition
boundary and thus this call has no effect, therefore it is safe to
ignore it.
2018-12-04 08:51:05 +02:00
Botond Dénes
ab3e639c3b foreign_reader: use bool for pending_next_partition
Foreign reader doesn't execute `next_partition()` calls straight away,
when this would require interaction with the remote reader. Instead
these calls are "remembered" and executed on the next occasion the
foreign reader has to interact with the remote reader. This was
implemented with a counter that counts the number of pending
`next_partition()` calls.
However when `next_partition()` is called multiple times, without
interleaving calls to `operator()()` or `fast_forward_to()`, only the
first such call has effect. Thus it doesn't make sense to count these
calls, it is enough to just set a flag if there was at least one such
call.
2018-12-04 08:51:05 +02:00
Botond Dénes
5a4fd1abab multishard_combining_reader: drop support for streamed_mutation fast-forwarding
It doesn't make sense for the multishard reader anyway, as it's only
used by the row-cache. We are about to introduce the pausing of inactive
shard readers, and it would require complex data structures and code
to maintain support for this feature that is not even used. So drop it.
2018-12-04 08:51:05 +02:00
Botond Dénes
b36733971b mutation_source_test: add option to skip intra-partition fast-forwarding tests
To allow for using this test suite for testing mutation sources that
don't support intra-partition fast-forwarding.
2018-12-04 08:51:05 +02:00
Botond Dénes
37f0117747 reader_concurrency_semaphore: refactor eviction mechanism
As we are about to add multiple sources of evictable readers, we need a
more scalable solution than a single functor being passed that opaquely
evicts a reader when called.
Add a generic way to register and unregister evictable (inactive)
readers to the semaphore. The readers are expected to be registered when
they become evictable and are expected to be unregistered when they
cease to become evictable. The semaphore might evict any reader that is
registered to it, when it sees fit.

This also solves the problem of notifying the semaphore when new readers
become evictable. Previously there was no such mechanism, and the
semaphore would only evict any such new readers when a new permit was
requested from it.
2018-12-04 08:51:00 +02:00
Rafael Ávila de Espíndola
21199a7a5c Add a filename to a malformed_sstable_exception.
It is reasonable for parse() to throw when it finds something wrong
with the format. This seems to be the best spot to add the filename
and rethrow.

Also add a testcase to make sure we keep handling this error
gracefully.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2018-12-03 13:50:23 -08:00
Rafael Ávila de Espíndola
a6e25e4bd0 Try to read the full sst in broken_sst.
With this patch we use data_consume_rows to try to read the entire
sstable. The patch also adds a test with a particular corruption that
would not be found without parsing the file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2018-12-03 13:47:49 -08:00
Rafael Ávila de Espíndola
b1190c58ec Convert tests to SEASTAR_THREAD_TEST_CASE.
This will simplify future changes to broken_sst.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2018-12-03 13:26:06 -08:00
Rafael Ávila de Espíndola
e5c5afffc9 Check the exception message.
This makes the tests a bit more strict by also checking the message
returned by the what() function.

This shows that some of the tests are out of sync with which errors
they check for. I will hopefully fix this in another pass.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2018-12-03 12:31:53 -08:00
Rafael Ávila de Espíndola
f9d81bcd43 Move some tests to broken_sstable_test.cc
sstable_test.cc was already a bit too big and there is potential for
having a lot of tests about broken sstables.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2018-12-03 12:16:30 -08:00
Rafael Ávila de Espíndola
cf4dc38259 Simplify state machine loop.
These loops have the structure :

while (true) {
  switch (state) {
  case state1:
  ...
  break;
  case state2:
  if (...) { ... break; } else {... continue; }
  ...
  }
  break;
}

There a couple things I find a bit odd on that structure:

* The break refers to the switch, the continue to the loop.
* A while (true) loop always hits a break or a continue.

This patch uses early returns to simplify the logic to

while (true) {
  switch (state) {
  case state1:
  ...
  return
  case state2:
  if (...) { ... return; }
  ...
  }
}

Now there are no breaks or continues.

Tests: unit (release)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181126171726.84629-1-espindola@scylladb.com>
2018-12-03 20:34:03 +01:00
Avi Kivity
b098b5b987 Merge "Optimize checksum_combine() for CRC32" from Tomek
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.

This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.

Performance results using perf_checksum and second buffer of length 64 KiB:

zlib CRC32 combine:   38'851   ns
libdeflate CRC32:      4'797   ns
fast_crc32_combine():     11   ns

So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.

Tested on i7-5960X CPU @ 3.00GHz

Performance was also evaluated using sstable writer benchmark:

  perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
     --value-size=10000 --rows 1000000 --datasets small-part

It yielded 9% improvement in median frag/s (129'055 vs 117'977).

Refs #3874
"

* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
  tests: perf_checksum: Test fast_crc32_combine()
  tests: Rename libdeflate_test to checksum_utils_test
  tests: libdeflate: Add more tests for checksum_combine()
  tests: libdeflate: Check both libdeflate and default checksummers
  sstables: Use fast_crc_combine() in the default checksummer
  utils/gz: Add fast implementation of crc32_combine()
  utils/gz: Add pre-computed polynomials
  utils/gz: Import Barett reduction implementation from libdeflate
  utils: Extract clmul() from crc.hh
2018-12-03 19:02:01 +02:00
Tomasz Grabiec
aa19f98d18 sstables: Write Statistics.db offset map entries in the same order as Cassandra
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.

Fixes #3955.

Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
2018-12-03 16:40:24 +02:00
Avi Kivity
4dc402b53f Merge "Create sstable in a sub-directory" from Benny
"
Due to an XFS heuristic, if all files are in one (or a few) directories,
then block allocation can become very slow. This is because XFS divides
the disk into a few allocation groups (AGs), and each directory allocates
preferentially from a single AG. That AG can become filled long before
the disk is full.

This patchset works around the problem by:
- creating sstable component files in their own temporary, per-sstable sub-directory,
- moving the files back into the canonical location right after begin created, and finally
- removing the temp sub-directory when the sstable is sealed.
- In addition, any temporary sub-directories that might have been left over if scylla
  crashed while creating sstables are looked up and removed when populating the table.

Fixes: #3167

Tests: unit (release)
"

* 'issues/3167/v7' of https://github.com/bhalevy/scylla:
  distributed_loader::populate_column_family: lookup and remove temp sstable directories
  database: directly use std::experimental::filesystem::path for lister::path
  database: use std::experimental::filesystem::path for lister::path
  sstable: use std::experimental::filesystem rather than boost
  sstable::seal_sstable: fixup indentation
  sstable: create sstable component files in a subdirectory
  sstable::new_sstable_component_file: pass component_type rather than filename
  sstable: cleanup filename related functions
  sstable: make write_crc, write_digest, and new_sstable_component_file private methods
2018-12-03 16:26:12 +02:00