Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.
Refactor tests to use seastar::thread.
"
This series adds DEFAULT UNSET and DEFAULT NULL keyword support
to INSERT JSON statement, as stated in #3909.
Tests: unit (release)
"
* 'add_json_default_unset_2' of https://github.com/psarna/scylla:
tests: add DEFAULT UNSET case to JSON cql tests
tests: split JSON part of cql query test
cql3: add DEFAULT UNSET to INSERT JSON
(cherry picked from commit 447f953a2c)
libdeflate's build places some object files in the source directory, which is
shared between the debug and release build. If the same object file (for the two
modes) is written concurrently, or if one more reads it while the other writes it,
it will be corrupted.
Fix by not building the executables at all. They aren't needed, and we already
placed the libraries' objects in the build directory (which is unshared). We only
need the libraries anyway.
Fixes#4130.
Branches: master, branch-3.0
Message-Id: <20190123145435.19049-1-avi@scylladb.com>
(cherry picked from commit c83ae62aed)
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.
Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.
The ka/la format writers are still left in sstables.cc, they could be also extracted.
"
* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
sstables: Make variadic write() not picked on substitution error
sstables: Extract MC format writer to mc/writer.cc
sstables: Extract maybe_add_summary_entry() out of components_writer
sstables: Publish functions used by writers in writer.hh
sstables: Move common write functions to writer.hh
sstables: Extract sstable_writer_impl to a header
sstables: Do not include writer.hh from sstables.hh
sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
sstables: types: Extract sstable_enabled_features::all()
sstables: Move components_writer to .cc
tests: sstable_datafile_test: Avoid dependency on components_writer
(cherry picked from commit b023e8b45d)
"
This series contains several optimizations of the MC format sstable writer, mainly:
- Avoiding output_stream when serializing into memory (e.g. a row)
- Faster serialization of primitive types when serializing into memory
I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:
- 10% for a row with a single cell of 8 bytes
- 10% for a row with a single cell of 100 bytes
- 9% for a row with a single cell of 1000 bytes
- 13% for a row with 6 cells of 100 bytes
"
* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
bytes_ostream: Optimize writing of fixed-size types
sstables: mc: Write temporary data to bytes_ostream rather than file_writer
sstables: mc: Avoid double-serialization of a range tombstone marker
sstables: file_writer: Generalize bytes& writer to accept bytes_view
sstables: Templetize write() functions on the writer
sstables: Turn m_format_write_helpers.cc into an impl header
sstables: De-futurize file_writer
bytes_ostream: Implement clear()
bytes_ostream: Make initial chunk size configurable
(cherry picked from commit e3f53542c9)
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.
To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.
More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.
Design document:
https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWoFixes#2538
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
service/storage_proxy: Release mutation as early as possible
service/storage_proxy: Delay replica writes based on view update backlog
service/storage_proxy: Get the backlog of a particular base replica
service/storage_proxy: Add counters for delayed base writes
main: Start and stop the view_update_backlog_broker
service: Distribute a node's view update backlog
service: Advertise view update backlog over gossip
service/storage_proxy: Send view update backlog from replicas
service/storage_proxy: Prepare to receive replica view update backlog
service/storage_proxy: Expose local view update backlog
tests/view_schema_test: Add simple test for db::view::node_update_backlog
db/view: Introduce node_update_backlog class
db/hints: Initialize current backlog
database: Add counter for current view backlog
database: Expose current memory view update backlog
idl: Add db::view::update_backlog
db/view: Add view_update_backlog
database: Wait on view update semaphore for view building
service/storage_proxy: Use near-infinite timeouts for view updates
database: generate_and_propagate_view_updates no longer needs a timeout
...
(cherry picked from commit b66f59aa3d)
Fixes a build failure when only the scylla binary was selected for
building like this:
./configure.py --with scylla
In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o
Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit edbef7400b)
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
(cherry picked from commit b098b5b987)
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.
perf_checksum test was introduced to measure performance of various
checksumming operations.
Results for 514 B (relevant for writing with compression enabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 58414 16.711us 3.483ns 16.708us 16.725us
crc_test.perf_adler_combine 165788278 6.059ns 0.031ns 6.027ns 7.519ns
crc_test.perf_zlib_crc32_combine 59546 16.767us 26.191ns 16.741us 16.801us
---
crc_test.perf_deflate_crc32_checksum 12705072 83.267ns 4.580ns 78.687ns 98.964ns
crc_test.perf_adler_checksum 3918014 206.701ns 23.469ns 183.231ns 258.859ns
crc_test.perf_zlib_crc32_checksum 2329682 428.787ns 0.085ns 428.702ns 510.085ns
Results for 64 KB (relevant for writing with compression disabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 25364 38.393us 17.683ns 38.375us 38.545us
crc_test.perf_adler_combine 169797143 5.842ns 0.009ns 5.833ns 6.901ns
crc_test.perf_zlib_crc32_combine 26067 38.663us 95.094ns 38.546us 40.523us
---
crc_test.perf_deflate_crc32_checksum 202821 4.937us 14.426ns 4.912us 5.093us
crc_test.perf_adler_checksum 44684 22.733us 206.263ns 22.492us 25.258us
crc_test.perf_zlib_crc32_checksum 18839 53.049us 36.117ns 53.013us 53.274us
The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.
Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().
SStable write performance was evaluated by running:
perf_fast_forward --populate --data-directory /tmp/perf-mc \
--rows=10000000 -c1 -m4G --datasets small-part
Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.
Before:
[1] MC,lz4: 330'903
[2] LA,lz4: 450'157
[3] MC,checksum: 419'716
[4] LA,checksum: 459'559
After:
[1'] MC,lz4: 446'917 ([1] + 35%)
[2'] LA,lz4: 456'046 ([2] + 1.3%)
[3'] MC,checksum: 462'894 ([3] + 10%)
[4'] LA,checksum: 467'508 ([4] + 1.7%)
After this series, the performance of the MC format writer is similar to that
of the LA format before the series.
There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"
* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
tests: perf: Introduce perf_checksum
tests: Add test for libdeflate CRC32 implementation
sstables: compress: Use libdeflate for crc32
sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
licenses: Add libdeflate license
Integrate libdeflate with the build system
Add libdeflate submodule
sstables: Avoid checksum_combine() for the crc32 checksummer
sstables: compress: Avoid unnecessary checksum_combine()
sstables: checksum_utils: Add missing include
(cherry picked from commit 5e759b0c07)
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
reads.
* Behave very badly when too many of them is running concurrently
(trashing).
* May deadlock if enough of them is running without a timeout.
The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
handle) as each scan now can make progress by evicting inactive shard
readers belonging to other scans.
* Will not deadlock at all.
In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
`foreign_reader`.
I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.
Fixes: #3835
"
* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
tests/multishard_mutation_query_test: test stateless query too
tests/querier_cache: fail resource-based eviction test gracefully
tests/querier_cache: simplify resource-based eviction test
tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
tests/mutation_reader_test: restore indentation
tests/mutation_reader_test: enrich pause-related multishard reader test
multishard_combining_reader: use pause-resume API
query::partition_slice: add clear_ranges() method
position_in_partition: add region() accessor
foreign_reader: add pause-resume API
tests/mutation_reader_test: implement the pause-resume API
query_mutations_on_all_shards(): implement pause-resume API
make_multishard_streaming_reader(): implement the pause-resume API
database: add accessors for user and streaming concurrency semaphores
reader_lifecycle_policy: extend with a pause-resume API
query_mutations_on_all_shards(): restore indentation
query_mutations_on_all_shards(): simplify the state-machine
multishard_combining_reader: use the reader lifecycle policy
multishard_combining_reader: add reader lifecycle policy
multishard_combining_reader: drop unnecessary `reader_promise` member
...
(cherry picked from commit 414b14a6bd)
"
This series adds proper handling of filtering queries with LIMIT.
Previously the limit was erroneously applied before filtering,
which leads to truncated results.
To avoid that, paged filtering queries now use an enhanced pager,
which remembers how many rows dropped and uses that information
to fetch for more pages if the limit is not yet reached.
For unpaged filtering queries, paging is done internally as in case
of aggregations to avoid returning keeping huge results in memory.
Also, previously, all limited queries used the page size counted
from max(page size, limit). It's not good for filtering,
because with LIMIT 1 we would then query for rows one-by-one.
To avoid that, filtered queries ask for the whole page and the results
are truncated if need be afterwards.
Tests: unit (release)
"
* 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla:
tests: add filtering with LIMIT test
tests: split filtering tests from cql_query_test
cql3: add proper handling of filtering with LIMIT
service/pager: use dropped_rows to adjust how many rows to read
service/pager: virtualize max_rows_to_fetch function
cql3: add counting dropped rows in filtering pager
(cherry picked from commit 1afda28cf3)
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.
The design constraints are:
1) The streamed writes should be visible to new writes, but the
sstable should not participate in compaction, or we would lose the
ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.
We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).
Fixes#3275
* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
tests: add materialized views test
tests: add view update generator to cql test env
main: add registering staging sstables read from disk
database: add a check if loaded sstable is already staging
database: add get_staging_sstable method
streaming: stream tables with views through staging sstables
streaming: add system distributed keyspace ref to streaming
streaming: add view update generator reference to streaming
main: add generating missed mv updates from staging sstables
storage_service: move initializing sys_dist_ks before bootstrap
db/view: add view_update_from_staging_generator service
db/view: add view updating consumer
table: add stream_view_replica_updates
table: split push_view_replica_updates
table: add as_mutation_source_excluding
table: move push_view_replica_updates to table.cc
database: add populating tables with staging sstables
database: add creating /staging directory for sstables
database: add sstable-excluding reader
table: add move_sstable_from_staging_in_thread function
...
(cherry picked from commit a38f6078fb)
This is a helper flat_mutation_reader that wraps another reader and
splits range tombstones over rows before emitting them.
It is used to produce the same mutation streams for both old (ka/la) and
new (mc) SSTables formats in unit tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).
We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.
With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.
The packages are created by running
ninja build/release/scylla-package.tar
or
ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
"
This series is a refactor of password management, motivated by a
combination of correctness bugs, improving testability, improving
clarity, and adding documentation.
Tests: unit (release)
"
* 'jhk/passwords_refactor/v2' of https://github.com/hakuch/scylla:
auth: Clean up implementation comments
auth: Remove unnecessary local variable
auth: Allow different random engines for salt
auth: Correct modulo bias in salt generation
auth: Extract random byte generation for salt
auth: Split out test for best supported scheme
auth: Rename function to use full words
auth: Add domain-specific exception for passwords
auth: Document passwords interface
auth: Move passsword stuff to its own namespace
auth: Identify password hashing errors correctly
auth: Add unit tests for password handling
auth: Move password handling to its own files
auth: Construct `std::random_device` instances once
While the `password_authenticator` is a complex component with lots of
dependencies, password hashing and checking itself is a process with
limited logical state and dependencies, which makes it easy to isolate
and test.
Compressing debug section reduces build size by 30% with no
significant increase in build time.
Results on a 4-core system (ninja release, size in MB):
before:
18056 build
real 59m43.138s
user 229m3.180s
sys 6m49.460s
after:
12387 build
real 60m30.112s
user 232m8.962s
sys 6m49.364s
Presumably, the difference in debug mode is even greater.x
Message-Id: <20180811180444.30578-1-avi@scylladb.com>
After open-source-parsers/jsoncpp@42a161f commit jsoncpp's version
of valueToQuotedString no longer fits our needs, because too many
UTF-8 characters are unnecessarily escaped. To remedy that,
this commit provides our own string quoting implementation.
Reported-by: Nadav Har'El <nyh@scylladb.com>
Refs #3622
An observable is used to decouple an information producer from a consumer
(in the same way as a callback), while allowing multiple consumers (called
observers) to coexist and to manage their lifetime separately.
Two classes are introduced:
observable: a producer class; when an observable is invoked all observers
receive the information
observer: a consumer class; receives information from a observable
Modelled after boost::signals2, with the following changes
- all signals return void; information is passed from the producer to
the consumer but not back
- thread-unsafe
- modern C++ without preprocessor hacks
- connection lifetime is always managed rather than leaked by default
- renamed to avoid the funky "slot" name
Message-Id: <20180709172726.5079-1-avi@scylladb.com>
The multishard_writer class gets mutation_fragments generated from
flat_mutation_reader and consumes the mutation_fragments with
multishard_writer::_consumer. If the mutation_fragment does not belong to the
shard multishard_writer is on, it will forward the mutation_fragment to the
correct shard. Future returned by multishard_writer() becomes ready
when all the mutation_fragments are consumed.
Tests: tests/multishard_writer_test.cc
Tests: dtest update_cluster_layout_tests.py
Fixes#3497
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.
Tests: unit(release).
"
* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
tests/perf: add microbenchmarks for basic row operations
tests: simple_schema: add make_row_from_serialized_value()
row: add clear_hash()
types: move compare_unsigned() to bytes.hh
lsa: provide migrator with the object size
lsa: add free() that does not require object size
db/view/build_progress: avoid copying mutation fragment
mutation_partition: enable ADL for cell swap
types: make some collection_type_impl functions non-static
counters: drop revertability of apply()
mutable_view: add default constructor and const_iterator
tests/mutation_reader: do not apply mutations created on another shard
sstables: do not call atomic_cell::value() for dead cells
lsa: sanitize use of migrators
lsa: reuse registered migrator ids
lsa: make migrators table thread-local
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.
It comes with two implementations:
* NOP, used in tests, which does nothing on large partition
update/delete
* CQL TABLE, which inserts/deletes information on particular sstable
to system.large_partitions table, in order to be retrievable from
cqlsh later.
References #3292
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.
This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.
Tests: units (release)
"
* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
Add test for loading the whole sstable
Add test for loading statistics
Add support for 3_x stats metadata
Pass sstable version to describe_type
Pass sstable version to write methods
metadata_type: add Serialization type
Pass sstable_version_types to parse methods
Add test for reading filter
Add test for read_summary
sstables 3.x: Add test for reading TOC
sstable: Make component_map version dependent
sstable::component_type: add operator<<
Extract sstable::component_type to separete header
Remove unused sstable::get_shared_components
sstable_version_types: add mc version