The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.
The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.
Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.
Ref #1449.
Closes#8222
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.
The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).
Closes#8140
* github.com:scylladb/scylla:
treewide: remove timeout config from query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
service: add timeout config to client state
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.
The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
initialization
As a preparation for introducing CDC generation management service.
cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.
The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is
sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes
and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.
Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.
This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.
The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)
v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
- changed functions not to return values via refs-arguments
- fixed nested classes to properly use language constructors
- renamed index_to to key_t to distinguish from node_index_t
- improved recurring variadic templates not to use sentinel argument
- use standard concepts
v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)
A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.
next todo:
- aarch64 version of 16-keys node search
tests: unit(dev), unit(debug for radix*), pref(dev)
"
* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
test/memory_footpring: Print radix tree node sizes
row: Remove old storages
row: Prepare row::equal for switch
row: Prepare row::difference for switch
row: Introduce radix tree storage type
row-equal: Re-declare the cells_equal lambda
test: Add tests for radix tree
utils: Compact radix tree
array-search: Add helpers to search for a byte in array
test/perf_collection: Add callback to check the speed of clone
test/perf_mutation: Add option to run with more than 1 columns
test/perf_mutation: Prepare to have several regular columns
test/perf_mutation: Use builder to build schema
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.
Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.
Refs #7150
Refs #5192
Refs #5960
Restore blocked_reactor_notify_ms right before
starting storage_proxy. Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.
Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed. The result is:
1. memory footprint
sizeof(class row): 112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes
the "in cache" value depends on the number of cells:
num of cells master patch
1 752 656
2 808 712
3 864 768
4 920 824
5 968 936
6 1136 992
...
16 1840 1672
17 1904 1992 (+88)
18 1976 2048 (+72)
19 2048 2104 (+56)
20 2120 2160 (+40)
21 2184 2208 (+24)
22 2256 2264 ( +8)
23 2328 2320
...
32 2960 2808
After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches
64 7872 6056
128 15040 9512
256 29376 18568
2. perf_mutation test is enhanced by this series and the
results differ depending on the number of columns used
tps value
--column-count master patch
1 59.9k 57.6k (-3.8%)
2 59.9k 57.5k
4 59.8k 57.6k
8 57.6k 57.7k <- eq
16 56.3k 57.6k
32 53.2k 57.4k (+7.9%)
A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.
An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.
The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.
This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.
Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.
Load driver command line:
./scylla-bench \
-workload uniform \
-mode read \
--partition-count=10 \
-clustering-row-count=1 \
-concurrency 100
Scylla command line:
scylla --developer-mode=1 -c1 -m1G --enable-cache=0
The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.
Before:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms
2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms
3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms
4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms
5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms
6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms
7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms
8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms
9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms
After:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms
2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms
3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms
4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms
5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms
6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms
7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms
8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms
9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms
Tests:
- unit [release]
- scylla-bench
"
* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
sstables: Share partition index pages between readers
sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
sstables: index_reader: Do not store cluster index cursor inside partition indexes
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).
In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.
We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If possible, test the highest sstable format version,
as it's the mostly used.
If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.
Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.
Closes#7767
* github.com:scylladb/scylla:
build: reinstate -Wunused-value warning for [[nodiscard]]
test: lib: don't ignore future in compare_readers()
test: mutation_test: check both ranges when comparing summaries
serialializer: silence unused value warning in variant deserializer
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations: 0
single run duration: 1.000s
number of runs: 10
test iterations median mad min max
clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns
clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7688
* github.com:scylladb/scylla:
tests: mutation_source_test for clustering_order_reader_merger
perf: microbenchmark for clustering_order_reader_merger
mutation_reader_test: test clustering_order_reader_merger in memory
test: generalize `random_subset` and move to header
mutation_reader: introduce clustering_order_reader_merger
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.
Tests: unit(dev)
Closes#7657
* github.com:scylladb/scylla:
cdc: Remove trailing whitespaces from cdc_tests
cdc: Remove mk_cdc_test_config from tests
config: Add add_cdc_extension function for testing
cdc: Add missing includes to cdc_extension.hh
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.
As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.
tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"
* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
query-context: Remove database from qctx
schema-tables: Use query processor referece in save_system(_keyspace)?_schema
system-keyspace: Rewrite force_blocking_flush
system-keyspace: Use cluster_name string in check_health
system-keyspace: Use db::config in setup_version
query-context: Kill global helpers
test: Use cql_test_env::evecute_cql instead of qctx version
code: Use qctx::evecute_cql methods, not global ones
system-keyspace: Do not call minimal_setup for the 2nd time
system-keyspace: Fix indentation after previous patch
system-keyspace: Do not do invoke_on_all by hands
system-keyspace: Remove dead code
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.
This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.
Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR
Closes#6916
* github.com:scylladb/scylla:
api: allow changing hinted handoff configuration
storage_proxy: fix wrong return type in swagger
hints_manager: implement change_host_filter
storage_proxy: always create hints manager
config: plug in hints::host_filter object into configuration
db/hints: introduce host_filter
hints/resource_manager: allow registering managers after start
hints: introduce db::hints::directory_initializer
directories.cc: prepare for use outside main.cc
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.
This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.
Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.
To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).
The qctx itself is a global pointer, which waits for its fix too, of course.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.
Closes#7507
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
Allow the evictable reader managing the underlying reader to pass its
own permit to it when creating it, making sure they share the same
permit. Note that the two parts can still end up using different
permits, when the underlying reader is kept alive between two pages of a
paged read and thus keeps using the permit received on the previous
page.
Also adjust the `reader_context` in multishard_mutation_query.cc to use
the passed-in permit instead of creating a new one when creating a new
reader.