"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.
Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.
Tests: unit {release}
+ tested Scylla with both 'la' and 'mc' formats to work fine:
cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
<<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
<<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;
pk | ck | rc
----+----+------
2 | 3 | null
4 | 6 | null
(2 rows)
<<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;
pk | ck | rc
----+----+------
5 | 7 | null
8 | 10 | null
2 | 3 | null
4 | 6 | null
7 | 9 | null
6 | 8 | null
(6 rows)
"
* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
database: Honour enable_sstables_mc_format configuration option.
sstables: Support SSTables 'mc' format as a feature.
db: Add configuration option for enabling SSTables 'mc' format.
tests: Add test for reading a complex column with zero subcolumns (SST3).
sstables: Fix parsing of complex columns with zero subcolumns.
sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
bytes: Add helper for turning bytes_view into sstring_view.
sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
sstables: Fix string formatting for exception messages in m_format_read_helpers.
sstables: Don't validate timestamps against the max value on parsing.
sstables: Always store only min bases in serialization_header.
sstables: Support 'mc' version parsing from filename.
SST3: Make sure we call consume_partition_end
There is a bad interaction between may_need_paging() and query result
size limiter. The former is trying to avoid the complexity of paged
queries when the number of returned rows is going to be smaller than the
page size. The latter uses the fact that paged queries need not return
all requested rows to limit the size of a query results. Since
may_need_paging() may turn a paged query into non-paged one as a side
effect it disables the oversized result protection.
This patch limits the cases when may_need_paging() disables paging to
the situations when we know for sure that query result size limiter
won't be needed, i.e.: the result is not going to contain more than one
row. If the client knows for sure that the paging is not needed and
the performance impact is worthwhile it can disable paging on its side.
Otherwise, let's default to the safer behaviour.
Fixes#3620.
Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.
Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.
Fixes#3734.
Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.
Fixes#3745.
This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.
Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from 127.0.4.1#0:
std::runtime_error (sorted_tokens is empty in first_token_index!)
To fix, wait for the token range setup before announcing the join
status.
Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test
Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.
This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"
* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
database: make database's mutation apply stage inherit its scheduling group from the caller
database: make database::_mutation_query_stage inherit the scheduling group
database: make database::_data_query_stage inheriting its caller's scheduling_group
storage_proxy: make _mutate_stage inherit its caller's scheduling_group
This error is transient, since as soon as the node is up we will be able
to send the migration request. Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).
The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.
Fixes#3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.
Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.
Fixes#3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
Query options may contain bound values needed for checking filtering
restrictions. Previously, empty query_options{} were used, which
caused prepared statements to fail.
Fixes#3677
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:
- Change from
std::unordered_multimap<dht::token_range, inet_address>
to
std::unordered_map<dht::token_range, std::vector<inet_address>>
- Change from
std::unordered_multimap<inet_address, dht::token_range>
to
std::unordered_map<inet_address, dht::token_range_vector>
Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.
Fixes#3636
Message-Id: <20180801093147.GD23569@scylladb.com>
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
"
This series adds some optimisations to the paging logic, that attempt to
close the performance gap between paged and not paged queries. The
former are more complex so always are going to be slower, but the
performance loss was unacceptably large.
Fixes#3619.
Performance with paging:
./perf_paging_before ./perf_paging_after diff
read 271246.13 312815.49 15.3%
Without paging:
./perf_nopaging_before ./perf_nopaging_after diff
read 343732.17 342575.77 -0.3%
Tests: unit(release), dtests(paging_test.py, paging_additional_test.py)
"
* tag 'optimise-paging/v1' of https://github.com/pdziepak/scylla:
cql3: select statement: don't copy metadata if not needed
cql3: query_options: make simple getter inlineable
cql3: metadata: avoid copying column information
query_pager: avoid visiting result_view if not needed
query::result_view: add get_last_partition_and_clustering_key()
query::result_reader: fix const correctness
tests/uuid: add more tests including make_randm_uuid()
utils: uuid: don't use std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
query::result_visitor provides get_last_partition_and_clustering_key()
which allows getting those without iterating through the whole result.
Moreover, row count may be precomputed in the result, if it isn't there
is query::result_view::count_partitions_and_rows() for getting it.
Count operations which were started on one shard and
were performed on another, due to non-shard-aware driver
and/or RPC.
Message-Id: <20180723155118.8545-1-avi@scylladb.com>
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.
Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.
Fixes#3605.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.
Fixes#3590.
Message-Id: <20180708123522.GC28899@scylladb.com>
Require a timeout parameter for storage_proxy::mutate_begin() and
all its callers (all the way to thrift and cql modification_statement
and batch_statement).
This should fix spurious debug-mode test failures, where overcommit
and general debug slowness result in the default timeouts being
exceeded. Since the tests use infinite timeouts, they should not
time out any more.
Tests: unit (release), with an extra patch that aborts
when a non-infinite timeout is detected.
Message-Id: <20180707204424.17116-1-avi@scylladb.com>
"
The main idea of this series is to provide a filtering_visitor
as a specialised result_set_builder::visitor implementation
that keeps restriction info and applies it on query results.
Also, since allow_filtering checking is not correct now (e.g. #2025)
on select_statement level, this series tries to fix any issues
related to it.
Still in TODO:
* handling CONTAINS relation in single column restriction filtering
* handling multi-column restrictions - especially EQ, which can be
split into multiple single-column restrictions
* more tests - it's never enough; especially esoteric cases
like filtering queries which also use secondary indexes,
paging tests, etc.
Tests: unit (release)
"
* 'allow_filtering_6' of https://github.com/psarna/scylla:
tests: add allow_filtering tests to cql_query_test
cql3: enable ALLOW FILTERING
service: add filtering_pager
cql3: optimize filtering partition keys and static rows
cql3: add filtering visitor
cql3: move result_set_builder functions to header
cql3: amend need_filtering()
cql3: add single column primary key restrictions getters
cql3: expose single column primary key restrictions
cql3: add needs_filtering to primary key restrictions
cql3: add simpler single_column_restriction::is_satisfied_by
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.
Found during code-review.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.
Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.
Fixes: #3153
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.
Scylla log full of such message was observed:
[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)
To fix, limit the number of retires.
Tests: update_cluster_layout_tests.py
Fixes#3542
Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
The presence of a plain reference prohibits the bound_view class from
being copyable. The trick employed to work around that was to use
'placement new' for copy-assigning bound_view objects, but this approach
is ill-formed and causes undefined behaviour for classes that have const
and/or reference members.
The solution is to use a std::reference_wrapper instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a0c951649c7aef2f66612fc006c44f8a33713931.1530113273.git.vladimir@scylladb.com>
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.
References #3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.
Tests: unit (release)
Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>