After this patch, the Materialized Views and Secondary Index features
are considered generally-available and no longer require passing an
explicit "--experimental=on" flag to Scylla.
The "--experimental=on" flag and the db::config::check_experimental()
function remain unused, as we graduated the only two features which used
this flag. However, we leave the support for experimental features in
the code, to make it easier to add new experimental features in the future.
Another reason to leave the command-line parameter behind is so existing
scripts that still use it will not break.
Fixes#3917
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115144456.25518-1-nyh@scylladb.com>
Bootstrapping process may need system distributed keyspace
to generate view updates, so initializing sys_dist_ks
is moved before the bootstrapping process is launched.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.
We also need to limit memory used in aggregate by thrift, but that is left to
another patch.
Fixes#3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.
This should never be done unless query handling moves to a different shard.
Fixes#3862
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.
Message-Id: <20181009133017.GA14449@scylladb.com>
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.
Message-Id: <20181009120900.GF22665@scylladb.com>
The pager::state() function returns a valid paging object even
if the pager itself is exhausted. It may also not contain the partition
key, so using it unconditionally was a bug - now, in case there is no
partition key present, paging state will contain an empty partition key.
Fixes#3829
Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com>
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.
Fixes#3639
Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
Add an overload to send_to_endpoint() which accepts a frozen_mutation.
The motivation is to allow better accounting of pending view updates,
but this change also allows some callers to avoid unfreezing already
frozen mutations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Lift frozen_mutation_and_schema to frozen_mutation.hh, since other
subsystems using frozen_mutations will likely want to pass it around
together with the schema.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().
Fixes#3820
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
We allow tables to be created with the ID property, mostly for
advanced recovery cases. However, we need to validate that the ID
doesn't match an existing one. We currently do this in
database::add_column_family(), but this is already too late in the
normal workflow: if we allow the schema change to go through, then
it is applied to the system tables and loaded the next time the node
boots, regardless of us throwing from database::add_column_family().
To fix this, we perform this validation when announcing a new table.
Note that the check wasn't removed from database::add_column_family();
it's there since 2015 and there might have been other reasons to add
it that are not related to the ID property.
Refs #2059
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230142.46743-1-duarte@scylladb.com>
If service::pager is exhausted, state() function used to return
a nullptr instead of a pointer to a valid paging state and the
documented return type in this case was 'unspecified'.
Sometimes a paging state may be needed anyway, even if the pager
is already exhausted - thus, state() return value becomes defined
after this commit. Exhausted pagers will return a valid object
to a state with _remaining field set to 0.
A standard way for passing a timeout parameter is specifying
a time_point, while pagers used to take a duration in order
to compute time points on the fly. This patch adds a timeout
parameter, which is a time_point, to fetch_page().
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.
Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.
Tests: unit {release}
+ tested Scylla with both 'la' and 'mc' formats to work fine:
cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
<<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
<<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;
pk | ck | rc
----+----+------
2 | 3 | null
4 | 6 | null
(2 rows)
<<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;
pk | ck | rc
----+----+------
5 | 7 | null
8 | 10 | null
2 | 3 | null
4 | 6 | null
7 | 9 | null
6 | 8 | null
(6 rows)
"
* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
database: Honour enable_sstables_mc_format configuration option.
sstables: Support SSTables 'mc' format as a feature.
db: Add configuration option for enabling SSTables 'mc' format.
tests: Add test for reading a complex column with zero subcolumns (SST3).
sstables: Fix parsing of complex columns with zero subcolumns.
sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
bytes: Add helper for turning bytes_view into sstring_view.
sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
sstables: Fix string formatting for exception messages in m_format_read_helpers.
sstables: Don't validate timestamps against the max value on parsing.
sstables: Always store only min bases in serialization_header.
sstables: Support 'mc' version parsing from filename.
SST3: Make sure we call consume_partition_end
There is a bad interaction between may_need_paging() and query result
size limiter. The former is trying to avoid the complexity of paged
queries when the number of returned rows is going to be smaller than the
page size. The latter uses the fact that paged queries need not return
all requested rows to limit the size of a query results. Since
may_need_paging() may turn a paged query into non-paged one as a side
effect it disables the oversized result protection.
This patch limits the cases when may_need_paging() disables paging to
the situations when we know for sure that query result size limiter
won't be needed, i.e.: the result is not going to contain more than one
row. If the client knows for sure that the paging is not needed and
the performance impact is worthwhile it can disable paging on its side.
Otherwise, let's default to the safer behaviour.
Fixes#3620.
Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.
Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.
Fixes#3734.
Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.
Fixes#3745.
This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.
Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from 127.0.4.1#0:
std::runtime_error (sorted_tokens is empty in first_token_index!)
To fix, wait for the token range setup before announcing the join
status.
Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test
Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.
This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"
* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
database: make database's mutation apply stage inherit its scheduling group from the caller
database: make database::_mutation_query_stage inherit the scheduling group
database: make database::_data_query_stage inheriting its caller's scheduling_group
storage_proxy: make _mutate_stage inherit its caller's scheduling_group
This error is transient, since as soon as the node is up we will be able
to send the migration request. Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).
The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.
Fixes#3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.
Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.
Fixes#3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
Query options may contain bound values needed for checking filtering
restrictions. Previously, empty query_options{} were used, which
caused prepared statements to fail.
Fixes#3677
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:
- Change from
std::unordered_multimap<dht::token_range, inet_address>
to
std::unordered_map<dht::token_range, std::vector<inet_address>>
- Change from
std::unordered_multimap<inet_address, dht::token_range>
to
std::unordered_map<inet_address, dht::token_range_vector>
Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.
Fixes#3636
Message-Id: <20180801093147.GD23569@scylladb.com>
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
"
This series adds some optimisations to the paging logic, that attempt to
close the performance gap between paged and not paged queries. The
former are more complex so always are going to be slower, but the
performance loss was unacceptably large.
Fixes#3619.
Performance with paging:
./perf_paging_before ./perf_paging_after diff
read 271246.13 312815.49 15.3%
Without paging:
./perf_nopaging_before ./perf_nopaging_after diff
read 343732.17 342575.77 -0.3%
Tests: unit(release), dtests(paging_test.py, paging_additional_test.py)
"
* tag 'optimise-paging/v1' of https://github.com/pdziepak/scylla:
cql3: select statement: don't copy metadata if not needed
cql3: query_options: make simple getter inlineable
cql3: metadata: avoid copying column information
query_pager: avoid visiting result_view if not needed
query::result_view: add get_last_partition_and_clustering_key()
query::result_reader: fix const correctness
tests/uuid: add more tests including make_randm_uuid()
utils: uuid: don't use std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>