On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
In the previous patch, we added a "view virtual" flag on columns. In this
patch we add persistance to this flag: I.e., writing it to the on-disk
schema table and reading it back on startup. But the implementation is
not as simple as adding a flag:
In the on-disk system tables, we have a "columns" table listing all the
columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED
VIEW" works by reading this "columns" table, and listing all of the
requested view's columns. Therefore, we cannot add "virtual columns" -
which are columns not added by the user and not intended to be seen -
to this list.
We therefore need to create in this patch a separate list for virtual
columns, in a new table "view_virtual_columns". This table is essentially
identical to the existing "columns" table, just separate. We need to write
each column to the appropriate table (columns with the view_virtual flag to
"view_virtual_columns", columns without it to the old "columns"), read
from both on startup, and remember to delete columns from both when a table
is dropped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.
In addition, change add_local_application_state to use std::list instead
of std::vector.
Refs #2782
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.
The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).
Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.
We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.
If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.
Fixes#2884
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
It will be used to store Scylla spcific table metadata. We cannot
store it in the standard "tables" table for compatibility reasons -
Cassandra will fail to read schema if it encounteres columns it is not
expecting.
"
- Introduce a parent span IP and span ID paradigm.
- Introduce time series tables to simplify traces processing.
- Add the "How to get traces?" chapter to the tracing.md.
"
* 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev:
docs: tracing.md: add a "how to get traces" chapter
tracing::trace_keyspace_helper: introduce a time series helper tables
tracing: cleanup: use nullptr instead of trace_state_ptr()
tracing: introduce a span ID and parent span ID
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."
* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
schema: Kill index_type enum
schema: Kill index_info class
cql3/statements/create_index_statement: Use database::existing_index_names() in validation
cql3/statements: Use secondary index manager in alter_table_statement class
index: Add secondary_index_manager
thrift/handler: Use index_metadata
db/schema_tables: Index persistence
schema: Add all_indices() to schema class
schema: Remove add_default_index_names() from schema_builder class
db/schema_tables: Add system table for indices
cql3/Cgl.g: DROP INDEX
cql3/statements: Add drop_index_statement class
database: Add find_indexed_table() to database class
cql3: Return change event from announce_migration()
cql3/statements: Multiple index targets for CREATE INDEX
cql3/statements: Use index_metadata in create_index_statement class
cql3/statements: Use feature flag in create_index_statement class
service/storage_service: Add feature flag for secondary indices
database: Add get_available_index_name() to database class
schema: Add get_default_index_name() to index_metadata class
...
This patch makes the tracing framework follow the general idea of Google's
Dapper paper: traces generated in a context of the same query are forming
a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running
on each involved replica Node and edges are RPCs sent from one Node to another.
- Each vertex in the tree above has an ID - "span ID".
- In order to be able to build the tree from the sessions traces we need
to know the parent "span ID" - the ID of a span that sent an RPC that created
the current span.
- Each span of a tracing session is given a 64-bit random span ID.
- The root span has a span_id::illegal_id value.
This patch adds:
- The described above parent span ID and a span ID to the one_session_records
object.
- The current span ID is passed in the trace_info struct to the remote replica.
- Add parent_id and span_id columns to system_traces.events table for the parent
ID and span ID.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patch replaces the current row tombstone representation by a
row_tombstone.
The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.
We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:
1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3
These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.
This patch only addresses the memory representation of such
row_tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.
This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.
Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
This patch allows a view schema to be frozen. To unfreeze such a
schema, we add an is_view attribute to the schema idl.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all. The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).
This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.
In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
The main idea is to log queries that take "too long" to complete.
The "too long" is above the given threshold.
To achieve the above this patch does the following:
- Introduce two new properties to the tracing::trace_state:
- "Full tracing": when the tracing of this query was explicitly requested.
In this state we will record all possible traces related to this query:
both on the coordinator and on any replica involved.
- "Log slow query": when slow query logging is enabled.
If slow query logging is enabled and a session's "duration" is above
the specified threshold we will create a record in the "slow queries log"
and write all trace records created on the coordinator and on a replica
if a replica's session lasts longer than that threshold.
(We will propagate the Coordinator's slow query logging threshold to replicas
in the context of a specific tracing/logging session).
The properties above are independent, namely they may be enabled and/or disabled
independently and any combination of them is legal (naturally, creating a tracing
session when both states above are disabled makes no sense).
- Instrument the tracing::tracing service to allow the following:
- Enable/disable slow query logging.
- Set/get the slow query duration threshold (in microseconds).
- Set/get the slow query log record TTL value (in seconds).
- Instrument the trace_keyspace_helper to write a slow query log entry
when requested.
- The slow query logging is disabled by default and the threshold is set to half a second.
- The TTL of a slow log record is set to 86400 seconds by default.
- It makes sense to use the same "slow query logging threshold" and a "slow query record TTL"
both on a coordinator and on a replica Nodes in a context of the same tracing session:
- Pass both TTL and a threshold to the replica in a trace_info.
This patch also implements the new slow query logging specific logic:
- Don't write the pending tracing records before the end of a tracing session
until "duration" reaches the logging threshold.
- Don't build the parameters<sstring, sstring> map unless we know we will write it
to I/O.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Instead of keeping separate booleans introduce a trace_state_props_set enum_set and
pass it around instead of separate booleans.
- Change the trace_info to hold this value in addition to write_on_close. Initialize
a corresponding bit in an enum_set based on a write_on_close value in a trace_info
constructor for a backward compatibility.
- Separate a trace_state constructor into two:
- For a primary session object.
- For a secondary session object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This patch changes the type of query::clustering_range to express that
ranges that wrap around are not allowed, and ranges that have the
start bound after the end bound are considered empty.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This patch adds support to send a cell's ttl as part of a query's
result. This is needed for thrift support.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes hashing for repair calculate checksums in a way that
doesn't require rebuilding whole mutation.
Unfortunately, such checksums are incompatible with the old ones so the
old way for computing checksums is preserved for compatibility reasons.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the range_tombstone class, composed of
a [start, end] pair of clustering_key_prefixes, the type
of inclusiveness of each bound, and a tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
... and make it a clustering_key_prefix, in preparation of
supporting not-whole-row range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>