In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
"This patch set ensures we quote the name of a UDT when it
contains characters that may cause parsing by the CQL parser
to fail.
Fixes#2491"
* 'cql3-quote-type/v1' of https://github.com/duarten/scylla:
cql3/util: Make maybe_quote() take argument by const reference
cql3/cql3_type: Quote UDT name if needed
schema: Lift maybe_quote() into cql3/util
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.
Fixes#2500
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
"During read query with CL<ALL not all replicas are contacted. It is
possible for some replicas to cache less data for some CF's (for instance
because of node restart), so the replica choice may have a big impact
on request's completion latency and on amount of work it generates in
a cluster.
This patch series keep track of per CF cached hit ratio and uses this
information to choose best replicas for a request. Nodes with lower
hit ratios are still contacted in order to populate their cache, but
less frequently."
* 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev:
storage_proxy: load balance read requests according to cache hit rates
choose extra replica for speculation in filter_for_query()
consistency_level: drop filter_for_query_dc_local function
database: reset node's hit rate information on connection drop
messaging_service: connection drop notifier
Store cluster wide cache hit statistics in CF
messaging_service: return cache hit ratio as part of data read
Distribute cache temperature over gossiper.
periodically calculate avg cache hit rate between all shards
database: introduce cache_temperature class
Rename load_broadcaster.cc to misc_services.cc
storage_proxy: use db::count_local_endpoints function instead open code it
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.
Message-Id: <20170614114607.GA21915@scylladb.com>
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
Use per CF-id reference count instead, and use handles as result of
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction.
Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
Mimic origin behaviour, iff TLS encryption is enabled, and
native_transport_port_ssl is set and different from
native_transport_port, start both tls- and non-tls
listeners.
Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."
* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
tests/view_schema_test: Add more test cases
tests/cql_assertions: Add assertion for row set equality
single_column_relation: Correctly print IN relation
statement_restrictions: Allow filtering regular columns for views
statement_restrictions: Relax clustering restrictions for views
statement_restrictions: Relax partition restrictions for views
cql3/statements: Prevent setting default ttl on view
cql3/restrictions: Complete implementation of is_satisfied_by()
db/view: Re-implement clustering_prefix_matches()
db/view: Re-implement partition_key_matches()
db/view: Generate regular tombstone for base deletions
db/view: Consider cell liveness when generating updates
db/view: Don't generate view updates for static rows
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."
* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
utils::loading_cache: avoid the reads storm when the key is not in the cache
utils::loading_cache: cleanup
utils::loading_cache: align the constrains in the constructor with the parameters description
utils::loading_cache: refresh in the background
auth::auth: add operator<<() for a permission_cache key
auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
db::config: define a saner default value for permissions_validity_in_ms
It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms.
This may cause the values to be invalidated only because some minor delays in the timer scheduling.
It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms.
This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling.
In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s.
This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data
in the foreground.
Setting it to 10s would allow a second read attempt before we enforce the foreground read.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Not 100% proper, but in line with how we still store the info.
Ensures (helps at least) to keep schema loaded from tables
and schema from builder comparable.
Fixes schema_changes_test error.
Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Introduce the calculate_affected_clustering_ranges() function to
calculate the smallest subject of affected clustering ranges that we
need to query for.
The update_requires_read_before_write() function checks whether
a view is potentially affected by the base update.
The patch also cleans up the may_be_affected_by() function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If a base table regular columns is part of the view's pk, and if that
column changes, we should replace the entry, by deleting the row(s)
with the old value and inserting a new one.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements clustering_prefix_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a subset of the clustering columns.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements partition_key_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a component of a compound partition key.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we take into account the liveness of the base's
regular column in the view's pk when generating view updates.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch stores the base_non_pk_column_in_view column as column_id,
which is more convenient, and it also stores a two-level optional to
encode both lazy initialization and the absence of such a column.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
More pointedly: Expose columns as is (currently
all_columns_in_select_order), expose name->column mapping more
appropriately named.
Renaming like this is not strictly neccesary, but there is a point to
trying to keep nomenclature similar-ish with origin, esp. when select
order column need to become filtered (spoiler alert).
Cassandra 3 uses cql names for column/field types, thus
we need to parse these out-of-line, and resolve more akin
to the cql parser.
Also wrap building user types similarly to origin, using
a "builder" wrapper, and usage graph resolving.
Having a varadic parameter being used in implicit sprint is not
very readable + makes it less intuitive when suddenly system keyspace
becomes more than one -> multiple sprints in the chain -> more confusion
or more execution paths.
Its not that horrible with some spread out sprint:s
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."
* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
schema: Kill index_type enum
schema: Kill index_info class
cql3/statements/create_index_statement: Use database::existing_index_names() in validation
cql3/statements: Use secondary index manager in alter_table_statement class
index: Add secondary_index_manager
thrift/handler: Use index_metadata
db/schema_tables: Index persistence
schema: Add all_indices() to schema class
schema: Remove add_default_index_names() from schema_builder class
db/schema_tables: Add system table for indices
cql3/Cgl.g: DROP INDEX
cql3/statements: Add drop_index_statement class
database: Add find_indexed_table() to database class
cql3: Return change event from announce_migration()
cql3/statements: Multiple index targets for CREATE INDEX
cql3/statements: Use index_metadata in create_index_statement class
cql3/statements: Use feature flag in create_index_statement class
service/storage_service: Add feature flag for secondary indices
database: Add get_available_index_name() to database class
schema: Add get_default_index_name() to index_metadata class
...
"This series introduces partial support for range deletions. This allows
deletion operations such as
delete from cf where p=1 and c > 0 and c <= 3.
This series only adds support for single-column range restrictions.
We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.
We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.
This limitation should stay in place until we can properly represent range
tombstones in the storage format."
* 'range-deletions/v2' of https://github.com/duarten/scylla:
mutation: Set cell using clustering_key_prefix
mutation_partition: Harmonize apply_delete overloads
prefix_compound_view_wrapper: Add is_full and is_empty functions
tests/cql_query_test: Add range deletion tests
cql3: Partially support ranged deletions
single_column_primary_key_restrictions: Implement has_bound()
modification_statement: Use statement_restrictions for where clause
statement_restrictions: Expose primary key restrictions
to_string: Add missing include