Strategy prefers promoting oldest sstables in L0. Because sort
procedure is incorrectly sorting elements in descending order,
newest sstables will be promoted first *if and only if* L0 falls
behind (more than 32 sstables). If L0 doesn't fall behind, we'll
have all L0 sstables compacted with overlapping ones in L1.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
Add a performance test for deletion in addition to the existing update
and query tests. The deletion performance test is executed using the
'--delete' argument to perf_simple_query.
Fixes#2417.
Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170615232500.26987-1-el@loadavg.io>
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
Range tombstones were introduced in version 1.3 and there exists
no direct upgrade from 1.2 to vnext, so we can retire the code
enforcing backwards compatibility.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170614211654.82501-1-duarte@scylladb.com>
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.
Test shows no huge memory is used for repair with large data set.
In addition, we now have a progress reporter in the log how many ranges are processed.
Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Fixes #2430."
* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
repair: Remove unused sub_ranges_max
repair: Reduce parallelism in repair_ranges
repair: Tweak the log a bit
repair: Use more stream_plan
repair: iterator over subranges instead of list
Use per CF-id reference count instead, and use handles as result of
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction.
Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
end and inbetwen. I.e. the test sort of assumed we started
with < 2 (or so) segments. Not always the case (timing)
Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
When starting repair, we divided the large token ranges (vnodes) linto small
subranges of a desired length (around 100 partition), and built a huge list
of those subranges - to iterate over them later and compare checksums of
those chunks.
However, building this list up-front is completely unnecessary, and wastes
a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was
spent on this list. Instead, what we do in this patch is to find the next
chunk in a DFS-like splitting algorithm, using only the token range
midpoint() function (as before). The amount of memory needed for this is
O(logN), instead of O(N) in the previous implementation.
Refs #2430.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."
* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
tests/view_schema_test: Add more test cases
tests/cql_assertions: Add assertion for row set equality
single_column_relation: Correctly print IN relation
statement_restrictions: Allow filtering regular columns for views
statement_restrictions: Relax clustering restrictions for views
statement_restrictions: Relax partition restrictions for views
cql3/statements: Prevent setting default ttl on view
cql3/restrictions: Complete implementation of is_satisfied_by()
db/view: Re-implement clustering_prefix_matches()
db/view: Re-implement partition_key_matches()
db/view: Generate regular tombstone for base deletions
db/view: Consider cell liveness when generating updates
db/view: Don't generate view updates for static rows
"Notably:
- add validation of the results (e.g. fragment count, expectations about disk activity)
- add cache-specific tests"
* 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev:
tests: perf_fast_forward: Report cache stats
row_cache: Keep counters in a struct
tests: perf_fast_forward: Add cache-specific tests
tests: perf_fast_forward: Extract test_reading_all()
tests: perf_fast_forward: Add validation of the results
tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
tests: perf_fast_forward: Allow the test to be interrupted
tests: perf_fast_forward: Allow testing with cache enabled
row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."
* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test forwarding in single-key readers
sstables: Remove unused code
sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
make_pkeys() needs to be invoked with n equal to the number of keys
which the table was populated with. Otherwise the extra keys, which
are missing in the table, may be placed anywhere in the vector due to
ring order sorting, and break the assumption that the table contains
all keys from the array up to index n. This resulted in the test
reading slighlty less fragments than it would follow from the desired
count.
Another problem is that we should not skip the fast_forward_to() call
for the inital range (workaround for a bug in sstable mutation
reader), otherwise we will read slightly less than expected as well.
Right now, next_token_for_shard() only allows iterating linearly in shard
order. Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
For row set equality, the order of the actual rows doesn't need to
match the order of the expected rows.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.
Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.
Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.
Tested against dtests. Note that patches for dtest + ccm
will follow."
* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
legacy_schema_migrator: Actually truncate legacy schema tables on finish
database: Extract "remove" from "drop_columnfamily"
v3 schema test fixes
thrift: Update CQL mapping of static CFs
schema_tables: Use v3 schema tables and formats
type_parser: Origin expects empty string -> bytes_type
cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
cql3_type: v3 to_string
cql_types: Introduce cql3_type::empty and associate with empty data_type
schema: rename column accessors to be in line with origin
schema: Add "is_static_compact_table"
schema_builder: Add helper to generate unique column names akin origin
schema: Add utility functions for static columns
schema: Use heterogeneous comparator for columns bounds
cql3_type_parser: Resolve from cql3 names/expressions
cql3_type: Add "prepare_interal" and "references_user_type"
cql3::cql3_type: Add prepare_internal path using only "local" holders
cql3_type: Add virtual destructor.
database/main: encapsulate system CF dir touching
...
The standard serialization API (e.g. in data_value) includes the following methods:
size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;
Align the utils::UUID API with the pattern above.
The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Use the same templated implementation for all different serialize_XXX(...).
The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.
This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.
The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
More pointedly: Expose columns as is (currently
all_columns_in_select_order), expose name->column mapping more
appropriately named.
Renaming like this is not strictly neccesary, but there is a point to
trying to keep nomenclature similar-ish with origin, esp. when select
order column need to become filtered (spoiler alert).