sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.
Fixes#1592
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
We can make the dependency more abstract by using mutation_source
instead of an sstable.
Will be useful in some stress tests which want to avoid the disk, but
is also good for the sake of decoupling.
Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.
Fixes#2407.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
"Enforces commutativity of addition:
m1 + m2 == m2 + m1
and consistency of difference and addition with equality:
m1 + (m2 - m1) == m1 + m2"
* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
mutation: Make compare_*_for_merge() consistent with equals()
tests: mutation: Improve assertion failure message
tests: Use default equality in test_mutation_diff_with_random_generator
mutation: Make counter cell difference consistent with apply
tests: range_tombstone_list_test: Improve error message
tests: range_tombstone_list: Check adjacent range merging
range_tombstone_list: Merge adjacent range tombstones in apply()
tests: mutation: Check commutativity of mutation addition
range_tombstone_list: Avoid violating set invariant
range_tombstone_list: Make tombstone merging commutative
range_tombstone_list: Add erase() operation to the reverter
range_tombstone_list: Make all undo operations ordered relative to each other
utils: Extract to_boost_visitor() to a separate header
allocating_strategy: Introduce alloc_strategy_unique_ptr<>
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.
Fixes#1859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.
Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.
To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d
To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.
Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.
Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.
Tested against dtests. Note that patches for dtest + ccm
will follow."
* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
legacy_schema_migrator: Actually truncate legacy schema tables on finish
database: Extract "remove" from "drop_columnfamily"
v3 schema test fixes
thrift: Update CQL mapping of static CFs
schema_tables: Use v3 schema tables and formats
type_parser: Origin expects empty string -> bytes_type
cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
cql3_type: v3 to_string
cql_types: Introduce cql3_type::empty and associate with empty data_type
schema: rename column accessors to be in line with origin
schema: Add "is_static_compact_table"
schema_builder: Add helper to generate unique column names akin origin
schema: Add utility functions for static columns
schema: Use heterogeneous comparator for columns bounds
cql3_type_parser: Resolve from cql3 names/expressions
cql3_type: Add "prepare_interal" and "references_user_type"
cql3::cql3_type: Add prepare_internal path using only "local" holders
cql3_type: Add virtual destructor.
database/main: encapsulate system CF dir touching
...
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.
Fixes#2384
Message-Id: <20170516103018.GQ3874@scylladb.com>
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."
* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
schema: Kill index_type enum
schema: Kill index_info class
cql3/statements/create_index_statement: Use database::existing_index_names() in validation
cql3/statements: Use secondary index manager in alter_table_statement class
index: Add secondary_index_manager
thrift/handler: Use index_metadata
db/schema_tables: Index persistence
schema: Add all_indices() to schema class
schema: Remove add_default_index_names() from schema_builder class
db/schema_tables: Add system table for indices
cql3/Cgl.g: DROP INDEX
cql3/statements: Add drop_index_statement class
database: Add find_indexed_table() to database class
cql3: Return change event from announce_migration()
cql3/statements: Multiple index targets for CREATE INDEX
cql3/statements: Use index_metadata in create_index_statement class
cql3/statements: Use feature flag in create_index_statement class
service/storage_service: Add feature flag for secondary indices
database: Add get_available_index_name() to database class
schema: Add get_default_index_name() to index_metadata class
...
NOTE: it's not wired yet.
Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.
Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.
Some highlights:
One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.
Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.
Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."
* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
sstables: Remove unused code
sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
sstables: mutation_reader: Use index_reader for single-partition reads
sstables: mutation_reader: Add trace-level logging
sstables: mutation_reader: Move partition reading code to sstable_data_source
sstables: mutation_reader: Move definitions out of the class body
sstables: Move binary_search() to a header
database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
sstables: index_reader: Introduce advance_to_next_partition()
sstables: index_reader: Introduce advance_and_check_if_present()
sstables: index_reader: Introduce advance_past()
sstables: index_reader: Make copyable
sstables: index_reader: Optimize advancing to extreme positions
sstables: index_reader: Keep two last pages alive
dht: ring_position_view: Add key getter
dht: ring_position_view: Add constructor and factory from ring_position_view
sstables: mutation_reader: Advance to next partition using index in some cases
sstables: index_reader: Expose access to partition key and tombstone
sstables: index_reader: Introduce promoted_index_view
sstables: mutation_reader: Move _index_in_current to sstable_data_source
...
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().
perf_fast_forward, rows: 1000000, value size: 100
Before:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.002182 2 916 3 152 2 0 0 1 1 88.1%
After:
Testing forwarding with clustering restriction in a large partition:
pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu
no 0.000758 2 2639 3 152 2 0 0 1 1 48.6%
This is also a cleanup, a step towards converting all code to use the
index_reader.
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.
Fixes#1156.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
* seastar 6b21197...2ebe842 (6):
> Merge "Various improvements to execution stages" from Paweł
> app-template: allow apps to specify a name for help message
> bool_class: avoid initializing object of incomplete type
> app-template: make sure we can still get help with required options
> prometheus: Http handler that returns prometheus 0.4 protobuf or text format
> Update DPDK to 17.02
Includes patch from Pawel to adjust to updated execution_stage interface.
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.
Fixes#192.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.
To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.
The changes in the patch are thus:
1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.
All changes are structural and don't affect current behavior.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Metrics name should be unique per type.
requests_blocked_memory was registered twice, one as a gauge and one as
derived.
This is not allowed.
Fixes#2165
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.
Performance was tested using:
perf-simple-query -c4 --counters --duration 60
The following results are medians.
before after diff
write 18640.41 33156.81 +77.9%
read 58002.32 62733.93 +8.2%"
* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
cell_locker: add metrics for lock acquisition
storage_proxy: count counter updates for which the node was a leader
storage_proxy: use counter-specific timeout for writes
storage_proxy: transform counter timeouts to mutation_write_timeout_exception
db: avoid allocations in do_apply_counter_update()
tests/counters: add test for apply reversability
counters: attempt to apply in place
atomic_cell: add COUNTER_IN_PLACE_REVERT flag
counters: add equality operators
counters: implement decrement operators for shard_iterator
counters: allow using both views and mutable_views
atomic_cell: introduce atomic_cell_mutable_view
managed_bytes: add cast to mutable_view
bytes: add bytes_mutable_view
utils: introduce mutable_view
db: add more tracing events for counter writes
db: propagate tracing state for counter writes
tests/cell_locker: add test for timing out lock acquisition
counter_cell_locker: allow setting timeouts
db: propagate timeout for counter writes
...
"Work on this series started with fixing the 'nodetool clearsnapshot'.
The current master code ignores the snapshots in deleted keyspaces (issue #2045).
I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.
I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.
I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
- Fix some minor issues in the existing code.
- Extend the lister class and move it into the separate files outside database.cc.
On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.
Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.