Commit Graph

779 Commits

Author SHA1 Message Date
Avi Kivity
f5dae826ce Merge "Migrate schema tables to v3 format" from Calle
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.

Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.

Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.

Tested against dtests. Note that patches for dtest + ccm
will follow."

* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
  legacy_schema_migrator: Actually truncate legacy schema tables on finish
  database: Extract "remove" from "drop_columnfamily"
  v3 schema test fixes
  thrift: Update CQL mapping of static CFs
  schema_tables: Use v3 schema tables and formats
  type_parser: Origin expects empty string -> bytes_type
  cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
  types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
  cql3_type: v3 to_string
  cql_types: Introduce cql3_type::empty and associate with empty data_type
  schema: rename column accessors to be in line with origin
  schema: Add "is_static_compact_table"
  schema_builder: Add helper to generate unique column names akin origin
  schema: Add utility functions for static columns
  schema: Use heterogeneous comparator for columns bounds
  cql3_type_parser: Resolve from cql3 names/expressions
  cql3_type: Add "prepare_interal" and "references_user_type"
  cql3::cql3_type: Add prepare_internal path using only "local" holders
  cql3_type: Add virtual destructor.
  database/main: encapsulate system CF dir touching
  ...
2017-05-17 11:25:52 +03:00
Asias He
0abfe39d8f database: Log compaction strategy setting on shard 0 only
The compaction strategy is per node not per shard. Do not duplicate the
same log on all shards.

Message-Id: <1494835519.git.asias@scylladb.com>
2017-05-17 11:17:41 +03:00
Gleb Natapov
c7ad3b9959 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384

Message-Id: <20170516103018.GQ3874@scylladb.com>
2017-05-16 15:06:10 +03:00
Calle Wilund
3514123677 database: Extract "remove" from "drop_columnfamily" 2017-05-10 16:44:48 +00:00
Calle Wilund
6c8b5fc09d schema_tables: Use v3 schema tables and formats
Switches system/schema_* for system_schema/*, updates schema/schema
builder and uses to hold/expect v3 style info (i.e. types & dropped).
2017-05-10 16:44:48 +00:00
Calle Wilund
48ddcbb77b database/main: encapsulate system CF dir touching 2017-05-10 16:44:47 +00:00
Calle Wilund
2e1c23f2f2 database: Relax rp ordering check to allow non-commitlog mutations
Allow replay to come post certain operations. Such as schema migration
2017-05-09 13:48:55 +00:00
Calle Wilund
27fdc5cfef schema_tables/system_tables: Add v3 tables to "ALL" and handle in init
I.e. deal with more than one keyspace in system_keyspace::make
2017-05-09 13:48:55 +00:00
Calle Wilund
4378dca6e1 schema_tables: Hide/abstract schema keyspace name 2017-05-09 13:48:55 +00:00
Avi Kivity
8c5c5d3004 Merge "CQL front-end for secondary indices" from Pekka
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."

* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
  schema: Kill index_type enum
  schema: Kill index_info class
  cql3/statements/create_index_statement: Use database::existing_index_names() in validation
  cql3/statements: Use secondary index manager in alter_table_statement class
  index: Add secondary_index_manager
  thrift/handler: Use index_metadata
  db/schema_tables: Index persistence
  schema: Add all_indices() to schema class
  schema: Remove add_default_index_names() from schema_builder class
  db/schema_tables: Add system table for indices
  cql3/Cgl.g: DROP INDEX
  cql3/statements: Add drop_index_statement class
  database: Add find_indexed_table() to database class
  cql3: Return change event from announce_migration()
  cql3/statements: Multiple index targets for CREATE INDEX
  cql3/statements: Use index_metadata in create_index_statement class
  cql3/statements: Use feature flag in create_index_statement class
  service/storage_service: Add feature flag for secondary indices
  database: Add get_available_index_name() to database class
  schema: Add get_default_index_name() to index_metadata class
  ...
2017-05-08 17:04:40 +03:00
Pekka Enberg
f26b8d7afb database: Add find_indexed_table() to database class 2017-05-04 14:59:12 +03:00
Pekka Enberg
930fa79aff database: Add get_available_index_name() to database class 2017-05-04 14:59:11 +03:00
Pekka Enberg
c6e7d4484a database: Make existing_index_names() per-keyspace operation 2017-05-04 14:59:11 +03:00
Pekka Enberg
8c729f0f5f database: Rewrite existing_index_names() to use new index metadata 2017-05-04 14:59:11 +03:00
Paweł Dziepak
24f4dcf9e4 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
2017-04-30 19:17:22 +03:00
Raphael S. Carvalho
8bae413bcf database: fix format msg for sprint
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224920.16607-1-raphaelsc@scylladb.com>
2017-04-26 17:18:58 +03:00
Raphael S. Carvalho
662fe77c11 database: kill column_family::start_rewrite
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:33 -03:00
Raphael S. Carvalho
43ac19eb52 database: wire up new resharding algorithm
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:31 -03:00
Raphael S. Carvalho
cf45333588 database: implement new sstable resharding algorithm
NOTE: it's not wired yet.

Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.

Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:30 -03:00
Raphael S. Carvalho
6513252e91 database: introduce function to replace new sstables by their ancestors
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:27 -03:00
Raphael S. Carvalho
c44a2319e6 prevent regular compaction from choosing shared sstables
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:26 -03:00
Avi Kivity
68f0df12ee Merge "Optimize reads with clustering restrictions" from Tomasz
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.

Some highlights:

One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.

Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.

Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."

* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
  sstables: Remove unused code
  sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
  sstables: mutation_reader: Use index_reader for single-partition reads
  sstables: mutation_reader: Add trace-level logging
  sstables: mutation_reader: Move partition reading code to sstable_data_source
  sstables: mutation_reader: Move definitions out of the class body
  sstables: Move binary_search() to a header
  database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
  sstables: index_reader: Introduce advance_to_next_partition()
  sstables: index_reader: Introduce advance_and_check_if_present()
  sstables: index_reader: Introduce advance_past()
  sstables: index_reader: Make copyable
  sstables: index_reader: Optimize advancing to extreme positions
  sstables: index_reader: Keep two last pages alive
  dht: ring_position_view: Add key getter
  dht: ring_position_view: Add constructor and factory from ring_position_view
  sstables: mutation_reader: Advance to next partition using index in some cases
  sstables: index_reader: Expose access to partition key and tombstone
  sstables: index_reader: Introduce promoted_index_view
  sstables: mutation_reader: Move _index_in_current to sstable_data_source
  ...
2017-04-20 13:58:37 +03:00
Tomasz Grabiec
4742008b70 sstables: mutation_reader: Use index_reader for single-partition reads
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().

perf_fast_forward, rows: 1000000, value size: 100

Before:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.002182         2        916      3        152       2       0        0        1        1  88.1%

After:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.000758         2       2639      3        152       2       0        0        1        1  48.6%

This is also a cleanup, a step towards converting all code to use the
index_reader.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
bedd0ab6f9 database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating 2017-04-20 10:54:38 +02:00
Raphael S. Carvalho
3286f7aaa6 compaction: make major compaction go through compaction manager
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.

Fixes #1156.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
2017-04-19 15:44:21 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Raphael S. Carvalho
7deeffc953 database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
2017-03-19 12:33:03 +02:00
Duarte Nunes
876a514743 database: Upgrade mutation to current schema to push view updates
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 18:15:27 +01:00
Duarte Nunes
bfb8a3c172 materialized views: Replace db::view::view class
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.

To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.

The changes in the patch are thus:

1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.

All changes are structural and don't affect current behavior.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:50:05 +01:00
Amnon Heiman
0a2eba1b94 database: requests_blocked_memory metric should be unique
Metrics name should be unique per type.

requests_blocked_memory was registered twice, one as a gauge and one as
derived.

This is not allowed.

Fixes #2165

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
2017-03-14 19:36:45 +02:00
Paweł Dziepak
b5f0e590be db: make database::query() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
38c1501f4d db: make apply an execution stage 2017-03-09 09:27:43 +00:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Avi Kivity
1af9e3a5cb Merge "database: fix the 'nodetool clearsnapshot'" from Vlad
"Work on this series started with fixing the  'nodetool clearsnapshot'.
The current master code  ignores the snapshots in deleted keyspaces (issue #2045).

I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.

I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.

I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
   - Fix some minor issues in the existing code.
   - Extend the lister class and move it into the separate files outside database.cc.

On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
2017-03-06 16:45:31 +02:00
Paweł Dziepak
04b80272f2 cell_locker: add metrics for lock acquisition 2017-03-02 09:05:12 +00:00
Paweł Dziepak
f93a766db4 db: avoid allocations in do_apply_counter_update() 2017-03-02 09:05:12 +00:00
Paweł Dziepak
774241648d db: add more tracing events for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
277501f42f db: propagate tracing state for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
25173f8095 db: propagate timeout for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
f25fa6566f db: avoid deserialization when applying counter mutation
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
2017-03-01 16:33:37 +00:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00
Paweł Dziepak
426345e1d4 storage_proxy: avoid excessive mutation freezes 2017-03-01 16:33:36 +00:00
Duarte Nunes
c0e5964462 database: Explicitly use discard_result()
Values returned from the lambda passed to finally() are immediately
destroyed, so make that explicit by using discard_result().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235541.28330-1-duarte@scylladb.com>
2017-02-28 18:41:19 +02:00
Paweł Dziepak
0198d8e470 Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.

API outline:

  When in forwarding mode, the stream does not return all fragments right away,
  but only those belonging to the current range. Initially current range only
  covers the static row. The stream can be forwarded, even before reaching end-
  of-stream for current range, to a later range with fast_forward_to().
  Forwarding doesn't change initial restrictions of the stream, it can only be
  used to skip over data.

  Monotonicity of positions is preserved by forwarding. That is fragments
  emitted after forwarding will have greater positions than any fragments
  emitted before forwarding.

  For any range, all range tombstones relevant for that range which are present
  in the original stream will be emitted. Range tombstones emitted before
  forwarding which overlap with the new range are not necessarily re-emitted.

  When not in forwarding mode, the stream acts as if the current range was equal
  to the full range. This implies that fast_forward_to() cannot be
  used.

  Whether stream is in forwarding mode or not is specified when the stream
  is created, typically via mutation_source interface.

What's left for later series:

  Optimization by providing specialized implementations. This series implements
  forwarding support in all mutation sources via generic wrapper which simply
  drops fragments."

* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_tests: Verify monotonicty of positions
  tests: random_mutation_generator: Spread the keys more
  tests: mutation_source_test: Make blobs more easily distinguishable
  tests: streamed_mutation: Test that merged stream passes mutation source tests
  tests: mutation_source_test: Add tests for forwarding of streamed_mutation
  tests: streamed_mutation_assertions: Add methods for navigating the stream
  tests: Add range generators to random_mutation_generator
  partition_slice_builder: Add with_ranges()
  query: Introduce full_clustering_range
  streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
  db: Enable creating forwardable readers via mutation_source
  mutation_source: Document liveness requirements
  mutation_source: Cleanup
  db: Replace virtual_reader_type with mutation_source_opt
  partition_version: Refactor make_partition_snapshot_reader() overloads
  database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
  memtable: Accept all mutation_source parameters
  streamed_mutation: Implement fast_forward_to() in stream merger
  streamed_mutation: Add generic implementation of forwardable streamed_mutation
  streamed_mutation: Add fast_forward_to() API
  position_in_partition: Introduce position_range
  position_in_partition: Introduce position constructor for right after the static row
  streamed_mutation: Make cast to view non-explicit
  streamed_mutation: Make schema() getter non-copying
2017-02-24 10:37:51 +00:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Tomasz Grabiec
586dbaa8d3 db: Replace virtual_reader_type with mutation_source_opt
Virtual reader is a mutation_source.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
f46ae8128d database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.

Technically just a cleanup. Alse prerequisite for more cleanup.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
2cc27f72ca memtable: Accept all mutation_source parameters 2017-02-23 18:23:52 +01:00
Calle Wilund
e20b804a65 commitlog/database: Add "release" method to ensure we free segments
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.

This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
2017-02-21 18:17:47 +01:00
Paweł Dziepak
359c617821 db: restore call to check_valid_rp()
5a0955e89d "db: add operations for
applying counter updates" merged two column_family::apply() overloads
into do_apply() in order to reduce code duplication. Unfortunately,
a call to check_valid_rp() didn't survive that change.
Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>
2017-02-21 15:26:04 +01:00