Commit Graph

1049 Commits

Author SHA1 Message Date
Paweł Dziepak
9d82a1ebfd abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Avi Kivity
e428805ba5 Merge "Optimize query result partition and row counts" from Duarte
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.

This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."

* 'calculate-counts/v3' of github.com:duarten/scylla:
  query-result: Send row and partition count over the wire
  query::result: Optimize calculate_counts()
2017-08-17 13:41:21 +03:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Duarte Nunes
d7bab684ea query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:28:29 +02:00
Duarte Nunes
bcf21aacc2 storage_proxy: Directly call query_nonsingular_mutations_locally
Instead of duplicating the branch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811001559.25788-1-duarte@scylladb.com>
2017-08-11 09:06:01 +03:00
Duarte Nunes
a3ee99554b service/storage_proxy: Remove out of date comment
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).

Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
2017-08-11 09:04:23 +03:00
Asias He
49360992d9 storage_service: Use the new range_streamer interface for removenode
So that removenode operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
6b8dc85f12 storage_service: Use the new range_streamer interface for decommission
So that decommission operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
24584b8509 storage_service: Use the new range_streamer interface for rebuild
So that rebuild operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Gleb Natapov
d2a2a6d471 storage_proxy: make range_slice_read_executor go through digest matching state
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.

The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
2017-08-03 11:37:03 +03:00
Gleb Natapov
3b7d8c8767 storage_proxy: add capability to read data/digest for non singular ranges
Currently only mutation_data read supports non singular ranges. This
patch extends data/digest reads to support them too.
2017-08-03 10:35:09 +03:00
Gleb Natapov
c619ef258b storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
2017-08-03 10:08:44 +03:00
Tomasz Grabiec
e09220dbff migration_manager: Log schema pulls 2017-07-27 20:08:25 +02:00
Tomasz Grabiec
350d98d4e1 migration_manager: Prevent pull requests from accumulating
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.
2017-07-27 20:08:25 +02:00
Vlad Zolotarov
e98adb13d5 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
2017-07-27 10:54:36 +02:00
Vlad Zolotarov
9086c643a6 service::storage_proxy: add a trace points pair in the SELECT replica flow
Add two trace points: at the beginning and at the end of the replica flow on the
replica shard.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>
2017-07-20 16:44:25 +02:00
Calle Wilund
247c36e048 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
2017-07-19 16:18:45 +03:00
Duarte Nunes
b8235f2e88 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
2017-07-14 12:11:22 +03:00
Gleb Natapov
f88723e739 storage_proxy: pass pending_endpoints by reference instead of by value
This makes lifetime of dead_endpoints object more clear and move() also has its price.

Message-Id: <20170710084549.GX2324@scylladb.com>
2017-07-11 16:52:21 +03:00
Tomasz Grabiec
07ed512060 migration_manager: Give empty response to schema pulls from incompatible nodes
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
2017-07-07 19:09:57 +02:00
Tomasz Grabiec
5f613d0527 migration_manager: Don't pull schema from incompatible nodes
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
2017-07-07 19:08:59 +02:00
Tomasz Grabiec
18a9e1762c service: Advertise schema tables format version through gossip
Will be needed to inhibit schema exchange on per-peer basis.
2017-07-07 19:07:59 +02:00
Tomasz Grabiec
ae4b24db06 misc_services: Switch to using reads_with[_no]_misses counters
They better approximate the intended meaning than hits/misses, which
according to Gleb is whether a read did any I/O or not.
2017-07-04 13:55:06 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Gleb Natapov
9b8499df0e cache_hitrate_calculator: filter cfs based on replication strategy instead of a name
The code filters CFs by name to not include system keyspace, but v3
schema added yet another system namespace. Better filter according to
replication strategy to accommodate for schema v4 adding even more
system keyspaces.

Fixes: #2516

Message-Id: <20170621073816.GB3944@scylladb.com>
2017-06-22 11:26:34 +03:00
Gleb Natapov
72a4554dd9 storage_proxy: Fix compilation on older (1.55) boost
Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by
boost::adaptors::transformed() when std::ref to lambda is passed to
it do not match iterator concept. It cannot be default constructed
because std::reference_wrapper is not default constructable.
boost::range::min_element() never actually default construct it, but
concept is checked anyway. The patch fixes it by providing an explicit
functor that is default constructable.

Message-Id: <20170618131836.GD3944@scylladb.com>
2017-06-18 16:54:41 +03:00
Duarte Nunes
b2c5aca4cf db/schema_tables: View mutations shouldn't always include base ones
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.

Fixes #2500

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
2017-06-18 16:29:59 +03:00
Gleb Natapov
87094849fa storage_proxy: load balance read requests according to cache hit rates
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
2017-06-13 09:57:14 +03:00
Gleb Natapov
bc8aa1b4ee choose extra replica for speculation in filter_for_query()
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
2017-06-13 09:57:14 +03:00
Gleb Natapov
0e4d5bc2f3 Store cluster wide cache hit statistics in CF 2017-06-13 09:57:14 +03:00
Gleb Natapov
69c5526301 messaging_service: return cache hit ratio as part of data read 2017-06-13 09:57:14 +03:00
Gleb Natapov
8ca1432b04 Distribute cache temperature over gossiper.
When a node start it does not have any information about cache temperature
of other nodes in the cluster and it is hard (if not impossible) to make
right guess. During cluster startup all nodes have cold caches, so there
is no point to redirect reads to other nodes even though local cache it
cold, but if only that node restarted than other nodes have populated
cache and reads should be redirected.

The node will get up-to-date information about other nodes caches,
but only after receiving first reply, until then it does not have the
information to make right decisions which may cause unwanted spikes
immediately after restart. Having cache temperature in gossiper helps
to solve the problem.
2017-06-13 09:57:14 +03:00
Gleb Natapov
991ec4a16c periodically calculate avg cache hit rate between all shards
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
2017-06-13 09:57:14 +03:00
Gleb Natapov
f59ecc2687 Rename load_broadcaster.cc to misc_services.cc
load_broadcaster is very small class, move it into generic file so that
we can put other small services there to save on compilation time.
2017-06-13 09:57:14 +03:00
Gleb Natapov
7bcf4c690f storage_proxy: use db::count_local_endpoints function instead open code it 2017-06-13 09:57:14 +03:00
Calle Wilund
3512ed4596 storage_service/config: Add "native_transport_port_ssl" option
Mimic origin behaviour, iff TLS encryption is enabled, and
native_transport_port_ssl is set and different from
native_transport_port, start both tls- and non-tls
listeners.

Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>
2017-05-29 15:53:56 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
1a99ebaa65 storage_proxy: switch to the exponential sharder for nonsingular queries
Nonsingular queries used exponential expansion of the token space to
avoid spending too much cpu time on near-empty tables, but the generation
of the search space was itself exponential.  Switch to the exponential sharder
which has linear cost.
2017-05-17 13:50:30 +03:00
Avi Kivity
f5dae826ce Merge "Migrate schema tables to v3 format" from Calle
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.

Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.

Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.

Tested against dtests. Note that patches for dtest + ccm
will follow."

* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
  legacy_schema_migrator: Actually truncate legacy schema tables on finish
  database: Extract "remove" from "drop_columnfamily"
  v3 schema test fixes
  thrift: Update CQL mapping of static CFs
  schema_tables: Use v3 schema tables and formats
  type_parser: Origin expects empty string -> bytes_type
  cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
  types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
  cql3_type: v3 to_string
  cql_types: Introduce cql3_type::empty and associate with empty data_type
  schema: rename column accessors to be in line with origin
  schema: Add "is_static_compact_table"
  schema_builder: Add helper to generate unique column names akin origin
  schema: Add utility functions for static columns
  schema: Use heterogeneous comparator for columns bounds
  cql3_type_parser: Resolve from cql3 names/expressions
  cql3_type: Add "prepare_interal" and "references_user_type"
  cql3::cql3_type: Add prepare_internal path using only "local" holders
  cql3_type: Add virtual destructor.
  database/main: encapsulate system CF dir touching
  ...
2017-05-17 11:25:52 +03:00
Vlad Zolotarov
a0737abdc5 cql_server::response: rework the tracing session ID insertion
Insert the tracing session ID into the response body in the cql_server::response constructor.

Fixes #2356

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:57:28 -04:00
Gleb Natapov
385645e8df storage_proxy: Fix mutation logging
Log mutation type only if mutation set is not empty.

Message-Id: <20170510142406.GA30426@scylladb.com>
2017-05-11 15:49:52 +01:00
Vlad Zolotarov
a855e82eff service::client_state: don't allow dropping the system_auth and system_traces objects
Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables)
but allow their modification (including tables).

We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters.
We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed.

Fixes #2346
Fixes #2338

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>
2017-05-11 13:03:30 +01:00
Paweł Dziepak
ba6b74e305 storage_service: counters are no longer experimental
Message-Id: <20170510124552.23558-1-pdziepak@scylladb.com>
2017-05-10 17:18:32 +03:00
Gleb Natapov
ab92406585 storage_proxy: optimize reconcile logic for CL=ONE
Regular single key query will never reconcile with CL=ONE since there
will be no digest mismatch, but range queries do not have digest stage,
so always goes through reconcile code. For CL=ONE there will be only one
result though, so no need to run complicated reconciliation logic and the
only result can be returned directly.

Message-Id: <20170509100334.GQ28272@scylladb.com>
2017-05-10 17:09:34 +03:00
Calle Wilund
539b65fc90 client_state: Make "has_access" auth check schema ks name independent 2017-05-09 13:48:55 +00:00
Gleb Natapov
2d5a7c8058 storage_proxy: make read repair stats accessible through Prometheus
Currently they can be read only through JMX.

Message-Id: <20170509075546.GN28272@scylladb.com>
2017-05-09 11:23:38 +03:00
Avi Kivity
8c5c5d3004 Merge "CQL front-end for secondary indices" from Pekka
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."

* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
  schema: Kill index_type enum
  schema: Kill index_info class
  cql3/statements/create_index_statement: Use database::existing_index_names() in validation
  cql3/statements: Use secondary index manager in alter_table_statement class
  index: Add secondary_index_manager
  thrift/handler: Use index_metadata
  db/schema_tables: Index persistence
  schema: Add all_indices() to schema class
  schema: Remove add_default_index_names() from schema_builder class
  db/schema_tables: Add system table for indices
  cql3/Cgl.g: DROP INDEX
  cql3/statements: Add drop_index_statement class
  database: Add find_indexed_table() to database class
  cql3: Return change event from announce_migration()
  cql3/statements: Multiple index targets for CREATE INDEX
  cql3/statements: Use index_metadata in create_index_statement class
  cql3/statements: Use feature flag in create_index_statement class
  service/storage_service: Add feature flag for secondary indices
  database: Add get_available_index_name() to database class
  schema: Add get_default_index_name() to index_metadata class
  ...
2017-05-08 17:04:40 +03:00
Calle Wilund
2049303399 query_pagers: bugfix: must count pk only/pk + static rows as 1
Previously only counted clustered/regular

Message-Id: <1494249013-4069-1-git-send-email-calle@scylladb.com>
2017-05-08 16:35:27 +03:00
Avi Kivity
9e67bd5aac Merge " Add partial range deletion support" from Duarte
"This series introduces partial support for range deletions. This allows
deletion operations such as

delete from cf where p=1 and c > 0 and c <= 3.

This series only adds support for single-column range restrictions.

We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.

We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.

This limitation should stay in place until we can properly represent range
tombstones in the storage format."

* 'range-deletions/v2' of https://github.com/duarten/scylla:
  mutation: Set cell using clustering_key_prefix
  mutation_partition: Harmonize apply_delete overloads
  prefix_compound_view_wrapper: Add is_full and is_empty functions
  tests/cql_query_test: Add range deletion tests
  cql3: Partially support ranged deletions
  single_column_primary_key_restrictions: Implement has_bound()
  modification_statement: Use statement_restrictions for where clause
  statement_restrictions: Expose primary key restrictions
  to_string: Add missing include
2017-05-07 19:27:09 +03:00
Avi Kivity
a592573491 Remove exception specifications
C++17 removed exception specifications from the language, and gcc 7 warns
about them even in C++14 mode.  Remove them from the code base.
2017-05-05 17:02:31 +03:00