Commit Graph

11716 Commits

Author SHA1 Message Date
Raphael S. Carvalho
b9f67351da db: expose clustering filter info via collectd
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:32:23 -03:00
Raphael S. Carvalho
a2dc88889d db: enable clustering optimization only on dtcs
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.

Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:31:07 -03:00
Raphael S. Carvalho
8d03ccd604 sstables: optimize reads with clustering filter
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.

For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.

That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:51:50 -03:00
Raphael S. Carvalho
768aced741 partition_slice: introduce key-independent function to get ranges
That will be important for sstable code that will rule out a sstable
if it doesn't cover a given clustering key range.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:56 -03:00
Raphael S. Carvalho
dce61ddb02 types: introduce abstract_type::as_tri_comparator()
That's akin to abstract_type::as_less_comparator's nature.
So we don't have to repeat something like the following everywhere:
auto cmp = [&type] (const bytes_view& b1, const bytes_view& b2) {
	return type->compare(b1, b2); }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:53 -03:00
Raphael S. Carvalho
004617839d database: check bloom filter of all sstables earlier
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:08 -03:00
Raphael S. Carvalho
2a426ab248 tests: add test to check tombstone metadata
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:35 -03:00
Raphael S. Carvalho
94c8ef39c3 sstables: store components ranges in sstable object
Store range for each clustering component in sstable itself to
optimize sstable filtering based on clustering key.
If schema defines no clustering key, this new field will be
empty. Each range stores min and max value of that specific
component. With this information, it's possible to know if a
sstable possibly stores a given clustering component.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:32 -03:00
Raphael S. Carvalho
026853fabb tests: add test to check composite validity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:30 -03:00
Raphael S. Carvalho
0a5af61176 sstables: introduce function to validate min max clustering values
Scylla was generating a sstable with incorrect min max clustering
values. This information is used to filter out a sstable when user
asks for a range of clustering rows. So it's important to detect
wrong metadata and make sure that it will not be used.
The validation is fast and will only happen when loading a sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:28 -03:00
Raphael S. Carvalho
1f31223f32 sstables: store schema in sstable object
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:17 -03:00
Avi Kivity
7a140a306e Revert "sstables: optimize selection of sstables for leveled strategy"
This reverts commit c75b07fc34f0e7267a8e49276b96bbd4686cb78d; does not
deduplicate the sstable list.
2016-09-01 18:34:08 +03:00
Raphael S. Carvalho
c75b07fc34 sstables: optimize selection of sstables for leveled strategy
It's possible to copy sstables directly into vector, and that will
improve performance. my benchmark tool[1] shows that new version
reduces running time of *copy procedure* by factor of two after
1024^2 calls.
Switching to back_inserter improves throughput even further.
[1]: gist.github.com/raphaelsc/a4b27290f362cdecdef399770dda759c

Refs #1632.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <7153514a9b5f5eb24dff518ee9fa3680e0881dae.1472741401.git.raphaelsc@scylladb.com>
2016-09-01 18:08:53 +03:00
Glauber Costa
dc5d8e33af Revert "row_cache: update sstable histograms on cache hits"
This reverts commit 1726b1d0cc.

Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.

Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.

The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.

Conflicts:
	row_cache.hh

Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-01 18:07:31 +03:00
Avi Kivity
e33671c285 Merge "tracing: Trace read sstables" from Duarte
"This patchset traces sstables we read from. To do that, we
need to flow the trace_state_ptr to the mutation_readers."
2016-09-01 13:24:16 +03:00
Duarte Nunes
ba374da043 database: Trace sstable accesses
This patch traces when we read from an sstable, be it a key range or a
single one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
f4cf2f2aef tracing: Make trace_state_ptr argument required
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
46b86ff801 storage_proxy: Pass along trace_state for queries
This patch changes the storage_proxy so it passed along a
trace_state_ptr to the layers below, when querying locally or
receiving a remote query request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
030db65c62 database: Accept a trace_state_ptr
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:28 +02:00
Duarte Nunes
9269256246 row_cache: Accept a trace_state_ptr
This patch changes the row_cache so it accepts a trace_state_ptr,
which it is responsible of flowing to the underlying mutation_reader
if needed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:00:55 +02:00
Duarte Nunes
5fd66f00c2 mutation_reader: Accept trace_state_ptr
This patch changes the mutation_reader so it optionally accepts a
trace_state_ptr. This will allow us to trace, for example, which
sstables are accessed during a request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:00:31 +02:00
Avi Kivity
cc127295e9 Merge "Fill in information for sstables per read histogram" from Glauber
"Nodetool cfhistograms is supposed to tell us how many SSTables were touched per
read. Currently, we are a bit in the dark as we don't export that information.

This patch exports that, so that we can start using it."
2016-09-01 12:54:24 +03:00
Glauber Costa
1726b1d0cc row_cache: update sstable histograms on cache hits
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:22 -04:00
Glauber Costa
ce24fd05fe database: keep statistics on SSTables touched per read
That is done for single partition queries only - mimicking what
Cassandra does on that matter.

For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:21 -04:00
Glauber Costa
0f413695ac database: make column family stats mutable
The make_reader method is currently a const method, but we would like to start
keeping hit statistics from it.

Instead of relaxing the const condition too much, we can just mark the _stats
field as mutable, indicating that make_reader will not be able to change
anything in the CF, except for keeping statistics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:24 -04:00
Glauber Costa
5c4d73577a initialize sstables_per_read histogram with 35 instead of 90 buckets
This is to match what Cassandra does. Nodetool may be expecting this on the
other side.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:24 -04:00
Glauber Costa
4310635bae move estimated histogram to utils
Nothing sstable-specific in it, really.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Glauber Costa
ffc2131c51 decouple estimated_histogram from sstables
There is nothing really that fundamentally ties the estimated histogram to
sstables. This patch gets rid of the few incidental ties. They are:

 - the namespace name, which is now moved to utils. Users inside sstables/
   now need to add a namespace prefix, while the ones outside have to change
   it to the right one
 - sstables::merge, which has a very non-descriptive name to begin with, is
   changed to a more descriptive name that can live inside utils/
 - the disk_types.hh include has to be removed - but it had no reason to be
   here in the first place.

Todo, is to actually move the file outside sstables/. That is done in a separate
step for clarity.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:13:23 -04:00
Yoav Kleinberger
624165da79 scyllatop: dump all output to stdout instead of running a fancy console interface
Sometimes the user would like to dump all the metrics into a file or
pipe it to another program, as requested in issue #1506.
This patch makes scyllatop check if stdout is connected to a TTY,
and if not - it does not fire up the fancy urwid UI but instead, just
writes all it's collected metrics to stdout.

Optionally, the user tell the program to quit after a specific
number of iterations via the -n or --iterations flag

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471777516-9903-1-git-send-email-yoav@scylladb.com>
2016-08-31 08:31:36 +03:00
Paweł Dziepak
e981101fa9 Merge "Remove clustering_key_filtering_context" from Piotr
"clustering_key_filtering_context is no longer needed.
partition_slice can be used instead so this series removes
clustering_key_filtering_context and passes partition_slice down where
it's needed. Then a static get_ranges method is used to obtain
clustering key ranges for a given partition.

Fixes #1614."
2016-08-30 22:30:15 +01:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Piotr Jastrzebski
b05b90b3a5 Introduce clustering_key_filter_ranges.
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 19:46:38 +02:00
Duarte Nunes
39e0fb1260 storage_proxy: Support multiple partition ranges
This patch adds the ability to query multiple partition ranges. This
is needed since 55f2cf1626, where we
started unwrapping partition ranges in Thrift.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1472474594-15368-1-git-send-email-duarte@scylladb.com>
2016-08-30 17:43:40 +03:00
Takuya ASADA
533dc0485d dist/common/scripts/scylla_sysconfig_setup: sync cpuset parameters with rps_cpus settings when posix_net_conf.sh is enabled and NIC is single queue
On posix_net_conf.sh's single queue NIC mode (which means RPS enabled mode), we are excluded cpu0 and it's sibling from network stack processing cpus, and assigned NIC IRQ to cpu0.
So always network stack is not working on cpu0 and it's sibling, to get better performance we need to exclude these cpus from scylla too.
To do this, we need to get RPS cpu mask from posix_net_conf.sh, pass it to scylla_cpuset_setup to construct /etc/scylla.d/cpuset.conf when scylla_setup executed.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-2-git-send-email-syuu@scylladb.com>
2016-08-30 16:51:16 +03:00
Takuya ASADA
0c3bb2ee63 dist/common/scripts/scylla_prepare: drop unnecesarry multiqueue NIC detection code on scylla_prepare
Right now scylla_prepare specifies -mq option to posix_net_conf.sh when number of RX queues > 1, but on posix_net_conf.sh it sets NIC mode to sq when queues < ncpus / 2.
So the logic is different, and actually posix_net_conf.sh does not need to specify -sq/-mq now, it autodetects queue mode.
So we need to drop detection logic from scylla_prepare, let posix_net_conf.sh to detect it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-1-git-send-email-syuu@scylladb.com>
2016-08-30 16:51:15 +03:00
Pekka Enberg
eff14bae0e transport/server: Explict CQL type IDs
The CQL type IDs are specified as hex in the CQL binary protocol
specification. Define CQL type IDs in the code explicitly to make
reviewing the code and adding new types easier.

Message-Id: <1472537971-26053-1-git-send-email-penberg@scylladb.com>
2016-08-30 09:45:26 +03:00
Avi Kivity
809d739ae8 Merge seastar upstream
* seastar 2b07b1f...0303e0c (3):
  > scripts/posix_net_conf.sh: add support --cpu-mask mode
  > file: improve tmpfs support
  > file::close: remove trailing newline in log message
2016-08-29 13:26:04 +03:00
Pekka Enberg
2d3aee73a6 systemd: Don't start Scylla service until network is up
Alexandr Porunov reports that Scylla fails to start up after reboot as follows:

  Aug 25 19:44:51 scylla1 scylla[637]: Exiting on unhandled exception of type 'std::system_error': Error system:99 (Cannot assign requested address)

The problem is that because there's no dependency to network service,
Scylla simply attempts to start up too soon in the boot sequence and
fails.

Fixes #1618.

Message-Id: <1472212447-21445-1-git-send-email-penberg@scylladb.com>
2016-08-29 13:15:39 +03:00
Takuya ASADA
74d994f6a1 dist/common/scripts/scylla_setup: support enabling services on Ubuntu 15.10/16.04
Right now it ignores Ubuntu, but we shareing .service between Fedora/CentOS and Ubuntu >= 15.10, so support it.

Fixes #1556.

Message-Id: <1471932814-17347-1-git-send-email-syuu@scylladb.com>
2016-08-29 13:13:14 +03:00
Avi Kivity
fb3a83a811 Merge "Slow query logging" from Vlad
"This series introduces a "slow query logging" feature that
allows logging the queries that take more than a specified
threshold time to complete.

Once such a query detected, it will be logged in a system_traces.node_slow_log table.

In addition all trace for that query that have been collected on a Coordinator
are going to be written as well.

If the handling time on a replica in the context of a query takes more than (the same) threshold
they are going to be written too.

The raw in a node_slow_log contains a session_id of a corresponding tracing session,
thereby allowing the user to query the system_traces tables for the corresponding trace
records.

The schema of the node_slow_log table is as follows:

CREATE TABLE system_traces.node_slow_log (
    node_ip inet,
    shard int,
    session_id uuid,
    date timestamp,
    start_time timeuuid,
    command text,
    duration int,
    parameters map<text, text>,
    source_ip inet,
    table_names set<text>,
    username text,
    PRIMARY KEY (start_time, node_ip, shard))
    WITH default_time_to_live = 86400

where
 - node_ip: IP of the coordinator Node.
 - shard: shard ID on a Coordinator where the query was handled.
 - session_id: ID of a corresponding tracing session.
 - date: a time when the query has began.
 - start_time: a time-based UUID for this query (needed for a primary key mostly).
 - command: a query string.
 - duration: a time it took to handle this query (in microseconds).
 - parameters: a map of query parameters (like in system_traces.sessions).
 - source_ip: IP of a Client that sent this query.
 - table_names: a set of "<keyspace>.<table name>" strings representing column
                families used in this query.
 - username: a user name used for this query.

The good thing is that most of the data we needed is already
collected by the regular tracing framework. The only missing ones
are a username and tables' names. So, this series makes the framework collect them too.

The whole feature is integrated in the Tracing framework. The main
changes to the framework that were made are as follows:
 - Store the constant capabilities of the tracing session in an enum_set, e.g.:
  - primary/secondary.
  - write on close.
 - Introduce two new capabilities to a tracing session of a specific query:
  - full tracing: collect all traces for this query (as it is before this series).
  - log slow query: log this query if its duration is above the threshold.
  These two capabilities may be defined independently.
 - Add the logic that handles the "log slow query"-only case:
  - Build the parameters<sstring, sstring> map only if the "duration" is above
    the given threshold.
  - The same about writing the trace entries.
 - In a not-only "log slow query" case:
  - Write the node_slow_log entry.
  - Extend the trace_info struct to pass slow query threshold and TTL to the replica
    Node.

In addition to above this series add the capability to configure the slow query logging
threshold and a TTL for the node_slow_log records.

The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming)
patches that are meant to align the naming of the existing methods with the ones the last one
is going to add."
2016-08-29 13:11:36 +03:00
Gleb Natapov
a2cdddb795 storage_proxy: forward mutation write with correct timeout value
Now that mutation handler knows how much time is left for mutation
write to be handled it can use this knowledge to set correct timeout
for forwarded mutations.

Message-Id: <20160828080637.GE9243@scylladb.com>
2016-08-29 13:06:36 +03:00
Avi Kivity
6cb796f38b Merge seastar upstream
* seastar ef063c5...2b07b1f (1):
  > file: make close() more robust against concurrent calls
2016-08-29 12:25:57 +03:00
Avi Kivity
f5f58b46c7 sstables: enable write-behind
Write-behind allows a single sstable write to saturate the disk,
improving throughput.  Later we can take advantage of this to reduce
the number of sstables being written concurrently.
2016-08-29 12:25:15 +03:00
Pekka Enberg
c5e5e7bb40 dist/docker: Clean up Scylla description for Docker image
Message-Id: <1472145307-3399-1-git-send-email-penberg@scylladb.com>
2016-08-29 10:48:06 +03:00
Vlad Zolotarov
a491ac0f18 tracing: introduce a log_slow_query logic
The main idea is to log queries that take "too long" to complete.
The "too long" is above the given threshold.

To achieve the above this patch does the following:
   - Introduce two new properties to the tracing::trace_state:
      - "Full tracing": when the tracing of this query was explicitly requested.
        In this state we will record all possible traces related to this query:
        both on the coordinator and on any replica involved.
      - "Log slow query": when slow query logging is enabled.
        If slow query logging is enabled and a session's "duration" is above
        the specified threshold we will create a record in the "slow queries log"
        and write all trace records created on the coordinator and on a replica
        if a replica's session lasts longer than that threshold.
        (We will propagate the Coordinator's slow query logging threshold to replicas
        in the context of a specific tracing/logging session).

     The properties above are independent, namely they may be enabled and/or disabled
     independently and any combination of them is legal (naturally, creating a tracing
     session when both states above are disabled makes no sense).
   - Instrument the tracing::tracing service to allow the following:
    - Enable/disable slow query logging.
    - Set/get the slow query duration threshold (in microseconds).
    - Set/get the slow query log record TTL value (in seconds).
   - Instrument the trace_keyspace_helper to write a slow query log entry
     when requested.
   - The slow query logging is disabled by default and the threshold is set to half a second.
   - The TTL of a slow log record is set to 86400 seconds by default.
   - It makes sense to use the same "slow query logging threshold" and a "slow query record TTL"
     both on a coordinator and on a replica Nodes in a context of the same tracing session:
     - Pass both TTL and a threshold to the replica in a trace_info.

This patch also implements the new slow query logging specific logic:
   - Don't write the pending tracing records before the end of a tracing session
     until "duration" reaches the logging threshold.
   - Don't build the parameters<sstring, sstring> map unless we know we will write it
     to I/O.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-08-28 18:28:44 +03:00
Avi Kivity
e81c1df557 Merge seastar upstream
* seastar 6fadd98...ef063c5 (2):
  > rpc: pass a timeout to a verb's server handler if the one was specified by a client
  > rpc: cleanup the old metaprogramming craft
2016-08-25 17:53:19 +03:00
Paweł Dziepak
6012a7e733 mutation_partition: fix iterator invalidation in trim_rows
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.

The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.

This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.

The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.

Fixes #1609.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
2016-08-25 16:52:35 +03:00
Paweł Dziepak
5f84348ce1 test.py: add missing nonwrapping_range_test
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472126087-15484-1-git-send-email-pdziepak@scylladb.com>
2016-08-25 15:36:10 +03:00
Piotr Jastrzebski
cda2e8f833 Remove stateless_clustering_key_filter_factory
It can be easily replaced with partition_slice_clustering_key_filter_factory.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-25 08:53:31 +02:00
Piotr Jastrzebski
5bf8807f9b Remove clustering_key_filtering_context::get_filter*
These methods are not used any more so they can go away.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-25 08:53:31 +02:00