It was discussed that leveled strategy may not benefit from parallel
compaction feature because almost all compaction jobs will have similar
size. It was also found that leveled strategy wasn't working correctly
with it because two overlapping sstable (targetting the same level)
could be created in parallel by two ongoing compaction.
Fixes#1293.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <60fe165d611c0283ca203c6d3aa2662ab091e363.1464883077.git.raphaelsc@scylladb.com>
From Duarte:
This patchset adds the range_tombstone_list data structure,
used to hold a set of disjoint range tombstones, and changes
the internal representation of row tombstones to use that
data structure.
Fixes#1155
[tgrabiec: Added compound_wrapper::make_empty(const schema&) overload
to fix compilation failure in tracing code]
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the gms::feature destructor so it
checks whether the gossiper has been stopped before trying
to unregister the feature.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the code from sstables/partition.cc which is used
to transform a set of range tombstones into a set of overlapping
scylladb tombstones.
The range_tombstone_merger will be used to send mutations to nodes not
yet updated to support the internal range tombstone representation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This class is responsible for representing a set of range tombstones
as non-overlapping disjoint sets of range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the range_tombstone class, composed of
a [start, end] pair of clustering_key_prefixes, the type
of inclusiveness of each bound, and a tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the idl-compiler so that the default value of a
field can be set to the value of a previous field in the class:
class P {
uint32_t x;
uint32_t y = x;
};
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
... and make it a clustering_key_prefix, in preparation of
supporting not-whole-row range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.
Message-Id: <20160602102642.GJ9939@scylladb.com>
Limit disk bandwidth to 5MB/s to emulate a slow disk:
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.write_bps_device
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.read_bps_device
Start scylla node 1 with low memory:
scylla -c 1 -m 128M --auto-bootstrap false
Run c-s:
taskset -c 7 cassandra-stress write duration=5m cl=ONE -schema
'replication(factor=1)' -pop seq=1..100000 -rate threads=20
limit=2000/s -node 127.0.0.1
Start scylla node 2 with low memory:
scylla -c 1 -m 128M --auto-bootstrap true
Without this patch, I saw std::bad_alloc during streaming
ERROR 2016-06-01 14:31:00,196 [shard 0] storage_proxy - exception during
mutation write to 127.0.0.1: std::bad_alloc (std::bad_alloc)
...
ERROR 2016-06-01 14:31:10,172 [shard 0] database - failed to move
memtable to cache: std::bad_alloc (std::bad_alloc)
...
To fix:
1. Apply the streaming mutation limiter before we read the mutation into
memory to avoid wasting memory holding the mutation which we can not
send.
2. Reduce the parallelism of sending streaming mutations. Before we send each
range in parallel, after we send each range one by one.
before: nr_vnode * nr_shard * (send_info + cf.make_reader memory usage)
after: nr_shard * (send_info + cf.make_reader memory usage)
We can at least save memory usage by the factor of nr_vnode, 256 by
default.
In my setup, fix 1) alone is not enough, with both fix 1) and 2), I saw
no std::bad_alloc. Also, I did not see streaming bandwidth dropped due
to 2).
In addition, I tested grow_cluster_test.py:GrowClusterTest.test_grow_3_to_4,
as described:
https://github.com/scylladb/scylla/issues/1270#issuecomment-222585375
With this patch, I saw no std::bad_alloc any more.
Fixes: #1270
Message-Id: <7703cf7a9db40e53a87f0f7b5acbb03fff2daf43.1464785542.git.asias@scylladb.com>
"This series introduces a tracing infrastructure that may be used
for tracing CQL commands execution and measuring latencies of separate
stages of CQL handling as defined by a CQL binary protocol specification.
To begin tracing one should create a "tracing session", which may then
be used to issuing tracing events.
If execution of a specific CQL command involves other Nodes (not only a Coordinator),
then a "tracing session ID" is passed to that Node (in the context of the
corresponding RPC call). Then this "session ID" may be used to create a
"secondary tracing session" to issue tracing events in the context of the original session.
The series contains an implementation of tracing that uses a keyspace in the current
cluster for storing tracing information.
This series contains a demo per-request tracing instrumentation of a QUERY
CQL command and even this instrumentation is partial: it only fully instruments
a QUERY->SELECT->read_data call chain.
This is by all means a very beginning of the proper instrumentation which is
to come.
Right now the latencies for a single SELECT for a single raw with RF 1 from a 2 Nodes cluster
on my laptop started using ccm (for C* all default parameters, for scylla - memory 256MB, --smp 2)
are as follows (pseudo-graphics warning):
--------------------------------------------------------------------------------------------
| scylla (2 Nodes x 2 shards each) | C* 2.1.8
_______________________________________|___________________________________|________________
Coordinator and replica are same Node | |
(TRACING OFF): | 0.3ms | 0.3ms
c-s with a single thread mean latency | (was 0.2ms before the last |
value | rebase with a master) |
--------------------------------------------------------------------------------------------
Coordinator and replica are same Node | |
(TRACING ON) | ~250us | ~1200us
Running a SELECT command from a cqlsh | |
a few times | |
--------------------------------------------------------------------------------------------
Coordinator and replica are not on the | |
same Node | ~700us | >2500us
(TRACING ON) | |
--------------------------------------------------------------------------------------------
To begin tracing one may use a cqlsh "TRACING ON/OFF" commands:
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> select "C0", "C1" from keyspace1.standard1 where key=0x12345679;
C0 | C1
--------------------+------
0x000000000001e240 | null
(1 rows)
Tracing session: 146f0180-21e7-11e6-b244-000000000000
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------+----------------------------+-----------+----------------
select "C0", "C1" from keyspace1.standard1 where key=0x12345679; | 2016-05-24 22:38:24.536000 | 127.0.0.1 | 0
message received from /127.0.0.1 [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | --
Done reading options [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
read_data handling is done [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | 37
Parsing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
Processing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 56
Done processing - preparing a result [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 550
Request complete | 2016-05-24 22:38:24.536560 | 127.0.0.1 | 560
cqlsh>"
This is a demo instrumentation:
- Check if a tracing info is present in the read_command.
- If yes - create a tracing session with the given tracing
session ID.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Store a trace state inside a client_state.
- Start tracing in a cql_server::connection::process_query().
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Add a tracing ID (UUID) optional field to cql_server::response.
- If _tracing_id is set make_frame() would insert a tracing ID
in the response message. According to CQL spec it should be the
first thing in the response "body" and the TRACING bit (0x02) should be
set in the "flags" field.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When client_state is created with an external_tag - store
a client address in the client state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
trace_state: Is a single tracing session.
tracing: A sharded service that contains an i_trace_backend_helper instance
and is a "factory" of trace_state objects.
trace_state main interface functions are:
- begin(): Start time counting (should be used via tracing::begin() wrapper).
- trace(): Create a tracing event - it's coupled with a time passed since begin()
(should be used via tracing::trace() wrapper).
- ~trace_state(): Destructor will close the tracing session.
"tracing" service main interface function is:
- start(): Initialize a backend.
- stop(): Shut down a backend.
- create_session(): Creates a new tracing session.
(tracing::end_session(): Is called by a trace_state destructor).
When trace_state needs to store a tracing event it uses a backend helper from
a "tracing" service.
A "tracing" service limits a number of opened tracing session by a static number.
If this number is reached - next sessions will be dropped.
trace_state implements a similar strategy in regard to tracing events per singe
session.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Uses a CQL keyspace system_traces to store tracing information.
Uses two tables:
CREATE TABLE system_traces.sessions (
session_id uuid,
command text,
client inet,
coordinator inet,
duration int,
parameters map<text, text>,
request text,
started_at timestamp,
PRIMARY KEY ((session_id)))
and
CREATE TABLE system_traces.events (
session_id uuid,
event_id timeuuid,
activity text,
source inet,
source_elapsed int,
thread text,
PRIMARY KEY ((session_id), event_id))
system_traces.sessions table contains records of tracing sessions.
system_traces.sessions columns description:
- session_id: an ID of the session.
- command: type of a command this session was created for
(currently supported "NONE", "QUERY" and "REPAIR").
- client: IP of the client that issued the command.
- coordinator: IP of a coordinator that received the command.
- duration: total duration of the tracing session (in us).
- parameters: optional parameters for this session, passed to
i_trace_state::begin() call.
- request: a CQL command this tracing session is created for.
- started_at: the time the session has been started at.
system_traces.events contains records of separate tracing events.
system_traces.events columns description:
- session_id: an ID of the session.
- event_id: an ID of the event.
- activity: the trace point description - a message given to
i_trace_state::trace().
- source: IP of the Node where trace event was issued.
- source_elapsed: time passed since creation of a tracing session (in us) on
the Node where this trace event was issued.
- thread: name of the thread in who's context this trace event was
issued in (currently its "core N", where 'N' is an index of
a shard the trace event was issued on).
This class will cache lambdas creating the corresponding mutations for each tracing
record requested to be stored till flush() method is called.
flush() will merge all pending mutations to "sessions" and "events" tables and
then apply a mutation to "events" table and when it completes - to "sessions"
table. This way it'll ensure that when some tracing session is visible, all its
events are visible too.
trace_keyspace_helper exposes a few metrics via collectd:
- tracing_error - a total number of errors (not including OOM)
- bad_column_family_errors - number of times a tracing record wasn't
stored because system_trace tables' schema
didn't match the expected value. This may happen if
a DB administrator is doing funny things like altering
the schemas of the above tables.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This class represents an interface for a specific backend that is
going to store tracing information.
The specific implementation may and expected to implement caching
of pending tracing records.
Interface functions are:
- start(): Initialize a backend (e.g. create keyspace and tables).
- stop(): Flush all pending work and shut down the backend.
- store_session_record()/store_event_record():
Cache/store the corresponding tracing records.
- flush(): Flush pending tracing records.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
writes_attempts suppose to count how many time data was sent out, but
currently it counts even those replicas in other DCs that get the data
through a coordinator. Fix it by counting only when data is actually sent.
Message-Id: <20160601153124.GB9939@scylladb.com>
"One of the things we need to do as part of the throttle rework I am doing is to
serialize memtable flushes to some extent - that will guarantee that in case
we're throttling, the flushes finish earlier and release memory earlier, if
compared to the case in which we just let all tables flush freely and
simultaneously."