Commit Graph

688 Commits

Author SHA1 Message Date
Avi Kivity
98226a14ac Merge "Exception propagation writers in commitlog batch"
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.

Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.

Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.

Also includes some rudimentary tests for flush queue mechanisms.

Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
2016-08-08 15:33:26 +03:00
Duarte Nunes
e0a43a82c6 system_keyspace: Correctly deal with wrapped ranges
This patch ensures we correctly deal with ranges that wrap around when
querying the size_estimates system table.

Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470412433-7767-1-git-send-email-duarte@scylladb.com>
2016-08-05 19:17:00 +03:00
Avi Kivity
b0a275945f Merge "Remove compact columns" from Duarte
"The compact column is a dense schema's single regular column. Its
existence has been a source of bugs, so this patchset removes the
column_kind::compact_column, as well as further references to compact
columns from the code base.

Fixes #1542"
2016-08-05 12:39:23 +03:00
Duarte Nunes
cb0516a76c schema: Remove compact_column concept
This is a confusing one, and can be replaced the fact that dense
schemas have a single regular column.

Ref #1542

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-03 17:21:41 +00:00
Duarte Nunes
529c3a3ae6 column_kind: Drop compact_column
A compact column is a dense schema's single regular column. The fact
that it is a different column_kind has lead to various bugs (#1535,
derived by the schema being dense and the column being regular.

Fixes #1542

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-03 17:21:37 +00:00
Calle Wilund
0f9e868839 commitlog: Use exception propagation in flush_queue (for batch)
Fixes: #1490

While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.

Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.

Thus, adding a muation to commit log in batch, will now, if an error
is generated, result in an exception to the caller, which should be
interpreted as "data might not have been persisted".

The failing segment is then closed, and we happily hope things will
get better in the next. Which they probably wont.

Missing: registration of some sort of "error-handling policy", similar
to origin, which can either kill transports or shut down process.
(A reasonable guess is that disk errors in commit log are not gonna
be recoverable).
2016-08-03 14:49:43 +00:00
Tomasz Grabiec
9476bc5a31 Introduce --abort-on-lsa-bad-alloc command line option
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.

Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
2016-08-03 17:26:44 +03:00
Amnon Heiman
bb4268a8a5 Add prometheus API
This patch adds the prometheus API it adds the proto library to the
compilation, adds an optional configuration parameter to change the
prometheus listening port and start the prometheus API in main.

To disable the prometheus API, set its listening port to 0.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>
2016-08-03 15:55:18 +03:00
Avi Kivity
75ee8fc2a7 size_estimates_recorder: adjust indentation 2016-07-30 20:10:12 +03:00
Avi Kivity
64d0cf58ea size_estimates_recorder: unwrap ranges before searching for sstables
column_family::select_sstables() requires unwrapped ranges, so unwrap
them.  Fixes crash with Leveled Compaction Strategy.

Fixes #1507.

Reviewed-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469563488-14869-1-git-send-email-avi@scylladb.com>
2016-07-27 10:06:21 +03:00
Duarte Nunes
ecfa04da77 system_keyspace: Add query_size_estimates() function
The query_size_estimates() function queries the size_estimates system
table for a given keyspace and table, filtering out the token ranges
according to the specified tokens.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:58 +00:00
Duarte Nunes
d984cc30bf size_estimates_recorder: Fix stop()
This patch fixes stop() by checking if the current CPU instead of
whether the service is active (which it won't be at the time stop() is
called).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:58 +00:00
Duarte Nunes
e16f3f2969 system_keyspace: Avoid pointers in range_estimates
This patch makes range_estimates a proper struct, where tokens are
represented as dht::tokens rather than dht::ring_position*.

We also pass other arguments to update_ and clear_size_estimates by
copy, since one will already be required.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-24 22:43:35 +00:00
Tomasz Grabiec
5e8f0efc85 schema_tables: Fix hang during keyspace drop
Fixes #1484.

We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.

The fix is to give each table a separate joinpoint instance.

Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
2016-07-21 19:14:57 +03:00
Paweł Dziepak
8a386a51bd Merge "Don't cache wide partitions" from Piotr
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
2016-07-21 10:24:25 +01:00
Piotr Jastrzebski
636a4acfd0 Add flag to configure
max size of a cached partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:20 +02:00
Pekka Enberg
aff8cf319d db/config: Start Thrift server by default
We have Thrift support now so start the server by default.
Message-Id: <1469002000-26767-1-git-send-email-penberg@scylladb.com>
2016-07-20 09:25:44 +01:00
Tomasz Grabiec
a0832f08d2 schema_tables: Add more logging
Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:00 +03:00
Vlad Zolotarov
b36b69c1d6 service::storage_proxy: remove a default value for a tracing::trace_state_ptr parameter
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:59 +03:00
Vlad Zolotarov
baa6496816 service::storage_proxy: READ instrumentation: store trace state object in abstract_read_executor
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-07-19 18:21:59 +03:00
Duarte Nunes
9ffdf4a5cd db: Implement size_estimates_recorder
This patch implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. The size_estimates_recorder class
corresponds to the one in Cassandra's SizeEstimatesRecorder.java.

Estimation is carried out by shard 0. Since we're estimating based on
data in shared sstables, having multiple shards doing this would skew
the results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-19 09:44:58 +00:00
Duarte Nunes
f8f61cf246 system_keyspace: Record and clear size estimates
This patch implements functions that allow the size_estimates system
table to be updated and cleared. The size_estimates table is updated
per schema with a set of token ranges and the associated estimations
of how many partitions there are and their mean size.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-18 23:58:31 +00:00
Gleb Natapov
9cc076c9f3 storage_proxy: preserve endpoint's order while filtering local nodes for query
filter_for_query() gets sorted by preference list of endpoints and
should preserve that order after filtering out non local endpoints for
local query. partition() does not guaranty this while stable_partition()
does, so use it instead.

Fixes #1450.
Message-Id: <20160713100909.GM10767@scylladb.com>
2016-07-13 13:17:28 +03:00
Glauber Costa
73a70e6d0a config: Use Scylla in user visible options
We have imported most of our data about config options from Cassandra.  Due to
that, many options that mention the database by name are still using
"Cassandra".

Specially for the user visible options, which is something that a user sees, we
should really be using Scylla here.

This patch was created by automatically replacing every occurrence of "Cassandra"
with "Scylla" and then later on discarding the ones in which the change didn't
make sense (such as Unused options and mentions to the Cassandra documentation)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>
2016-07-12 09:18:17 +03:00
Gleb Natapov
726b79ea91 messaging_service: enable internode_compression option
Use LZ4 for internode compression if enabled.

Message-Id: <20160711141734.GZ18455@scylladb.com>
2016-07-11 18:30:21 +03:00
Calle Wilund
4ab03e98cf commitlog: Ensure we don't end up in a loop when we must wait for alloc
Continuation reordering could cause us to repeatedly see the
segment-local flag var even though actual write/sync ops are done.
Can cause wild recursion without actual delayed continuation ->
SOE.

Fix by also checking queue status, since this is the wait object.

Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>
2016-07-11 14:12:38 +03:00
Calle Wilund
14b0fe23c5 commitlog: Ensure we don't end up in a loop when we must wait for alloc
Continuation reordering could cause us to repeatedly see the 
segment-local flag var even though actual write/sync ops are done. 
Can cause wild recursion without actual delayed continuation ->
SOE. 

Fix by also checking queue status, since this is the wait object.
2016-07-11 07:45:36 +00:00
Tomasz Grabiec
8c4b5e4283 db: Avoiding checking bloom filters during compaction
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.

This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.

I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).

Refs #1322.
2016-07-10 09:54:20 +02:00
Asias He
f4389349e4 config: Enable partitioner option
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
2016-07-08 17:44:55 +08:00
Glauber Costa
7169b727ea move system tables to its own region
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.

Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 17:46:28 -04:00
Avi Kivity
76cc6408cd Merge "feature check for seed node" from Asias
""This series implemnts feature check for seed node.
2016-07-05 19:01:01 +03:00
Asias He
6f69963ef9 system_keyspace: Simplify load_host_ids implementation
- Use plain loop instead of do_for_each

- Use row.get_as() instead of row.template get_as()
Message-Id: <3e108d3a6258c0caaf569eb9c79532d9789ea411.1467703722.git.asias@scylladb.com>
2016-07-05 09:47:21 +02:00
Asias He
3f31be58b6 system_keyspace: Simplify load_tokens implemntation
- Use plain loop instead of do_for_each

- Use row.get_as() instead of row.template get_as()
Message-Id: <f959ace4f30078695d383c849ed4520169228f97.1467703722.git.asias@scylladb.com>
2016-07-05 09:47:21 +02:00
Asias He
31df4e5316 system_keyspace: Introduce load_peer_features
To get the peer features stored in the system.peers table.
2016-07-05 10:09:53 +08:00
Avi Kivity
2a46410f4a Change sstable_list from a map to a set
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set.  The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
2016-07-03 10:26:57 +03:00
Avi Kivity
9ac730dcc9 mutation_reader: make restricting_mutation_reader even more restricting
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.

To limit the damage, add two limits on the queue:
 - a timeout, which is equal to the read timeout
 - a queue length limit, which is equal to 2% of the shard memory divided
   by an estimate of the queued request size (1kb)

Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
2016-06-29 15:17:35 +02:00
Avi Kivity
edeef03b34 db: restrict replica read concurrency
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard.  This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.

Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled.  This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.

Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads.  Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
2016-06-27 17:17:56 +03:00
Duarte Nunes
dfbf68cd24 commitlog: Define operator<< in namespace db
Needed for compilation with gcc6.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1466852874-8448-1-git-send-email-duarte@scylladb.com>
2016-06-26 10:05:28 +03:00
Duarte Nunes
aacc7193f2 schema: Replace keyspace's schema_ptr on CF update
This patch ensures we replace the schema_ptr held by its respective
keyspace object when a column family is being updated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20160623085710.26168-1-duarte@scylladb.com>
2016-06-23 11:11:52 +02:00
Calle Wilund
2b812a392a commitlog_replayer: Fix calculation of global min pos per shard
If a CF does not have any sstables at all, we should treat it
as having a replay position of zero. However, since we also
must deal with potential re-sharding, we cannot just set
shard->uuid->zero initially, because we don't know what shards
existed.

Go through all CF:s post map-reduce, and for every shard where
a CF does not have an RP-mapping (no sstables found), set the
global min pos (for shard) to zero.

Fixes #1372

Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>
2016-06-21 10:05:05 +03:00
Calle Wilund
88ffe60138 batchlog_manager: Change replay mutation CL to ALL
Try to emulate the origin behaviour for batch reply. They use an
explicit write handler, combinging

1.) Hinting to all known dead endpoints
2.) Sending to all persumed live, requiring ack from all
3.) Hinting to endpoint to which send failed.

We don't have hints, so try to work around by doing send with
cl=ALL, and if send fails (wholly or partially), retain the
batch in the log.

This is still slight behavioural difference, and we also risk
filling up the batch log in extreme cases. (Though probably not
in any real environment).

Refs #1222

Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>
2016-06-21 09:41:09 +03:00
Calle Wilund
7cdea1b889 commitlog: Use flush queue for write/flush ordering, improve batch
Using an ordering mechanism better than rw-locks for write/flush
means we can wait for pending write in batch mode, and coalesce
data from more than one mutation into a chunk.

It also means we can wait for a specific read+flush pair (based on
file position).

Downside is that we will not do parallel writes in batch mode (unless
we run out of buffer), which might underutilize the disk bandwidth.

Upside is that running in batch mode (i.e. per-write consistency)
now has way better bandwidth, and also, at least with high mutation
rate, better average latency.

Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>
2016-06-20 13:09:16 +03:00
Tomasz Grabiec
75f899cc93 lsa: Make reclamation step configurable via config 2016-06-14 15:13:15 +02:00
Gleb Natapov
9635e67a84 config: adjust boost::program_options validator to work with db::string_map
Fixes #1320

Message-Id: <20160607064511.GX9939@scylladb.com>
2016-06-07 10:42:27 +03:00
Gleb Natapov
9132604a90 config: make string_map to be a unique type instead of an alias to unordered_map
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.

Message-Id: <20160602102642.GJ9939@scylladb.com>
2016-06-02 13:28:40 +03:00
Gleb Natapov
1476becd28 config: put operators << and >> into db namespace
Makes ADL find the right version of the overload.

Message-Id: <20160601130952.GJ2381@scylladb.com>
2016-06-02 10:45:01 +03:00
Avi Kivity
b50cb3eca8 config: rename compact_on_idle
compact_on_idle will lead users to thinking we're talking about sstable
compaction, not log-structured-allocator compaction.

Rename the variable to reduce the probability of confusion.
Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>
2016-05-30 08:39:13 +03:00
Piotr Jastrzebski
136b8148d2 Use idle CPU to compact LSA memory
Register an idle CPU handler that compacts a single segment
every time there's nothing better to execute on CPU.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>
2016-05-26 12:43:53 +03:00
Tomasz Grabiec
f0c2b1d161 config: Fix typos
Message-Id: <1464201938-4778-1-git-send-email-tgrabiec@scylladb.com>
2016-05-26 08:19:57 +03:00
Calle Wilund
8cdf4e37fb schema_tables: Fix merge_keyspaces to handle alter keyspace
Must keep "altered" alive into the call chain.
2016-05-10 14:32:51 +00:00