Compare commits

...

151 Commits

Author SHA1 Message Date
Pekka Enberg
737a9019a4 dist/docker: Add missing "hostname" package
The Fedora base image has changed so we need to add "hostname" that's
used by the Docker-specific launch script to our image.

Fixes Scylla startup.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-15 13:44:38 +03:00
Takuya ASADA
eb1924a4e4 dist: fix file not found error on centos_dep/build_dependency.sh
We don't have boost.diff, and doesn't need it. So return to rpmbuild --rebuild.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-14 14:12:46 +03:00
Pekka Enberg
2ed34b0e96 Merge seastar upstream
* seastar 1995676...78e3924 (5):
  > fix output stream batching
  > rpc: server connection shutdown fix
  > doc: add Seastar tutorial
  > resource: increase default reserve memory
  > http client: moved http_response_parser.rl from apps/seawreck into http directory

Adjust transport/server.cc for the demise of output_stream::batch_flush()
2015-10-12 16:12:35 +03:00
Glauber Costa
12ac9a1fbd do not calculate truncation time independently
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.

The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.

But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.

Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-09 17:39:47 +03:00
Glauber Costa
4460f243a3 snapshots: fix json type
We are generating a general object ({}), whereas Cassandra 2.1.x generates an
array ([]). Let's do that as well to avoid surprising parsers.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 19:06:38 +03:00
Glauber Costa
55a5877d82 snapshots: handle jsondir creation for empty files case
We still need to write a manifest when there are no files in the snapshot.
But because we have never reached the touch_directory part in the sstables
loop for that case, nobody would have created jsondir in that case.

Since now all the file handling is done in the seal_snapshot phase, we should
just make sure the directory exists before initiating any other disk activity.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 19:06:33 +03:00
Glauber Costa
b03a474ca6 snapshots: get rid of empty tables optimization
We currently have one optimization that returns early when there are no tables
to be snapshotted.

However, because of the way we are writing the manifest now, this will cause
the shard that happens to have tables to be waiting forever. So we should get
rid of it. All shards need to pass through the synchronization point.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 19:06:28 +03:00
Glauber Costa
9ec7b9a213 snapshots: don't hash pending snapshots by snapshot name
If we are hashing more than one CF, the snapshot themselves will all have the same name.
This will cause the files from one of them to spill towards the other when writing the manifest.

The proper hash is the jsondir: that one is unique per manifest file.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 19:06:22 +03:00
Pekka Enberg
fc4e167ffd release: prepare for 0.10 2015-10-08 14:44:36 +03:00
Pekka Enberg
c7c6ebb813 Merge "Switch to gcc-5 on CentOS rpm, with some related fixes" from Takuya 2015-10-08 14:43:29 +03:00
Pekka Enberg
95012793e5 db/schema_tables: Wire up drop keyspace notifications
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-08 13:10:48 +02:00
Pekka Enberg
87d45cc58a service/migration_manager: Simplify notify_drop_keyspace()
There's no need to pass keyspace_metadata to notify_drop_keyspace()
because all we are interested in is the name. The keyspace has been
dropped so there's not much we could do with its metadata either.

Simplifies the next patch that wires up drop keyspace notification.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-08 13:10:48 +02:00
Avi Kivity
e5dca96af3 Merge "snapshots: fix global generation of the manifest file" from Glauber
"snapshotting the files themselves is easy: if more than one CF happens to link
an SSTable twice, all but one will fail, and we will end up with one copy.

The problem for us, is that the snapshot procedure is supposed to leave a
manifest file inside its directory.  So if we just call snapshot() from
multiple shards, only the last one will succeed, writing its own SSTables to
the manifest leaving all other shards' SSTables unaccounted for.

Moreover, for things like drop table, the operation should only proceed when
the snapshot is complete. That includes the manifest file being correctly
written, and for this reason we need to wait for all shards to finish their
snapshotting before we can move on."
2015-10-08 13:08:31 +03:00
Glauber Costa
725ae03772 snapshots: write the manifest file from a single shard
Currently, the snapshot code has all shards writing the manifest file. This is
wrong, because all previous writes to the last will be overwritten. This patch
fixes it, by synchronizing all writes and leaving just one of the shards with the
task of closing the manifest.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 11:36:36 +02:00
Glauber Costa
25d24222fe snapshots: separate manifest creation
The way manifest creation is currently done is wrong: instead of a final
manifest containing all files from all shards, the current code writes a
manifest containing just the files from the shard that happens to be the
unlucky loser of the writing race.

In preparation to fix that, separate the manifest creation code from the rest.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 11:36:36 +02:00
Glauber Costa
abc63e4669 snapshots: clarify and fix sync behavior
We do need to sync jsondir after we write the manifest file (previously done,
but with a question), and before we start it (not previously done) to guarantee
that the manifest file won't reference any file that is not visible yet.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 11:36:36 +02:00
Glauber Costa
ca4babdb57 snapshots: close file after flush
We are currently flushing it, but not closing it.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 11:36:36 +02:00
Avi Kivity
bd7bf3ea84 Merge seastar upstream
* seastar 6664a83...1995676 (1):
  > introduce sync_directory
2015-10-08 12:29:17 +03:00
Takuya ASADA
3a77188d47 dist: move yum install first
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-08 06:29:06 +09:00
Takuya ASADA
10dd1781be dist: Stop specify required libraries manually, use AutoReqProv
We don't need specify dynamically linked library here. AutoReqProv detects it.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-08 06:15:46 +09:00
Takuya ASADA
137fe19ea9 dist: support glob pattern on do_install()
Currently do_install() does not function correctly when passing glob pattern & package are already installed.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-08 06:15:46 +09:00
Takuya ASADA
9cb2776606 dist: switch CentOS gcc to 5.1.1-4
Since we don't want to let user to upgrade libstdc++, we will link libstdc++ statically, using ./configure.py --static-stdc++

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-08 06:15:46 +09:00
Takuya ASADA
0e13757d92 configure.py: add --static-stdc++ to link libstdc++ statically
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-08 06:15:46 +09:00
Avi Kivity
bffcbc592f Merge seastar upstream
* seastar fba8ac6...6664a83 (2):
  > do not add failed stream to output stream poller.
  > rpc: wait for all data to be sent before closing
2015-10-07 18:33:04 +03:00
Calle Wilund
42c086a5cd batchlog_manager: Fixup includes + exception handling
* Fix exception handling in batch loop (report + still re-arm)
* Cleanup seastar include reference style
2015-10-07 17:06:34 +03:00
Avi Kivity
19f36cd3cc Merge "Batchlog manager - run loop on only one shard" from Calle
"* Runs the batchlog loop on only main cpu, but round-robins the actual work
   to each available shard in round-robin fashion.
 * Use gate to guard work loop instead of semaphore (better shutdown,
   eventually)
 * Actually _start_ the batch loop (not done previously)
 * Rename logger + add cpu# hint"

Fixes #424
2015-10-07 16:52:10 +03:00
Calle Wilund
a4c14d3d1d batchlog_manager: Add hint of which cpu timer callback is running on 2015-10-07 14:57:55 +02:00
Calle Wilund
6416c62d39 main: Actually start the batchlog_manager service loop
Was not invoked previously.
2015-10-07 14:30:09 +02:00
Calle Wilund
b46496da34 batchlog_manager: Rename logger
* More useful/referrable on command line (--log*)
* Matches class name (though not origin)
2015-10-07 14:30:09 +02:00
Calle Wilund
6f94a3bdad batchlog_manager: Use gate instead of semaphore
Since that exists now.
2015-10-07 14:30:09 +02:00
Calle Wilund
874da0eb67 batchlog_manager: Run timer loop on only one shard
Since replay is a "node global" operation, we should not attempt to
do it in parallel on each shard. It will just overlap/interfere.
Could just run this on cpu 0 or but since this _could_ be a
lengty operation, each timer callback is round-robined shards just in case...
2015-10-07 14:30:09 +02:00
Avi Kivity
a151268bfe Merge 2015-10-07 14:35:02 +03:00
Calle Wilund
246e8e24f2 replay_position: Make <= comparator simpler and cleaner 2015-10-07 14:34:22 +03:00
Avi Kivity
eccbf85e9d Merge "Truncation records per shard"
Fixes  #423

"Changes the "truncated_at" blob contents of system.local table. It now stores
N replay_positions, where N == # shards.

The system.local table schema remains unchanged, and older truncation data
is accepted, though it will for obvious reasons still be insufficient.

Since the data is opaque to the running instance, blob compatibilty with
origin should be irrelevant (and we're not really that now anyway).

Note that technically, changing shard cound inbetween runs could make us hold
on to RP data "longer than required", but this is
a.) Insignificant data sizes
b.) Data that is valid exactly once: When restarting a failed node and
    replaying. The "shards" only refer to "last run", and after that we don't
    care. At worst, we can get less than fresh data (not all shards manage
    to save truncation records before crash).

It is worth noting (and I've done do in the code) that the system.local table
+ sharding cause some rather silly inefficiencens, since for this (and others)
we store a value for each shard, each save which causes a global flush of the
systable, in turn delegated on all cores. So the op is N^2 in "db complexity".
At some point we should maybe consider if operations like "drop table" and
"truncate" should not be done on shard level, but on machine level, so it can
coordinate itself. But otoh, it is rare and not _very_ expensive either."
2015-10-07 14:33:22 +03:00
Avi Kivity
c48a826c65 db: fix string type incorrectly unvalidated
We call the conversion function that expectes a NUL terminated string,
but provide a string view, which is not.

Fix by using the begin/end variant, which doesn't require a NUL terminator.

Fixes #437.
2015-10-07 12:22:01 +02:00
Calle Wilund
a66c22f1ec commitlog_replayer: Acquire truncation RP:s per replayed shard
I.e. get them in bulk and fill in for all shards
2015-10-07 09:00:22 +02:00
Calle Wilund
17bd18b59c commitlog_replayer: Add logging message for exceptions in multi-file recover 2015-10-07 08:59:54 +02:00
Calle Wilund
3f1fa77979 commitlog_replayer: Fix broken comparison
A commitlog entry should be ignored if its position is <= highest recorded
position, not <.
2015-10-07 08:59:53 +02:00
Calle Wilund
271eb3ba02 replay_position: Add <= comparator 2015-10-07 08:59:53 +02:00
Calle Wilund
6b0ab79ecb system_keyspace: Keep per-shard truncation records
Fixes  #423
* CF ID now maps to a truncation record comprised of a set of 
  per-shard RP:s and a high-mark timestamp
* Retrieving RP:s are done in "bulk"
* Truncation time is calculated as max of all shards.

This version of the patch will accept "old" truncation data, though the 
result of applying it will most likely not be correct (just one shard)

Record is still kept as a blob, "new" format is indicated by 
record size.
2015-10-07 08:59:52 +02:00
Calle Wilund
199b72c6f3 commitlog: fix reader "offset" handling broken + ensure exceptions propagates
Must ensure we find a chunk/entry boundary still even when run
with a start offset, since file navigation in chunk based.
Was not observed as broken previously because
1.) We did not run with offsets
2.) The exception never reached caller.

Also make the reader silently ignore empty files.
2015-10-07 08:54:49 +02:00
Calle Wilund
024041c752 commitlog: make log message slightly more informative/correct 2015-10-07 08:54:49 +02:00
Calle Wilund
f7151cac61 cql3::untyped_result_set: Allow "get_map" to be explicit about result
type

Allow providing both hash/equal etc for resulting map, as well
as explicit data_types for the deserialization.
Also allow direct extraction of kv-pairs to iterator, for more advanced
unpacking.
2015-10-07 08:54:49 +02:00
Avi Kivity
29106ab802 Merge seastar upstream
* seastar 4a3071e...fba8ac6 (3):
  > stream.hh: Fix broken "set_exception".
  > configure.py: fix use of "echo -e"
  > deannoyify touch_directory
2015-10-07 09:44:58 +03:00
Gleb Natapov
358d93112f replace ad-hoc cql connection polling with new batch_flush() output stream API 2015-10-06 19:22:23 +03:00
Pekka Enberg
b40999b504 database: Fix drop_column_family() UUID lookup race
Remove the about to be dropped CF from the UUID lookup table before
truncating and stopping it. This closes a race window where new
operations based on the UUID might be initiated after truncate
completes.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 17:10:17 +02:00
Pekka Enberg
5878f62b18 db/schema_tables: Clean up indentation
Almost the whole file is (accidentally) indented four spaces to the
right for no reason. Fix that up because it's annoying as hell.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 17:09:27 +02:00
Pekka Enberg
1f9e769dd3 db/schema_tables: Remove obsolete ifdef'd code
Remove ifdef'd code that we won't be converting to C++ because of design
differences.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 17:09:27 +02:00
Avi Kivity
75dd123d01 Merge "CQL DROP KEYSPACE support" from Pekka
"This patch series implements support for CQL DROP KEYSPACE and makes the
test_keyspace CQL test in dtest pass:

  [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.keyspace_test
  keyspace_test (cql_tests.TestCQL) ... ok

  ----------------------------------------------------------------------
  Ran 1 test in 12.166s

  OK

  [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
  table_test (cql_tests.TestCQL) ... ok

  ----------------------------------------------------------------------
  Ran 1 test in 23.841s

  OK"
2015-10-06 15:19:33 +03:00
Pekka Enberg
da7b741f64 service/migration_manager: Implement announce_keyspace_drop()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 14:53:35 +03:00
Pekka Enberg
6e304cd58c db/schema_tables: Fix merge_keyspaces() to actually drop keyspaces
When we query schema keyspaces after we have applied a delete mutation,
the dropped keyspace does not exist in the "after" result set. Fix the
merge_keyspaces() algorithm to take that into account.

Makes merge_keyspaces() really call to database::drop_keyspace() when a
keyspace is dropped.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 14:53:35 +03:00
Pekka Enberg
5d9d1e28cb db/schema_tables: Implement make_drop_keyspace_mutations()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 14:53:35 +03:00
Pekka Enberg
9576b0ef23 database: Implement drop_keyspace()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 14:53:35 +03:00
Pekka Enberg
b66154e43a cql3: Fix capture-by-reference in drop_keyspace_statement
We need to capture the "is_local_only" boolean by value because it's an
argument to the function. Fixes an annoying bug where we failed to update
schema version because we pass "true" accidentally.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 14:53:35 +03:00
Tomasz Grabiec
bc1d159c1b Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git
From Pekka:

This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:

  [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
  table_test (cql_tests.TestCQL) ... ok

  ----------------------------------------------------------------------
  Ran 1 test in 23.841s

  OK
2015-10-06 13:39:25 +02:00
Shlomi Livne
f347a024a1 update boost testsuite output
We are generating huge output xml files with the --jenkins flag. Update
the printout from all to test_suite - to reduce size and incldue the
info we need.

Error messages / failed assertions are still printed

Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
2015-10-06 14:27:19 +03:00
Pekka Enberg
042e9252d5 service/migration_manager: Implement announce_column_family_drop()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
633279415d db/schema_tables: Fix merge_tables() to actually drop tables
When we query schema tables after we have applied a delete mutation, the
dropped table does not exist in the "after" result set. Fix the
merge_tables() algorithm to take that into account.

Makes merge_tables() really call to database::drop_column_family() when
a table is dropped.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
82d20dba65 db/schema_tables: Implement make_drop_table_mutations()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
b89b70daa8 db/schema_tables: Wire up drop column notifications
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
b1e6ab144a database: Implement drop_column_family()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
afbb2f865d database: Add keyspace_metadata::remove_column_family() helper
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
0651ab6901 database: Futurize drop_column_family() function
Futurize drop_column_family() so that we can call truncate() from it.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
85ffaa5330 database: Add truncate() variant that does not look up CF by name
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:54 +03:00
Pekka Enberg
baff913d91 cql3: Fix capture-by-reference in drop_table_statement
We need to capture the "is_local_only" boolean by value because it's an
argument to the function. Fixes an annoying bug where we failed to update
schema version because we pass "true" accidentally. Spotted by ASan.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:54 +03:00
Avi Kivity
2f56f72466 Merge seastar upstream
* seastar 0c402e1...4a3071e (3):
  > output stream flush batching
  > Update README with compilation issues - OOM
  > resource: fix memory leak in resource::allocate()
2015-10-06 11:17:29 +03:00
Avi Kivity
e342914265 Merge "Fixes for incremental backup" from Glauber
"The control over backups is now moved to the CF itself, from the storage
service. That allows us to simplify the code (while making it correct) for cases
in which the storage service is not available.

With this change, we no longer need the database config passed down to the
storage_service object. So that patch is reverted."
2015-10-05 14:36:26 +03:00
Glauber Costa
651937becf Revert "pass db::config to storage service as well"
This reverts commit c2b981cd82.
2015-10-05 13:21:33 +02:00
Glauber Costa
639ba2b99d incremental backups: move control to the CF level
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.

The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:16:11 +02:00
Glauber Costa
b619d244e8 storage_service: public access to the database object
Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:15:27 +02:00
Glauber Costa
69d1358627 database: non const versions of get_keyspaces/column_families
We will need to change some properties of the keyspace / cf. We need an acessor
that is not marked as const.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:13:37 +02:00
Pekka Enberg
b74a9d99d5 db/schema_tables: Fix UTF-8 serialization
Use the utf8_type to serialize strings instead of using to_bytes().

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-05 09:26:15 +02:00
Avi Kivity
21bb5ea5c7 Add .gitattributes file to classify C++ source
With this, diffs become more pleasant to read, as access specifiers
no longer find their way into the hunk header.
2015-10-05 08:51:51 +02:00
Avi Kivity
7c23ec49ae Merge "Support incremental backups" from Glauber
"Generate backups when the configuration file indicates we should;
toggle behavior on/off through the API."
2015-10-04 13:49:20 +03:00
Avi Kivity
4ca4efbc9c Merge "Add cfstats support" from Amnon
"This series adds the functionality that is required so the nodetool cfstats
would work.

It complete the histogram support for read and write latency and add stub for
functionality that is needed but is not supported yet."
2015-10-04 13:38:30 +03:00
Amnon Heiman
a04401d5a4 API: Column family to return sum of the total read and write
This adds the implementation that return the estimated total latency of
the read and of the write.

First the method that sum the count was renamed to get_cf_stats_count
and a method was added named get_cf_stats_sum to sum the estimated
latencies.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Amnon Heiman
4145a48335 API: return estimated sum from histogram
The histogram that are used typically only sample the data, so to get an
estimation of the actual sum, we use the estimated mean multiply by the
actuall count.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Amnon Heiman
7d3a0f0789 histogram: initilization and mean calculation
This patch contains two changes to the histogram implementation. It uses
a simpler method to calculate the estimated mean (simply divide the
estimated sum with the number of samples) and to make sure that there
will always be values in the histogram, it start with taking a sample
(when there are no samples) and then use the mask to decide if to sample
or not.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Amnon Heiman
1f16765140 column family: setting the read and write latency histogram
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.

To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.

The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Amnon Heiman
8e9729371f API: Add functionality to column family to support nodetool cfstats
This adds the API definition with stub implementation that would make
the nodetool cfstats to run.

After this patch the nodetool cfstats command would work, but with stub
imlementation.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Amnon Heiman
2b59bb2d2b API: storage proxy definition cas read and write
This patch add some missing definition for cas read an write, the API
definition is for completness only as we do not support cas yet.

It also change a part of the definition from storage_service to
storage_proxy as it should be.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Glauber Costa
700b37635f api: incremental backups
GET and POST methods are implemented in the API.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-02 18:23:27 +02:00
Glauber Costa
d4edb82c9e column_family: incremental backups
Only tables that arise from flushes are backed up. Compacted tables are not.
Therefore, the place for that to happen is right after our flush.

Note that due to our sharded architecture, it is possible that in the face of a
value change some shards will backup sstables while others won't.

This is, in theory, possible to mitigate through a rwlock. However, this
doesn't differ from the situation where all tables are coming from a single
shard and the toggle happens in the middle of them.

The code as is guarantees that we'll never partially backup a single sstable,
so that is enough of a guarantee.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-02 18:23:27 +02:00
Glauber Costa
a5fb145084 storage_service: incremental backups
Query and set the state of incremental backups. The initial value comes from
the configuration file through the local db reference. Later on, it can be
changed through the interface.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-02 18:23:27 +02:00
Glauber Costa
c2b981cd82 pass db::config to storage service as well
We would like to access configuration, but don't want to poke other services
in order to do so.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-02 18:23:26 +02:00
Takuya ASADA
05db25bfc5 dist: fix yum install error on CentOS dependency rpms
Not able to install boost packages one by one, install them all at once.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-01 18:09:39 +03:00
Pekka Enberg
5e27d476d4 database: Improve exception error messages
When we convert exceptions into CQL server errors, type information is
not preserved. Therefore, improve exception error messages to make
debugging dtest failures, for example, slightly easier.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-01 11:23:46 +03:00
Avi Kivity
01e01a2bd7 main: fix typo in 'check_direct_io_support()' 2015-09-30 20:16:07 +03:00
Glauber Costa
73a1fab273 sanity check the filesystem
For a lot of users, running Scylla in some kinds of filesystems that do not support
O_DIRECT is quite frustrating: it will fail at some point, with random error messages
that aren't really meaningful.

We should try to check for that, and fail with a good error message. Also, since our
performance claims won't really hold in anything other than XFS, we should warn the user
if that is not the setup we encounter.

Fixes #409

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-30 17:58:27 +03:00
Avi Kivity
3c72977291 Merge seastar upstream
* seastar 432a771...0c402e1 (1):
  > function to check for direct_io capabilities
2015-09-30 17:58:05 +03:00
Gleb Natapov
2998b891f3 storage_proxy: fix crash during background read repair
Lazy digest calculation code introduced a bug in background read repair.
The problem is that digest_read_resolver::resolve() destroys one data
result (it is moved to a caller to be sent as a reply), so during
background digest match there is no value to calculate a digest from.
Copying data to the caller would be most elegant solution, but also
slowest one, so lets just treat the case where there is only one
target queried and skip digest calculation in this case since we know
digest_match() will do nothing.
2015-09-30 16:35:12 +03:00
Avi Kivity
c7be4911f3 Merge seastar upstream
* seastar 283901a...432a771 (2):
  > provide an api to query the filesystem type
  > modernize reactor::stop()
2015-09-30 16:34:30 +03:00
Avi Kivity
c84ed13dac Merge "CQL truncate" from Calle
"First iteration implementation of CQL truncate, transposed from
Origin.

Includes a workable impl. of snapshots, since that is sort of an integral
part of the origin code.

Note: This is still incomplete/incorrect in two ways:

1.) Since we have no way to ensure sstables are finished writing,
    the flush-snapshots are unreliable. Needs basically the same
    fix as correct commitlog management, namely flush queues and
    the ability to wait-force "active" flushes to finish before
    continuing.
2.) System table truncation record saving does not handle sharding.
    This means we basically save the "last" RP from any of the shards
    truncating, and consequently if we have a crash and do commitlog
    replay, we could resurrect truncated data.
    Fix is to have truncation records be per cf+shard just as RP:s
    are per shard.

However, since some people are waiting for at least a semi-functional
truncate, I'm submitting this without fixing the two above issues,
since they can be dealt with in subsequent patches."
2015-09-30 15:46:37 +03:00
Shlomi Livne
9e86b6273c dist: add dependency on xfsprogs
Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
2015-09-30 12:23:59 +03:00
Avi Kivity
9c5a36efd0 logalloc: fix segment free in debug mode
Must match allocation function.
2015-09-30 09:45:25 +02:00
Avi Kivity
489f737351 build: fix order of libasan on link command
gcc 5.1 requires libasan to be first, humor it.
2015-09-30 09:45:25 +02:00
Pekka Enberg
3cb60556e9 cql3: Implement truncate_statement::execute()
Implement the execute() function by using the underlying
truncate_blocking() API from storage proxy.
2015-09-30 09:09:43 +02:00
Pekka Enberg
455d382bac cql3: Implement check_access() and validate() for truncate_statement
Implement the check_access() and validate() functions as stubs to avoid
tripping over the unimplemented exception from cqlsh.
2015-09-30 09:09:43 +02:00
Pekka Enberg
f1fa2ec758 cql3: Move truncate statement implementation to source file
Clean up the truncate_statement class before we start modifying it.
Saves us from recompilation pain.
2015-09-30 09:09:43 +02:00
Calle Wilund
d0864be20f storage_proxy: Implement "truncate_blocking" 2015-09-30 09:09:43 +02:00
Calle Wilund
a8742cd199 to_string: Add << operator for std::set 2015-09-30 09:09:43 +02:00
Calle Wilund
80ade2e2d3 storage_proxy: Add TRUNCATE verb handler 2015-09-30 09:09:43 +02:00
Calle Wilund
37131fcc05 messaging_service: TRUNCATE verb methods 2015-09-30 09:09:42 +02:00
Calle Wilund
68b8d8f48c database: Implement "truncate" for column family
Including snapshotting.
2015-09-30 09:09:42 +02:00
Pekka Enberg
ac4007153d row_cache: Implement clear() helper
We need to clear the row cache for column family truncate operation.
2015-09-30 09:09:42 +02:00
Calle Wilund
7856d7fe02 config: Change "auto_snapshot" to "used" 2015-09-30 09:09:42 +02:00
Calle Wilund
56228fba24 column family: Add "snapshot" operation. 2015-09-30 09:09:42 +02:00
Calle Wilund
428557a66d sstables: add "create_links" method
Adds hard links in requested directory to all components of the sstable
Used for snapshotting
2015-09-30 09:09:42 +02:00
Calle Wilund
cdaafb0505 sstables: Expose directory, max age and all active files 2015-09-30 09:09:42 +02:00
Calle Wilund
c141e15a4a column family: Add "run_with_compaction_disabled" helper
A'la origin. Could as well been RAII.
2015-09-30 09:09:41 +02:00
Calle Wilund
b3c95ce42d system_keyspace: Change truncation record method to use context qp
Align with rest of file (for better or worse). This allows calls from
entity without query_processor handy (i.e. storage_proxy).

Added "minimal" setup method for the "global" state, to facilitate
tests. Doing a full setup either in cql_test_env or after it is created
breaks badly. (Not sure why). So quick workaround.

Updated the current two users (batchlog_manager and commitlog_replayer)
callsites to conform.
2015-09-30 09:09:41 +02:00
Calle Wilund
3abd8b38b6 query_context: Expose query_processor (local) 2015-09-30 09:09:41 +02:00
Calle Wilund
0444029a16 cql_test_env: expose distributed db and query processor 2015-09-30 09:09:41 +02:00
Calle Wilund
713860602b cql3/maps.cc : implement maps::marker::bind
Needed for system table (truncation pos)
2015-09-30 09:09:41 +02:00
Avi Kivity
4ae6c8c875 Merge seastar upstream
* seastar 5fe596a...283901a (10):
  > Add filesystem "link_file"
  > scripts: posix_net_conf.sh: take care of the case with more than 32 CPUs
  > posix: Add explanatory string to throw_system_error_on()
  > tests: fix memory leak in thread_test test_thread_1
  > tests: fix memory leak in timertest cancellation test
  > README: require xfsprogs-devel
  > file: query dma alignment from OS
  > file: separate disk and memory dma alignment
  > scripts: posix_net_conf.sh: Exclude CPU0 in RDS config for EN.
  > README: Add missing Ubuntu 14.04 dependencies to README.md
2015-09-29 19:06:37 +03:00
Avi Kivity
0ec0e32014 Merge "ommitlog: preallocate segments" from Calle
"Modified version of the initial patch (which was reverted), further
reducing the possible delay states in CL allocation and segment management."
2015-09-29 17:02:54 +03:00
Glauber Costa
91408d3cbc warn users on 100 % CPU usage
Although it is technically a seastar problem, most complains about that is
coming from the Scylla side.  I prefer to keep the message here so we can reference
a Scylla issue.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-29 16:40:24 +03:00
Avi Kivity
c30feb714c Merge "gossip heart_beat_version + time to wait for seed" from Asias 2015-09-29 15:49:24 +03:00
Avi Kivity
d5d271a45b Merge "improvement on code that handles temporary TOC" from Raphael 2015-09-29 11:47:35 +03:00
Tomasz Grabiec
4863d16fb6 Merge tag 'bloom-memory' from git@github.com:glommer/scylla.git
From Glauber:

We will export the total memory used by the filter as its "off heap"
size for the purposes of statistics.
2015-09-29 10:05:17 +02:00
Glauber Costa
22294dd6a0 do not re-read sstable components after write
When we write an SSTable, all its components are already in memory. load() is
to big of a hammer.

We still want to keep the write operation separated from the preparation to
read, but in the case of a newly written SSTable, all we need to do is to open
the index and data file.

Fixes #300

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-29 10:00:26 +02:00
Calle Wilund
0ec50e8d36 create_index_statement: Bugfix. Inverted logic in full index check 2015-09-29 09:47:15 +02:00
Avi Kivity
c52d9f8da4 db: fix circular reference collection_type_impl <-> cql3_type
cql2_type is a simple wrapper around data_type.  But some data_types
(collection_type_impl) contain a cql3_type as a cache to avoid recomputing
it, resulting in a circular reference.  This leaks memory when as_cql3_type()
is called.

Fix by using a static hash table for the cache.
2015-09-29 08:38:15 +02:00
Raphael S. Carvalho
549a9e2ed4 sstable: rename file_existence to file_exists
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-28 15:49:57 -03:00
Raphael S. Carvalho
59506eba24 sstable: close file returned by open_file_dma in file_existence
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-28 15:49:47 -03:00
Raphael S. Carvalho
da316c982d sstable: fsync cf dir before removing temporary toc
That's important to guarantee that all other components were
deleted before deleting TemporaryTOC.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-28 15:49:24 -03:00
Glauber Costa
5dd0953bb9 api: implement filter off heap memory calculation
For us, everything is "off heap", so this will just be the total amount of
memory used by the filters.

Fixes #339

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-28 16:44:26 +02:00
Glauber Costa
8b3a6f19a1 sstables: export filter memory size
Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-28 16:43:20 +02:00
Glauber Costa
cbedd9ee41 Export Bloom Filter's memory size
Do it so we can estimate how much memory it is being used by the filters. This
estimate is not 100 % correct: the implementation of the bloom_filter class
uses a thread-local variable that is common to all filters. We won't include
that in the estimate. But aside from that, it should be quite accurate.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-28 16:43:06 +02:00
Pekka Enberg
f43f0d6f04 keys: Add compound_wrapper::from_singular()
Clean up code by adding a from_singular() helper function to compound
wrapper and use it in.
2015-09-28 16:29:44 +02:00
Asias He
3a36ec33db gossip: Wait longer for seed node during boot up
When start a cluster on AWS, the seed node might get ready after
non-seed nodes is ready to contact it. Wait for seed node longer to make
the boot up process more robust.
2015-09-28 11:11:11 +08:00
Asias He
e43b2c2c89 api: Add get_current_heart_beat_version
curl -X GET "http://127.0.0.1:10000/gossiper/heart_beat_version/127.0.0.2"

This is useful to check if the gossip code is still running when
debugging.

Now we can get both the generation version and heart beat version of a
node.

curl -X GET "http://127.0.0.1:10000/gossiper/generation_number/127.0.0.2"
2015-09-28 09:38:33 +08:00
Asias He
817c138034 gossip: Add get_current_heart_beat_version interface
HTTP API will use it.
2015-09-28 09:38:22 +08:00
Gleb Natapov
f0c3caa43b Do not ignore exceptions during compaction
As comment explains if both read and write fails write exception is
ignored. To fix that create one exception that contains both errors.
2015-09-27 14:16:35 +03:00
Gleb Natapov
d53be0a91e Move operator<< for std::exception_ptr to std namespace and make it get const
If the operator is not in std namespace it cannot be found in non global
contexts.
2015-09-27 14:16:35 +03:00
Takuya ASADA
b2630db514 dist: remove rpm dependency to libvirt
This is for testing virtio mode, since we don't officially recommend to use virtio mode we should drop it.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-09-25 17:14:37 -07:00
Gleb Natapov
140641689b messaging: do not use rpc client in error state
Using rpc client in error state will result in a message loss. Try to
reconnect instead.
2015-09-24 17:50:51 +02:00
Raphael S. Carvalho
ce855577b6 add compaction stats to collectd
With this change, we can see the number and length of compaction
activity per shard from collectd.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-24 16:51:11 +02:00
Asias He
e77cea382e rpm: Improve rpm build scripts
This makes we can build in a centos container.
2015-09-23 21:42:51 -07:00
Tomasz Grabiec
1b1cfd2cbf tests: Introduce tests/memory_footprint_test 2015-09-23 21:27:44 -07:00
Tomasz Grabiec
d033cdcefe db: Move "Populating Keyspace ..." message from WARN to INFO level
WARN level is for messages which should draw log reader's attention,
journalctl highlights them for example. Populating of keyspace is a
fairly normal thing, so it should be logged on lower level.
2015-09-23 15:28:44 +02:00
Avi Kivity
b3b6fc2f39 Merge branch 'branch-0.9' 2015-09-23 06:27:55 -07:00
Avi Kivity
36c3439fae Merge branch 'branch-0.9' 2015-09-22 05:33:32 -07:00
Calle Wilund
4941d91063 Commitlog: add some more verbosity 2015-09-22 12:57:33 +02:00
Avi Kivity
99e19a9f73 Merge branch 'branch-0.9' 2015-09-21 17:03:47 -07:00
Avi Kivity
37344c19e7 version: update for next cycle 2015-09-22 00:41:57 +03:00
Avi Kivity
eca0228f15 Merge branch 'branch-0.9' 2015-09-22 00:40:52 +03:00
Tomasz Grabiec
83dbea5b3a Merge branch 'branch-0.9'
tests: Fix row_cache_alloc_stress
    dist: remove conflicts with cassandra21 to allow side by side rpm installation
    dist: update ami base image id to one that supports enhanced networking
2015-09-21 23:06:35 +02:00
Tomasz Grabiec
a588c72ef2 Merge branch 'branch-0.9'
Changes:

    transport: fix poller removal
    dist: Add CentOS packaging
    row_cache: Use allocating_section in row_cache::populate()
2015-09-21 20:28:21 +02:00
Calle Wilund
a10745cf0e Commitlog: Delay timer by period/ncpus for each cpu
To avoid having all shards doing sync at the same time.
2015-09-21 13:30:35 +02:00
Calle Wilund
dcabf8c1d2 Commitlog: Pre-allocate "reserve" segments
Refs #356

Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.

Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.

Some logging cleanup/betterment also to make behaviour easier to trace.

Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.

With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.

v2: Fixed timestamp not being reset on reserve acquire
2015-09-21 13:04:39 +02:00
78 changed files with 2601 additions and 1285 deletions

2
.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
*.cc diff=cpp
*.hh diff=cpp

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=0.9
VERSION=0.10
if test -f version
then

View File

@@ -950,6 +950,33 @@
}
]
},
{
"path":"/column_family/metrics/estimated_row_count/{name}",
"operations":[
{
"method":"GET",
"summary":"Get estimated row count",
"type":"array",
"items":{
"type":"long"
},
"nickname":"get_estimated_row_count",
"produces":[
"application/json"
],
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/column_family/metrics/estimated_column_count_histogram/{name}",
"operations":[

View File

@@ -93,6 +93,30 @@
}
]
},
{
"path":"/gossiper/heart_beat_version/{addr}",
"operations":[
{
"method":"GET",
"summary":"Get heart beat version for a node",
"type":"int",
"nickname":"get_current_heart_beat_version",
"produces":[
"application/json"
],
"parameters":[
{
"name":"addr",
"description":"The endpoint address",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/gossiper/assassinate/{addr}",
"operations":[
@@ -126,4 +150,4 @@
]
}
]
}
}

View File

@@ -546,7 +546,58 @@
]
},
{
"path": "/storage_service/metrics/cas_write/unfinished_commit",
"path":"/storage_proxy/metrics/cas_read/unavailables",
"operations":[
{
"method":"GET",
"summary":"Get CAS read unavailables",
"type":"long",
"nickname":"get_cas_read_unavailables",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_proxy/metrics/cas_write/timeouts",
"operations":[
{
"method":"GET",
"summary":"Get CAS write timeout",
"type":"long",
"nickname":"get_cas_write_timeouts",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_proxy/metrics/cas_write/unavailables",
"operations":[
{
"method":"GET",
"summary":"Get CAS write unavailables",
"type":"long",
"nickname":"get_cas_write_unavailables",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path": "/storage_proxy/metrics/cas_write/unfinished_commit",
"operations": [
{
"method": "GET",
@@ -561,7 +612,7 @@
]
},
{
"path": "/storage_service/metrics/cas_write/contention",
"path": "/storage_proxy/metrics/cas_write/contention",
"operations": [
{
"method": "GET",
@@ -576,7 +627,7 @@
]
},
{
"path": "/storage_service/metrics/cas_write/condition_not_met",
"path": "/storage_proxy/metrics/cas_write/condition_not_met",
"operations": [
{
"method": "GET",
@@ -591,7 +642,7 @@
]
},
{
"path": "/storage_service/metrics/cas_read/unfinished_commit",
"path": "/storage_proxy/metrics/cas_read/unfinished_commit",
"operations": [
{
"method": "GET",
@@ -606,7 +657,7 @@
]
},
{
"path": "/storage_service/metrics/cas_read/contention",
"path": "/storage_proxy/metrics/cas_read/contention",
"operations": [
{
"method": "GET",
@@ -621,7 +672,7 @@
]
},
{
"path": "/storage_service/metrics/cas_read/condition_not_met",
"path": "/storage_proxy/metrics/cas_read/condition_not_met",
"operations": [
{
"method": "GET",
@@ -636,7 +687,7 @@
]
},
{
"path": "/storage_service/metrics/read/timeouts",
"path": "/storage_proxy/metrics/read/timeouts",
"operations": [
{
"method": "GET",
@@ -651,7 +702,7 @@
]
},
{
"path": "/storage_service/metrics/read/unavailables",
"path": "/storage_proxy/metrics/read/unavailables",
"operations": [
{
"method": "GET",
@@ -696,7 +747,7 @@
]
},
{
"path": "/storage_service/metrics/range/timeouts",
"path": "/storage_proxy/metrics/range/timeouts",
"operations": [
{
"method": "GET",
@@ -711,7 +762,7 @@
]
},
{
"path": "/storage_service/metrics/range/unavailables",
"path": "/storage_proxy/metrics/range/unavailables",
"operations": [
{
"method": "GET",
@@ -726,7 +777,7 @@
]
},
{
"path": "/storage_service/metrics/write/timeouts",
"path": "/storage_proxy/metrics/write/timeouts",
"operations": [
{
"method": "GET",
@@ -741,7 +792,7 @@
]
},
{
"path": "/storage_service/metrics/write/unavailables",
"path": "/storage_proxy/metrics/write/unavailables",
"operations": [
{
"method": "GET",

View File

@@ -144,7 +144,9 @@ inline httpd::utils_json::histogram add_histogram(httpd::utils_json::histogram r
res.max = val.max;
}
double ncount = res.count() + val.count;
res.sum = res.sum() + val.sum;
// To get an estimated sum we take the estimated mean
// and multiply it by the true count
res.sum = res.sum() + val.mean * val.count;
double a = res.count()/ncount;
double b = val.count/ncount;

View File

@@ -76,14 +76,30 @@ future<json::json_return_type> get_cf_stats(http_context& ctx,
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const sstring& name,
static future<json::json_return_type> get_cf_stats_count(http_context& ctx, const sstring& name,
utils::ihistogram column_family::stats::*f) {
return map_reduce_cf(ctx, name, 0, [f](const column_family& cf) {
return (cf.get_stats().*f).count;
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats_sum(http_context& ctx,
static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const sstring& name,
utils::ihistogram column_family::stats::*f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([uuid, f](database& db) {
// Histograms information is sample of the actual load
// so to get an estimation of sum, we multiply the mean
// with count. The information is gather in nano second,
// but reported in micro
column_family& cf = db.find_column_family(uuid);
return ((cf.get_stats().*f).count/1000.0) * (cf.get_stats().*f).mean;
}, 0.0, std::plus<double>()).then([](double res) {
return make_ready_future<json::json_return_type>((int64_t)res);
});
}
static future<json::json_return_type> get_cf_stats_count(http_context& ctx,
utils::ihistogram column_family::stats::*f) {
return map_reduce_cf(ctx, 0, [f](const column_family& cf) {
return (cf.get_stats().*f).count;
@@ -285,19 +301,26 @@ void set_column_family(http_context& ctx, routes& r) {
sstables::merge, utils_json::estimated_histogram());
});
cf::get_estimated_column_count_histogram.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
std::vector<double> res;
return make_ready_future<json::json_return_type>(res);
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
uint64_t res = 0;
for (auto i: *cf.get_sstables() ) {
res += i.second->get_stats_metadata().estimated_row_size.count();
}
return res;
},
std::plus<uint64_t>());
});
cf::get_compression_ratio.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
return make_ready_future<json::json_return_type>(0);
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {
sstables::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i.second->get_stats_metadata().estimated_column_count);
}
return res;
},
sstables::merge, utils_json::estimated_histogram());
});
cf::get_all_compression_ratio.set(r, [] (std::unique_ptr<request> req) {
@@ -315,25 +338,33 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_read.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx,req->param["name"] ,&column_family::stats::reads);
return get_cf_stats_count(ctx,req->param["name"] ,&column_family::stats::reads);
});
cf::get_all_read.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx, &column_family::stats::reads);
return get_cf_stats_count(ctx, &column_family::stats::reads);
});
cf::get_write.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx, req->param["name"] ,&column_family::stats::writes);
return get_cf_stats_count(ctx, req->param["name"] ,&column_family::stats::writes);
});
cf::get_all_write.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx, &column_family::stats::writes);
return get_cf_stats_count(ctx, &column_family::stats::writes);
});
cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, req->param["name"], &column_family::stats::reads);
});
cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx,req->param["name"] ,&column_family::stats::reads);
});
cf::get_write_latency.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_stats_sum(ctx, req->param["name"] ,&column_family::stats::writes);
});
cf::get_all_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return get_cf_histogram(ctx, &column_family::stats::writes);
});
@@ -490,20 +521,20 @@ void set_column_family(http_context& ctx, routes& r) {
}, std::plus<uint64_t>());
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [] (std::unique_ptr<request> req) {
//TBD
// FIXME
// We are missing the off heap memory calculation
// Return 0 is the wrong value. It's a work around
// until the memory calculation will be available
//auto id = get_uuid(req->param["name"], ctx.db.local());
return make_ready_future<json::json_return_type>(0);
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst.second->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(0);
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst.second->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_index_summary_off_heap_memory_used.set(r, [] (std::unique_ptr<request> req) {
@@ -560,7 +591,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_true_snapshots_size.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
// FIXME
//auto id = get_uuid(req->param["name"], ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -641,17 +672,30 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_tombstone_scanned_histogram.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
// FIXME
//auto id = get_uuid(req->param["name"], ctx.db.local());
std::vector<double> res;
httpd::utils_json::histogram res;
res.count = 0;
res.mean = 0;
res.max = 0;
res.min = 0;
res.sum = 0;
res.variance = 0;
return make_ready_future<json::json_return_type>(res);
});
cf::get_live_scanned_histogram.set(r, [] (std::unique_ptr<request> req) {
//TBD
unimplemented();
// FIXME
//auto id = get_uuid(req->param["name"], ctx.db.local());
std::vector<double> res;
//std::vector<double> res;
httpd::utils_json::histogram res;
res.count = 0;
res.mean = 0;
res.max = 0;
res.min = 0;
res.sum = 0;
res.variance = 0;
return make_ready_future<json::json_return_type>(res);
});
@@ -741,8 +785,9 @@ void set_column_family(http_context& ctx, routes& r) {
// TBD
// FIXME
// This is a workaround, until there will be an API to return the count
// per level, we return 0
return make_ready_future<json::json_return_type>(0);
// per level, we return an empty array
vector<uint64_t> res;
return make_ready_future<json::json_return_type>(res);
});
}
}

View File

@@ -53,6 +53,13 @@ void set_gossiper(http_context& ctx, routes& r) {
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [](std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_current_heart_beat_version(ep).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
});
httpd::gossiper_json::assassinate_endpoint.set(r, [](std::unique_ptr<request> req) {
if (req->get_query_param("unsafe") != "True") {
return gms::assassinate_endpoint(req->param["addr"]).then([] {

View File

@@ -219,7 +219,29 @@ void set_storage_proxy(http_context& ctx, routes& r) {
sp::get_cas_read_timeouts.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
// FIXME
// cas is not supported yet, so just return 0
return make_ready_future<json::json_return_type>(0);
});
sp::get_cas_read_unavailables.set(r, [](std::unique_ptr<request> req) {
//TBD
// FIXME
// cas is not supported yet, so just return 0
return make_ready_future<json::json_return_type>(0);
});
sp::get_cas_write_timeouts.set(r, [](std::unique_ptr<request> req) {
//TBD
// FIXME
// cas is not supported yet, so just return 0
return make_ready_future<json::json_return_type>(0);
});
sp::get_cas_write_unavailables.set(r, [](std::unique_ptr<request> req) {
//TBD
// FIXME
// cas is not supported yet, so just return 0
return make_ready_future<json::json_return_type>(0);
});

View File

@@ -513,16 +513,38 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::is_incremental_backups_enabled.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(false);
// If this is issued in parallel with an ongoing change, we may see values not agreeing.
// Reissuing is asking for trouble, so we will just return true upon seeing any true value.
return service::get_local_storage_service().db().map_reduce(adder<bool>(), [] (database& db) {
for (auto& pair: db.get_keyspaces()) {
auto& ks = pair.second;
if (ks.incremental_backups_enabled()) {
return true;
}
}
return false;
}).then([] (bool val) {
return make_ready_future<json::json_return_type>(val);
});
});
ss::set_incremental_backups_enabled.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
auto value = req->get_query_param("value");
return make_ready_future<json::json_return_type>(json_void());
auto val_str = req->get_query_param("value");
bool value = (val_str == "True") || (val_str == "true") || (val_str == "1");
return service::get_local_storage_service().db().invoke_on_all([value] (database& db) {
// Change both KS and CF, so they are in sync
for (auto& pair: db.get_keyspaces()) {
auto& ks = pair.second;
ks.set_incremental_backups(value);
}
for (auto& pair: db.get_column_families()) {
auto cf_ptr = pair.second;
cf_ptr->set_incremental_backups(value);
}
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::rebuild.set(r, [](std::unique_ptr<request> req) {

View File

@@ -120,7 +120,7 @@ class Antlr3Grammar(object):
modes = {
'debug': {
'sanitize': '-fsanitize=address -fsanitize=leak -fsanitize=undefined',
'sanitize_libs': '-lubsan -lasan',
'sanitize_libs': '-lasan -lubsan',
'opt': '-O0 -DDEBUG -DDEBUG_SHARED_PTR -DDEFAULT_ALLOCATOR',
'libs': '',
},
@@ -147,6 +147,7 @@ urchin_tests = [
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/perf/perf_simple_query',
'tests/memory_footprint',
'tests/perf/perf_sstable',
'tests/cql_query_test',
'tests/storage_proxy_test',
@@ -213,6 +214,8 @@ arg_parser.add_argument('--dpdk-target', action = 'store', dest = 'dpdk_target',
help = 'Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
arg_parser.add_argument('--debuginfo', action = 'store', dest = 'debuginfo', type = int, default = 1,
help = 'Enable(1)/disable(0)compiler debug information generation')
arg_parser.add_argument('--static-stdc++', dest = 'staticcxx', action = 'store_true',
help = 'Link libgcc and libstdc++ statically')
add_tristate(arg_parser, name = 'hwloc', dest = 'hwloc', help = 'hwloc support')
add_tristate(arg_parser, name = 'xen', dest = 'xen', help = 'Xen support')
args = arg_parser.parse_args()
@@ -277,6 +280,7 @@ urchin_core = (['database.cc',
'cql3/statements/index_prop_defs.cc',
'cql3/statements/index_target.cc',
'cql3/statements/create_index_statement.cc',
'cql3/statements/truncate_statement.cc',
'cql3/update_parameters.cc',
'cql3/ut_name.cc',
'thrift/handler.cc',
@@ -379,6 +383,7 @@ urchin_core = (['database.cc',
'partition_slice_builder.cc',
'init.cc',
'repair/repair.cc',
'exceptions/exceptions.cc',
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -448,6 +453,7 @@ tests_not_using_seastar_test_framework = set([
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/memory_footprint',
'tests/test-serialization',
'tests/gossip',
'tests/compound_test',
@@ -548,6 +554,8 @@ for mode in build_modes:
cfg = dict([line.strip().split(': ', 1)
for line in open('seastar/' + pc[mode])
if ': ' in line])
if args.staticcxx:
cfg['Libs'] = cfg['Libs'].replace('-lstdc++ ', '')
modes[mode]['seastar_cflags'] = cfg['Cflags']
modes[mode]['seastar_libs'] = cfg['Libs']
@@ -556,6 +564,9 @@ seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_re
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem'
user_cflags = args.user_cflags
user_ldflags = args.user_ldflags
if args.staticcxx:
user_ldflags += " -static-libgcc -static-libstdc++"
outdir = 'build'
buildfile = 'build.ninja'
@@ -597,11 +608,11 @@ with open(buildfile, 'w') as f:
description = CXX $out
depfile = $out.d
rule link.{mode}
command = $cxx $cxxflags_{mode} $ldflags {seastar_libs} -o $out $in $libs $libs_{mode}
command = $cxx $cxxflags_{mode} {sanitize_libs} $ldflags {seastar_libs} -o $out $in $libs $libs_{mode}
description = LINK $out
pool = link_pool
rule link_stripped.{mode}
command = $cxx $cxxflags_{mode} -s $ldflags {seastar_libs} -o $out $in $libs $libs_{mode}
command = $cxx $cxxflags_{mode} -s {sanitize_libs} $ldflags {seastar_libs} -o $out $in $libs $libs_{mode}
description = LINK (stripped) $out
pool = link_pool
rule ar.{mode}

View File

@@ -255,7 +255,14 @@ maps::delayed_value::bind(const query_options& options) {
::shared_ptr<terminal>
maps::marker::bind(const query_options& options) {
throw std::runtime_error("");
auto val = options.get_value_at(_bind_index);
return val ?
::make_shared<maps::value>(
maps::value::from_serialized(*val,
static_pointer_cast<const map_type_impl>(
_receiver->type),
options.get_serialization_format())) :
nullptr;
}
void

View File

@@ -97,7 +97,7 @@ cql3::statements::create_index_statement::validate(distributed<service::storage_
}
} else {
// validateNotFullIndex
if (target->type != index_target::target_type::full) {
if (target->type == index_target::target_type::full) {
throw exceptions::invalid_request_exception("full() indexes can only be created on frozen collections");
}
// validateIsValuesIndexIfTargetColumnNotCollection

View File

@@ -77,7 +77,7 @@ const sstring& drop_keyspace_statement::keyspace() const
future<bool> drop_keyspace_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
return make_ready_future<>().then([&] {
return make_ready_future<>().then([this, is_local_only] {
return service::get_local_migration_manager().announce_keyspace_drop(_keyspace, is_local_only);
}).then_wrapped([this] (auto&& f) {
try {

View File

@@ -76,7 +76,7 @@ void drop_table_statement::validate(distributed<service::storage_proxy>&, const
future<bool> drop_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
return make_ready_future<>().then([&] {
return make_ready_future<>().then([this, is_local_only] {
return service::get_local_migration_manager().announce_column_family_drop(keyspace(), column_family(), is_local_only);
}).then_wrapped([this] (auto&& f) {
try {

View File

@@ -0,0 +1,105 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2014 Cloudius Systems
*
* Modified by Cloudius Systems
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "cql3/statements/truncate_statement.hh"
#include "cql3/cql_statement.hh"
#include <experimental/optional>
namespace cql3 {
namespace statements {
truncate_statement::truncate_statement(::shared_ptr<cf_name> name)
: cf_statement{std::move(name)}
{
}
uint32_t truncate_statement::get_bound_terms()
{
return 0;
}
::shared_ptr<parsed_statement::prepared> truncate_statement::prepare(database& db)
{
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
}
bool truncate_statement::uses_function(const sstring& ks_name, const sstring& function_name) const
{
return parsed_statement::uses_function(ks_name, function_name);
}
void truncate_statement::check_access(const service::client_state& state)
{
warn(unimplemented::cause::AUTH);
#if 0
state.hasColumnFamilyAccess(keyspace(), columnFamily(), Permission.MODIFY);
#endif
}
void truncate_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state)
{
warn(unimplemented::cause::VALIDATION);
#if 0
ThriftValidation.validateColumnFamily(keyspace(), columnFamily());
#endif
}
future<::shared_ptr<transport::messages::result_message>>
truncate_statement::execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options)
{
return service::get_local_storage_proxy().truncate_blocking(keyspace(), column_family()).handle_exception([](auto ep) {
throw exceptions::truncate_exception(ep);
}).then([] {
return ::shared_ptr<transport::messages::result_message>{};
});
}
future<::shared_ptr<transport::messages::result_message>>
truncate_statement::execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options)
{
throw std::runtime_error("unsupported operation");
}
}
}

View File

@@ -52,64 +52,23 @@ namespace statements {
class truncate_statement : public cf_statement, public cql_statement, public ::enable_shared_from_this<truncate_statement> {
public:
truncate_statement(::shared_ptr<cf_name> name)
: cf_statement{std::move(name)}
{ }
truncate_statement(::shared_ptr<cf_name> name);
virtual uint32_t get_bound_terms() override {
return 0;
}
virtual uint32_t get_bound_terms() override;
virtual ::shared_ptr<prepared> prepare(database& db) override {
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
}
virtual ::shared_ptr<prepared> prepare(database& db) override;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return parsed_statement::uses_function(ks_name, function_name);
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
virtual void check_access(const service::client_state& state) override {
throw std::runtime_error("not implemented");
#if 0
state.hasColumnFamilyAccess(keyspace(), columnFamily(), Permission.MODIFY);
#endif
}
virtual void check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override {
throw std::runtime_error("not implemented");
#if 0
ThriftValidation.validateColumnFamily(keyspace(), columnFamily());
#endif
}
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<::shared_ptr<transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override {
throw std::runtime_error("not implemented");
#if 0
try
{
StorageProxy.truncateBlocking(keyspace(), columnFamily());
}
catch (UnavailableException e)
{
throw new TruncateException(e);
}
catch (TimeoutException e)
{
throw new TruncateException(e);
}
catch (IOException e)
{
throw new TruncateException(e);
}
return null;
#endif
}
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
virtual future<::shared_ptr<transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override {
throw std::runtime_error("unsupported operation");
}
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
};
}

View File

@@ -70,17 +70,25 @@ public:
}
// this could maybe be done as an overload of get_as (or something), but that just
// muddles things for no real gain. Let user (us) attempt to know what he is doing instead.
template<typename K, typename V>
std::unordered_map<K, V> get_map(const sstring& name) const {
auto vec = boost::any_cast<const map_type_impl::native_type&>(
map_type_impl::get_instance(data_type_for<K>(),
data_type_for<V>(), false)->deserialize(
get_blob(name)));
std::unordered_map<K, V> res;
std::transform(vec.begin(), vec.end(),
std::inserter(res, res.end()), [](auto& p) {
template<typename K, typename V, typename Iter>
void get_map_data(const sstring& name, Iter out, data_type keytype =
data_type_for<K>(), data_type valtype =
data_type_for<V>()) const {
auto vec =
boost::any_cast<const map_type_impl::native_type&>(
map_type_impl::get_instance(keytype, valtype, false)->deserialize(
get_blob(name)));
std::transform(vec.begin(), vec.end(), out,
[](auto& p) {
return std::pair<K, V>(boost::any_cast<const K&>(p.first), boost::any_cast<const V&>(p.second));
});
}
template<typename K, typename V, typename ... Rest>
std::unordered_map<K, V, Rest...> get_map(const sstring& name,
data_type keytype = data_type_for<K>(), data_type valtype =
data_type_for<V>()) const {
std::unordered_map<K, V, Rest...> res;
get_map_data<K, V>(name, std::inserter(res, res.end()), keytype, valtype);
return res;
}
const std::vector<::shared_ptr<column_specification>>& get_columns() const {

View File

@@ -52,6 +52,8 @@
#include "service/storage_service.hh"
#include "mutation_query.hh"
#include "sstable_mutation_readers.hh"
#include <core/fstream.hh>
#include "utils/latency.hh"
using namespace std::chrono_literals;
@@ -496,7 +498,26 @@ column_family::try_flush_memtable_to_sstable(lw_shared_ptr<memtable> old) {
newtab->set_unshared();
dblog.debug("Flushing to {}", newtab->get_filename());
return newtab->write_components(*old).then([this, newtab, old] {
return newtab->load();
return newtab->open_data().then([this, newtab] {
// Note that due to our sharded architecture, it is possible that
// in the face of a value change some shards will backup sstables
// while others won't.
//
// This is, in theory, possible to mitigate through a rwlock.
// However, this doesn't differ from the situation where all tables
// are coming from a single shard and the toggle happens in the
// middle of them.
//
// The code as is guarantees that we'll never partially backup a
// single sstable, so that is enough of a guarantee.
if (!incremental_backups_enabled()) {
return make_ready_future<>();
}
auto dir = newtab->get_dir() + "/backups/";
return touch_directory(dir).then([dir, newtab] {
return newtab->create_links(dir);
});
});
}).then([this, old, newtab] {
dblog.debug("Flushing done");
// We must add sstable before we call update_cache(), because
@@ -641,8 +662,10 @@ void column_family::start_compaction() {
void column_family::trigger_compaction() {
// Submitting compaction job to compaction manager.
_stats.pending_compactions++;
_compaction_manager.submit(this);
if (!_compaction_disabled) {
_stats.pending_compactions++;
_compaction_manager.submit(this);
}
}
future<> column_family::run_compaction() {
@@ -811,7 +834,7 @@ future<> database::populate_keyspace(sstring datadir, sstring ks_name) {
if (i == _keyspaces.end()) {
dblog.warn("Skipping undefined keyspace: {}", ks_name);
} else {
dblog.warn("Populating Keyspace {}", ks_name);
dblog.info("Populating Keyspace {}", ks_name);
return lister::scan_dir(ksdir, directory_entry_type::directory, [this, ksdir, ks_name] (directory_entry de) {
auto comps = parse_fname(de.name);
if (comps.size() < 2) {
@@ -965,7 +988,7 @@ void database::update_keyspace(const sstring& name) {
}
void database::drop_keyspace(const sstring& name) {
throw std::runtime_error("not implemented");
_keyspaces.erase(name);
}
void database::add_column_family(schema_ptr schema, column_family::config cfg) {
@@ -1005,8 +1028,18 @@ future<> database::update_column_family(const sstring& ks_name, const sstring& c
});
}
void database::drop_column_family(const sstring& ks_name, const sstring& cf_name) {
throw std::runtime_error("not implemented");
future<> database::drop_column_family(db_clock::time_point dropped_at, const sstring& ks_name, const sstring& cf_name) {
auto uuid = find_uuid(ks_name, cf_name);
auto& ks = find_keyspace(ks_name);
auto cf = _column_families.at(uuid);
_column_families.erase(uuid);
ks.metadata()->remove_column_family(cf->schema());
_ks_cf_to_uuid.erase(std::make_pair(ks_name, cf_name));
return truncate(dropped_at, ks, *cf).then([this, cf] {
return cf->stop();
}).then([this, cf] {
return make_ready_future<>();
});
}
const utils::UUID& database::find_uuid(const sstring& ks, const sstring& cf) const throw (std::out_of_range) {
@@ -1051,7 +1084,7 @@ column_family& database::find_column_family(const sstring& ks_name, const sstrin
try {
return find_column_family(find_uuid(ks_name, cf_name));
} catch (...) {
std::throw_with_nested(no_such_column_family("Can't find a column family " + cf_name + " in a keyspace " + ks_name));
std::throw_with_nested(no_such_column_family(ks_name, cf_name));
}
}
@@ -1059,7 +1092,7 @@ const column_family& database::find_column_family(const sstring& ks_name, const
try {
return find_column_family(find_uuid(ks_name, cf_name));
} catch (...) {
std::throw_with_nested(no_such_column_family("Can't find a column family " + cf_name + " in a keyspace " + ks_name));
std::throw_with_nested(no_such_column_family(ks_name, cf_name));
}
}
@@ -1067,7 +1100,7 @@ column_family& database::find_column_family(const utils::UUID& uuid) throw (no_s
try {
return *_column_families.at(uuid);
} catch (...) {
std::throw_with_nested(no_such_column_family("Can't find a column family with UUID: " + uuid.to_sstring()));
std::throw_with_nested(no_such_column_family(uuid));
}
}
@@ -1075,7 +1108,7 @@ const column_family& database::find_column_family(const utils::UUID& uuid) const
try {
return *_column_families.at(uuid);
} catch (...) {
std::throw_with_nested(no_such_column_family("Can't find a column family with UUID: " + uuid.to_sstring()));
std::throw_with_nested(no_such_column_family(uuid));
}
}
@@ -1116,6 +1149,7 @@ keyspace::make_column_family_config(const schema& s) const {
cfg.enable_cache = _config.enable_cache;
cfg.max_memtable_size = _config.max_memtable_size;
cfg.dirty_memory_region_group = _config.dirty_memory_region_group;
cfg.enable_incremental_backups = _config.enable_incremental_backups;
return cfg;
}
@@ -1132,6 +1166,21 @@ keyspace::make_directory_for_column_family(const sstring& name, utils::UUID uuid
return make_directory(column_family_directory(name, uuid));
}
no_such_keyspace::no_such_keyspace(const sstring& ks_name)
: runtime_error{sprint("Can't find a keyspace %s", ks_name)}
{
}
no_such_column_family::no_such_column_family(const utils::UUID& uuid)
: runtime_error{sprint("Can't find a column family with UUID %s", uuid)}
{
}
no_such_column_family::no_such_column_family(const sstring& ks_name, const sstring& cf_name)
: runtime_error{sprint("Can't find a column family %s in keyspace %s", cf_name, ks_name)}
{
}
column_family& database::find_column_family(const schema_ptr& schema) throw (no_such_column_family) {
return find_column_family(schema->id());
}
@@ -1151,7 +1200,7 @@ schema_ptr database::find_schema(const sstring& ks_name, const sstring& cf_name)
try {
return find_schema(find_uuid(ks_name, cf_name));
} catch (std::out_of_range&) {
std::throw_with_nested(no_such_column_family(ks_name + ":" + cf_name));
std::throw_with_nested(no_such_column_family(ks_name, cf_name));
}
}
@@ -1261,7 +1310,9 @@ struct query_state {
};
future<lw_shared_ptr<query::result>>
column_family::query(const query::read_command& cmd, const std::vector<query::partition_range>& partition_ranges) const {
column_family::query(const query::read_command& cmd, const std::vector<query::partition_range>& partition_ranges) {
utils::latency_counter lc;
_stats.reads.set_latency(lc);
return do_with(query_state(cmd, partition_ranges), [this] (query_state& qs) {
return do_until(std::bind(&query_state::done, &qs), [this, &qs] {
auto&& range = *qs.current_partition_range++;
@@ -1284,6 +1335,8 @@ column_family::query(const query::read_command& cmd, const std::vector<query::pa
return make_ready_future<lw_shared_ptr<query::result>>(
make_lw_shared<query::result>(qs.builder.build()));
});
}).finally([lc, this]() mutable {
_stats.reads.mark(lc);
});
}
@@ -1438,6 +1491,7 @@ database::make_keyspace_config(const keyspace_metadata& ksm) {
cfg.max_memtable_size = std::numeric_limits<size_t>::max();
}
cfg.dirty_memory_region_group = &_dirty_memory_region_group;
cfg.enable_incremental_backups = _cfg->incremental_backups();
return cfg;
}
@@ -1511,6 +1565,47 @@ future<> database::flush_all_memtables() {
});
}
future<> database::truncate(db_clock::time_point truncated_at, sstring ksname, sstring cfname) {
auto& ks = find_keyspace(ksname);
auto& cf = find_column_family(ksname, cfname);
return truncate(truncated_at, ks, cf);
}
future<> database::truncate(db_clock::time_point truncated_at, const keyspace& ks, column_family& cf)
{
const auto durable = ks.metadata()->durable_writes();
const auto auto_snapshot = get_config().auto_snapshot();
future<> f = make_ready_future<>();
if (durable || auto_snapshot) {
// TODO:
// this is not really a guarantee at all that we've actually
// gotten all things to disk. Again, need queue-ish or something.
f = cf.flush();
} else {
cf.clear();
}
return cf.run_with_compaction_disabled([truncated_at, f = std::move(f), &cf, auto_snapshot, cfname = cf.schema()->cf_name()]() mutable {
return f.then([truncated_at, &cf, auto_snapshot, cfname = std::move(cfname)] {
dblog.debug("Discarding sstable data for truncated CF + indexes");
// TODO: notify truncation
future<> f = make_ready_future<>();
if (auto_snapshot) {
auto name = sprint("%d-%s", truncated_at.time_since_epoch().count(), cfname);
f = cf.snapshot(name);
}
return f.then([&cf, truncated_at] {
return cf.discard_sstables(truncated_at).then([&cf, truncated_at](db::replay_position rp) {
// TODO: indexes.
return db::system_keyspace::save_truncation_record(cf, truncated_at, rp);
});
});
});
});
}
const sstring& database::get_snitch_name() const {
return _cfg->endpoint_snitch();
}
@@ -1529,6 +1624,131 @@ future<> update_schema_version_and_announce(distributed<service::storage_proxy>&
});
}
// Snapshots: snapshotting the files themselves is easy: if more than one CF
// happens to link an SSTable twice, all but one will fail, and we will end up
// with one copy.
//
// The problem for us, is that the snapshot procedure is supposed to leave a
// manifest file inside its directory. So if we just call snapshot() from
// multiple shards, only the last one will succeed, writing its own SSTables to
// the manifest leaving all other shards' SSTables unaccounted for.
//
// Moreover, for things like drop table, the operation should only proceed when the
// snapshot is complete. That includes the manifest file being correctly written,
// and for this reason we need to wait for all shards to finish their snapshotting
// before we can move on.
//
// To know which files we must account for in the manifest, we will keep an
// SSTable set. Theoretically, we could just rescan the snapshot directory and
// see what's in there. But we would need to wait for all shards to finish
// before we can do that anyway. That is the hard part, and once that is done
// keeping the files set is not really a big deal.
//
// This code assumes that all shards will be snapshotting at the same time. So
// far this is a safe assumption, but if we ever want to take snapshots from a
// group of shards only, this code will have to be updated to account for that.
struct snapshot_manager {
std::unordered_set<sstring> files;
semaphore requests;
semaphore manifest_write;
snapshot_manager() : requests(0), manifest_write(0) {}
};
static thread_local std::unordered_map<sstring, lw_shared_ptr<snapshot_manager>> pending_snapshots;
static future<>
seal_snapshot(sstring jsondir) {
std::ostringstream ss;
int n = 0;
ss << "{" << std::endl << "\t\"files\" : [ ";
for (auto&& rf: pending_snapshots.at(jsondir)->files) {
if (n++ > 0) {
ss << ", ";
}
ss << "\"" << rf << "\"";
}
ss << " ]" << std::endl << "}" << std::endl;
auto json = ss.str();
auto jsonfile = jsondir + "/manifest.json";
dblog.debug("Storing manifest {}", jsonfile);
return recursive_touch_directory(jsondir).then([jsonfile, json = std::move(json)] {
return engine().open_file_dma(jsonfile, open_flags::wo | open_flags::create | open_flags::truncate).then([json](file f) {
return do_with(make_file_output_stream(std::move(f)), [json] (output_stream<char>& out) {
return out.write(json.c_str(), json.size()).then([&out] {
return out.flush();
}).then([&out] {
return out.close();
});
});
});
}).then([jsondir] {
return sync_directory(std::move(jsondir));
}).finally([jsondir] {
pending_snapshots.erase(jsondir);
return make_ready_future<>();
});
}
future<> column_family::snapshot(sstring name) {
return flush().then([this, name = std::move(name)]() {
auto tables = boost::copy_range<std::vector<sstables::shared_sstable>>(*_sstables | boost::adaptors::map_values);
return do_with(std::move(tables), [this, name](std::vector<sstables::shared_sstable> & tables) {
auto jsondir = _config.datadir + "/snapshots/" + name;
return parallel_for_each(tables, [name](sstables::shared_sstable sstable) {
auto dir = sstable->get_dir() + "/snapshots/" + name;
return recursive_touch_directory(dir).then([sstable, dir] {
return sstable->create_links(dir);
});
}).then([jsondir, &tables] {
// This is not just an optimization. If we have no files, jsondir may not have been created,
// and sync_directory would throw.
if (tables.size()) {
return sync_directory(std::move(jsondir));
} else {
return make_ready_future<>();
}
}).then([this, &tables, jsondir] {
auto shard = std::hash<sstring>()(jsondir) % smp::count;
std::unordered_set<sstring> table_names;
for (auto& sst : tables) {
auto f = sst->get_filename();
auto rf = f.substr(sst->get_dir().size() + 1);
table_names.insert(std::move(rf));
}
return smp::submit_to(shard, [requester = engine().cpu_id(), jsondir = std::move(jsondir),
tables = std::move(table_names), datadir = _config.datadir] {
if (pending_snapshots.count(jsondir) == 0) {
pending_snapshots.emplace(jsondir, make_lw_shared<snapshot_manager>());
}
auto snapshot = pending_snapshots.at(jsondir);
for (auto&& sst: tables) {
snapshot->files.insert(std::move(sst));
}
snapshot->requests.signal(1);
auto my_work = make_ready_future<>();
if (requester == engine().cpu_id()) {
my_work = snapshot->requests.wait(smp::count).then([jsondir = std::move(jsondir),
snapshot] () mutable {
return seal_snapshot(jsondir).then([snapshot] {
snapshot->manifest_write.signal(smp::count);
return make_ready_future<>();
});
});
}
return my_work.then([snapshot] {
return snapshot->manifest_write.wait(1);
}).then([snapshot] {});
});
});
});
});
}
future<> column_family::flush() {
// FIXME: this will synchronously wait for this write to finish, but doesn't guarantee
// anything about previous writes.
@@ -1558,6 +1778,40 @@ future<> column_family::flush(const db::replay_position& pos) {
return seal_active_memtable();
}
void column_family::clear() {
_cache.clear();
_memtables->clear();
add_memtable();
}
// NOTE: does not need to be futurized, but might eventually, depending on
// if we implement notifications, whatnot.
future<db::replay_position> column_family::discard_sstables(db_clock::time_point truncated_at) {
assert(_stats.pending_compactions == 0);
db::replay_position rp;
auto gc_trunc = to_gc_clock(truncated_at);
auto pruned = make_lw_shared<sstable_list>();
for (auto&p : *_sstables) {
if (p.second->max_data_age() <= gc_trunc) {
rp = std::max(p.second->get_stats_metadata().position, rp);
p.second->mark_for_deletion();
continue;
}
pruned->emplace(p.first, p.second);
}
_sstables = std::move(pruned);
dblog.debug("cleaning out row cache");
_cache.clear();
return make_ready_future<db::replay_position>(rp);
}
std::ostream& operator<<(std::ostream& os, const user_types_metadata& m) {
os << "org.apache.cassandra.config.UTMetaData@" << &m;
return os;

View File

@@ -106,6 +106,7 @@ public:
bool enable_disk_reads = true;
bool enable_cache = true;
bool enable_commitlog = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
logalloc::region_group* dirty_memory_region_group = nullptr;
};
@@ -120,8 +121,8 @@ public:
int64_t live_sstable_count = 0;
/** Estimated number of compactions pending for this column family */
int64_t pending_compactions = 0;
utils::ihistogram reads{256, 100};
utils::ihistogram writes{256, 100};
utils::ihistogram reads{256};
utils::ihistogram writes{256};
sstables::estimated_histogram estimated_read;
sstables::estimated_histogram estimated_write;
};
@@ -143,6 +144,7 @@ private:
compaction_manager& _compaction_manager;
// Whether or not a cf is queued by its compaction manager.
bool _compaction_manager_queued = false;
int _compaction_disabled = 0;
private:
void update_stats_for_new_sstable(uint64_t new_sstable_data_size);
void add_sstable(sstables::sstable&& sstable);
@@ -195,7 +197,7 @@ public:
void apply(const mutation& m, const db::replay_position& = db::replay_position());
// Returns at most "cmd.limit" rows
future<lw_shared_ptr<query::result>> query(const query::read_command& cmd, const std::vector<query::partition_range>& ranges) const;
future<lw_shared_ptr<query::result>> query(const query::read_command& cmd, const std::vector<query::partition_range>& ranges);
future<> populate(sstring datadir);
@@ -203,6 +205,8 @@ public:
future<> stop();
future<> flush();
future<> flush(const db::replay_position&);
void clear(); // discards memtable(s) without flushing them to disk.
future<db::replay_position> discard_sstables(db_clock::time_point);
// FIXME: this is just an example, should be changed to something more
// general. compact_all_sstables() starts a compaction of all sstables.
@@ -212,6 +216,16 @@ public:
// Compact all sstables provided in the vector.
future<> compact_sstables(std::vector<lw_shared_ptr<sstables::sstable>> sstables);
future<> snapshot(sstring name);
const bool incremental_backups_enabled() const {
return _config.enable_incremental_backups;
}
void set_incremental_backups(bool val) {
_config.enable_incremental_backups = val;
}
lw_shared_ptr<sstable_list> get_sstables();
size_t sstables_count();
int64_t get_unleveled_sstables() const;
@@ -236,6 +250,15 @@ public:
return _stats;
}
template<typename Func, typename Result = futurize_t<std::result_of_t<Func()>>>
Result run_with_compaction_disabled(Func && func) {
++_compaction_disabled;
return _compaction_manager.remove(this).then(std::forward<Func>(func)).finally([this] {
if (--_compaction_disabled == 0) {
trigger_compaction();
}
});
}
private:
// One does not need to wait on this future if all we are interested in, is
// initiating the write. The writes initiated here will eventually
@@ -345,6 +368,9 @@ public:
void add_column_family(const schema_ptr& s) {
_cf_meta_data.emplace(s->cf_name(), s);
}
void remove_column_family(const schema_ptr& s) {
_cf_meta_data.erase(s->cf_name());
}
friend std::ostream& operator<<(std::ostream& os, const keyspace_metadata& m);
};
@@ -356,6 +382,7 @@ public:
bool enable_disk_reads = true;
bool enable_disk_writes = true;
bool enable_cache = true;
bool enable_incremental_backups = false;
size_t max_memtable_size = 5'000'000;
logalloc::region_group* dirty_memory_region_group = nullptr;
};
@@ -384,6 +411,14 @@ public:
// FIXME to allow simple registration at boostrap
void set_replication_strategy(std::unique_ptr<locator::abstract_replication_strategy> replication_strategy);
const bool incremental_backups_enabled() const {
return _config.enable_incremental_backups;
}
void set_incremental_backups(bool val) {
_config.enable_incremental_backups = val;
}
const sstring& datadir() const {
return _config.datadir;
}
@@ -393,12 +428,13 @@ private:
class no_such_keyspace : public std::runtime_error {
public:
using runtime_error::runtime_error;
no_such_keyspace(const sstring& ks_name);
};
class no_such_column_family : public std::runtime_error {
public:
using runtime_error::runtime_error;
no_such_column_family(const utils::UUID& uuid);
no_such_column_family(const sstring& ks_name, const sstring& cf_name);
};
// Policy for distributed<database>:
@@ -463,7 +499,7 @@ public:
void add_column_family(schema_ptr schema, column_family::config cfg);
future<> update_column_family(const sstring& ks_name, const sstring& cf_name);
void drop_column_family(const sstring& ks_name, const sstring& cf_name);
future<> drop_column_family(db_clock::time_point changed_at, const sstring& ks_name, const sstring& cf_name);
/* throws std::out_of_range if missing */
const utils::UUID& find_uuid(const sstring& ks, const sstring& cf) const throw (std::out_of_range);
@@ -507,9 +543,19 @@ public:
const std::unordered_map<sstring, keyspace>& get_keyspaces() const {
return _keyspaces;
}
std::unordered_map<sstring, keyspace>& get_keyspaces() {
return _keyspaces;
}
const std::unordered_map<utils::UUID, lw_shared_ptr<column_family>>& get_column_families() const {
return _column_families;
}
std::unordered_map<utils::UUID, lw_shared_ptr<column_family>>& get_column_families() {
return _column_families;
}
const std::unordered_map<std::pair<sstring, sstring>, utils::UUID, utils::tuple_hash>&
get_column_families_mapping() const {
return _ks_cf_to_uuid;
@@ -520,6 +566,9 @@ public:
}
future<> flush_all_memtables();
/** Truncates the given column family */
future<> truncate(db_clock::time_point truncated_at, sstring ksname, sstring cfname);
future<> truncate(db_clock::time_point truncated_at, const keyspace& ks, column_family& cf);
const logalloc::region_group& dirty_memory_region_group() const {
return _dirty_memory_region_group;

View File

@@ -39,8 +39,8 @@
*/
#include <chrono>
#include <core/future-util.hh>
#include <core/do_with.hh>
#include <seastar/core/future-util.hh>
#include <seastar/core/do_with.hh>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/sliced.hpp>
@@ -57,7 +57,7 @@
#include "db/config.hh"
#include "gms/failure_detector.hh"
static logging::logger logger("BatchLog Manager");
static logging::logger logger("batchlog_manager");
const uint32_t db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
@@ -68,21 +68,37 @@ db::batchlog_manager::batchlog_manager(cql3::query_processor& qp)
{}
future<> db::batchlog_manager::start() {
_timer.set_callback(
std::bind(&batchlog_manager::replay_all_failed_batches, this));
_timer.arm(
lowres_clock::now()
+ std::chrono::milliseconds(
service::storage_service::RING_DELAY),
std::experimental::optional<lowres_clock::duration> {
std::chrono::milliseconds(replay_interval) });
// Since replay is a "node global" operation, we should not attempt to
// do it in parallel on each shard. It will just overlap/interfere.
// Could just run this on cpu 0 or so, but since this _could_ be a
// lengty operation, we'll round-robin it between shards just in case...
if (smp::main_thread()) {
auto cpu = engine().cpu_id();
_timer.set_callback(
[this, cpu]() mutable {
auto dest = (cpu++ % smp::count);
return smp::submit_to(dest, [] {
return get_local_batchlog_manager().replay_all_failed_batches();
}).handle_exception([](auto ep) {
logger.error("Exception in batch replay: {}", ep);
}).finally([this] {
_timer.arm(lowres_clock::now()
+ std::chrono::milliseconds(replay_interval)
);
});
});
_timer.arm(
lowres_clock::now()
+ std::chrono::milliseconds(
service::storage_service::RING_DELAY));
}
return make_ready_future<>();
}
future<> db::batchlog_manager::stop() {
_stop = true;
_timer.cancel();
return _sem.wait(std::chrono::milliseconds(60));
return _gate.close();
}
future<size_t> db::batchlog_manager::count_all_batches() const {
@@ -98,7 +114,7 @@ mutation db::batchlog_manager::get_batch_log_mutation_for(const std::vector<muta
mutation db::batchlog_manager::get_batch_log_mutation_for(const std::vector<mutation>& mutations, const utils::UUID& id, int32_t version, db_clock::time_point now) {
auto schema = _qp.db().local().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto key = partition_key::from_exploded(*schema, {uuid_type->decompose(id)});
auto key = partition_key::from_singular(*schema, id);
auto timestamp = db_clock::now_in_usecs();
auto data = [this, &mutations] {
std::vector<frozen_mutation> fm(mutations.begin(), mutations.end());
@@ -164,7 +180,7 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
}
auto& fm = fms->front();
auto mid = fm.column_family_id();
return system_keyspace::get_truncated_at(_qp, mid).then([this, &fm, written_at, mutations](db_clock::time_point t) {
return system_keyspace::get_truncated_at(mid).then([this, &fm, written_at, mutations](db_clock::time_point t) {
auto schema = _qp.db().local().find_schema(fm.column_family_id());
if (written_at > t) {
auto schema = _qp.db().local().find_schema(fm.column_family_id());
@@ -206,7 +222,7 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
}).then([this, id] {
// delete batch
auto schema = _qp.db().local().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto key = partition_key::from_exploded(*schema, {uuid_type->decompose(id)});
auto key = partition_key::from_singular(*schema, id);
mutation m(key, schema);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, {}, tombstone(now, gc_clock::now()));
@@ -214,8 +230,8 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
});
};
return _sem.wait().then([this, batch = std::move(batch)] {
logger.debug("Started replayAllFailedBatches");
return seastar::with_gate(_gate, [this, batch = std::move(batch)] {
logger.debug("Started replayAllFailedBatches (cpu {})", engine().cpu_id());
typedef ::shared_ptr<cql3::untyped_result_set> page_ptr;
sstring query = sprint("SELECT id, data, written_at, version FROM %s.%s LIMIT %d", system_keyspace::NAME, system_keyspace::BATCHLOG, page_size);
@@ -257,8 +273,6 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
}).then([this] {
logger.debug("Finished replayAllFailedBatches");
});
}).finally([this] {
_sem.signal();
});
}

View File

@@ -42,9 +42,11 @@
#pragma once
#include <unordered_map>
#include "core/future.hh"
#include "core/distributed.hh"
#include "core/timer.hh"
#include <seastar/core/future.hh>
#include <seastar/core/distributed.hh>
#include <seastar/core/timer.hh>
#include <seastar/core/gate.hh>
#include "cql3/query_processor.hh"
#include "gms/inet_address.hh"
#include "db_clock.hh"
@@ -61,7 +63,7 @@ private:
size_t _total_batches_replayed = 0;
cql3::query_processor& _qp;
timer<clock_type> _timer;
semaphore _sem;
seastar::gate _gate;
bool _stop = false;
std::random_device _rd;

View File

@@ -193,7 +193,7 @@ public:
cfg.commit_log_location = "/var/lib/scylla/commitlog";
}
logger.trace("Commitlog maximum disk size: {} MB / cpu ({} cpus)",
max_disk_size / (1024*1024));
max_disk_size / (1024*1024), smp::count);
_regs = create_counters();
}
@@ -204,6 +204,8 @@ public:
future<> init();
future<sseg_ptr> new_segment();
future<sseg_ptr> active_segment();
future<sseg_ptr> allocate_segment(bool active);
future<> clear();
future<> sync_all_segments();
future<> shutdown();
@@ -213,9 +215,10 @@ public:
void discard_unused_segments();
void discard_completed_segments(const cf_id_type& id,
const replay_position& pos);
void on_timer();
void sync();
void arm() {
_timer.arm(std::chrono::milliseconds(cfg.commitlog_sync_period_in_ms));
void arm(uint32_t extra = 0) {
_timer.arm(std::chrono::milliseconds(cfg.commitlog_sync_period_in_ms + extra));
}
std::vector<sstring> get_active_names() const;
@@ -241,11 +244,21 @@ public:
private:
segment_id_type _ids = 0;
std::vector<sseg_ptr> _segments;
std::deque<sseg_ptr> _reserve_segments;
std::vector<buffer_type> _temp_buffers;
std::unordered_map<flush_handler_id, flush_handler> _flush_handlers;
flush_handler_id _flush_ids = 0;
replay_position _flush_position;
timer<clock_type> _timer;
size_t _reserve_allocating = 0;
// # segments to try to keep available in reserve
// i.e. the amount of segments we expect to consume inbetween timer
// callbacks.
// The idea is that since the files are 0 len at start, and thus cost little,
// it is easier to adapt this value compared to timer freq.
size_t _num_reserve_segments = 0;
seastar::gate _gate;
uint64_t _new_counter = 0;
};
/*
@@ -296,12 +309,12 @@ public:
// TODO : tune initial / default size
static constexpr size_t default_size = align_up<size_t>(128 * 1024, alignment);
segment(segment_manager* m, const descriptor& d, file && f)
segment(segment_manager* m, const descriptor& d, file && f, bool active)
: _segment_manager(m), _desc(std::move(d)), _file(std::move(f)), _sync_time(
clock_type::now())
{
++_segment_manager->totals.segments_created;
logger.debug("Created new segment {}", *this);
logger.debug("Created new {} segment {}", active ? "active" : "reserve", *this);
}
~segment() {
if (is_clean()) {
@@ -324,7 +337,7 @@ public:
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(
now - _sync_time).count();
if ((_segment_manager->cfg.commitlog_sync_period_in_ms * 2) < uint64_t(ms)) {
logger.debug("Need sync. {} ms elapsed", ms);
logger.debug("{} needs sync. {} ms elapsed", *this, ms);
return true;
}
return false;
@@ -337,12 +350,16 @@ public:
sync();
return _segment_manager->active_segment();
}
void reset_sync_time() {
_sync_time = clock_type::now();
}
future<sseg_ptr> sync() {
// Note: this is not a marker for when sync was finished.
// It is when it was initiated
_sync_time = clock_type::now();
reset_sync_time();
if (position() <= _flush_pos) {
logger.trace("Sync not needed : ({} / {})", position(), _flush_pos);
logger.trace("Sync not needed {}: ({} / {})", *this, position(), _flush_pos);
return make_ready_future<sseg_ptr>(shared_from_this());
}
return cycle().then([](auto seg) {
@@ -358,7 +375,7 @@ public:
pos = _file_pos;
}
if (pos != 0 && pos <= _flush_pos) {
logger.trace("Already synced! ({} < {})", pos, _flush_pos);
logger.trace("{} already synced! ({} < {})", *this, pos, _flush_pos);
return make_ready_future<sseg_ptr>(std::move(me));
}
logger.trace("Syncing {} -> {}", _flush_pos, pos);
@@ -370,7 +387,7 @@ public:
_dwrite.write_unlock(); // release it already.
pos = std::max(pos, _file_pos);
if (pos <= _flush_pos) {
logger.trace("Already synced! ({} < {})", pos, _flush_pos);
logger.trace("{} already synced! ({} < {})", *this, pos, _flush_pos);
return make_ready_future<sseg_ptr>(std::move(me));
}
++_segment_manager->totals.pending_operations;
@@ -389,7 +406,7 @@ public:
}).then([this, pos, me = std::move(me)]() {
_flush_pos = std::max(pos, _flush_pos);
++_segment_manager->totals.flush_count;
logger.trace("Synced to {}", _flush_pos);
logger.trace("{} synced to {}", *this, _flush_pos);
return make_ready_future<sseg_ptr>(std::move(me));
}).finally([this] {
--_segment_manager->totals.pending_operations;
@@ -488,16 +505,13 @@ public:
}
// gah, partial write. should always get here with dma chunk sized
// "bytes", but lets make sure...
logger.debug("Partial write: {}/{} bytes", *written, size);
logger.debug("Partial write {}: {}/{} bytes", *this, *written, size);
*written = align_down(*written, alignment);
return make_ready_future<stop_iteration>(stop_iteration::no);
// TODO: retry/ignore/fail/stop - optional behaviour in origin.
// we fast-fail the whole commit.
} catch (std::exception& e) {
logger.error("Failed to persist commits to disk: {}", e.what());
throw;
} catch (...) {
logger.error("Failed to persist commits to disk.");
logger.error("Failed to persist commits to disk for {}: {}", *this, std::current_exception());
throw;
}
});
@@ -688,11 +702,11 @@ future<> db::commitlog::segment_manager::init() {
// base id counter is [ <shard> | <base> ]
_ids = replay_position(engine().cpu_id(), id).id;
if (cfg.mode != sync_mode::BATCH) {
_timer.set_callback(std::bind(&segment_manager::sync, this));
this->arm();
}
// always run the timer now, since we need to handle segment pre-alloc etc as well.
_timer.set_callback(std::bind(&segment_manager::on_timer, this));
auto delay = engine().cpu_id() * std::ceil(double(cfg.commitlog_sync_period_in_ms) / smp::count);
logger.trace("Delaying timer loop {} ms", delay);
this->arm(delay);
});
}
@@ -803,22 +817,37 @@ void db::commitlog::segment_manager::flush_segments(bool force) {
}
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::allocate_segment(bool active) {
descriptor d(next_id());
return engine().open_file_dma(cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create).then([this, d, active](file f) {
auto s = make_lw_shared<segment>(this, d, std::move(f), active);
return make_ready_future<sseg_ptr>(s);
});
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::new_segment() {
if (_shutdown) {
throw std::runtime_error("Commitlog has been shut down. Cannot add data");
}
descriptor d(next_id());
return engine().open_file_dma(cfg.commit_log_location + "/" + d.filename(), open_flags::wo | open_flags::create).then([this, d](file f) {
_segments.emplace_back(make_lw_shared<segment>(this, d, std::move(f)));
auto max = max_disk_size;
auto cur = totals.total_size_on_disk;
if (max != 0 && cur >= max) {
logger.debug("Size on disk {} MB exceeds local maximum {} MB", cur / (1024 * 1024), max / (1024 * 1024));
flush_segments();
++_new_counter;
if (_reserve_segments.empty()) {
if (_num_reserve_segments < cfg.max_reserve_segments) {
++_num_reserve_segments;
logger.trace("Increased segment reserve count to {}", _num_reserve_segments);
}
}).then([this] {
return make_ready_future<sseg_ptr>(_segments.back());
});
return allocate_segment(true).then([this](sseg_ptr s) {
_segments.push_back(s);
return make_ready_future<sseg_ptr>(s);
});
}
_segments.push_back(_reserve_segments.front());
_reserve_segments.pop_front();
_segments.back()->reset_sync_time();
logger.trace("Acquired segment {} from reserve", _segments.back());
return make_ready_future<sseg_ptr>(_segments.back());
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::active_segment() {
@@ -841,7 +870,7 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
*/
void db::commitlog::segment_manager::discard_completed_segments(
const cf_id_type& id, const replay_position& pos) {
logger.debug("discard completed log segments for {}, table {}", pos, id);
logger.debug("Discard completed segments for {}, table {}", pos, id);
for (auto&s : _segments) {
s->mark_clean(id, pos);
}
@@ -849,7 +878,7 @@ void db::commitlog::segment_manager::discard_completed_segments(
}
std::ostream& db::operator<<(std::ostream& out, const db::commitlog::segment& s) {
return out << "commit log segment (" << s._desc.filename() << ")";
return out << s._desc.filename();
}
std::ostream& db::operator<<(std::ostream& out, const db::commitlog::segment::cf_mark& m) {
@@ -863,10 +892,14 @@ std::ostream& db::operator<<(std::ostream& out, const db::replay_position& p) {
void db::commitlog::segment_manager::discard_unused_segments() {
auto i = std::remove_if(_segments.begin(), _segments.end(), [=](auto& s) {
if (s->is_unused()) {
logger.debug("{} is unused", *s);
logger.debug("Segment {} is unused", *s);
return true;
}
logger.debug("Not safe to delete {}; dirty is {}", s, segment::cf_mark {*s});
if (s->is_still_allocating()) {
logger.debug("Not safe to delete segment {}; still allocating.", s);
} else {
logger.debug("Not safe to delete segment {}; dirty is {}", s, segment::cf_mark {*s});
}
return false;
});
if (i != _segments.end()) {
@@ -878,16 +911,22 @@ future<> db::commitlog::segment_manager::sync_all_segments() {
logger.debug("Issuing sync for all segments");
return parallel_for_each(_segments, [this](sseg_ptr s) {
return s->sync().then([](sseg_ptr s) {
logger.debug("Synced {}", *s);
logger.debug("Synced segment {}", *s);
});
});
}
future<> db::commitlog::segment_manager::shutdown() {
_shutdown = true;
return parallel_for_each(_segments, [this](sseg_ptr s) {
return s->shutdown();
});
if (!_shutdown) {
_shutdown = true;
_timer.cancel();
return _gate.close().then([this] {
return parallel_for_each(_segments, [this](sseg_ptr s) {
return s->shutdown();
});
});
}
return make_ready_future<>();
}
@@ -898,6 +937,8 @@ future<> db::commitlog::segment_manager::shutdown() {
*/
future<> db::commitlog::segment_manager::clear() {
logger.debug("Clearing all segments");
_shutdown = true;
_timer.cancel();
flush_segments(true);
return sync_all_segments().then([this] {
for (auto& s : _segments) {
@@ -913,6 +954,51 @@ void db::commitlog::segment_manager::sync() {
for (auto& s : _segments) {
s->sync(); // we do not care about waiting...
}
}
void db::commitlog::segment_manager::on_timer() {
if (cfg.mode != sync_mode::BATCH) {
sync();
}
// IFF a new segment was put in use since last we checked, and we're
// above threshold, request flush.
if (_new_counter > 0) {
auto max = max_disk_size;
auto cur = totals.total_size_on_disk;
if (max != 0 && cur >= max) {
_new_counter = 0;
logger.debug("Size on disk {} MB exceeds local maximum {} MB", cur / (1024 * 1024), max / (1024 * 1024));
flush_segments();
}
}
// Gate, because we are starting potentially blocking ops
// without waiting for them, so segement_manager could be shut down
// while they are running.
seastar::with_gate(_gate, [this] {
// take outstanding allocations into regard. This is paranoid,
// but if for some reason the file::open takes longer than timer period,
// we could flood the reserve list with new segments
auto n = _reserve_segments.size() + _reserve_allocating;
return parallel_for_each(boost::irange(n, _num_reserve_segments), [this, n](auto i) {
++_reserve_allocating;
return this->allocate_segment(false).then([this](sseg_ptr s) {
if (!_shutdown) {
// insertion sort.
auto i = std::upper_bound(_reserve_segments.begin(), _reserve_segments.end(), s, [](auto s1, auto s2) {
const descriptor& d1 = s1->_desc;
const descriptor& d2 = s2->_desc;
return d1.id < d2.id;
});
i = _reserve_segments.emplace(i, std::move(s));
logger.trace("Added reserve segment {}", *i);
}
}).finally([this] {
--_reserve_allocating;
});
});
}).handle_exception([](auto ep) {
logger.warn("Exception in segment reservation: {}", ep);
});
arm();
}
@@ -944,6 +1030,7 @@ db::commitlog::segment_manager::buffer_type db::commitlog::segment_manager::acqu
if (a == nullptr) {
throw std::bad_alloc();
}
logger.trace("Allocated {} k buffer", s / 1024);
return buffer_type(reinterpret_cast<char *>(a), s, make_free_deleter(a));
}
@@ -956,6 +1043,7 @@ void db::commitlog::segment_manager::release_buffer(buffer_type&& b) {
constexpr const size_t max_temp_buffers = 4;
if (_temp_buffers.size() > max_temp_buffers) {
logger.trace("Deleting {} buffers", _temp_buffers.size() - max_temp_buffers);
_temp_buffers.erase(_temp_buffers.begin() + max_temp_buffers, _temp_buffers.end());
}
totals.buffer_list_bytes = std::accumulate(_temp_buffers.begin(),
@@ -1104,7 +1192,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
}
future<> read_header() {
return fin.read_exactly(segment::descriptor_header_size).then([this](temporary_buffer<char> buf) {
advance(buf);
if (!advance(buf)) {
// zero length file. accept it just to be nice.
return make_ready_future<>();
}
// Will throw if we got eof
data_input in(buf);
auto ver = in.read<uint32_t>();
@@ -1124,9 +1215,6 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
this->id = id;
this->next = 0;
if (start_off > pos) {
return skip(start_off - pos);
}
return make_ready_future<>();
});
}
@@ -1154,6 +1242,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
this->next = next;
if (start_off >= next) {
return skip(next - pos);
}
return do_until(std::bind(&work::end_of_chunk, this), std::bind(&work::read_entry, this));
});
}
@@ -1181,6 +1273,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
throw std::runtime_error("Invalid entry size");
}
if (start_off > pos) {
return skip(size - entry_header_size);
}
return fin.read_exactly(size - entry_header_size).then([this, size, checksum, rp](temporary_buffer<char> buf) {
advance(buf);
@@ -1213,8 +1309,10 @@ db::commitlog::read_log_file(file f, commit_load_reader_func next, position_type
auto w = make_lw_shared<work>(std::move(f), off);
auto ret = w->s.listen(std::move(next));
w->s.started().then(std::bind(&work::read_file, w.get())).finally([w] {
w->s.started().then(std::bind(&work::read_file, w.get())).then([w] {
w->s.close();
}).handle_exception([w](auto ep) {
w->s.set_exception(ep);
});
return ret;
@@ -1236,6 +1334,14 @@ uint64_t db::commitlog::get_pending_tasks() const {
return _segment_manager->totals.pending_operations;
}
uint64_t db::commitlog::get_num_segments_created() const {
return _segment_manager->totals.segments_created;
}
uint64_t db::commitlog::get_num_segments_destroyed() const {
return _segment_manager->totals.segments_destroyed;
}
future<std::vector<db::commitlog::descriptor>> db::commitlog::list_existing_descriptors() const {
return list_existing_descriptors(active_config().commit_log_location);
}

View File

@@ -111,6 +111,9 @@ public:
uint64_t commitlog_total_space_in_mb = 0;
uint64_t commitlog_segment_size_in_mb = 32;
uint64_t commitlog_sync_period_in_ms = 10 * 1000; //TODO: verify default!
// Max number of segments to keep in pre-alloc reserve.
// Not (yet) configurable from scylla.conf.
uint64_t max_reserve_segments = 12;
sync_mode mode = sync_mode::PERIODIC;
};
@@ -229,6 +232,8 @@ public:
uint64_t get_total_size() const;
uint64_t get_completed_tasks() const;
uint64_t get_pending_tasks() const;
uint64_t get_num_segments_created() const;
uint64_t get_num_segments_destroyed() const;
/**
* Returns the largest amount of data that can be written in a single "mutation".

View File

@@ -117,11 +117,16 @@ future<> db::commitlog_replayer::impl::init() {
logger.warn("Could not read sstable metadata {}", std::current_exception());
}
}
// TODO: this is not correct. Truncation does not fully take sharding into consideration
return db::system_keyspace::get_truncated_position(qp, uuid).then([&map, uuid](auto truncated_rp) {
if (truncated_rp != replay_position()) {
auto& pp = map[engine().cpu_id()][uuid];
pp = std::max(pp, truncated_rp);
// We do this on each cpu, for each CF, which technically is a little wasteful, but the values are
// cached, this is only startup, and it makes the code easier.
// Get all truncation records for the CF and initialize max rps if
// present. Cannot do this on demand, as there may be no sstables to
// mark the CF as "needed".
return db::system_keyspace::get_truncated_position(uuid).then([&map, &uuid](std::vector<db::replay_position> tpps) {
for (auto& p : tpps) {
logger.trace("CF {} truncated at {}", uuid, p);
auto& pp = map[p.shard_id()][uuid];
pp = std::max(pp, p);
}
});
}).then([&map] {
@@ -183,8 +188,8 @@ future<> db::commitlog_replayer::impl::process(stats* s, temporary_buffer<char>
auto uuid = fm.column_family_id();
auto& map = _rpm[shard];
auto i = map.find(uuid);
if (i != map.end() && rp < i->second) {
logger.trace("entry {} at {} is less than recorded replay position {}. skipping", fm.column_family_id(), rp, i->second);
if (i != map.end() && rp <= i->second) {
logger.trace("entry {} at {} is younger than recorded replay position {}. skipping", fm.column_family_id(), rp, i->second);
s->skipped_mutations++;
return make_ready_future<>();
}
@@ -248,7 +253,10 @@ future<> db::commitlog_replayer::recover(std::vector<sstring> files) {
logger.info("Replaying {}", files);
return parallel_for_each(files, [this](auto f) {
return this->recover(std::move(f));
return this->recover(f).handle_exception([f](auto ep) {
logger.error("Error recovering {}: {}", f, ep);
std::rethrow_exception(ep);
});
});
}

View File

@@ -71,6 +71,9 @@ struct replay_position {
bool operator<(const replay_position & r) const {
return id < r.id ? true : (r.id < id ? false : pos < r.pos);
}
bool operator<=(const replay_position & r) const {
return !(r < *this);
}
bool operator==(const replay_position & r) const {
return id == r.id && pos == r.pos;
}

View File

@@ -407,7 +407,7 @@ public:
"The port for inter-node communication." \
) \
/* Advanced automatic backup setting */ \
val(auto_snapshot, bool, true, Unused, \
val(auto_snapshot, bool, true, Used, \
"Enable or disable whether a snapshot is taken of the data before keyspace truncation or dropping of tables. To prevent data loss, using the default setting is strongly advised. If you set to false, you will lose data on truncation or drop." \
) \
/* Key caches and global row properties */ \

View File

@@ -55,6 +55,9 @@ struct query_context {
api::timestamp_type next_timestamp() {
return _qp.local().next_timestamp();
}
cql3::query_processor& qp() {
return _qp.local();
}
};
// This does not have to be thread local, because all cores will share the same context.

File diff suppressed because it is too large Load Diff

View File

@@ -87,6 +87,8 @@ future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& p
std::vector<mutation> make_create_keyspace_mutations(lw_shared_ptr<keyspace_metadata> keyspace, api::timestamp_type timestamp, bool with_tables_and_types_and_functions = true);
std::vector<mutation> make_drop_keyspace_mutations(lw_shared_ptr<keyspace_metadata> keyspace, api::timestamp_type timestamp);
lw_shared_ptr<keyspace_metadata> create_keyspace_from_schema_partition(const schema_result::value_type& partition);
future<> merge_tables(distributed<service::storage_proxy>& proxy, schema_result&& before, schema_result&& after);
@@ -100,7 +102,9 @@ std::vector<mutation> make_create_table_mutations(lw_shared_ptr<keyspace_metadat
future<std::map<sstring, schema_ptr>> create_tables_from_tables_partition(distributed<service::storage_proxy>& proxy, const schema_result::mapped_type& result);
void add_table_to_schema_mutation(schema_ptr table, api::timestamp_type timestamp, bool with_columns_and_triggers, const partition_key& pkey, std::vector<mutation>& mutations);
std::vector<mutation> make_drop_table_mutations(lw_shared_ptr<keyspace_metadata> keyspace, schema_ptr table, api::timestamp_type timestamp);
future<schema_ptr> create_table_from_name(distributed<service::storage_proxy>& proxy, const sstring& keyspace, const sstring& table);
future<schema_ptr> create_table_from_table_row(distributed<service::storage_proxy>& proxy, const query::result_set_row& row);
@@ -109,6 +113,8 @@ void create_table_from_table_row_and_column_rows(schema_builder& builder, const
future<schema_ptr> create_table_from_table_partition(distributed<service::storage_proxy>& proxy, lw_shared_ptr<query::result_set>&& partition);
void drop_column_from_schema_mutation(schema_ptr table, const column_definition& column, long timestamp, std::vector<mutation>& mutations);
std::vector<column_definition> create_columns_from_column_rows(const schema_result::mapped_type& rows,
const sstring& keyspace,
const sstring& table,/*,

View File

@@ -40,6 +40,8 @@
#include <boost/range/algorithm_ext/push_back.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/map.hpp>
#include "system_keyspace.hh"
#include "types.hh"
@@ -50,6 +52,7 @@
#include "cql3/query_options.hh"
#include "cql3/query_processor.hh"
#include "utils/fb_utilities.hh"
#include "utils/hash.hh"
#include "dht/i_partitioner.hh"
#include "version.hh"
#include "thrift/server.hh"
@@ -482,10 +485,14 @@ future<> init_local_cache() {
});
}
future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp) {
void minimal_setup(distributed<database>& db, distributed<cql3::query_processor>& qp) {
auto new_ctx = std::make_unique<query_context>(db, qp);
qctx.swap(new_ctx);
assert(!new_ctx);
}
future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp) {
minimal_setup(db, qp);
return setup_version().then([&db] {
return update_schema_version(db.local().get_version());
}).then([] {
@@ -499,24 +506,40 @@ future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp
}).then([] {
return db::schema_tables::save_system_keyspace_schema();
});
return make_ready_future<>();
}
typedef std::pair<db::replay_position, db_clock::time_point> truncation_entry;
typedef std::unordered_map<utils::UUID, truncation_entry> truncation_map;
typedef std::pair<replay_positions, db_clock::time_point> truncation_entry;
typedef utils::UUID truncation_key;
typedef std::unordered_map<truncation_key, truncation_entry> truncation_map;
static thread_local std::experimental::optional<truncation_map> truncation_records;
future<> save_truncation_record(cql3::query_processor& qp, const column_family& cf, db_clock::time_point truncated_at, const db::replay_position& rp) {
db::serializer<replay_position> rps(rp);
bytes buf(bytes::initialized_later(), sizeof(db_clock::rep) + rps.size());
future<> save_truncation_records(const column_family& cf, db_clock::time_point truncated_at, replay_positions positions) {
auto size =
sizeof(db_clock::rep)
+ positions.size()
* db::serializer<replay_position>(
db::replay_position()).size();
bytes buf(bytes::initialized_later(), size);
data_output out(buf);
rps(out);
// Old version would write a single RP. We write N. Resulting blob size
// will determine how many.
// An external entity reading this blob would get a "correct" RP
// and a garbled time stamp. But an external entity has no business
// reading this data anyway, since it is meaningless outside this
// machine instance.
for (auto& rp : positions) {
db::serializer<replay_position>::write(out, rp);
}
out.write<db_clock::rep>(truncated_at.time_since_epoch().count());
map_type_impl::native_type tmp;
tmp.emplace_back(boost::any{ cf.schema()->id() }, boost::any{ buf });
sstring req = sprint("UPDATE system.%s SET truncated_at = truncated_at + ? WHERE key = '%s'", LOCAL, LOCAL);
return qp.execute_internal(req, {tmp}).then([&qp](auto rs) {
return qctx->qp().execute_internal(req, {tmp}).then([](auto rs) {
truncation_records = {};
return force_blocking_flush(LOCAL);
});
@@ -525,49 +548,84 @@ future<> save_truncation_record(cql3::query_processor& qp, const column_family&
/**
* This method is used to remove information about truncation time for specified column family
*/
future<> remove_truncation_record(cql3::query_processor& qp, utils::UUID id) {
future<> remove_truncation_record(utils::UUID id) {
sstring req = sprint("DELETE truncated_at[?] from system.%s WHERE key = '%s'", LOCAL, LOCAL);
return qp.execute_internal(req, {id}).then([&qp](auto rs) {
return qctx->qp().execute_internal(req, {id}).then([](auto rs) {
truncation_records = {};
return force_blocking_flush(LOCAL);
});
}
static future<truncation_entry> get_truncation_record(cql3::query_processor& qp, utils::UUID cf_id) {
static future<truncation_entry> get_truncation_record(utils::UUID cf_id) {
if (!truncation_records) {
sstring req = sprint("SELECT truncated_at FROM system.%s WHERE key = '%s'", LOCAL, LOCAL);
return qp.execute_internal(req).then([&qp, cf_id](::shared_ptr<cql3::untyped_result_set> rs) {
return qctx->qp().execute_internal(req).then([cf_id](::shared_ptr<cql3::untyped_result_set> rs) {
truncation_map tmp;
if (!rs->empty() && rs->one().has("truncated_set")) {
if (!rs->empty() && rs->one().has("truncated_at")) {
auto map = rs->one().get_map<utils::UUID, bytes>("truncated_at");
for (auto& p : map) {
auto uuid = p.first;
auto buf = p.second;
truncation_entry e;
data_input in(p.second);
e.first = db::serializer<replay_position>::read(in);
data_input in(buf);
while (in.avail() > sizeof(db_clock::rep)) {
e.first.emplace_back(db::serializer<replay_position>::read(in));
}
e.second = db_clock::time_point(db_clock::duration(in.read<db_clock::rep>()));
tmp[p.first] = e;
tmp[uuid] = e;
}
}
truncation_records = std::move(tmp);
return get_truncation_record(qp, cf_id);
return get_truncation_record(cf_id);
});
}
return make_ready_future<truncation_entry>((*truncation_records)[cf_id]);
}
future<db::replay_position> get_truncated_position(cql3::query_processor& qp, utils::UUID cf_id) {
return get_truncation_record(qp, cf_id).then([](truncation_entry e) {
return make_ready_future<db::replay_position>(e.first);
future<> save_truncation_record(const column_family& cf, db_clock::time_point truncated_at, db::replay_position rp) {
// TODO: this is horribly ineffective, we're doing a full flush of all system tables for all cores
// once, for each core (calling us). But right now, redesigning so that calling here (or, rather,
// save_truncation_records), is done from "somewhere higher, once per machine, not shard" is tricky.
// Mainly because drop_tables also uses truncate. And is run per-core as well. Gah.
return get_truncated_position(cf.schema()->id()).then([&cf, truncated_at, rp](replay_positions positions) {
auto i = std::find_if(positions.begin(), positions.end(), [rp](auto& p) {
return p.shard_id() == rp.shard_id();
});
if (i == positions.end()) {
positions.emplace_back(rp);
} else {
*i = rp;
}
return save_truncation_records(cf, truncated_at, positions);
});
}
future<db_clock::time_point> get_truncated_at(cql3::query_processor& qp, utils::UUID cf_id) {
return get_truncation_record(qp, cf_id).then([](truncation_entry e) {
future<db::replay_position> get_truncated_position(utils::UUID cf_id, uint32_t shard) {
return get_truncated_position(std::move(cf_id)).then([shard](replay_positions positions) {
for (auto& rp : positions) {
if (shard == rp.shard_id()) {
return make_ready_future<db::replay_position>(rp);
}
}
return make_ready_future<db::replay_position>();
});
}
future<replay_positions> get_truncated_position(utils::UUID cf_id) {
return get_truncation_record(cf_id).then([](truncation_entry e) {
return make_ready_future<replay_positions>(e.first);
});
}
future<db_clock::time_point> get_truncated_at(utils::UUID cf_id) {
return get_truncation_record(cf_id).then([](truncation_entry e) {
return make_ready_future<db_clock::time_point>(e.second);
});
}
set_type_impl::native_type prepare_tokens(std::unordered_set<dht::token>& tokens) {
set_type_impl::native_type tset;
for (auto& t: tokens) {

View File

@@ -84,6 +84,9 @@ extern schema_ptr hints();
extern schema_ptr batchlog();
extern schema_ptr built_indexes(); // TODO (from Cassandra): make private
// Only for testing.
void minimal_setup(distributed<database>& db, distributed<cql3::query_processor>& qp);
future<> init_local_cache();
future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp);
future<> update_schema_version(utils::UUID version);
@@ -274,10 +277,14 @@ enum class bootstrap_state {
return CompactionHistoryTabularData.from(queryResultSet);
}
#endif
future<> save_truncation_record(cql3::query_processor&, const column_family&, db_clock::time_point truncated_at, const db::replay_position&);
future<> remove_truncation_record(cql3::query_processor&, utils::UUID);
future<db::replay_position> get_truncated_position(cql3::query_processor&, utils::UUID);
future<db_clock::time_point> get_truncated_at(cql3::query_processor&, utils::UUID);
typedef std::vector<db::replay_position> replay_positions;
future<> save_truncation_record(const column_family&, db_clock::time_point truncated_at, db::replay_position);
future<> save_truncation_records(const column_family&, db_clock::time_point truncated_at, replay_positions);
future<> remove_truncation_record(utils::UUID);
future<replay_positions> get_truncated_position(utils::UUID);
future<db::replay_position> get_truncated_position(utils::UUID, uint32_t shard);
future<db_clock::time_point> get_truncated_at(utils::UUID);
#if 0

View File

@@ -4,7 +4,7 @@ MAINTAINER Avi Kivity <avi@cloudius-systems.com>
ADD scylla.repo /etc/yum.repos.d/
RUN dnf -y update
RUN dnf -y install scylla-server
RUN dnf -y install scylla-server hostname
RUN dnf clean all
ADD start-scylla /start-scylla
RUN chown scylla /start-scylla

View File

@@ -37,6 +37,6 @@ if [ "$OS" = "Fedora" ]; then
rpmbuild -bs --define "_topdir $RPMBUILD" $RPMBUILD/SPECS/scylla-server.spec
mock rebuild --resultdir=`pwd`/build/rpms $RPMBUILD/SRPMS/scylla-server-$VERSION*.src.rpm
else
sudo yum-builddep $RPMBUILD/SPECS/scylla-server.spec
sudo yum-builddep -y $RPMBUILD/SPECS/scylla-server.spec
rpmbuild -ba --define "_topdir $RPMBUILD" $RPMBUILD/SPECS/scylla-server.spec
fi

View File

@@ -1,55 +1,111 @@
#!/bin/sh -e
export RPMBUILD=`pwd`/build/rpmbuild
do_install()
{
pkg=$1
sudo yum install -y $RPMBUILD/RPMS/*/$pkg 2> build/err || if [ "`cat build/err`" != "Error: Nothing to do" ]; then cat build/err; exit 1;fi
echo Install $name done
}
sudo yum install -y wget yum-utils rpm-build rpmdevtools gcc gcc-c++ make patch
mkdir -p build/srpms
cd build/srpms
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/b/boost-1.57.0-6.fc22.src.rpm
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/n/ninja-build-1.5.3-2.fc22.src.rpm
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/r/ragel-6.8-3.fc22.src.rpm
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/r/re2c-0.13.5-9.fc22.src.rpm
if [ ! -f binutils-2.25-5.fc22.src.rpm ]; then
wget http://ftp.riken.jp/Linux/fedora/releases/22/Everything/source/SRPMS/b/binutils-2.25-5.fc22.src.rpm
fi
if [ ! -f isl-0.14-3.fc22.src.rpm ]; then
wget http://ftp.riken.jp/Linux/fedora/releases/22/Everything/source/SRPMS/i/isl-0.14-3.fc22.src.rpm
fi
if [ ! -f gcc-5.1.1-4.fc22.src.rpm ]; then
wget http://ftp.riken.jp/Linux/fedora/updates/22/SRPMS/g/gcc-5.1.1-4.fc22.src.rpm
fi
if [ ! -f boost-1.57.0-6.fc22.src.rpm ]; then
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/b/boost-1.57.0-6.fc22.src.rpm
fi
if [ ! -f ninja-build-1.5.3-2.fc22.src.rpm ]; then
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/n/ninja-build-1.5.3-2.fc22.src.rpm
fi
if [ ! -f ragel-6.8-3.fc22.src.rpm ]; then
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/r/ragel-6.8-3.fc22.src.rpm
fi
if [ ! -f re2c-0.13.5-9.fc22.src.rpm ]; then
wget http://download.fedoraproject.org/pub/fedora/linux/releases/22/Everything/source/SRPMS/r/re2c-0.13.5-9.fc22.src.rpm
fi
cd -
sudo yum install -y epel-release
sudo yum install -y cryptopp cryptopp-devel jsoncpp jsoncpp-devel lz4 lz4-devel yaml-cpp yaml-cpp-devel thrift thrift-devel scons gtest gtest-devel python34
sudo ln -sf /usr/bin/python3.4 /usr/bin/python3
sudo yum install -y scl-utils
sudo yum install -y https://www.softwarecollections.org/en/scls/rhscl/devtoolset-3/epel-7-x86_64/download/rhscl-devtoolset-3-epel-7-x86_64.noarch.rpm
sudo yum install -y devtoolset-3-gcc-c++
sudo yum install -y python-devel libicu-devel openmpi-devel mpich-devel libstdc++-devel bzip2-devel zlib-devel
rpmbuild --define "_topdir $RPMBUILD" --without python3 --rebuild build/srpms/boost-1.57.0-6.fc22.src.rpm
sudo yum install -y `ls $RPMBUILD/RPMS/x86_64/boost*|grep -v debuginfo`
rpmbuild --define "_topdir $RPMBUILD" --rebuild build/srpms/re2c-0.13.5-9.fc22.src.rpm
sudo yum install -y $RPMBUILD/RPMS/x86_64/re2c-0.13.5-9.el7.centos.x86_64.rpm
rpm --define "_topdir $RPMBUILD" -ivh build/srpms/ninja-build-1.5.3-2.fc22.src.rpm
patch $RPMBUILD/SPECS/ninja-build.spec < dist/redhat/centos_dep/ninja-build.diff
rpmbuild --define "_topdir $RPMBUILD" -ba $RPMBUILD/SPECS/ninja-build.spec
sudo yum install -y $RPMBUILD/RPMS/x86_64/ninja-build-1.5.3-2.el7.centos.x86_64.rpm
sudo yum install -y flex bison dejagnu zlib-static glibc-static sharutils bc libstdc++-static gmp-devel texinfo texinfo-tex systemtap-sdt-devel mpfr-devel libmpc-devel elfutils-devel elfutils-libelf-devel glibc-devel.x86_64 glibc-devel.i686 gcc-gnat libgnat doxygen graphviz dblatex texlive-collection-latex docbook5-style-xsl python-sphinx cmake
sudo yum install -y gcc-objc
rpm --define "_topdir $RPMBUILD" -ivh build/srpms/ragel-6.8-3.fc22.src.rpm
patch $RPMBUILD/SPECS/ragel.spec < dist/redhat/centos_dep/ragel.diff
rpmbuild --define "_topdir $RPMBUILD" -ba $RPMBUILD/SPECS/ragel.spec
sudo yum install -y $RPMBUILD/RPMS/x86_64/ragel-6.8-3.el7.centos.x86_64.rpm
mkdir build/antlr3-tool-3.5.2
cp dist/redhat/centos_dep/antlr3 build/antlr3-tool-3.5.2
cd build/antlr3-tool-3.5.2
wget http://www.antlr3.org/download/antlr-3.5.2-complete-no-st3.jar
cd -
cd build
tar cJpf $RPMBUILD/SOURCES/antlr3-tool-3.5.2.tar.xz antlr3-tool-3.5.2
cd -
rpmbuild --define "_topdir $RPMBUILD" -ba dist/redhat/centos_dep/antlr3-tool.spec
sudo yum install -y $RPMBUILD/RPMS/noarch/antlr3-tool-3.5.2-1.el7.centos.noarch.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/binutils-2.25-5.el7.centos.x86_64.rpm ]; then
rpmbuild --define "_topdir $RPMBUILD" --rebuild build/srpms/binutils-2.25-5.fc22.src.rpm
fi
do_install binutils-2.25-5.el7.centos.x86_64.rpm
wget -O build/3.5.2.tar.gz https://github.com/antlr/antlr3/archive/3.5.2.tar.gz
mv build/3.5.2.tar.gz $RPMBUILD/SOURCES
rpmbuild --define "_topdir $RPMBUILD" -ba dist/redhat/centos_dep/antlr3-C++-devel.spec
sudo yum install -y $RPMBUILD/RPMS/x86_64/antlr3-C++-devel-3.5.2-1.el7.centos.x86_64.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/isl-0.14-3.el7.centos.x86_64.rpm ]; then
rpmbuild --define "_topdir $RPMBUILD" --rebuild build/srpms/isl-0.14-3.fc22.src.rpm
fi
do_install isl-0.14-3.el7.centos.x86_64.rpm
do_install isl-devel-0.14-3.el7.centos.x86_64.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/gcc-5.1.1-4.el7.centos.x86_64.rpm ]; then
rpmbuild --define "_topdir $RPMBUILD" --define "fedora 21" --rebuild build/srpms/gcc-5.1.1-4.fc22.src.rpm
fi
do_install *5.1.1-4*
if [ ! -f $RPMBUILD/RPMS/x86_64/boost-1.57.0-6.el7.centos.x86_64.rpm ]; then
rpmbuild --define "_topdir $RPMBUILD" --without python3 --rebuild build/srpms/boost-1.57.0-6.fc22.src.rpm
fi
do_install boost*
if [ ! -f $RPMBUILD/RPMS/x86_64/re2c-0.13.5-9.el7.centos.x86_64.rpm ]; then
rpmbuild --define "_topdir $RPMBUILD" --rebuild build/srpms/re2c-0.13.5-9.fc22.src.rpm
fi
do_install re2c-0.13.5-9.el7.centos.x86_64.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/ninja-build-1.5.3-2.el7.centos.x86_64.rpm ]; then
rpm --define "_topdir $RPMBUILD" -ivh build/srpms/ninja-build-1.5.3-2.fc22.src.rpm
patch $RPMBUILD/SPECS/ninja-build.spec < dist/redhat/centos_dep/ninja-build.diff
rpmbuild --define "_topdir $RPMBUILD" -ba $RPMBUILD/SPECS/ninja-build.spec
fi
do_install ninja-build-1.5.3-2.el7.centos.x86_64.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/ragel-6.8-3.el7.centos.x86_64.rpm ]; then
rpm --define "_topdir $RPMBUILD" -ivh build/srpms/ragel-6.8-3.fc22.src.rpm
patch $RPMBUILD/SPECS/ragel.spec < dist/redhat/centos_dep/ragel.diff
rpmbuild --define "_topdir $RPMBUILD" -ba $RPMBUILD/SPECS/ragel.spec
fi
do_install ragel-6.8-3.el7.centos.x86_64.rpm
if [ ! -f $RPMBUILD/RPMS/noarch/antlr3-tool-3.5.2-1.el7.centos.noarch.rpm ]; then
mkdir build/antlr3-tool-3.5.2
cp dist/redhat/centos_dep/antlr3 build/antlr3-tool-3.5.2
cd build/antlr3-tool-3.5.2
wget http://www.antlr3.org/download/antlr-3.5.2-complete-no-st3.jar
cd -
cd build
tar cJpf $RPMBUILD/SOURCES/antlr3-tool-3.5.2.tar.xz antlr3-tool-3.5.2
cd -
rpmbuild --define "_topdir $RPMBUILD" -ba dist/redhat/centos_dep/antlr3-tool.spec
fi
do_install antlr3-tool-3.5.2-1.el7.centos.noarch.rpm
if [ ! -f $RPMBUILD/RPMS/x86_64/antlr3-C++-devel-3.5.2-1.el7.centos.x86_64.rpm ];then
wget -O build/3.5.2.tar.gz https://github.com/antlr/antlr3/archive/3.5.2.tar.gz
mv build/3.5.2.tar.gz $RPMBUILD/SOURCES
rpmbuild --define "_topdir $RPMBUILD" -ba dist/redhat/centos_dep/antlr3-C++-devel.spec
fi
do_install antlr3-C++-devel-3.5.2-1.el7.centos.x86_64.rpm

View File

@@ -8,13 +8,10 @@ License: AGPLv3
URL: http://www.scylladb.com/
Source0: %{name}-@@VERSION@@-@@RELEASE@@.tar
BuildRequires: libaio-devel boost-devel libstdc++-devel cryptopp-devel hwloc-devel numactl-devel libpciaccess-devel libxml2-devel zlib-devel thrift-devel yaml-cpp-devel lz4-devel snappy-devel jsoncpp-devel systemd-devel xz-devel openssl-devel libcap-devel libselinux-devel libgcrypt-devel libgpg-error-devel elfutils-devel krb5-devel libcom_err-devel libattr-devel pcre-devel elfutils-libelf-devel bzip2-devel keyutils-libs-devel ninja-build ragel antlr3-tool antlr3-C++-devel make
BuildRequires: libaio-devel boost-devel libstdc++-devel cryptopp-devel hwloc-devel numactl-devel libpciaccess-devel libxml2-devel zlib-devel thrift-devel yaml-cpp-devel lz4-devel snappy-devel jsoncpp-devel systemd-devel xz-devel openssl-devel libcap-devel libselinux-devel libgcrypt-devel libgpg-error-devel elfutils-devel krb5-devel libcom_err-devel libattr-devel pcre-devel elfutils-libelf-devel bzip2-devel keyutils-libs-devel ninja-build ragel antlr3-tool antlr3-C++-devel xfsprogs-devel make
%{?fedora:BuildRequires: python3 gcc-c++ libasan libubsan}
%{?rhel:BuildRequires: python34 devtoolset-3-gcc-c++}
Requires: libaio boost-program-options boost-system libstdc++ boost-thread cryptopp hwloc-libs numactl-libs libpciaccess libxml2 zlib thrift yaml-cpp lz4 snappy jsoncpp boost-filesystem systemd-libs xz-libs openssl-libs libcap libselinux libgcrypt libgpg-error elfutils-libs krb5-libs libcom_err libattr pcre elfutils-libelf bzip2-libs keyutils-libs
# TODO: create our own bridge device for virtio
Requires: libvirt-daemon
%{?rhel:BuildRequires: python34 gcc-c++ >= 5.1.1}
Requires: systemd-libs xfsprogs
%description
@@ -26,7 +23,7 @@ Requires: libvirt-daemon
./configure.py --with scylla --disable-xen --enable-dpdk --mode=release
%endif
%if 0%{?rhel}
./configure.py --with scylla --disable-xen --enable-dpdk --mode=release --compiler=/opt/rh/devtoolset-3/root/usr/bin/g++
./configure.py --with scylla --disable-xen --enable-dpdk --mode=release --static-stdc++
%endif
ninja-build -j2

49
exceptions/exceptions.cc Normal file
View File

@@ -0,0 +1,49 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright 2015 Cloudius Systems
*
* Modified by Cloudius Systems
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <sstream>
#include "exceptions.hh"
#include "log.hh"
exceptions::truncate_exception::truncate_exception(std::exception_ptr ep)
: request_execution_exception(exceptions::exception_code::PROTOCOL_ERROR, sprint("Error during truncate: %s", ep))
{}

View File

@@ -107,6 +107,19 @@ struct unavailable_exception : cassandra_exception {
{}
};
class request_execution_exception : public cassandra_exception {
public:
request_execution_exception(exception_code code, sstring msg)
: cassandra_exception(code, std::move(msg))
{ }
};
class truncate_exception : public request_execution_exception
{
public:
truncate_exception(std::exception_ptr ep);
};
class request_timeout_exception : public cassandra_exception {
public:
db::consistency_level consistency;

View File

@@ -840,6 +840,10 @@ int gossiper::get_current_generation_number(inet_address endpoint) {
return endpoint_state_map.at(endpoint).get_heart_beat_state().get_generation();
}
int gossiper::get_current_heart_beat_version(inet_address endpoint) {
return endpoint_state_map.at(endpoint).get_heart_beat_state().get_heart_beat_version();
}
future<bool> gossiper::do_gossip_to_live_member(gossip_digest_syn message) {
size_t size = _live_endpoints.size();
if (size == 0) {
@@ -1280,11 +1284,11 @@ future<> gossiper::do_shadow_round() {
return make_ready_future<>();
}).get();
}
if (clk::now() > t + storage_service_ring_delay()) {
if (clk::now() > t + storage_service_ring_delay() * 60) {
throw std::runtime_error(sprint("Unable to gossip with any seeds (ShadowRound)"));
}
if (this->_in_shadow_round) {
logger.trace("Sleep 1 second and retry ...");
logger.info("Sleep 1 second and connect seeds again ... ({} seconds passed)", std::chrono::duration_cast<std::chrono::seconds>(clk::now() - t).count());
sleep(std::chrono::seconds(1)).get();
}
}
@@ -1477,6 +1481,12 @@ future<int> get_current_generation_number(inet_address ep) {
});
}
future<int> get_current_heart_beat_version(inet_address ep) {
return smp::submit_to(0, [ep] {
return get_local_gossiper().get_current_heart_beat_version(ep);
});
}
future<> unsafe_assassinate_endpoint(sstring ep) {
return smp::submit_to(0, [ep] {
return get_local_gossiper().unsafe_assassinate_endpoint(ep);

View File

@@ -305,6 +305,7 @@ public:
bool is_known_endpoint(inet_address endpoint);
int get_current_generation_number(inet_address endpoint);
int get_current_heart_beat_version(inet_address endpoint);
bool is_gossip_only_member(inet_address endpoint);
private:
@@ -462,6 +463,7 @@ future<std::set<inet_address>> get_unreachable_members();
future<std::set<inet_address>> get_live_members();
future<int64_t> get_endpoint_downtime(inet_address ep);
future<int> get_current_generation_number(inet_address ep);
future<int> get_current_heart_beat_version(inet_address ep);
future<> unsafe_assassinate_endpoint(sstring ep);
future<> assassinate_endpoint(sstring ep);

11
keys.hh
View File

@@ -160,6 +160,17 @@ public:
return TopLevel::from_bytes(get_compound_type(s)->serialize_single(std::move(v)));
}
template <typename T>
static
TopLevel from_singular(const schema& s, const T& v) {
auto ct = get_compound_type(s);
if (!ct->is_singular()) {
throw std::invalid_argument("compound is not singular");
}
auto type = ct->types()[0];
return from_single_value(s, type->decompose(v));
}
TopLevelView view() const {
return TopLevelView::from_bytes(_bytes);
}

5
log.cc
View File

@@ -214,8 +214,8 @@ logging::log_level lexical_cast(const std::string& source) {
}
std::ostream& operator<<(std::ostream&out, std::exception_ptr eptr) {
namespace std {
std::ostream& operator<<(std::ostream&out, const std::exception_ptr eptr) {
if (!eptr) {
out << "<no exception>";
return out;
@@ -241,3 +241,4 @@ std::ostream& operator<<(std::ostream&out, std::exception_ptr eptr) {
}
return out;
}
}

4
log.hh
View File

@@ -179,6 +179,8 @@ logger::do_log(log_level level, const char* fmt, Args&&... args) {
}
// Pretty-printer for exceptions to be logged, e.g., std::current_exception().
std::ostream& operator<<(std::ostream&, std::exception_ptr);
namespace std {
std::ostream& operator<<(std::ostream&, const std::exception_ptr);
}
#endif /* LOG_HH_ */

24
main.cc
View File

@@ -43,6 +43,7 @@
#include "init.hh"
#include "release.hh"
#include <cstdio>
#include <core/file.hh>
logging::logger startlog("init");
@@ -89,6 +90,17 @@ static logging::log_level to_loglevel(sstring level) {
}
}
static future<> disk_sanity(sstring path) {
return check_direct_io_support(path).then([path] {
return file_system_at(path).then([path] (auto fs) {
if (fs != fs_type::xfs) {
startlog.warn("{} is not on XFS. This is a non-supported setup, and performance is expected to be very bad.\n"
"For better performance, placing your data on XFS-formatted directories is strongly recommended", path);
}
});
});
};
static void apply_logger_settings(sstring default_level, db::config::string_map levels,
bool log_to_stdout, bool log_to_syslog) {
logging::logger_registry().set_all_loggers_level(to_loglevel(default_level));
@@ -246,6 +258,12 @@ int main(int ac, char** av) {
return dirs.touch_and_lock(db.local().get_config().data_file_directories());
}).then([&db, &dirs] {
return dirs.touch_and_lock(db.local().get_config().commitlog_directory());
}).then([&db] {
return parallel_for_each(db.local().get_config().data_file_directories(), [] (sstring pathname) {
return disk_sanity(pathname);
}).then([&db] {
return disk_sanity(db.local().get_config().commitlog_directory());
});
}).then([&db] {
return db.invoke_on_all([] (database& db) {
return db.init_system_keyspace();
@@ -282,6 +300,10 @@ int main(int ac, char** av) {
}).then([] {
auto& ss = service::get_local_storage_service();
return ss.init_server();
}).then([] {
return db::get_batchlog_manager().invoke_on_all([] (db::batchlog_manager& b) {
return b.start();
});
}).then([rpc_address] {
return dns::gethostbyname(rpc_address);
}).then([&db, &proxy, &qp, rpc_address, cql_port, thrift_port, start_thrift] (dns::hostent e) {
@@ -319,6 +341,8 @@ int main(int ac, char** av) {
}).then([api_address, api_port] {
print("Seastar HTTP server listening on %s:%s ...\n", api_address, api_port);
});
}).then([] {
startlog.warn("Polling mode enabled. ScyllaDB will use 100% of all your CPUs.\nSee https://github.com/scylladb/scylla/issues/417 for a more detailed explanation");
});
}).or_terminate();
});

View File

@@ -178,6 +178,9 @@ public:
// FIXME: not really true, a previous stop could be in progress?
return make_ready_future<>();
}
bool error() {
return _p->error();
}
operator rpc_protocol::client&() { return *_p; }
};
@@ -295,14 +298,19 @@ static unsigned get_rpc_client_idx(messaging_verb verb) {
shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::get_rpc_client(messaging_verb verb, shard_id id) {
auto idx = get_rpc_client_idx(verb);
auto it = _clients[idx].find(id);
if (it == _clients[idx].end()) {
auto remote_addr = ipv4_addr(id.addr.raw_addr(), _port);
auto client = make_shared<rpc_protocol_client_wrapper>(*_rpc, remote_addr, ipv4_addr{_listen_address.raw_addr(), 0});
it = _clients[idx].emplace(id, shard_info(std::move(client))).first;
return it->second.rpc_client;
} else {
return it->second.rpc_client;
if (it != _clients[idx].end()) {
auto c = it->second.rpc_client;
if (!c->error()) {
return c;
}
remove_rpc_client(verb, id);
}
auto remote_addr = ipv4_addr(id.addr.raw_addr(), _port);
auto client = make_shared<rpc_protocol_client_wrapper>(*_rpc, remote_addr, ipv4_addr{_listen_address.raw_addr(), 0});
it = _clients[idx].emplace(id, shard_info(std::move(client))).first;
return it->second.rpc_client;
}
void messaging_service::remove_rpc_client(messaging_verb verb, shard_id id) {
@@ -536,4 +544,18 @@ future<query::result_digest> messaging_service::send_read_digest(shard_id id, qu
return send_message<query::result_digest>(this, net::messaging_verb::READ_DIGEST, std::move(id), cmd, pr);
}
// Wrapper for TRUNCATE
void messaging_service::register_truncate(std::function<future<> (sstring, sstring)>&& func) {
register_handler(this, net::messaging_verb::TRUNCATE, std::move(func));
}
void messaging_service::unregister_truncate() {
_rpc->unregister_handler(net::messaging_verb::TRUNCATE);
}
future<> messaging_service::send_truncate(shard_id id, std::chrono::milliseconds timeout, sstring ks, sstring cf) {
return send_message_timeout<void>(this, net::messaging_verb::TRUNCATE, std::move(id), std::move(timeout), std::move(ks), std::move(cf));
}
} // namespace net

View File

@@ -528,6 +528,11 @@ public:
void unregister_read_digest();
future<query::result_digest> send_read_digest(shard_id id, query::read_command& cmd, query::partition_range& pr);
// Wrapper for TRUNCATE
void register_truncate(std::function<future<>(sstring, sstring)>&& func);
void unregister_truncate();
future<> send_truncate(shard_id, std::chrono::milliseconds, sstring, sstring);
public:
// Return rpc::protocol::client for a shard which is a ip + cpuid pair.
shared_ptr<rpc_protocol_client_wrapper> get_rpc_client(messaging_verb verb, shard_id id);

View File

@@ -519,6 +519,7 @@ class mutation_partition final {
boost::intrusive::compare<row_tombstones_entry::compare>>;
friend rows_entry;
friend row_tombstones_entry;
friend class size_calculator;
private:
tombstone _tombstone;
row _static_row;

View File

@@ -210,12 +210,7 @@ row_cache::make_reader(const query::partition_range& range) {
}
row_cache::~row_cache() {
with_allocator(_tracker.allocator(), [this] {
_partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
_tracker.on_erase();
deleter(p);
});
});
clear();
}
void row_cache::populate(const mutation& m) {
@@ -235,6 +230,15 @@ void row_cache::populate(const mutation& m) {
});
}
void row_cache::clear() {
with_allocator(_tracker.allocator(), [this] {
_partitions.clear_and_dispose([this, deleter = current_deleter<cache_entry>()] (auto&& p) mutable {
_tracker.on_erase();
deleter(p);
});
});
}
future<> row_cache::update(memtable& m, partition_presence_checker presence_checker) {
_tracker.region().merge(m._region); // Now all data in memtable belongs to cache
auto attr = seastar::thread_attributes();

View File

@@ -56,6 +56,7 @@ class cache_entry {
mutation_partition _p;
lru_link_type _lru_link;
cache_link_type _cache_link;
friend class size_calculator;
public:
friend class row_cache;
friend class cache_tracker;
@@ -182,6 +183,9 @@ public:
// information there is for its partition in the underlying data sources.
void populate(const mutation& m);
// Clears the cache.
void clear();
// Synchronizes cache with the underlying data source from a memtable which
// has just been flushed to the underlying data source.
// The memtable can be queried during the process, but must not be written.

Submodule seastar updated: 5fe596a764...78e3924fcf

View File

@@ -222,11 +222,11 @@ public void notifyUpdateAggregate(UDAggregate udf)
}
#endif
future<> migration_manager::notify_drop_keyspace(const lw_shared_ptr<keyspace_metadata>& ksm)
future<> migration_manager::notify_drop_keyspace(sstring ks_name)
{
return get_migration_manager().invoke_on_all([name = ksm->name()] (auto&& mm) {
return get_migration_manager().invoke_on_all([ks_name] (auto&& mm) {
for (auto&& listener : mm._listeners) {
listener->on_drop_keyspace(name);
listener->on_drop_keyspace(ks_name);
}
});
}
@@ -381,13 +381,12 @@ future<> migration_manager::announce_keyspace_drop(const sstring& ks_name, bool
{
try {
auto& db = get_local_storage_proxy().get_db().local();
/*auto&& keyspace = */db.find_keyspace(ks_name);
auto& keyspace = db.find_keyspace(ks_name);
#if 0
logger.info(String.format("Drop Keyspace '%s'", oldKsm.name));
announce(LegacySchemaTables.makeDropKeyspaceMutation(oldKsm, FBUtilities.timestampMicros()), announceLocally);
#endif
// FIXME
throw std::runtime_error("not implemented");
auto&& mutations = db::schema_tables::make_drop_keyspace_mutations(keyspace.metadata(), db_clock::now_in_usecs());
return announce(std::move(mutations), announce_locally);
} catch (const no_such_keyspace& e) {
throw exceptions::configuration_exception(sprint("Cannot drop non existing keyspace '%s'.", ks_name));
}
@@ -399,14 +398,11 @@ future<> migration_manager::announce_column_family_drop(const sstring& ks_name,
{
try {
auto& db = get_local_storage_proxy().get_db().local();
/*auto&& cfm = */db.find_schema(ks_name, cf_name);
/*auto&& ksm = */db.find_keyspace(ks_name);
#if 0
logger.info(String.format("Drop table '%s/%s'", oldCfm.ksName, oldCfm.cfName));
announce(LegacySchemaTables.makeDropTableMutation(ksm, oldCfm, FBUtilities.timestampMicros()), announceLocally);
#endif
// FIXME
throw std::runtime_error("not implemented");
auto&& old_cfm = db.find_schema(ks_name, cf_name);
auto&& keyspace = db.find_keyspace(ks_name);
logger.info("Drop table '{}/{}'", old_cfm->ks_name(), old_cfm->cf_name());
auto mutations = db::schema_tables::make_drop_table_mutations(keyspace.metadata(), old_cfm, db_clock::now_in_usecs());
return announce(std::move(mutations), announce_locally);
} catch (const no_such_column_family& e) {
throw exceptions::configuration_exception(sprint("Cannot drop non existing table '%s' in keyspace '%s'.", cf_name, ks_name));
}

View File

@@ -79,7 +79,7 @@ public:
static future<> notify_update_column_family(schema_ptr cfm);
static future<> notify_drop_keyspace(const lw_shared_ptr<keyspace_metadata>& ksm);
static future<> notify_drop_keyspace(sstring ks_name);
static future<> notify_drop_column_family(schema_ptr cfm);

View File

@@ -61,7 +61,6 @@
#include <boost/range/adaptor/transformed.hpp>
#include <boost/iterator/counting_iterator.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/indirected.hpp>
#include <boost/range/algorithm/count_if.hpp>
#include <boost/range/algorithm/find.hpp>
#include <boost/range/algorithm/find_if.hpp>
@@ -1293,21 +1292,22 @@ class digest_read_resolver : public abstract_read_resolver {
_digest_results.clear();
}
virtual size_t response_count() const override {
return _digest_results.size() + _data_results.size();
return _digest_results.size();
}
bool digests_match() const {
assert(response_count());
if (response_count() == 1) {
return true;
}
auto digests = boost::range::join(_digest_results, _data_results | boost::adaptors::indirected | boost::adaptors::transformed(std::mem_fn(&query::result::digest)));
const query::result_digest& first = *digests.begin();
return std::find_if(digests.begin() + 1, digests.end(), [&first] (const query::result_digest& digest) { return digest != first; }) == digests.end();
auto& first = *_digest_results.begin();
return std::find_if(_digest_results.begin() + 1, _digest_results.end(), [&first] (query::result_digest digest) { return digest != first; }) == _digest_results.end();
}
public:
digest_read_resolver(db::consistency_level cl, size_t block_for, std::chrono::high_resolution_clock::time_point timeout) : abstract_read_resolver(cl, 0, timeout), _block_for(block_for) {}
void add_data(gms::inet_address from, foreign_ptr<lw_shared_ptr<query::result>> result) {
if (!_timedout) {
// if only one target was queried digest_check() will be skipped so we can also skip digest calculation
_digest_results.emplace_back(_targets_count == 1 ? query::result_digest(bytes()) : result->digest());
_data_results.emplace_back(std::move(result));
got_response(from);
}
@@ -2225,6 +2225,42 @@ bool storage_proxy::should_hint(gms::inet_address ep) {
#endif
}
future<> storage_proxy::truncate_blocking(sstring keyspace, sstring cfname) {
logger.debug("Starting a blocking truncate operation on keyspace {}, CF {}", keyspace, cfname);
auto& gossiper = gms::get_local_gossiper();
if (!gossiper.get_unreachable_token_owners().empty()) {
logger.info("Cannot perform truncate, some hosts are down");
// Since the truncate operation is so aggressive and is typically only
// invoked by an admin, for simplicity we require that all nodes are up
// to perform the operation.
auto live_members = gossiper.get_live_members().size();
throw exceptions::unavailable_exception(db::consistency_level::ALL,
live_members + gossiper.get_unreachable_members().size(),
live_members);
}
auto all_endpoints = gossiper.get_live_token_owners();
auto& ms = net::get_local_messaging_service();
auto timeout = std::chrono::milliseconds(_db.local().get_config().truncate_request_timeout_in_ms());
logger.trace("Enqueuing truncate messages to hosts {}", all_endpoints);
return parallel_for_each(all_endpoints, [keyspace, cfname, &ms, timeout](auto ep) {
return ms.send_truncate(net::messaging_service::shard_id{ep, 0}, timeout, keyspace, cfname);
}).handle_exception([cfname](auto ep) {
try {
std::rethrow_exception(ep);
} catch (rpc::timeout_error& e) {
logger.trace("Truncation of {} timed out: {}", cfname, e.what());
} catch (...) {
throw;
}
});
}
#if 0
/**
* Performs the truncate operatoin, which effectively deletes all data from
@@ -2529,6 +2565,12 @@ void storage_proxy::init_messaging_service() {
return p->query_singular_local_digest(cmd, pr);
});
});
ms.register_truncate([](sstring ksname, sstring cfname) {
const auto truncated_at = db_clock::now();
return get_storage_proxy().invoke_on_all([truncated_at, ksname, cfname](storage_proxy& sp) {
return sp._db.local().truncate(truncated_at, ksname, cfname);
});
});
}
void storage_proxy::uninit_messaging_service() {
@@ -2540,6 +2582,7 @@ void storage_proxy::uninit_messaging_service() {
ms.unregister_read_data();
ms.unregister_read_mutation_data();
ms.unregister_read_digest();
ms.unregister_truncate();
}
// Merges reconcilable_result:s from different shards into one

View File

@@ -160,6 +160,14 @@ public:
*/
future<> mutate_atomically(std::vector<mutation> mutations, db::consistency_level cl);
/**
* Performs the truncate operatoin, which effectively deletes all data from
* the column family cfname
* @param keyspace
* @param cfname
*/
future<> truncate_blocking(sstring keyspace, sstring cfname);
/*
* Executes data query on the whole cluster.
*

View File

@@ -96,6 +96,9 @@ public:
void gossip_snitch_info();
distributed<database>& db() {
return _db;
}
private:
bool is_auto_bootstrap();
inet_address get_broadcast_address() {

View File

@@ -215,7 +215,7 @@ future<> compact_sstables(std::vector<shared_sstable> sstables,
future<> write_done = newtab->write_components(
std::move(mutation_queue_reader), estimated_partitions, schema).then([newtab, stats, start_time] {
return newtab->load().then([newtab, stats, start_time] {
return newtab->open_data().then([newtab, stats, start_time] {
uint64_t endsize = newtab->data_size();
double ratio = (double) endsize / (double) stats->start_size;
auto end_time = std::chrono::high_resolution_clock::now();
@@ -237,10 +237,29 @@ future<> compact_sstables(std::vector<shared_sstable> sstables,
});
// Wait for both read_done and write_done fibers to finish.
// FIXME: if write_done throws an exception, we get a broken pipe
// exception on read_done, and then we don't handle write_done's
// exception, causing a warning message of "ignored exceptional future".
return read_done.then([write_done = std::move(write_done)] () mutable { return std::move(write_done); });
return when_all(std::move(read_done), std::move(write_done)).then([] (std::tuple<future<>, future<>> t) {
sstring ex;
try {
std::get<0>(t).get();
} catch(...) {
ex += "read exception: ";
ex += sprint("%s", std::current_exception());
}
try {
std::get<1>(t).get();
} catch(...) {
if (ex.size()) {
ex += ", ";
}
ex += "write exception: ";
ex += sprint("%s", std::current_exception());
}
if (ex.size()) {
throw std::runtime_error(ex);
}
});
}
class compaction_strategy_impl {

View File

@@ -41,6 +41,7 @@
#include "memtable.hh"
#include <boost/filesystem/operations.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/range/adaptor/map.hpp>
#include <regex>
#include <core/align.hh>
@@ -877,7 +878,11 @@ future<> sstable::open_data() {
_index_file = std::get<file>(std::get<0>(files).get());
_data_file = std::get<file>(std::get<1>(files).get());
return _data_file.size().then([this] (auto size) {
_data_file_size = size;
if (this->has_component(sstable::component_type::CompressionInfo)) {
_compression.update(size);
} else {
_data_file_size = size;
}
}).then([this] {
return _index_file.size().then([this] (auto size) {
_index_file_size = size;
@@ -911,12 +916,6 @@ future<> sstable::load() {
return read_summary();
}).then([this] {
return open_data();
}).then([this] {
// After we have _compression and _data_file_size, we can update
// _compression with additional information it needs:
if (has_component(sstable::component_type::CompressionInfo)) {
_compression.update(_data_file_size);
}
});
}
@@ -1386,7 +1385,7 @@ future<uint64_t> sstable::bytes_on_disk() {
});
}
const bool sstable::has_component(component_type f) {
const bool sstable::has_component(component_type f) const {
return _components.count(f);
}
@@ -1394,6 +1393,16 @@ const sstring sstable::filename(component_type f) const {
return filename(_dir, _ks, _cf, _version, _generation, _format, f);
}
std::vector<sstring> sstable::component_filenames() const {
std::vector<sstring> res;
for (auto c : _component_map | boost::adaptors::map_keys) {
if (has_component(c)) {
res.emplace_back(filename(c));
}
}
return res;
}
sstring sstable::toc_filename() const {
return filename(component_type::TOC);
}
@@ -1413,6 +1422,21 @@ const sstring sstable::filename(sstring dir, sstring ks, sstring cf, version_typ
return dir + "/" + strmap[version](entry_descriptor(ks, cf, version, generation, format, component));
}
future<> sstable::create_links(sstring dir) const {
return parallel_for_each(component_filenames(), [this, dir](sstring f) {
auto sdir = get_dir();
auto name = f.substr(sdir.size());
auto dst = dir + name;
return ::link_file(f, dst);
}).then([dir] {
// sync dir
return ::open_directory(dir).then([](file df) {
auto f = df.flush();
return f.finally([df = std::move(df)] {});
});
});
}
entry_descriptor entry_descriptor::make_descriptor(sstring fname) {
static std::regex la("la-(\\d+)-(\\w+)-(.*)");
static std::regex ka("(\\w+)-(\\w+)-ka-(\\d+)-(.*)");
@@ -1584,9 +1608,9 @@ remove_by_toc_name(sstring sstable_toc_name) {
}
static future<bool>
file_existence(sstring filename) {
file_exists(sstring filename) {
return engine().open_file_dma(filename, open_flags::ro).then([] (file f) {
return make_ready_future<>();
return f.close().finally([f] {});
}).then_wrapped([] (future<> f) {
bool exists = true;
try {
@@ -1603,11 +1627,11 @@ file_existence(sstring filename) {
future<>
sstable::remove_sstable_with_temp_toc(sstring ks, sstring cf, sstring dir, unsigned long generation, version_types v, format_types f) {
return seastar::async([ks, cf, dir, generation, v, f] {
auto toc = file_existence(filename(dir, ks, cf, v, generation, f, component_type::TOC)).get0();
auto toc = file_exists(filename(dir, ks, cf, v, generation, f, component_type::TOC)).get0();
// assert that toc doesn't exist for sstable with temporary toc.
assert(toc == false);
auto tmptoc = file_existence(filename(dir, ks, cf, v, generation, f, component_type::TemporaryTOC)).get0();
auto tmptoc = file_exists(filename(dir, ks, cf, v, generation, f, component_type::TemporaryTOC)).get0();
// assert that temporary toc exists for this sstable.
assert(tmptoc == true);
@@ -1627,12 +1651,13 @@ sstable::remove_sstable_with_temp_toc(sstring ks, sstring cf, sstring dir, unsig
auto file_path = filename(dir, ks, cf, v, generation, f, entry.first);
// Skip component that doesn't exist.
auto exists = file_existence(file_path).get0();
auto exists = file_exists(file_path).get0();
if (!exists) {
continue;
}
remove_file(file_path).get();
}
fsync_directory(dir).get();
// Removing temporary
remove_file(filename(dir, ks, cf, v, generation, f, component_type::TemporaryTOC)).get();
// Fsync'ing column family dir to guarantee that deletion completed.

View File

@@ -183,6 +183,7 @@ public:
version_types v, format_types f);
future<> load();
future<> open_data();
void set_generation(unsigned long generation) {
_generation = generation;
@@ -259,20 +260,40 @@ public:
return _filter_file_size;
}
uint64_t filter_memory_size() {
return _filter->memory_size();
}
// Returns the total bytes of all components.
future<uint64_t> bytes_on_disk();
partition_key get_first_partition_key(const schema& s) const;
partition_key get_last_partition_key(const schema& s) const;
const sstring get_filename() {
const sstring get_filename() const {
return filename(component_type::Data);
}
const sstring& get_dir() const {
return _dir;
}
sstring toc_filename() const;
metadata_collector& get_metadata_collector() {
return _collector;
}
future<> create_links(sstring dir) const;
/**
* Note. This is using the Origin definition of
* max_data_age, which is load time. This could maybe
* be improved upon.
*/
gc_clock::time_point max_data_age() const {
return _now;
}
std::vector<sstring> component_filenames() const;
private:
sstable(size_t wbuffer_size, sstring ks, sstring cf, sstring dir, unsigned long generation, version_types v, format_types f, gc_clock::time_point now = gc_clock::now())
: sstable_buffer_size(wbuffer_size)
@@ -328,7 +349,7 @@ private:
gc_clock::time_point _now;
const bool has_component(component_type f);
const bool has_component(component_type f) const;
const sstring filename(component_type f) const;
@@ -360,7 +381,6 @@ private:
future<> read_statistics();
void write_statistics();
future<> open_data();
future<> create_data();
future<index_list> read_indexes(uint64_t summary_idx);

View File

@@ -127,7 +127,7 @@ if __name__ == "__main__":
if test[0].startswith(os.path.join('build','debug')):
mode = 'debug'
xmlout = args.jenkins+"."+mode+"."+os.path.basename(test[0])+".boost.xml"
path = path + " --output_format=XML --log_level=all --report_level=no --log_sink=" + xmlout
path = path + " --output_format=XML --log_level=test_suite --report_level=no --log_sink=" + xmlout
print(path)
proc = subprocess.Popen(path.split(' '), stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=env,preexec_fn=os.setsid)
signal.alarm(args.timeout)

View File

@@ -43,6 +43,7 @@ static atomic_cell make_atomic_cell(bytes value) {
SEASTAR_TEST_CASE(test_execute_batch) {
return do_with_cql_env([] (auto& e) {
db::system_keyspace::minimal_setup(e.db(), e.qp());
auto& qp = e.local_qp();
auto bp = make_lw_shared<db::batchlog_manager>(qp);

View File

@@ -310,13 +310,15 @@ SEASTAR_TEST_CASE(test_commitlog_delete_when_over_disk_limit){
cfg.commitlog_segment_size_in_mb = 2;
cfg.commitlog_total_space_in_mb = 1;
return make_commitlog(cfg).then([](tmplog_ptr log) {
auto sem = make_lw_shared<semaphore>(0);
// add a flush handler that simply says we're done with the range.
auto r = log->second.add_flush_handler([log](cf_id_type id, replay_position pos) {
auto r = log->second.add_flush_handler([log, sem](cf_id_type id, replay_position pos) {
log->second.discard_completed_segments(id, pos);
sem->signal();
});
auto set = make_lw_shared<std::set<segment_id_type>>();
auto uuid = utils::UUID_gen::get_time_UUID();
return do_until([set]() {return set->size() > 1;},
return do_until([set, sem]() {return set->size() > 1 && sem->try_wait();},
[log, set, uuid]() {
sstring tmp = "hej bubba cow";
return log->second.add_mutation(uuid, tmp.size(), [tmp](db::commitlog::output& dst) {
@@ -327,8 +329,9 @@ SEASTAR_TEST_CASE(test_commitlog_delete_when_over_disk_limit){
});
}).then([log]() {
auto n = log->second.get_active_segment_names().size();
auto d = log->second.get_num_segments_destroyed();
BOOST_REQUIRE(n > 0);
BOOST_REQUIRE(n < 2);
BOOST_REQUIRE(d > 0);
}).finally([log, r = std::move(r)]() {
return log->second.clear().then([log] {});
});

View File

@@ -250,6 +250,14 @@ public:
return _qp->local();
}
distributed<database>& db() override {
return *_db;
}
distributed<cql3::query_processor>& qp() override {
return *_qp;
}
future<> start() {
return seastar::async([this] {
locator::i_endpoint_snitch::create_snitch("SimpleSnitch").get();

View File

@@ -24,6 +24,7 @@
#include <functional>
#include <vector>
#include <core/distributed.hh>
#include "core/sstring.hh"
#include "core/future.hh"
#include "core/shared_ptr.hh"
@@ -71,6 +72,10 @@ public:
virtual database& local_db() = 0;
virtual cql3::query_processor& local_qp() = 0;
virtual distributed<database>& db() = 0;
virtual distributed<cql3::query_processor> & qp() = 0;
};
future<::shared_ptr<cql_test_env>> make_env_for_test();

227
tests/memory_footprint.cc Normal file
View File

@@ -0,0 +1,227 @@
/*
* Copyright (C) 2015 Cloudius Systems, Ltd.
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/irange.hpp>
#include <seastar/util/defer.hh>
#include <seastar/core/app-template.hh>
#include <seastar/core/thread.hh>
#include "schema_builder.hh"
#include "memtable.hh"
#include "row_cache.hh"
#include "frozen_mutation.hh"
#include "tmpdir.hh"
#include "sstables/sstables.hh"
class size_calculator {
class nest {
public:
static thread_local int level;
nest() { ++level; }
~nest() { --level; }
};
static std::string prefix() {
std::string s(" ");
for (int i = 0; i < nest::level; ++i) {
s += "-- ";
}
return s;
}
public:
static void print_cache_entry_size() {
std::cout << prefix() << "sizeof(cache_entry) = " << sizeof(cache_entry) << "\n";
{
nest n;
std::cout << prefix() << "sizeof(decorated_key) = " << sizeof(dht::decorated_key) << "\n";
std::cout << prefix() << "sizeof(lru_link_type) = " << sizeof(cache_entry::lru_link_type) << "\n";
std::cout << prefix() << "sizeof(cache_link_type) = " << sizeof(cache_entry::cache_link_type) << "\n";
print_mutation_partition_size();
}
std::cout << "\n";
std::cout << prefix() << "sizeof(rows_entry) = " << sizeof(rows_entry) << "\n";
std::cout << prefix() << "sizeof(deletable_row) = " << sizeof(deletable_row) << "\n";
std::cout << prefix() << "sizeof(row) = " << sizeof(row) << "\n";
std::cout << prefix() << "sizeof(atomic_cell_or_collection) = " << sizeof(atomic_cell_or_collection) << "\n";
}
static void print_mutation_partition_size() {
std::cout << prefix() << "sizeof(mutation_partition) = " << sizeof(mutation_partition) << "\n";
{
nest n;
std::cout << prefix() << "sizeof(_static_row) = " << sizeof(mutation_partition::_static_row) << "\n";
std::cout << prefix() << "sizeof(_rows) = " << sizeof(mutation_partition::_rows) << "\n";
std::cout << prefix() << "sizeof(_row_tombstones) = " << sizeof(mutation_partition::_row_tombstones) <<
"\n";
}
}
};
thread_local int size_calculator::nest::level = 0;
static schema_ptr cassandra_stress_schema() {
return schema_builder("ks", "cf")
.with_column("KEY", bytes_type, column_kind::partition_key)
.with_column("C0", bytes_type)
.with_column("C1", bytes_type)
.with_column("C2", bytes_type)
.with_column("C3", bytes_type)
.with_column("C4", bytes_type)
.build();
}
[[gnu::unused]]
static mutation make_cs_mutation() {
auto s = cassandra_stress_schema();
mutation m(partition_key::from_single_value(*s, bytes_type->from_string("4b343050393536353531")), s);
for (auto&& col : s->regular_columns()) {
m.set_clustered_cell(clustering_key::make_empty(*s), col,
atomic_cell::make_live(1, bytes_type->from_string("8f75da6b3dcec90c8a404fb9a5f6b0621e62d39c69ba5758e5f41b78311fbb26cc7a")));
}
return m;
}
bytes random_bytes(size_t size) {
bytes result(bytes::initialized_later(), size);
for (size_t i = 0; i < size; ++i) {
result[i] = std::rand() % std::numeric_limits<uint8_t>::max();
}
return result;
}
sstring random_string(size_t size) {
sstring result(sstring::initialized_later(), size);
static const char chars[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
for (size_t i = 0; i < size; ++i) {
result[i] = chars[std::rand() % sizeof(chars)];
}
return result;
}
struct mutation_settings {
size_t column_count;
size_t column_name_size;
size_t row_count;
size_t partition_key_size;
size_t clustering_key_size;
size_t data_size;
};
static mutation make_mutation(mutation_settings settings) {
auto builder = schema_builder("ks", "cf")
.with_column("pk", bytes_type, column_kind::partition_key)
.with_column("ck", bytes_type, column_kind::clustering_key);
for (size_t i = 0; i < settings.column_count; ++i) {
builder.with_column(to_bytes(random_string(settings.column_name_size)), bytes_type);
}
auto s = builder.build();
mutation m(partition_key::from_single_value(*s, bytes_type->decompose(random_bytes(settings.partition_key_size))), s);
for (size_t i = 0; i < settings.row_count; ++i) {
auto ck = clustering_key::from_single_value(*s, bytes_type->decompose(random_bytes(settings.clustering_key_size)));
for (auto&& col : s->regular_columns()) {
m.set_clustered_cell(ck, col,
atomic_cell::make_live(1,
bytes_type->decompose(random_bytes(settings.data_size))));
}
}
return m;
}
struct sizes {
size_t memtable;
size_t cache;
size_t sstable;
size_t frozen;
};
static sizes calculate_sizes(const mutation& m) {
sizes result;
auto s = m.schema();
auto mt = make_lw_shared<memtable>(s);
cache_tracker tracker;
row_cache cache(s, mt->as_data_source(), tracker);
assert(tracker.region().occupancy().used_space() == 0);
assert(mt->occupancy().used_space() == 0);
mt->apply(m);
cache.populate(m);
result.memtable = mt->occupancy().used_space();
result.cache = tracker.region().occupancy().used_space();
result.frozen = freeze(m).representation().size();
tmpdir sstable_dir;
auto sst = make_lw_shared<sstables::sstable>(s->ks_name(), s->cf_name(),
sstable_dir.path,
1 /* generation */,
sstables::sstable::version_types::la,
sstables::sstable::format_types::big);
sst->write_components(*mt).get();
sst->load().get();
result.sstable = sst->data_size();
return result;
}
int main(int argc, char** argv) {
namespace bpo = boost::program_options;
app_template app;
app.add_options()
("column-count", bpo::value<size_t>()->default_value(5), "column count")
("column-name-size", bpo::value<size_t>()->default_value(2), "column name size")
("row-count", bpo::value<size_t>()->default_value(1), "row count")
("partition-key-size", bpo::value<size_t>()->default_value(10), "partition key size")
("clustering-key-size", bpo::value<size_t>()->default_value(10), "clustering key size")
("data-size", bpo::value<size_t>()->default_value(32), "cell data size");
return app.run(argc, argv, [&] {
return seastar::async([&] {
mutation_settings settings;
settings.column_count = app.configuration()["column-count"].as<size_t>();
settings.column_name_size = app.configuration()["column-name-size"].as<size_t>();
settings.row_count = app.configuration()["row-count"].as<size_t>();
settings.partition_key_size = app.configuration()["partition-key-size"].as<size_t>();
settings.clustering_key_size = app.configuration()["clustering-key-size"].as<size_t>();
settings.data_size = app.configuration()["data-size"].as<size_t>();
auto m = make_mutation(settings);
auto sizes = calculate_sizes(m);
std::cout << "mutation footprint:" << "\n";
std::cout << " - in cache: " << sizes.cache << "\n";
std::cout << " - in memtable: " << sizes.memtable << "\n";
std::cout << " - in sstable: " << sizes.sstable << "\n";
std::cout << " - frozen: " << sizes.frozen << "\n";
std::cout << "\n";
size_calculator::print_cache_entry_size();
});
});
}

View File

@@ -273,6 +273,7 @@ SEASTAR_TEST_CASE(test_multiple_memtables_one_partition) {
column_family::config cfg;
cfg.enable_disk_reads = false;
cfg.enable_disk_writes = false;
cfg.enable_incremental_backups = false;
return with_column_family(s, cfg, [s] (column_family& cf) {
const column_definition& r1_col = *s->get_column_definition("r1");
auto key = partition_key::from_exploded(*s, {to_bytes("key1")});
@@ -319,6 +320,7 @@ SEASTAR_TEST_CASE(test_flush_in_the_middle_of_a_scan) {
cfg.enable_disk_reads = true;
cfg.enable_disk_writes = true;
cfg.enable_cache = true;
cfg.enable_incremental_backups = false;
return with_column_family(s, cfg, [s](column_family& cf) {
return seastar::async([s, &cf] {
@@ -391,6 +393,7 @@ SEASTAR_TEST_CASE(test_multiple_memtables_multiple_partitions) {
column_family::config cfg;
cfg.enable_disk_reads = false;
cfg.enable_disk_writes = false;
cfg.enable_incremental_backups = false;
auto cm = make_lw_shared<compaction_manager>();
return do_with(make_lw_shared<column_family>(s, cfg, column_family::no_commitlog(), *cm), [s, cm] (auto& cf_ptr) mutable {
column_family& cf = *cf_ptr;

View File

@@ -980,6 +980,7 @@ SEASTAR_TEST_CASE(compaction_manager_test) {
column_family::config cfg;
cfg.datadir = tmp->path;
cfg.enable_commitlog = false;
cfg.enable_incremental_backups = false;
auto cf = make_lw_shared<column_family>(s, cfg, column_family::no_commitlog(), *cm);
cf->start();
cf->set_compaction_strategy(sstables::compaction_strategy_type::size_tiered);

View File

@@ -65,3 +65,9 @@ std::ostream& operator<<(std::ostream& os, const std::unordered_set<T>& items) {
os << "{" << join(", ", items) << "}";
return os;
}
template <typename T>
std::ostream& operator<<(std::ostream& os, const std::set<T>& items) {
os << "{" << join(", ", items) << "}";
return os;
}

View File

@@ -27,7 +27,6 @@
#include <boost/assign.hpp>
#include <boost/locale/encoding_utf.hpp>
#include <boost/range/adaptor/sliced.hpp>
#include <boost/range/algorithm/remove.hpp>
#include "cql3/statements/batch_statement.hh"
#include "service/migration_manager.hh"
@@ -206,17 +205,6 @@ cql_server::cql_server(distributed<service::storage_proxy>& proxy, distributed<c
{
}
bool
cql_server::poll_pending_responders() {
while (!_pending_responders.empty()) {
auto c = _pending_responders.front();
_pending_responders.pop_front();
c->do_flush();
c->_flush_requested = false;
}
return false;
}
scollectd::registrations
cql_server::setup_collectd() {
return {
@@ -431,16 +419,8 @@ future<> cql_server::connection::process()
}
}).finally([this] {
return _pending_requests_gate.close().then([this] {
// Remove ourselves from poll list
auto i = std::remove(_server._pending_responders.begin(), _server._pending_responders.end(), this);
if (i != _server._pending_responders.end()) {
_server._pending_responders.pop_back();
}
// prevent the connection from been added to the poller
_flush_requested = true;
return std::move(_ready_to_respond).then([this] {
// do the final flush here since poller was disabled for the connection
return _write_buf.flush();
return _ready_to_respond.finally([this] {
return _write_buf.close();
});
});
});
@@ -829,22 +809,12 @@ future<> cql_server::connection::write_response(shared_ptr<cql_server::response>
{
_ready_to_respond = _ready_to_respond.then([this, response = std::move(response)] () mutable {
return response->output(_write_buf, _version).then([this, response] {
if (!_flush_requested) {
_flush_requested = true;
_server._pending_responders.push_back(this);
}
return _write_buf.flush();
});
});
return make_ready_future<>();
}
void
cql_server::connection::do_flush() {
_ready_to_respond = _ready_to_respond.then([this] {
return _write_buf.flush();
});
}
void cql_server::connection::check_room(temporary_buffer<char>& buf, size_t n)
{
if (buf.size() < n) {

View File

@@ -67,7 +67,6 @@ struct [[gnu::packed]] cql_binary_frame_v3 {
class cql_server {
class event_notifier;
class connection;
static constexpr int current_version = 3;
@@ -76,10 +75,7 @@ class cql_server {
distributed<cql3::query_processor>& _query_processor;
std::unique_ptr<scollectd::registrations> _collectd_registrations;
std::unique_ptr<event_notifier> _notifier;
circular_buffer<connection*> _pending_responders;
reactor::poller _poller{[this] { return poll_pending_responders(); }}; // FIXME: register before tcp poller
private:
bool poll_pending_responders();
scollectd::registrations setup_collectd();
uint64_t _connects = 0;
uint64_t _connections = 0;
@@ -92,6 +88,7 @@ public:
future<> stop();
private:
class fmt_visitor;
class connection;
class response;
friend class type_codec;
};
@@ -153,13 +150,11 @@ class cql_server::connection {
serialization_format _serialization_format = serialization_format::use_16_bit();
service::client_state _client_state;
std::unordered_map<uint16_t, cql_query_state> _query_states;
bool _flush_requested = false;
public:
connection(cql_server& server, connected_socket&& fd, socket_address addr);
~connection();
future<> process();
future<> process_request();
void do_flush();
private:
future<> process_request_one(temporary_buffer<char> buf,
@@ -215,7 +210,6 @@ private:
void init_serialization_format();
friend event_notifier;
friend class cql_server;
};
}

View File

@@ -225,7 +225,7 @@ struct string_type_impl : public abstract_type {
}
} else {
try {
boost::locale::conv::utf_to_utf<char>(v.data(), boost::locale::conv::stop);
boost::locale::conv::utf_to_utf<char>(v.data(), v.end(), boost::locale::conv::stop);
} catch (const boost::locale::conv::conversion_error& ex) {
throw marshal_exception(ex.what());
}
@@ -1182,6 +1182,8 @@ struct empty_type_impl : abstract_type {
logging::logger collection_type_impl::_logger("collection_type_impl");
const size_t collection_type_impl::max_elements;
thread_local std::unordered_map<data_type, shared_ptr<cql3::cql3_type>> collection_type_impl::_cql3_type_cache;
const collection_type_impl::kind collection_type_impl::kind::map(
[] (shared_ptr<cql3::column_specification> collection, bool is_key) -> shared_ptr<cql3::column_specification> {
// FIXME: implement
@@ -1241,14 +1243,16 @@ collection_type_impl::is_compatible_with(const abstract_type& previous) const {
shared_ptr<cql3::cql3_type>
collection_type_impl::as_cql3_type() const {
if (!_cql3_type) {
auto ret = _cql3_type_cache[shared_from_this()];
if (!ret) {
auto name = cql3_type_name();
if (!is_multi_cell()) {
name = "frozen<" + name + ">";
}
_cql3_type = make_shared<cql3::cql3_type>(name, shared_from_this(), false);
ret = make_shared<cql3::cql3_type>(name, shared_from_this(), false);
_cql3_type_cache[shared_from_this()] = ret;
}
return _cql3_type;
return ret;
}
bytes

View File

@@ -387,7 +387,7 @@ bool equal(data_type t, bytes_view e1, bytes_view e2) {
class collection_type_impl : public abstract_type {
static logging::logger _logger;
mutable shared_ptr<cql3::cql3_type> _cql3_type; // initialized on demand, so mutable
static thread_local std::unordered_map<data_type, shared_ptr<cql3::cql3_type>> _cql3_type_cache; // initialized on demand
public:
static constexpr size_t max_elements = 65535;

View File

@@ -96,6 +96,10 @@ public:
}
virtual void close() override { }
virtual size_t memory_size() override {
return sizeof(_hash_count) + _bitset.memory_size();
}
};
struct murmur3_bloom_filter: public bloom_filter {
@@ -118,6 +122,10 @@ struct always_present_filter: public i_filter {
virtual void clear() override { }
virtual void close() override { }
virtual size_t memory_size() override {
return 0;
}
};
filter_ptr create_filter(int hash, large_bitset&& bitset);

View File

@@ -21,6 +21,7 @@
#include "compaction_manager.hh"
#include "database.hh"
#include "core/scollectd.hh"
static logging::logger cmlog("compaction_manager");
@@ -46,6 +47,7 @@ void compaction_manager::task_start(lw_shared_ptr<compaction_manager::task>& tas
_stats.pending_tasks--;
}
_stats.active_tasks++;
return task->compacting_cf->run_compaction().then([this, task] {
// If compaction completed successfully, let's reset
// sleep time of compaction_retry.
@@ -64,6 +66,8 @@ void compaction_manager::task_start(lw_shared_ptr<compaction_manager::task>& tas
task->compacting_cf = nullptr;
_stats.completed_tasks++;
}).finally([this] {
_stats.active_tasks--;
});
});
}).then_wrapped([this, task] (future<> f) {
@@ -139,9 +143,22 @@ compaction_manager::~compaction_manager() {
assert(_stopped == true);
}
void compaction_manager::register_collectd_metrics() {
auto add = [this] (auto type_name, auto name, auto data_type, auto func) {
_registrations.push_back(
scollectd::add_polled_metric(scollectd::type_instance_id("compaction_manager",
scollectd::per_cpu_plugin_instance,
type_name, name),
scollectd::make_typed(data_type, func)));
};
add("objects", "compactions", scollectd::data_type::GAUGE, [&] { return _stats.active_tasks; });
}
void compaction_manager::start(int task_nr) {
_stopped = false;
_tasks.reserve(task_nr);
register_collectd_metrics();
for (int i = 0; i < task_nr; i++) {
auto task = make_lw_shared<compaction_manager::task>();
task_start(task);
@@ -150,6 +167,7 @@ void compaction_manager::start(int task_nr) {
}
future<> compaction_manager::stop() {
_registrations.clear();
return do_for_each(_tasks, [this] (auto& task) {
return this->task_stop(task);
}).then([this] {

View File

@@ -42,6 +42,7 @@ public:
struct stats {
int64_t pending_tasks = 0;
int64_t completed_tasks = 0;
uint64_t active_tasks = 0; // Number of compaction going on.
};
private:
struct task {
@@ -64,6 +65,7 @@ private:
bool _stopped = true;
stats _stats;
std::vector<scollectd::registration> _registrations;
private:
void task_start(lw_shared_ptr<task>& task);
future<> task_stop(lw_shared_ptr<task>& task);
@@ -73,6 +75,8 @@ public:
compaction_manager();
~compaction_manager();
void register_collectd_metrics();
// Creates N fibers that will allow N compaction jobs to run in parallel.
// Defaults to only one fiber.
void start(int task_nr = 1);

View File

@@ -58,7 +58,7 @@ public:
double old_m = mean;
double old_s = variance;
mean = old_m + ((value - old_m) / (total + 1));
mean = ((double)(sum + value)) / (total + 1);
variance = old_s + ((value - old_m) * (value - mean));
}
sum += value;
@@ -81,7 +81,7 @@ public:
* Call set_latency, that would start a latency object if needed.
*/
bool should_sample() const {
return total & sample_mask;
return total == 0 || (count & sample_mask);
}
/**
* Set the latency according to the sample rate.

View File

@@ -58,6 +58,8 @@ struct i_filter {
virtual void clear() = 0;
virtual void close() = 0;
virtual size_t memory_size() = 0;
/**
* @return The smallest bloom_filter that can provide the given false
* positive probability rate for the given number of elements.

View File

@@ -52,6 +52,11 @@ public:
size_t size() const {
return _nr_bits;
}
size_t memory_size() const {
return block_size() * _storage.size() + sizeof(_nr_bits);
}
bool test(size_t idx) const {
auto idx1 = idx / bits_per_block();
idx %= bits_per_block();

View File

@@ -378,7 +378,7 @@ public:
auto i = _segments.find(seg);
assert(i != _segments.end());
_segments.erase(i);
delete seg;
::free(seg);
}
segment* containing_segment(void* obj) const {
uintptr_t addr = reinterpret_cast<uintptr_t>(obj);