scylladb

Author	SHA1	Message	Date
Gleb Natapov	2d630e068b	mutation_query_test: add test for result size calculation Check that digest only and digest+data query calculate result size to be the same. Message-Id: <20180906153800.GK2326@scylladb.com> (cherry picked from commit `9e438933a2`)	2018-09-08 18:55:23 +03:00
Gleb Natapov	5a8e9698d8	mutation_partition: accurately account for result size in digest only queries When measuring_output_stream is used to calculate result's element size it incorrectly takes into account not only serialized element size, but a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells constructors puts in the beginning. Fix it by taking starting point in a stream before element serialization and subtracting it afterwords. Fixes #3755 Message-Id: <20180906153609.GJ2326@scylladb.com> (cherry picked from commit `d7674288a9`)	2018-09-08 18:55:23 +03:00
Gleb Natapov	64f1aa8d99	mutation_partition: correctly measure static row size when doing digest calculation The code uses incorrect output stream in case only digest is requested and thus getting incorrect data size. Failing to correctly account for static row size while calculating digest may cause digest mismatch between digest and data query. Fixes #3753. Message-Id: <20180905131219.GD2326@scylladb.com> (cherry picked from commit `98092353df`)	2018-09-06 16:51:31 +03:00
Eliran Sinvani	280e6eedb9	cql3: ensure repeated values in IN clauses don't return repeated rows When the list of values in the IN list of a single column contains duplicates, multiple executors are activated since the assumption is that each value in the IN list corresponds to a different partition. this results in the same row appearing in the result number times corresponding to the duplication of the partition value. Added queries for the in restriction unitest and fixed with a bad result check. Fixes #2837 Tests: Queries as in the usecase from the GitHub issue in both forms , prepared and plain (using python driver),Unitest. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com> (cherry picked from commit `d734d316a6`)	2018-08-26 15:52:18 +03:00
Tomasz Grabiec	f80f15a6af	Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte Fixes a regression introduced in `9e88b60ef5`, which broke the lookup for prefetched values of lists when a clustering key is specified. This is the code that was removed from some list operations: std::experimental::optional<clustering_key> row_key; if (!column.is_static()) { row_key = clustering_key::from_clustering_prefix(params._schema, prefix); } ... auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column); Put it back, in the form of common code in the update_parameters class. Fixes #3703 https://github.com/duarten/scylla cql-list-fixes/v1: tests/cql_query_test: Test multi-cell static list updates with ckeys cql3/lists: Fix multi-cell static list updates in the presence of ckeys keys: Add factory for an empty clustering_key_prefix_view (cherry picked from commit `6937cc2d1c`)	2018-08-21 17:37:36 +01:00
Duarte Nunes	d0eb0c0b90	cql3/query_options: Use _value_views in prepare() _value_views is the authoritative data structure for the client-specified values. Indeed, the ctor called transport::request::read_options() leaves _values completely empty. In query_options::prepare() we were, however, using _values to associated values to the client-specified column names, and not _value_views. Fix this by using _value_views instead. As for the reasons we didn't see this bug earlier, I assume it's because very few drivers set the 0x04 query options flag, which means column names are omitted. This is the right thing to do since most drivers have enough information to correctly position the values. Fixes #3688 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180814234605.14775-1-duarte@scylladb.com> (cherry picked from commit `a4355fe7e7`)	2018-08-21 18:24:06 +03:00
Jesse Haber-Kucharsky	1427c4d428	auth: Don't use unsupported hashing algorithms In previous versions of Fedora, the `crypt_r` function returned `nullptr` when a requested hashing algorithm was not supported. This is consistent with the documentation of the function in its man page. As of Fedora 28, the function's behavior changes so that the encrypted text is not `nullptr` on error, but instead the string "0". The info pages for `crypt_r` clarify somewhat (and contradict the man pages): Some implementations return `NULL` on failure, and others return an _invalid_ hashed passphrase, which will begin with a `` and will not be the same as SALT. Because of this change of behavior, users running Scylla on a Fedora 28 machine which was upgraded from a previous release would not be able to authenticate: an unsupported hashing algorithm would be selected, producing encrypted text that did not match the entry in the table. With this change, unsupported algorithms are correctly detected and users should be able to continue to authenticate themselves. Fixes #3637. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com> (cherry picked from commit `fce10f2c6e`)	2018-08-05 10:30:58 +03:00
Gleb Natapov	034f2cb42d	cache_hitrate_calculator: fix race when new table is added during calculations The calculation consists of several parts with preemption point between them, so a table can be added while calculation is ongoing. Do not assume that table exists in intermediate data structure. Fixes #3636 Message-Id: <20180801093147.GD23569@scylladb.com> (cherry picked from commit `44a6afad8c`)	2018-08-01 14:30:58 +03:00
Amos Kong	e043a5c276	scylla_setup: fix conditional statement of silent mode Commit `300af65555` introdued a problem in conditional statement, script will always abort in silent mode, it doesn't care about the return value. Fixes #3485 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com> (cherry picked from commit `364c2551c8`)	2018-07-25 12:34:11 +03:00
Takuya ASADA	5da9bd3a6e	dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode Current script silently continues even one of setup fails, need to abort. Fixes #3433 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180522180355.1648-1-syuu@scylladb.com> (cherry picked from commit `300af65555`)	2018-07-25 12:34:11 +03:00
Avi Kivity	3578027e2e	Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz " The problem happens under the following circumstances: - we have a partially populated partition in cache, with a gap in the middle - a read with no clustering restrictions trying to populate that gap - eviction of the entry for the lower bound of the gap concurrent with population The population may incorrectly mark the range before the gap as continuous. This may result in temporary loss of writes in that clustering range. The problem heals by clearing cache. Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been failing sporadically. The problem is in ensure_population_lower_bound(), which returns true if current clustering range covers all rows, which means that the populator has a right to set continuity flag to true on the row it inserts. This is correct only if the current population range actually starts since before all clustering rows. Otherwise, we're populating since _last_row and should consult it. Fixes #3608. " * 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla: row_cache: Fix violation of continuity on concurrent eviction and population position_in_partition: Introduce is_before_all_clustered_rows() (cherry picked from commit `31151cadd4`)	2018-07-25 12:34:11 +03:00
Shlomi Livne	7d2150a057	release: prepare for 2.1.6 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-07-01 22:35:26 +03:00
Avi Kivity	afd3c571cc	Merge "Backport Disable sstable filtering based on min/max clustering key components" to 2.1" from Tomasz " Changes made: - switched the test to use do_with_cql_env_thread due to lack of SEASTAR_TEST_CASE_THREAD macro - imported make_local_key() from master, needed for the database_test to pass " * tag 'tgrabiec/disable-min-max-sstable-filtering-v1-branch-2.1' of github.com:tgrabiec/scylla: Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz tests: simple_schema: Generate local keys form make_pkeys() tests: Import make_local_key() from master	2018-06-28 12:41:00 +03:00
Avi Kivity	093c8512db	Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz " With DateTiered and TimeWindow, there is a read optimization enabled which excludes sstables based on overlap with recorded min/max values of clustering key components. The problem is that it doesn't take into account partition tombstones and static rows, which should still be returned by the reader even if there is no overlap in the query's clustering range. A read which returns no clustering rows can mispopulate cache, which will appear as partition deletion or writes to the static row being lost. Until node restart or eviction of the partition entry. There is also a bad interaction between cache population on read and that optimization. When the clustering range of the query doesn't overlap with any sstable, the reader will return no partition markers for the read, which leads cache populator to assume there is no partition in sstables and it will cache an empty partition. This will cause later reads of that partition to miss prior writes to that partition until it is evicted from cache or node is restarted. Disable until a more elaborate fix is implemented. Fixes #3552 Fixes #3553 " * tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla: tests: Add test for slicing a mutation source with date tiered compaction strategy tests: Check that database conforms to mutation source database: Disable sstable filtering based on min/max clustering key components (cherry picked from commit `e1efda8b0c`)	2018-06-28 11:10:41 +02:00
Tomasz Grabiec	9c0b8ec736	tests: simple_schema: Generate local keys form make_pkeys() Extracted from commit `2b0b703615`	2018-06-28 11:10:41 +02:00
Tomasz Grabiec	1794b732b0	tests: Import make_local_key() from master Imported from master at 8a25bd467c69df94ea3f3638b42d36beee20adf0	2018-06-28 11:10:41 +02:00
Avi Kivity	c1ac4fb8b0	Update seastar submodule * seastar 2a2c1d2...c89c8b8 (1): > tests/test-utils: Add macro for running tests within a seastar thread Needed for tests in the following patch.	2018-06-28 10:00:05 +03:00
Asias He	2e7e59fb50	gossip: Fix tokens assignment in assassinate_endpoint The tokens vector is defined a few lines above and is needed outsie the if block. Do not redefine it again in the if block, otherwise the tokens will be empty. Found by code inspection. Fixes #3551. Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com> (cherry picked from commit `c3b5a2ecd5`)	2018-06-27 12:00:58 +03:00
Vladimir Krivopalov	af29d4bed3	Fix Scylla compilation with Crypto++ v6. In Crypto++ v6, the `byte` typedef has been moved from the global namespace to the `CryptoPP::` namespace. This fix brings in the CryptoPP namespace so that the `byte` typedef is seen with both old and new versions of Crypto++. Fixes #3252. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <799d055be710231884d101a52c0be8ed8b0a9806.1520125889.git.vladimir@scylladb.com> (cherry picked from commit `99bd5180ba`)	2018-06-25 17:49:32 +03:00
Shlomi Livne	72494bbe05	release: prepare for 2.1.5 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-06-19 09:05:55 +03:00
Avi Kivity	5784823888	Update scylla-ami submodule * dist/ami/files/scylla-ami c5d9e96...0df779d (1): > scylla_install_ami: Update CentOS to latest version Fixes #3523.	2018-06-17 12:12:21 +03:00
Takuya ASADA	a7633be1a9	Revert "dist/ami: update CentOS base image to latest version" This reverts commit `69d226625a`. Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180523112414.27307-1-syuu@scylladb.com> (cherry picked from commit `55d6be9254`)	2018-06-17 11:33:55 +03:00
Takuya ASADA	e78ded74ce	dist/debian: add --jobs <njobs> option just like build_rpm.sh On some build environment we may want to limit number of parallel jobs since ninja-build runs ncpus jobs by default, it may too many since g++ eats very huge memory. So support --jobs <njobs> just like on rpm build script. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180425205439.30053-1-syuu@scylladb.com> (cherry picked from commit `782ebcece4`)	2018-06-14 15:05:09 +03:00
Avi Kivity	6615c2a6a9	database: stop using incremental selectors There is a bug in incremental_selector for partitioned_sstable_set, so until it is found, stop using it. This degrades scan performance of Leveled Compaction Strategy tables. Fixes #3513. (as a workaround) Introduced: 2.1 Message-Id: <20180613131547.19084-1-avi@scylladb.com> (cherry picked from commit `aeffbb6732`)	2018-06-14 10:52:39 +03:00
Vlad Zolotarov	11500ccd3a	locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting() ec2_snitch::gossiper_starting() calls for the base class (default) method that sets _gossip_started to TRUE and thereby prevents to following reconnectable_snitch_helper registration. Fixes #3454 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com> (cherry picked from commit `2dde372ae6`)	2018-06-14 10:52:39 +03:00
Shlomi Livne	955f3eeb56	release: prepare for 2.1.4 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-06-06 11:27:01 +03:00
Avi Kivity	08bfd96774	Update seastar submodule * seastar 675acd5...2a2c1d2 (1): > tls: Ensure handshake always drains output before return/throw Fixes #3461.	2018-05-31 12:06:13 +03:00
Mika Eloranta' via ScyllaDB development	f6c4d558eb	build: fix rpm build script --jobs N handling Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands and makes the --jobs argument work as supposed. Signed-off-by: Mika Eloranta <mel@aiven.io> Message-Id: <20180113212904.85907-1-mel@aiven.io> (cherry picked from commit `7266446227`)	2018-05-27 10:25:26 +03:00
Avi Kivity	0040ff6de2	Update seastar submodule * seastar 0e6dcd5...675acd5 (1): > net/tls: Wait for output to be sent when shutting down Fixes #3459.	2018-05-24 12:03:10 +03:00
Glauber Costa	c238bc7a81	commitlog: don't move pointer to segment We are currently moving the pointer we acquired to the segment inside the lambda in which we'll handle the cycle. The problem is, we also use that same pointer inside the exception handler. If an exception happens we'll access it and we'll crash. Probably #3440. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180518125820.10726-1-glauber@scylladb.com> (cherry picked from commit `596a525950`)	2018-05-19 19:13:58 +03:00
Avi Kivity	3b984a4293	dist: redhat: get rid of raid0.devices_discard_performance This parameter is not available on recent Red Hat kernels or on non-Red Hat kernels (it was removed on 3.10.0-772.el7, RHBZ 1455932). The presence of the parameter on kernels that don't support it cause the module load to fail, with the result that the storage is not available. Fix by removing the parameter. For someone running an older Red Hat kernel the effect will be that discard is disabled, but they can fix that by updating the kernel. For someone running a newer kernel, the effect will be that they can access their data. Fixes #3437. Message-Id: <20180516134913.6540-1-avi@scylladb.com> (cherry picked from commit `3b8118d4e5`)	2018-05-19 19:13:58 +03:00
Takuya ASADA	156761d77e	dist/ami: update CentOS base image to latest version Since we requires updated version of systemd, we need to update CentOS base image. Fixes #3184 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com> Conflicts: dist/ami/build_ami.sh Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180508083521.18661-1-syuu@scylladb.com>	2018-05-19 19:13:58 +03:00
Avi Kivity	8e33e80ad3	release: prepare for 2.1.3	2018-04-25 09:01:30 +03:00
Duarte Nunes	c35dd86c87	db/schema_tables: Only drop UDTs after merging tables Dropping a user type requires that all tables using that type also be dropped. However, a type may appear to be dropped at the same time as a table, for instance due to the order in which a node receives schema notifications, or when dropping a keyspace. When dropping a table, if we build a schema in a shard through a global_schema_pointer, then we'll check for the existence of any user type the schema employs. We thus need to ensure types are only dropped after tables, similarly to how it's done for keyspaces. Fixes #3068 Tests: unit-tests (release) Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180129114137.85149-1-duarte@scylladb.com> (cherry picked from commit `1e3fae5bef`)	2018-04-25 01:15:25 +03:00
Pekka Enberg	87cb8a1fa4	release: prepare for 2.1.2	2018-04-17 09:45:00 +03:00
Takuya ASADA	26f3340c32	dist/debian: use ~root as HOME to place .pbuilderrc When 'always_set_home' is specified on /etc/sudoers pbuilder won't read .pbuilderrc from current user home directory, and we don't have a way to change the behavor from sudo command parameter. So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed, this can work both environment which does specified always_set_home and doesn't specified. Fixes #3366 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `ace44784e8`)	2018-04-17 09:39:15 +03:00
Avi Kivity	aaba093371	Update seastar submodule * seastar af1b789...0e6dcd5 (1): > tls: Ensure we always pass through semaphores on shutdown Fixes #3358.	2018-04-14 20:52:02 +03:00
Gleb Natapov	a64c6e6be9	cql_server: fix a race between closing of a connection and notifier registration There is a race between cql connection closure and notifier registration. If a connection is closed before notification registration is complete stale pointer to the connection will remain in notification list since attempt to unregister the connection will happen to early. The fix is to move notifier unregisteration after connection's gate is closed which will ensure that there is no outstanding registration request. But this means that now a connection with closed gate can be in notifier list, so with_gate() may throw and abort a notifier loop. Fix that by replacing with_gate() by call to is_closed(); Fixes: #3355 Tests: unit(release) Message-Id: <20180412134744.GB22593@scylladb.com> (cherry picked from commit `1a9aaece3e`)	2018-04-12 16:57:18 +03:00
Duarte Nunes	c83d2d0d77	db/view: Reject view entries with non-composite, empty partition key Empty partition keys are not supported on normal tables - they cannot be inserted or queried (surprisingly, the rules for composite partition keys are different: all components are then allowed to be empty). However, the (non-composite) partition key of a view could end up being empty if that column is: a base table regular column, a base table clustering key column, or a base table partition key column, part of a composite key. Fixes #3262 Refs CASSANDRA-14345 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180403122244.10626-1-duarte@scylladb.com> (cherry picked from commit `ec8960df45`)	2018-04-03 19:08:38 +03:00
Asias He	0aa49d0311	gossip: Relax generation max difference check start node 1 2 3 shutdown node2 shutdown node1 and node3 start node1 and node3 nodetool removenode node2 clean up all scylla data on node2 bootstrap node2 as a new node I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever: On node1, node3 [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704 On node2 [shard 0] storage_service - JOINING: waiting for schema information to complete This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2. gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe(); gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe(); Each force_newer_generation_unsafe increases the generation by 1. Here is an example, Before nodetool removenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "generation": 0, "is_alive": false, "update_time": 1521778757334, "version": 0 }, ``` After nodetool revmoenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "application_state": [ { "application_state": 0, "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246", "version": 214 }, { "application_state": 6, "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a", "version": 153 } ], "generation": 2, "is_alive": false, "update_time": 1521779276246, "version": 0 }, ``` In gossiper::apply_state_locally, we have this check: ``` if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) { // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself) logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation); } ``` to skip the gossip update. To fix, we relax generation max difference check to allow the generation of a removed node. After this patch, the removed node bootstraps successfully. Tests: dtest:update_cluster_layout_tests.py Fixes #3331 Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com> (cherry picked from commit `f539e993d3`)	2018-04-03 19:08:38 +03:00
Shlomi Livne	cce455b1f5	release: prepare for 2.1.1 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-03-25 09:32:02 +03:00
Avi Kivity	6772f3806b	tests: mutation_source_test: fix scattering or partition tombstone The partition tombstone is not part of a mutation_fragment in the old streamed_mutation, so it was not scattered correctly by fragment_scatterer. This causes test failures if the mutations to be scattered have a partition tombstone. Fix by calling consume(tombstone) directly. This isn't nice, but the code is dead anyway.	2018-03-24 15:15:02 +03:00
Avi Kivity	6c9d699835	Merge "Fix abort during counter table read-on-delete" from Tomasz " This fixes an abort in an sstable reader when querying a partition with no clustering ranges (happens on counter table mutation with no live rows) which also doesn't have any static columns. In such case, the sstable_mutation_reader will setup the data_consume_context such that it only covers the static row of the partition, knowing that there is no need to read any clustered rows. See partition.cc::advance_to_upper_bound(). Later when the reader is done with the range for the static row, it will try to skip to the first clustering range (missing in this case). If clustering_ranges_walker tells us to skip to after_all_clustering_rows(), we will hit an assert inside continuous_data_consumer::fast_forward_to() due to attempt to skip past the original data file range. If clustering_ranges_walker returns before_all_clustering_rows() instead, all is fine because we're still at the same data file position. Fixes #3304. " * 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Test reads with no clustering ranges and no static columns tests: simple_schema: Allow creating schema with no static column clustering_ranges_walker: Stop after static row in case no clustering ranges (cherry picked from commit `054854839a`)	2018-03-23 10:47:23 +03:00
Vlad Zolotarov	a75e1632c8	test.py: limit the tests to run on 2 shards with 4GB of memory Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> (cherry picked from commit `57a6ed5aaa`)	2018-03-22 12:45:25 +02:00
Jesse Haber-Kucharsky	c5718bf620	auth: Fix improper sharing of sharded `service` This change is backported from `092f2e659c`. Previously, the sharded permissions cache was only accessible to the implementation of `auth::service` in `auth/service.cc`. The intention was that invoking `auth::service::get_permissions` on shard `k` would query the cache on shard `k`, which would in turn depend on `auth::service` on shard k to check for superuser status. The problem is in `auth::service::start`. `seastar::sharded<auth::permissions_cache>::start` is invoked with `*this` of shard 0, causing all instances of the cache to reference the same object. I wasn't able to locally reproduce errors or crashes due to this bug when I compiled a release build of Scylla. However, running a debug build meant that the glorious `seastar::debug_shared_ptr_counter_type` quickly saved the day with its checks that `seastar::shared_ptr` isn't being misused. To eliminate this problem, we move ownership of a single instance of `auth::permissions_cache` to a single instance of `auth::service`. When `auth::service` is sharded, so is the permissions cache. I verified interactively that no assertions failed in debug mode with this change. Fixes #3296. Tests: unit (debug, release) Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <280a889f551180db1c00d8a80eddf85b2ff0ac60.1521696176.git.jhaberku@scylladb.com>	2018-03-22 10:04:50 +02:00
Duarte Nunes	2315fcd6cf	gms/gossiper: Synchronize endpoint state destruction In gossiper::handle_major_state_change() we set the endpoint_state for a particular endpoint and replicate the changes to other cores. This is totally unsynchronized with the execution of gossiper::evict_from_membership(), which can happen concurrently, and can remove the very same endpoint from the map (in all cores). Replicating the changes to other cores in handle_major_state_change() can interleave with replicating the changes to other cores in evict_from_membership(), and result in an undefined final state. Another issue happened in debug mode dtests, where a fiber executes handle_major_state_change(), calls into the subscribers, of which storage_service is one, and ultimately lands on storage_service::update_peer_info(), which iterates over the endpoint's application state with deferring points in between (to update a system table). gossiper::evict_from_membership() was executed concurrently by another fiber, which freed the state the first one is iterating over. Fixes #3299. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180318123211.3366-1-duarte@scylladb.com> (cherry picked from commit `810db425a5`)	2018-03-18 14:55:32 +02:00
Asias He	8c5464d2fd	range_streamer: Stream 10% of ranges instead of 10 ranges per time If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per stream plan will cause tons of stream plan to be created to stream data, each having very few data. This cause each stream plan has low transfer bandwidth, so that the total time to complete the streaming increases. It makes more sense to send a percentage of the total ranges per stream plan than a fixed ranges. Here is an example to stream a keyspace with 513 ranges in total, 10 ranges v.s. 10% ranges: Before: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 51 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 107 seconds After: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 10 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 22 seconds Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com> (cherry picked from commit `9b5585ebd5`)	2018-03-14 10:13:01 +02:00
Asias He	346d2788e3	Revert "streaming: Do not abort session too early in idle detection" This reverts commit `f792c78c96`. With the "Use range_streamer everywhere" (`7217b7ab36`) series, all the user of streaming now do streaming with relative small ranges and can retry streaming at higher level. Reduce the time-to-recover from 5 hours to 10 minutes per stream session. Even if the 10 minutes idle detection might cause higher false positive, it is fine, since we can retry the "small" stream session anyway. In the long term, we should replace the whole idle detection logic with whenever the stream initiator goes away, the stream slave goes away. Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com> (cherry picked from commit `ad7b132188`)	2018-03-14 10:12:59 +02:00
Avi Kivity	4f68fede6d	Merge "Make reader concurrency dual-restricted by count and memory" from Botond " Refs #2692 Fixes #3246 The current restricting algorithm [1] restricts the active-reader queue based on the memory consumption of the existing active readers. When this memory consumption is above the limit new readers are not admitted. The inactive reader queue on the other hand has a fixed length. This caused performance regressions on two workloads: * read-only: since the inactive-reader queue length is severly limited (compared to the previous situation) reads will timeout at loads comfortably handled before. * mixed: since the memory consumption happens only at admission time (already created active readers are not limited) memory consumption growed significantly causing problems when compactions kicked in. The solution is to reintroduce the old limit of 100 active concurrent user-reads while still keeping the memory-based limit as well. For workloads that don't consume a lot of memory or on large boxes with lots of memory the count-based limit will be reached which is reverting to the old well-known behaviour. For memory-hungry workloads or on small boxes with little memory the memory based-limit will kick in sooner avoiding memory overconsumption. [1] introduced by `bdbbfe9390` " * 'restricted-reader-dual-limit/v3-backport-2.1' of https://github.com/denesb/scylla: Modify unit tests so that they test the dual-limits Use the reader_concurrency_semaphore to limit reader concurrency Add reader_concurrency_semaphore Add reader_resource_tracker param to mutation_source mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh	2018-03-08 19:10:06 +02:00
Botond Dénes	681f9e4f50	Modify unit tests so that they test the dual-limits	2018-03-08 18:54:16 +02:00
Botond Dénes	c503bc7693	Use the reader_concurrency_semaphore to limit reader concurrency	2018-03-08 18:54:15 +02:00
Botond Dénes	de7024251b	Add reader_concurrency_semaphore This semaphore implements the new dual, count and memory based active reader limiting. As purely memory-based limiting proved to cause problems on big boxes admitting a large number of readers (more than any disk could handle) the previous count-based limit is reintroduced in addition to the existing memory-based limit. When creating new readers first the count-based limit is checked. If that clears the memory limit is checked before admitting the reader. reader_conccurency_semaphore wraps the two semaphores that implement these limits and enforces the correct order of limit checking. This class also completely replaces the restricted_reader_config struct, it encapsulates all data and related functinality of the latter, making client code simpler.	2018-03-08 18:54:15 +02:00
Botond Dénes	9a0eb2319c	Add reader_resource_tracker param to mutation_source Soon, reader_resource_tracker will only be constructible after the reader has been admitted. This means that the resource tracker cannot be preconstructed and just captured by the lambda stored in the mutation source and instead has to be passed in along the other parameters.	2018-03-08 18:54:12 +02:00
Botond Dénes	9ef462449b	mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh In preparation to reader_concurrency_semaphore being added to the file. The reader_resource_tracker is really only a helper class for reader_concurrency_semaphore so the latter is better suited to provide the name of the file.	2018-03-08 15:34:48 +02:00
Amnon Heiman	6271f30716	dist/docker: Add support for housekeeping This patch takes a modified version of the Ubuntu 14.04 housekeeping service script and uses it in Docker to validate the current version. To disable the version validation, pass the --disable-version-check flag when running the container. Message-Id: <20180220161231.1630-1-amnon@scylladb.com> (cherry picked from commit `edcfab3262`)	2018-03-07 16:17:13 +02:00
Takuya ASADA	8b64e80c88	dist/debian: install scylla-housekeeping upstart script correctly on Ubuntu 14.04 Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore. So handle it independently for each distribution, try to install /etc/init/scylla-housekeeping.conf on Ubuntu 14.04. Fixes #3239 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `101e909483`)	2018-03-07 16:16:36 +02:00
Amnon Heiman	c5bffcaa68	scylla-housekeeing: need to support both debian/ubuntu variations Debian and ubuntu list files come in two variations. The housekeeping should support both. This patch change the regexp that match the os in the repository file. After the introduction of the second list variation, the os name can be in the middle of the path not only at the end. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20180227092543.19538-1-amnon@scylladb.com> (cherry picked from commit `57d46c6959`)	2018-03-07 16:15:54 +02:00
Tomasz Grabiec	8aa0b60e91	tests: cache: Fix invalidate() not being waited for Probably responsible for occasional failures of subsequent assertion. Didn't mange to reproduce. Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `d9f0c1f097`)	2018-03-06 12:17:16 +02:00
Asias He	dccf762654	storage_service: Add missing return in pieces empty check If pieces.empty is empty, it is bogus to access pieces[0]: sstring move_name = pieces[0]; Fix by adding the missing return. Spotted by Vlad Zolotarov <vladz@scylladb.com> Fixes #3258 Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com> (cherry picked from commit `8900e830a3`)	2018-03-06 09:58:21 +02:00
Tomasz Grabiec	e5344079d9	intrusive_set_external_comparator: Fix _header having undefined color on move swap_tree() doesn't change the color of the header, and becasue header was not initialized, it is undefined (can be both red or black). One problem this causes is that algo::is_header() expects the header to be always red. It is used by unlink(), which for trees which have a black header would infinite-loop. The fix is to initialize the header. Fixes #3242. Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `30635510a2`)	2018-02-28 13:57:33 +02:00
Paweł Dziepak	7bc8515c48	tests/cql3: increase TTL to avoid spurious failures The test inserts some values with a TTL of 1 second and then reads them back expecting them not to be expired yet. That may not always be the case if the machine is slow and we are running in the debug mode. Increasising the TTLs by x100 should help avoid these false positives. Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com> (cherry picked from commit `d97eebe82d`)	2018-02-22 14:14:41 +00:00
Duarte Nunes	1228a41eaa	cql3/query_processor: Remove prepared statements upon dropping a view Fixes #3198 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180209143652.31852-1-duarte@scylladb.com> (cherry picked from commit `d757c87107`)	2018-02-22 14:11:08 +00:00
Tomasz Grabiec	58b90ceee0	tests: row_cache: Improve test for snapshot consistency on eviction Reproduces https://github.com/scylladb/scylla/issues/3215. Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `9c3e56fb16`)	2018-02-16 11:42:33 +01:00
Tomasz Grabiec	ef46067606	mvcc: Do not move unevictable snapshots to cache Commit `6ccd317` introduced a bug in partition_entry::evict() where a partition entry may be partially evicted if there are non-evictable snapshots in it. Partially evicting some of the versions may violate consistency of a snapshot which includes evicted versions. For one, continuity flags are interpreted realtive to the merged view, not within a version, so evicting from some of the versions may mark reanges as continuous when before they were discontinuous. Also, range tombtsones of the snapshot are taken from all versions, so we can't partially evict some of them without marking all affected ranges as discontinuous. The fix is to revert back to full eviciton, and avoid moving non-evictable snapshots to cache. When moving whole partition entry to cache, we first create a neutral empty partition entry and then merge the memtable entry into it just like we would if the entry already existed. Fixes #3215. Tests: unit (release) Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `b0b57b8143`)	2018-02-16 11:26:13 +01:00
Shlomi Livne	ffdd0f6392	release: prepare for 2.1.0 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-02-14 15:17:43 +02:00
Paweł Dziepak	3ab1c8abff	cql3/select_statement: do not capture stack variables by reference Default capture by reference considered harmful in async code. (cherry picked from commit `b635fec9bf`)	2018-02-08 17:54:00 +02:00
Amnon Heiman	d306c40507	database: correct the label creation for database reads The labels in database active_reads metrics where not define correctly. Label should be created so it will be possible to select based on their value. The current implementation define a label "class" with three instances: user, streaming, system. Fixes: #2770 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20180123125206.23660-1-amnon@scylladb.com> (cherry picked from commit `a0a1961b6d`)	2018-02-08 15:20:55 +02:00
Paweł Dziepak	b98d5b30de	Merge "Do not evict from memtable snapshots" from Tomasz "When moving whole partition entries from memtable to cache, we move snapshots as well. It is incorrect to evict from such snapshots though, because associated readers would miss data. Solution is to record evictability of partition version references (snapshots) and avoiding eviction from non-evictable snapshots. Could affect scanning reads, if the reader uses partition entry from memtable, and the partition is too large to fit in reader's buffer, and that entry gets moved to cache (was absent in cache), and then gets evicted (memory pressure). The reader will not see the remainder of that entry. Found during code review. Introduced in `ca8e3c4`, so affects 2.1+ Fixes #3186. Tests: unit (release)" * 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla: tests: mvcc: Add test for eviction with non-evictable snapshots mutation_partition: Define + operator on tombstones tests: mvcc: Check that partition is fully discontinuous after eviction tests: row_cache: Add test for memtable readers surviving flush and eviction memtable: Make printable mvcc: Take partition_entry by const ref in operator<<() mvcc: Do not evict from non-evictable snapshots mvcc: Drop unnecessary assignment to partition_snapshot::_version tests: Use partition_entry::make_evictable() where appropriate mvcc: Encapsulate construction of evictable entries (cherry picked from commit `6ccd317c38`)	2018-02-06 19:29:56 +01:00
Tomasz Grabiec	85f5e57502	tests: Introduce mutation_partition_assertions mutation_assertions are now delegating to mutation_partition_assertions. (cherry picked from commit `c7539f2ed0`)	2018-02-06 19:29:56 +01:00
Tomasz Grabiec	19158f3401	mutation_partition: Make check_continuity() const-qualified (cherry picked from commit `bde050835f`)	2018-02-06 19:29:56 +01:00
Tomasz Grabiec	a7e40d6acb	mutation_partition: Make check_continuity() public (cherry picked from commit `f9257886cb`)	2018-02-06 19:29:56 +01:00
Tomasz Grabiec	eedcfedd5a	mutation_partition: Extract sliced() from mutation into mutation_partition So that we can call it on mutation_partition. (cherry picked from commit `b3709047b0`)	2018-02-06 19:29:56 +01:00
Tomasz Grabiec	b655fe262b	mvcc: Add const-qualified partition_version_ref::operator*() (cherry picked from commit `a6e083ef6f`)	2018-02-06 19:29:56 +01:00
Shlomi Livne	cbb3b959e3	release: prepare for 2.1.rc3 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-02-06 12:12:31 +02:00
Raphael S. Carvalho	3dd282f7f0	sstables/compress: Fix race condition in segmented offset reading of shared sstable Race condition was introduced by commit `028c7a0888`, which introduces chunk offset compression, because a reading state is kept in the compress structure which is supposed to be immutable and can be shared among shards owning the same sstable. So it may happen that shard A updates state while shard B relies on information previously set which leads to incorrect decompression, which in turn leads to read misbehaving. We could serialize access to at() which would only lead to contention issues for shared sstables, but that can be avoided by moving state out of compress structure which is expected to be immutable after sstable is loaded and feeded to shards that own it. Sequential accessor (wraps state and reference to segmented_offset) is added to prevent at() and push_back() interfaces from being polluted. Tests: release mode. Fixes #3148. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com> (cherry picked from commit `09f4ee808f`)	2018-02-06 12:10:29 +02:00
Tomasz Grabiec	574548e50f	Merge 'Fixes for exception safety in memtable range reads' from Paweł These patches deal with the remaining exception safety issues in the memtable partition range readers. That includes moving the assignment to iterator_reader::_last outside of allocating section to avoid problems caused by exception-unsafe assignment operator. Memory accotuning code is also moved out of the retryable context to improve the code robustness and avoid potential problems in the future. Fixes #3172. * https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety-2.1/v1: memtable: do not update iterator_reader::_last in alloc section memtable: do not change accounting state in alloc section tests/memtable: add more reader exception safety tests	2018-02-05 20:51:26 +01:00
Paweł Dziepak	688d58f54a	tests/memtable: add more reader exception safety tests	2018-02-05 15:11:55 +00:00
Paweł Dziepak	ea9b0bb4b0	memtable: do not change accounting state in alloc section Allocating sections can be retried so code that has side effects (like updating flushed bytes accouting) has no place there.	2018-02-05 15:11:54 +00:00
Paweł Dziepak	6a9b026601	memtable: do not update iterator_reader::_last in alloc section iterator_reader::_last is a part of the state that survives allocating section retries, therefore, it should not be modified in the retryable context.	2018-02-05 15:11:53 +00:00
Amnon Heiman	adc1523aaa	scylla_setup support private repo on debian during setup Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170917145248.19677-1-amnon@scylladb.com> (cherry picked from commit `bc356a3c15`)	2018-02-01 15:58:00 +02:00
Tomasz Grabiec	5444eead08	Merge "Make memtable reads exception safe" from Paweł These patches change the memtable reader implementation (in particular partition_snapshot_reader) so that the existing exception safety paroblems are fixed, but also in a way that, hopefully, would make it easier to reason about the error handling and avoid future bugs in that area. The main difficulty related to exception safety is that when an exception is thrown out of an allocating section that code is run again with increased memory reserved. If the retryable code has side effects it is very easy to get incorrect behaviour. In addition to that, entering an allocating section is not exactly cheap which encourages doing so rarely and having large sections. The approach taken by this series is to, first, make entering allocating sections cheaper and then reducing the amount of logic that runs inside of them to a minimum. This means that instead of entering a section once per a call to flat_mutation_reader::fill_buffer() the allocation section is entered once for each emitted row. The only state modified from within the section are cached iterators to the current row, which are dropped on retry. Hopefully, this would make the reader code easier to reason about. The optimisations to the allocating sections and managed_bytes linearised context has successfully eliminated any penalty caused by much more fine grained allocating sections. Fixes #3123. Fixes #3133. Tests: unit-tests (release) BEFORE test iterations median mad min max memtable.one_partition_one_row 1155362 869.139ns 0.282ns 868.465ns 873.253ns memtable.one_partition_many_rows 127252 7.871us 15.252ns 7.851us 7.886us memtable.many_partitions_one_row 58715 17.109us 2.765ns 17.013us 17.112us memtable.many_partitions_many_rows 4839 206.717us 212.385ns 206.505us 207.448us AFTER test iterations median mad min max memtable.one_partition_one_row 1194453 839.223ns 0.503ns 834.952ns 842.841ns memtable.one_partition_many_rows 133785 7.477us 4.492ns 7.473us 7.507us memtable.many_partitions_one_row 60267 16.680us 18.027ns 16.592us 16.700us memtable.many_partitions_many_rows 4975 201.048us 144.929ns 200.822us 201.699us ./before_sq ./after_sq diff read 337373.86 353694.24 4.8% write 388759.99 394135.78 1.4% * https://github.com/pdziepak/scylla.git memtable-exception-safety-2.1/v1: flat_mutation_reader: add allocation point in push_mutation_fragment linearization_context: remove non-trivial operations from fast path lsa: split alloc section into reserving and reclamation-disabled parts lsa: optimise disabling reclamation and invalidation counter mutation_fragment: allow creating clustering row in place paratition_snapshot_reader: minimise amount of retryable code memtable: drop memtable_entry::read() tests/memtable: add test for reader exception safety	2018-02-01 10:54:35 +01:00
Paweł Dziepak	1e74362ec9	tests/memtable: add test for reader exception safety	2018-02-01 10:54:34 +01:00
Paweł Dziepak	72e52dafba	memtable: drop memtable_entry::read()	2018-02-01 10:54:34 +01:00
Paweł Dziepak	29746e1e7b	paratition_snapshot_reader: minimise amount of retryable code Retryable code that has side effects is a recipe for bugs. This patch reworkds the snapshot reader so that the amount of logic run with reclamation disabled is minimal and has a very limited side effects.	2018-02-01 10:54:34 +01:00
Paweł Dziepak	13cd56774f	mutation_fragment: allow creating clustering row in place Moving clustering_row is expensive due to amount of data stored internally. Adding a mutation_fragment constructor that builds a clustering_row in-place saves some of that moving.	2018-02-01 10:54:34 +01:00
Paweł Dziepak	812018479b	lsa: optimise disabling reclamation and invalidation counter Most of the lsa gory details are hidden in utils/logalloc.cc. That includes the actual implementation of a lsa region: region_impl. However, there is code in the hot path that often accesses the _reclaiming_enabled member as well as its base class allocation_strategy. In order to optimise those accesses another class is introduced: basic_region_impl that inherits from allocation_strategy and is a base of region_impl. It is defined in utils/logalloc.hh so that it is publicly visible and its member functions are inlineable from anywhere in the code. This class is supposed to be as small as possible, but contain all members and functions that are accessed from the fast path and should be inlined.	2018-02-01 10:54:34 +01:00
Paweł Dziepak	0ee2462811	lsa: split alloc section into reserving and reclamation-disabled parts Allocating sections reserves certain amount of memory, then disables reclamation and attempts to perform given operation. If that fails due to std::bad_alloc the reserve is increased and the operation is retried. Reserving memory is expensive while just disabling reclamation isn't. Moreover, the code that runs inside the section needs to be safely retryable. This means that we want the amount of logic running with reclamation disabled as small as possible, even if it means entering and leaving the section multiple times. In order to reduce the performance penalty of such solution the memory reserving and reclamation disabling parts of the allocating sections are separated.	2018-02-01 10:54:34 +01:00
Paweł Dziepak	c8bc3a7053	linearization_context: remove non-trivial operations from fast path Since linearization_context is thread_local every time it is accessed the compiler needs to emit code that checks if it was already constructed and does so if it wasn't. Moreover, upon leaving the context from the outermost scope the map needs to be cleared. All these operations impose some performance overhead and aren't really necessary if no buffers were linearised (the expected case). This patch rearranges the code so that lineatization_context is trivially constructible and the map is cleared only if it was modified.	2018-02-01 10:54:34 +01:00
Paweł Dziepak	9f78799e80	flat_mutation_reader: add allocation point in push_mutation_fragment Exception safety tests inject a failure at every allocation and verify whether the error is handled properly. push_mutation_fragment() adds a mutation fragment to a circular_buffer, in theory any call to that function can result in a memory allocation, but in practice that depends on the implementation details. In order to improve the effectiveness of the exception safety tests this patch adds an explicit allocation point in push_mutation_fragment().	2018-02-01 10:54:33 +01:00
Calle Wilund	5bba3856ca	auth: Fix transitional auth for non-valid credentials Fixes #3096 The credentials processing for transitional auth was broken in `ba6a41d`, "auth: Switch to sharded service which effectively removed the "virtualization" of underlying auth in the SASL challenge. As a quick workaround, add the permissive exception handling to sasl object as well. Message-Id: <20180103102724.1083-1-calle@scylladb.com> (cherry picked from commit `35b9ec868a`)	2018-02-01 11:36:46 +02:00
Avi Kivity	63e92418dd	Update seastar submodule * seastar 8d254a1...af1b789 (3): > tls_test: Fix echo test not setting server trust store > tls: Do not restrict re-handshake to client > tls: Actually verify client certificate if requested Fixes #3072	2018-01-28 13:59:04 +02:00
Paweł Dziepak	9eaa6f233e	Update scylla-ami submodule * scylla-ami 3366c93...c5d9e96 (1): > Update Amazon kernel packages release stream to 2017.09	2018-01-24 13:27:52 +00:00
Raphael S. Carvalho	6600317b2c	sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling The reason sstable key estimation is inaccurate is that it doesn't account that index sampling is now dynamic. The estimation is done as follow: uint64_t get_estimated_key_count() const { return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) * _components->summary.header.min_index_interval; } The biggest problem is that _components->summary.header.min_index_interval isn't actually the minimum interval, but instead the default interval value set in the schema. So the estimation gets worse the larger the average partition, because the larger the average partition the lower the index sampling interval. One of the problems is that estimation has a big influence on bloom filter size, and so for large partitions we were generating bigger filters than we had to. From now on, size at full sampling is calculated as if sampling were static (which was the case until commit `8726ee937d` which introduced size-based sampling), using minimum index as a strict sampling interval. Tests: units (release) Fixes #3113. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com> (cherry picked from commit `2c181b69c9`)	2018-01-24 11:42:28 +02:00
Vladimir Krivopalov	807acb2dd9	main: Fix warnings when running "scylla --version" Print Scylla version, if requested, before running Seastar application. Fixes #3124 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <bbd0f303f612327446ce1f10ebd17ebed8d76048.1516144651.git.vladimir@scylladb.com> (cherry picked from commit `73b6e9fbb1`)	2018-01-17 16:59:28 +02:00
Takuya ASADA	5e44bf97f0	dist/debian: follow gcc-7.2 package naming changes on 3rdparty repo for Debian 9 Switch to renamed gcc-7.2 package on Debian 9, too. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1516191853-2562-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `f3c8574135`)	2018-01-17 14:38:55 +02:00
Takuya ASADA	4003be40b3	dist/debian: fix package name typo on Debian 8 Correct package name is scylla-gcc72-g++-7, not scylla-g++-7. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1516189354-5880-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `15e266eea4`)	2018-01-17 13:45:39 +02:00
Takuya ASADA	cf059b6ee2	dist/debian: follow renaming of gcc-7.2 packages on Ubuntu 14.04/16.04 Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2, so switch to it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1516103292-26942-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `912a14eb9b`)	2018-01-17 13:38:56 +02:00
Shlomi Livne	d96c31ee4d	release: prepare for 2.1.rc2 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-01-16 16:22:56 +02:00
Avi Kivity	680ce234b0	Merge "Fix memory leak on zone reclaim" from Tomek "_free_segments_in_zones is not adjusted by segment_pool::reclaim_segments() for empty zones on reclaim under some conditions. For instance when some zone becomes empty due to regular free() and then reclaiming is called from the std allocator, and it is satisfied from a zone after the one which is empty. This would result in free memory in such zone to appear as being leaked due to corrupted free segment count, which may cause a later reclaim to fail. This could result in bad_allocs. The fix is to always collect such zones. Fixes #3129 Refs #3119 Refs #3120" * 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev: tests: lsa: Test _free_segments_in_zones is kept correct on reclaim lsa: Expose max_zone_segments for tests lsa: Expose tracker::non_lsa_used_space() lsa: Fix memory leak on zone reclaim (cherry picked from commit `4ad212dc01`)	2018-01-16 15:54:40 +02:00
Asias He	ad656b2c55	storage_service: Do not wait for restore_replica_count in handle_state_removing The call chain is: storage_service::on_change() -> storage_service::handle_state_removing() -> storage_service::restore_replica_count() -> streamer->stream_async() Listeners run as part of gossip message processing, which is serialized. This means we won't be processing any gossip messages until streaming completes. In fact, there is no need to wait for restore_replica_count to complete which can take a long time, since when it completes, this node will send notification to tell the removal_coordinator that the restore process is finished on this node. This node will be removed from _replicating_nodes on the removal_coordinator. Tested with update_cluster_layout_tests.py Fixes #2886 Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com> (cherry picked from commit `5107b6ad16`)	2018-01-16 11:37:55 +02:00
Tomasz Grabiec	43101b6bff	database: Invalidate only affected ranges from flush_streaming_mutations() Invalidating whole range causes larger latency spikes. Regression from 2.0 introduced in `d22fdf4261`. Refs #3119 Tests: units (release) Message-Id: <1516046938-26855-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `b5d5bf5bc4`)	2018-01-16 11:18:36 +02:00
Asias He	492a5c8886	storage_service: Set NORMAL status after token_metadata is replicated Commit `2d5fb9d109` (gms/gossiper: Replicate changes incrementally to other shards) changes the way we replicate _token_metadata and endpoint_state_map. Before they are replicated at the same time, after they are not any more. This causes a shard in NORMAL status can still be with a empty _token_metadata. We saw errors: [shard 12] token_metadata - sorted_tokens is empty in first_token_index! during CorruptThenRepairNemesis. Fix by setting the gossip status to NORMAL after replication of _token_metadata, so that once a node is in NORMAL, we can do repair. The commit `69c81bcc87` (repair: Do not allow repair until node is in NORMAL status) prevents the early repair operation by checking if a node is in NORMAL status. Fixes #3121 Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com> (cherry picked from commit `3c8ed255ac`)	2018-01-16 09:41:34 +02:00
Raphael S. Carvalho	152747b8fd	mutation_reader: Fix use-after-move Problem introduced in `375ed938b4` Also remove redefinition of schema in dummy incremental selector which is supposed to use the one in base class instead. Following tests are fixed: ./build/release/tests/mutation_reader_test ./build/release/tests/sstable_test -- -c1 ./build/release/tests/row_cache_test ./build/release/tests/cache_flat_mutation_reader_test ./build/release/tests/row_cache_stress_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180111153831.17462-1-raphaelsc@scylladb.com>	2018-01-11 17:43:41 +02:00
Takuya ASADA	00c08519a7	dist/debian: make pbuilder works on Debian 9 On Debian 9, 'pbuilder create' fails because of lack of GPG key for 3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and 'pbuilder update'. Also, apt-key adv --fetch-keys does not works correctly on it, but we can use "curl <URL> \| apt-key add -" as workaround. Fixes #3088 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `b68ee98310`)	2018-01-11 15:03:49 +02:00
Takuya ASADA	5d47a39b7b	dist/debian: follow renaming of gcc-7.2 packages on Debian 8 Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2, so switch to it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1515522920-8266-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `420b61b466`)	2018-01-11 15:03:47 +02:00
Takuya ASADA	4f8e8bdc04	dist/debian: rename boost1.63 to scylla-boost163 on Debian 8 We provided "boost1.63" package for Debian 8 since we couldn't build "scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the problem and now we have it for Debian 8 too, so switch to it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `51013f561d`)	2018-01-11 15:03:44 +02:00
Paweł Dziepak	ef1dab4565	combined_reader: optimise for disjoint partition streams The legacy mutation_reader/streamed_mutation design allowed very easily to skip the partition merging logic if there was only one underlying reader that has emitted it. That optimisation was lost after conversion to flat mutation readers which has impacted the performance. This patch mostly recovers it by bypassing most of mutation_reader_merger logic if there is only a single active reader for a given partition. The performance regression was introduced in `8731c1bc66` "Flatten the implementation of combined_mutation_reader". perf_simple_query -c4 read results (medians of 60): original regression before 8731c1 after 8731c1 diff read 326241.02 300244.09 -8.0% this patch before after diff read 313882.59 325148.05 3.6% Message-Id: <20180103121019.764-1-pdziepak@scylladb.com> (cherry picked from commit `b4a4c04bab`)	2018-01-11 10:33:31 +01:00
Tomasz Grabiec	3f602814ba	mutation_reader: Move definition of combining mutation reader to source file So that the whole world doesn't recompile when it changes. (cherry picked from commit `60ed5d29c0`)	2018-01-11 10:33:08 +01:00
Tomasz Grabiec	83d4e85e00	mutation_reader: Use make_combined_reader() to create combined reader So that we can hide the definition of combined_mutation_reader. It's also less verbose. (cherry picked from commit `52285a9e73`)	2018-01-11 10:33:06 +01:00
Asias He	857ffeefce	streaming: Do send failed message for uninitialized session The uninitialized session has no peer associated with it yet. There is no point sending the failed message when abort the session. Sending the failed message in this case will send to a peer with uninitialized dst_cpu_id which will casue the receiver to pass a bogus shard id to smp::submit_to which cases segfault. In addition, to be safe, initialize the dst_cpu_id to zero. So that uninitialized session will send message to shard zero instead of random bogus shard id. Fixes the segfault issue found by repair_additional_test.py:RepairAdditionalTest.repair_abort_test Fixes #3115 Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com> (cherry picked from commit `774307b3a7`)	2018-01-09 16:32:12 +02:00
Piotr Jastrzebski	a845e23702	Fix fast_forward_to(partition_range&) in forwardable flat reader. Making sure fast_forward_to(const partition_range&) sets _current correctly. Fixes #3089 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <6c29cf273f191da0e21035bcbe1592042ecffc70.1515490058.git.piotr@scylladb.com> (cherry picked from commit `945f45f490`)	2018-01-09 14:57:53 +02:00
George Tavares	f9b14df3a3	db/view: Consume updated rows regardless of static row Using Materialized Views, if the base table has static columns, and the update in base table mutates static and non static rows, the streamed_mutation is stopped before process non static row. The patch avoids stopping the stream_mutation and adds a test case. Message-Id: <20171220173434.25091-1-tavares.george@gmail.com> (cherry picked from commit `ceecd542cd`)	2018-01-08 15:39:57 +01:00
Raphael S. Carvalho	ae47dfde7d	sstables: cure our blindness on sstable read failure After `611774b`, we're blind again on which sstable caused a compaction to fail, leaving us with cryptic message as follow: compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum) After this change, now both read failure in compaction or regular read will report the guilty sstable, see: compaction_manager - compaction failed: std::runtime_error (SSTable reader found an exception when reading sstable ./data/.../keyspace1-standard1 ka-1-Data.db : std::runtime_error(compressed chunk failed checksum)) Fixes #3006. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com> (cherry picked from commit `4610e994e1`)	2018-01-08 13:43:32 +02:00
Vladimir Krivopalov	cc15a13365	Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp. In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and Reader classes from JsonCpp have been deprecated in favour of newer/better ones: CharReaderBuilder/CharReader and StreamWriterBuilder/StreamWriter. This fix uses the new classes where available or resorts to old ones for older versions of the library. Fixes #2989 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> (cherry picked from commit `76775ddf26`)	2018-01-07 14:48:54 +02:00
Avi Kivity	6e14dcb84c	Merge "Fix potential infinite recursion in leveled compaction" from Raphael '"The issue is triggered by compaction of sstables of level higher than 0. The problem happens when interval map of partitioned sstable set stores intervals such as follow: [-9223362900961284625 : -3695961740249769322 ] (-3695961740249769322 : -3695961103022958562 ] When selector is called for first interval above, the exclusive lower bound of the second interval is returned as next token, but the inclusivess info is not returned. So reader_selector was returning that there were new readers when the current token was -3695961740249769322 because it was stored in selector position field as inclusive, but it's actually exclusive. This false positive was leading to infinite recursion in combined reader because sstable set's incremental selector itself knew that there were actually no new readers, and therefore no progress could be made." Fixes #2908.' * 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla: tests: test for infinite recursion bug when doing high-level compaction Fix potential infinite recursion when combining mutations for leveled compaction dht: make it easier to create ring_position_view from token dht: introduce is_min/max for ring_position (cherry picked from commit `375ed938b4`)	2018-01-07 14:47:18 +02:00
Pekka Enberg	9ed64cc11c	dist/docker: Switch to Scylla 2.1 repository	2018-01-05 10:43:29 +02:00
Shlomi Livne	d4c46afc50	release: prepare for 2.1.rc1 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2018-01-03 10:48:35 +02:00
Paweł Dziepak	f371d17884	db/schema_tables: do not use moved from shared pointer Shared pointer view is captured by two continuations, one of which is moving it away. Using do_with() solves the problem. Fixes #3092. Message-Id: <20171221111614.16208-1-pdziepak@scylladb.com> (cherry picked from commit `4dfddc97c7`)	2017-12-21 15:13:53 +01:00
Tomasz Grabiec	0a82a885a4	Merge "Remove memtable::make_reader" from Piotr Migrate all the places that used memtable::make_reader to use memtable::make_flat_reader and remove memtable::make_reader. * seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased: Remove memtable::make_reader Stop using memtable::make_reader in row_cache_stress_test Stop using memtable::make_reader in row_cache_test Stop using memtable::make_reader in mutation_test Stop using memtable::make_reader in streamed_mutation_test Stop using memtable::make_reader in memtable_snapshot_source.hh Stop using memtable::make_reader in memtable::apply Add consume_partitions(flat_mutation_reader& reader, Consumer consumer) Add default parameter values in make_combined_reader Migrate test_virtual_dirty_accounting_on_flush to flat reader Migrate test_adding_a_column_during_reading_doesnt_affect_read_result Simplify flat_reader_assertions& produces(const mutation& m) Migrate test_partition_version_consistency_after_lsa_compaction_happens flat_mutation_reader: Allow setting buffer capacity Add next_mutation() to flat_mutation_reader_assertions cf::for_all_partitions::iteration_state: don't store schema_ptr read_mutation_from_flat_mutation_reader: don't take schema_ptr Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader (cherry picked from commit `b0a56a91c2`)	2017-12-21 14:10:31 +01:00
Tomasz Grabiec	17febfdb0e	database: Move operator<<() overloads to appropriate source files (cherry picked from commit `fd7ab5fe99`)	2017-12-21 14:10:24 +01:00
Vlad Zolotarov	830bf99528	tests: sstable_datafile_test: fix the compilation error on Power 'char' and int8_t ('unsigned char') are different types. 'bytes' base type is int8_t - use the correct type for casting. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> (cherry picked from commit `22ca5d2596`)	2017-12-21 14:09:47 +01:00
Tomasz Grabiec	90000d9861	Merge "Fixes for multi_range_reader" from Paweł The following patches contain fixes for skipping to the next parititon in multi_range_reader and completelty dissable support for fast forwarding inside a single partition, which is not needed and would only add unnecessary complexity. * https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1: flat_multi_range_mutation_reader: disallow streamed_mutation::forwarding flat_multi_range_mutation_reader: clear buffer on next_partition() tests/flat_multi_range_mutation_reader: test skipping to next partition (cherry picked from commit `71cc63dfa6`)	2017-12-21 14:07:15 +01:00
Asias He	46dae42dcd	streaming: One cf per time on sender In the case there are large number of column families, the sender will send all the column families in parallel. We allow 20% of shard memory for streaming on the receiver, so each column family will have 1/N, N is the number of in-flight column families, memory for memtable. Large N causes a lot of small sstables to be generated. It is possible there are multiple senders to a single receiver, e.g., when a new node joins the cluster, the maximum in-flight column families is number of peer node. The column families are sent in the order of cf_id. It is not guaranteed that all peers has the same speed so they are sending the same cf_id at the same time, though. We still have chance some of the peers are sending the same cf_id. Fixes #3065 Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com> (cherry picked from commit `a9dab60b6c`)	2017-12-20 17:07:39 +01:00
Tomasz Grabiec	d6395634ad	range_tombstone_list: Fix insert_from() end_bound was not updated in one of the cases in which end and end_kind was changed, as a result later merging decision using end_bound were incorrect. end_bound was using the new key, but the old end_kind. Fixes #3083. Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `dfe48bbbc7`)	2017-12-20 15:31:51 +01:00
Avi Kivity	d886b3def4	Merge "Fix read amplification in sstable reads" from Paweł "4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source into sstable_mutation_reader" has introduced unintentional changes, some of them causing excessive read amplification during empty range reads. The following patches restore the previous behaviour." * tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla: sstables: set _read_enabled to false if possible sstables: set _single_partition_read for single parititon reads (cherry picked from commit `772d1f47d7`)	2017-12-19 18:18:06 +02:00
Tomasz Grabiec	bcb06bb043	flat_mutation_reader: Fix make_nonforwardable() It emitted end-of-stream prematurely if buffer was full. Message-Id: <1513697716-32634-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `6a6bf58b98`)	2017-12-19 16:01:21 +00:00
Tomasz Grabiec	4606300b25	row_cache: Fix single_partition_populating_reader not waiting on create_underlying() to resolve Results in undefined behavior. Message-Id: <1513691679-27081-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `7b36c8423c`)	2017-12-19 16:12:37 +02:00
Piotr Jastrzebski	282d93de99	Use row_cache::make_flat_reader in column_family::make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <ba1659ceed8676f45942ce6e7506158026947345.1513687259.git.piotr@scylladb.com> (cherry picked from commit `570fc5afed`)	2017-12-19 14:42:52 +02:00
Avi Kivity	52d3403cb0	Update scylla-ami submodule * dist/ami/files/scylla-ami be90a3f...3366c93 (1): > scylla_install_ami: skip ec2_check while building AMI Still tracking master.	2017-12-19 10:12:05 +02:00
Tomasz Grabiec	97f6073699	Merge "Migrate cache to use flat_mutation_reader" from Piotr (cherry picked from commit `37b19ae6ba`)	2017-12-18 20:51:09 +01:00
Glauber Costa	5454e6e168	conf: document listen_on_broadcast_address That's a supported feature that is listed in our help message, but it is not present in the yaml file. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20171215011240.16027-1-glauber@scylladb.com> (cherry picked from commit `b8f49fcc14`)	2017-12-18 17:00:46 +02:00
Vlad Zolotarov	498fb11c70	messaging_service: fix a mutli-NIC support Don't enforce the outgoing connections from the 'listen_address' interface only. If 'local_address' is given to connect() it will enforce it to use a particular interface to connect from, even if the destination address should be accessed from a different interface. If we don't specify the 'local_address' the source interface will be chosen according to the routing configuration. Fixes #3066 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com> (cherry picked from commit `be6f8be9cb`)	2017-12-17 10:51:37 +02:00
Avi Kivity	a6b4881994	Merge "SSTable summary regeneration fixes" from Raphael "Fixes #3057." * 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla: tests: sstable summary recreation sanity test sstables: make loading of sstable without summary to work again sstables: fix summary generation with dynamic index sampling (cherry picked from commit `11de20fc33`)	2017-12-17 09:39:16 +02:00
Takuya ASADA	9848df6667	dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian Currently scylla-housekeeping-daily.service/-restart.service hardcoded "--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file, but we use same .service for Ubuntu/Debian. It doesn't work correctly, we need to specify .list file for Debian variants. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `c2e87f4677`)	2017-12-16 22:03:42 +02:00
Piotr Jastrzebski	2090a5f8f6	Fix build by removing semicolon after concept Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <4504cf47be0a451c58052476bc8cc4f9cba59472.1513248094.git.piotr@scylladb.com> (cherry picked from commit `ac1d2f98e4`)	2017-12-14 12:48:29 +02:00
Amos Kong	7634ed39eb	Reset default cluster_name back to 'Test Cluster' for compatibility There are some users used original default cluster_name 'Test Cluster', they will fail to start the node for cluster_name change if they use new scylla.yaml. 'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back to original old to avoid problem for users. Fixes #3060 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com> (cherry picked from commit `b07de93636`)	2017-12-13 16:58:10 +02:00
Avi Kivity	fb9b15904a	Merge "Convert sstable readers to flat streams" from Paweł "While `aa8c2cbc16` 'Merge "Migrate sstables to flat_mutation_reader" from Piotr' has converted the low-level sstable reader to the new flat_mutation_reader interface there were still multiple readers related to sstables that required converting, including: - restricted reader - filtering reader - single partition sstable reader This series completes their conversion to the flat stream interface." * tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla: db: convert single_key_sstalbe_reader to flat streams db: fully convert incremental_reader_selector to flat readers db: make make_range_sstable_reader() return flat reader db: make column_family::make_reader() return flat reader db: make column_family::make_sstable_reader() return a flat reader filtering_reader: switch to flat mutation fragment streams filtering_reader: pass a const dht::decorated_key& to the callback mutation_reader: drop make_restricted_reader() db: use make_restricted_flat_reader mutation_reader: convert restricted reader to flat streams (cherry picked from commit `6cb3b29168`)	2017-12-13 15:38:22 +02:00
Glauber Costa	4e11f05aa7	database: delete created SSTables if streaming writes fail We have had an issue recently where failed SSTable writes left the generated SSTables dangling in a potentially invalid state. If the write had, for instance, started and generated tmp TOCs but not finished, those files would be left for dead. We had fixed this in commit `b7e1575ad4`, but streaming memtables still have the same isse. Note that we can't fix this in the common function write_memtable_to_sstable because different flushers have different retry policies. Fixes #3062 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20171213011741.8156-1-glauber@scylladb.com> (cherry picked from commit `1aabbc75ab`)	2017-12-13 10:09:43 +02:00
Jesse Haber-Kucharsky	516a1ae834	cql3: Add missing `return` Since `return` is missing, the "else" branch is also taken and this results a user being created from scratch. Fixes #3058. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com> (cherry picked from commit `7e3a344460`)	2017-12-11 09:55:27 +02:00
Paweł Dziepak	be5127388d	Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz "Fixes cache reader to not skip over data in some cases involving overlapping range tombstones in different partition versions and discontinuous cache. Introduced in 2.0 Fixes #3053." * tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev: tests: row_cache: Add reproducer for issue #3053 tests: mvcc: Add test for partition_snapshot::range_tombstones() mvcc: Optimize partition_snapshot::range_tombstones() for single version case mvcc: Fix partition_snapshot::range_tombstones() tests: random_mutation_generator: Do not emit dummy entries at clustering row positions (cherry picked from commit `051cbbc9af`)	2017-12-08 13:03:32 +01:00
Tomasz Grabiec	6d0679ca72	mvcc: Extract partition_entry::add_version() (cherry picked from commit `52cabe343c`)	2017-12-08 12:33:49 +01:00
Avi Kivity	eb67b427b2	Merge "SSTable resharding fixes" from Raphael "Didn't affect any release. Regression introduced in `301358e`. Fixes #3041" * 'resharding_fix_v4' of github.com:raphaelsc/scylla: tests: add sstable resharding test to test.py tests: fix sstable resharding test sstables: Fix resharding by not filtering out mutation that belongs to other shard db: introduce make_range_sstable_reader rename make_range_sstable_reader to make_local_shard_sstable_reader db: extract sstable reader creation from incremental_reader_selector db: reuse make_range_sstable_reader in make_sstable_reader (cherry picked from commit `d934ca55a7`)	2017-12-07 16:43:28 +02:00
Amos Kong	2931324b34	dist/debian: add scylla-tools-core to depends list Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com> (cherry picked from commit `8fd5d27508`)	2017-12-07 13:40:46 +02:00
Amos Kong	614519c4be	dist/redhat: add scylla-tools-core to requires list Fixes #3051 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com> (cherry picked from commit `eb3b138ee2`)	2017-12-07 13:40:46 +02:00
Botond Dénes	203b924c76	mutation_reader_merger: don't query the kind of moved-from fragment Call mutation_fragment_kind() on the fragment before it's moved as there are not guarantees for the state of a moved-from object (apart from that it's in a valid one). Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <c47b1e22877bb9499f1fbb9d513093c29ef1901b.1512635422.git.bdenes@scylladb.com> (cherry picked from commit `1ff65f41fd`)	2017-12-07 11:41:04 +01:00
Botond Dénes	f4f957fa53	Add streamed mutation fast-forwarding unit test for the flat combined-reader Test for the bug fixed by `9661769`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <fc917bae8e9c99f026bf7b366e6e9d39faf466af.1512630741.git.bdenes@scylladb.com> (cherry picked from commit `9fce51f8a0`)	2017-12-07 11:40:53 +01:00
Botond Dénes	39e614a444	combined_mutation_reader: fix fast-fowarding related row-skipping bug When fast forwarding is enabled and all readers positioned inside the current partition return EOS, return EOS from the combined-reader too. Instead of skipping to the next partition if there are idle readers (positioned at some later partition) available. This will cause rows to be skipped in some cases. The fix is to distinguish EOS'd readers that are only halted (waiting for a fast-forward) from thoose really out of data. To achieve this we track the last fragment-kind the reader emitted. If that was a partition-end then the reader is out of data, otherwise it might emit more fragments after a fast-forward. Without this additional information it is impossible to determine why a reader reached EOS and the code later may make the wrong decision about whether the combined-reader as a whole is at EOS or not. Also when fast-forwarding between partition-ranges or calling next_partition() we set the last fragment-kind of forwarded readers because they should emit a partition-start, otherwise they are out of data. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com> (cherry picked from commit `9661769313`)	2017-12-06 16:42:06 +02:00
Paweł Dziepak	d8521d0fa2	Merge "Flatten combined_mutation_reader" from Botond "Convert combined_mutation_reader into a flat_mutation_reader impl. For now - in the name of incremental progress - all consumers are updated to use the combined reader through the mutation_reader_from_flat_mutation_reader adaptor. The combined reader also uses all it's sub mutation_readers through the flat_mutation_reader_from_mutation_reader adaptor." * 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla: Add unit tests for the combined reader - selector interactions Add flat_mutation_reader overload of make_combined_reader Flatten the implementation of combined_mutation_reader Add mutation_fragment_merger mutation_fragment::apply(): handle partition start and end too Add non-const overload of partition_start::partition_tombstone() Make combined_mutation_reader a flat_mutation_reader Move the mutation merging logic to combined_mutation_reader Remove the unnecessary indirection of mutation_reader_merger::next() Move the implementation of combined_mutation_reader into mutation_reader_merger Remove unused mutation_and_reader::less_compare and operator< (cherry picked from commit `046991b0b7`)	2017-12-06 16:41:42 +02:00
Takuya ASADA	f60696b55f	dist/debian: need apt-get update after installing GPG key for 3rdparty repo We need apt-get update after install GPG key, otherwise we still get unauthenticated package error on Debian package build. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1512556948-29398-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `aeb6ebce5a`)	2017-12-06 12:43:42 +02:00
Avi Kivity	1b15a0926a	Merge "Make sstable tests use flat_mutation_reader" from Paweł "This series makes sstable tests use flat stream interface. The main motivation is to allow eventual removal of mutation_reader and streamed_mutation and ensuring that the conversion between the interfaces doesn't hide any bugs that would be otherwise found." * tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla: sstables: drop read_range_rows() tests/mutation_reader: stop using read_range_rows() incremental_reader_selector: do not use read_range_rows() tests/sstable: stop using read_range_rows() sstables: drop read_row() tests/sstables: use read_row_flat() instead of read_row() database: use read_row_flat() instead of read_row() tests/sstable_mutation_test: get flat_mutation_readers from mutation sources tests/sstables: make sstable_reader return flat_mutation_reader sstable: drop read_row() overload accepting sstable::key tests/sstable: stop using read_row() with sstable::key tests/flat_mutation_reader_assertions: add has_monotonic_positions() tests/flat_mutation_reader_assertions: add produces(Range) tests/flat_mutation_reader_assertions: add produces(mutation) tests/flat_mutation_reader_assertions: add produces(dht::decorated_key) tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind) tests/flat_mutation_reader_assertions: fix fast forwarding (cherry picked from commit `601a03dda7`)	2017-12-06 10:12:36 +02:00
Takuya ASADA	32efd3902c	dist/debian: install CA certificates before install repo GPG key Since pbuilder chroot environment does not install CA certificates by default, accessing https://download.opensuse.org will cause certificate verification error. So we need to install it before installing 3rdparty repo GPG key. Also, checking existance of gpgkeys_curl is not needed, since it's always not installed since we are running the script in clean chroot environment. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `8f02967a3b`)	2017-12-06 10:12:17 +02:00
Avi Kivity	6b2f7f8c39	Merge "enable secure-apt for Ubuntu/Debian pbuilder" from Takuya * 'debian-secure-apt-3rdparty-v3' of https://github.com/syuu1228/scylla: dist/debian: support Ubuntu 18.04LTS dist/debian: disable ALLOWUNTRUSTED dist/debian: enable secure-apt for Debian dist/debian: enable secure-apt for Ubuntu (cherry picked from commit `a25b5e30f8`)	2017-12-04 14:47:23 +02:00
Takuya ASADA	370a6482e3	dist/debian: disable entire pybuild actions Even after `25bc18b` commited, we still see the build error similar to #3036 on some environment, but not on dh_auto_install, it on dh_auto_test (see #3039). So we need to disable entire pybuild actions, not just dh_auto_install. Fixes #3039 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1512185097-23828-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `8c403ea4e0`)	2017-12-02 19:37:01 +02:00
Takuya ASADA	981644167b	dist/debian: skip running dh_auto_install on pybuild We are getting package build error on dh_auto_install which is invoked by pybuild. But since we handle all installation on debian/scylla-server.install, we can simply skip running dh_auto_install. Fixes #3036 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `25bc18b8ff`)	2017-12-01 16:06:44 +02:00
Avi Kivity	6f669da227	Update seastar submodule * seastar 78cd87f...8d254a1 (2): > fstream: do not ignore dma_write return value > Update dpdk submodule Fixes dpdk build and missing file write error check.	2017-11-30 10:43:22 +02:00
Avi Kivity	bdf1173075	Point seastar submodule at scylla-seastar.git This allows fixes to seastar to be cherry-picked into scylla-seastar.git branch-2.1.	2017-11-30 10:40:51 +02:00
Duarte Nunes	106c69ad45	compound_compact: Change universal reference to const reference The universal reference was introduced so we could bind an rvalue to the argument, but it would have sufficed to make the argument a const reference. This is also more consistent with the function's other overload. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171129132758.19654-1-duarte@scylladb.com> (cherry picked from commit `cda3ddd146`)	2017-11-29 14:42:08 +01:00
Tomasz Grabiec	740fcc73b8	Merge "compact_storage serialization fixes" from Duarte Fix two issues with serializing non-compound range tombstones as compound: convert a non-compound clustering element to compound and actually advertise the issue to other nodes. * git@github.com:duarten/scylla.git rt-compact-fixes/v1: compound_compact: Allow rvalues in size() sstables/sstables: Convert non-compound clustering element to compound tests/sstable_mutation_test: Verify we can write/read non-correct RTs service/storage_service: Export non-compound RT feature (cherry picked from commit `e9cce59b85`)	2017-11-29 14:18:21 +01:00
Raphael S. Carvalho	cefbb0b999	sstables: fix data_consume_context's move operator and ctor after `7f8b62bc0b`, its move operator and ctor broke. That potentially leads to error because data_consume_context dtor moves sstable ref to continuation when waiting for in-flight reads from input stream. Otherwise, sstable can be destroyed meanwhile and file descriptor would be invalid, leading to EBADF. Fixes #3020. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com> (cherry picked from commit `f699cf17ae`)	2017-11-29 09:54:27 +01:00
Tomasz Grabiec	02f43f5e4c	Merge "Convert memtable flush reader to flat streams" from Paweł This series converts memtable flush reader to the new flat mutation readers. Just like the scanning reader, flush reader concatenates multiple partition snapshot readers in order to provide a stream of all partitions in the memtable. * https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1 tests/flat_mutation_reader_assertion: add produces_partition() memtable: make make_flush_reader() return flat_mutation_reader flat_mutation_reader: add optimised flat_mutation_reader_opt memtable: switch flush reader implementation to flat streams tests/memtable: add test for flush reader (cherry picked from commit `04106b4c96`)	2017-11-27 20:29:25 +01:00
Duarte Nunes	8850ef7c59	tests/sstable_mutation_test: Change make_reader to make_flat_reader A merge conflict between `596ebaed1f` and `bd1efbc25c` caused the test to fail to build. Signed-off-by: Duarte Nunes <duarte@scylladb.com> (cherry picked from commit `4a6ffa3f5c`)	2017-11-27 09:59:56 +01:00
Duarte Nunes	8567723a7b	tests: Initialize storage service for some tests These tests now require having the storage service initialize, which is needed to decide whether correct non-compound range tombstones should be emitted or not. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171126152921.5199-1-duarte@scylladb.com> (cherry picked from commit `922f095f22`)	2017-11-26 17:41:20 +02:00
Duarte Nunes	b0b7c73acd	cql3/delete_statement: Allow non-range deletions on non-compound schemas This patch fixes a regression introduced in `1c872e2ddc`. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171126102333.3736-1-duarte@scylladb.com> (cherry picked from commit `15fbb8e1ca`)	2017-11-26 12:29:27 +02:00
Takuya ASADA	eb82d66849	dist/debian: link libgcc dynamically As we discussed on the thread (https://github.com/scylladb/scylla/issues/2941), since we override symbols on libgcc, we need to link libgcc dynamically for Ubuntu/Debian too (CentOS already do it). Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1511542866-21486-2-git-send-email-syuu@scylladb.com> (cherry picked from commit `7380a6088b`)	2017-11-25 20:10:15 +02:00
Takuya ASADA	eb12fb3733	dist/debian: switch to our PPA verions of gcc-72 Now we have gcc-7.2 on our PPA for Ubuntu 16.04/14.04, let's switch to it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1511542866-21486-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `df6546d151`)	2017-11-25 20:10:14 +02:00
Tomasz Grabiec	60d011c9c0	Merge "Convert sstable writers to flat mutation readers" from Paweł The following patches convert sstable writers to use flat mutation readers instead of the legacy mutation_reader interface. Writers were already using flat consumer interface and used consume_flattened_in_thread(), so most of the work was limited to providing an appropriate equivalent for flat mutation readers. * https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1: flat_mutation_reader: move consumer_adapter out of consume() flat_mutation_reader: introduce consume_in_thread() tests/flat_mutation_reader: test consume_in_thread() sstables: switch write_components() to flat_mutation_reader streamed_mutation: drop streamed_mutation_returning() sstables: convert compaction to flat_mutation_reader mutation_reader: drop consume_flattened_in_thread() (cherry picked from commit `596ebaed1f`)	2017-11-24 18:49:32 +01:00
Tomasz Grabiec	7c3390bde8	Merge "Fixes to sstable files for non-compound schemas" from Duarte This series mainly fixes issues with the serialization of promoted index entries for non-compound schemas and with the serialization of range tombstones, also for non-compound schemas. We lift the correct cell name writing code into its own function, and direct all users to it. We also ensure backward compatibility with incorrectly generated promoted indexes and range tombstones. Fixes #2995 Fixes #2986 Fixes #2979 Fixes #2992 Fixes #2993 * git@github.com:duarten/scylla.git promoted-index-serialization/v3: sstables/sstables: Unify column name writers sstables/sstables: Don't write index entry for a missing row maker sstables/sstables: Reuse write_range_tombstone() for row tombstones sstables/sstables: Lift index writing for row tombstones sstables/sstables: Leverage index code upon range tombstone consume sstables/sstables: Move out tombstone check in write_range_tombstone() sstables/sstables: A schema with static columns is always compound sstables/sstables: Lift column name writing logic sstables/sstables: Use schema-aware write_column_name() for collections sstables/sstables: Use schema-aware write_column_name() for row marker sstables/sstables: Use schema-aware write_column_name() for static row sstables/sstables: Writing promoted index entry leverages column_name_writer sstables/sstables: Add supported feature list to sstables sstables/sstables: Don't use incorrectly serialized promoted index cql3/single_column_primary_key_restrictions: Implement is_inclusive() cql3/delete_statement: Constrain range deletions for non-compound schemas tests/cql_query_test: Verify range deletion constraints sstables/sstables: Correctly deserialize range tombstones service/storage_service: Add feature for correct non-compound RTs tests/sstable_*: Start the storage service for some cases sstables/sstable_writer: Prepare to control range tombstone serialization sstables/sstables: Correctly serialize range tombstones tests/sstable_assertions: Fix monotonicity check for promoted indexes tests/sstable_assertions: Assert a promoted index is empty tests/sstable_mutation_test: Verify promoted index serializes correctly tests/sstable_mutation_test: Verify promoted index repeats tombstones tests/sstable_mutation_test: Ensure range tombstone serializes correctly tests/sstable_datafile_test: Add test for incorrect promoted index tests/sstable_datafile_test: Verify reading of incorrect range tombstones sstables/sstable: Rename schema-oblivious write_column_name() function sstables/sstables: No promoted index without clustering keys tests/sstable_mutation_test: Verify promoted index is not generated sstables/sstables: Optimize column name writing and indexing compound_compat: Don't assume compoundness (cherry picked from commit `bd1efbc25c`)	2017-11-24 18:49:19 +01:00
Tomasz Grabiec	95b55a0e9d	tests: sstable: Make tombstone_purge_test more reliable TTL of 1 second may cause the cell to expire right after we write it, if the second component of current time changes right after it. Use larger ttl to avoid spurious faliures due to this. Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `35e404b1a2`)	2017-11-24 18:49:16 +01:00
Amnon Heiman	7785d8f396	estimated_histogram: update the sum and count when merging When merging histograms the count and the sum should be updated. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20171122154822.23855-1-amnon@scylladb.com> (cherry picked from commit `3f8d9a87ee`)	2017-11-22 16:57:08 +01:00
Glauber Costa	b805e37d30	estimated_histogram: also fill up sum metric Prometheus histograms have 3 embedded metrics: count, buckets, and sum. Currently we fill up count and buckets but sum is left at 0. This is particularly bad, since according to the prometheus documentation, the best way to calculate histogram averages is to write: rate(metric_sum[5m]) / rate(metric_count[5m]) One way of keeping track of the sum is adding the value we sampled, every time we sample. However, the interface for the estimated histogram has a method that allows to add a metric while allowing to adjust the count for missing metrics (add_nano()) That makes acumulating a sum inaccurate--as we will have no values for the points that were added. To overcome that, when we call add_nano(), we pretend we are introducing new_count - _count metrics, all with the same value. Long term, doing away with sampling may help us provide more accurate results. After this patch, we are able to correctly calculate latency averages through the data exported in prometheus. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20171122144558.7575-1-glauber@scylladb.com> (cherry picked from commit `6c4e8049a0`)	2017-11-22 16:57:07 +01:00
Tomasz Grabiec	a790b8cd20	Merge "Remove sstable::read_rows" from Piotr * seastar-dev.git haaawk/flat_reader_remove_read_rows: sstable_mutation_test: use read_rows_flat instead of read_rows perf_sstable: use read_rows_flat instead of read_rows Remove sstable::read_rows (cherry picked from commit `e9ffe36d65`)	2017-11-22 16:11:31 +01:00
Tomasz Grabiec	a10ea80a63	Merge "Migrate sstables to flat_mutation_reader" from Piotr Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods and use them in sstable::as_mutation_source. * https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3: Introduce conversion from flat_mutation_reader to streamed_mutation Add sstables::read_rows_flat and sstables::read_range_rows_flat Turn sstable_mutation_reader into a flat_mutation_reader sstable: add getter for filter_tracker Move mp_row_consumer methods implementations to the bottom Remove unused sstable_mutation_reader constructor Replace "sm" with "partition" in get_next_sm and on_sm_finished Move advance_to_upper_bound above sstable_mutation_reader Store sstable_mutation_reader pointer in mp_row_consumer Stop using streamed_mutation in consumer and reader Stop using streamed_mutation in sstable_data_source Delete sstable_streamed_mutation Introduce sstable::read_row_flat Migrate sstable::as_mutation_source to flat_mutation_reader Remove single_partition_reader_adaptor Merge data_consume_context::impl into data_consume_context Create data_consume_context_opt. Merge on_partition_finished into mark_partition_finished Check _partition_finished instead of _current_partition_key Merge sstable_data_source into sstable_mutation_reader Remove sstable_data_source Remove get_next_partition and partition_header (cherry picked from commit `aa8c2cbc16`)	2017-11-22 15:49:22 +01:00
Takuya ASADA	91a5c9d20c	dist/redhat: avoid hardcoding GPG key file path on scylla-epel-7-x86_64.cfg Since we want to support cross building, we shouldn't hardcode GPG file path, even these files provided on recent version of mock. This fixes build error on some older build environment such as CentOS-7.2. Fixes #3002 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `c1b97d11ea`)	2017-11-21 17:26:53 +02:00
Takuya ASADA	f846b897bf	configure.py: suppress 'nonnull-compare' warning on antlr3 We get following warning from antlr3 header when we compile Scylla with gcc-7.2: /opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]': /opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare] To make it compilable we need to specify '-Wno-nonnull-compare' on cflags. Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com> (cherry picked from commit `f26cde582f`)	2017-11-21 17:26:53 +02:00
Takuya ASADA	8d7c34bf68	dist/debian: switch Debian 3rdparty packages to external build service Switch Debian 3rdparty packages to our OBS repo (https://build.opensuse.org/project/subprojects/home:scylladb). We don't use 3rdparty packages on dist/debian/dep, so dropped them. Also we switch Debian to gcc-7.2/boost-1.63 on same time. Due to packaging issues following packages doesn't renamed our 3rdparty package naming rule for now: - gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72. - boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `ab9d7cdc65`)	2017-11-21 17:26:53 +02:00
Duarte Nunes	7449586a26	thrift/server: Handle exception within gate The exception handling code inspects server state, which could be destroyed before the handle_exception() task runs since it runs after exiting the gate. Move the exception handling inside the gate and avoid scheduling another accept if the server has been stopped. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171116122921.21273-1-duarte@scylladb.com> (cherry picked from commit `34a0b85982`)	2017-11-21 15:52:38 +02:00
Daniel Fiala	b601b9f078	utils/big_decimal: Fix compilation issue with converion of cpp_int to uint64_t. Signed-off-by: Daniel Fiala <daniel@scylladb.com> Message-Id: <20171121134854.16278-1-daniel@scylladb.com> (cherry picked from commit `21ea05ada1`)	2017-11-21 15:52:01 +02:00
Tomasz Grabiec	1ec81cda37	Merge "Convert queries to flat mutation readers" from Paweł These patches convert queries (data, mutation and counter) to flat mutation readers. All of them already use consume_flattened() to consume a flat stream of data, so the only major missing thing was adding support for reversed partitions to flat_mutation_reader::consume(). * pdziepak flat_mutation_reader-queries/v3-rebased: flat_mutation_reader: keep reference to decorated key valid flat_muation_reader: support consuming reversed partitions tests/flat_mutation_reader: add test for flat_mutation_reader::consume() mutation_partition: convert queries to flat_mutation_readers tests/row_cache_stress_test: do not use consume_flattened() mutation_reader: drop consume_flattened() streamed_mutation: drop reverse_streamed_mutation() (cherry picked from commit `6969a235f3`)	2017-11-21 12:58:41 +01:00
Paweł Dziepak	e87a2bc9c0	streamed_mutation: make emit_range_tombstone() exception safe For a time range tombstone that was already removed from a tree is owned by a raw pointer. This doesn't end well if creation of a mutation fragment or a call to push_mutation_fragment() throw. Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com> (cherry picked from commit `1b936876b7`)	2017-11-21 12:35:47 +01:00
Tomasz Grabiec	b84d13d325	Merge "Fix reversed queries with range tombstones" from Paweł This series reworks handling of range tombstones in reversed queries so that they are applied to correct rows. Additionally, the concept of flipped range tombstones is removed, since it only made it harder to reason about the code. Fixes #2982. * https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2: streamed_mutation: fix reversing range tombstones range_tombstone: drop flip() tests/cql_query_test: test range tombstones and reverse queries tests/range_tombstone_list: add test for range_tombstone_accumulator (cherry picked from commit `cec5b0a5b8`)	2017-11-21 12:35:37 +01:00
Botond Dénes	b5abf6541d	Add fast-forwarding with no data test to mutation_source_test Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <9cb630bf9441e178b2040709f92767d4a740a875.1511180262.git.bdenes@scylladb.com> (cherry picked from commit `f059e71056`)	2017-11-21 12:34:46 +01:00
Botond Dénes	8cf869cb37	flat_mutation_reader_assertions: add fast_forward_to(position_range) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <7b530909cf188887377aec3985f9f8c0e3b9b1e8.1511180262.git.bdenes@scylladb.com> (cherry picked from commit `a1a0d445d6`)	2017-11-21 12:34:43 +01:00
Botond Dénes	df509761b0	flat_mutation_reader_from_mutation_reader(): make ff more resilient Currently flat_mutation_reader_from_mutation_reader()'s converting_reader will throw std::runtime_error if fast_forward_to() is called when its internal streamed_mutation_opt is disengaged. This can create problems if this reader is a sub-reader of a combined reader as the latter has no way to determine the source of a sub-reader EOS. A reader can be in EOS either because it reached the end of the current position_range or because it doesn't have any more data. To avoid this, instead of throwing we just silently ignore the fact that the streamed_mutation_opt is disengaged and set _end_of_stream to true which is still correct. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com> (cherry picked from commit `8065dca4a1`)	2017-11-21 12:34:40 +01:00
Vlad Zolotarov	b90e11264e	cql_transport::cql_server: fix the distributed prepared statements cache population Don't std::move() the "query" string inside the parallel_for_each() lambda. parallel_for_each is going to invoke the given callback object for each element of the range and as a result the first call of lambda that std::move()s the "query" is going to destroy it for all other calls. Fixes #2998 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com> (cherry picked from commit `941aa20252`)	2017-11-21 10:53:50 +02:00
Shlomi Livne	84b2bff0a6	release: prepare for 2.1.rc0 Signed-off-by: Shlomi Livne <shlomi@scylladb.com>	2017-11-19 18:53:20 +02:00
Tomasz Grabiec	2113299b61	sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition() _current_pi_idx was not reset from advance_to_next_partition(), which is used when we skip to the next partition before fully consuming it. As a result, if we try to skip to a clustering position which is before the index block used by the last skip in the previous partition, we would not skip assuming that the new position is in the current block. This may result in more data being read from the sstable than necessary. Fixes #2984 Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>	2017-11-17 11:00:26 +00:00
Avi Kivity	6950389a3f	Update seastar submodule * seastar 11ad0b1...78cd87f (3): > Merge "http: Use output stream for files" from Amnon > tutorial: a section about when_all() and when_all_succeed() > Merge "Power8 related changes (what's left of them)" from Vlad	2017-11-16 16:31:46 +02:00
Avi Kivity	f18b3928d0	Merge	2017-11-16 14:55:45 +02:00
Avi Kivity	beffe469af	index_entry: add move constructor, assigment operators As can be seen in one of the traces in #2958, the copy constructor of index_entry is called in response to std::vector<index_entry>::push_back(index_entry&&). This is wasteful. Fix by providing the full suite of constructors/assignment operators. Message-Id: <20171116121608.5580-1-avi@scylladb.com>	2017-11-16 13:54:05 +01:00
Avi Kivity	bbcfc57cb4	Merge "Free auth and its use from global variables" from Jesse "This patch series addresses #2929. The objective is to eliminate global state from the implementation and use of all access-control functionlity. I've made every effort to make these patches logically independent and incremental, but the final patch is big: this was necessary because eliminating the global instances themselves is an atomic change." * 'jhk/non_global_auth/v2' of https://github.com/hakuch/scylla: auth: Switch to sharded service tracing/trace_keyspace_helper: Use internal `client_state` auth: Make the QP an explicit dependency auth: Unify Java class name attributes auth: Make life-time control more consistent auth: Move metadata constants auth: Don't expose internal constant auth: Extract `permissions_cache` utils/loading_cache: Include necessary dependency auth: Fix static constant initialization auth: Extract `delayed_tasks` from `auth.cc`	2017-11-16 14:52:34 +02:00
Jesse Haber-Kucharsky	ba6a41d397	auth: Switch to sharded service This change appears quite large, but is logically fairly simple. Previously, the `auth` module was structured around global state in a number of ways: - There existed global instances for the authenticator and the authorizer, which were accessed pervasively throughout the system through `auth::authenticator::get()` and `auth::authorizer::get()`, respectively. These instances needed to be initialized before they could be used with `auth::authenticator::setup(sstring type_name)` and `auth::authorizer::setup(sstring type_name)`. - The implementation of the `auth::auth` functions and the authenticator and authorizer depended on resources accessed globally through `cql3::get_local_query_processor()` and `service::get_local_migration_manager()`. - CQL statements would check for access and manage users through static functions in `auth::auth`. These functions would access the global authenticator and authorizer instances and depended on the necessary systems being started before they were used. This change eliminates global state from all of these. The specific changes are: - Move out `allow_all_authenticator` and `allow_all_authorizer` into their own files so that they're constructed like any other authenticator or authorizer. - Delete `auth.hh` and `auth.cc`. Constants and helper functions useful for implementing functionality in the `auth` module have moved to `common.hh`. - Remove silent global dependency in `auth::authenticated_user::is_super()` on the auth* service in favour of a new function `auth::is_super_user()` with an explicit auth* service argument. - Remove global authenticator and authorizer instances, as well as the `setup()` functions. - Expose dependency on the auth* service in `auth::authorizer::authorize()` and `auth::authorizer::list()`, which is necessary to check for superuser status. - Add an explicit `service::migration_manager` argument to the authenticators and authorizers so they can announce metadata tables. - The permissions cache now requires an auth* service reference instead of just an authorizer since authorizing also requires this. - The permissions cache configuration can now easily be created from the DB configuration. - Move the static functions in `auth::auth` to the new `auth::service`. Where possible, previously static resources like the `delayed_tasks` are now members. - Validating `cql3::user_options` requires an authenticator, which was previously accessed globally. - Instances of the auth* service are accessed through `external` instances of `client_state` instead of globally. This includes several CQL statements including `alter_user_statement`, `create_user_statement`, `drop_user_statement`, `grant_statement`, `list_permissions_statement`, `permissions_altering_statement`, and `revoke_statement`. For `internal` `client_state`, this is `nullptr`. - Since the `cql_server` is responsible for instantiating connections and each connection gets a new `client_state`, the `cql_server` is instantiated with a reference to the auth* service. - Similarly, the Thrift server is now also instantiated with a reference to the auth* service. - Since the storage service is responsible for instantiating and starting the sharded servers, it is instantiated with the sharded auth* service which it threads through. All relevant factory functions have been updated. - The storage service is still responsible for starting the auth* service it has been provided, and shutting it down. - The `cql_test_env` is now instantiated with an instance of the auth* service, and can be accessed through a member function. - All unit tests have been updated and pass. Fixes #2929.	2017-11-15 23:22:42 -05:00
Jesse Haber-Kucharsky	1dd181bd7b	tracing/trace_keyspace_helper: Use internal `client_state`	2017-11-15 23:19:18 -05:00
Jesse Haber-Kucharsky	41612ee577	auth: Make the QP an explicit dependency Rather than have all uses of the QP in auth reference global variables, we supply a QP reference to both the authenticator and authorizer on construction. The caller still references a global variable when constructing the instances, but fixing this problem is a much larger task that is out of scope of this change.	2017-11-15 23:19:13 -05:00
Jesse Haber-Kucharsky	157e22a4f0	auth: Unify Java class name attributes	2017-11-15 23:19:00 -05:00
Jesse Haber-Kucharsky	9aff5d9a77	auth: Make life-time control more consistent	2017-11-15 23:18:44 -05:00
Jesse Haber-Kucharsky	5825e37310	auth: Move metadata constants This change is motivated partly be aesthetics, but more significantly due to the future work to refactor `auth` into a sharded service. Since doing so will require writing `auth::auth` from scratch, these constants (and other common functionality) need a new home.	2017-11-15 23:18:42 -05:00
Jesse Haber-Kucharsky	22670cae82	auth: Don't expose internal constant	2017-11-15 23:17:52 -05:00
Jesse Haber-Kucharsky	20b7f92b9c	auth: Extract `permissions_cache` In addition to improving clarity, this makes the cache testable. There shouldn't be any functional changes.	2017-11-15 23:17:41 -05:00
Jesse Haber-Kucharsky	6f4241574c	utils/loading_cache: Include necessary dependency	2017-11-15 23:17:05 -05:00
Jesse Haber-Kucharsky	5c39a2cc15	auth: Fix static constant initialization Using "Meyer's singletons" eliminate the problem of static constant initialization order because static variables inside functions are initialized only the first time control flow passes over their declaration. Fixes #2966.	2017-11-15 23:16:52 -05:00
Jesse Haber-Kucharsky	507e1ef8d5	auth: Extract `delayed_tasks` from `auth.cc` This simple task scheduler is used by the auth module to delay metadata creation until the system is settled. Extracting it out allows the `auth` module to be refactored into a sharded service and for other components of `auth` to make use of it. Fixes #2965.	2017-11-15 23:16:46 -05:00
Botond Dénes	7f60fb19b4	flat_mutation_reader::fast_forward_buffer_to: remove schema parameter `e7a0732f72` added the schema to flat_mutation_reader::impl so the schema doesn't need to be provided externally anymore. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <04933512d3485d85629a9945b8ecb211aa2aab50.1510732121.git.bdenes@scylladb.com>	2017-11-15 10:40:02 +01:00
Tomasz Grabiec	a061be688d	Merge "Prepare sstables read path for flat_mutation_reader" from Piotr This patchset prepares sstables read path for flat_mutation_reader. It cuts some dependencies between classes and replaces sstables::mutation_reader with ::mutation_reader. This will make it possible to gradually convert the code to flat_mutation_reader because we have converters between flat_mutation_reader and ::mutation_reader. * seastar-dev.git haaawk/flat_reader_prepare_sstables_rebased Reduce dependencies from mp_row_consumer to sstable_streamed_mutation Replace sstables::mutation_reader with ::mutation_reader Remove range_reader_adaptor Remove sstable_range_wrapping_reader	2017-11-15 10:40:02 +01:00
Piotr Jastrzebski	6cd4b6b09c	Remove sstable_range_wrapping_reader The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Piotr Jastrzebski	6d85e4fb0c	Remove range_reader_adaptor The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Piotr Jastrzebski	ea449c9cce	Replace sstables::mutation_reader with ::mutation_reader This will make migration to flat_mutation_reader much easier and sstables::mutation_reader is going away with this migration anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:01 +01:00
Piotr Jastrzebski	228f0737f4	Reduce dependencies from mp_row_consumer to sstable_streamed_mutation Before this patch mp_row_consumer was using sstable_streamed_mutation in two ways: 1. Populate sstable_streamed_mutation's buffer with mutation_fragments 2. Advance sstable_streamed_mutation's sstable_data_source to new position. We can easily reduce those dependencies only to the first one. This will reduce the coupling between those classes and simplify the flow of execution. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:01 +01:00
Takuya ASADA	07c191af41	dist/common/scripts/scylla_dev_mode_setup: include scylla_lib.sh To use verify_args function we requires scylla_lib.sh, so include it. Fixes #2945 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1510154173-18017-1-git-send-email-syuu@scylladb.com>	2017-11-15 11:31:14 +02:00
Avi Kivity	7caf3a543e	Merge "Respect size-tiered options in strategies that rely on its functionality" from Raphael "Otherwise, such strategies couldn't behave as expected when it needs to do STCS." * 'respecting_stcs_options_v2' of github.com:raphaelsc/scylla: tests: enable twcs test that relied on size-tiered properties twcs: respect stcs options by forwarding them to stcs method lcs: forward stcs options to respect them stcs: make most_interesting_bucket respect size-tiered options stcs: make most_interesting_bucket respect thresholds compaction: make size_tiered_most_interesting_bucket static method of stcs class stcs: introduce new ctor stcs: make header self contained stcs: inline function definition so as not to break one definition rule	2017-11-14 17:57:57 +02:00
Raphael S. Carvalho	1f478d5daa	tests: enable twcs test that relied on size-tiered properties Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:27:27 -02:00
Raphael S. Carvalho	8165af1d08	twcs: respect stcs options by forwarding them to stcs method Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:27:27 -02:00
Raphael S. Carvalho	9cdc047a4c	lcs: forward stcs options to respect them Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:27:27 -02:00
Raphael S. Carvalho	2b7f87474b	stcs: make most_interesting_bucket respect size-tiered options Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:27:25 -02:00
Raphael S. Carvalho	d8ec913c34	stcs: make most_interesting_bucket respect thresholds Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:26:04 -02:00
Raphael S. Carvalho	cb6d060d8e	compaction: make size_tiered_most_interesting_bucket static method of stcs class Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:24:03 -02:00
Raphael S. Carvalho	b69dbf8b99	stcs: introduce new ctor Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-14 13:21:59 -02:00
Avi Kivity	0dc888f963	Merge	2017-11-14 15:59:52 +02:00
Tomasz Grabiec	7323fe76db	gossiper: Replicate endpoint_state::is_alive() Broken in `f570e41d18`. Not replicating this may cause coordinator to treat a node which is down as alive, or vice verse. Fixes regression in dtest: consistency_test.py:TestAvailability.test_simple_strategy which was expected to get "unavailable" exception but it was getting a timeout. Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>	2017-11-14 15:58:00 +02:00
Vlad Zolotarov	c6c41aa877	tests: loading_cache_test: make it more robust Make sure loading_cache::stop() is always called where appropriate: regardless whether the test failed or there was an exception during the test. Otherwise a false-alarm use-after-free error may occur. Fixes #2955 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1510625736-3109-1-git-send-email-vladz@scylladb.com>	2017-11-14 11:35:49 +00:00
Avi Kivity	09e730f9f2	Merge "Fix bugs in cache related to handling of bad_alloc" from Tomasz "Fixes #2944." * tag 'tgrabiec/cache-exception-safety-fixes-v2' of github.com:scylladb/seastar-dev: tests: row_cache: Add test for exception safety of multi-partition scans tests: row_cache: Add test for exception safety of single-partition reads tests: mutation_source_tests: Always print the seed tests: Disable alloc failure injection in test assertions tests: Avoid needless copies row_cache: Fix exception safety of cache_entry::read() row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh() cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors cache_streamed_mutation: Make advancing to the next range exception-safe cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe cache_streamed_mutation: Make drain_tombstones() exception-safe cache_streamed_mutation: Return void from start_reading_from_underlying() cache_streamed_mutation: Document invariants related to exception-safety streamed_mutation: Add reserve_one() lsa: Guarantee invalidated references on allocating section retry mvcc: partition_snapshot_row_cursor: Mark allocation points	2017-11-14 11:42:13 +02:00
Raphael S. Carvalho	f6574412a3	stcs: make header self contained Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-13 18:07:31 -02:00
Raphael S. Carvalho	2b45aa3593	stcs: inline function definition so as not to break one definition rule goal is to allow multiple definitions of header Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-11-13 18:07:30 -02:00
Tomasz Grabiec	638d23025b	tests: row_cache: Add test for exception safety of multi-partition scans	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	084e1861c8	tests: row_cache: Add test for exception safety of single-partition reads	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	a968a84ec5	tests: mutation_source_tests: Always print the seed BOOST_TEST_MESSAGE() is not logged by default, and for some tests we don't want to enable that because it's too noisy. But we need to know the seed to reproduce a failure, so we better to always print it.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	e868929faf	tests: Disable alloc failure injection in test assertions Injecting failures to assertions doesn't add much value but slows down test execution by adding extra iterations.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	5cf7f9d1bb	tests: Avoid needless copies	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	1971332195	row_cache: Fix exception safety of cache_entry::read() When we fail, we need to return streamed_mutation back, so that the operation can be retried. Causes SIGSEGV on nullptr otherwise on bad_alloc.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	11a195c403	row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data If assignment to _lower_bound in the "_secondary_in_progress = true;" case in do_read_from_primary() throws due to allocation failure, the update section will be retried and we will take the not_moved path, skipping the range which was discontinuous and was supposed to be read from underlying. Fix by redoing lookup using _lower_bound in case the section is retried. When we retry, _primary.valid() will be false. We need to ensure now that _lower_bound is always valid. Fixes #2944.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	5dc1ee41e4	row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	09c49b2db3	cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	f60cfa34f4	mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors Assignment to _pos (position_in_partition) may throw. noexcept is a remnant from the version which didn't have _pos.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	bd7b68f877	cache_streamed_mutation: Make advancing to the next range exception-safe Changing _ck_ranges_curr and _lower_bound should be atomic, either both fail or both succeed. Currently it could happen that if position_in_partition::for_range_start() fails, _ck_ranges_curr would be advanced but _lower_bound not.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	081deec731	cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe We need to maintain the following invariants: (1) no fragment with position >= _lower_bound was pushed yet (2) If _lower_bound > mf.position(), mf was emitted Before this patch (1) could be violated if drain_tombstones() failed in the middle. (2) could be violated if push_mutation_fragment() failed.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	d1b844737a	cache_streamed_mutation: Make drain_tombstones() exception-safe If push_mutation_fragment() failed, mfo which we got from get_next() would be lost. Fix by making sure push_mutation_fragment() won't fail.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	875fc93956	cache_streamed_mutation: Return void from start_reading_from_underlying() The return value is no longer used.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	5fb319bbb9	cache_streamed_mutation: Document invariants related to exception-safety	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	53f4452b47	streamed_mutation: Add reserve_one()	2017-11-13 20:55:13 +01:00
Tomasz Grabiec	8d69d217af	lsa: Guarantee invalidated references on allocating section retry There is existing code (e.g. use of partition_snapshot_row_cursor in cache_streamed_mutation) which assumes that references will be invalidated when bad_alloc is thrown from allocating_section. That is currently the case because on retry we will attempt memory reclamation which will invalidate references either through compaction or eviction. Make this guarantee explicit.	2017-11-13 20:55:13 +01:00
Tomasz Grabiec	6bf1c6014f	mvcc: partition_snapshot_row_cursor: Mark allocation points This marks places which may allocate but not always do as allocation points to increase effectiveness of testing.	2017-11-13 20:55:13 +01:00
Raphael S. Carvalho	cfd2343689	sstables: fix report in integrity check file interposer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171113185842.11018-1-raphaelsc@scylladb.com>	2017-11-13 20:07:09 +01:00
Raphael S. Carvalho	cf8e12c760	checked_file_impl: remove unneeded variant of open_checked_file_dma like in integrity_checked_file_impl, we don't need a variant of open for default file open options. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171113185412.10880-1-raphaelsc@scylladb.com>	2017-11-13 20:06:58 +01:00
Raphael S. Carvalho	564046a135	thrift: fix compilation error thrift/server.cc:237:6: required from here thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object maybe_retry_accept(which, keepalive, std::move(ex)); gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>	2017-11-13 20:05:33 +01:00
Avi Kivity	9cd5bc4eb8	Merge "Convert streaming to flat mutation readers" from Paweł "The following patches convert streaming and repair code to the new flat mutation reader interface. In particular this involves changing:: - fragment_and_freeze() -- a consumer that fragments and freezes mutations - checksum computation for repair which until now was using two-level mutation_reader/streamed_mutation interface - multi_range_reader -- a mutation reader that automatically fast forwards to between given partiton ranges" * tag 'flat_mutation_reader-streaming/v2' of https://github.com/pdziepak/scylla: (24 commits) mutation_reader: drop multi_range_reader db: convert make_streaming_reader() to flat_mutation_reader tests/flat_mutation_reader: add test for multi range reader tests/flat_mutation_reader_assertions: add fast_forward_to() tests/simple_schema: add to_ring_positions() helper flat_mutation_reader: convert flat_multi_range_mutation_reader flat_mutation_reader: add partition_range_forwarding flat_mutation_reader: make pop_mutation_fragment() public flat_mutation_reader: copy multi_range_mutation_reader streamed_mutation: drop mutation_hasher tests/flat_mutation_reader: add test for partition checksum repair: convert partition_checksum::compute_streamed() to flat streams repair: make partition_hasher consume flat mutation streams mutation_hasher: copy mutation_hasher to repair.cc partition_start: make partition_tombstone() const partition_checksum: introduce compute() for flat_mutation_reader db: drop single-range make_streaming_reader() fragment_and_freeze: drop streamed_mutation overload stream_transfer_task: switch to flat_mutation_reader tests/flat_mutation_reader: add test for fragment_and_freeze ...	2017-11-13 18:56:59 +02:00
Paweł Dziepak	97767963a0	mutation_reader: drop multi_range_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	dca93bea23	db: convert make_streaming_reader() to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	98965add5b	tests/flat_mutation_reader: add test for multi range reader Based on mutation_reader.cc:test_multi_range_reader.	2017-11-13 16:49:52 +00:00
Paweł Dziepak	d23813cd41	tests/flat_mutation_reader_assertions: add fast_forward_to()	2017-11-13 16:49:52 +00:00
Paweł Dziepak	8fc9d250c5	tests/simple_schema: add to_ring_positions() helper Based on mutation_reader_test.cc:to_ring_position()	2017-11-13 16:49:52 +00:00
Paweł Dziepak	a9ec01d5a5	flat_mutation_reader: convert flat_multi_range_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	11e8866aee	flat_mutation_reader: add partition_range_forwarding flat_mutation_reader::partition_range_forwarding and mutation_reader::forwarding are aliases of the same type. The change was necessary in order to make mutation_reader::forwarding available in flat_mutation_reader.hh even though it is included by mutation_reader.hh	2017-11-13 16:49:52 +00:00
Paweł Dziepak	009785a178	flat_mutation_reader: make pop_mutation_fragment() public flat_mutation_reader public interface already exposes low leve is_buffer_empty() and is_buffer_full() adding pop_mutation_fragment() will make implementation of intermediate readers more straightforward.	2017-11-13 16:49:52 +00:00
Paweł Dziepak	d9a2b00d4a	flat_mutation_reader: copy multi_range_mutation_reader multi_range_mutation_reader for flat mutation readers is going to be based on the original one.	2017-11-13 16:49:52 +00:00
Paweł Dziepak	7866e5b4a9	streamed_mutation: drop mutation_hasher	2017-11-13 16:49:52 +00:00
Paweł Dziepak	aa64b711d1	tests/flat_mutation_reader: add test for partition checksum Based on streamed_mutation_test:test_mutation_hash	2017-11-13 16:49:52 +00:00
Paweł Dziepak	f690e2e80b	repair: convert partition_checksum::compute_streamed() to flat streams	2017-11-13 16:49:52 +00:00
Paweł Dziepak	d71a14b943	repair: make partition_hasher consume flat mutation streams	2017-11-13 16:49:52 +00:00
Paweł Dziepak	2b774119a1	mutation_hasher: copy mutation_hasher to repair.cc Repair is the exclusive user of mutation_hasher. Moving it there will make integration with partition_checksum easier.	2017-11-13 16:49:52 +00:00
Paweł Dziepak	af4fa6152b	partition_start: make partition_tombstone() const	2017-11-13 16:49:52 +00:00
Paweł Dziepak	f648f94464	partition_checksum: introduce compute() for flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	37640f223b	db: drop single-range make_streaming_reader()	2017-11-13 16:49:52 +00:00
Paweł Dziepak	e2481a89e1	fragment_and_freeze: drop streamed_mutation overload	2017-11-13 16:49:52 +00:00
Paweł Dziepak	6f1e0d3ed8	stream_transfer_task: switch to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	50a1d76c1f	tests/flat_mutation_reader: add test for fragment_and_freeze Based on streamed_mutation_test:test_fragmenting_and_freezing_streamed_mutations	2017-11-13 16:49:52 +00:00
Paweł Dziepak	f5c40e0861	flat_mutation_reader_from_mutations: take vector by value	2017-11-13 16:49:51 +00:00
Paweł Dziepak	9854b8a450	fragment_and_freeze: work on flat_mutation_readers	2017-11-13 16:49:47 +00:00
Paweł Dziepak	8bb672502d	fragment_and_freeze: allow callback to stop iteration There is a user of fragment_and_freeze() (streaming) that will need to be able to break the loop Right now, it does that between streamed_mutation, but that won't be possible after we switch to flat readers.	2017-11-13 16:44:33 +00:00
Paweł Dziepak	73b8f54cf4	test/mutation_source_test: generate sets of mutations	2017-11-13 16:42:56 +00:00
Tomasz Grabiec	3536d2156c	tests: row_cache: Add reproducer for issue #2948 Message-Id: <1510229584-14398-2-git-send-email-tgrabiec@scylladb.com>	2017-11-13 15:20:21 +00:00
Tomasz Grabiec	8402728747	row_cache: Call open_version() under region's allocator partition_entry::read() calls open_version() under standard allocator, but it may allocate a new partition version if a snapshot already exists which was created in an earlier phase. Versions are supposed to be allocated using region's allocator, they will be freed using region's allocator. LSA will delegate free() to the standard allocator correctly in this case, but it will subtract from its _non_lsa_occupancy, assuming the allocation was done through it. This will corrupt occupancy() for cache region. Fixes #2948. Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>	2017-11-13 15:20:08 +00:00
Avi Kivity	061f6830fa	Merge "thrift/server: Ensure stop() waits for accepts" from Duarte "Ensure stop() waits for the accept loop to complete to avoid crashes during shutdown." * 'thrift-server-stop/v4' of https://github.com/duarten/scylla: thrift/server: Restore code format thrift/server: Stopping the server waits for connection shutdown thrift/server: Abort listeners on stop() thrift/server: Avoid manual memory management thrift/server: Add move ctor for connection thrift/server: Extract retry logic thrift/server: Retry with backoff for some error types thrift/server: Retry accept in case of error	2017-11-13 12:48:05 +02:00
Duarte Nunes	049fbb58f3	thrift/server: Restore code format Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:21:54 +01:00
Duarte Nunes	7b25e3200a	thrift/server: Stopping the server waits for connection shutdown This patch ensures the future returned from stop() resolves only when all connections and listeners are no longer in use. Fixes #2657 Fixes #2942 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:21:53 +01:00
Duarte Nunes	f523a0f845	thrift/server: Abort listeners on stop() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:19:44 +01:00
Duarte Nunes	8e0e2363e9	thrift/server: Avoid manual memory management Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:19:44 +01:00
Duarte Nunes	75d04be96f	thrift/server: Add move ctor for connection	2017-11-13 11:19:44 +01:00
Duarte Nunes	9d3322ff1a	thrift/server: Extract retry logic Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:19:43 +01:00
Duarte Nunes	b5cf1a152f	thrift/server: Retry with backoff for some error types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:19:19 +01:00
Duarte Nunes	f367dbe1ed	thrift/server: Retry accept in case of error In case of errors like ECONNABORTED, we want to retry accepting connections. Right now we immediately retry the accept, but in subsequent patches we introduce a backoff for other types of errors. We also consider fatal errors like EBADFD, which should not trigger a retry. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-13 11:19:03 +01:00
Avi Kivity	d57395dce9	cql: prevent overflow when computing averages Currently, we use type type of the column as the accumulator when we average it. This can easily overflow, e.g. (2^31-1)+(3) = overflow. Fix by using __int128 for the accumulator. It's not standard, but it's way more efficient and simpler than the alternatives. Inspired by CASSANDRA-12417, but much simpler due to the availability of __int128. Message-Id: <20171112173529.30764-1-avi@scylladb.com>	2017-11-13 08:53:59 +01:00
Piotr Jastrzebski	acfc6fef55	Simplify flat_mutation_reader wrappers If a wrapper takes a flat_mutation_reader in a constructor then it does not have to take schema_ptr because it can obtain it from the inner flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <88c3672df08d2ac465711e9138d426e43ae9c62b.1510331382.git.piotr@scylladb.com>	2017-11-13 08:53:34 +01:00
Avi Kivity	f8af4f507b	Merge "Support for varint and decimal in aggregate functions" from Daniel "This patch adds support for varint and decimal to aggregate functions. Some other types (like byte or smallint) weren't supported and they are supported by C. So their aggregate functions were added as well. To allow aggregate functions for big_decimal, following methods were added to big_decimal type: Division by int64_t that preservers number of decimal digits. * Operator += . * Comparison operators. Fixes #2842." * 'danfiala/scylla-2842-send-002' of https://github.com/hagrid-the-developer/scylla: tests: Add tests for aggregate functions. tests: Add tests for big_decimal type. cql3/functions: Add aggregate functions for big_decimal. utils/big_decimal: Added necessary operators and methods for aggregate functions. cql3/functions: Add aggregate functions for types for which it is trivial.	2017-11-12 17:11:33 +02:00
Daniel Fiala	bc20484c47	tests: Add tests for aggregate functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-11-12 15:53:22 +01:00
Daniel Fiala	ee1d69502b	tests: Add tests for big_decimal type. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-11-12 15:53:22 +01:00
Daniel Fiala	74c5f70b0a	cql3/functions: Add aggregate functions for big_decimal. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-11-12 15:53:13 +01:00
Daniel Fiala	ce2f010859	utils/big_decimal: Added necessary operators and methods for aggregate functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-11-12 15:51:29 +01:00
Daniel Fiala	115668fe70	cql3/functions: Add aggregate functions for types for which it is trivial. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-11-12 13:56:20 +01:00
Tomasz Grabiec	484dde692f	Merge "make sure that cache updates don't overflow dirty memory" from Glauber Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). After that is done, we need to make sure that dirty memory is not seen as freed until the cache update is done. Until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 * git@github.com:glommer/scylla.git glommer/update-cache-v4: row_cache: modernize use of seastar threads mutation_partition: estimate size of partition memtable: factor out calculation of memtable_entry memory size memtable: add a method to export memtable's dirty memory manager dirty_memory_manager: block if we hit the real dirty limit dirty_memory_manager: add functions to manipulate real dirty partition: add method to calculate memory size of a partition row cache: pin real dirty during cache updates.	2017-11-10 13:55:12 +01:00
Piotr Jastrzebski	e7a0732f72	Add schema_ptr to flat_mutation_reader It is usefull to have a schema inside a flat reader the same way we had schema inside a streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b37e0dbf38810c00bd27fb876b69e1754c16a89f.1510312137.git.piotr@scylladb.com>	2017-11-10 13:54:55 +01:00
Pekka Enberg	0c192c835c	cql3: Fix 'DROP INDEX' to also drop index view This patch fixes 'DROP INDEX' CQL statement to also drop the underlying index view automatically so that we don't leave unused materialized views behind. Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>	2017-11-10 10:52:08 +01:00
Duarte Nunes	73f6c9a612	Merge seastar upstream * seastar 8040cab...11ad0b1 (7): > alloc_failure_injector: Fix compilation error with gcc 7.1 > core/gate: Add is_closed() function > doc: code formatting and fix function call > doc: tutoral code formatting > build: adjust -Wno-error=cpp for clang > build: don't error out on preprocessor #warning > Merge 'Enhancements of allocation failure injector' from Tomasz Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-11-09 14:42:06 +01:00
Takuya ASADA	f607a01cc5	dist/debian: link boost statically Since we switched scylla-boost163 which isn't provided by distribution repo, we need to link them statically. Fixes #2946 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1510229553-29801-1-git-send-email-syuu@scylladb.com>	2017-11-09 14:51:00 +02:00
Glauber Costa	1d7617723d	row cache: pin real dirty during cache updates. Right now, once a region is moved to the cache is no longer visible to the dirty memory system. Not as real dirty nor virtual dirty. The problem is that until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 19:46:36 -05:00
Glauber Costa	c2f49da609	partition: add method to calculate memory size of a partition Once that is added, also add a method to a memtable entry to calculate the entire size of a memtable entry. Right now we only have one method to calculate the size minus rows. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	b02ab991b9	dirty_memory_manager: add functions to manipulate real dirty There are times in which we want to add and remove real dirty memory without impacting virtual dirty. One such example is the cache update process, where real dirty is the limiting factor. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	a6b2226562	dirty_memory_manager: block if we hit the real dirty limit Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). A next step is to add a controller that will increase the priority of the tasks involving in releasing real dirty memory if we get dangerously close to the threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	b98a48657e	memtable: add a method to export memtable's dirty memory manager It will be used by the cache update process to gradually return real dirty memory to the manager. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	ec36b9eddc	memtable: factor out calculation of memtable_entry memory size The total size is the sum of two components. Add a method that does that sum so this code gets easier to reuse. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	d49ecae201	mutation_partition: estimate size of partition In the memtable flusher, we account for the size of a partition as we read them. However, there are other points in the architecture where we would like to calculate the size of a partition in a point in which we are not reading it. One such example is the cache update process. This patch enhances the mutation_partition adding a method that returns the total size for this partition. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Glauber Costa	b836005555	row_cache: modernize use of seastar threads For a while now we have an async() function, that simplifies the code by not needing to issue an explicit join. This patch converts the row cache to use async() as well, which most of our code already does. Doing so will make it easier to make changes to update_cache. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Paweł Dziepak	b69f94fece	Merge "Implement flat_mutation_reader::consume" from Piotr "Implement flat_mutation_reader::consume and add tests for it. For that implement flat_mutation_reader_from_mutations and read_mutation_from_flat_mutation_reader." * 'haaawk/flat_reader_consume_v3' of github.com:scylladb/seastar-dev: Add tests for flat_mutation_reader::consume Add tests for flat_mutation_reader utils Introduce read_mutation_from_flat_mutation_reader Make mutation_rebuilder streamed_mutation independent flat_mutation_reader_from_mutation: support multiple mutations Introduce flat_mutation_reader::consume Move FlattenedConsumer concept to flat_mutation_reader.hh	2017-11-08 15:08:47 +00:00
Paweł Dziepak	0373f357a8	Merge "Make memtable::make_reader return flat_mutation_reader" from Piotr "This patchset introduces memtable::make_flat_reader that returns flat_mutation_reader and converts internal memtable readers into flat_mutation_readers. It also introduces some utility methods like make_forwardable and make_partition_snapshot_flat_reader." * 'haaawk/flat_reader_memtable_v4' of github.com:scylladb/seastar-dev: Turn scanning_reader into flat_mutation_reader Change memtable_entry::read to return flat_mutation_reader Make iterator_reader independent from mutation_reader Introduce make_partition_snapshot_flat_reader Prepare partition_snapshot_flat_reader Introduce flat_mutation_reader_from_mutation Prepare flat_mutation_reader_from_mutation Introduce make_forwardable Prepare make_forwardable for flat_mutation_reader Introduce empty_flat_reader memtable: Introduce make_flat_reader	2017-11-08 14:24:26 +00:00
Piotr Jastrzebski	29d409de2f	Add tests for flat_mutation_reader::consume Make sure that flat_mutation_reader::consume stops as it's asked by the consumer. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	d42e53982d	Add tests for flat_mutation_reader utils Test flat_mutation_reader_from_mutations and read_mutation_from_flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	4b58a05053	Introduce read_mutation_from_flat_mutation_reader This helper method reads a single mutation from a flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	6718ecab82	Make mutation_rebuilder streamed_mutation independent mutation_rebuilder will be used not only with streamed_mutations but also with flat_mutation_readers so it's better for it to be independent from streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	aa16cd7eef	flat_mutation_reader_from_mutation: support multiple mutations Rename flat_mutation_reader_from_mutation to flat_mutation_reader_from_mutations. Make it work with std::vector<mutation> instead of a single mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	bcd5415413	Introduce flat_mutation_reader::consume This is equivalent to consume_flattened for old mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:25:28 +01:00
Piotr Jastrzebski	9233ee7309	Move FlattenedConsumer concept to flat_mutation_reader.hh This concept will be used both in flat_mutation_reader.hh and mutation_reader.hh. mutation_reader.hh includes flat_mutation_reader.hh so we have to move the concept to make it accessible in both files. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:14:51 +01:00
Piotr Jastrzebski	864d02e795	Turn scanning_reader into flat_mutation_reader This will make memtable::make_reader more efficient. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:08:53 +01:00
Tomasz Grabiec	9e115fb7e2	Merge addition of mutation_source::make_flat_mutation_reader() from Piotr Make it possible for a mutation_source to be created both for sources that use old mutation_reader and new flat_mutation_reader. Add tests for flat_mutation_reader::next_partition to run_mutation_source_tests. * seastar-dev.git 'dev/haaawk/flat_reader_mutation_source_v3': Remove mutation_reader.hh dependency from flat_mutation_reader.hh Prepare mutation_source for more than one implementation Add flat reader mutation source implementation Add mutation_source::make_flat_mutation_reader Use mutation_source::make_flat_mutation_reader in tests Add flat_mutation_reader_assertions Add test for flat_mutation_reader::next_partition	2017-11-08 14:05:25 +01:00
Piotr Jastrzebski	68505a5065	Change memtable_entry::read to return flat_mutation_reader This is the first step to move scanning_reader to be flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	7b016527bf	Make iterator_reader independent from mutation_reader iterator_reader will be used also in flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	f499949645	Introduce make_partition_snapshot_flat_reader This allows creation of flat_mutation_reader from MVCC. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	ced64d7571	Prepare partition_snapshot_flat_reader This commit creates a copy of partition_snapshot_reader and names it partition_snapshot_flat_reader. This new class will be turned into a flat_mutation_reader in the next commit. The purpose of this commit is to make it easier to review the next commit. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	ed074a4f56	Introduce flat_mutation_reader_from_mutation This is a utility method that will be handy in conversion from mutation_reader to flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	c3a4ce842a	Prepare flat_mutation_reader_from_mutation This commit copies streamed_mutation_from_mutation from streamed_mutation to flat_mutation_reader and renames it to streamed_mutation_from_mutation_copy. This copy will be used as a base for flat_mutation_reader_from_mutation. The purpose of this commit is to make it easier to review the next commit. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	decefe6eaa	Introduce make_forwardable It will add the ability to fast_forward_to on position_range to flat_mutation_reader that does not have this ability. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	6da8caf26f	Prepare make_forwardable for flat_mutation_reader This commit copies make_forwardable from streamed_mutation to flat_mutation_reader and renames it to make_forwardable_copy. This copy will be used as a base for make_forwardable implementation for flat_mutation_reader. The purpose of this commit is to make it easier to review the next commit. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	647dd7f86a	Introduce empty_flat_reader This is an implementation of flat_mutation_reader that returns nothing. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	0a9ab7ff80	memtable: Introduce make_flat_reader This method creates a flat_mutation_reader instead of mutation_reader. All users will be gradually converted to the new interface. make_reader is implemented using make_flat_reader and will be removed once all users are migrated. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	3661aca7ee	Add test for flat_mutation_reader::next_partition This is added to the run_mutation_source_tests suite. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:58:31 +01:00
Piotr Jastrzebski	1c9e4ba04c	Add flat_mutation_reader_assertions This will be usefull in tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:58:31 +01:00
Piotr Jastrzebski	4bca2210bf	Use mutation_source::make_flat_mutation_reader in tests Use the new call in run_conversion_to_mutation_reader_tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:58:31 +01:00
Piotr Jastrzebski	6efda10790	Add mutation_source::make_flat_mutation_reader This will be used as an intermediate state of migration from mutation_reader to flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:58:31 +01:00
Piotr Jastrzebski	93e8b43e7b	Add flat reader mutation source implementation This will be used by sources that are migrated to flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:41:12 +01:00
Piotr Jastrzebski	1a7936561e	Prepare mutation_source for more than one implementation There will be a second implementation that will be used by sources that are converted to flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:41:12 +01:00
Piotr Jastrzebski	e80007559b	Remove mutation_reader.hh dependency from flat_mutation_reader.hh It's not needed and causes cyclic dependency when we need flat_mutation_reader in mutation_source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 12:41:12 +01:00
Duarte Nunes	f50b7c240f	tests/view_schema_test: Wrap view queries in eventually() ...instead of wrapping the base table queries, since those will immediately succeed. This fixes ocasional failures. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171107170221.4309-1-duarte@scylladb.com>	2017-11-08 09:20:43 +02:00
Duarte Nunes	328b908574	tests/view_schema_test: Avoid non-pk restrictions We don't support non-PK restrictions correctly as explained in commit `3c90607` ("tests/cql_query_test: Fix view creation in test_duration_restrictions()") and Apache Cassandra doesn't support them for MVs either. Change some test cases to not rely on them. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171107165138.3176-1-duarte@scylladb.com>	2017-11-08 09:20:11 +02:00
Duarte Nunes	cb9daec8fd	thrift: Preserve query order for some verbs `f44131226a` introduced a regression where for some verbs we would return partitions in their natural sort order, but since thrift partition ranges can wrap-around, what we need to preserve is query order. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171103201118.18175-1-duarte@scylladb.com>	2017-11-07 17:00:48 +00:00
Paweł Dziepak	5a4b46f555	Merge "Fix exception safety related to range tombstones in cache" from Tomasz Fixes #2938. * 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev: tests: range_tombstone_list: Add test for exception safety of apply() tests: Introduce range_tombstone_list assertions cache: Make range tombstone merging exception-safe range_tombstone_list: Introduce apply_monotonically() range_tombstone_list: Make reverter::erase() exception-safe range_tombstone_list: Fix memory leaks in case of bad_alloc mutation_partition: Fix abort in case range tombstone copying fails managed_bytes: Declare copy constructor as allocation point Integrate with allocation failure injection framework	2017-11-07 15:30:52 +00:00
Pekka Enberg	b515cca5e2	tests/view_schema_test: Disable non-PK restriction tests We don't support non-PK restrictions correctly as explained in commit `3c90607` ("tests/cql_query_test: Fix view creation in test_duration_restrictions()") and Apache Cassandra doesn't support them for MVs either. Disable the tests, but don't remove them because they will be resurrected once CASSANDRA-13832 is fixed. Message-Id: <1510052422-3478-1-git-send-email-penberg@scylladb.com>	2017-11-07 16:42:00 +02:00
Tomasz Grabiec	0123eb876c	tests: range_tombstone_list: Add test for exception safety of apply()	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	dedb8a6a15	tests: Introduce range_tombstone_list assertions	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	bbca83d4c0	cache: Make range tombstone merging exception-safe range_tombstone_list::apply() has no exception safety guarantees about the logical state. The target mutation_partition in cache should be assumed to be left in unspecified state. In particular, some of the preexisting overlapping tombstones may be removed and not reinserted, so the cache would be missing some of the range tombstone information in case the whole allocating section fails. Use apply_monotonically() which provides the needed guarantees. Fixes #2938.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	9c620e0246	range_tombstone_list: Introduce apply_monotonically()	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	2fe53ac617	range_tombstone_list: Make reverter::erase() exception-safe erase_undo_op() constructor takes ownership of it, and destroys it when it goes out of scope. If emplace_back() fails, it would be destroyed before being removed from its container (_dst._tombstones). Fix by making sure _ops.emplace_back() won't fail.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	6190f9fc63	range_tombstone_list: Fix memory leaks in case of bad_alloc If insert() fails, the allocated range_tombstone would not be freed. Use alloc_strategy_unique_ptr.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	ca3e72266f	mutation_partition: Fix abort in case range tombstone copying fails If exception is thrown from _row_tombstones.apply(), _rows will be left uncleared. This will trigger assertion in bi::set_member_hook destructor, which assrts that the hook is not linked. Always clear _rows.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	5348d9f596	managed_bytes: Declare copy constructor as allocation point Because of the small size optimization, not all copies will call the allocator, so allocation failure injection may miss this site if the value is not large enough. Make the testing more effective by marking this place explicitly as an allocation point.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	34ccf234ea	Integrate with allocation failure injection framework	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	2d3f3ab2b8	Update seastar submodule * seastar d71922c...8040cab (4): > util: Introduce support for allocation failure injection > Adding dpdk-port-index as a command line option with default value of 0 > core/sharded: Introduce invoke_on_others() > noncopyable_function: improve support for capturing mutable lambdas	2017-11-07 15:29:45 +01:00
Paweł Dziepak	80cfcc357f	Merge "Config fixes" from Calle "Fixes #2933 Fixes regressions introduced by config restructuring. Allows "base" config to handle errors by warning, while other uses can opt otherwise." [pdziepak: resolved merge conflict] * 'calle/cfgfix' of github.com:scylladb/seastar-dev: config_test: Use error handler (ignore errors) + add error test config: Resurrect command line aliases that where lost main: Use error handler for config parse config_file: Add optional "error_handler" to yaml parse functions	2017-11-06 11:40:37 +00:00
Amos Kong	76ab8bf292	scylla_setup: parse --no-ec2-check option The option was introduced by commit `e645b0f` ("dist/common/scripts: move EC2 configuration verification to 'scylla_ec2_check'"), but it doesn't parsed the option at all. Fixes #2934 Signed-off-by: Amos Kong <amos@scylladb.com> Acked-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <df21356e528c6e161f73f4408a201fedef8f52d9.1509744954.git.amos@scylladb.com>	2017-11-06 12:06:54 +02:00
Calle Wilund	21b2c2e310	config_test: Use error handler (ignore errors) + add error test Fixes #2933 Uses handler on main test, ignoring the invalid option present. Also adds test to verify error handling works as expected.	2017-11-06 09:58:16 +00:00
Calle Wilund	959d729428	config: Resurrect command line aliases that where lost	2017-11-06 09:54:46 +00:00
Calle Wilund	f1dd698600	main: Use error handler for config parse Treat all errors as loggable errors/warnings. Preserving previous behaviour.	2017-11-06 09:54:09 +00:00
Calle Wilund	287b6fd8bd	config_file: Add optional "error_handler" to yaml parse functions Allowing parse errors / unknown options to be ignored.	2017-11-06 09:53:05 +00:00
Amos Kong	d9f16cf23c	trivial: fix a typo in warning message > std::invalid_argument: Option memtable_allocation_typeis not applicable Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <37d293a4eadfb8a58acaf96f80b1d2e943530c6b.1509947604.git.amos@scylladb.com>	2017-11-06 09:41:07 +02:00
Duarte Nunes	d8e0b47e75	Merge 'CQL secondary index queries' from Pekka "This patch series adds support for secondary index queries using the backing index view that's created when CREATE INDEX statement is executed. Example: -- Create keyspace and table: CREATE KEYSPACE ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1}; CREATE TABLE ks.users ( userid uuid, name text, email text, country text, PRIMARY KEY (userid) ); -- Create secondary indexes: CREATE INDEX ON ks.users (email); CREATE INDEX ON ks.users (country); -- Insert some data: INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Bondie Easseby', 'beassebyv@house.gov', 'France'); INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Demetri Curror', 'dcurrorw@techcrunch.com', 'France'); INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Langston Paulisch', 'lpaulischm@reverbnation.com', 'United States'); INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Channa Devote', 'cdevote14@marriott.com', 'Denmark'); -- Query on the secondary-index backed non-primary keys: SELECT * FROM ks.users WHERE email = 'beassebyv@house.gov'; userid \| country \| email \| name --------+---------+-------+------ 022238c8-5213-44b5-959e-4e3e1b032f85 \| France \| beassebyv@house.gov \| Bondie Easseby (1 rows) SELECT * FROM ks.users WHERE country = 'France'; userid \| country \| email \| name --------------------------------------+---------+-------------------------+---------------- 2152d85a-61f6-4eab-af4d-e7e7d0872319 \| France \| beassebyv@house.gov \| Bondie Easseby 59fddb6d-bfc9-4636-a9a0-85383fd815ee \| France \| dcurrorw@techcrunch.com \| Demetri Curror Known imitations: - Only regular column indexes return results. Indexing primary key components like clustering keys return empty result set because of index view query partition key serialization issues that will be fixed in subsequent patches. - Secondary index queries are not paginated, which can cause problems for queries that return a large number of rows. - Multiple restrictions don't work correctly if one of them is backed by a secondary-index. - Only one secondary-indexed restriction per query is supported -- other restrictions are ignored. - Compound partition keys are not supported. - ALLOW FILTERING on non-primary key columns does not work correctly without secondary index (see issue #2200)." * 'penberg/cql-2i-queries/v2' of github.com:penberg/scylla: tests/cql_query_test: Add test case for secondary index queries cql3: Secondary-index backed select statements index: Fix index view schema when primary key component is indexed tests/cql_query_test: Fix view creation in test_duration_restrictions() cql3/restrictions: Add statement_restrictions::index_restrictions() helper index: Implement index::supports_expression() for EQ operator cql3: Make operator_type class non-copyable index: Fix index::supports_expression() operator parameter type cql3: Implement statement_restriction index validation	2017-11-04 01:51:55 +01:00
Tomasz Grabiec	2e96069f2f	tests: perf_cache_eviction: Switch to time-series like workload Before the patch we appended and queried at the front. Insert at the front instead, so that writes and reads overlap. Stresses eviction and population more. Message-Id: <1506369562-14892-1-git-send-email-tgrabiec@scylladb.com>	2017-11-03 13:45:41 +00:00
Tomasz Grabiec	92e3449d59	mutation_reader: Do not call fast_forward_to() on a reference to a capture The range reference is supposed to be valid as long as the reader is used, not just around fast_forward_to(). Introduced in `a6b9186cab` Message-Id: <1509710642-12713-1-git-send-email-tgrabiec@scylladb.com>	2017-11-03 12:09:42 +00:00
Amos Kong	4762326f35	dist/redhat: Fix dependence version issue The spec file requires two different version, it causes conflict. The problem was introduced in commit `6893ad46b8` ("dist/redhat: Switch to g++-7/boost-1.63 on CentOS7"). Fixes #2931 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <f0d74c4ae0d325d7e2bd827f56a36330b9ef19eb.1509703504.git.amos@scylladb.com>	2017-11-03 13:40:22 +02:00
Paweł Dziepak	0de651d617	Merge "Mark whole query range continuous in cache" from Tomasz "We currently can't insert row entries at any position_in_partition, but only at full keys and after all keys. If a query range has bounds such that we have to insert a dummy entry at non-representable position then information about range continuity will not be fully populated. In particular, single-row queries of a row which is not present in sstables will miss when repeated again. The series fixes the problem by marking the whole query range as continuous by inserting dummy entries at boundaries when necessary. Refs #2579." * tag 'tgrabiec/cache-range-continuity-v2' of github.com:scylladb/seastar-dev: tests: row_cache: Add test for population of single rows tests: Add test for population of continuity tests: mutation_reader_assertions: Introduce produces_compacted() mutation: Introduce apply(mutation_fragment) cache: Document invariants of cache_streamed_mutation::_lower_bound cache_streamed_mutation: Special-case population for singular ranges query: Introduce is_single_row() cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction cache_streamed_mutation: Override continuity of older versions when populating cache_streamed_mutation: Mark whole query range as continuous tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases row_cache: Make read_context::key() valid before reading from underlying starts mutation_partition: Allow creating rows_entry at any clustered position_in_partition position_in_partition: Do not use -2 and +2 weights clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range mutation_partition: Remove delegating_compare() mvcc: Print iterators in operator<< for partition_snapshot_row_cursor mvcc: Introduce partition_snapshot_row_weakref mvcc: Make the null state of partition_snapshot::change_mark explicit mvcc: Add partition_snapshot::region() getter mvcc: Add partition_snapshot::schema() getter position_in_partition: Introduce before_key() position_in_partition: Introduce min() position_in_partition: Introduce for_static_row()	2017-11-03 11:05:49 +00:00
Paweł Dziepak	ab12981491	test.py: make sure that tests/memory_footprint is being run While not being a real unit tests memory_footprint can be a quite useful tool and running it among other tests will ensure that we will notice when it gets broken. Message-Id: <20171102160233.6756-2-pdziepak@scylladb.com>	2017-11-03 11:46:30 +01:00
Paweł Dziepak	4cda3170d6	tests/memory_footprint: do not create two cache instances When created cache registers several metrics, since attempts to create an already existing metrics result in an exception being thrown it is no longer possible to have two cache instances at the same time. This is exactly what happens in memory_footprint: one (useless) cache object is created through a call to do_with_cql_env() and, then, memory_footprint explicitly creates another one (not a useless one). The tests itself doesn't really need a full cql environment and the only reason it was added is so that storage_service is initialised and various code paths can query for the available cluster features. This can be done in a much lightweight way using storage_service_for_tests. Fixes memory_footprint failure (until next time we decide there is nothing wrong with globals). Message-Id: <20171102160233.6756-1-pdziepak@scylladb.com>	2017-11-03 11:46:30 +01:00
Amos Kong	f2ff431b75	dist/redhat: Fix baseurl of 3rdparty repo $basearch isn't parsed as expected, the finaly baseurl is wrong. We only have x86_64 arch in external 3rdparty repository, and the conf file is only for x86_64, so it's fine to use hardcode x86_64. The problem was introduced by commit `b5e83ebd94` ("dist/redhat: switch 3rdparty packages to external build service"). Fixes #2930 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <708f46a7c36623e86fee278462c80db1eff3b820.1509700430.git.amos@scylladb.com>	2017-11-03 11:51:34 +02:00
Pekka Enberg	aeea83172b	tests/cql_query_test: Add test case for secondary index queries	2017-11-03 10:12:58 +02:00
Pekka Enberg	9048f741ad	cql3: Secondary-index backed select statements This patch adds support for secondary-index backed select statements. Current select_statement class is split into two separate classes: primary_key_select_statement that retains regular query behavior and indexed_table_select_statement that introduces the new secondary-index backed query logic. One of the two behaviors is selected at query preparation time to minimize overhead for non-indexed queries.	2017-11-03 10:12:58 +02:00
Pekka Enberg	3150962cb7	index: Fix index view schema when primary key component is indexed This fixes index view schema to exclude indexed column when a primary key component like clustering key is indexed. This fixes a server crash when CREATE INDEX statement is executed on a clustering key column.	2017-11-03 10:12:58 +02:00
Pekka Enberg	3c90607988	tests/cql_query_test: Fix view creation in test_duration_restrictions() The materialized view created in test_duration_restriction() restricts on a non-PK column. Since Scylla's ALLOW FILTERING and secondary index validation path is broken, once we start to do secondary index queries, query processor thinks there's a secondary index backing that non-PK column and fails because it's unable to find such column. Fix up the view to only trigger the duration type validation error we're interested in here.	2017-11-03 10:12:58 +02:00
Pekka Enberg	c243a0c8fc	cql3/restrictions: Add statement_restrictions::index_restrictions() helper	2017-11-03 09:10:43 +02:00
Pekka Enberg	678a6f6e2f	index: Implement index::supports_expression() for EQ operator	2017-11-03 09:10:43 +02:00
Pekka Enberg	04b482146c	cql3: Make operator_type class non-copyable The operator_type class is really an enumeration, which is not supposed to be copied.	2017-11-03 09:10:43 +02:00
Pekka Enberg	1ae9343f68	index: Fix index::supports_expression() operator parameter type The cql3::operator_type is supposed to be passed around as const reference, not by value; otherwise equality won't work.	2017-11-03 09:10:43 +02:00
Pekka Enberg	3e3c580f74	cql3: Implement statement_restriction index validation	2017-11-03 09:10:43 +02:00
Botond Dénes	ce03a4d2c7	test.py: print failed test summary if there are failed tests Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <3a913b111552276ab94dfb83738244699550f929.1507894597.git.bdenes@scylladb.com>	2017-11-02 11:49:14 +00:00
Tomasz Grabiec	16d6222a96	tests: row_cache: Add test for population of single rows	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	bbf8ccb709	tests: Add test for population of continuity	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	3ad5666098	tests: mutation_reader_assertions: Introduce produces_compacted()	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	749f5770df	mutation: Introduce apply(mutation_fragment)	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	a76202df4f	cache: Document invariants of cache_streamed_mutation::_lower_bound (cherry picked from commit b52813279d30782270ac83856233f18787b28b7e)	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	328faf695e	cache_streamed_mutation: Special-case population for singular ranges This is an optimization which avoids creating dummy entries around row entry when populating a singular range.	2017-11-02 12:16:09 +01:00
Tomasz Grabiec	90796893ee	query: Introduce is_single_row()	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	0fd57cdff5	cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	8c41a3eb43	cache_streamed_mutation: Override continuity of older versions when populating Fixes the case of continuity not being populated when the row which is the upper bound of the population range belongs to a non-latest version. In such case we wouldn't mark the range as continuous, because we can't modify rows of non-latest versions. To fix this, create an empty entry in latest version which will just override the continuity flag of the old entry.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	65ed490e1c	cache_streamed_mutation: Mark whole query range as continuous Before this patch only ranges between returned row fragments were marked as continuous. In the extreme case, there could be no such fragments, in which case next read would miss as well. To avoid this, mark whole query range as continuous by inserting dummy entries when necessary. Refs #2579.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	552d7a683a	tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	d4928eb1b7	cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows Current code will not mark the range as continuous if the previous entry does not come in the latest version. Fix that by switching to partition_snapshot_row_pointer, which is capable of checking in older versions as necessary. Also, we avoid the key comparison if we know that the iterator is still valid.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	835d17ee37	cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	af4a9a4a30	row_cache: Make read_context::key() valid before reading from underlying starts So that we can call cache_streamed_mutation::can_populate() before we start reading from underlying. Will be needed in upcoming changes which insert dummy entries when falling back to underlying.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	72028bb048	mutation_partition: Allow creating rows_entry at any clustered position_in_partition In preparation for supporting setting continuity of arbitrary clustering range.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	9ac1b515e1	position_in_partition: Do not use -2 and +2 weights ::weight() is using those values for excl_end and excl_start in order to be able to represent non-overlapping ranges. In their model the end bound is inclusive. We don't need this, since position_range has end bound exclusive. This change makes that: position_in_partition::after_key(y) == position_in_partition::for_range_end(clutering_range::make({x}, {y})	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	4b25fa1130	clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range position_range is end-exclusive. The reader might have returned a tombstone which is not really relevant for the range.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	409adc045a	mutation_partition: Remove delegating_compare() It can't work with rows_entry at any position_in_partition, so we need to drop it.	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	b4954f55b9	mvcc: Print iterators in operator<< for partition_snapshot_row_cursor	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	ad156b5986	mvcc: Introduce partition_snapshot_row_weakref	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	967cabcaf2	mvcc: Make the null state of partition_snapshot::change_mark explicit	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	4b7933543d	mvcc: Add partition_snapshot::region() getter	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	9cf30f19ae	mvcc: Add partition_snapshot::schema() getter	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	34cb13939f	position_in_partition: Introduce before_key()	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	cc06c328ef	position_in_partition: Introduce min()	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	d05f130b09	position_in_partition: Introduce for_static_row()	2017-11-02 11:05:19 +01:00
Calle Wilund	8c257c40b4	storage_service: Only replicate token metadata iff modified in on_change Fixes #2869 Message-Id: <20171101105629.22104-1-calle@scylladb.com>	2017-11-01 14:56:55 +02:00
Jesse Haber-Kucharsky	da5c486e49	Add `coding-style.md` referencing Seastar While it would be nice if we could reference the file corresponding to the exact version of Seastar pinned as a Scylla submodule, GitHub does not support this. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <b7565acd1ccb8bce42b0edf00221922e78e1c9ef.1508274655.git.jhaberku@scylladb.com>	2017-10-30 16:52:29 -07:00
Duarte Nunes	74a4cf8bb1	thrift/handler: multiget_{slice, count} always returns queried keys This patch changes the way the multiget_{slice, count} verbs return their results, by ensuring a queried key that produced no results is still present in the returned map, associated with an empty list. This is not required by the thrift interface, and it is a performance step back, but matches the behavior of Apache Cassandra. Said behavior is relied upon by projects like JanusGraph, whose integration with Scylla motivated this patch. Fixes #2900 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171019161104.22797-2-duarte@scylladb.com>	2017-10-30 16:48:58 -07:00
Duarte Nunes	f44131226a	thrift/handler: Use map for column_visitor aggregation Most common operations, like multiget_count and multiget_slice, return maps. So, instead of keeping a vector internally in column_visitor that we later transform into a map, keep a map that we transform into a vector for the uncommon operations. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171019161104.22797-1-duarte@scylladb.com>	2017-10-30 16:48:55 -07:00
Takuya ASADA	a56dd79d69	dist/redhat: moving gcc-7.1 to gcc-7.2 We mistakenly merged the patch witch compiles Scylla using gcc-7.1, it need to fix correct version (gcc-7.2). Message-Id: <1508875618-31659-1-git-send-email-syuu@scylladb.com>	2017-10-30 14:26:43 -07:00
Tomasz Grabiec	b4e3c0946a	cache_streamed_mutation: Avoid copy of decorated_key Message-Id: <1509060503-17483-1-git-send-email-tgrabiec@scylladb.com>	2017-10-26 16:51:27 -07:00
Pekka Enberg	9ba3920fd7	Merge "cql3/query_processor: Clean-up" from Jesse "This series cleans-up the query processor header and source file, including deleting dead Java code. There are no functional or interface changes. I've run all unit tests and observed no failures." * 'jhk/clean_up_qp/v2' of github.com:hakuch/scylla: cql3/query_processor: Fix formatting cql3/query_processor: Organize headers cql3/query_processor.hh: Consolidate `public` and `private` sections cql3/query_processor: Remove dead Java code	2017-10-21 21:48:06 +03:00
Jesse Haber-Kucharsky	66c4abe4fb	cql3/query_processor: Fix formatting Lines are now less than 120 columns and formatting conforms to the Seastar coding standards document.	2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky	edb83c0014	cql3/query_processor: Organize headers	2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky	ed6a3179a1	cql3/query_processor.hh: Consolidate `public` and `private` sections	2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky	50cfa8a7b8	cql3/query_processor: Remove dead Java code	2017-10-21 13:53:03 -04:00
Avi Kivity	ef8587a910	Merge seastar upstream * seastar 8babd1f...d71922c (11): > configure.py: add -Wno-sign-compare to compile Boost.Test with gcc-7 > log: Print nested exceptions > reactor: do not account non idle activity for total idle time calculation > execution_stage: defer execution less aggressively > Fix -Wreturn-type warnings > cpu scheduler: make _reciprocal_shares_times_2_32 wider to avoid overflow problems > noncopyable_function add bool operator > execution_stage: make make_execution_stage return a named type > memory: support overriding the default allocator page size > memory: fix crash during startup with large page_size > core: io_destroy is missing when destructing reactor, which causes io_context leak	2017-10-21 16:38:32 +03:00
Takuya ASADA	bc76d34e34	dist/debian: handle python scripts correctly on package builder We are failing to build .deb package on pbuilder due to lack of build time dependencies so we need add those packages on Build-Depends, also we need to follow Debian packaging style for the package contains python scripts. Fixes #2918 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1508457215-11552-1-git-send-email-syuu@scylladb.com>	2017-10-21 16:38:20 +03:00
Takuya ASADA	6893ad46b8	dist/redhat: Switch to g++-7/boost-1.63 on CentOS7 Switch to g++-7/boost-1.63 on CentOS7, too. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1508457509-13122-2-git-send-email-syuu@scylladb.com>	2017-10-21 12:28:40 +03:00
Takuya ASADA	7f38634080	dist/debian: Switch to g++-7/boost-1.63 on Ubuntu 14.04/16.04 Switch to g++-7/boost-1.63 for Ubuntu 14.04/16.04 that newly provided via our 3rdparty PPA. To make Scylla compilable with boost-1.63/g++-7, we need to disable following warnings: - misleading-indentation - overflow - noexcept-type Compile error message: https://gist.github.com/syuu1228/96acc640c56c3316df5ce6911d60beea Seastar also has similar problem, it needs to disable 'sign-compare', detail is in a patch for Seastar. This update also fixes current Ubuntu 14.04/16.04 compilation error problem, since errors were come from too old g++/boost. Fixes #2902 Fixes #2903 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1508457509-13122-1-git-send-email-syuu@scylladb.com>	2017-10-21 12:28:38 +03:00
Avi Kivity	d6cd44a725	Revert "Merge 'Single key sstable reader optimization' from Botond" This reverts commit `5e9cd128ad`, reversing changes made to `1f4e6759a7`. Tomek found some serious issues.	2017-10-19 12:47:21 +03:00
Botond Dénes	9bd4d7cbb2	Readd x right to configure.py (removed by `05db87e06`) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <41d505277fe29b275aba65477cafc9275b393a64.1508395961.git.bdenes@scylladb.com>	2017-10-19 11:30:06 +03:00
Duarte Nunes	5e9cd128ad	Merge 'Single key sstable reader optimization' from Botond "When reading a single row it is possible that the read will be satisfied by just reading from one of the data source candidates. To exploit this an optimization is employed which sorts data source candidates by their timestamp and reads mutations from the most recent to the oldest. When all needed cells are present and their earliest timestamp is still later than the latest one of the remaining data source the read can be terminated early. However this optimization also has the possibility to backfire as the data sources are read sequentially, so if all of them has to be read eventually then we will end up worse then without it. Thus the optimization can be disabled up-front or enabled to only run until its efficiency degrades below a certain threshold. Also counters are added to column-families to make it possible to observe how well it performs. Benchmarking Benchmarking was done with disabled cache and at a constant op rate of 4k (1/3 of the max op rate on my box), against 3 sstables containing the same 10000 rows. 1) Optimization turned off (all sstables read paralelly) latency mean : 1.3 [simple:1.3] latency median : 1.0 [simple:1.0] latency 95th percentile : 2.4 [simple:2.4] latency 99th percentile : 2.9 [simple:2.9] latency 99.9th percentile : 8.0 [simple:8.0] latency max : 13.5 [simple:13.5] 2) Optimization turned on, best case (1 of 3 sstables read) latency mean : 0.6 [simple:0.6] latency median : 0.6 [simple:0.6] latency 95th percentile : 1.0 [simple:1.0] latency 99th percentile : 1.2 [simple:1.2] latency 99.9th percentile : 4.4 [simple:4.4] latency max : 13.4 [simple:13.4] 3) Optimization turned on, best case, IN query (1 of 3 sstables read) latency mean : 0.7 [simple_in:0.7] latency median : 0.6 [simple_in:0.6] latency 95th percentile : 1.1 [simple_in:1.1] latency 99th percentile : 1.4 [simple_in:1.4] latency 99.9th percentile : 5.4 [simple_in:5.4] latency max : 16.8 [simple_in:16.8] 4) Optimization turned on, worst case (3 of 3 sstables read sequentally) latency mean : 2.8 [simple:2.8] latency median : 2.3 [simple:2.3] latency 95th percentile : 5.4 [simple:5.4] latency 99th percentile : 6.5 [simple:6.5] latency 99.9th percentile : 13.5 [simple:13.5] latency max : 19.2 [simple:19.2] 5) Optimization turned on, mid case (2 of 3 sstables read sequentally) latency mean : 1.4 [simple:1.4] latency median : 1.1 [simple:1.1] latency 95th percentile : 2.7 [simple:2.7] latency 99th percentile : 3.2 [simple:3.2] latency 99.9th percentile : 7.7 [simple:7.7] latency max : 15.1 [simple:15.1]" Ref #324 * 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla: Add unit tests for single_key_sstable_reader Add counters for the single-key reader optimization Add single_key_parallel_scan_threshold option single_key_sstable_reader: optimize single-row queries single_key_sstable_reader: move reading code into it's own method Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()	2017-10-18 16:38:53 +01:00
Duarte Nunes	1f4e6759a7	tests: Fix compile errors introduced in `c468e5981` Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1508337315-8224-1-git-send-email-duarte@scylladb.com>	2017-10-18 16:38:18 +01:00
Botond Dénes	c3bd89ad63	Add unit tests for single_key_sstable_reader	2017-10-18 17:24:03 +03:00
Botond Dénes	dfe312ca3a	Add counters for the single-key reader optimization Add two counters, one to determine how many of the reads fall into the optimization, and a second one to determine it's effectiveness. The first one is single_key_reader_optimization_hit_rate. It contains the rate of reads that the optimization applies to out of all the reads that go into the single_key_sstable_reader. The second one, single_key_reader_optimization_extra_read_proportion is a histogram of the efficiency of the optimization. It contains the proportion of extra data-sources read. It's a number between 0 and 1, where 0 is the best case (only one data-source was read) and 1 is the worst case (all data-sources were read eventually). This is the same number that is used for the threshold option (see previous patch). Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1] range. Note that single_key_parallel_scan_threshold effectively provides an upper bound for the proportion as the optimization is turned off as soon as it goes above that number. The counters are disabled if single_key_parallel_scan_threshold is set to 0 disabling the optimization entirely.	2017-10-18 17:24:03 +03:00
Botond Dénes	08502f2d48	Add single_key_parallel_scan_threshold option This option regulates when exactly the single-key optimization is considered ineffective and turned off. The threshold is the proportion of the extra data source candidates that can be read before the optimization is considered ineffective and disabled. The proportion is calculated as follows: (read_data_sources - 1) / (total_data_sources - 1) We substract 1 from the read_data_sources and total_data_sources to effectively measure the rate of extra data sources we read. This makes sure that the proportion is meaningful even if e.g. we have only have a total of 2 data-sources and we read only 1 (best case). Whenever this number goes above the threshold the optimization is disabled. The threshold is number between 0 and 1, 0 forces the optimization off and 1 forces it on. Increase the treshold to favor throughput over latency for single-row reads, decrease the treshold to improve latency at the expense of throughput. If the threshold is > 0 (it's not force disabled) and the optimization is disabled due to a read crossing the threshold, we will issue "probing" reads (every 100th read) to determine if the optimization is worth re-enabling. Probing reads are allowed to run through the optimization path and if they go below the threshold the optimization is re-enabled.	2017-10-18 17:24:03 +03:00
Botond Dénes	3c1fa3ecc1	single_key_sstable_reader: optimize single-row queries For single-row queries that only query atomic cells one can put a lower bound on the timestamps which may affect the query results and thus rule out entire data sources. This allows the query to read only those sstables that actually contribute to the result. To do this we incrementally move through the sstables overlapping with the query range, checking after each read mutation whether we already have a value for all required cells and whether the lower-bound of their timestamps is higher than the upper-bound of the timestamps of all the remaining data-sources. When this condition is met we terminate the read.	2017-10-18 17:24:03 +03:00
Botond Dénes	5fc44c4307	single_key_sstable_reader: move reading code into it's own method	2017-10-18 17:24:03 +03:00
Botond Dénes	6cdeca1846	Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()	2017-10-18 17:24:03 +03:00
Botond Dénes	7aceb14395	Fix compile errors in tests/config_test.cc introduced by `c468e5981` Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <2700ac3987c3a229eb7083ce6f5d390012a3b66c.1508336217.git.bdenes@scylladb.com>	2017-10-18 15:20:45 +01:00
Paweł Dziepak	c28e31eac4	database: fix build (auto shards&)	2017-10-18 13:10:00 +01:00
Duarte Nunes	446e5f53db	database: Avoid superfluous shards_for_this_sstable vector copies Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171018112643.40411-1-duarte@scylladb.com>	2017-10-18 15:00:52 +03:00
Duarte Nunes	044b8deae4	Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz "The main problem fixed is slow processing of application state changes. This may lead to a bootstrapping node not having up to date view on the ring, and serve incorrect data. Fixes #2855." * tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev: gms/gossiper: Remove periodic replication of endpoint state map gossiper: Check for features in the change listener gms/gossiper: Replicate changes incrementally to other shards gms/gossiper: Document validity of endpoint_state properties storage_service: Update token_metadata after changing endpoint_state gms/gossiper: Process endpoints in parallel gms/gossiper: Serialize state changes and notifications for given node utils/loading_shared_values: Allow Loader to return non-future result gms/gossiper: Encapsulate lookup of endpoint_state storage_service: Batch token metadata and endpoint state replication utils/serialized_action: Introduce trigger_later() gossiper: Add and improve logging gms/gossiper: Don't fire change listeners when there is no change gms/gossiper: Allow parallel apply_state_locally() gms/gossiper: Avoid copies in endpoint_state::add_application_state() gms/failure_detector: Ignore short update intervals	2017-10-18 10:13:25 +01:00
Duarte Nunes	c468e59817	Merge 'Extract config file mechanism + allow additional' from Calle "Extracts the yaml/boost-po aspects of the "self-describing" db::config into an abstract type. db::config is then reimplemented in said type, removing some of the slightly cumbersome entanglement with seastar opts (log). Adds a main hook for additional configuration files (options + file)" * 'calle/config' of github.com:scylladb/seastar-dev: main/init: Add registerable configuration objects db::config: Re-implement on utils/config_file. utils::config_file: Abstract out config file to external type	2017-10-18 09:50:53 +01:00
Tomasz Grabiec	f570e41d18	gms/gossiper: Remove periodic replication of endpoint state map For large clusters the map can be big and cause latency problems. Since we now actively replicate changes, this is no longer needed.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	84c7b63c51	gossiper: Check for features in the change listener In preparation for removal of periodic replication	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	2d5fb9d109	gms/gossiper: Replicate changes incrementally to other shards storage_service depends on endpoint states to be replicated to all shards before token metadata is replicated. Currently this is taken care of by storage_service::replicate_to_all_cores(), invoked from storage_service's change listener. It copies whole endpoint state map, which is expensive in large clusters. It's more efficient to replicate only incremental changes, and only once, rather than for each application state.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	28c9609370	gms/gossiper: Document validity of endpoint_state properties	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	cf113ed295	storage_service: Update token_metadata after changing endpoint_state There is a requirement that whatever is present in token_metadata, should also be present in endpoint_state. Because of that, we should update endpoint_state first (set_gossip_tokens). Apache Cassandra switched to this order as well in commit b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	5cc83b9b3c	gms/gossiper: Process endpoints in parallel Makes state application faster due to increased parallelism. Refs #2855. Bootrap of 11th node, ignoring apply_state_locally() which complete instantly: Before: DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms After: DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	8f01e08690	gms/gossiper: Serialize state changes and notifications for given node It's possible that a change listener for a later state will run before change listener for the previous state completes, in which case node's state can be corruped. For example, the previous change listener may override sysytem.peers with an old value. This patch fixes the problem by serializing state changes and listeners for each node. The implementation uses loading_shared_values so that the lock remains alive as long as there is anyone holding it. Using endpoint_state_map for that doesn't seem appropraite, because entries can be removed from it while listeners are still running. There is code in the gossiper which anticipates that entry may be gone across deferring points in some places.	2017-10-18 08:49:53 +02:00
Tomasz Grabiec	f7a7e97095	utils/loading_shared_values: Allow Loader to return non-future result	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	6fccf7f4d0	gms/gossiper: Encapsulate lookup of endpoint_state	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	6263b0ebb6	storage_service: Batch token metadata and endpoint state replication Replication needs to be serialized. We can batch replication requests which are waiting to start. Use serialized_action, which does this.	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	2e2ae4671e	utils/serialized_action: Introduce trigger_later() Can be used instead of trigger() to improve batching.	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	41ffefd194	gossiper: Add and improve logging	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	0ed84710d9	gms/gossiper: Don't fire change listeners when there is no change apply_new_states() always fires change listeners for received values, even if we already processed the state earlier. Some change listeners are heavy-weight, e.g. storage_service::handle_state_normal(). We should avoid calling them more than necessary. Make sure that we always run the change listeners by putting them in a defer() block. Otherwise, if exception is thrown in the middle of state application, change listeners would not be run. Later we would not detect the change for states which were already applied, and not run the change listers. Fixes #2867	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	c780a74b58	gms/gossiper: Allow parallel apply_state_locally() It is serialized since `e428d06f40`. This causes regression in performance of application state propagation due to reduced parallelism. Processing states for each node has high latency due to memtable flushes triggered by update_tokens() and commitlog syncs done by system.peers updates, if commitlog sync mode is set to "batch". We have high internal concurrency for these, so increasing parallelism significantly reduces time to process all states. Fixes #2855.	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	f20a805eca	gms/gossiper: Avoid copies in endpoint_state::add_application_state()	2017-10-18 08:49:52 +02:00
Tomasz Grabiec	a71624d58d	gms/failure_detector: Ignore short update intervals Failure detector decides that a node is down if it hasn't received a change of its heartbeat for longer than ~11 times the average of past intervals between updates. If there are multiple incoming ACKs containing information about the same node, we may detect and report a change for each of them. This will cause failure_detector to establish that the average report period is in milliseconds. After the update storm is over, it will claim the node failure very soon, because report period will now be a large multiple of the average. Fix by not counting short updates into the calculation of average arrival time. Fixes #2861.	2017-10-18 08:49:52 +02:00
Calle Wilund	12a54805ea	main/init: Add registerable configuration objects Allowing plugging in command line arguments + "parse-points" for configs outside db/config	2017-10-18 00:52:04 +00:00
Calle Wilund	4bd98f7296	db::config: Re-implement on utils/config_file. Re-use config abstraction, and de-couple the seastar logging parts a little bit more.	2017-10-18 00:51:54 +00:00
Calle Wilund	05db87e068	utils::config_file: Abstract out config file to external type Handling all the boost::commandline + YAML stuff. This patch only provides an external version of these functions, it does not modify the db::config object. That is for a follow-up patch.	2017-10-18 00:51:41 +00:00
Pekka Enberg	ae92055b52	Merge "Bring histogram closer to what Prometheus expects" from Glauber "Histograms are a native prometheus type, and there are many functions available that operate on them. There is extensive documentation about them at https://prometheus.io/docs/practices/histograms/ One example is the function histogram_quantile(), that can extract useful quantiles from the histograms. Currently, those functions don't work well. The reasons are twofold: 1) We are only exporting 16 metrics, starting from 1usec. That means that the highest latency we can differentiate is 4ms. After that, everything falls into the same bin. 2) The format that prometheus expects is that each bin will contain the total number of points seen up until that bin, while we currently export the total number of points that falls between bins. IOW, it is a cummulative histogram. About point two, granted it is a bit hidden in their website, but it is there. The following phrase about a caveat make it clear: "Note that we divide the sum of both buckets. The reason is that the histogram buckets are cumulative. The le="0.3" bucket is also contained in the le="1.2" bucket; dividing it by 2 corrects for that." It is also not needed to accumulate things that fall over the last bin: the _count component of the histogram will already account for that." Acked-by: Amnon Heiman <amnon@scylladb.com> Acked-by: Gleb Natapov <gleb@scylladb.com> * 'prometheus-histograms' of github.com:glommer/scylla: storage_proxy: change reporting of estimated histograms estimated_histogram: bring histogram closer to what prometheus expects.	2017-10-17 20:23:10 +03:00
Takuya ASADA	3cab5557e5	dist/debian: fix Debian not to use new dependency package names We moved to new dependency package names like antlr3-c++-dev to scylla-antlr35-c++-dev when we moved to ppa on Ubuntu, but Debian still uses old dist/debian/dep packages. So keep using old style package names. Fixes #2831 Message-Id: <1508245175-2184-1-git-send-email-syuu@scylladb.com>	2017-10-17 16:39:35 +03:00
Daniel Fiala	f5629b3a23	types: Use std::pair instead of std::tuple to avoid compile-time error with explicit constructor. Fixes #2895. Signed-off-by: Daniel Fiala <daniel@scylladb.com> Message-Id: <20171017071316.2836-1-daniel@scylladb.com>	2017-10-17 12:32:43 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Duarte Nunes	fbb4c9edda	schema: Provide all-selecting partition slice This patch introduces schema::full_slice(), which returns a partition_slice selecting the full clustering range, as well as all static and regular columns. No options aside from the default are set in that partition_slice. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-1-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:35 +02:00
Tomasz Grabiec	6d5a0f8a98	db: Add debug-level logging related to streaming Message-Id: <1505896395-30203-1-git-send-email-tgrabiec@scylladb.com>	2017-10-16 18:49:10 +01:00
Paweł Dziepak	d9abb75bfa	tests/perf_simple_query: fix counter update query Message-Id: <20171016125334.4423-1-pdziepak@scylladb.com>	2017-10-16 19:41:31 +02:00
Calle Wilund	cc28cf838c	password_auth: Return actual generated salt from gensalt Fixes: 2898 Typo error in gensalt(). Only returned selected hash method, not the random salt bytes. Does not prevent the hash function from operating, but strength is ever so reduced. Message-Id: <20171016130505.25593-2-calle@scylladb.com>	2017-10-16 14:07:46 +01:00
Calle Wilund	57c5f13166	password_auth: Keep crypt_data as thread local Fixes: 2887 Speeds up password hashing ever so slightly. Message-Id: <20171016130505.25593-1-calle@scylladb.com>	2017-10-16 14:07:42 +01:00
Paweł Dziepak	8c3b7fea81	Merge "Introduce new API and converters from/to old mutation_reader" from Piotr "This changeset is the first step to flatten mutation_reader. Then it introduces new mutation_fragment types for partition header and end of partition. Using those a new flat_mutation_reader is defined. Finally it introduces converters between new flat_mutation_reader and old mutation_reader." * 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev: Add tests for flat_mutation_reader Introduce conversion from flat_mutation_reader to mutation_reader Introduce conversion from mutation_reader to flat_mutation_reader Introduce flat_mutation_reader Extract FlattenedConsumer concept using GCC6_CONCEPT Introduce partition_end mutation_fragment Introduce a position for end of partition Introduce partition_start mutation_fragment Introduce FragmentConsumer Introduce a position for partition start streamed_mutation: Extract concepts using GCC6_CONCEPT macro	2017-10-16 12:14:23 +01:00
Gleb Natapov	bd09ce7cd4	gdb: Add new command task_histogram The command scans random set of objects in a small pool (or, optionally only objects of a certain size) for vptrs and builds a histogram, so that most often used vptrs can be easily found. The command is useful to find "memory leaks" caused by creating of too many tasks of a certain type which is usually a result of unlimited parallelism somewhere. Message-Id: <20171015081634.GB21092@scylladb.com>	2017-10-15 12:12:42 +03:00
Piotr Jastrzebski	5f34559b78	Add tests for flat_mutation_reader Those tests run mutation source test for all sources using conversion to and from flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-13 16:08:59 +02:00
Piotr Jastrzebski	31733a7eeb	Introduce conversion from flat_mutation_reader to mutation_reader This will be used in transition from mutation_reader to flat_mutation_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-13 16:08:59 +02:00
Piotr Jastrzebski	6a66bee788	Introduce conversion from mutation_reader to flat_mutation_reader This will be used in transition from mutation_reader to flat_mutation_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-13 16:08:59 +02:00
Piotr Jastrzebski	748205ca75	Introduce flat_mutation_reader This reader operates on mutation_fragments instead of streamed_mutations. Each partition starts with a partition_header fragment and ends with end_of_partition fragment. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-13 16:08:40 +02:00
Tomasz Grabiec	b74b06808e	tests: row_cache: Add test for concurrent population of partition entries Message-Id: <1507815478-20269-2-git-send-email-tgrabiec@scylladb.com>	2017-10-12 15:55:33 +01:00
Tomasz Grabiec	083b9cddef	row_cache: Fix handling of concurrent partition population This fixes a regression introduced in `27a3b4bca9` (master only). partition_range_cursor assumes that as long as references are valid, _end is valid as well. But if new entries were inserted before _end, it may not, if the new entries fall after the query range. This may result in reads returning partitions from outside the query range. Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>	2017-10-12 15:55:20 +01:00
Tomasz Grabiec	68fe1a5bee	utils/loading_cache: Fix compilation on older compilers Message-Id: <1507728312-10585-1-git-send-email-tgrabiec@scylladb.com>	2017-10-12 14:55:34 +03:00
Raphael S. Carvalho	25a4f152cd	sstables: remove dead sstable method Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171012072905.12737-1-raphaelsc@scylladb.com>	2017-10-12 11:58:39 +02:00
Raphael S. Carvalho	16dd0d15fc	sstables: make get_shards_for_this_sstable return const ref Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>	2017-10-12 11:58:23 +02:00
Pekka Enberg	1701fc2e50	Merge "gms/gossiper: Multiple cleanups" from Duarte "Based on the functions get_endpoint_state_for_endpoint_ptr(), get_application_state_ptr() and endpoint_state::get_application_state_ptr(), this series cleanups miscelaneous functions related to the gossiper. It not only removes duplicated code, but also omits many copies. All pointer usages have been audited for safety." Acked-by: Asias He <asias@scylladb.com> Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com> * 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits) gms/endpoint_state: Remove get_application_state() service/storage_service: Avoid copies in prepare_replacement_info() service/storage_service: Cleanup get_application_state_value() service/storage_service: Cleanup handle_state_removing() service/storage_service: Cleanup get_rpc_address() locator/reconnectable_snitch_helper: Avoid versioned_value copies locator/production_snitch_base: Cleanup get_endpoint_info() service/migration_manager: Avoid copies in is_ready_for_bootstrap() service/migration_manager: Cleanup has_compatible_schema_tables_version() service/migration_manager: Fix usages of get_application_state() cache_hit_rate: Avoid copies in get_hit_rate() gms/endpoint_state: Avoid copies in is_shutdown() service/load_broadcaster: Avoid copy in on_join() gms/gossiper: Cleanup get_supported_features() gms/gossiper: Cleanup get_gossip_status() gms/gossiper: Cleanup seen_any_seed() gms/gossiper: Cleanup get_host_id() gms/gossiper: Removed dead uses_vnodes() function gms/gossiper: Cleanup uses_host_id() gms/gossiper: Add get_application_state_ptr() ...	2017-10-11 13:45:36 +03:00
Duarte Nunes	f67a553b96	gms/endpoint_state: Remove get_application_state() It is no longer used, as all callsites have moved to get_application_state_ptr(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	e9358c1c83	service/storage_service: Avoid copies in prepare_replacement_info() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	674f5d8eaf	service/storage_service: Cleanup get_application_state_value() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	0ccb9211d7	service/storage_service: Cleanup handle_state_removing() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	bdee795876	service/storage_service: Cleanup get_rpc_address() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2f05d7423a	locator/reconnectable_snitch_helper: Avoid versioned_value copies Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	28d63a76df	locator/production_snitch_base: Cleanup get_endpoint_info() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	03e6fc95ba	service/migration_manager: Avoid copies in is_ready_for_bootstrap() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	72ca6b34ef	service/migration_manager: Cleanup has_compatible_schema_tables_version() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	976324bbb8	service/migration_manager: Fix usages of get_application_state() We were taking a reference to a temporary value in different places. Fix them by using get_application_state_ptr(), which also avoids a copy. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	bb89b97cbb	cache_hit_rate: Avoid copies in get_hit_rate() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	9d5c6e0c72	gms/endpoint_state: Avoid copies in is_shutdown() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	25b0654312	service/load_broadcaster: Avoid copy in on_join() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	92df519b91	gms/gossiper: Cleanup get_supported_features() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	39f71f7d12	gms/gossiper: Cleanup get_gossip_status() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	db660f1e08	gms/gossiper: Cleanup seen_any_seed() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	88dd97fe8e	gms/gossiper: Cleanup get_host_id() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	95079795ce	gms/gossiper: Removed dead uses_vnodes() function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	7db7704edc	gms/gossiper: Cleanup uses_host_id() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2984bdab29	gms/gossiper: Add get_application_state_ptr() This patch introduces the get_application_state_ptr() function, which allows access to a versioned_value of a particular endpoint. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	f41748af81	gms/gossiper: Cleanup notify_failure_detector() Now that we have get_endpoint_state_for_endpoint_ptr(), which does not return a copy and allows mutating the actual state, we can use it instead of repeating the lookup code. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	2210d10552	gms/gossiper: Cleanup is_alive() Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is enabled, mark it as const, and have some callers use it instead of open coding the logic. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	ceef45a6fe	gms/gossiper: Const-qualify functions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	955aee1588	gms/gossiper: Cleanup convict() Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify logging, and also protect expensive operations by checking the log level. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	cf99a41226	gms/gossiper: Add non-const get_endpoint_state_for_endpoint_ptr() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	d0fba1a113	gms/failure_detector: Simplify alive/dead endpoint count Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Duarte Nunes	dc65cda1a3	gms/failure_detector: Fix if/else style to include braces Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:31 +01:00
Raphael S. Carvalho	67c5c8dc67	sstables: do not recompute shards for all tables after each compaction For every finished compaction, we were calculating shards for all existing tables. With ignore_msb set to 0, it's probably not a big deal, but if ignore_msb is like 12 and LCS is used (meaning thousands of tables possibly), the operation may stall the reactor for a considerable amount of time. That's fixed by caching shards. Fixes #2875. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>	2017-10-11 11:45:01 +03:00
Tomasz Grabiec	66a15ccd18	gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr() Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>	2017-10-10 18:27:43 +01:00
Gleb Natapov	36d9225e40	scylla-gdb: print number of allocated objects as an integer instead of float Message-Id: <20171010151835.GT23527@scylladb.com>	2017-10-10 18:19:44 +03:00
Avi Kivity	4ad3900d8d	Merge "gossiper: Optimize endpoint_state lookup" from Duarte "gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This series introduces a function that returns a pointer and avoids the copy. Fixes #764" * 'endpoint-state/v2' of https://github.com/duarten/scylla: gossiper: Avoid endpoint_state copies endpoint_state: const-qualify functions storage_service: Remove duplicate endpoint state check	2017-10-10 17:29:22 +03:00
Piotr Jastrzebski	f325fef362	Extract FlattenedConsumer concept using GCC6_CONCEPT This concept will be used in flat_mutation_reader::consume Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	46727f12e0	Introduce partition_end mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the end of the current partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	adffc80619	Introduce a position for end of partition This position will be used for mutation fragment that represents the end of partition. This position sorts after all other mutation fragments. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	2516b42752	Introduce partition_start mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the beginning of the next partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	1f4fb6dd4a	Introduce FragmentConsumer This concept helps define StreamedMutationConsumer. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:05:44 +02:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Duarte Nunes	bc976b4773	endpoint_state: const-qualify functions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:30:28 +01:00
Duarte Nunes	198b1b76b5	storage_service: Remove duplicate endpoint state check We already performed the check, so we don't need to do it again. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:25:34 +01:00
Avi Kivity	c0687a9761	sstables: replace naked new with make_lw_shared Fallout from the sstables dependency reduction patches. Message-Id: <20171010121134.26342-1-avi@scylladb.com>	2017-10-10 13:21:46 +01:00
Tomasz Grabiec	46c7e06e56	locator: Optimize token_metadata::is_member() Currently it's linear in the number of tokens in the system in the worst case. We could use the knowledge which _topology has to make it O(1). Fixes #2873. Message-Id: <1507630182-13410-1-git-send-email-tgrabiec@scylladb.com>	2017-10-10 14:27:54 +03:00
Tomasz Grabiec	44faaafc29	cache_streamed_mutation: Read static row with cache region locked _snp->static_row() allocates and needs reference stability. Message-Id: <1507555031-11567-1-git-send-email-tgrabiec@scylladb.com>	2017-10-09 15:55:53 +01:00
Avi Kivity	8d81ec92f6	gdb: adjust 'scylla memory' command for fallback small pools Seastar small pools can now fall back to smaller spans. Adjust the 'scylla memory' command accordingly. Message-Id: <20171005123935.13503-1-avi@scylladb.com>	2017-10-09 11:44:03 +02:00
Botond Dénes	dead2617ce	mp_row_consumer: remove unnecessary _reasource_tracker member Leftovers from `a43901f84`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <88237d9cd97feeca47e12ec4af89c90f1a3a6bb5.1507535176.git.bdenes@scylladb.com>	2017-10-09 10:59:40 +03:00
Botond Dénes	af083d6507	Merge mutation_reader related test cases into mutation_reader_test The following tests were merged: * combined_mutation_reader_test * restricted_reader_test Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <db6b5b3c2d30cfaa720fff07c859649a180cff95.1507299293.git.bdenes@scylladb.com>	2017-10-08 17:33:55 +03:00
Avi Kivity	fd1d35d4af	Update seastar submodule * seastar c62bbf9...8babd1f (9): > Enhanced support for Travis CI: build with and without DPDK support, use varioius compilers (GCC 5/6/7) > backtrace: Allow whitespace after the backtrace addresses > test.py: fix typo in noncopyable_function_test > utils: introduce noncopyable_function > Revert "utils: introduce noncopyable_function" > utils: introduce noncopyable_function > Add seastar-addr2line helper script to decode backtraces > execution_stage: pass scheduling_group to constructor > reactor: preempt tasks when a signal is received	2017-10-08 16:36:10 +03:00
Avi Kivity	98e69482bf	Merge "Add support for CAST AS functions" from Daniel "This series implements CAST AS functions in scylla. It allows to use expressions of the form CAST(x AS type) in select statements. Primary motivation for this functions came from aggregate functions, because function avg(.) gives rounded results for interger columns. Now it is possible to convert such column to float/double and obtain floating point results: SELECT ... avg(cast(x as double)), ... Fixes #2280." * 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla: tests: Add test for CAST AS functions. cql3: Add support for CAST AS functions to ANTLR grammar. cql3/selectable: Add selectable::with_cast for CAST AS functions. cql3/functions: Add support for CAST AS functions. types:: Add support for CAST AS functions. types: Moved code that implements conversion of types' values to string.	2017-10-08 12:55:07 +03:00
Botond Dénes	046a1f9b05	sstables: Get rid of [[deprecated]] index_reader::get_index_entries() Change test code (the only consumers) to read index by partitions. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b6111e92b5e0729bfa2e76fd848215804174067a.1507297154.git.bdenes@scylladb.com>	2017-10-08 12:18:52 +03:00
Daniel Fiala	9e11bfe8fa	tests: Add test for CAST AS functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:05:53 +02:00
Daniel Fiala	4dd504b9ac	cql3: Add support for CAST AS functions to ANTLR grammar. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:04:40 +02:00
Daniel Fiala	7fe653f08c	cql3/selectable: Add selectable::with_cast for CAST AS functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:04:40 +02:00
Daniel Fiala	ca092a0b7d	cql3/functions: Add support for CAST AS functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:04:40 +02:00
Daniel Fiala	61570e4a73	types:: Add support for CAST AS functions. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:04:40 +02:00
Daniel Fiala	e2c0a57ecf	types: Moved code that implements conversion of types' values to string. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2017-10-07 21:04:40 +02:00
Botond Dénes	a43901f842	row_consumer: de-virtualize io_priority() and resource_tracker() Fixes #2830 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>	2017-10-06 18:50:12 +01:00
Botond Dénes	d2b294dc06	loading_cache: prepend this-> to method calls on captured this To make gcc 6.3 happy. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <849402e20a1ffa6f603eff4fe295981a94b9ca79.1507282527.git.bdenes@scylladb.com>	2017-10-06 12:09:34 +02:00
Vlad Zolotarov	bc9d17963f	test.py: add loading_cache_test Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1507137724-2408-3-git-send-email-vladz@scylladb.com>	2017-10-05 15:30:07 +01:00
Vlad Zolotarov	1394e781be	utils + cql3: use a functor class instead of std::function Define value_extractor_fn as a functor class instead of std::function. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1507137724-2408-2-git-send-email-vladz@scylladb.com>	2017-10-05 15:29:51 +01:00
Duarte Nunes	a011eb72c2	Merge branch 'CQL secondary index backing views' from Pekka "This patch series adds backing materialized view for secondary indices. When a new index is created with the 'CREATE INDEX' statement, a backing materialized view is created automatically. For example, assuming the following table: CREATE TABLE ks1.users ( userid uuid, email text, PRIMARY KEY (userid) ); When the following index is created: CREATE INDEX user_email ON ks1.users (email); The following materialized view is also created: cqlsh> DESCRIBE ks1.users; <snip> CREATE MATERIALIZED VIEW ks1.user_email_index AS SELECT email, userid FROM ks1.users WHERE email IS NOT NULL PRIMARY KEY (email, userid) WITH CLUSTERING ORDER BY (userid ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CQL queries will use the backing materialized view as part of queries on indexed columns to fetch the primary keys." * 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev: schema_tables: Create backing view for indices database: Kill obsolete secondary index manager stub cql3: Wire up secondary index manager cql3/restrictions: Add term_slice::is_supported_by() function index: Add secondary_index_manager::create_view_for_index() index: Add target_parser::parse() helper cql3/statements: Add index_target::from_sstring() helper index: Add secondary_index_manager::get_dependent_indices() index: Add secondary_index_manager::reload() index: Add secondary_index_manager::list_indexes() index: Add index class index: Pass column_family to secondary_index_manager constructor database: Make secondary index manager per-column family	2017-10-05 12:08:14 +01:00
Pekka Enberg	4045e1ec09	schema_tables: Create backing view for indices This patch wires calls to secondary index manager reload() in merge_tables_and_views() and changes make_update_indices_mutations() to also create mutations for the backing materialized view. After this patch, "CREATE INDEX" CQL statement also creates a materialized view.	2017-10-05 10:07:44 +03:00
Pekka Enberg	5d30ad5e1a	database: Kill obsolete secondary index manager stub	2017-10-05 10:07:44 +03:00
Pekka Enberg	3a27f2e812	cql3: Wire up secondary index manager	2017-10-05 10:07:44 +03:00
Pekka Enberg	feae924c8c	cql3/restrictions: Add term_slice::is_supported_by() function	2017-10-05 10:07:44 +03:00
Pekka Enberg	ed4c96c025	index: Add secondary_index_manager::create_view_for_index() This patch adds a create_view_for_index() function, which creates a view_ptr for index_metadata.	2017-10-05 10:07:44 +03:00
Pekka Enberg	a809ea902e	index: Add target_parser::parse() helper	2017-10-05 10:07:44 +03:00
Pekka Enberg	9f07af8224	cql3/statements: Add index_target::from_sstring() helper	2017-10-05 10:07:44 +03:00
Pekka Enberg	50943ce592	index: Add secondary_index_manager::get_dependent_indices()	2017-10-05 10:07:44 +03:00
Glauber Costa	189ef02596	storage_proxy: change reporting of estimated histograms We are currently collapsing the histograms in 16 points, exponentially increasing in value, starting from 1. While reducing the number of points is a worthy goal, the current configuration caps us at 4ms. Our latencies tend to be higher than this. Starting from 1 is also a bit of an exhaggeration: rarely are our latencies in that range. This patch changes reporting so that we report 20 points starting from 32. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-10-04 20:01:15 -04:00
Glauber Costa	fc4416abcc	estimated_histogram: bring histogram closer to what prometheus expects. Histograms are a native prometheus type, and there are many functions available that operate on them. There is extensive documentation about them at https://prometheus.io/docs/practices/histograms/ One example is the function histogram_quantile(), that can extract useful quantiles from the histograms. Currently, those functions don't work well. The reasons are twofold: 1) We are only exporting 16 metrics, starting from 1usec. That means that the highest latency we can differentiate is 4ms. After that, everything falls into the same bin. 2) The format that prometheus expects is that each bin will contain the total number of points seen up until that bin, while we currently export the total number of points that falls between bins. IOW, it is a cummulative histogram. About point two, granted it is a bit hidden in their website, but it is there. The following phrase about a caveat make it clear: "Note that we divide the sum of both buckets. The reason is that the histogram buckets are cumulative. The le="0.3" bucket is also contained in the le="1.2" bucket; dividing it by 2 corrects for that." It is also not needed to accumulate things that fall over the last bin: the _count component of the histogram will already account for that. This patch changes the histogram format to be more in line with what prometheus expect. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-10-04 20:01:13 -04:00
Avi Kivity	ab65b42bb6	size_estimates: remove ambiguity in call to std::ref() The call to std::ref() is not namespace-qualified, and so can conflict with seastar::ref(). Fix by naming std::ref() explicitly. Message-Id: <20171004155250.4960-1-avi@scylladb.com>	2017-10-04 18:31:40 +02:00
Duarte Nunes	953888d0d0	Merge "Auth: pluggable auth + "transitional" auth-objects" from Calle "Makes authorizer/authenticator actually pluggable (by class name) and adds a "Transitional" type for both, conforming to the DSE definition of the types. The idea is to allow a rolling upgrade of a cluster to authentication op by first making all clients provide credentials (ignored by non-auth), then node by node enable auth with transitional handlers, then ensure user DB is populated and distributed, and finally rollingly enable strict auth for each node. Pfew." Fixes #2836 * auth: Transitional auth wrappers auth: Make authenticator/authorizer use actual name based lookup	2017-10-04 12:46:16 +02:00
Calle Wilund	611e00646b	auth: Transitional auth wrappers Similar to DSE objects with similar name. Basically ignores all authentication/authorization except "superuser" login. All others sessions are treated as anonymous. Note: like DSE counterparts, a client session must still _use_ authentication to be able to connect, even though the actual content of the auth is mostly ignored.	2017-10-04 12:44:44 +02:00
Calle Wilund	b96a7ae656	auth: Make authenticator/authorizer use actual name based lookup Allowing for pluggable auth objects. Note: requires "class_registrator: Fix qualified name matching + provider helpers" patch previously sent.	2017-10-04 12:44:44 +02:00
Calle Wilund	801ee44cb8	class_registrator: Fix qualified name matching + provider helpers Should not assume namespace "org", nor should we allow "loose" substring matching.	2017-10-04 12:43:42 +02:00
Calle Wilund	3c509e0333	class_registrator: Allow different return types Allows registry to give back, for example, shared_ptr etc instead of solely unique_ptr. If a registry is defined with seastar/std shared/lw_shared/unique_ptr as "BaseType", the type will assume this is the intended result type.	2017-10-04 12:43:42 +02:00
Avi Kivity	bdbbfe9390	Merge "Make restricting_mutation_reader more accurate" from Botond "Currently restricting_mutation_reader restricts mutation_readears on a count basis. This is inaccurate on multiple levels. The reader might be a combined_mutation_reader, which might be composed of multiple individual readers, whose number might change during the lifetime of the reader. The memory consumption of the readers can vary and may change during the lifetime of the reader as well. To remedy this, make the restriction memory-consumption based. The restricting semaphore is now configured with the amound of memory (bytes) that its readers are allowed to consume in total. New readers consume 128k units up-front to account for read-ahead buffers, and then consume additional units for any buffer (returned from input_stream<>::read()) they keep around. Like before, readers already allowed to read will not be blocked, instead new readers will be blocked on their first read if all the units all consumed. Fixes #2692." * 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla: Update reader restriction related metrics Add restricted_reader_test unit test restricted_mutation_reader: restrict based-on memory consumption mutation_reader.hh: Move restricted_reader related code	2017-10-04 12:43:58 +03:00
Paweł Dziepak	fdfa6703c3	Merge "loading_shared_values and size limited and evicting prepared statements cache" from Vlad " The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map). It turned out that we already have the classes that do almost what I needed: - utils::loading_cache - sstables::shared_index_lists Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the new hinted handoff code. This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition of an entry to the set (PATCH1). Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3). PATCH4 optimizes the loading_cache for the long timer period use case. But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been used long enough. This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after they are created. Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading" conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements caches to utils::loading_cache. This also fixes #2474." * 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla: tests: loading_cache_test: initial commit cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache cql3: prepared statements cache on top of loading_cache utils::loading_cache: make the size limitation more strict utils::loading_cache: added static_asserts for checking the callbacks signatures utils::loading_cache: add a bunch of standard synchronous methods utils::loading_cache: add the ability to create a cache that would not reload the values utils::loading_cache: add the ability to work with not-copy-constructable values utils::loading_cache: add EntrySize template parameter utils::loading_cache: rework on top of utils::loading_shared_values sstables::shared_index_list: use utils::loading_shared_values utils: introduce loading_shared_values	2017-10-04 09:13:32 +01:00
Daniel Fiala	1133838b9f	types: Add data_type_for for varint and decimal, data_value constructor for simple_date_type. Signed-off-by: Daniel Fiala <daniel@scylladb.com> Message-Id: <20171004044040.21631-1-daniel@scylladb.com>	2017-10-04 10:52:57 +03:00
Tomasz Grabiec	f506339582	tests: perf_fast_forward: Auto-create test directory To avoid exception due to missing directory. Message-Id: <1506081627-12933-1-git-send-email-tgrabiec@scylladb.com>	2017-10-03 15:36:37 +03:00
Botond Dénes	fea6214a0a	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics. Use labels to distinguish between system, user and streaming reads related metrics.	2017-10-03 12:44:17 +03:00
Botond Dénes	3280fbc4d4	Add restricted_reader_test unit test	2017-10-03 12:44:17 +03:00
Botond Dénes	47e07b787e	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-10-03 12:44:12 +03:00
Botond Dénes	0a07e9e7c7	mutation_reader.hh: Move restricted_reader related code In preparation of make_restricted_reader taking a mutation_source as its argument.	2017-10-03 12:39:22 +03:00
Avi Kivity	78eae8bf48	Revert "Merge "Make restricting_mutation_reader more accurate" from Botond" This reverts commit `c6e5dcc556`, reversing changes made to `19b21a0ab2`. Failes to build, plus author has more changes.	2017-10-03 11:58:59 +03:00
Pekka Enberg	641f28da02	cql3/statements: Clean up select_statement class definition We have some historical #ifdef'd code that really ought to be removed by now... Message-Id: <1507015932-8165-1-git-send-email-penberg@scylladb.com>	2017-10-03 11:17:32 +03:00
Avi Kivity	c6e5dcc556	Merge "Make restricting_mutation_reader more accurate" from Botond "Currently restricting_mutation_reader restricts mutation_readears on a count basis. This is inaccurate on multiple levels. The reader might be a combined_mutation_reader, which might be composed of multiple individual readers, whose number might change during the lifetime of the reader. The memory consumption of the readers can vary and may change during the lifetime of the reader as well. To remedy this, make the restriction memory-consumption based. The restricting semaphore is now configured with the amound of memory (bytes) that its readers are allowed to consume in total. New readers consume 128k units up-front to account for read-ahead buffers, and then consume additional units for any buffer (returned from input_stream<>::read()) they keep around. Like before, readers already allowed to read will not be blocked, instead new readers will be blocked on their first read if all the units all consumed." Fixes #2692. * 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla: Update reader restriction related metrics Add restricted_reader_test unit test restricted_mutation_reader: restrict based-on memory consumption mutation_reader.hh: Move restricted_reader related code	2017-10-03 11:15:34 +03:00
Daniel Fiala	19b21a0ab2	types: Allow 'T' as a date-time separator in timestamps. * Letter 'T' is specified in ISO 8601 and also in Cassandra documentation. Signed-off-by: Daniel Fiala <daniel@scylladb.com> Message-Id: <20171003073558.19257-1-daniel@scylladb.com>	2017-10-03 11:10:11 +03:00
Avi Kivity	3cc1c2c387	Merge seastar upstream * seastar 899fc4e...c62bbf9 (6): > Merge "CPU Scheduler for seastar" from Avi > reactor: set SCHED_FIFO policy for timer thread > future: mark future::wait() as noexcept > shared_promise: Make get_shared_future() const-qualified > Remove pessimizing and redundant std::move()-s reported by Clang-tidy utility > Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339	2017-10-02 20:47:32 +03:00
Avi Kivity	dd5ab75d04	range: add missing include Message-Id: <20171002144608.5032-1-avi@scylladb.com>	2017-10-02 16:49:24 +02:00
Avi Kivity	5ed6d1b176	dist: enable CAP_SYS_NICE Allow scylla to use SCHED_FIFO for the timer thread for more accurate scheduling. Message-Id: <20171001121500.28318-1-avi@scylladb.com>	2017-10-02 16:32:00 +02:00
Avi Kivity	dbce5158a3	Update ami submodule * dist/ami/files/scylla-ami 5ffa449...be90a3f (1): > amazon kernel: enable updates	2017-10-02 17:07:09 +03:00
Piotr Jastrzebski	83fd22face	Add test to reproduce #2854 When memtable gets flushed, existing mutation_readers created for it stop handling fast_forward_to correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f580ac59f3fcec53e7c78ad7a8b6374eb36958c6.1506690042.git.piotr@scylladb.com>	2017-09-29 15:17:53 +02:00
Piotr Jastrzebski	2583207d9d	Fix memtable scanning_reader::fast_forward_to If memtable is flushed then call fast_forward_to on _delegate. Otherwise call iterator_reader::fast_forward_to. Fixes #2854 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <6bf1c8bafce845ef945698ce4d722c3c8606e632.1506690042.git.piotr@scylladb.com>	2017-09-29 15:17:39 +02:00
Asias He	c0b965ee56	gossip: Better check for gossip stabilization on startup This is a backport of Apache CASSANDRA-9401 (2b1e6aba405002ce86d5badf4223de9751bf867d) It is better to check the number of nodes in the endpoint_state_map is not changing for gossip stabilization. Fixes #2853 Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>	2017-09-29 08:57:25 +02:00
Tomasz Grabiec	d75f243a8b	Update seastar submodule Fixes #2770. Fixes #2819. * seastar 92fdce2...899fc4e (14): > scollectd: increment the metadata iterator with the values > Enable Travis CI builds for Seastar. > tests: Fix httpd test compilation error caused by unconditionally explicit tuple constructor in GCC5: scylladb/seastar#326 > core::shared_future: add available() and failed() methods > rpc: make sure that _write_buf stream is always properly closed > log: Fail on attempt to register logger with the same name twice > Merge "Make backtraces useful on ASLR-enabled machines as well" from Botond > reactor: add option to bypass fsync > future-util: modernize do_until() implementation > future-util: fix do_until() API to not have forwarding references > input_stream: add rvalue variant of input_stream::consume() > logger: remove extra spaces after timestamp > tutorial: lifetime management > Fix broken link for fsqual failure message	2017-09-28 15:27:34 +02:00
Piotr Jastrzebski	6069bab755	Cache single queries to non-existing partitions This way we don't need to query sstables again when the query is repeated. Fixes #1533 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>	2017-09-27 16:15:18 +02:00
Tomasz Grabiec	b704710954	migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled We don't pull schema during rolling upgrade, that is until schema_tables_v3 feature is enabled on all nodes. Because features are enabled from gossiper timer, there is a race between feature enablement and processing of endpoint states which may trigger schema pull. It can happen that we first try to pull, but only later enable the feature. In that case the schema pull will not happen until the next schema change. The fix is to ensure that pulls abandoned due to feature not being enabled will be retried when it is enabled. Fixes sporadic failure in dtest: repair_additional_test.py:RepairAdditionalTest.repair_schema_test Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>	2017-09-27 12:00:07 +01:00
Tomasz Grabiec	7a58fb5767	gossiper: Allow waiting for feature to be enabled Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>	2017-09-27 11:57:06 +01:00
Raphael S. Carvalho	63eb9f61c0	db: use correct dirty memory manager for system column families Dirty memory manager for non-system column families was being used when applying mutations to system cfs. That previously lead to deadlock when updating history. Basically, write disable waits on compaction, and compaction waits on a write that would release dirty memory for updating compaction history. Only using the correct dirty manager wouldn't solve this problem if write is disabled for system cf, but the problem is completely solved in addition to previous change which updates history outside the sstable lock. Refs #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>	2017-09-26 19:51:31 +02:00
Raphael S. Carvalho	e34c1db642	db: update compaction history outside the sstable write lock The reason to do that is because compaction can deadlock if refresh disables write which waits for compaction, and compaction in turn waits for dirty memory[1] that would be released by memtable write. Dirty memory manager for non-system cfs was being used for system cfs, which was useful for exposing this problem. [1]: when updating compaction history. Fixes #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>	2017-09-26 19:51:12 +02:00
Asias He	4b1034b9cd	storage_service: Remove the stream_hints Our hinted handoff implementation will not use the db::system_keyspace::HINTS system table to store hints. No need to stream them. Acked-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>	2017-09-26 19:05:21 +03:00
Paweł Dziepak	af1976bc30	Merge "Fix cache reader skipping rows in some cases" from Tomasz "Fixes the problem of concurrent populations of clustering row ranges leading to some readers skipping over some of the rows. Spotted during code review. Fixes #2834." * tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev: tests: mvcc: Add test for partition_snapshot_row_cursor tests: row_cache: Add test for concurrent population tests: row_cache: Make populate_range() accept partition_range tests: Add simple_schema::make_ckey_range() cache_streamed_mutation: Add missing _next_row.maybe_refresh() call mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid() mvcc: Keep track of all iterators in partition_snapshot_row_cursor mvcc: Make partition_snapshot_row_cursor printable	2017-09-26 15:09:58 +01:00
Tomasz Grabiec	3eb251e3a4	tests: perf_fast_forward: Fail if ran with more than one shard The test reads only from local shard, if ran with more shards, current shard will miss some of the data. Message-Id: <1506081609-12811-1-git-send-email-tgrabiec@scylladb.com>	2017-09-26 15:23:10 +03:00
Calle Wilund	dd2b8821a4	everywhere_strategy: Make get_natural_endpoints handle non-init state Make get_natural_endpoints return local address iff token metadata is not yet setup (since that is the one address we already know of). If a request has a consistency level requiring more endpoints, it will still fail, but for calls with, for example, CL=ONE, at startup we will succeed, and more or less act like local strategy. Yet, further down the line, have data distributed as desired. Acked-by: Gleb Natapov <gleb@scylladb.com> Message-Id: <20170926113512.15707-1-calle@scylladb.com>	2017-09-26 15:21:30 +03:00
Asias He	98e9049820	gossip: Print SCHEMA_TABLES_VERSION correctly Found this when debugging gossip with debug print. The application state SCHEMA_TABLES_VERSION was printed as UNKNOWN. Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>	2017-09-26 08:38:28 +02:00
Tomasz Grabiec	e5e9886014	tests: mvcc: Add test for partition_snapshot_row_cursor	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	e4adc9c600	tests: row_cache: Add test for concurrent population	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	a3fb7ce660	tests: row_cache: Make populate_range() accept partition_range	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	dd7af02251	tests: Add simple_schema::make_ckey_range()	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	e83cd508f6	cache_streamed_mutation: Add missing _next_row.maybe_refresh() call We were checking if the cursor is up_to_date(), but this is not enough to guarantee that the cursor is valid, merely that its iterators are valid. The cursor may be invalidated even if its iterators are valid if there was an insertion after cursor's position. Fixes #2834.	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	2f8d91043d	mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position The cursor maintains a heap of iterators in all versions. If rows were inserted before the latest version's iterator, cursor would not see them. Fix by redoing the lookup for iterators not in the current row in maybe_refresh(). Refs #2834.	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	09d99b0358	mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	4ee11641c0	mvcc: Keep track of all iterators in partition_snapshot_row_cursor Will be needed when updating the iterator for latest version. Before this change, such iterator could be neither in _current_row nor in _heap. Besides that, this will allow user to always access the iterator of latest version, which enables some optimizations in the future of avoiding unnecessary lookups. get_iterator_in_latest_version() is now always valid.	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	a8cbd34dde	mvcc: Make partition_snapshot_row_cursor printable	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	8e46d15f91	storage_service: Register features before joining Since commit `8378fe190`, we disable schema sync in a mixed cluster. The detection is done using gossiper features. We need to make sure the features are registerred, and thus can be enabled, before the bootstrapping of a non-seed node happens. Otherwise the bootstrap will hang waiting on schema sync which will not happen. Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:13:02 +01:00
Tomasz Grabiec	b92dcb0284	storage_service: Extract register_features() Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>	2017-09-25 09:12:46 +01:00
Tomasz Grabiec	d11d696072	tests: mutation_source_tests: Fix use-after-scope on partition range Message-Id: <1506096881-3076-1-git-send-email-tgrabiec@scylladb.com>	2017-09-22 19:13:47 +02:00
Botond Dénes	015ac042a8	combined_mutation_reader_test: remove unneeded includes Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <a388efa6fc93049f4d69c049764cc9225a04bce4.1506098363.git.bdenes@scylladb.com>	2017-09-22 18:45:04 +02:00
Botond Dénes	a7984a9908	combined_mutation_reader_test: remove leftover debug logging Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <96e61fcd2543ec84921f1b2188d7248e55e7efe0.1506097635.git.bdenes@scylladb.com>	2017-09-22 18:44:47 +02:00
Tomasz Grabiec	5def901a92	sstables: Don't register logger with the same name twice There can be one logger with given name. This was causing --logger-log-level sstable=trace to not work for the majority of log points. Message-Id: <1505902259-4561-1-git-send-email-tgrabiec@scylladb.com>	2017-09-20 16:40:06 +03:00
Piotr Jastrzebski	98c359d7de	Introduce a position for partition start This position will be used for mutation fragment that represents the start of a partition. This position sorts before static row. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-09-20 11:34:03 +02:00
Piotr Jastrzebski	e1f7d1f25d	streamed_mutation: Extract concepts using GCC6_CONCEPT macro It makes it easier to actually use those concepts. Lambdas passed to mutation_fragment::visit have to declare return type otherwise compiler fails with: internal compiler error: Segmentation fault Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-09-20 11:34:03 +02:00
Tomasz Grabiec	02d41864af	Merge "Fix miss opportunity to update gossiper features" from Asias The gossiper checks if features should be enabled from its timer callback when it detects that endpoint_state_map changed, that is different than shadow_endpoint_state_map. shadow_endpoint_state_map is also assigned from endpoint_state_map in storage_service::replicate_tm_and_ep_map(), called from storage_service::on_change() Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so that we won't miss gossip feature update. Fixes #2824 * git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1: gossip: Move the _features_condvar signal code to maybe_enable_features gossip: Make maybe_enable_features public storage_service: Check gossip feature update in replicate_tm_and_ep_map	2017-09-20 11:16:37 +02:00
Asias He	ebc3bada12	storage_service: Check gossip feature update in replicate_tm_and_ep_map This is another place we can update endpoint_state_map in addition to gossiper::run(). Call the gossiper:maybe_enable_features() so that we won't miss gossip feature update.	2017-09-20 16:58:33 +08:00
Asias He	6022b7423a	gossip: Make maybe_enable_features public It will be needed by storage_service.	2017-09-20 16:58:33 +08:00
Asias He	68c7a391b5	gossip: Move the _features_condvar signal code to maybe_enable_features It is easier to call to features update logic outside gossiper.	2017-09-20 16:58:32 +08:00
Asias He	173cba67ba	storage_service: Remove rpc client on all shards in on_dead We should close connections to nodes that are down on all shards instead of the shard which runs the on_dead gossip callback. Found by Gleb. Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>	2017-09-20 10:23:31 +02:00
Botond Dénes	43dba8f173	Update reader restriction related metrics Update description of existing reader count metrics, add memory consumption metrics.	2017-09-20 11:16:21 +03:00
Botond Dénes	b2db29dc65	Add restricted_reader_test unit test	2017-09-20 11:15:45 +03:00
Botond Dénes	33e97e7457	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-09-20 11:14:35 +03:00
Botond Dénes	e4a9e55e0d	mutation_reader.hh: Move restricted_reader related code In preparation of make_restricted_reader taking a mutation_source as its argument.	2017-09-20 11:12:57 +03:00
Tomasz Grabiec	741ec61269	streaming: Fix streaming not streaming all ranges It skipped one sub-range in each of the 10 range batch, and tried to access the range vector using end() iterator. Fixes sporadic failures of update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test. Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>	2017-09-20 10:33:59 +03:00
Avi Kivity	5b0cb28af9	Merge "row_cache: Call fast_forward_to() outside allocating section" from Tomasz "On bad_alloc the section is retried. If the exception happened inside fast_forward_to() on the underlying reader, that call will be retried. However, the reader should not be used after exception is thrown, since it is in unspecified state. Also, calling fast_forward_to() with cache region locked increases the chances of it failing to allocate. We shouldn't call fast_forward_to() with the cache region locked. Fixes #2791." * 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev: cache_streamed_mutation: De-futurize cursor movement cache_streamed_mutation: Call fast_forward_to() outside allocating section cache_streamed_mutation: Switch from flags to explicit state machine	2017-09-19 17:11:22 +03:00
Botond Dénes	96c6d54a5c	incremental_reader_selector: Remove unecessary check for duplicated next_token The next_token will never be the same as the current _selector_position, unless they are both maximum_token, which is already handled. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <9c54ae07a18d201185027c9b533bcb5256bead8a.1505826102.git.bdenes@scylladb.com>	2017-09-19 16:42:02 +03:00
Avi Kivity	12c393dd16	build: default to gold linker Better, faster. Message-Id: <20170919115737.12084-1-avi@scylladb.com>	2017-09-19 14:02:31 +02:00
Avi Kivity	a31ade54e0	streamed_mutation: optimize merge_mutations() if only one mutation If we read a partition from a single sstable (a fairly common case), we can bypass mutation_merger and just return the input. Message-Id: <20170918181418.14021-1-avi@scylladb.com>	2017-09-19 11:00:59 +01:00
Botond Dénes	8cb953b58b	incremental_reader_selector: don't create readers unconditionally on ff When fast-forwarding check that the new position is past the selector before attempting to create new readers. Also don't clear the set of already created readers and don't overwrite the selector position. Fixes #2807 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <514f69005eb29c2a3359f098d40abf588900b76f.1505811064.git.bdenes@scylladb.com>	2017-09-19 11:27:47 +02:00
Asias He	8f8273969d	gossip: Do not wait for echo message in mark_alive gossiper::apply_state_locally() calls handle_major_state_change() for each endpoint, in a seastar thread, which calls mark_alive() for new nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously waits for each node to respond before it moves on to the next entry. As a result it may take a while before whole state is processed. Apache (tm) Cassandra (tm) sends echos in the background. In a large cluster, we see at the time the joining node starts streaming, it hasn't managed to apply all the endpoint_state for peer nodes, so the joining node does not know some of the nodes yet, which results in the joining node ingores to stream from some of the existing nodes. Fixes #2787 Fixes #2797 Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>	2017-09-19 10:49:00 +03:00
Raphael S. Carvalho	1524426deb	sstables: Fix compaction correctness of higher-level tables When incremental_reader_selector is used for compaction, it will first call incremental selector of partitioned sstable set with minimum token that will result in first interval being skipped, which means not everything being compacted. The interval is skipped because iterator is incorrectly advanced when token lies before it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>	2017-09-19 09:59:30 +03:00
Avi Kivity	55e0b63e65	storage_proxy: scan more nodes exponentially to achieve target result set size The current sequential scan can take a long time on a small or empty table with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to exponential scan reduces the time. Fixes #1230. Message-Id: <20170912173803.8277-1-avi@scylladb.com>	2017-09-18 15:15:15 +02:00
Avi Kivity	e44517851e	untyped_result_set: reduce dependencies Forward-declare untyped_result_set and untyped_result_set_row, and remove the include from query_processor.hh. Message-Id: <20170916170859.27612-3-avi@scylladb.com>	2017-09-18 15:15:15 +02:00
Avi Kivity	0317746822	untyped_result_set: make untyped_result_set::row a namespace scope class Makes it possible to forward-declare, with the aim of reducing dependencies. Message-Id: <20170916170859.27612-2-avi@scylladb.com>	2017-09-18 15:15:15 +02:00
Pekka Enberg	9ebd8be82b	index: Add secondary_index_manager::reload() This patch adds a reload() function, which updates the secondary index manager index map to match underlying column family indices.	2017-09-18 14:31:35 +03:00
Duarte Nunes	16d2e4e81b	Merge 'reduce sstables.hh coupling' from Glauber "sstables.hh is already too big, and it is soon to become bigger with the inclusion of the read_monitor, to pair it with the write_monitor. It's a good opportunity for us to reduce sstables.hh dependencies by moving the write monitor to its own reader. One obvious caller is already changed so we don't need to include sstables.hh anymore." * 'progress-monitor' of https://github.com/glommer/scylla: sstables: do not include sstables.hh from memtable glue sstables: move write_monitor to its own header	2017-09-18 13:31:32 +02:00
Pekka Enberg	2ae6b141e5	index: Add secondary_index_manager::list_indexes()	2017-09-18 14:27:35 +03:00
Avi Kivity	a2f26f7b29	log_histogram: rename to log_heap log_histogram is not really a histogram, it is a heap-like container. Rename to log_heap in case we do want a log_histogram one day. Message-Id: <20170916172137.30941-1-avi@scylladb.com>	2017-09-18 12:44:05 +02:00
Amnon Heiman	8d668a9dc0	API: storage_service repair_async_status to return proper error code This patch change the implementation of storage_service repair_async_status to throw an exception, this way a 400 return code will be returned. Fixes #2794 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170917080533.6612-1-amnon@scylladb.com>	2017-09-18 09:08:26 +03:00
Vlad Zolotarov	cea15486c4	tests: loading_cache_test: initial commit Test utils::loading_shared_values and utils::loading_cache. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 22:19:15 -04:00
Vlad Zolotarov	66568be969	cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache - Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class. - Add the corresponding metrics to the query_processor: - Evictions count. - Current entries count. - Current memory footprint. Fixes #2474 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 22:19:15 -04:00
Vlad Zolotarov	8f912b46b1	cql3: prepared statements cache on top of loading_cache This is a template class that implements caching of prepared statements for a given ID type: - Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow this memory limit - the less recently used entries are going to be evicted so that the new entry could be added. - The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size functor class that returns a number of bytes for a given prepared statement (currently returns 10000 bytes for any statement). - The cache entry is going to be evicted if not used for 60 minutes or more. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 22:19:11 -04:00
Vlad Zolotarov	9a43398d6a	utils::loading_cache: make the size limitation more strict Ensure that the size of the cache is never bigger than the "max_size". Before this patch the size of the cache could have been indefinitely bigger than the requested value during the refresh time period which is clearly an undesirable behaviour. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	4e72a56310	utils::loading_cache: added static_asserts for checking the callbacks signatures Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	a13362e74b	utils::loading_cache: add a bunch of standard synchronous methods Add a few standard synchronous methods to the cache, e.g. find(), remove_if(), etc. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	fa2f8162a5	utils::loading_cache: add the ability to create a cache that would not reload the values Sometimes we don't want the cached values to be periodically reloaded. This patch adds the ability to control this using a ReloadEnabled template parameter. In case the reloading is not needed the "loading" function is not given to the constructor but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add the corresponding get(key, loader) method in the future when needed). Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	a60a77dfc8	utils::loading_cache: add the ability to work with not-copy-constructable values Current get(...) interface restricts the cache to work only with copy-constructable values (it returns future<Tp>). To make it able to work with non-copyable value we need to introduce an interface that would return something like a reference to the cached value (like regular containers do). We can't return future<Tp&> since the caller would have to ensure somehow that the underlying value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like pointer to that value. "Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while it keeps a "reference" to an object stored in the cache. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	c24d85f632	utils::loading_cache: add EntrySize template parameter Allow a variable entry size parameter. Provide an EntrySize functor that would return a size for a specific entry. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	6024014f92	utils::loading_cache: rework on top of utils::loading_shared_values Get rid of the "proprietary" solution for asynchronous values on-demand loading. Use utils::loading_shared_values instead. We would still need to maintain intrusive set and list for efficient shrink and invalidate operations but their entry is not going to contain the actual key and value anymore but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value pair value. In general, we added another level of dereferencing in order to get the key value but since we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set this should not translate into an additional set lookup latency. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	d56684b1a5	sstables::shared_index_list: use utils::loading_shared_values Since utils::loading_shared_values API is based on the original shared_index_list this change is mostly a drop-in replacement of the corresponding parts. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:11 -04:00
Vlad Zolotarov	ec3fed5c4d	utils: introduce loading_shared_values This class implements an key-value container that is populated using the provided asynchronous callback. The value is loaded when there are active references to the value for the given key. Container ensures that only one entry is loaded per key at any given time. The returned value is a lw_shared_ptr to the actual value. The value for a specific key is immediately evicted when there are no more references to it. The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed every time a new value is added (asynchronously loaded). The container has a rehash() method that would grow or shrink the container as needed in order to get the load factor into the [0.25, 0.75] range. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-09-15 20:53:06 -04:00
Glauber Costa	2227ae3f19	sstables: do not include sstables.hh from memtable glue There is no need to include the whole sstables.hh file in memtable-sstable.hh anymore. All we need is the shared_sstable definition and the progress monitor. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-09-15 14:16:35 -04:00
Glauber Costa	51829f528d	sstables: move write_monitor to its own header Soon I am about to introduce a read monitor, and pairing infrastructure to manage it. Having it all living in sstables.hh force to include it everytime, even in places that don't really need it. Move to its own header. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-09-15 14:09:07 -04:00
Tomasz Grabiec	804722b6c8	tests: perf_fast_forward: Fix use-after-scope on partition range Message-Id: <1505489249-16806-1-git-send-email-tgrabiec@scylladb.com>	2017-09-15 16:34:41 +01:00
Tomasz Grabiec	7b5b461067	cache_streamed_mutation: De-futurize cursor movement start_reading_from_underlying() doesn't return future<> any more, so we can simplify this.	2017-09-15 15:41:55 +02:00
Tomasz Grabiec	22019577cc	cache_streamed_mutation: Call fast_forward_to() outside allocating section On bad_alloc the section is retried. If the exception happened inside fast_forward_to() on the underlying reader, that call will be retried. However, the reader should not be used after exception is thrown, since it is in unspecified state. Also, calling fast_forward_to() with cache region locked increases the chances of it failing to allocate. We shouldn't call fast_forward_to() with the cache region locked. Fixes #2791.	2017-09-15 15:41:55 +02:00
Tomasz Grabiec	3b790a1e80	cache_streamed_mutation: Switch from flags to explicit state machine We're in one state at a time, so it's better to express it as a single variable rather than N independent flags. In preparation before adding more states.	2017-09-15 15:41:55 +02:00
Glauber Costa	eb93d5f8ad	database: pass a monitor as a parameter to memtable writer Right now we pass a permit to the memtable writer and that permit is used insite write_memtable_to_sstable to compose a write_monitor. We would like to extend the write_monitor to include other things, that right now are not available as parameters to write_memtable_to_sstable - and which are possibly too specialized to be. The solution for that is to pass the write_monitor instead of the permit to the writer. Conceptually, that also makes sense because the write_monitor is something the sstable writer is aware of. Permits, on the other hand, are a database concept that is alien to the sstable writer. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170915032836.21154-1-glauber@scylladb.com>	2017-09-15 12:26:56 +02:00
Duarte Nunes	8378fe190a	Merge 'Fix schema version mismatch during rolling upgrade from 1.7' from Tomasz "When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema for some reason, reads or writes which involve both 1.7 and 2.0 nodes may start to fail with the following error logged: storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d The situation should heal after whole cluster is upgraded. Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes due to change in the schema tables format. Mismatch is meant to be avoided by having 2.0 nodes calculate the old digest on schema migration during upgrade, and use that version until next time the table is altered. It is thus not allowed to alter tables during the rolling upgrade. Two 2.0 nodes may exchange schema, if they detect through gossip that their schema versions don't match. They may not match temporarily during boot, until the upgraded node completes the bootstrap and propagates its new schema through gossip. One source of such temporary mismatch is construction of new tracing tables, which didn't exist on 1.7. Such schema pull will result in a schema merge, which cause all tables to be altered and their schema version to be recalculated. The new schema will not match the one used by 1.7 nodes, causing reads and writes to fail, because schema requesting won't work during rolling upgrade from 1.7 to 2.0. The main fix employed here is to hold schema pulls, even among 2.0 nodes, until rolling upgrade is complete." * 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev: tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case tests: cql_test_env: Enable all features in tests schema_tables: Make make_scylla_tables_mutation() visible migration_manager: Disable pulls during rolling upgrade from 1.7 storage_service: Introduce SCHEMA_TABLES_V3 feature schema_tables: Don't alter tables which differ only in version schema_mutations: Use mutation_opt instead of stdx::optional<mutation>	2017-09-15 10:27:47 +02:00
Tomasz Grabiec	c657eec4cf	tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	f0fdf75e7c	tests: cql_test_env: Enable all features in tests	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	571cac95ed	schema_tables: Make make_scylla_tables_mutation() visible For tests.	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	5a92c18e63	migration_manager: Disable pulls during rolling upgrade from 1.7 If there is a schema pull during rolling upgrade among a two 2.0 nodes, then schema merge will delete the persisted schema version. When the node loads that table again, e.g. on restart, it will generate a version which is different than the one which 1.7 nodes use. This will cause reads and writes to fail. To avoid this, disable pulls until all nodes are upgraded. Fixes #2802.	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	713d75fd51	storage_service: Introduce SCHEMA_TABLES_V3 feature	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	f943d2efbf	schema_tables: Don't alter tables which differ only in version We apply deletion of scylla_tables.version to the incoming schema mutations so that table schema version is recalculated after merge. The mutations which we read from local schema tables may not have it deleted in which case all tables would be considered as differing on the presence of the version field. Avoid this by deleting the field from old mutations as well.	2017-09-14 20:26:31 +02:00
Tomasz Grabiec	99272087e6	schema_mutations: Use mutation_opt instead of stdx::optional<mutation>	2017-09-14 20:26:31 +02:00
Takuya ASADA	7662271fc9	dist/ami: show correct message when scylla-ami-setup.service failed After the service started, a state of the service may become "failed", "active" or "activating". But our script does not accept scylla-ami-setup.service become "failed" state, in result the script shows up wrong message. So we handle these three types of state correctly. Fixes #2759 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1504589079-1986-1-git-send-email-syuu@scylladb.com>	2017-09-14 12:40:05 +03:00
Tomasz Grabiec	0911fbbdef	row_cache: Fix row_cache::update_invalidating() evict() doesn't guarantee that the whole partition is discontinuous. In particular, partition tombstone cannot be marked as discontinuous. The parts which are still continuous must be updated. Broken after `c78047fa5b`. Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>	2017-09-14 10:58:25 +03:00
Asias He	0ec574610d	locator: Get rid of assert in token_metadata In commit `69c81bcc87` (repair: Do not allow repair until node is in NORMAL status), we saw a coredump due to an assert in token_metadata::first_token_index. Throw an exception instead of abort the whole scylla process. Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>	2017-09-14 10:33:02 +03:00
Gleb Natapov	31e803a36c	storage_proxy: wire up percentile speculative read properly Collect coordinator side read statistic per CF and use them in percentile speculative read executor. Getting percentile from estimated_histogram object is rather expensive, so cache it and recalculate only once per second (or if requested percentile changes). Fixes #2757 Message-Id: <20170911131752.27369-3-gleb@scylladb.com>	2017-09-14 10:31:26 +03:00
Gleb Natapov	0842faecef	estimated_histogram: fix overflow error handling Currently overflow values are stored in incorrect bucket (last one instead of special "overflow" one) and percentile() function throws if there is overflow value. The patch fixes the code to store overflow value in corespondent bucket and makes percentile() to take it into account instead of throwing. Message-Id: <20170911131752.27369-2-gleb@scylladb.com>	2017-09-14 10:31:21 +03:00
Asias He	5ff0b113c9	gossip: Fix indentation in apply_state_locally Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>	2017-09-14 10:09:50 +03:00
Tomasz Grabiec	65ca8eebb8	mutation_partition: Print rows_entry's position instead of key For dummy rows, _key doesn't reflect the right position. Message-Id: <1505317040-6783-1-git-send-email-tgrabiec@scylladb.com>	2017-09-13 20:49:28 +03:00
Avi Kivity	ca8e3c4a78	Merge "Evict from partition snapshots in cache" from Tomasz "This series fixes the problem of active reads causing OOM due to the fact that partition snapshots they hold are not evictable. In particular, a single scan of a partition larger than memory will bad_alloc due to itself. After this, when partition entry is evicted from cache, data in all the snapshots is also evicted. We still don't have row-level eviction, but this series lays some grounds for it by making cache readers prepared for the possibility of rows being evicted. Fixes #2775. Fixes #2730." * tag 'tgrabiec/snapshot-evicition-in-cache-v1' of github.com:scylladb/seastar-dev: tests: Add test for partition_entry::evict() mutation_partition: Introduce range continuity checking methods mutation_partition: Enable rows_entry::compare() on position_in_partition_views tests: Extract mvcc tests to separate file tests: row_cache: Add evicition tests tests: simple_schema: Add new_tombstone() helper tests: streamed_mutation_assertions: Introduce produces(mutation&) streamed_mutation: Allow setting buffer capacity row_cache: Evict partition snapshots mvcc: Introduce partition_entry::evict() row_cache: Handle eviction in partition reader tests: row_cache_test: Don't assume mvcc snapshots are not evictable row_cache: Reuse allocation_strategy::invalidate_references() row_cache: Don't invalidate references on insertion lsa: Move reclaim counter concept to allocation_strategy level mvcc: Ensure partition_snapshot always destroys versions using proper allocator mvcc: Encapsulate reference stability check in partition_snapshot mvcc: Store LSA region reference in partition_snapshot	2017-09-13 20:48:33 +03:00
Tomasz Grabiec	a45b1ef4bc	sstables: Make atomic_deletion_manager logger static So that it's visible to the framework at boot and --logger-log-level can be used on it. Message-Id: <1505286578-21904-1-git-send-email-tgrabiec@scylladb.com>	2017-09-13 20:35:41 +03:00
Tomasz Grabiec	b8f62e86de	tests: Add test for partition_entry::evict()	2017-09-13 17:47:04 +02:00
Tomasz Grabiec	455a1b0d24	mutation_partition: Introduce range continuity checking methods	2017-09-13 17:47:04 +02:00
Tomasz Grabiec	abc489e99d	mutation_partition: Enable rows_entry::compare() on position_in_partition_views For full symmetry with existing overloads.	2017-09-13 17:47:04 +02:00
Tomasz Grabiec	d76b141b34	tests: Extract mvcc tests to separate file	2017-09-13 17:47:04 +02:00
Tomasz Grabiec	2dfb3b95a5	tests: row_cache: Add evicition tests	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	204ec9c673	tests: simple_schema: Add new_tombstone() helper	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	5b1adfa542	tests: streamed_mutation_assertions: Introduce produces(mutation&)	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	cb16b038ef	streamed_mutation: Allow setting buffer capacity Needed in tests to limit amount of prefetching done by readers, so that it's easier to test interleaving of various events.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	c78047fa5b	row_cache: Evict partition snapshots If snapshots are not evicted, they may pin unbouned amount of memory for a long time in cache, which may lead to OOM. Evict snapshots together with the entry. Fixes #2775. Fixes #2730.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	b6ae5783cd	mvcc: Introduce partition_entry::evict() The operation frees as much memory as possible, marking affected mutation elements as discontinuous.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	fa2c26342c	row_cache: Handle eviction in partition reader	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	99aa3d1964	tests: row_cache_test: Don't assume mvcc snapshots are not evictable The test was not updating the underlying mutation source but still expecting to get the right data after calling invalidate(). If snapshots are evictable, that's not guaranteed. Apply to underlying as well, so data is read from underlying if necessary.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	adb159d51b	row_cache: Reuse allocation_strategy::invalidate_references() Modification count in the tracker is redundant, we can rely on allocator's invalidation counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	27a3b4bca9	row_cache: Don't invalidate references on insertion modification_count is currently only used to detect invalidation of references, intended to be incremented on erasure. Insertion into intrusive set doesn't invalidate references, so no need to increment the counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	87be474c19	lsa: Move reclaim counter concept to allocation_strategy level So that generic code can detect invalidation of references. Also, to allow reusing the same mechanism for signalling external reference invalidation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	4053c801e2	mvcc: Ensure partition_snapshot always destroys versions using proper allocator partition_snapshot is managed by lw_shared_ptr. Currently it is assumed that before it dies, maybe_merge_versions() is called on it, which destroyes it in the right allocator context. It's not very safe. This patch improves safety by using the right allocator in snapshot's destructor.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	cda86abdbc	mvcc: Encapsulate reference stability check in partition_snapshot	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	4c920c9891	tests: cql_test_env: Use cancel_prior_atomic_deletions() This fixes a failure in view_schema_test, which starts many instances of single_node_cql_env. cancel_atomic_deletions() causes later deletions to fail, which causes some of the test cases to fail. Message-Id: <1505311250-3118-2-git-send-email-tgrabiec@scylladb.com>	2017-09-13 17:11:34 +03:00
Tomasz Grabiec	dc0860ac70	sstables: Introduce cancel_prior_atomic_deletions() Like cancel_atomic_deletions() but doesn't fail later deletions. Message-Id: <1505311250-3118-1-git-send-email-tgrabiec@scylladb.com>	2017-09-13 17:11:33 +03:00
Tomasz Grabiec	8a425cedc6	tests: cql_test_env: Cancel pending sstable deletions on shutdown Fixes a hang on shutdown with --smp 2 in perf_fast_forward. The hang is in sstables::await_background_jobs_on_all_shards(), which is waiting on sstable deletions. Not all shards agree to delete certain sstables, because e.g. not all shards decide to compact them yet. Cancel those deletes after database is stopped on all shards, like we do in main.cc Fixes #2796. Message-Id: <1505292239-26032-1-git-send-email-tgrabiec@scylladb.com>	2017-09-13 11:56:48 +03:00
Asias He	c84dcabb8f	gossip: Use boost::copy_range in apply_state_locally boost::copy_range is better because the vector is allocated with the correct size instead of growing when the inserter is called. [avi: also crashes less] Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>	2017-09-13 11:33:15 +03:00
Tomasz Grabiec	b3a8ba5af6	gdb: Introduce "scylla find" command Finds live objects on seastar heap of current shard which contain given value. Prints results in 'scylla ptr' format. Example: (gdb) scylla find 0x600005321900 thread 1, small (size <= 512), live (0x6000000f3800 +48) thread 1, small (size <= 56), live (0x6000008a1230 +32) Message-Id: <1505284614-19577-1-git-send-email-tgrabiec@scylladb.com>	2017-09-13 11:22:23 +03:00
Asias He	fa9d47c7f3	gossip: Fix a log message typo in compare_endpoint_startup Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>	2017-09-13 09:54:56 +02:00
Glauber Costa	ecad1be161	compaction_strategy: add missing header compaction_strategy.hh throws an exception, but it doesn't add the exception header. It is working in-tree because of inclusion order, but it broke one of my yet-out-of-tree changes. In any case, it is best to add the headers we will need to the files, and that is what this patch does. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170912233326.26114-1-glauber@scylladb.com>	2017-09-13 08:40:15 +02:00
Raphael S. Carvalho	ef18b1162b	sstables/compaction_manager: rename and better explain reshard function submit doesn't properly describe the function and also improve explanation of the relationship between function itself and its job parameter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170912032034.23043-1-raphaelsc@scylladb.com>	2017-09-12 12:25:17 +03:00
Avi Kivity	1bd207a306	sstables: merge filter.cc into sstables.cc filter.cc has just two smallish functions, which are part of the sstable class. Move them to sstables.cc where the rest of the class members are defined. Message-Id: <20170912080541.7836-1-avi@scylladb.com>	2017-09-12 10:06:52 +02:00
Tomasz Grabiec	423142ec81	tests: row_cache_test: Fix abort in debug mode The test used apply() variant which assumed that it was invoked in a seastar thread, which is no longer the case after commit `d22fdf4`. Fix by copying outisde cache update, and use non-deferring apply() variant for cache update. Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>	2017-09-12 10:57:36 +03:00
Tomasz Grabiec	3f527e028d	Merge "Reduce dependencies on sstables.hh" from Avi This patchset reduces includes of sstables.hh, reducing compile time by both reducing the amount of code compiled, and the amount of needless recompiles caused by false dependencies. It does so by replacing lw_shared_ptr<sstable>, which requires a complete class, with a new custom type shared_sstable, which allows an incomplete sstable class definition. * https://github.com/avikivity/scylla deps2/v2.1 database: change truncate() to flush while compaction is disabled database: make run_with_compaction_disabled() a non-template database: add indirection to compaction_manager instance database: remove dependency on compaction.hh and compaction_manager.hh size_estimates_virtual_reader.hh: add missing include system_keyspace: add missing include main: add missing include storage_service: add missing include repair: add missing include compaction.hh: add missig include and forward declaration compaction_manager: add missing include shared_index_lists.hh: add missing include perf_fast_forward: add missing include sstable_mutation_test: add missing include sstables: extract version and format enum into a separate header file database.hh: add missing forward declaration for foreign_sstable_open_info cql_test_env: add forward declaration database: make column_family::disable_sstable_write() out-of-line sstables: introduce make_sstable() as a shortcut for make_lw_shared<sstable> treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable> sstables: use support for lw_shared_ptr with incomplete type for shared_sstable sstables: reduce dependencies streaming: remove unneeded includes	2017-09-12 09:56:46 +02:00
Tomasz Grabiec	ee1e7732a6	database: Create tables with continuous cache When table is created, it doesn't contain any data, so we can mark the whole data range as continuous in cache. This way reads will immediately hit, and flushes will populate. If sstables are later attached, the attaching process is supposed to invalidate affected ranges (and it does). Fixes #2536. Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>	2017-09-12 10:53:07 +03:00
Avi Kivity	85a6a2b3cb	streaming: remove unneeded includes	2017-09-12 10:43:39 +03:00
Avi Kivity	578bf55371	sstables: reduce dependencies Use shared_sstable.hh instead of sstables.hh.	2017-09-12 10:43:36 +03:00
Avi Kivity	07feaf9c4c	sstables: use support for lw_shared_ptr with incomplete type for shared_sstable Use the lw_shared_ptr deleter support to define shared_sstable without pulling the definition of class sstable, reducing compile time and dependencies if only shared_sstable is needed.	2017-09-12 10:43:05 +03:00
Avi Kivity	f7023501d6	treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable> Since shared_sstable is going to be its own type soon, we can't use the old alias.	2017-09-12 10:43:05 +03:00
Avi Kivity	1a3cdffbc1	sstables: introduce make_sstable() as a shortcut for make_lw_shared<sstable> shared_sstable will soon not be an alias for lw_shared_ptr<sstable>, so we need another factory function.	2017-09-12 10:43:05 +03:00
Avi Kivity	88b91c84a1	database: make column_family::disable_sstable_write() out-of-line Reduces dependencies.	2017-09-12 10:43:05 +03:00
Avi Kivity	02028df9b1	cql_test_env: add forward declaration Not worthwhile to add a new #include for this.	2017-09-12 10:43:05 +03:00
Avi Kivity	02e5bf1c20	database.hh: add missing forward declaration for foreign_sstable_open_info Supplied by an incidental include now, but it will be gone soon.	2017-09-12 10:43:05 +03:00
Avi Kivity	c4bafd912c	sstables: extract version and format enum into a separate header file This allows removing a dependency on sstables.hh later on.	2017-09-12 10:43:05 +03:00
Avi Kivity	5ebb15b9d4	sstable_mutation_test: add missing include	2017-09-12 10:43:05 +03:00
Avi Kivity	fdab47ab32	perf_fast_forward: add missing include	2017-09-12 10:43:05 +03:00
Avi Kivity	ca2d0b4efb	shared_index_lists.hh: add missing include	2017-09-12 10:43:05 +03:00
Avi Kivity	eb62b2c00d	compaction_manager: add missing include	2017-09-12 10:43:05 +03:00
Avi Kivity	0efa444a56	compaction.hh: add missing includes	2017-09-12 10:42:45 +03:00
Avi Kivity	7ca029c8f1	database_fwd.hh: add column_family forward declaration	2017-09-12 10:41:28 +03:00
Avi Kivity	4751402709	build: disable -fsanitize-address-use-after-scope on CqlParser.o The parser generator somehow confuses the use-after-scope sanitizer, causing it to use large amounts of stack space. Disable that sanitizer on that file. Message-Id: <20170905110628.18047-1-avi@scylladb.com>	2017-09-11 19:42:26 +02:00
Avi Kivity	43a72254ff	repair: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	aebab377d9	storage_service: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	a3b8089bd4	main: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	0aaefe665b	system_keyspace: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	d3cde2e2be	size_estimates_virtual_reader.hh: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Avi Kivity	f9c8c1ddc2	database: add indirection to compaction_manager instance Allows making it forward-declared later on, reducing dependencies.	2017-09-11 20:09:45 +03:00
Avi Kivity	9d0aaa941a	database: make run_with_compaction_disabled() a non-template Allows reducing dependencies down the line, and un-templating non-performance-critical functions is a good thing.	2017-09-11 20:09:45 +03:00
Avi Kivity	6b5514a3df	database: change truncate() to flush while compaction is disabled In preparation to make run_with_compaction_disabled() a non-template, we want to remove any non-copyable captures (so the function can be an std::function, which requires copyability). Move the flush within the compaction disabled region. This changes the behavior, but it shouldn't matter.	2017-09-11 20:09:45 +03:00
Avi Kivity	14fd4168dc	Merge seastar upstream * seastar 31b925d...92fdce2 (3): > shared_ptr: allow incomplete classes in lw_shared_ptr<> > Update DPDK to 17.05 > future: pass func as mutable to lambda arg of handle_exception[_type]	2017-09-11 20:09:04 +03:00
Tomasz Grabiec	95b3eaac97	debug: Allow running scylla_row_cache_report.stp script against a running process Message-Id: <1504776359-16424-1-git-send-email-tgrabiec@scylladb.com>	2017-09-11 14:17:30 +03:00
Avi Kivity	fe019ad84d	Merge "Refuse to load non-Scylla counter sstables" from Paweł "These patches make Scylla refuse to load counter sstables that may contain unsupported counter shards. They are recognised by the lack of the Scylla component. Fixes #2766." * tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla: db: reject non-Scylla counter sstables in flush_upload_dir db: disallow loading non-Scylla counter sstables sstable: add has_scylla_component()	2017-09-11 13:28:44 +03:00
Tzach Livyatan	83eab5c8d7	Remove comment about Too high number of concurrent compactions from scylla_compaction_manager_compactions help It should never happen and its not clear what too high stands for Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170911085645.21222-1-tzach@scylladb.com>	2017-09-11 13:27:35 +03:00
Gleb Natapov	d0d8bdf615	storage_proxy: remove unused parameter from get_restricted_ranges() function Message-Id: <20170911084653.GH24167@scylladb.com>	2017-09-11 11:58:44 +02:00
Gleb Natapov	f66e9377d4	storage_proxy: do not keep reference to a keyspace during write A keyspace can be deleted while write is ongoing, so the object cannot be used after defer point. The keyspace reference is only used to check how many replies a write operation should wait for and this can be precalculated during write handler creation. Fixes #2777 Message-Id: <20170911084436.GG24167@scylladb.com>	2017-09-11 11:57:00 +02:00
Asias He	bb9dbc5ade	storage_service: Do not use c_str() in the logger Use logger.info("{}", msg) instead. Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>	2017-09-10 18:10:24 +03:00
Botond Dénes	9ebeb9d5ce	Fix --Wreturn-type warnings in tests: use abort() instead of assert(0) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <95927f933411302e84d57d169ee0147def7bc643.1504890922.git.bdenes@scylladb.com>	2017-09-10 17:09:53 +03:00
Gleb Natapov	9137446109	api: uses correct statistics for storage proxy range histograms. Message-Id: <20170910073458.GB1870@scylladb.com>	2017-09-10 16:18:36 +03:00
Pekka Enberg	d2632ddf1d	Merge "gossip: optimize apply_state_locally for large cluster" from Asias "This series tries to improve the bootstrap of a node in a large cluster by improving how gossip applies the gossip node state. In #2404, the joining node failed to bootstrap, because it did not see the seed node when storage_service::bootstrap ran. After this series, we apply the whole gossip state contained in the gossip ack/ack2 message before applying the next one, and we apply the state of the seed node earlier than non-seed node so we can have the seed node's state faster. We also add some randomness to the order of applying gossip node state to prevent some of the nodes' state are always applied earlier than the others. This series improves apply_state_locally for large cluster: - Tune the order of applying endpoint_state - Serialize apply_state_locally - Avoid copying of the gossip state map Fixes #2404" * tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev: gossip: Avoid copying with apply_state_locally gossip: Serialize apply_state_locally gossip: Tune the order of applying endpoint_state in apply_state_locally gossip: Introduce is_seed helper gossip: Pass const endpoint_state& in notify_failure_detector gossip: Pass reference in notify_failure_detector	2017-09-08 11:41:43 +03:00
Asias He	57dd3cb2c5	gossip: Do not use c_str() in the logger Use logger.info("{}", msg) instead. Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>	2017-09-08 10:59:42 +03:00
Asias He	e98ce7887b	gossip: Avoid copying with apply_state_locally Move the std::map<inet_address, endpoint_state> map from the gossip ack/ack2 message directly and move it around in apply_state_locally to avoid copying the map.	2017-09-08 15:19:48 +08:00
Asias He	fd879b4e09	gossip: Serialize apply_state_locally apply_state_locally will be called when gossip ack/ack2 message is received. It will use the std::map<inet_address, endpoint_state>& map to update the endpoint state. However, we can receive multiple such gossip ack/ack2 messages from multiple peer nodes in parallel. Currently, we process them in parallel. It is better to apply all the states from one node then move to apply all the states from another node than interleaving. Because it is more important to have the state of the whole cluster than to have a bit newer state from another peer (if it is newer), especially when the node boots up and runs its first round of gossip exchange. After this patch, we apply the whole gossip state contained in the gossip ack/ack2 message before applying the next one.	2017-09-08 15:19:47 +08:00
Asias He	9ccba950ba	gossip: Tune the order of applying endpoint_state in apply_state_locally We currently always apply the endpoint_state in the order of the endpoint ip address. This is not good because some of the endpoint's state is always applied earlier than the others. In large cluster, the number of endpoints can be large, it takes time to apply all of them. To make it more fair, we apply the endpoint_state randomly. Apply the seed node's state earlier because in bootstrap, we will check if we have seen the seed node in storage_service::bootstrap. In #2404, the bootstrap failed because, the joining node hasn't apply the seed node's state when storage_service::bootstrap runs.	2017-09-08 15:19:47 +08:00
Asias He	c5456ed38f	gossip: Introduce is_seed helper To check if a endpoint is a seed node.	2017-09-08 15:19:47 +08:00
Asias He	32edd95241	gossip: Pass const endpoint_state& in notify_failure_detector	2017-09-08 15:19:47 +08:00
Asias He	46e562cbfa	gossip: Pass reference in notify_failure_detector In large cluster, the map can be large. Pass reference to avoid copying.	2017-09-08 15:19:47 +08:00
Glauber Costa	db846326f8	compaction: remove dead code This code has no more users. Bury it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170908005305.29925-1-glauber@scylladb.com>	2017-09-08 08:17:15 +02:00
Tomasz Grabiec	57dc988475	Update seastar submodule * seastar 85ca12d...31b925d (19): > net/byteorder: fix 64 bit ntohq and htonq on big endian machines > core, util: fix compilation on non-x86 processors > core/memory: Fix SIGSEGV in small_pool::add_more_objects() > log: remove debug leftovers > Merge "TLS state machine fixes" from Calle > logger: allow adjusting the timestamp style for stdout logs > thread: make thread_context::s_main portable > core: add seastar::cache_line_size constant > Add detach() to input_stream and output_stream > Install dependencies for Arch Linux. > tls: Guard non-established sockets in sesrefs + more explicit close + states > tls: Make vec_push fully exception safe > basic_sstring: resize uses sstring > Merge "Add and correct unit tests" from Jesse > tcp: enforce 1-byte maximum segment invariant with zero window > tcp: verify 1-byte maximum segment invariant during send with zero window > memory: reduce small_pool vulnerability to fragmentation further > Prometheus: avoid merging all metrics family > net: Fix possible NULL pointer dereference.	2017-09-07 10:34:27 +02:00
Avi Kivity	d9ee2ad9f0	chunked_vector: avoid boost::small_vector with old boost versions Apparently older boost versions have a bug resulting in a double-free in boost::container::small_vector. Use std::vector instead. Fixes #2748. Tested-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20170903170207.21635-1-avi@scylladb.com>	2017-09-07 09:32:51 +03:00
Tomasz Grabiec	121cd8cb6c	tests: Fix cql_query_test.cc::test_duration_restrictions validate_request_failure() assumed that the future returned by execute_cql() is always ready, which doesn't have to be the case, and caused aborts in debug mode build. Message-Id: <1504701342-13300-1-git-send-email-tgrabiec@scylladb.com>	2017-09-06 15:49:03 +03:00
Tomasz Grabiec	3986486cb3	tests: cql_test_env: Avoid exceptions to make debugging easier Message-Id: <1504701375-13491-1-git-send-email-tgrabiec@scylladb.com>	2017-09-06 15:48:59 +03:00
Paweł Dziepak	e401d2d50b	db: reject non-Scylla counter sstables in flush_upload_dir Scylla already refuses to load counter sstables that do not have Scylla component. However, if this happens because of 'nodetool refresh' command the existing protection will trigger after sstables have been moved to the data directory. This is too later, so an additional check is added when the upload directory is scanned.	2017-09-06 12:04:26 +01:00
Paweł Dziepak	6a5e8bace1	db: disallow loading non-Scylla counter sstables Scylla does not support local and remote counter shards. This means that it is unsafe to directly load sstables that may contain them.	2017-09-06 12:03:58 +01:00
Paweł Dziepak	ebc538f4a3	sstable: add has_scylla_component() has_scylla_component() is going to be used to verify that an sstable has been generated by a recent version of Scylla. This would make it possible to reject sstables that may be unsafe to load (e.g. sstables containing legacy counter shards).	2017-09-06 12:03:45 +01:00
Avi Kivity	a59e375aad	Merge "Support termination of repair jobs" from Asias "This series implements the missing API to terminate all repairs. For example: $ curl -X POST --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/force_terminate_repair" With the new stream_plan::abort() api we can now abort the stream session assocaited with the repair as well. On top of this, we can support termination of single repair job instead all jobs. Fixes #2105" * tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev: repair: Support termination of repair jobs repair: Track repair_info repair: Intorduce repair id to repair_info map api: Add force_terminate_repair API streaming: Add abort to stream_plan streaming: Add abort_all_stream_sessions for stream_coordinator streaming: Introduce streaming::abort() streaming: Make stream_manager and coordinator message debug level streaming: Check if _stream_result is valid streaming: Log peer address in on_error streaming: Introduce received_failed_complete_message	2017-09-06 12:58:05 +03:00
Avi Kivity	31706ba989	Merge "Fix Scylla upgrades when counters are used" from Paweł "Scylla 1.7.4 and older use incorrect ordering of counter shards, this was fixed in `0d87f3dd7d` ("utils::UUID: operator< should behave as comparison of hex strings/bytes"). However, that patch was not backported to 1.7 branch until very recently. This means that versions 1.7.4 and older emit counter shards in an incorrect order and expect them to be so. This is particularly bad when dealing with imported correct sstables in which case some shards may become duplicated. The solution implemented in this patch is to allow any order of counter shards and automaticly merge all duplicates. The code is written in a way so that the correct ordering is expected in the fast path in order not to excessively punish unaffected deployments. A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless upgrade from 1.7.4 to later Scylla versions. If that feature is not available Scylla still writes sstables and sends on-wire counters using the old ordering so that it can be correctly understood by 1.7.4, once the flag becomes available Scylla switches to the correct order. Fixes #2752." * tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla: tests/counter: verify counter_id ordering counter: check that utils::UUID uses int64_t mutation_partition_serializer: use old counter ordering if necessary mutation_partition_view: do not expect counter shards to be sorted sstables: write counter shards in the order expected by the cluster tests/sstables: add storage_service_for_tests to counter write test tests/sstables: add test for reading wrong-order counter cells sstables: do not expect counter shards to be sorted storage_service: introduce CORRECT_COUNTER_ORDER feature tests/counter: test 1.7.4 compatible shard ordering counters: add helper for retrieving shards in 1.7.4 order tests/counter: add tests for 1.7.4 counter shard order counters: add counter id comparator compatible with Scylla 1.7.4 tests/counter: verify order of counter shards tests/counter: add test for sorting and deduplicating shards counters: add function for sorting and deduplicating counter cells counters: add counter_id::operator>	2017-09-05 14:20:55 +03:00
Paweł Dziepak	ed68a75b75	tests/counter: verify counter_id ordering	2017-09-05 10:52:54 +01:00
Paweł Dziepak	cdf7ba76f1	counter: check that utils::UUID uses int64_t	2017-09-05 10:46:03 +01:00
Paweł Dziepak	4aa72c6454	mutation_partition_serializer: use old counter ordering if necessary Until the cluster is fully upgraded from a version that uses the incorrect counter shard ordering it is essential to keep using it lest the old nodes corrupt the data upon receiving mutations with a counter shard ordering they do not expect.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	b540516e5e	mutation_partition_view: do not expect counter shards to be sorted	2017-09-05 10:32:48 +01:00
Paweł Dziepak	84edb5a1f2	sstables: write counter shards in the order expected by the cluster If the feature signaling that we have switched to the correct ordering of counter shards is not enabled it means that the user still can do a rollback to a version that expects wrong ordering. In order to avoid any disasters when that happens write sstables using the 1.7.4 order until we know for sure that it is no longer needed.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	2b614201a7	tests/sstables: add storage_service_for_tests to counter write test Writing a counters to a sstable is going to require cluster feature information, which requires accessing some singletons.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	5007c9290a	tests/sstables: add test for reading wrong-order counter cells	2017-09-05 10:32:48 +01:00
Paweł Dziepak	3e1d09e71d	sstables: do not expect counter shards to be sorted	2017-09-05 10:32:48 +01:00
Paweł Dziepak	ecd2bf128b	storage_service: introduce CORRECT_COUNTER_ORDER feature Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix this problem a new feature is introduced that will be used to determine when nodes with that bug fixed can start sending counter shard in the correct order.	2017-09-05 10:32:48 +01:00
Paweł Dziepak	1e03c4acbe	tests/counter: test 1.7.4 compatible shard ordering	2017-09-05 10:32:47 +01:00
Paweł Dziepak	067e429881	counters: add helper for retrieving shards in 1.7.4 order	2017-09-05 10:32:47 +01:00
Paweł Dziepak	fd25a09db2	tests/counter: add tests for 1.7.4 counter shard order	2017-09-05 10:32:47 +01:00
Paweł Dziepak	a93e8ce185	counters: add counter id comparator compatible with Scylla 1.7.4	2017-09-05 10:32:47 +01:00
Paweł Dziepak	b0f67c1680	tests/counter: verify order of counter shards	2017-09-05 10:32:47 +01:00
Paweł Dziepak	27397b5dad	tests/counter: add test for sorting and deduplicating shards	2017-09-05 10:32:47 +01:00
Paweł Dziepak	e0c2379f26	counters: add function for sorting and deduplicating counter cells Due to a bug in an implementation of UUID less compare some Scylla versions sort counter shards in an incorrect order. Moreover, when dealing with imported correct data the inconsistencies in ordering caused some counter shards to become duplicated.	2017-09-05 10:32:39 +01:00
Paweł Dziepak	74af818eaf	counters: add counter_id::operator>	2017-09-04 18:25:47 +01:00
Avi Kivity	4b06a2e95d	Merge "Fix exception safety in cache update related paths" from Tomasz * 'tgrabiec/make-row-cache-update-exception-safe' of github.com:scylladb/seastar-dev: row_cache: Improve safety of cache updates row_cache: Extract invalidate_sync() memtable: Mark mark_flushed() as noexcept database: Add non-throwing try_trigger_compaction() database: Make add_sstable() have strong exception guarantees row_cache: Don't require presence checker to be supplied externally database: Supply presence checker in sstable snapshots mutation_source: Introduce mutation_source::make_partition_presence_checker() mutation_reader: Move definitions up in the header mutation_reader: Use constructor delegation to reduce code duplication row_cache: Make populate() preserve continuity row_cache: Allow marking as fully continuous on construction database: Add missing serialization of sstable set udpate and cache invalidation	2017-09-04 18:37:42 +03:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	b0f3efa577	row_cache: Extract invalidate_sync()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	673a22f8e1	memtable: Mark mark_flushed() as noexcept Callers rely on that.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bf75b882ae	database: Add non-throwing try_trigger_compaction()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	116d4ae02b	database: Make add_sstable() have strong exception guarantees If insert() fails, we left the database with stats updated, but sstable not being attached.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	df787afe6a	database: Supply presence checker in sstable snapshots	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	8a9f0f86e7	mutation_source: Introduce mutation_source::make_partition_presence_checker() Every mutation source can have a presence checker. By default all answer "maybe contains". Having this on mutation_source level will be useful for simplifying cache update flow. The cache can ask the right snapshot for a presence checker rather than relying on database to know when and how to make the right one which preserves all invariants. This will be especially useful once all updates of the underlying mutation source of cache (e.g. sstable list) will have to go through cache for safety reasons.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	065feb1b7b	mutation_reader: Move definitions up in the header	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	4e4839082b	mutation_reader: Use constructor delegation to reduce code duplication	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	1a2f17d42c	row_cache: Make populate() preserve continuity	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bc3112a187	row_cache: Allow marking as fully continuous on construction Will be needed in tests.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	ab8632b225	database: Add missing serialization of sstable set udpate and cache invalidation Commit `e3ad676433` missed a few places. It is required to serialize sstable list update and cache synchronization in order to preserve partition update isolation. Fixes #2746.	2017-09-04 10:04:29 +02:00
Piotr Jastrzebski	dd5dc75605	Stop calling _local_cache.stop in at_exit. This removes a race condition that was causing #2721 Fixes #2721 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <ad060fab43d63c17db9f811c421d7ab26e5e57c8.1503933021.git.piotr@scylladb.com>	2017-09-03 15:55:48 +03:00
Asias He	e14bb7b1d5	repair: Remove #if'ed code in repair_ranges It is unlikely we will use parallel_for_each version in repair_ranges. Get rid of the dead code. Message-Id: <31a9366adfe0262512a326ef9703aa0bba05e1fb.1503996138.git.asias@scylladb.com>	2017-09-03 11:13:02 +03:00
Avi Kivity	0524cbbd72	Merge db/config.cc cleanups from Jesse * 'jhk/config_hygiene/v1' of https://github.com/hakuch/scylla: db/config.cc: Clarify documentation for `typed_value_ex` db/config.cc: Fix formatting and warnings db/config.cc: Remove unnecessary `mutable` on lambdas db/config.cc: Remove unused variables	2017-09-03 11:08:53 +03:00
Botond Dénes	a980ff6463	Use abort() instead of assert + throw in unreachable code Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <393c3730111dfe090c44d8fc2e31602956a7d008.1504022425.git.bdenes@scylladb.com>	2017-09-03 11:07:27 +03:00
Raphael S. Carvalho	22701346de	sstables/stcs: avoid needless copy of bucket in get_buckets() In addition, remove bucket by iterator which is faster. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170903000315.16338-1-raphaelsc@scylladb.com>	2017-09-03 10:46:48 +03:00
Avi Kivity	551eb75eb0	Update AMI submodule * dist/ami/files/scylla-ami b41e5eb...5ffa449 (3): > amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x) > Prevent dependency error on 'yum update' > scylla_create_devices: don't raise error when no disks found	2017-08-31 15:13:52 +03:00
Glauber Costa	e642aee3f7	database: wait for asynchronous operations to end before closing CF This was part of "add gate for generic async operations to column family" but somehow didn't make it into the final patch. Add the missing piece. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170830164205.4497-1-glauber@scylladb.com>	2017-08-31 11:16:30 +03:00
Avi Kivity	23d3ca56a1	Merge "optional integrity checker of sstable component writes" from Raphael "optional interposer that will check integrity of writes to sstable components. The option name is enable_sstable_data_integrity_check, it's disabled by default and can be enabled via config file. It will provide enough details that will help to find the root of the issue. if disk failed for example, we would've something like the following reported: ERROR 2017-08-17 09:18:11,577 [shard 0] sstable - integrity check failed for ./data/data/system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/system_schema-aggregates-ka-111-Scylla.db, stage: read after write verification, write: 4096 bytes to offset 0, reason: data read from underlying storage isn't the same as written, mismatch at byte 0: data written sample: 10000001000000010000001a00000001 date read sample: 00000000000000000000000000000000" * 'integrity_check_interposer_v3' of github.com:raphaelsc/scylla: sstables: optionally check integrity of sstable component writes sstables: remove unneeded new_sstable_component_file variant db/config: add sstable_data_integrity_check option sstables: introduce file interposer for integrity check	2017-08-31 11:08:12 +03:00
Raphael S. Carvalho	a84fbde8c8	sstables: optionally check integrity of sstable component writes If config file's sstable_data_integrity_check option is enabled, new integrity check interposer will be used in addition to the existing one. Performance is expected to drop because of all the integrity checks for every write. This new interposer will provide detailed info when integrity check fails. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-31 02:27:50 -03:00
Raphael S. Carvalho	04ea4daa7e	sstables: remove unneeded new_sstable_component_file variant can get rid of it because file_open_options is optional in reactor::open_file_dma() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-31 02:17:34 -03:00
Raphael S. Carvalho	0218d6fd8f	db/config: add sstable_data_integrity_check option If enabled, interposer for checking integrity of sstable component writes will be used. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-30 13:57:08 -03:00
Raphael S. Carvalho	f76b609cf5	sstables: introduce file interposer for integrity check optional interposer that will check integrity when writing to sstable components. It will provide enough details that will help to find the root of the issue, which may come from lower level layers. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-30 11:55:36 -03:00
Jesse Haber-Kucharsky	eddf34d005	test.py: Add missing tests Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <6fc5e810495801e646ccc41c16b581c8eceeda22.1504030666.git.jhaberku@scylladb.com>	2017-08-30 09:58:12 +01:00
Asias He	471e8b341f	repair: Support termination of repair jobs This patch implements the missing API to terminate all repairs. For example: $ curl -X POST --header "Accept: application/json" "http://127.0.0.2:10000/storage_service/force_terminate_repair" With the new stream_plan::abort() api we can now abort the stream session assocaited with the repair as well. Fixes #2105	2017-08-30 15:19:52 +08:00
Asias He	07d9dc03ec	repair: Track repair_info Make repair_info a shared pointer and store them in _repairs map so we can find by the repair id and access them later.	2017-08-30 15:19:52 +08:00
Asias He	5c9732c645	repair: Intorduce repair id to repair_info map The maps are stored in a vector. The vector has smp::count elements, each element will be accessed by only one shard. The add_repair_info, remove_repair_info and get_repair_info helpers are added.	2017-08-30 15:19:51 +08:00
Asias He	6dc62c6215	api: Add force_terminate_repair API The api /storage_service/force_terminate is supposed to be /storage_service/force_terminate_repair. scylla-jmx uses /storage_service/force_terminate api. So instead of renaming it, it is better to add a new name for it.	2017-08-30 15:19:51 +08:00
Asias He	9c8da2cc56	streaming: Add abort to stream_plan It can be used by the user of stream_plan to abort the stream sessions. Repair will be the first user when aborting the repair.	2017-08-30 15:19:51 +08:00
Asias He	475b7a7f1c	streaming: Add abort_all_stream_sessions for stream_coordinator It will abort all the sessions within the stream_coordinator. It will be used by stream_plan soon.	2017-08-30 15:19:51 +08:00
Asias He	fad34801bf	streaming: Introduce streaming::abort() It will be used soon by stream_plan::abort() to abort a stream session.	2017-08-30 15:19:50 +08:00
Asias He	7fba7cca01	streaming: Make stream_manager and coordinator message debug level When we abort a session, it is possible that: node 1 abort the session by user request node 1 send the complete_message to node 2 node 2 abort the session upon receive of the complete_message node 1 sends one more stream message to node 2 and the stream_manager for the session can not be found. It is fine for node 2 to not able to find the stream_manager, make the log on node 2 less verbose to confuse user less.	2017-08-30 15:19:50 +08:00
Asias He	be573bcafb	streaming: Check if _stream_result is valid If on_error() was called before init() was executed, the _stream_result can be invalid.	2017-08-30 15:19:44 +08:00
Asias He	8a3f6acdd2	streaming: Log peer address in on_error	2017-08-30 15:18:27 +08:00
Asias He	eace5fc6e8	streaming: Introduce received_failed_complete_message It is the handler for the failed complete message. Add a flag to remember if we received a such message from peer, if so, do not send back the failed complete message back to the peer when running close_session with failed status.	2017-08-30 15:18:27 +08:00
Asias He	cc18da5640	Revert "gossip: Make bootstrap more robust" This reverts commit `b56ba02335`. After commit `8fa35d6ddf` (messaging_service: Get rid of timeout and retry logic for streaming verb), streaming verb in rpc does not check if a node is in gossip memebership since all the retry logic is removed. Remove the extra wait before removing the joining node from gossip membership. Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>	2017-08-30 10:14:11 +03:00
Avi Kivity	48b9e47f7d	Revert "row_cache: Add missing handling for failures happening outside the updating thread" This reverts commit `f9feb310ab` (requested by author).	2017-08-29 19:26:02 +03:00
Tomasz Grabiec	f9feb310ab	row_cache: Add missing handling for failures happening outside the updating thread Thread stack allocation may fail, in which case we did not do the necessary invalidation. Fix by hoisting the scope of the cleanup function. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:22 +03:00
Tomasz Grabiec	5d2f2bc90b	lsa: Mark region::merge() as noexcept It seems to satisfy this, and row_cache::do_update() will rely on it to simplify error handling. Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:17 +03:00
Asias He	8fa35d6ddf	messaging_service: Get rid of timeout and retry logic for streaming verb With the "Use range_streamer everywhere" (`7217b7ab36`) seires, all the user of streaming now do streaming with relative small ranges and can retry streaming at higher level. There are problems with timeout and retry at RPC verb level in streaming: 1) Timeout can be false negative. 2) We can not cancel the send operations which are already called. When user aborts the streaming, the retry logic keeps running for a long time. This patch removes all the timeout and retry logic for streaming verbs. After this, the timeout is the job of TCP, the retry is the job of the upper layer. Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>	2017-08-29 17:20:00 +03:00
Botond Dénes	d1209c548a	Fix -Wreturn-type warnings Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <99f7a006daaa78eb87720ac51c394093398bc868.1504013915.git.bdenes@scylladb.com>	2017-08-29 16:41:09 +03:00
Tomer Sandler	f1eb6a8de3	node_health_check: Various updates - Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore). - Removed curl command (no longer using the api_address), instead using scylla --version - Added -v flag in iptables command, for more verbosity - Added support to for OEL (Oracle Enterprise Linux) - minor fix - Some text changes - minor - OEL support indentation fix + collecting all files under /etc/scylla - Added line seperation under cp output message Signed-off-by: Tomer Sandler <tomer@scylladb.com> Message-Id: <20170828131429.4212-1-tomer@scylladb.com>	2017-08-29 15:15:10 +03:00
Paweł Dziepak	90c77c89ae	test.py: add missing compress_test Message-Id: <20170829105331.27078-1-pdziepak@scylladb.com>	2017-08-29 13:05:11 +02:00
Paweł Dziepak	d5fa07f6df	Merge "sstables: switch from deque<> to a custom container" from Avi Large deques require contiguous storage, which may not be available (or may be expensive to obtain). Switch to new custom container instead, which allocates less contiguous storage. Allocation problems were observed with the summary and compression info. While there is work to reduce compression info contiguous space use, this solves all std::deque problems (and should not conflict with that work). Fixes #2708 * tag '2708/v6' of https://github.com/avikivity/scylla: sstables: switch std::deque to chunked_vector tests: add test for chunked_vector utils: add a new container type chunked_vector	2017-08-29 11:11:01 +01:00
Avi Kivity	5224ab9c92	Merge "Fix sstable reader not working for empty set of clustering ranges" from Tomasz "Fixes #2734." * 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev: tests: Introduce clustering_ranges_walker_test tests: simple_schema: Add missing include sstables: reader: Make clustering_ranges_walker work with empty range set clustering_ranges_walker: Make adjacency more accurate	2017-08-29 10:28:49 +03:00
Asias He	a36141843a	gossip: Switch to seastar::lowres_system_clock The newly added lowres_system_clock is good enough for gossip resolution. Switch to use it. Message-Id: <fe0e7a9ef1ea0caffaa8364afe5c78b6988613bf.1503971833.git.asias@scylladb.com>	2017-08-29 10:16:25 +03:00
Asias He	2701bfd1f8	gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints The _unreachable_endpoints will be accessed in fast path soon by the hinted hand off code. Message-Id: <500d9cbb2117ab7b070fd1bd111c5590f46c3c3a.1503971826.git.asias@scylladb.com>	2017-08-29 10:15:55 +03:00
Tomasz Grabiec	05e0ca6546	tests: Introduce clustering_ranges_walker_test	2017-08-28 21:08:55 +02:00
Tomasz Grabiec	dcbc1282a9	tests: simple_schema: Add missing include	2017-08-28 21:00:06 +02:00
Tomasz Grabiec	48dabc8262	sstables: reader: Make clustering_ranges_walker work with empty range set Such queries can be issued by counter updates which involve only static row. Causes failure in test_query_only_static_row invoked from sstable_mutation_test. See commit `6572f38`, which fixed the problem in cache reader. Fixes #2734.	2017-08-28 21:00:06 +02:00
Tomasz Grabiec	071badce3b	clustering_ranges_walker: Make adjacency more accurate Current check considered some adjacent range tombstones as overlapping with the ranges. Making this more accurate will become more important after we will rely on putting p_i_p::after_all_clustered_rows() in _current_start in out-of-range state.	2017-08-28 21:00:06 +02:00
Jesse Haber-Kucharsky	abf4c1688d	db/config.cc: Clarify documentation for `typed_value_ex`	2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky	7374f9d86f	db/config.cc: Fix formatting and warnings	2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky	90666e5744	db/config.cc: Remove unnecessary `mutable` on lambdas	2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky	449bd60480	db/config.cc: Remove unused variables	2017-08-28 10:08:29 -04:00
Botond Dénes	eec451bcf8	segmented_offsets: use _current_bucket_segment_index consistently Previously _current_bucket_segment_index was used differently depending on whether update_position_trackers() is used in a random or sequential access. In the former case was used as the absolute index of the segment (independent of the buckets) and in the latter as the relative index of the segment within its bucket. This caused problems when there was a switch between random and sequential access, meaning one could get different results for an at() call depending on what was the previous at() call. Fix this by consistently using _current_bucket_segment_index as - like its name suggest - the bucket relative segment index. Ref #1946. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <7f68ac1d32c80e8dea6dfa11be02acaa961bce2a.1503924927.git.bdenes@scylladb.com>	2017-08-28 16:14:25 +03:00
Avi Kivity	fa8d0fe4d0	Revert "Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond"""" This reverts commit `238877a0c6`. A fix was found and will be committed shortly.	2017-08-28 16:14:13 +03:00
Tomer Sandler	83f249c15d	node_health_check: added line seperation under cp output message Signed-off-by: Tomer Sandler <tomer@scylladb.com> Message-Id: <20170828124307.2564-1-tomer@scylladb.com>	2017-08-28 15:44:13 +03:00
Tomasz Grabiec	16c1b0fb6b	Merge "Reduce dependencies on types.hh" from Avi * 'deps1/v1' of https://github.com/avikivity/scylla: types.hh: extract marshal_exception from types.hh into a new file utils: remove dependency on types.hh locator: add missing include "log.hh" supervisor: remove dependency on init.hh tracing: add missing include "log.hh" gms: remove unneeded #include "types.hh"	2017-08-28 13:58:46 +02:00
Avi Kivity	4e67bc9573	Merge "Fixes for skipping in sstable reader" from Tomasz * 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Add more tests for fast forwarding across partitions sstables: Fix abort in mutation reader for certain skip pattern sstables: Fix reader returning partition past the query range in some cases sstables: Introduce data_consume_context::eof()	2017-08-28 12:48:02 +03:00
Tomasz Grabiec	3241018c79	tests: mutation_source_test: Add more tests for fast forwarding across partitions	2017-08-28 10:30:08 +02:00
Tomasz Grabiec	65e488c150	sstables: Fix abort in mutation reader for certain skip pattern The problem happens for the following sequence of events: 1) reader stops in the middle of some partition before it skips to another partition range 2) reader is fast forwarded to a partition range which has no data in the sstable. There are some partitions between the previous partition range and the one we skip to 3) the reader is asked for next partition The problem was that mutation_reader::fast_forward_to() was putting the reader in _read_enabled == false state in step 2, but data_consume_context was not fast forwarded to the range. When in step 3 we were asked for the next partition, we attempted to skip using index (because of 1). The result of the skip was some position which is outside of the current range of data_consume_context, which causes it to abort. To fix, add a check for _read_enabled before we try to skip.	2017-08-28 10:28:15 +02:00
Tomasz Grabiec	dc3c8863f3	sstables: Fix reader returning partition past the query range in some cases If index was used to skip to the next partition (because the current partition wasn't consumed in full) and reader's partition range ends before the data file ends, we did not detect that we're out of range before returning a streamed_mutation. Fix by checking _context.eof() before doing that. Refs #2733.	2017-08-28 10:16:27 +02:00
Tzach Livyatan	12fb975282	Fix typos in metrics description Fixes #2658 Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170803121732.19640-1-tzach@scylladb.com>	2017-08-28 10:48:28 +03:00
Takuya ASADA	437931f499	dist/redhat: fix dependency package name typo scylla-libstdc++-static53 -> scylla-libstdc++53-static Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1503306027-7316-1-git-send-email-syuu@scylladb.com>	2017-08-28 10:44:40 +03:00
Tomasz Grabiec	6baad2c2e6	sstables: Introduce data_consume_context::eof()	2017-08-28 09:19:43 +02:00
Avi Kivity	171fe67a64	gms: remove unneeded #include "types.hh"	2017-08-27 15:18:57 +03:00
Avi Kivity	a9f19e37b5	tracing: add missing include "log.hh" It's currently made available via another include, which is going away.	2017-08-27 15:18:41 +03:00
Avi Kivity	471ae5b22b	supervisor: remove dependency on init.hh Replace with a simpler dependency on log.hh	2017-08-27 15:17:55 +03:00
Avi Kivity	27d3ab20a9	locator: add missing include "log.hh" It's currently made available via another include, which is going away.	2017-08-27 15:17:05 +03:00
Avi Kivity	7234f0f0a0	utils: remove dependency on types.hh Replace with dependency on much smaller marshal_exception.hh.	2017-08-27 15:16:21 +03:00
Avi Kivity	93317e2f4a	types.hh: extract marshal_exception from types.hh into a new file For better or worse, marshal_exception is used from utils/, and it's not good to have utils/ depend on types.hh. Extract marshal_exception to make it possible to remove the dependency.	2017-08-27 15:14:55 +03:00
Avi Kivity	238877a0c6	Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond""" This reverts commit `9d27455744`. It's still broken. To reproduce: ./tools/bin/cassandra-stress write -schema compression=LZ4Compressor (on a clean database) .0 0x00007ffff32aa69b in raise () from /lib64/libc.so.6 .1 0x00007ffff32ac4a0 in abort () from /lib64/libc.so.6 .2 0x000000000054a0e8 in seastar::memory::abort_on_underflow (size=<optimized out>) at core/memory.cc:1189 .3 seastar::memory::allocate_large (size=<optimized out>) at core/memory.cc:1194 .4 0x000000000054b305 in seastar::memory::allocate (size=size@entry=18446744073702885265) at core/memory.cc:1227 .5 0x000000000054b45e in malloc (n=n@entry=18446744073702885265) at core/memory.cc:1452 .6 0x00000000006013e4 in seastar::temporary_buffer<char>::temporary_buffer (this=0x6010195fc800, size=18446744073702885265) at /home/avi/urchin/seastar/core/temporary_buffer.hh:72 .7 0x0000000000a3908b in seastar::input_stream<char>::read_exactly (this=0x6010053d0248, n=18446744073702885265) at /home/avi/urchin/seastar/core/iostream-impl.hh:189 .8 0x0000000000a9c77f in compressed_file_data_source_impl::get (this=0x6010053d0240) at sstables/compress.cc:499 .9 0x0000000000aa1b01 in seastar::data_source::get (this=<optimized out>) at /home/avi/urchin/seastar/core/iostream.hh:63 .10 seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}::operator()() const (__closure=__closure@entry=0x6010195fcab0) at /home/avi/urchin/seastar/core/iostream-impl.hh:204 .11 0x0000000000aa22f0 in seastar::futurize<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >::apply<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}&>(sstables::data_consume_rows_context&&) (func=...) at /home/avi/urchin/seastar/core/future.hh:1312 .12 seastar::repeat<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}>(sstables::data_consume_rows_context&&) (action=...) at /home/avi/urchin/seastar/core/future-util.hh:203 .13 0x0000000000a9e730 in seastar::input_stream<char>::consume<sstables::data_consume_rows_context> (consumer=..., this=<optimized out>) at /home/avi/urchin/seastar/core/iostream-impl.hh:237 .14 data_consumer::continuous_data_consumer<sstables::data_consume_rows_context>::consume_input<sstables::data_consume_rows_context> (c=..., this=<optimized out>) at sstables/consumer.hh:226 .15 sstables::data_consume_context::impl::read (this=<optimized out>) at sstables/row.cc:411 .16 sstables::data_consume_context::read (this=<optimized out>) at sstables/row.cc:437 .17 0x0000000000aafbae in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const (__closure=<optimized out>) at sstables/partition.cc:843 .18 seastar::apply_helper<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, std::tuple<>&&, std::integer_sequence<unsigned long> >::apply({lambda()#2}&&, std::tuple) (args=..., func=...) at ./seastar/core/apply.hh:36 .19 seastar::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=..., func=...) at ./seastar/core/apply.hh:44 .20 seastar::futurize<seastar::future<> >::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=..., func=...) at ./seastar/core/future.hh:1302 .21 seastar::future<>::then<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, seastar::future<> >(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&) ( this=this@entry=0x6010195fcbb0, func=...) at ./seastar/core/future.hh:890 .22 0x0000000000ac273f in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const (__closure=0x6010195fcc28) at sstables/partition.cc:843 .23 seastar::do_until_continued<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&, seastar::promise<>) (stop_cond=..., action=..., p=...) at /home/avi/urchin/seastar/core/future-util.hh:155 .24 0x0000000000ac29c3 in seastar::do_until<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&) (action=..., stop_cond=..., this=<optimized out>) at /home/avi/urchin/seastar/core/future-util.hh:330 .25 sstables::sstable_streamed_mutation::fill_buffer (this=<optimized out>) at sstables/partition.cc:844 .26 0x0000000000ad3d2b in streamed_mutation::fill_buffer (this=0x6010195fcd10) at ./streamed_mutation.hh:489 .27 consume_flattened_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >, std::function<bool (streamed_mutation const&)> >(mutation_reader&, stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >&, std::function<bool (streamed_mutation const&)>&&) ( (gdb) p addr $1 = { chunk_start = 13330037, chunk_len = 18446744073702885265, offset = 0 }	2017-08-27 13:32:37 +03:00
Avi Kivity	576e33149f	Merge seastar upstream * seastar 0083ee8...85ca12d (1): > Merge "Run-time logging configuration" from Jesse Includes patch from Jesse: "Switch to Seastar for logging option handling In addition to updating the abstraction layer for Seastar logging in `log.hh`, the configuration system (`db/config.{hh,cc}`) has been updated in two ways: - The string-map type for Boost.program_options is now defined in Seastar. - A configuration value can be marked as `UsedFromSeastar`. This is like `Used`, except the option is expected to be defined in the Boost.Program_options description for Seastar. If the option is not defined in Seastar, or it is defined with a different type, then a run-time exception is thrown early in Scylla's initialization. This is necessary because logging options which are now defined in Seastar were previously defined in Scylla and support for these options in the YAML file cannot be dropped. In order to be able to verify that options marked `UsedFromSeastar` are actually defined in Seastar, the interface for adding options to `db::config` has changed from taking a `boost::program_options::options_description_easy_init` (which is handle into a `boost::program_options::options_description` which only allows adding options) to taking a `boost::program_options::options_description` directly (which also allows querying existing options). Scylla also fully defers to Seastar's support for run-time logging configuration." Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>	2017-08-27 13:11:33 +03:00
Avi Kivity	4f5b5bc8e6	Merge seastar upstream * seastar b9f1eb7...0083ee8 (1): > http: Add MIME type support for JSON	2017-08-27 13:09:04 +03:00
Jesse Haber-Kucharsky	af95d3baa7	db/config.cc: Remove unused function Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5a4e4e153c2d87e838d1cf6def7a494a92a72f63.1503344007.git.jhaberku@scylladb.com>	2017-08-27 13:08:19 +03:00
Vlad Zolotarov	9b9f19606f	scylla_cpuset_setup: add the description near the perftune.yaml removing Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1503600250-25169-1-git-send-email-vladz@scylladb.com>	2017-08-27 12:51:12 +03:00
Asias He	68346f7e53	repair: Use with_semaphore for sp_parallelism_semaphore Instead of calling semaphore.signal() manually. Message-Id: <51b7ecdebac91763a2340fe00959742810614845.1503648936.git.asias@scylladb.com>	2017-08-27 12:50:38 +03:00
Avi Kivity	2b3ee4b0a7	Merge "make cf drop more robust" from Glauber "We have recently found two problems with the drop_column_family code that needs addressing. The first is that exceptions in truncate() may lead to stop() being skipped, which can cause Scylla to crash. The other is that a truncate() issued before drop_column_family may get the chance to execute only after the column family is already dropped and also crash (That is issue 2726). The second problem is the classic problem of asynchronous execution on an object that may terminate, which we have been traditionally solving with a gate. We add a gate to the column family that will be closed during CF stop(), and we will require all asychronous operations to enter it. The immediate fix is for truncate(), where we have seen a real, concrete problem. But it would be good to audit other code paths to make sure that they are sane. The most obvious ones, flush, compaction and sstable deletion are already sane, since they are waited on explicitly during stop()." Fixes #2726. * 'issue-2726-v2-master' of github.com:glommer/scylla: database: add gate for generic async operations to column family database: make sure that column family is always stopped when dropped	2017-08-27 12:42:20 +03:00
Avi Kivity	1f66940134	sstables: switch std::deque to chunked_vector Reduce susceptibility to memory fragmentation.	2017-08-26 16:44:47 +03:00
Avi Kivity	204659ef40	tests: add test for chunked_vector	2017-08-26 16:44:47 +03:00
Avi Kivity	3ba2c0652d	utils: add a new container type chunked_vector We currently use std::deque<> for when we need large random-access containers, but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can exceed our 256k fragmentation unit with large sstables. The new container, which is a cross between deque and vector, has much lower limitations. Like deque, we allocate chunks of contiguous items, but they are 128k in size instead of 512. The last chunk can be smaller to avoid allocating 128k for a really small vector.	2017-08-26 16:44:45 +03:00
Tomasz Grabiec	2ca99be27d	ring_position_view: Print token instead of token pointer Broken in `e989d65539`. Message-Id: <1503667158-7544-1-git-send-email-tgrabiec@scylladb.com>	2017-08-25 14:25:21 +01:00
Glauber Costa	83323e155e	database: add gate for generic async operations to column family run_with_compaction_disabled(), which is called by truncate, has a pretty large defer point in remove(). When the code gets to finally execute, we can't guarantee that the column family will still be alive. That is true in particular if we issued a drop table command following truncate: by the time truncate gets to resume, the CF will be gone. Before the column family is dropped, it will always call its stop() method, which means we have an opportunity to do some waiting there. We already wait for flushes and current compactions to end. Traditionally, we have been solving similar problems by adding a gate that will catch asynchronous operations and making sure that potentially asynchronous operations will enter the gate before executing. Let's do the same thing here. We will close() the gate during stop(). Fixes #2726 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-08-24 13:12:57 -04:00
Glauber Costa	d090e7be35	database: make sure that column family is always stopped when dropped truncate can throw exceptions. If it does, cf->stop() will never be called because it is contained in a .then clause instead of finally. One of the things that truncate does - in a finally block of its own - is initiate a final compaction. If it returns an exception nobody will wait for that compaction to finish (since cf->stop() is the one doing that) and we'll crash. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-08-24 13:01:47 -04:00
Avi Kivity	40aeb00151	Merge "consider the pre-existing cpuset.conf when configuring networking mode" from Vlad "Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml file and using it." Fixes #2725. * 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla: scylla_prepare: respect the cpuset.conf when configuring the networking scylla_cpuset_setup: rm perftune.yaml scylla_cpuset_setup: add a missing "include" of scylla_lib.sh	2017-08-24 18:53:22 +03:00
Vlad Zolotarov	c72eb34b89	scylla_prepare: respect the cpuset.conf when configuring the networking Choose the networking configuration mode according to the current contents of /etc/scylla.d/cpuset.conf. If it doesn't exist - use the default mode. If it exists - use the mode that has been used for generation of the CPU set. Store the configuration into the /etc/scylla.d/perftune.yaml Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-08-24 09:09:40 -04:00
Vlad Zolotarov	89285a13ac	scylla_cpuset_setup: rm perftune.yaml scylla_setup resets our configuration and perftune.yaml is a part of it. perftune.yaml is generated based on the contents of cpuset.conf therefore we should reset these together. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-08-24 09:09:40 -04:00
Vlad Zolotarov	d0ccfe34b9	scylla_cpuset_setup: add a missing "include" of scylla_lib.sh The scylla_cpuset_setup uses a verify_args() function that is defined in the scylla_lib.sh. Fixes #2716 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-08-24 09:09:40 -04:00
Paweł Dziepak	1006a946e8	mvcc: allow invoking maybe_merge_versions() inside allocating section Message-Id: <20170823083544.4225-1-pdziepak@scylladb.com>	2017-08-24 14:30:38 +02:00
Pekka Enberg	870de26e35	index: Add index class Add a simple index class, which represents an instantiated index.	2017-08-24 14:00:02 +03:00
Pekka Enberg	d63a650b3f	index: Pass column_family to secondary_index_manager constructor We need column family for various secondary index manager operations.	2017-08-24 14:00:02 +03:00
Pekka Enberg	981e320d54	database: Make secondary index manager per-column family Make the secondary index manager per-column family like in Apache Cassandra to keep CQL front-end similar between the two codebases.	2017-08-24 14:00:02 +03:00
Botond Dénes	839d1db4d3	parse(compression): add missing reinterpret_cast<char> std::copy_n was using value as uint64_t, smashing the stack. Also remove unused variable. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4e2d71fc74326965dfd98bed2347100fb6ebe43b.1503568210.git.bdenes@scylladb.com>	2017-08-24 13:38:03 +03:00
Avi Kivity	9d27455744	Revert "Revert "Merge "Compress in-memory compression-info" from Botond"" This reverts commit `9656fd79a0`. A fix is now available.	2017-08-24 13:37:35 +03:00
Tomasz Grabiec	9656fd79a0	Revert "Merge "Compress in-memory compression-info" from Botond" This reverts commit `ef85cf1cb3`, reversing changes made to `de011ece52`. Vlad reports that this causes SIGSEGV on cluster restarts. seastar::backtrace_buffer::append_backtrace() at /home/vladz/work/urchin/seastar/core/reactor.cc:274 (inlined by) print_with_backtrace at /home/vladz/work/urchin/seastar/core/reactor.cc:289 seastar::print_with_backtrace(char const) at /home/vladz/work/urchin/seastar/core/reactor.cc:296 sigsegv_action at /home/vladz/work/urchin/seastar/core/reactor.cc:3512 (inlined by) operator() at /home/vladz/work/urchin/seastar/core/reactor.cc:3498 (inlined by) _FUN at /home/vladz/work/urchin/seastar/core/reactor.cc:3494 ?? ??:0 operator()<seastar::temporary_buffer<char> > at /home/vladz/work/urchin/sstables/sstables.cc:870 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/apply.hh:44 (inlined by) do_void_futurize_apply_tuple<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1270 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1290 (inlined by) then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)> > at /home/vladz/work/urchin/seastar/core/future.hh:890 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:873 (inlined by) do_until_continued<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:155 do_until<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:330 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:874 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302 then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:890 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:875 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302 operator()<seastar::future_state<> > at /home/vladz/work/urchin/seastar/core/future.hh:900 (inlined by) run at /home/vladz/work/urchin/seastar/core/future.hh:395 seastar::reactor::run_tasks(seastar::circular_buffer<std::unique_ptr<seastar::task, std::default_delete<seastar::task> >, std::allocator<std::unique_ptr<seastar::task, std::default_delete<seastar::task> > > >&) at /home/vladz/work/urchin/seastar/core/reactor.cc:2317 seastar::reactor::run() at /home/vladz/work/urchin/seastar/core/reactor.cc:2775 seastar::app_template::run_deprecated(int, char*, std::function<void ()>&&) at /home/vladz/work/urchin/seastar/core/app-template.cc:142	2017-08-24 11:44:14 +02:00
Alexys Jacob	a133290694	scylla_io_setup: migrate away from deprecated string.atoi Python 2.0 deprecated string.atoi and we should move away from it as stated here: https://docs.python.org/2/library/string.html#string.atoi Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170817134002.28124-1-ultrabug@gentoo.org>	2017-08-24 12:36:34 +03:00
Avi Kivity	dcac7125fe	Merge seastar upstream * seastar e96881a...b9f1eb7 (9): > httpd: indentation patch > httpd: handle exception when shutting down > stall-detector: Allow backtrace throttling to be configured > stall-detector: Fix messages about suppresssion not appearing > scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter > scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration > scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads > core: make current_backtrace() noexcept > memory: add large allocation detector stubs for default allocator	2017-08-24 11:35:28 +03:00
Piotr Jastrzebski	477068d2c3	Make streamed_mutation more exception safe Make sure that push_mutation_fragment leaves _buffer_size with a correct value if exception is thrown from emplace_back. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <83398412aa78332d88d91336b79140aecc988602.1503474403.git.piotr@scylladb.com>	2017-08-23 09:37:04 +01:00
Avi Kivity	2f41ed8493	Merge "repair: Do not allow repair until node is in NORMAL status" from Asias Fixes #2723. * tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev: repair: Do not allow repair until node is in NORMAL status gossip: Add is_normal helper	2017-08-23 09:44:45 +03:00
Asias He	69c81bcc87	repair: Do not allow repair until node is in NORMAL status The following backtrace was reported by user when running repair and keeping restarting the node at the same time. #0 0x00007eff077281d7 in raise () from /lib64/libc.so.6 #1 0x00007eff07729a08 in abort () from /lib64/libc.so.6 #2 0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6 #4 0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133 #5 0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143 #6 0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...) at locator/abstract_replication_strategy.cc:66 #7 0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0, range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196 #8 repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781 #9 <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005 #10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>:: It is reproduced with 1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done 2) start node 127.0.0.1, stop node 127.0.0.1 in a loop The problem is, during boot up, the token_metadata is not replicated to all shards until the node goes into NORMAL status. To fix, check until node is in NORMAL status before allowing repair. Fixes #2723	2017-08-23 14:40:04 +08:00
Asias He	65912dd1ac	gossip: Add is_normal helper It will be used by repair to check if a node is in NORMAL status.	2017-08-23 14:40:04 +08:00
Amnon Heiman	abbd78367c	Add configuration to disable per keyspace and column family metrics The number of keysapce and column family metrics reported is proportional to the number of shards times the number of keysapce/column families. This can cause a performance issue both on the reporting system and on the collecting system. This patch adds a configuration flag (set to false by default) to enable or disable those metrics. Fixes #2701 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170821113843.1036-1-amnon@scylladb.com>	2017-08-22 19:19:54 +03:00
Botond Dénes	4f42acc956	abstract_marker::raw::prepare: add missing return statement The function doesn't return a value in the all-false branch. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <3c1976682ffc190d741c066d942b83be4463cae8.1503402721.git.bdenes@scylladb.com>	2017-08-22 15:06:18 +03:00
Paweł Dziepak	9d82a1ebfd	abstract_read_executor: make make_requests() exception safe Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Paweł Dziepak	31afc2f242	shared_index_lists: restore indentation Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Paweł Dziepak	93eaa95378	sstables: make shared_index_lists::get_or_load exception safe Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>	2017-08-22 12:09:42 +02:00
Avi Kivity	ef85cf1cb3	Merge "Compress in-memory compression-info" from Botond "Overly large metadata can hog memory which especially hurts in setups with bad disk/memory ratio. To ease the pain compress the in-memory compression-info. The compression is implemented based on Avi's idea which is to group n offsets together into segments, where each segment stores a base absolute offset into the file, the other offsets in the segments being relative offsets (and thus of reduced size). Also offsets are allocated only just enough bits to store their maximum value. The offsets are thus packed in a buffer like so: arrrarrrarrr... where n is 4, a is an absolute offset and r are offsets relative to a. This of course means that stored offsets will not be aligned, not even on a byte boundary, but the size reduction pretty convincing. In addition, segments are stored in buckets, where each bucket has its own base offset. In addition, segments in a buckets are optimized to address as large of a chunk of the data as possible for a given chunk size." Ref #1946. * 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla: Add unit test for compress::offsets Optimise the storage of compression chunk offsets Add script to precompute segmented compression parameters	2017-08-22 10:30:58 +03:00
Botond Dénes	62c18da35c	Add unit test for compress::offsets	2017-08-21 17:06:20 +03:00
Botond Dénes	028c7a0888	Optimise the storage of compression chunk offsets To reduce the memory footprint of compression-info, n offsets are grouped together into segments, where each segment stores a base absolute offset into the file, the other offsets in the segments being relative offsets (and thus of reduced size). Also offsets are allocated only just enough bits to store their maximum value. The offsets are thus packed in a buffer like so: arrrarrrarrr... where n is 4, a is an absolute offset and r are offsets relative to a. The optimal value of n can be calculated for a given file_size (f) and chunk_size (c), by finding the minima of the following function: f(n) = (f/c)/n * (log2(f) + (n - 1)log2((n-1)(c + 64))) This is done in an empirical way, using a script (see below). Furthermore segments are stored in buckets, where each bucket has its own base offset. Each bucket therefore can address an equal chunk of the file and furthermore each segment in a bucket can address an equal sub-chunk of this area. The value of a given offset i is thus: bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i) To account for the bucketed storage we calculate a local_f, which is optimized so that a bucketful of segmented offsets can address the largest possible chunk of f. As value of this local_f only depends on the bucket_size (b) and c the value of n can be made independent of f and therefore only depend on one dynamic value, c. This makes life much simpler as we don't need to know the size of the file up-front, we can just append buckets to the storage on demand, while the required storage is still less than a third [1] of the original storage requirements (std::deque<uint64>). The table with the minima(f(n)) for different f and c values is pre-computed by gen_segmented_compress_params.py and stored in sstables/segmented_compress_params.hh. This script also creates a table with the best values of local_f for the given bucket_size. At runtime we only select the best params based on c. [1] This was calculated for c=4K and b=4K	2017-08-21 17:06:12 +03:00
Avi Kivity	de011ece52	main: deprecate non-murmur3 partitioners more forcefully Some (most?) users don't read logs or release notes, so they won't notice that the ByteOrdered and Random partitioners were deprecated in 2.0. Make them notice by refusing to start with a deprecated partitioner, unless a switch is explicitly enabled. Message-Id: <20170820073424.8331-1-avi@scylladb.com>	2017-08-21 14:32:22 +02:00
Avi Kivity	9f415ef870	sstables: accurate summary entry size calculation Calculate the summary entry size correctly, so we don't end up with oversize summaries. Message-Id: <20170819184255.14181-2-avi@scylladb.com>	2017-08-21 14:28:57 +02:00
Avi Kivity	17c372bf0e	sstables: get rid of 64kB minimum index advance to generate summary Limiting summary entry generation to at most one summary entry per 64k of index data can lead to large index pages, with thousands of index entries per summary entry. These are slow to parse, and there is no real gain from the limit, since we already enforce a size limit on the summary. Remove the limit and allow summary entry generation based solely on spanned data size. Fixes #2711. Message-Id: <20170819184255.14181-1-avi@scylladb.com>	2017-08-21 14:26:44 +02:00
Avi Kivity	81a33df25d	dht: reduce split_range_to_single_shard contiguous memory demand split_range_to_single_shard() returns a vector of size 4096, with each element (a partition_range) of size 100. The total of 400k can cause defragmentation if memory is fragmented. Fix by using a deque. Fixes #2707. Message-Id: <20170819141017.28287-1-avi@scylladb.com>	2017-08-21 14:25:45 +02:00
Piotr Jastrzebski	c602ffd610	Make Scylla ttl expiration behave like in Cassandra Fixes #2497 [tgrabiec: reworked the title] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2f5a99dce6ef11fe0ef135c9fa0592078fc9a056.1502886874.git.piotr@scylladb.com>	2017-08-21 14:25:45 +02:00
Botond Dénes	eae33a1f19	Add script to precompute segmented compression parameters The script generates sstables/segmented_compress_params.hh which contains a list with the optimal number of grouped offsets for different data and chunk sizes as well as a list with the best nominal data sizes for different chunk sizes, given a bucket size. Data sizes are in the range of [24,250] and chunks in the range of [24, 230]. Data sizes that are not used with the current bucket_size are ommited. See next commit for details of how the calculated values are used.	2017-08-21 10:44:08 +03:00
Avi Kivity	5a2439e702	main: check for large allocations Large allocations can require cache evictions to be satisfied, and can therefore induce long latencies. Enable the seastar large allocation warning so we can hunt them down and fix them. Message-Id: <20170819135212.25230-1-avi@scylladb.com>	2017-08-21 10:25:40 +03:00
Pekka Enberg	318423d50b	Merge seastar upstream * seastar 2d16aca...e96881a (4): > memory: add detector for large allocations > memory: reduce large allocations for small pools > net: Fix potential NULL pointer dereference in udp.cc > Update dpdk submodule	2017-08-21 10:24:08 +03:00
Tomasz Grabiec	8f2ca52740	tests: Run test_query_only_static_row test case on all mutation sources The test checks behavior common to all mutation readers, so it's better to run it against all mutation sources rather than only for cache reader. Message-Id: <1503072333-17995-1-git-send-email-tgrabiec@scylladb.com>	2017-08-20 12:23:28 +03:00
Raphael S. Carvalho	10eaa2339e	compaction: Make resharding go through compaction manager Two reasons for this change: 1) every compaction should be multiplexed to manager which in turn will make decision when to schedule. improvements on it will immediately benefit every existing compaction type. 2) active tasks metric will now track ongoing reshard jobs. Fixes #2671. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>	2017-08-20 11:35:14 +03:00
Takuya ASADA	38b2ff617f	dist/redhat: follow the change on libgcc/libstdc++ package name Since we moved to external 3rdparty repository, we added '53' suffix on gcc packages, so follow the change. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170819092039.1090-2-syuu@scylladb.com>	2017-08-19 16:01:28 +03:00
Takuya ASADA	f1b5401d1f	dist/redhat: Change g++ command name on CentOS We have added '-5.3' suffix on g++ command from scylla-gcc53-c++-5.3.1-2.2, follow the change on scylla build script. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170819092039.1090-1-syuu@scylladb.com>	2017-08-19 16:01:27 +03:00
Avi Kivity	e428805ba5	Merge "Optimize query result partition and row counts" from Duarte "Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This series optimizes it, in case it is needed, and also changes the result message to include the partition and row counts, avoiding the calculation altogether." * 'calculate-counts/v3' of github.com:duarten/scylla: query-result: Send row and partition count over the wire query::result: Optimize calculate_counts()	2017-08-17 13:41:21 +03:00
Alexys Jacob	e5ff8efea3	dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection These two admin related packages will be packaged under the "app-admin" category and not the "dev-db" one. This fixes the detection path of the packages for scylla_setup. Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>	2017-08-17 13:20:43 +03:00
Nadav Har'El	7832d8a883	get rid of unused part in configure.py Scylla's configure.py contains stuff we copied from Seastar's configure.py, but is no longer used. Let's get rid of some of it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170813150842.12603-1-nyh@scylladb.com>	2017-08-17 12:05:44 +03:00
Duarte Nunes	1e7f0eab82	memtable: Created readers should be fast forwardable by default mutation_reader::forwarding defaults to yes. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170816180304.2121-1-duarte@scylladb.com>	2017-08-17 10:21:01 +03:00
Botond Dénes	e70cfc8f36	incremental_reader_selector: account for possibly disengaged lower bound In addition to the constructor (fixed previously) the check for no sstables on the first call to select() also has to be prepared for the lower bound of the range being disengaged. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>	2017-08-17 10:07:26 +03:00
Botond Dénes	af83b7f57b	incremental_reader_selector: use lazy_deref instead of tertiary operator Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>	2017-08-17 08:45:46 +03:00
Botond Dénes	eb7eee510d	combined_mutation_reader_test: use the global const objects directly Instead of local ones. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <3ec1a70e4c0198c0563dff9688bbaa7fcfcace71.1502891190.git.bdenes@scylladb.com>	2017-08-16 16:56:42 +03:00
Paweł Dziepak	784dcbf1ca	sstables: initialise index metrics on all shards Fixes #2702. Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>	2017-08-16 15:44:26 +03:00
Avi Kivity	d7e3fbc6fe	Merge seastar upstream * seastar 2a43102...2d16aca (1): > fstream: do not ignore unresolved future Fixes #2697.	2017-08-16 15:09:59 +03:00
Botond Dénes	611774b1d9	Use the incremental reader for compaction As leveled compaction strategy stands to gain the most from incrementally opening sstables. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>	2017-08-15 21:38:04 +03:00
Takuya ASADA	0f9b095867	dist/common/scripts: prevent ignoreing flag that passed after another flag which requires parameter When user mistakenly forgot to pass parameter for a flag, our scripts misparses next flag as the parameter. ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed '--ntp-domain --setup-nic'. Result of that, next flag will ignore by scripts. To prevent such behavior, reject any parameter that start with '--'. Fixes #2609 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170815114751.6223-1-syuu@scylladb.com>	2017-08-15 18:27:32 +03:00
Duarte Nunes	c7aa3ea069	mutation_partition: Remove obsolete short read detection When compacting a partition for querying we would read an extra row, to include any tombstones between that one and the previous row. This is no longer needed since we have a general mechanism to detect short reads in the storage_proxy. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811103031.22866-1-duarte@scylladb.com>	2017-08-15 12:01:55 +01:00
Avi Kivity	8df6dd1fa0	database: make incremental_reader_selector robust vs. full-range partition_range incremental_reader_selector assumes the partition_range it receives has a lower bound, but it was seen in mutation_test that this is not so. Fix by checking whether the bound exists or not. Message-Id: <20170815095852.14149-1-avi@scylladb.com>	2017-08-15 11:03:22 +01:00
Avi Kivity	a35bfb3ea9	Merge seastar upstream * seastar 47b31f6...2a43102 (1): > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb Fixes #2690	2017-08-14 16:23:02 +03:00
Avi Kivity	e892a0082a	Merge "Drop exhausted mutation_readers when possible" from Duarte "Exhausted readers belonging to a combined_mutation_reader can be fast forwarded, so we have to keep them around. However, if the reader is not fast forwardable, then we can drop the contained readers and their buffers." * 'ff-reader/v2' of github.com:duarten/scylla: combined_mutation_reader: Drop exhausted readers if not in FF mode combined_mutation_reader: Remove superfluous mutation_readers list memtable_snapshot_source: Created readers should be fast forwardable	2017-08-14 16:20:38 +03:00
Duarte Nunes	7fb6a74302	combined_mutation_reader: Drop exhausted readers if not in FF mode Exhausted readers can be fast forwarded, so we have to keep them around. However, if the current reader is not fast forwardable, then we can drop those readers and their buffers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 14:37:27 +02:00
Duarte Nunes	0b53f88a42	combined_mutation_reader: Remove superfluous mutation_readers list The _all_readers variable can do the same job. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 14:37:27 +02:00
Duarte Nunes	77477605c1	memtable_snapshot_source: Created readers should be fast forwardable As they're used by the cache tests. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 14:37:27 +02:00
Avi Kivity	afff29bdb9	Merge seastar upstream * seastar edb73ab...47b31f6 (1): > tls: Only recurse once in shutdown code Fixes #2691.	2017-08-14 15:09:42 +03:00
Duarte Nunes	a17cef76b2	query-result-writer: Remove unneeded field Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811102940.22747-1-duarte@scylladb.com>	2017-08-14 12:33:33 +01:00
Duarte Nunes	ec75eac37d	ring_position_exponential_vector_sharder: Take ranges by rvalue Avoids some copies. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170814093310.29200-1-duarte@scylladb.com>	2017-08-14 12:55:43 +03:00
Duarte Nunes	3b9a9b7321	query-result: Send row and partition count over the wire To avoid calculating them on the coordinator side. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:29:06 +02:00
Duarte Nunes	d7bab684ea	query::result: Optimize calculate_counts() Now that range queries go through the normal digest path, we rely on query::result::calculate_counts() to count the amount of partitions and rows returned. This patch makes it a bit faster. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:28:29 +02:00
Avi Kivity	cb2c5016ea	Merge seastar upstream * seastar 7a49ae5...edb73ab (11): > scripts: perftune.py: change the network module mode auto selection heuristic > net/tls: explicitly ignore ready future during shutdown > Use python2 explicitly as an interpreter for Python v2 scripts > peering_sharded_service: prevent over-run the container > Add link to documentation to the README.md > Add guidelines for contributing to Seastar > sharded: fix move constructor for peering_sharded_service services > Provide a convenient way to lazy-convert to string the values of pointers > tutorial: overhaul semaphores section > simple-stream: Make fragmented::write_substream return simple if possible > simple-stream: Make simple/fragmented memory output stream top level	2017-08-14 10:29:27 +03:00
Raphael S. Carvalho	050a7019b8	sstables/index_reader: fix index reader for summary entry spanning lots of keys quantity prevents index_reader from reading all index entries of a summary entry that span more than min_index_interval entries. That can happen after introduction of size-based sampling, and consequently, sstable will not be able to return a key which logical position in summary entry is beyond min_index_interval. It's ok to not use quantity because index_reader will read all indexes until either next summary entry or end of file is reached. Fixes test_sstable_conforms_to_mutation_source Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>	2017-08-12 09:44:16 +03:00
Duarte Nunes	08e284a07e	combined_mutation_reader: Don't drop mutation readers This patch fixes a regression introduced in `a6b9186ca`. We should keep the readers around in case a subsequent call to fast_forward() will require them. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811160444.12795-1-duarte@scylladb.com>	2017-08-11 19:17:29 +03:00
Duarte Nunes	44b6da2e90	test.py: Add combined_mutation_reader_test Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811155017.9899-1-duarte@scylladb.com>	2017-08-11 18:54:11 +03:00
Avi Kivity	dbf8625ac9	Merge "size-based sampling for sstable summary" from Raphael "Fixes #1842." * 'size_based_sampling_v3' of github.com:raphaelsc/scylla: tests: test summary entry spanning more keys than min interval db/config: introduce sstable_summary_ratio option sstables: introduce size-based sampling for sstable summary sstables: make components_writer::offset const qualified and uint64_t sstables: make writer::offset const qualified and uint64_t	2017-08-11 18:41:45 +03:00
Duarte Nunes	e7d56884c0	list_reader_selector: Prevent infinite loop In case the readers are empty. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811153142.8926-1-duarte@scylladb.com>	2017-08-11 18:34:55 +03:00
Vladimir Krivopalov	003e8cf250	Use python2 explicitly as an interpreter for Python v2 scripts Signed-off-by: Vladimir Krivopalov <vladimir.krivopalov@gmail.com> Message-Id: <20170811032712.4362-1-vladimir.krivopalov@gmail.com>	2017-08-11 18:08:11 +03:00
Duarte Nunes	20337053ad	Don't use literal lambdas These are only available in C++17. Fixes the build after `b5460c2`. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-11 13:08:42 +02:00
Duarte Nunes	b5460c2990	Merge "Support `duration` type" from Jesse "This patch series adds support for the `duration` type in CQL, which was added to Cassandra in 3.10. As part of this work, it was necessary also to add support for the `vint` and `unsigned vint` types to the native protocol implementation, which are part of v5 of the specification. To test interactively, it is necessary to use cqlsh distributed with Cassandra, as the version we distribute does not yet support the duration type." * 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla: Support `duration` CQL native type CQL native protocol: Add support for `vint` serialization duration_test.cc: Add test for printing zero duration duration.cc: Remove nop `const` qualifier on return type Change `const` qualifier declaration order for `duration` duration.cc: Simplify range checking Rename `duration` to `cql_duration`	2017-08-11 10:56:55 +01:00
Duarte Nunes	bcf21aacc2	storage_proxy: Directly call query_nonsingular_mutations_locally Instead of duplicating the branch. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170811001559.25788-1-duarte@scylladb.com>	2017-08-11 09:06:01 +03:00
Duarte Nunes	a3ee99554b	service/storage_proxy: Remove out of date comment Now that we don't go directly to reconciliation for range queries, the result isn't required to have the row and partition counts calculated (we no longer transform a reconciled_result to a query::result). Furthermore, this line was causing a lot of dtests to fail on account of them not expecting an error line in the logs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170810225351.12610-1-duarte@scylladb.com>	2017-08-11 09:04:23 +03:00
Raphael S. Carvalho	5124f94358	tests: test summary entry spanning more keys than min interval Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-11 01:37:06 -03:00
Raphael S. Carvalho	872412d31a	db/config: introduce sstable_summary_ratio option Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-11 01:36:21 -03:00
Raphael S. Carvalho	8726ee937d	sstables: introduce size-based sampling for sstable summary Currently, a summary entry is added after min_index_interval index entries were written. Not taking into account size of index entries becomes a problem with large partitions which may create big index entries due to promoted indexes. Read performance is affected as a consequence because index entries spanned by summary are all read from disk to serve request. What we wanna do is to also add a summary entry after index reaches a boundary. To deal with oversampling, we want to write 1 byte to summary for every 2000 bytes written to data file (this will be eventually made into an option in the config file). Both conditions must be met to avoid under or oversampling. That way, the amount of data needed from index file to satify the request is drastically reduced. Fixes #1842. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-11 00:30:12 -03:00
Raphael S. Carvalho	da7489720b	sstables: make components_writer::offset const qualified and uint64_t Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-10 21:48:11 -03:00
Raphael S. Carvalho	881c479be8	sstables: make writer::offset const qualified and uint64_t Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-08-10 21:46:39 -03:00
Jesse Haber-Kucharsky	509626fe08	Support `duration` CQL native type `duration` is a new native type that was introduced in Cassandra 3.10 [1]. Support for parsing and the internal representation of the type was added in `8fa47b74e8`. Important note: The version of cqlsh distributed with Scylla does not have support for durations included (it was added to Cassandra in [2]). To test this change, you can use cqlsh distributed with Cassandra. Duration types are useful when working with time-series tables, because they can be used to manipulate date-time values in relative terms. Two interesting applications are: - Aggregation by time intervals [3]: `SELECT * FROM my_table GROUP BY floor(time, 3h)` - Querying on changes in date-times: `SELECT ... WHERE last_heartbeat_time < now() - 3h` (Note: neither of these is currently supported, though columns with duration values are.) Internally, durations are represented as three signed counters: one for months, for days, and for nanoseconds. Each of these counters is serialized using a variable-length encoding which is described in version 5 of the CQL native protocol specification. The representation of a duration as three counters means that a semantic ordering on durations doesn't exist: Is `1mo` greater than `1mo1d`? We cannot know, because some months have more days than others. Durations can only have a concrete absolute value when they are "attached" to absolute date-time references. For example, `2015-04-31 at 12:00:00 + 1mo`. That duration values are not comparable presents some difficulties for the implementation, because most CQL types are. Like in Cassandra's implementation [2], I adopted a similar strategy to the way restrictions on the `counter` type are checked. A type "references" a duration if it is either a duration or it contains a duration (like a `tuple<..., duration, ...>`, or a UDT with a duration member). The following restrictions apply on durations. Note that some of these contexts are either experimental features (materialized views), or not currently supported at run-time (though support exists in the parser and code, so it is prudent to add the restrictions now): - Durations cannot appear in any part of a primary key, either for tables or materialized views. - Durations cannot be directly used as the element type of a `set`, nor can they be used as the key type of a `map`. Because internal ordering on durations is based on a byte-level comparison, this property of Cassandra was intended to help avoid user confusion around ordering of collection elements. - Secondary indexes on durations are not supported. - "Slice" relations (<=, <, >=, >) are not supported on durations with `WHERE` restrictions (like `SELECT ... WHERE span <= 3d`). Multi-column restrictions only work with clustering columns, which cannot be `duration` due to the first rule. - "Slice" relations are not supported on durations with query conditions (like `UPDATE my_table ... IF span > 5us`). Backwards incompatibility note: As described in the documentation [4], duration literals take one of two forms: either ISO 8601 formats (there are three), or a "standard" format. The ISO 8601 formats start with "P" (like "P5W"). Therefore, identifiers that have this form are no longer supported. Fixes #2240. [1] https://issues.apache.org/jira/browse/CASSANDRA-11873 [2] `bfd57d13b7` [3] https://issues.apache.org/jira/browse/CASSANDRA-11871 [4] http://cassandra.apache.org/doc/latest/cql/types.html#working-with-durations	2017-08-10 15:01:10 -04:00
Jesse Haber-Kucharsky	91dab1d998	CQL native protocol: Add support for `vint` serialization Version 5 of the native protocol for CQL [1] adds the `vint` and `unsigned vint` types. An unsigned integer encoded as a `vint` has a variable size based on the magnitude of the value. The first byte indicates the total number of bytes. For signed integers, a "zig-zag" encoding scheme ensures that small negative values are encoded as short-length `vint`s (0 -> 0, -1 -> 1, 1 -> 2, 2 -> 3, -2 -> 4, etc). [1] https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec	2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky	77489f843f	duration_test.cc: Add test for printing zero duration It's somewhat counter-intuitive, but Cassandra also formats zero-valued duration values as an empty string.	2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky	d9c027c2dd	duration.cc: Remove nop `const` qualifier on return type These have no effect according to the Clang static analyzer.	2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky	54c3cf0201	Change `const` qualifier declaration order for `duration` The vast majority of the code-base is written in left-`const` style, and consistency is important.	2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky	1889b036b1	duration.cc: Simplify range checking	2017-08-10 14:11:23 -04:00
Avi Kivity	301358e440	Merge "Optimize combined_mutation_reader for disjoint sstable ranges" from Botond "sstables will sometimes have narrow/disjont ranges (e.g. LCS L1+). This can be exploited when reading from a range of sstables by opening sstables on-demand thus saving memory, processing and potentially I/O. To achieve this combined_mutation_reader is refactored such that the reader selection logic is moved-out into a reader_selector class. combined_mutation_reader now takes a reader_selector instance in its constructor and asks it for new readers for the current ring position on every call to operator()(). At the moment two specializations of reader_selector are provided: * list_reader_selector which implements the current logic, that is using a provided mutation_reader list, and * incremental_reader_selector which implements the on-demand opening logic discussed above. Fixes #1935" * 'bdenes/optimize_combined_reader-v6' of https://github.com/denesb/scylla: Add combined_mutation_reader_test unit test Remove range_sstable_reader Add incremental_reader_selector Add reader_selector to combined_mutation_reader sstable_set::incremental_selector: select() now returns a selection	2017-08-10 15:16:30 +03:00
Botond Dénes	9ee9988097	Add combined_mutation_reader_test unit test	2017-08-10 12:38:10 +03:00
Botond Dénes	3e97a5cd6b	Remove range_sstable_reader range_sstable_reader is replaced with combined_mutation_reader, using the incremental_reader_selector.	2017-08-10 12:38:10 +03:00
Botond Dénes	bfc74f1312	Add incremental_reader_selector incremental_reader_selector is a specialization of reader_selector for the case when sstables have narrow and/or disjoint token ranges. To exploit this it creates new readers on-demand when their sstable's token range intersects with the current ring position.	2017-08-10 12:38:02 +03:00
Botond Dénes	a6b9186cab	Add reader_selector to combined_mutation_reader combined_mutation_reader now accepts as a constructor argument a reader_selector instance whoose task is to create new readers on each call to operator()() if needed and possible. This way it is possible to control how readers are created through different specializations of reader_selector. The previous logic is refactored into list_reader_selector which is using a pre-provided mutation_reader list and forwards all of them to combined_mutation_reader at once.	2017-08-10 12:37:40 +03:00
Takuya ASADA	1cb0fff146	dist/common/scripts/scylla_raid_setup: handle '--disks' parameter correctly when disk list is end with ',' We should handle parameters correctly even it's malformed. Fixes #2402 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1499266239-27551-1-git-send-email-syuu@scylladb.com>	2017-08-10 11:42:33 +03:00
Takuya ASADA	8e115d69a9	dist/debian: append postfix '~DISTRIBUTION' to scylla package version We are moving to aptly to release .deb packages, that requires debian repository structure changes. After the change, we will share 'pool' directory between distributions. However, our .deb package name on specific release is exactly same between distributions, so we have file name confliction. To avoid the problem, we need to append distribution name on package version. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>	2017-08-10 10:53:56 +03:00
Vlad Zolotarov	1b4594b03a	transport::server::process_prepare() don't ignore errors on other shards If storing of the statement fails on any shard we should fail the whole PREPARE request. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1502325392-31169-13-git-send-email-vladz@scylladb.com>	2017-08-10 10:32:37 +03:00
Jesse Haber-Kucharsky	352e9f60ba	Rename `duration` to `cql_duration` `std::chrono::duration` is a prolific enough name that it's best to disambiguate.	2017-08-09 15:15:20 -04:00
Botond Dénes	94fc550e68	sstable_set::incremental_selector: select() now returns a selection A seletion contains - in addition to the list of sstables - a next_token which is a hint as to what is the next best token to call select() with. This should be the smallest token such that at the next call to select() the least number of new sstables will be returned, without skipping any.	2017-08-09 16:27:33 +03:00
Takuya ASADA	3077416ecc	dist/debian: Backport scalability fix of _Unwind_Find_FDE to out gcc for Debian 8 Since we provide custom build gcc only for Debian 8, the fix is not apply to Ubuntu/Debian 9. Fixes #2646 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1502239191-12649-1-git-send-email-syuu@scylladb.com>	2017-08-09 12:19:52 +03:00
Avi Kivity	7217b7ab36	Merge "Use range_streamer everywhere" from Asias "With this series, all the following cluster operations: - bootstrap - rebuild - decommission - removenode will use the same code to do the streaming. The range_streamer is now extended to support both fetch from and push to peer node. Another big change is now the range_streamer will stream less ranges at a time, so less data, per stream_plan and range_streamer will remember which ranges are failed to stream and can retry later. The retry policy is very simple at the moment it retries at most 5 times and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes .... Later, we can introduce api for user to decide when to stop retrying and the retry interval. The benefits: - All the cluster operation shares the same code to stream - We can know the operation progress, e.g., we can know total number of ranges need to be streamed and number of ranges finished in bootstrap, decommission and etc. - All the cluster operation can survive peer node down during the operation which usually takes long time to complete, e.g., when adding a new node, currently if any of the existing node which streams data to the new node had issue sending data to the new node, the whole bootstrap process will fail. After this patch, we can fix the problematic node and restart it, the joining node will retry streaming from the node again. - We can fail streaming early and timeout early and retry less because all the operations use stream can survive failure of a single stream_plan. It is not that important for now to have to make a single stream_plan successful. Note, another user of streaming, repair, is now using small stream_plan as well and can rerun the repair for the failed ranges too. This is one step closer to supporting the resumable add/remove node opeartions." * tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev: storage_service: Use the new range_streamer interface for removenode storage_service: Use the new range_streamer interface for decommission storage_service: Use the new range_streamer interface for rebuild storage_service: Use the new range_streamer interface for bootstrap dht: Extend range_streamer interface	2017-08-09 10:00:25 +03:00
Takuya ASADA	98fc7b376d	dist/redhat: install mdadm/xfsprogs on package install time We experienced 'Constructing RAID volume...' takes too much time on some AMIs, this is because setup script stuck at 'yum -y install mdadm xfsprogs'. We don't have to install these packages on AMI startup time, we should preinstall them on AMI creating time. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1502192796-21040-1-git-send-email-syuu@scylladb.com>	2017-08-09 09:10:34 +03:00
Piotr Jastrzebski	4137517cdc	Check arguments of table_helper::setup_keyspace to make sure all table helpers passed as arguments are for the right keyspace. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <10edacd509880bb18180f13e8c28593d068c5c7b.1501688729.git.piotr@scylladb.com>	2017-08-08 15:55:06 +03:00
Piotr Jastrzebski	2d8a80f211	Make table_helper constructor safer by taking keyspace name by value and storing it inside the object. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <a5dab41647348ae311e023fe5592aec650c6e32a.1501688729.git.piotr@scylladb.com>	2017-08-08 15:55:06 +03:00
Daniel Fiala	06089474c9	Print warning if user uses default cluster_name * Configuration for cluster_name is commented-out in config file. * Default value set to empty string and if not rewritten by user then warning is printed and value is reset to "ScyllaDB Cluster". Fixes #2648. Message-Id: <20170808113322.9313-1-daniel@scylladb.com>	2017-08-08 14:47:17 +03:00
Avi Kivity	a71138fc84	config: mark column_index_size_in_kb as Used Fixes #2681 Message-Id: <20170808100415.16296-1-avi@scylladb.com>	2017-08-08 11:08:00 +01:00
Ultrabug	2022da2405	Add overall python code QA and guidelines with flake8 ScyllaDB loves python & python loves ScyllaDB. It would benefit the project to start enforcing some code guidelines and basic QA with a linter along a PEP8 respect thanks to flake8. This patch adds a tox config to at least start with an assessment of the work to be done on all .py files in the code base. To reduce its noise, tests on long lines (> 80char) are ignored for now. Signed-off-by: Ultrabug <ultrabug@gentoo.org> Message-Id: <20170726134242.8927-1-ultrabug@gentoo.org>	2017-08-08 11:15:45 +03:00
Raphael S. Carvalho	dddbd34b52	sstables: close index file when sstable writer fails index's file output stream uses write behind but it's not closed when sstable write fails and that may lead to crash. It happened before for data file (which is obviously easier to reproduce for it) and was fixed by `0977f4fdf8`. Fixes #2673. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>	2017-08-08 09:53:14 +03:00
Asias He	49360992d9	storage_service: Use the new range_streamer interface for removenode So that removenode operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	6b8dc85f12	storage_service: Use the new range_streamer interface for decommission So that decommission operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:48 +08:00
Asias He	24584b8509	storage_service: Use the new range_streamer interface for rebuild So that rebuild operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00
Asias He	f239b11a84	storage_service: Use the new range_streamer interface for bootstrap So that bootstrap operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00
Asias He	6810031ba7	dht: Extend range_streamer interface After this patch and the following patches to use the new range_streamder interface, all the following cluster operations: - bootstrap - rebuild - decommission - removenode will use the same code to do the streaming. The range_streamer is now extended to support both fetch from and push to peer node. Another big change is now the range_streamer will stream less ranges at a time, so less data, per stream_plan and range_streamer will remember which ranges are failed to stream and can retry later. The retry policy is very simple at the moment it retries at most 5 times and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes .... Later, we can introduce api for user to decide when to stop retrying and the retry interval. The benefits: - All the cluster operation shares the same code to stream - We can know the operation progress, e.g., we can know total number of ranges need to be streamed and number of ranges finished in bootstrap, decommission and etc. - All the cluster operation can survive peer node down during the operation which usually takes long time to complete, e.g., when adding a new node, currently if any of the existing node which streams data to the new node had issue sending data to the new node, the whole bootstrap process will fail. After this patch, we can fix the problematic node and restart it, the joining node will retry streaming from the node again. - We can fail streaming early and timeout early and retry less because all the operations use stream can survive failure of a single stream_plan. It is not that important for now to have to make a single stream_plan successful. Note, another user of streaming, repair, is now using small stream_plan as well and can rerun the repair for the failed ranges too. This is one step closer to supporting the resumable add/remove node opeartions.	2017-08-07 16:31:47 +08:00
Avi Kivity	86de6cc7fb	Merge seastat upstream * seastar f14d2a3...7a49ae5 (8): > sharded: improve support for cooperating sharded<> services > sharded: support for peer services > semaphore: add a version of with_semaphore that takes a duration timeout > scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs > Revert "future-utils: make when_all() (vector variant) exception safe" > Revert "future-utils: fix gross compilation errors in when_all()" > future-utils: fix gross compilation errors in when_all() > future-utils: make when_all() (vector variant) exception safe Includes change to batchlog_manager constructor to adapt it to seastar::sharded::start() change.	2017-08-06 17:47:47 +03:00
Avi Kivity	3edec66903	Revert "repair: Make send_repair_checksum_range timeout" This reverts commit `98757069a5`. We have the failure detector which will detect an unresponsive node and fail the RPC. Adding a timeout can just introduce false positives.	2017-08-06 13:09:36 +03:00
Avi Kivity	621926d914	dist: debian: escape "$" character for make	2017-08-05 16:51:03 +03:00
Avi Kivity	a471851bf1	dist: debian: add /opt/scylladb/bin to PATH so antlr can be found	2017-08-05 15:46:58 +03:00
Avi Kivity	8bdc0dd471	dist: debian: search for libaries in /opt/scylladb/lib	2017-08-05 13:18:14 +03:00
Takuya ASADA	2ff3bdba5c	dist/debian: switch Ubuntu 3rdparty packages to external build service Switch Ubuntu to launchpad ppa: https://launchpad.net/~scylladb/+archive/ubuntu/ppa/+packages Since switching 3rdparty on Debian is not ready yet, keep them to use scylla 3rdparty repo, also keep --rebuild-dep option and dist/debian/dep. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1501866678-4922-1-git-send-email-syuu@scylladb.com>	2017-08-05 11:29:13 +03:00
Glauber Costa	4a911879a3	add active streaming reads metric In commit `f38e4ff3f`, we have separated streaming reads from normal reads for the purpose of determining the maximum number of reads going on. However, we'll now be totally unaware of how many reads will be happening on behalf of streaming and that can be important information when debugging issues. This patch adds this metric so we don't fly blind. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>	2017-08-05 11:06:37 +03:00
Duarte Nunes	587b6be089	dirty_memory_manager: Add missing include Allows tests/memory_footprint to build on Ubuntu 14.04. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-04 10:15:23 +02:00
Avi Kivity	4f12068e50	dist: re-add --rebuild-dep to build_rpm.sh For compatibility with existing scripts; ignored.	2017-08-04 07:10:18 +03:00
Takuya ASADA	b5e83ebd94	dist/redhat: switch 3rdparty packages to external build service Drop existing 3rdparty build script/3rdparty repo, switch to Fedora Copr https://copr.fedorainfracloud.org/coprs/scylladb/scylla-3rdparty/packages/ Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170803110754.22152-1-syuu@scylladb.com>	2017-08-04 06:40:09 +03:00
Pekka Enberg	90872ffa1f	docker: Disable stall detector Fixes #2162 Message-Id: <1501759957-4380-1-git-send-email-penberg@scylladb.com>	2017-08-03 14:52:49 +03:00
Takuya ASADA	91ade1a660	dist/debian: check scylla user/group existance before adding them To prevent install failing on the environment which already has scylla user/group, existance check is needed. Fixes #2389 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1495023805-14905-1-git-send-email-syuu@scylladb.com>	2017-08-03 13:01:18 +03:00
Takuya ASADA	6ac254fbcb	dist: change nomerges=1 on block devices during fstrim execution We have problem to run fstrim with nomerges=2, so we need to change the parameter to 1 during fstrim execution. To do this, this fix changes follow things: - revert dropping scylla_fstrim on Ubuntu 16.04/CentOS - disable distribution provided fstrim script - enable scylla_fstrim on all distributions - introduce --set-nomerges on scylla-blocktune - scylla_fstrim call scylla-blocktune by following order: - 'scylla-blocktune --set-nomerges 1' - 'fstrim' for each devices - 'scylla-blocktune --set-nomerges 2' Fixes #2649 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>	2017-08-03 13:00:34 +03:00
Botond Dénes	0b7ac01f0f	Add QtCreator project file and .gdbinit to .gitignore Message-Id: <ff662910fe1156cdde2bda4aa5bb9cfc45bddda9.1501752340.git.bdenes@scylladb.com>	2017-08-03 12:58:35 +03:00
Avi Kivity	f38e4ff3f9	database: prevent streaming reads from blocking normal reads Streaming reads and normal reads share a semaphore, so if a bunch of streaming reads use all available slots, no normal reads can proceed. Fix by assigning streaming reads their own semaphore; they will compete with normal reads once issued, and the I/O scheduler will determine the winner. Fixes #2663. Message-Id: <20170802153107.939-1-avi@scylladb.com>	2017-08-03 10:23:01 +01:00
Avi Kivity	911536960a	database: remove streaming read queue length limit If we fail a streaming read due queue overload, we will fail the entire repair. Remove the limit for streaming, and trust the caller (repair) to have bounded concurrency. Fixes #2659. Message-Id: <20170802143448.28311-1-avi@scylladb.com>	2017-08-03 10:21:07 +01:00
Avi Kivity	e9519ca8e5	Merge "make range selects more efficient by going through digest matching stage" from Gleb "Currently scanning reads go to reconciliation stage directly which requires asking for mutation data from all peers. This series makes it to try matching digests first like a single partition read." Fixes #2666. * 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev: storage_proxy: make range_slice_read_executor go through digest matching state storage_proxy: add capability to read data/digest for non singular ranges storage_proxy: remove redundant parameter from never_speculating_read_executor constructor	2017-08-03 12:18:11 +03:00
Tzach Livyatan	d3d46a5eac	Add comments on cluster_name in scylla.yaml Fix #2316 Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170730082922.21884-1-tzach@scylladb.com>	2017-08-03 12:12:15 +03:00
Gleb Natapov	d2a2a6d471	storage_proxy: make range_slice_read_executor go through digest matching state Currently scanning reads go to reconciliation stage directly which requires asking for mutation data from all peers. This patch makes it to try matching digests first like a single partition read. The change requires internode protocol changes since currently it is not possible to ask for multi partition data/digest over RPC. It means that the capability has to be guarded by new gossip feature flag which the patch also adds.	2017-08-03 11:37:03 +03:00
Tzach Livyatan	99b2232c5d	docs/docker: Add hostname parameter to examples Using --hostname to give the container a meaningful name is a good practice, and make the monitoring dashboard easier to understand Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170803081027.6675-1-tzach@scylladb.com>	2017-08-03 11:14:12 +03:00
Gleb Natapov	3b7d8c8767	storage_proxy: add capability to read data/digest for non singular ranges Currently only mutation_data read supports non singular ranges. This patch extends data/digest reads to support them too.	2017-08-03 10:35:09 +03:00
Gleb Natapov	c619ef258b	storage_proxy: remove redundant parameter from never_speculating_read_executor constructor never_speculating_read_executor always waits for all targets so block_for parameter is always equal to targets.size(). No need to to pass it explicitly.	2017-08-03 10:08:44 +03:00
Duarte Nunes	4c9206ba2f	tests/sstable_mutation_test: Don't use moved-from object Fix a bug introduced in `dbbb9e93d` and exposed by gcc6 by not using a moved-from object. Twice. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170802161033.4213-1-duarte@scylladb.com>	2017-08-03 09:45:49 +03:00
Asias He	763fa83232	repair: Fix build in repair_cf_range The compiler does not like the mutable. Message-Id: <83c5e8a944b72a095b8e29e9988986e6ca9cefc5.1501690749.git.asias@scylladb.com>	2017-08-02 18:57:32 +02:00
Asias He	5798625d73	repair: Singal parallelism_semaphore in case of error If we throw after we take the semaphore and beforew the when_all below runs, no one will increase the semaphore. Fixes #2661 Message-Id: <49540ede4c8a6d84004e10e0f63690e3c21d72c7.1501686383.git.asias@scylladb.com>	2017-08-02 18:32:32 +03:00
Avi Kivity	ebff739a84	Merge "use paging for compaction history" from Amnon "This series adds an option to use paging in internal query and use that for the get compaction history function. Internal paging will be done explicitly, to use paging, you first create a state object (that contains the query as well) and use that state to get the first page, the result will contain both the query result and a new state that can be used to get the next page. Fixes #2366" * 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev: system_keyspace: Use paging for get compaction history Add paging for internal queries query_options: Allows creating query_options from query_options	2017-08-02 18:15:58 +03:00
Avi Kivity	ac31abf6a4	repair: don't lambda-capture repair_tracker It is static, so it need not be captured, and some compilers complain.	2017-08-02 18:07:31 +03:00
Avi Kivity	ce60ef59f3	Revert "repair: Singal parallelism_semaphore in case of error" This reverts commit `a548eee28c`. It releases the semaphore too early (noted by Glauber).	2017-08-02 17:13:46 +03:00
Avi Kivity	b2753b0183	Merge "Fix possible repair stuck" from Asias "This series tries to fix possible repair stuck." Fixes #2660, #2661, #2662. * tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev: repair: Make send_repair_checksum_range timeout repair: Singal parallelism_semaphore in case of error repair: Fix repair_tracker done	2017-08-02 16:51:51 +03:00
Asias He	98757069a5	repair: Make send_repair_checksum_range timeout If the verb never returns the repair will hangs forever. Make it use the timeout version of the send_message. Fixes #2662	2017-08-02 21:41:50 +08:00
Asias He	a548eee28c	repair: Singal parallelism_semaphore in case of error If we throw after we take the semaphore and beforew the when_all below runs, one one will increase the semaphore. Fixes #2661	2017-08-02 21:41:45 +08:00
Asias He	abcff4c78e	repair: Fix repair_tracker done If it throws after repair_tracker.start and before the when_all below, the repair_tracker.done will never be called for this repair id. Fixes #2660	2017-08-02 21:40:29 +08:00
Pekka Enberg	78f68613ce	dist/docker: Reduce number of layers One of the best practices for Dockerfiles is to minimize the number of layers because they increase the overall image size: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#minimize-the-number-of-layers Consolidate our "yum install" commands to reduce the number of lauyers. Suggested by Dean Hamstead. Message-Id: <1501670572-8701-1-git-send-email-penberg@scylladb.com>	2017-08-02 15:21:05 +03:00
Takuya ASADA	ffbdacc1fa	dist/debian: remove ant from prerequisite packages This lines are mistakenly copied from scylla-tools, won't need for scylla-server. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1498029619-1928-1-git-send-email-syuu@scylladb.com>	2017-08-02 12:12:42 +03:00
Duarte Nunes	cec41f9de6	Merge seastar upstream * seastar fc937b8...f14d2a3 (4): > configure.py: Ensure tmp directory exists when getting dpdk cflags > checked_ptr: fix hash() compilation > net: fix potential use after free in posix_server_socket::accept() > http: removed unneeded lamda captures Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-02 10:05:08 +02:00
Asias He	cf6f4a5185	gossip: Introduce the shadow_round_ms option It specifies the maximum gossip shadow round time. It can be used to reduce the gossip feature check time during node boot up. For instance, when the first node in the cluster, which listed both itself and other node as seed in the yaml config, boots up, it will try to talk to other seed nodes which are not started yet. The gossip shadow round will be used to fetch the feature info of the cluster. Since there is no other seed node in the cluster, the shadow round will fail. User can reduce the default shadow_round_ms option to reduce the boot time. Fixes #2615 Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>	2017-08-02 09:52:35 +03:00
Vlad Zolotarov	4b28ea216d	utils::loading_cache: cancel the timer after closing the gate The timer is armed inside the section guarded by the _timer_reads_gate therefore it has to be canceled after the gate is closed. Otherwise we may end up with the armed timer after stop() method has returned a ready future. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>	2017-08-01 17:21:44 +01:00
Duarte Nunes	569bbf2edd	sstables/sstables: Use per-cpu noop_write_monitor We employ a thread-per-core architecture, so don't go about sharing seastar::shared_ptrs across cpus. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170801144153.17354-1-duarte@scylladb.com>	2017-08-01 18:10:49 +03:00
Avi Kivity	db7329b1cb	Merge "Ensure correct EOC for PI block cell names" from Duarte "This series ensures the always write correct cell names to promoted index cell blocks, taking into account the eoc of range tombstones. Fixes #2333" * 'pi-cell-name/v1' of github.com:duarten/scylla: tests/sstable_mutation_test: Test promoted index blocks are monotonic sstables: Consider eoc when flushing pi block sstables: Extract out converting bound_kind to eoc	2017-08-01 18:09:07 +03:00
Gleb Natapov	1da4d5c5ee	cql transport: run accept loop in the foreground It was meant to be run in the foreground since it is waited upon during stop(), but as it is now from the stop() perspective it is completed after first connection is accepted. Fixes #2652 Message-Id: <20170801125558.GS20001@scylladb.com>	2017-08-01 17:04:14 +03:00
Avi Kivity	1e8bb972b6	compaction: fix iteration in leveled compaction droppable tombstones loop Since get_level_count() is unsigned, it will never be negative, and the loop may never terminate. Message-Id: <20170719133502.13316-1-avi@scylladb.com>	2017-08-01 13:40:36 +03:00
Avi Kivity	ba2e170e4b	compaction: fix return in leveled compaction droppable tombstones loop If the loop ever terminates, we need to return something. Message-Id: <20170719133508.13374-1-avi@scylladb.com>	2017-08-01 13:33:02 +03:00
Takuya ASADA	a998b7b3eb	dist/ami: follow scylla-tools package name change on RedHat variants Since scylla-tools generates two .rpm packages, we need to copy them to our AMI. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170722090002.9850-1-syuu@scylladb.com>	2017-07-31 18:57:12 +03:00
Avi Kivity	7c8dea088a	Merge seastar upstream * seastar 54e940f...fc937b8 (2): > configure.py: Always ensure tmp directory exists > coding-style.md: introduce	2017-07-31 18:06:09 +03:00
Duarte Nunes	a85232dd82	Fix compilation errors on GCC 6 GCC 6 inconsistently requires explicitly calling a member function through "this->" for lambda functions capturing "this". Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170731143755.21970-1-duarte@scylladb.com>	2017-07-31 17:40:44 +03:00
Benoît Canet	b44ba11e4c	transport: Count the number of unpaged queries Queries with query page size equal or smaller than zero are unpaged queries. Count these kind of queries and make them a metrics since they can ruin the performance of the system. Message-Id: <20170731130004.25807-2-benoit@scylladb.com>	2017-07-31 16:01:45 +03:00
Avi Kivity	3fe6731436	Merge "educe the effect of the latency metrics" from Amnon "This series reduce that effect in two ways: 1. Remove the latency counters from the system keyspaces 2. Reduce the histogram size by limiting the maximum number of buckets and stop the last bucket." Fixes #2650. * 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev: database: remove latency from the system table estimated histogram: return a smaller histogram	2017-07-31 15:58:30 +03:00
Paweł Dziepak	402799fcc0	mutation_reader: drop move_and_clear() Since the discovery of std::exchange(x, {}) move_and_clear has become obsolete. Beside, the name was wrong, it did not clear the vector but recreated it meaning that any allocated memory wasn't reused (not that it mattered in the existing usages). Message-Id: <20170731123549.10887-1-pdziepak@scylladb.com>	2017-07-31 15:51:19 +03:00
Gleb Natapov	87bc3f7e7f	configure.py: use user provided compiler flags when checking for features User provided compiler flags my change an outcome of the test. Message-Id: <20170724111520.GA18230@scylladb.com>	2017-07-31 15:33:06 +03:00
Avi Kivity	f4b2a1ef4e	Merge "Optimise combined_mutation_reader" from Paweł "These patches optimise combined_mutation_reader for cases where the majority of mutation_readers is disjoint. perf_fast_forward: Results are medians of 3 of fragments/s as reported by perf_fast_forward. Command: perf_fast_forward -c1 --enable-cache small: small-partition-skips (read=1, skip=0) large: large-partition-skips (read=1, skip=0) before after diff small 195753 238196 +22% large 1244325 1359096 +9% perf_simple_query: Results are medians of 10 of reads/s as reported by perf_simple_query. Command: perf_simple_query -c1 before 98651.40 after 104554.85 diff +6%" * tag 'avoid-merge_mutations/v1' of https://github.com/pdziepak/scylla: combined_mutation_reader: avoid unnecessary merge_mutations() combined_mutation_reader: do not pop mutation with different key	2017-07-31 15:14:42 +03:00
Avi Kivity	178b54e790	Merge "memtable flush: Fixes and improvements" from Duarte "This series ensure that when we retry a memtable flush, we re-acquire the flush permit that was previously released. It also ensures we don't hold the sstable read lock for the duration of the sleep leading to the retry. To achieve that cleanly we refactor the way the permit lifecycle is managed by employing a RAII-based approach. We also improve the latency of writes blocked on virtual dirty by releasing the flush permit before fsyncing the sstables. There are additional avenues for performance improvements on top of this one." * 'memtable-flush-additional-fixes/v4' of github.com:duarten/scylla: column_family: Re-acquire flush permit in case of error column_family: Don't hold sstable read lock when retrying flush sstables: Release the flush permit before fsyncing sstables: Introduce write_monitor database: Extract out dirty_memory_manager dirty_memory_manager: Refactor flush permit lifetime management dirty_memory_manager: Invert permit acquisition order memtable_list: Register different seal functions for each behaviour	2017-07-31 14:57:19 +03:00
Paweł Dziepak	2b53a560c8	combined_mutation_reader: avoid unnecessary merge_mutations() Merging mutations is quite an expensive operation. The creation of streamed mutation merger involves several allocations (mostly coming from various std::vector) and then all mutation_fragments need to go through a heap. All this is completely unnecessary if there is only one mutation, so let's skip a call to merge_mutations() in such cases. This also means that we can reuse memory allocated by _current vector if merge is not required.	2017-07-31 12:35:40 +01:00
Paweł Dziepak	f78f2b3c92	combined_mutation_reader: do not pop mutation with different key Originally, the loop insidecombined_mutation_reader::next() so that it was popping mutation from the heap and when it encountered one with a different decorated key it was pushed back and the ones accumulated so far merged and emitted. In other words, every time the reader progressed to the next mutation it did needless pop and push operations on the heap. This patch rearranges the code so that the key of the next mutation is compared before it is popped from the heap.	2017-07-31 12:35:40 +01:00
Duarte Nunes	c81431ad16	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	9162e016da	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	1a33cc6847	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	784a078e72	sstables: Introduce write_monitor The write_monitor provides callbacks to inform an observer of the state of the ongoing sstable write. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	d2b0a5a0a6	database: Extract out dirty_memory_manager Needed to the flush_permit can be propagated to the sstables layer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	a2b732c156	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	f647f5b14a	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Duarte Nunes	e371accac8	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Paweł Dziepak	e970630272	tests/serialized_action: add missing forced defers serialized_action_tests depends on the fact that first part of the serialized_action is executed at cetrtain points (in which it reads a global variable that is later updated by the main thread). This worked well in the release mode before ready continuations were inlined and run immediately, but not in the debug mode since inlining was not happening and the main seastar::thread was missing some yield points. Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>	2017-07-31 11:35:24 +01:00
Duarte Nunes	4e3232fc29	utils/log_histogram: Fix typo when calculating number of buckets We weren't correctly calculating the number of buckets due to returning the wrong variable. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170731094733.7746-1-duarte@scylladb.com>	2017-07-31 12:49:11 +03:00
Avi Kivity	e855a28fae	Revert "Merge "memtable flush: Fixes and improvements" from Duarte" This reverts commit `733a64a1df`, reversing changes made to `e11e66723a`. Breaks sstable_test and perf_fast_forward.	2017-07-31 12:44:28 +03:00
Avi Kivity	85056f3611	log_histogram: fix constexpr-ness of log_histogram_options 1. assert() is not constexpr. 2. can't use static_assert(), because the contructor may be called in a non-constexpr environment; moved to log_histogram 3. pow2_rank() uses count_leading_zeros() which is not constexpr; split into constexpr and non-constexpr versions 4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank Message-Id: <20170726105444.32698-1-avi@scylladb.com>	2017-07-31 09:11:40 +01:00
Avi Kivity	733a64a1df	Merge "memtable flush: Fixes and improvements" from Duarte "This series ensure that when we retry a memtable flush, we re-acquire the flush permit that was previously released. It also ensures we don't hold the sstable read lock for the duration of the sleep leading to the retry. To achieve that cleanly we refactor the way the permit lifecycle is managed by employing a RAII-based approach. We also improve the latency of writes blocked on virtual dirty by releasing the flush permit before fsyncing the sstables. There are additional avenues for performance improvements on top of this one." * 'memtable-flush-additional-fixes/v3' of github.com:duarten/scylla: column_family: Re-acquire flush permit in case of error column_family: Don't hold sstable read lock when retrying flush sstables: Release the flush permit before fsyncing sstables: Introduce write_monitor database: Extract out dirty_memory_manager dirty_memory_manager: Refactor flush permit lifetime management dirty_memory_manager: Invert permit acquisition order memtable_list: Register different seal functions for each behaviour main: Don't catch polymorphic exceptions by value	2017-07-31 10:32:26 +03:00
Duarte Nunes	e11e66723a	main: Don't catch polymorphic exceptions by value GCC trunk complains due to exception slicing. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170727163021.8000-1-duarte@scylladb.com>	2017-07-31 10:12:13 +03:00
Avi Kivity	fc683c3f3e	Merge seastar upstream * seastar a14d667...54e940f (8): > Merge "Prometheus to use output stream" from Amnon > http_test: Fix an http output stream test > build: harden try_compile_and_link output temporary file > configure: disable exception scalability hack on debug build > build: don't perform test compiles to /dev/null > Provide workaround for non scaleable c++ exception runtime > Merge "Add output stream to http message reply" from Amnon > configure.py: use user provided compiler flags when checking for features	2017-07-31 10:09:48 +03:00
Avi Kivity	c1718dd5e3	Update scylla-ami submodule * dist/ami/files/scylla-ami 2bd1481...b41e5eb (1): > Fix incorrect scylla-server sysconfig file edit for i3 memflush controller	2017-07-31 09:41:24 +03:00
Takuya ASADA	714540cd4c	dist/debian: refuse upgrade if current scylla < 1.7.3 && commitlog remains Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse updating package if current scylla < 1.7.3 && commitlog remains. Note: We have the problem on scylla-server package, but to prevent scylla-conf package upgrade, %pretrans should be define on scylla-conf. Fixes #2551 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>	2017-07-31 09:08:40 +03:00
Avi Kivity	e5d2e28df9	Merge "Backport exception scalability fix from gcc-7" from Gleb "This patch series backports scalability fix for _Unwind_Find_FDE and modifies out CentOS package to use our libgcc and libstdc++ which are needed to make use of the fix instead of locally installed ones." Ref #2646 (fixes on RHEL 7 and related only) * 'gleb/exception-gcc-fix-v2' of github.com:cloudius-systems/seastar-dev: dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc	2017-07-30 19:31:03 +03:00
Gleb Natapov	8fe875cc79	dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one	2017-07-30 16:03:25 +03:00
Gleb Natapov	1cf7e72c68	dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc	2017-07-30 16:03:10 +03:00
Paweł Dziepak	e62403190b	Merge "Introduce perf_cache_eviction test" from Tomasz Runs appending writes to a single partition, at full speed, and a reader which selects the head of the partition, with 100ms delay between reads. Prints latency percentiles and some stats. Intended to test performance at the transition from non-evicting to evicting modes. Currently we can see that after the transition, whole partition gets evicted and reads constantly miss. Sample output: rd/s: 10, wr/s: 135947, ev/s: 0, pmerge/s: 1, miss/s: 0, cache: 708/778 [MB], LSA: 820/910 [MB], std free: 82 [MB] reads : min: 149 , 50%: 179 , 90%: 1331 , 99%: 1331 , 99.9%: 1331 , max: 6866 [us] writes: min: 3 , 50%: 4 , 90%: 4 , 99%: 5 , 99.9%: 258 , max: 51012 [us] rd/s: 7, wr/s: 93354, ev/s: 9, pmerge/s: 1, miss/s: 3, cache: 0/0 [MB], LSA: 107/128 [MB], std free: 82 [MB] reads : min: 179 , 50%: 179 , 90%: 73457 , 99%: 73457 , 99.9%: 73457 , max: 105778 [us] writes: min: 3 , 50%: 4 , 90%: 4 , 99%: 5 , 99.9%: 258 , max: 105778 [us] * tag 'tgrabiec/row-eviction-perf-test' of github.com:scylladb/seastar-dev: tests: Introduce perf_cache_eviction tests: simple_schema: Add getter for DDL statement estimated_histogram: Implement percentile() utils: estimated_histogram: Make printable	2017-07-28 09:49:22 +01:00
Duarte Nunes	0f1bd81523	column_family: Re-acquire flush permit in case of error If we fail to flush an sstable, after creating the flush_reader, then we will have released the flush permit when we retry the flush. Ensure that when retrying, we re-acquire the flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	2f4cffc7f6	column_family: Don't hold sstable read lock when retrying flush Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	5e64839e85	sstables: Release the flush permit before fsyncing This allows a queued flush to start while we fsync the current sstable, which helps reduce the overall time new writes are blocked on dirty memory. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	a737577881	sstables: Introduce write_monitor The write_monitor provides callbacks to inform an observer of the state of the ongoing sstable write. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	121f967b30	database: Extract out dirty_memory_manager Needed to the flush_permit can be propagated to the sstables layer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	ef1275e9dd	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	cfc8fae33f	dirty_memory_manager: Invert permit acquisition order For an upcoming fix it is required to invert the permit acquisition order: first we acquire the background work permit and then the single flush permit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	7e68e4677d	memtable_list: Register different seal functions for each behaviour Instead of passing a flush_behaviour to the seal function, use two different functions for each of the behaviours. This will be important in the forthcoming patches, which will require the signatures of those functions to differ. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	7502401652	main: Don't catch polymorphic exceptions by value GCC trunk complains due to exception slicing. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Duarte Nunes	143f4fd861	Merge "Prevent pull requests from accumulating" from Tomasz If schema merging completes at lower rate than incoming pull requests, then merge processes will accumulate and needlessly request and hold schema mutations. In rare cases, when there are constant schema changes, they may even overflow memory. This was seen in dtest: concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test Allowing only one active and one queued pull request per remote endpoint is enough. * tag 'tgrabiec/dont-accumulate-schema-pulls-v2' of github.com:scylladb/seastar-dev: migration_manager: Log schema pulls migration_manager: Prevent pull requests from accumulating utils: Introduce serialized_action	2017-07-27 21:01:38 +02:00
Tomasz Grabiec	e09220dbff	migration_manager: Log schema pulls	2017-07-27 20:08:25 +02:00
Tomasz Grabiec	350d98d4e1	migration_manager: Prevent pull requests from accumulating If schema merging completes at lower rate than incoming pull requests, then merge processes will accumulate and needlessly request and hold schema mutations. In rare cases, when there are constant schema changes, they may even overflow memory. This was seen in dtest: concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test Allowing only one active and one queued pull request per remote endpoint is enough.	2017-07-27 20:08:25 +02:00
Tomasz Grabiec	6a3703944b	utils: Introduce serialized_action	2017-07-27 20:08:21 +02:00
Duarte Nunes	dbbb9e93da	tests/sstable_mutation_test: Test promoted index blocks are monotonic Reproduces #2333 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 18:23:58 +02:00
Duarte Nunes	06728bdfe9	sstables: Consider eoc when flushing pi block When flushing a promoted index block using a range tombstone cell name as a bound, use the right eoc value instead of always writing composite::eoc::none. Fixes #2333 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 18:23:58 +02:00
Duarte Nunes	718517ed91	sstables: Extract out converting bound_kind to eoc Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 18:23:58 +02:00
Paweł Dziepak	f02bef7917	streamed_mutation: do not call fill_buffer() ahead of time consume_mutation_fragments_until() allows consuming mutation fragments until a specified condition happens. This patch reorganises its implementation so that we avoid situations when fill_buffer() is called with stop condition being true. Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>	2017-07-27 17:47:57 +02:00
Tomasz Grabiec	ac7e6ef1bc	tests: Introduce perf_cache_eviction	2017-07-27 17:19:07 +02:00
Tomasz Grabiec	2d2e7ef6fb	tests: simple_schema: Add getter for DDL statement	2017-07-27 17:19:07 +02:00
Tomasz Grabiec	5602be72fa	estimated_histogram: Implement percentile()	2017-07-27 17:19:07 +02:00
Tomasz Grabiec	1bc305ed7b	utils: estimated_histogram: Make printable	2017-07-27 17:19:03 +02:00
Takuya ASADA	91a75f141b	dist/redhat: limit metapackage dependencies to specific version of scylla packages When we install scylla metapackage with version (ex: scylla-1.7.1), it just always install newest scylla-server/-jmx/-tools on the repo, instead of installing specified version of packages. To install same version packages with the metapackage, limited dependencies to current package version. Fixes #2642 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170726193321.7399-1-syuu@scylladb.com>	2017-07-27 14:21:35 +03:00
Takuya ASADA	11870e47ec	dist/redhat: refuse upgrade if current scylla < 1.7.3 && commitlog remains Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse updating package if current scylla < 1.7.3 && commitlog remains. Note: We have the problem on scylla-server package, but to prevent scylla-conf package upgrade, %pretrans should be define on scylla-conf. Fixes #2551 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170727110730.613-1-syuu@scylladb.com>	2017-07-27 14:09:17 +03:00
Tomasz Grabiec	22948238b6	row_cache: Fix potential timeout or deadlock due to sstable read concurrency limit database::make_sstable_reader() creates a reader which will need to obtain a semaphore permit when invoked. Therefore, each read may create at most one such reader in order to be guaranteed to make progress. If the reader tries to create another reader, that may deadlock (or for non-system tables, timeout), if enough number of such readers tries to do the same thing at the same time. Avoid the problem by dropping previous reader before creating a new one. Refs #2644. Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>	2017-07-27 13:58:20 +03:00
Vlad Zolotarov	e98adb13d5	service::storage_service: initialize auth and tracing after we joined the ring Initialize the system_auth and system_traces keyspaces and their tables after the Node joins the token ring because as a part of system_auth initialization there are going to be issues SELECT and possible INSERT CQL statements. This patch effectively reverts the `d3b8b67` patch and brings the initialization order to how it was before that patch. Fixes #2273 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>	2017-07-27 10:54:36 +02:00
Amnon Heiman	a71b9e498a	database: remove latency from the system table This patch remove the latency histograms from the system table, it also extend the already existing exclusion to all system keyspaces. It also uses the new get_histogram API to set a minimal bucket size to 100 microseconds.	2017-07-27 11:41:15 +03:00
Amnon Heiman	1b05f23d12	estimated histogram: return a smaller histogram The current histogram contains 91 buckets, this is a very high resolution with a high upper limit. To reduce traffic passed, between scylla and the prometheus, this patch generate a smaller histogram. It limit the number of buckets (16 by default), set a lower limit to the lowest bucket, and uses 2 as the bucket coeficient. Highest empty buckets will not be reported. Signed-off-by: Amnon Heiman <amnon@scylladb.com> estimated histogram	2017-07-27 11:41:10 +03:00
Tomasz Grabiec	e9fc0b0491	Merge "Some fixes for performance regressions in perf_fast_forward" from Paweł These patches contain some minor fixes for performance regression reported by perf_fast_forward after partial cache was merged. The solution is still far from perfect, there is one case that still has 30% degradation, but there is some improvement so there is no reason to hold these changes back. Refs #2582. Some numbers: before - before cache changes were merged (`555621b537`) cache - at the commit that introduced the partial cache (`9b21a9bfb6`) after - recent master + this series (based on `e988121dbb`) Differences are shown relative to "before". Testing effectiveness of caching of large partition, single-key slicing reads: Large partitions, range [0, 500000], populating cache before cache after 1636840 1013688 1234606 -38% -25% Large partitions, range [0, 500000], reading from cache before cache after 2012615 3076812 3035423 +53% +51% Testing scanning small partitions with skips. reading small partitions (skip 0) before cache after 227060 165261 200639 -27% -11% skipping small partitions (skip 1) before cache after 29813 27312 38210 -8% +28% Testing slicing small partitions: slicing small partitions (offset 0, read 4096) before cache after 195282 149695 180497 -23% -8% * https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3: sstables: make sure that fill_buffer() actually fills buffer mutation_merger: improve handling of non-deferring fill_buffer()s partition_snapshot_row_cursor: avoid apply() in single-version cases sstables: introduce decorated_key_view ring_position_comparator: accept sstables::decorated_key_view sstable: keep a pre-computed token in summary_entry sstables: cache token in index entries index_reader: advance_and_check_if_present() use index_comparator ring_position_comparator: drop unused overloads cache_streamed_mutation: avoid moving clustering_row streamed_mutation: introduce consume_mutation_fragments_until() cache_streamed_mutation: use consumer based read_context reader rows_entry: make position() inlineable mutation_fragment: make destructor always_inline keys: introduce compound_wrapper::from_exploded_view() sstables: avoid copying key components compound_compat: explode: reserve some elements in a vector cache: short-circut static row logic if there are no static columns cache: use equality comparators instead of tri_compare sstables: avoid indirect calls to abstract_type::is_multi_cell()	2017-07-27 10:14:35 +02:00
Pekka Enberg	b80504188a	docs/docker-hub: Mark '--experimental' as 2.0 feature The '--experimental' flag appears in 2.0 so mark it as such in the user documentation on Docker Hub. Message-Id: <1501137703-29706-1-git-send-email-penberg@scylladb.com>	2017-07-27 10:28:25 +03:00
Duarte Nunes	85e85ec72e	Don't catch polymorphic exceptions by value It makes gcc a very sad compiler. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726172053.5639-2-duarte@scylladb.com>	2017-07-27 09:39:58 +03:00
Duarte Nunes	7536659cb5	CqlParser: Don't catch polymorphic exceptions by value Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726172053.5639-1-duarte@scylladb.com>	2017-07-27 09:39:57 +03:00
Tzach Livyatan	ea97b87205	Adding Scylla restart instructions Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170725064719.31109-1-tzach@scylladb.com>	2017-07-27 09:38:49 +03:00
Vlad Zolotarov	9adabd1bc4	utils::loading_cache: add stop() method loading_cache invokes a timer that may issue asynchronous operations (queries) that would end with writing into the internal fields. We have to ensure that these operations are over before we can destroy the loading_cache object. Fixes #2624 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>	2017-07-26 21:28:49 +02:00
Duarte Nunes	50ad0003c6	db/schema_tables: Drop dropped columns when dropping tables Fixes #2633 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726150228.2593-2-duarte@scylladb.com>	2017-07-26 18:41:28 +02:00
Duarte Nunes	3425403126	db/schema_tables: Store column_name in text form As does Cassandra. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726150228.2593-1-duarte@scylladb.com>	2017-07-26 18:41:12 +02:00
Duarte Nunes	787308a96c	cql3/tuples: Don't catch polymorphic exception by value Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726155740.3275-1-duarte@scylladb.com>	2017-07-26 19:28:35 +03:00
Asias He	515a744303	gossip: Fix nr_live_nodes calculation We need to consider the _live_endpoints size. The nr_live_nodes should not be larger than _live_endpoints size, otherwise the loop to collect the live node can run forever. It is a regression introduced in commit `437899909d` (gossip: Talk to more live nodes in each gossip round). Fixes #2637 Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>	2017-07-26 16:48:30 +03:00
Paweł Dziepak	7b0f75c0d1	sstables: avoid indirect calls to abstract_type::is_multi_cell()	2017-07-26 14:38:27 +01:00
Paweł Dziepak	b4d1dea4a9	cache: use equality comparators instead of tri_compare Equality comparator may be much cheaper than the fully fledged trichotomic comparator, especially if the component types are byte order equal but not byte order comparable.	2017-07-26 14:38:27 +01:00
Paweł Dziepak	2780555968	cache: short-circut static row logic if there are no static columns	2017-07-26 14:38:27 +01:00
Paweł Dziepak	4a0385e908	compound_compat: explode: reserve some elements in a vector When we are exploding a compound key we know already that there is more than one component, but we have no easy way of determining how many of them are going to be there. Let's reserve space for a few elements so that we avoid an excessive number of reallocations in case of medium-sized keys.	2017-07-26 14:38:27 +01:00
Paweł Dziepak	28c105e4a7	sstables: avoid copying key components	2017-07-26 14:38:27 +01:00
Paweł Dziepak	6031b7e587	keys: introduce compound_wrapper::from_exploded_view()	2017-07-26 14:38:27 +01:00
Paweł Dziepak	c9ccd813ab	mutation_fragment: make destructor always_inline mutation_fragment destructor was already made inline-friendly by moving most of the logic to a separate function. However, the compiler still is quite reluctant to inline it in certain cases, so let's give it a stronger hint.	2017-07-26 14:38:27 +01:00
Paweł Dziepak	43cce6c2f4	rows_entry: make position() inlineable	2017-07-26 14:38:27 +01:00
Paweł Dziepak	c2ec43f70b	cache_streamed_mutation: use consumer based read_context reader	2017-07-26 14:38:21 +01:00
Paweł Dziepak	2066354de3	streamed_mutation: introduce consume_mutation_fragments_until() consume_mutation_fragments_until() is a consumer based interface that avoids indirect calls and continuation overhead present in the naive streamed_mutation::operator() approach.	2017-07-26 14:37:20 +01:00
Paweł Dziepak	9bc6038ff3	cache_streamed_mutation: avoid moving clustering_row clustering_row can stores quite a lot of data internally which makes its move constructor not exactly cheap. If possible it is better to move mutation_fragment around as it keeps everything externally. This also avoids some cases when clustering row would be extracted from mutation_fragment only to be made to create another mutation_fragment later.	2017-07-26 14:36:37 +01:00
Paweł Dziepak	68e57a742f	ring_position_comparator: drop unused overloads	2017-07-26 14:36:37 +01:00
Paweł Dziepak	960a140880	index_reader: advance_and_check_if_present() use index_comparator	2017-07-26 14:36:37 +01:00
Paweł Dziepak	dc7bad9a50	sstables: cache token in index entries When a sstable reader is fast forwarded some index entries may be read (and compared) multiple times. This patch makes sure that once a token is computed we keep it around and reuse if the entry is accessed again.	2017-07-26 14:36:37 +01:00
Paweł Dziepak	bfb7b56c74	sstable: keep a pre-computed token in summary_entry Each sstable index lookup involves a binary search in the summary and each time a partition key of summary entry is compared with anything its token needs to be calculated. Since we keep summary in the memory all the time it is better to also keep the tokens around.	2017-07-26 14:36:36 +01:00
Paweł Dziepak	fe7eba7f06	ring_position_comparator: accept sstables::decorated_key_view ring_position_comparator has overloads for comparing ring_positions as well as sstables::key_view. In the case of the latter it needs to compute the token of the key. However, the sstable layer could cache some tokens so let's allow the comparator callers to provide it directly.	2017-07-26 14:36:36 +01:00
Paweł Dziepak	31d7cfdefb	sstables: introduce decorated_key_view	2017-07-26 14:36:36 +01:00
Paweł Dziepak	722c56f3f2	partition_snapshot_row_cursor: avoid apply() in single-version cases	2017-07-26 14:36:36 +01:00
Paweł Dziepak	e145ee6bb8	mutation_merger: improve handling of non-deferring fill_buffer()s It is possible that a call to fill_buffer() will return an immediately ready future. This patch avoids uncontrolled recursion in case when all merged streamed mutation do not defer ini fill_buffer() and also optimises for non-deferring case by avoiding some of the logic.	2017-07-26 14:36:36 +01:00
Paweł Dziepak	e0a04cb7fe	sstables: make sure that fill_buffer() actually fills buffer streamed_mutation::impl::fill_buffer() is supposed to either push mutation fragments to the buffer or set EOS flag. However, it was possible that mp_row_consumer would return proceed::no if a skip was needed without satisfying any of these conditions.	2017-07-26 14:36:36 +01:00
Pekka Enberg	e66635a885	Merge "Developer documentation improvements" from Jesse "This patch series addresses some feedback from the preliminary HACKING.md, adds some new content, and updates the README file with some quick-start information." * 'jhk/better_hacking/v3' of github.com:hakuch/scylla: README.md: Add quick-start section and defer to `HACKING.md` HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too HACKING.md: Add details and examples for unit tests HACKING.md: Add section for project dependencies HACKING.md: Describe releases and tags HACKING.md: Re-work "building" section, including memory needs HACKING.md: Update ccache recommendations HACKING.md: Update "Contributing" URL	2017-07-26 16:25:58 +03:00
Duarte Nunes	e988121dbb	schema_builder: Replace type when re-dropping column Fixes #2634 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170725183933.5311-1-duarte@scylladb.com>	2017-07-26 13:26:29 +02:00
Duarte Nunes	64fcf0c642	alter_table_statement: Allow collection columns to replace normal ones Fixes #2632 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170725183811.5155-1-duarte@scylladb.com>	2017-07-26 13:24:03 +02:00
Duarte Nunes	1622847c1d	perf/perf_fast_forward: Don't pass non-pod to varargs function Passing a Non-POD object to variadic functions is unsupported. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170726094756.22867-1-duarte@scylladb.com>	2017-07-26 11:48:22 +01:00
Duarte Nunes	9c831b4e97	schema: Remove unnecessary print Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170725174000.71061-1-duarte@scylladb.com>	2017-07-26 12:01:51 +02:00
Duarte Nunes	472f32fb06	tests/schema_change_test: Add test case for add+drop notification Reproduces #2616 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170725170622.4380-2-duarte@scylladb.com>	2017-07-26 11:59:48 +02:00
Duarte Nunes	33e18a1779	db/schema_tables: Consider differing dropped columns If a node is notified of a schema change where the schema's dropped columns have changes, that node will miss the changes to the dropped columns. A scenario where this can happen is where a column c is dropped, then added as a different typed, and then dropped again, with a node n having seen the first drop and being notified of the subsequent add and drop. Fixes #2616 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170725170622.4380-1-duarte@scylladb.com>	2017-07-26 11:59:34 +02:00
Jesse Haber-Kucharsky	d6c0138576	README.md: Add quick-start section and defer to `HACKING.md`	2017-07-25 17:58:00 -04:00
Jesse Haber-Kucharsky	d06bccf857	HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too	2017-07-25 17:57:55 -04:00
Jesse Haber-Kucharsky	9c2390e1a4	HACKING.md: Add details and examples for unit tests	2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky	488839dd15	HACKING.md: Add section for project dependencies	2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky	14d03d7548	HACKING.md: Describe releases and tags	2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky	6e8bfdbb3f	HACKING.md: Re-work "building" section, including memory needs	2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky	64acb41305	HACKING.md: Update ccache recommendations	2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky	4fe767de31	HACKING.md: Update "Contributing" URL The old page results in a 404 error.	2017-07-25 17:46:45 -04:00
Paweł Dziepak	295689d16f	db: include counter writes on leader in metrics Counters write path on leader is completely different than on any other replica (non-leaders share write path between counters and regular columns). This patch makes sure that counter writes performed on leader are added to appropriate metrics. Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>	2017-07-25 18:31:43 +02:00
Tomasz Grabiec	18be42f71a	Merge fixes related to row cache from Raphael * git@github.com:raphaelsc/scylla.git row_cache_fixes: db: atomically synchronize cache with changes to the snapshot db: refresh row cache's underlying data source after compaction	2017-07-25 15:34:32 +02:00
Paweł Dziepak	79a1ad7a37	tests/row_cache: test queries with no clustering ranges Reproducer for #2604. Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>	2017-07-25 15:29:17 +02:00
Paweł Dziepak	1ea507d6ae	tests: do not overload the meaning of empty clustering range Empty clustering key range is perfectly valid and signifies that the reader is not interested in anything but the static row. Let's not make it mean anything else. Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>	2017-07-25 15:28:12 +02:00
Paweł Dziepak	6572f38450	cache: fix aborts if no clustering range is specified cache_streamed_mutation assumed that at least one clustering range was specified. That was wrong since the readers are allowed to query just for a static row (e.g. counter update that modifies only static columns). Fixes #2604. Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>	2017-07-25 15:27:48 +02:00
Amnon Heiman	1f5a9ecc40	scylla-housekeeping: support patches releases To support both version and patch release, the version server now returns a patchversion parameter that include the latest minor version's patch release. The housekeeping should return a separate message if the current minor version is not with the latest patch release, and a message if the version was changed. For example, if a user is using version 1.6.1 it should get a warning that he need to update if 1.6.2 is available and in addition a warning it should upgrade if version 1.7 is out. Examples: $ scylla-housekeeping version --version 1.6.2 Your current Scylla release is 1.6.2, while the latest patch release is 1.6.4, and the latest minor release is 1.7.2 (recommended) $ scylla-housekeeping version --version 1.7.1 You current Scylla release is 1.7.1 while the latest patch release is 1.7.2 is available, update for the latest bug fixes $ scylla-housekeeping version --version 1.7.1 You current Scylla release is 1.7.1 while the latest patch release is 1.7.2, update for the latest bug fixes and improvements Fixes #1972 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Acked-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170725095455.6450-1-amnon@scylladb.com>	2017-07-25 13:12:18 +03:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Raphael S. Carvalho	e3ad676433	db: atomically synchronize cache with changes to the snapshot updates to cache and snapshot (i.e. sstable set) aren't synchronized, so it may happen that cache update for memtable flush will use wrong snapshot version, and that violates cache invariant of each partition entry only reflecting one snapshot. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:45:05 -03:00
Avi Kivity	c21bb5ae05	tests: fix sstable_datafile_test build with boost 1.55 Boost 1.55 accidentally removed support for "range for" on recursive_directory_iterator (previous and latter versions do support it). Use old-style iteration instead. Message-Id: <20170724080128.8824-1-avi@scylladb.com>	2017-07-24 11:20:12 +03:00
Avi Kivity	f75b578607	Update ami submodule * dist/ami/files/scylla-ami 5dfe42f...2bd1481 (1): > Enable support for experimental CPU controller in i3 instances	2017-07-24 10:26:52 +03:00
Tomasz Grabiec	60678f0e8a	ring_position: Optimize contruction from r-value referenceces of decorated_key Message-Id: <1500650171-26291-1-git-send-email-tgrabiec@scylladb.com>	2017-07-24 10:25:14 +03:00
Tomasz Grabiec	136d205855	mutation_partition: Always mark static row as continuous when no static columns To avoid unnecessary cache misses after static columns are added. Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>	2017-07-24 10:23:35 +03:00
Tomasz Grabiec	714d609605	database: Fix reversed order of keyspace and table names in a log message Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 17:10:17 +02:00
Tomasz Grabiec	059779eea6	gdb: Fix 'scylla ptr' reporting large object pages as free The 'free' attribute is not updated for all pages belonging to a large object, so we can't use it to determine if the page is allocated or not. More reliable way is to check if it belongs to any free span. Message-Id: <1500648094-20039-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:41 +02:00
Tomasz Grabiec	29a82f5554	schema_registry: Keep unused entries around for 1 second This is in order to avoid frequent misses which have a relatively high cost. A miss means we need to fetch schema definition from another node and in case of writes do a schema merge. If the schema is kept alive only by the incoming request, then it will be forgotten immediately when the request is done, and the next request using the same schema version will miss again. Refs #2608. Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:37 +02:00
Tomasz Grabiec	ecc85988dd	legacy_schema_migrator: Don't snapshot empty legacy tables Otherwise we will create a new (empty) snapshot each time we boot. Message-Id: <1500573920-31478-2-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:31 +02:00
Tomasz Grabiec	408cea66cd	database: Allow disabling auto snapshots during drop/truncate Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:56:29 +02:00
Duarte Nunes	937fe80a1a	Merge 'Fix possible inconsistency of table schema version' from Tomasz "Fixes issues uncovered in longevity test (#2608). Main problem is that due to time drift scylla_tables.version column may not get deleted on all nodes doing the schema merge, which will make some nodes come up with different table schema version than others. The inconsistency will not heal because scylla_tables doesn't take part in the schema sync. This is fixed by the last patch. This will cause nodes to constantly try to sync the schema, which under some conditions triggers #2617." * tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev: schema_tables: Add scylla_tables to ALL schema: Make schema_mutations equality consistent with digest schema_tables: Extract compact_for_schema_digest() schema_tables: Always drop scylla_tables::version	2017-07-21 16:55:23 +02:00
Tomasz Grabiec	65c64614aa	schema_registry: Ensure schema_ptr is always synced on the other core global_schema_ptr ensures that schema object is replicated to other cores on access. It was replicating the "synced" state as well, but only when the shard didn't know about the schema. It could happen that the other shard has the entry, but it's not yet synced, in which case we would fail to replicate the "synced" state. This will result in exception from mutate(), which rejects attempts to mutate using an unsynced schema. The fix is to always replicate the "synced" state. If the entry is syncing, we will preemptively mark it as synced earlier. The syncing code is already prepared for this. Refs #2617. Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>	2017-07-21 16:54:47 +02:00
Duarte Nunes	7eecda3a61	schema: Support compaction enabled attribute Fixes #2547 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170721132206.3037-1-duarte@scylladb.com>	2017-07-21 15:38:45 +02:00
Amnon Heiman	e345d05ebe	system_keyspace: Use paging for get compaction history there could be a lot of compactions when querying for compaction history. This patch changes the query to use paging. It would collect all results when returning to the caller. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-07-20 18:17:49 +03:00
Vlad Zolotarov	9086c643a6	service::storage_proxy: add a trace points pair in the SELECT replica flow Add two trace points: at the beginning and at the end of the replica flow on the replica shard. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>	2017-07-20 16:44:25 +02:00
Amnon Heiman	08c81427b9	Add paging for internal queries Usually, internal queries are used for short queries. Sometimes though, like in the case of get compaction history, there could be a large amount of results. Without paging it will overload the system. This patch adds the ability to use paging internally. Using paging will be done explicitely, all the relevant information would be store in an internal_query_state, that would hold both the paging state but also the query so consecutive calls can be made. To use paging use the query method with a function. The function gets beside a statement and its parameters a function that will be used for each of the returned rows. For example if qp is a query_processor: qp.query("SELECT * from system.compaction_history", [] (const cql3::untyped_result_set::row& row) { .... // do something with row ... return stop_iteration::no; // keep on reading }); Will run the function on each of the compaction history table rows. To stop the iteration, the function can return stop_iteration::yes.	2017-07-20 17:43:51 +03:00
Tomasz Grabiec	2bc549f426	Merge perf_fast_forward enhancements from Paweł * https://github.com/pdziepak/scylla.git perf_fast_forward_improvements/v1: perf_fast_forward: move global state to global scope perf_fast_forward: move tests groups to separate functions perf_fast_forward: allow running only selected test groups perf_fast_forward: use consumer interface for reading streamed_mutation	2017-07-20 16:41:29 +02:00
Tomasz Grabiec	ed2388da2c	schema_tables: Add scylla_tables to ALL So that scylla_tables takes part in the digest and in mutations sent as part of schema sync. Otherwise inconsistencies in scylla_tables will not heal. Refs #2608.	2017-07-20 15:47:10 +02:00
Tomasz Grabiec	78ff728795	schema: Make schema_mutations equality consistent with digest Digest only looks like live values, ignoring deletion information. Equality should be consistent with that, so that schemas considered equal do not trigger the alter path unnecessarily.	2017-07-20 15:47:10 +02:00
Tomasz Grabiec	6adbe61e2f	schema_tables: Extract compact_for_schema_digest()	2017-07-20 15:47:10 +02:00
Tomasz Grabiec	1b85c316bf	schema_tables: Always drop scylla_tables::version It can happen that due to time drift between nodes, the incoming "version" cell will have higher timestamp than api::new_timestamp(). In such case the column would not be dropped and would cause version mismatch between nodes. Ensure it's always covered by using max of current time and cell's timestamp. Refs #2608.	2017-07-20 15:47:10 +02:00
Takuya ASADA	2bf16c6e8a	dist/debian: add --no-clean option to skip building pbuilder .tgz image By default build_deb.sh destroys all previous build image to make sure we don't have environment dependent issue, but it's takes time to build distribution root image (.tgz in pbuilder) from scratch. --no-clean option is for skipping create .tgz stage, use previously built image, to make build time shorter. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1500542094-12946-1-git-send-email-syuu@scylladb.com>	2017-07-20 15:37:00 +03:00
Calle Wilund	91f314e54c	duration.cc: Fix static assert static_assert(cond) is C++17 only Message-Id: <1500373227-12025-1-git-send-email-calle@scylladb.com>	2017-07-20 13:14:51 +02:00
Paweł Dziepak	823fb5e9d8	perf_fast_forward: use consumer interface for reading streamed_mutation Using streamed_mutation::operator() is undesirable as it introduces an indirect call and a continuation overhead for each emitted mutation fragment. Consumer interface is the preferred method of reading streamed mutations.	2017-07-20 11:02:53 +01:00
Paweł Dziepak	d184508d7b	perf_fast_forward: allow running only selected test groups	2017-07-20 11:02:31 +01:00
botond	884928c511	install-dependencies.sh: Fix ubuntu dependencies Remove dependencies section from README.md, point to the install-dependencies.sh script instead. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <7f51f17a743a82d68b7d4a279b066ffe55fe0379.1500540523.git.bdenes@scylladb.com>	2017-07-20 12:00:20 +03:00
Paweł Dziepak	a18a36c94b	perf_fast_forward: move tests groups to separate functions	2017-07-20 09:26:42 +01:00
Paweł Dziepak	3fd4f9c1c7	perf_fast_forward: move global state to global scope All test perf_fast_forward test cases currently live in the main function. This patch moves the state they rely on to a global scope so that it will be easier to extract these tests to individual functions.	2017-07-20 09:26:42 +01:00
Avi Kivity	c5ee62a6a4	Merge "restrict background writers with scheduling groups" from Glauber "This patchset restricts background writers - such as compactions, streaming flushes and memtable flushes to a maximum amount of CPU usage through a seastar::thread_scheduling_group. The said maximum is recommended to be set 50 % - it is default disabled, but can be adjusted through a configuration option until we are able to auto-tune this. The second patch in this series provides a preview on how such auto-tune would look like. By implementing a simple controller we automatically adjust the quota for the memtable writer processes, so that the rate at which bytes come in is equal to the rates at which bytes are flushed. Tail latencies are greatly reduced by this series, and heavy spikes that previously appeared on CPU-bound workloads are no more." * 'memtable-controller-v5' of https://github.com/glommer/scylla: simple controller for memtable/streaming writer shares. restrict background writers to 50 % of CPU.	2017-07-20 10:58:53 +03:00
Takuya ASADA	c441b1604a	dist/redhat: use EPEL's ragel for CentOS Since ragel added on EPEL, drop self-built package and use EPEL one. See #2441 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170719170500.18515-1-syuu@scylladb.com>	2017-07-20 10:36:57 +03:00
Calle Wilund	7a583585a2	system_keyspace: Make sure "system" is written to keyspaces (visible) Fixes #2514 Bug in schema version 3 update: We failed to write "system" to the schema tables. Only visible on an empty instance of course. Message-Id: <1500469809-23546-2-git-send-email-calle@scylladb.com>	2017-07-19 16:18:56 +03:00
Calle Wilund	247c36e048	system_schema: Fix remaining places not handing two system keyspaces Some places remained where code looked directly at system_keyspace::NAME to determine iff a ks is considered special/system/protected. Including schema digest calculation. Export "is_system_keyspace" and use accordingly. Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>	2017-07-19 16:18:45 +03:00
Duarte Nunes	1daf1bc4bb	Merge 'Revert back to 1.7 schema layout in memory' from Tomasz "Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555) by reverting back to using the old layout in memory and thus also in across-node requests. We still use the new v3 layout in schema tables (needed by drivers and external tools). Translations happen when converting to/from schema mutations." * tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev: schema: Revert back to the 1.7 layout of static compact tables in memory schema: Use v3 column layout when converting to/from schema mutations schema: Encapsulate column layout translations in the v3_columns class	2017-07-19 12:52:52 +02:00
Avi Kivity	d5aba779d4	Merge "streaming error handling improvement" from Asias "This series improves the streaming error handling so that when one side of the streaming failed, it will propagate the error to the other side and the peer will close the failed session accordingly. This removes the unnecessary wait and timeout time for the peer to discover the failed session and fail eventually. Fix it by: - Use the complete message to notify peer node local session is failed - Listen on shutdown gossip callback so that we can detect the peer is shutdown can close the session with the peer Fixes #1743" * tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev: streaming: Listen on shutdown gossip callback gms: Add is_shutdown helper for endpoint_state class streaming: Send complete message with failed flag when session is failed streaming: Handle failed flag in complete message streaming: Do not fail the session when failed to send complete message streaming: Introduce send_failed_complete_message streaming: Do not send complete message when session is successful streaming: Introduce the failed parameter for complete message streaming: Remove unused session_failed function streaming: Less verbose in logging streaming: Better stats	2017-07-19 11:18:09 +03:00
Amos Kong	2bdcad5bc3	scylla_raid_setup: fix syntax error /usr/lib/scylla/scylla_raid_setup: line 132: syntax error near unexpected token `fi' Fixes #2610 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <af3a5bc77c5ba2b49a8f48a5aaa19afffb787886.1500430021.git.amos@scylladb.com>	2017-07-19 11:10:29 +03:00
Duarte Nunes	ab72132cb1	view_schema_test: Retry failed queries Due to the asynchronous nature of view update propagation, results might still be absent from views when we query them. To be able to deterministically assert on view rows, this patch retries a query a bounded number of times until it succeeds. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170718212646.2958-1-duarte@scylladb.com>	2017-07-19 09:59:44 +02:00
Duarte Nunes	115ff1095e	db/view: Use view schema for view pk operations Instead of base schema. Fixes #2504 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170718190703.12972-1-duarte@scylladb.com>	2017-07-19 09:59:34 +02:00
Tomasz Grabiec	a9237c1666	schema: Revert back to the 1.7 layout of static compact tables in memory We are using C* 3.x compatible layout in schema tables but want to keep using the 1.7 layout in memory for compatibility during rolling upgrade. This patch switches the schema and schema_builder classes back to the old layout. Translation of layout happens when converting to/from schema mutations. Notable changes: 1) Includes a revert of commit `6260f31e08` "thrift: Update CQL mapping of static CFs". 2) Brings back the "default_validation_class" schema attribute. In v3 it can be dervied from column definitions, but in v2 it can't, so we have to store it. 3) legacy_schema_migrator and schema_builder don't have to do conversions to v3, this is now handled by the v3_columns class. schema_builder works with the same layout as schema, that is v2. 4) Includes a revert of commit `66991a7ccb` "v3 schema test fixes" Fixes #2555.	2017-07-19 09:52:15 +02:00
Tomasz Grabiec	dc2dc056a4	schema: Use v3 column layout when converting to/from schema mutations	2017-07-19 09:52:15 +02:00
Tomasz Grabiec	dc463ef644	schema: Encapsulate column layout translations in the v3_columns class	2017-07-19 09:52:15 +02:00
Avi Kivity	bfae5c7bac	Merge "Time window compaction strategy support" from Raphael "Time window strategy was introduced to address several limitations of date tiered strategy. In addition, its options are much easier to reason about, basically just window size and window unit. TWCS will work to keep only one sstable in each window. So the only real optimization needed is to align partition key to the window. Size tiered strategy is used to reduce write amplification when compacting the incoming window. For more details: https://issues.apache.org/jira/browse/CASSANDRA-9666 Fixes #1432." * 'twcs_v2' of github.com:raphaelsc/scylla: tests: add tests for time window compaction strategy compaction: wire up time window compaction strategy compaction/twcs: override default values with options in schema sstables: implement time window compaction strategy sstables: import TimeWindowCompactionStrategy.java	2017-07-19 10:22:53 +03:00
Duarte Nunes	3bfcf47cc6	types: Implement hash() for collections This patch provides a rather trivial implementation of hash() for collection types. It is needed for view building, where we hold mutations in a map indexed by partition keys (and frozen collection types can be part of the key). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170718192107.13746-1-duarte@scylladb.com>	2017-07-19 09:52:56 +03:00
Raphael S. Carvalho	c55c63f213	tests: add tests for time window compaction strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-19 02:58:37 -03:00
Raphael S. Carvalho	7ecedac222	compaction: wire up time window compaction strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-19 02:58:37 -03:00
Raphael S. Carvalho	01886c23a8	compaction/twcs: override default values with options in schema Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-19 02:58:37 -03:00
Raphael S. Carvalho	206d30c52a	sstables: implement time window compaction strategy For more details, https://issues.apache.org/jira/browse/CASSANDRA-9666 Fixes #1432. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-19 02:58:35 -03:00
Glauber Costa	c9a529ebee	simple controller for memtable/streaming writer shares. This patch introduces a simple controller that will adjust memtables CPU shares, trying to keep it around the soft limit: if we start going below it means we're too fast (unless we are idle) and shares are adjusted downwards. If we start going above it means we're too fast and shares are adjusted upwards. I have tested this extensively in a single-CPU setup with various CPU-bound workloads while tracking virtual dirty and the results are good, with virtual dirty fluctuating only slightly, somewhere within the desired range. Exceptions to this are: 1) when the load is very light - the idle system goes faster, and that's ok 2) when the load is very high - as foreground requests dominate we can't flush fast enough and hit the hard limit. However, in such scenarios the memtable shares do hit its maximum, and the results are no worse than they are right now and this will only be fixed by CPU-limiting the actual requests. This feature can be disabled with a config option - that is scheduled to go away as we acquire more confidence in this. When the feature is disabled, all background writers (streaming, compaction, memtables) will share the same scheduling group, with static quotas. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:47 -04:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Asias He	d6cebd1341	streaming: Listen on shutdown gossip callback When a node shutdown itself, it will send a shutdown status to peer nodes. When peer nodes receives the shtudown status update, they are supposed to close all the sessions with that node becasue the node is shutdown, no need to wait and timeout, then fail the session. This change can speed up the closing of sessions.	2017-07-19 10:11:06 +08:00
Asias He	ed7e6974d5	gms: Add is_shutdown helper for endpoint_state class It will be used by streaming manager to check if a node is in shutdown status.	2017-07-19 10:11:05 +08:00
Asias He	aa87429e67	streaming: Send complete message with failed flag when session is failed To notify peer node the session is failed.	2017-07-19 10:11:05 +08:00
Asias He	03b838705c	streaming: Handle failed flag in complete message Fail the current session if the failed flag is on in the complete message handler.	2017-07-19 10:11:05 +08:00
Asias He	12d18cfab4	streaming: Do not fail the session when failed to send complete message Since the complete message is not mandatary, no point to fail the session in case failed to send the complete message.	2017-07-19 10:11:04 +08:00
Asias He	ca5248cd58	streaming: Introduce send_failed_complete_message Currently, send_complete_message is not used. We will use it shortly in case the local session is failed. Send a complete message with failed flag to notify peer node that the session is failed so that peer can close the session. This can speed up the closing of failed session. Also rename it to send_failed_complete_message.	2017-07-19 10:11:04 +08:00
Raphael S. Carvalho	2686e84792	sstables: import TimeWindowCompactionStrategy.java it will be later converted to C++. Imported from latest scylla- tools-java repository. Checked that it doesn't lack anything. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-18 18:26:17 -03:00
Takuya ASADA	49b01e764a	dist/common/scripts/scylla_prepare: stop running hugeadm when it's posix mode A user reported scylla-server.service does not able to run on their cloud instance, because of hugeadm. (hugeadm says the kernel does not support huge pages.) We don't need it for posix mode, so move it in dpdk mode. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1500367219-8728-1-git-send-email-syuu@scylladb.com>	2017-07-18 16:39:16 +03:00
Tomasz Grabiec	63caa58b70	Merge "Drop mutations that raced with truncate" from Duarte Instead of retrying, just drop mutations that raced with a truncate. * git@github.com:duarten/scylla.git truncate-reorder/v1: database: Rename replay_position_reordered_exception database: Drop mutations that raced with truncate	2017-07-18 12:53:36 +02:00
Asias He	f21cb75cdb	streaming: Do not send complete message when session is successful The complete_message is not needed and the handler of this rpc message does nothing but returns a ready future. The patch to remove it did not make into the Scylla 1.0 release so it was left there.	2017-07-18 15:29:42 +08:00
Duarte Nunes	d9fa3bf322	thrift: Fail when mixed CFs are detected Fixes #2588 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170717222612.7429-1-duarte@scylladb.com>	2017-07-18 10:21:33 +03:00
Asias He	0ba4e73068	streaming: Introduce the failed parameter for complete message Use this flag to notify the peer that the session is failed so that the peer can close the failed session more quickly. The flag is used as a rpc::optional so it is compatible use old version of the verb.	2017-07-18 11:24:31 +08:00
Asias He	7599c1524d	streaming: Remove unused session_failed function It is never used. Get rid of it.	2017-07-18 11:22:09 +08:00
Asias He	caad7ced23	streaming: Less verbose in logging Now, we will have large number of small streaming. Make the not very important logging message debug level.	2017-07-18 11:17:09 +08:00
Asias He	d0dffd7346	streaming: Better stats Log the number of bytes streamed and streaming bandwidth summary in the same line with session complete message.	2017-07-18 11:17:09 +08:00
Avi Kivity	64ef7aa5e4	Merge seastar upstream * seastar 867b7c7...a14d667 (1): > tls: remove unneeded lambda captures	2017-07-17 19:30:59 +03:00
Duarte Nunes	6b464da67d	schema: Get rid of regular_columns_by_name They are unused. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170717103635.6473-2-duarte@scylladb.com>	2017-07-17 12:52:41 +02:00
Asias He	adc5f0bd21	gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option It is useful for larger cluster with larger gossip message latency. By default the fd_max_interval_ms is 2 seconds which means the failure_detector will ignore any gossip message update interval larger than 2 seconds. However, in larger cluster, the gossip message udpate interval can be larger than 2 seconds. Fixes #2603. Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>	2017-07-17 13:29:16 +03:00
Duarte Nunes	13caccf1cf	Merge 'Fixes around migration to v3 schema tables' from Tomasz branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev: schema: Use proper name comparator legacy_schema_migrator: Properly migrate non-UTF8 named columns schema_tables: Store column_name in text form legacy_schema_migrator: Migrate columns like Cassandra schema_builder: Add factory method for default_names legacy_schema_migrator: Simplify logic thrift: Don't set regular_column_name_type schema: Use proper column name type for static columns schema: Fix column_name_type() for static compact tables schema: Introduce clustering_column_at() thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type partition_slice_builder: Use proper column's type instead of regular_column_name_type()	2017-07-17 11:16:52 +02:00
Tomasz Grabiec	34dae0588c	schema: Use proper name comparator This replaces column_definition::name_comparator, which incorrectly assumes that names are always utf8, with name_compare moved from schema::rebuild() and unifies usages.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	7e54290d38	legacy_schema_migrator: Properly migrate non-UTF8 named columns Currently migrator assumed all columns are utf8-named, which doesn't have to be the case for static compact tables. Refs #2597. Due to #2573, we can assume that Scylla wasn't used with non-utf8 column names, and that old names are always in textual form.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	60a76efd37	schema_tables: Store column_name in text form That's how it is stored by Cassandra. Refs #2597.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	61229a7536	legacy_schema_migrator: Migrate columns like Cassandra This fixes generation of synthetic columns for static compact tables. Current code always generates synthetic clustering column with utf8 type and synthetic regular column with bytes type (in schema_builder). That's fine when creating a new CQL table, but not when migrating existing tables created via thrift API. Fixes #2584. This also migrates empty compact value columns like Cassandra does. Such columns are present in compact tables without regular columns, e.g.: create table test (k int, ck int, primary key (k, ck)) with compact storage; They should be migrated to a synthetic regular column with empty_type type and a non-empty name.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	49e21b3b8e	schema_builder: Add factory method for default_names	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	6dc299c27a	legacy_schema_migrator: Simplify logic The expression "is_dense.value_or(true)" is always true inside the if, so drop it. This allows us to drop temporary calulated_is_dense. We can also get rid of one of the if branches by extracting builder.set_is_dense() outside.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	3987e9be31	thrift: Don't set regular_column_name_type Regular columns are always utf8 after `f5dae826ce`.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	b919c50d21	schema: Use proper column name type for static columns After `f5dae826ce`, static columns not always have utf8 column names. For static compact tables it's determined by the cell name comparator type, which is equal to the type of the synthetic clustering column. Caused various errors with static thrift tables with non-utf8 comparator.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	f685f7f8a1	schema: Fix column_name_type() for static compact tables Introduced in `f5dae826ce`.	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	84536a4a75	schema: Introduce clustering_column_at()	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	9ed958a1eb	thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type	2017-07-17 09:40:06 +02:00
Tomasz Grabiec	9768036d61	partition_slice_builder: Use proper column's type instead of regular_column_name_type()	2017-07-17 09:40:06 +02:00
Avi Kivity	c51001b598	Merge seastar upstream * seastar b812cee...867b7c7 (1): > rpc: start server's send loop only after protocol negotiation Fixes #2600.	2017-07-16 19:36:31 +03:00
Avi Kivity	a5bd854019	Merg seastar upstream * seastar 844bcfb...b812cee (1): > Update dpdk submodule Fix #2595 (again).	2017-07-16 17:00:48 +03:00
Avi Kivity	d9c64ef737	tests: move tmpdir to /tmp Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk with write-back cache, and forever on a spinning disk. Message-Id: <20170716081653.10018-1-avi@scylladb.com>	2017-07-16 11:55:08 +02:00
Avi Kivity	9116dd91cb	tests: copy the sstable with an unknown component to the data directory We will be creating links to those sstable's files, and those don't work if the data directory and the test sstable are on different devices. Copying the files to the same directory fixes the problem. Message-Id: <20170716090405.14307-1-avi@scylladb.com>	2017-07-16 11:55:00 +02:00
Duarte Nunes	2c711922cc	database: Drop mutations that raced with truncate Mutations that race with a truncate can just be dropped. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Duarte Nunes	0825c9c805	database: Rename replay_position_reordered_exception Rename replay_position_reordered_exception to mutation_reordered_with_truncate_exception for more precision, since this is the only situation where this exception can be thrown. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-16 00:08:05 +02:00
Avi Kivity	e87ab54bfc	Merge seastar upstream * seastar ff34c42...844bcfb (1): > Update dpdk submodule Fixes #2595.	2017-07-15 19:17:05 +03:00
Tomasz Grabiec	caa62f7f05	Merge "Fixes for memtable flushing and replay positions" from Duarte We don't ensure mutations are applied in memory following the order of their replay positions. A memtable can thus be flushed with replay position rp, with the new one being at replay position rp', where rp' < rp. This breaks an intrinsic assumption in the code, which this series addresses. Fixes #2074 branch memtable-flush/v3 of git@github.com:duarten/scylla.git: commitlog: Always flush latest memtable column_family: More precise count of switched memtables column_family: Fix typo in pending_tasks metric name column_family: More precise count of pending flushes dirty_memory_manager: Remove unnecessary check from flush_one() column_family: Don't rely on flush_queue to guarantee flushes finished column_family: Don't bother closing the flush_queue on stop() column_family: Stop using flush_queue column_family: Remove outdated comment about the flush_queue memtable: Stop tracking the highest flushed rp	2017-07-14 11:39:37 +02:00
Avi Kivity	162d9aa85d	tests: fix view_schema_test with clang Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}. No doubt it is correct, but sheesh. Make the data_value explicit to humor it. Message-Id: <20170713074315.9857-1-avi@scylladb.com>	2017-07-14 12:24:27 +03:00
Duarte Nunes	b8235f2e88	storage_proxy: Preserve replica order across mutations In storage_proxy we arrange the mutations sent by the replicas in a vector of vectors, such that each row corresponds to a partition key and each column contains the mutation, possibly empty, as sent by a particular replica. There is reconciliation-related code that assumes that all the mutations sent by a particular replica can be found in a single column, but that isn't guaranteed by the way we initially arrange the mutations. This patch fixes this and enforces the expected order. Fixes #2531 Fixes #2593 Signed-off-by: Gleb Natapov <gleb@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170713162014.15343-1-duarte@scylladb.com>	2017-07-14 12:11:22 +03:00
Duarte Nunes	5f24e9a4a5	memtable: Stop tracking the highest flushed rp Since we no longer enforce that mutations are applied in memory ordered by their replay_positions, the way the highest_flush_rp is being tracked is no longer correct. The invariant it was used to maintain no longer exists, so we can get rid of it together with the assertion on the highest_flush_rp on flush(). Fixes #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:06 +02:00
Duarte Nunes	22a53a52a1	column_family: Remove outdated comment about the flush_queue Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:05 +02:00
Duarte Nunes	003941cd95	column_family: Stop using flush_queue Since commitlog ordering requirements have been relaxed, we now keep the set of replay_positions seen by a memtable in a set, which we then use to clean up relevant segments in the commitlog. This means that the guarantees provided by the flush_queue are no longer necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:56:00 +02:00
Duarte Nunes	7e6fe5895e	column_family: Don't bother closing the flush_queue on stop() When stopping a column family we issue a flush(), for which we wait. Since writes are supposed to have stopped coming in, and also new flush requests, there's no need to call and wait for the flush_queue to be closed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	a1f4536ffb	column_family: Don't rely on flush_queue to guarantee flushes finished We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. So, use a phased_barrier() to ensure that calling flush() returns a future that completes when all flushes up to that point have finished. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:58 +02:00
Duarte Nunes	1b320496e2	dirty_memory_manager: Remove unnecessary check from flush_one() We don't need to check whether a memtable is empty in flush_one(), as that must be checked later, during the actual sealing. The condition itself is rare and is checked already after the potentially contented semaphore has been acquired. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:57 +02:00
Duarte Nunes	59bdaed02b	column_family: More precise count of pending flushes This patch ensures we update the count of pending flushes in the same place as we update the stats across column families, which is more correct since it only accounts for actual flushes and not those of empty memtables or that have been coalesced together. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	3e27c335a9	column_family: Fix typo in pending_tasks metric name Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	a11724c6e1	column_family: More precise count of switched memtables The memtable_switch_count metric is supposed to count the number of times a flush has resulted in the memtable being switched out, but we were incrementing the count regardless of whether we tried to flush an empty memtable or two or more flushes were coalesced into one. This patch fixes this by moving the metric to where the memtable is actually switched. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Duarte Nunes	bca1b19ce9	commitlog: Always flush latest memtable We now don't ensure mutations are applied in memory following the order of their replay positions, so we can't rely on the replay position to order memtable flushes. When flushing commit log segments, ensure we flush the latest memtable. Refs #2074 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-13 22:51:25 +02:00
Paweł Dziepak	ec689b2fe1	Merge "utils: minor fixes in the loading_cache class" from Vlad "This series aims to fix the "serving invalid (old) values" issue in the loading_cache (issue #2590) by arming the timer with a period that equals min(expire, refresh). We are still trying to optimize the main case where 'expire' is significantly longer than 'refresh' period. We don't want to add any additional logic in the fast path and this series gives the immediate solution for the issue above while not adding any additional CPU cycle to the fast path." * 'loading_cache_short_expired-v2' of https://github.com/vladzcloudius/scylla: utils::loading_cache: arm the timer with a period equal to min(_expire, _update) utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source	2017-07-13 16:58:53 +01:00
Vlad Zolotarov	45e23d8090	db::config: fix the permissions cache related parameters description Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries more readable and more related to what they really do. Mention the none-zero value requirement for the permissions_update_interval_in_ms and the permissions_cache_max_entries when the permissions cache is enabled. Adjust the parameters description in the scylla.yaml too. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>	2017-07-13 16:00:40 +01:00
Vlad Zolotarov	76ea74f3fd	utils::loading_cache: arm the timer with a period equal to min(_expire, _update) Arm the timer with a period that is not greater than either the permissions_validity_in_ms or the permissions_update_interval_in_ms in order to ensure that we are not stuck with the values older than permissions_validity_in_ms. Fixes #2590 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-07-13 10:48:59 -04:00
Vlad Zolotarov	121e3c7b8f	utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-07-13 10:42:12 -04:00
Tomasz Grabiec	30ec4af949	legacy_schema_migrator: Fix calculation of is_dense Current algorithm was marking tables with regular columns not named "value" as not dense, which doesn't have to be the case. It can be either way. It should be enough to look at clustering components. If there is a clustering key, then table is dense if and only if all comparator components belong to the clustering key. If there is no clustering key, then if there are any regular columns we're sure it's not dense. Fixes #2587. Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>	2017-07-13 17:28:09 +03:00
Jesse Haber-Kucharsky	8fa47b74e8	cql: Add definition of underlying type for durations Cassandra 3.10 added the `duration` type [1], intended to manipulate date-time values with offsets (for example, `now() - 2y3h`). The full implementation of the `duration` type in Scylla requires support for version 5 of the binary protocol, which is not yet available. In the meantime, this patch patch adds the implementation of the underlying type for the eventual `duration` type. Included is also the ported test suite from the reference implementation and additional tests. Related to #2240. [1] https://issues.apache.org/jira/browse/CASSANDRA-11873 Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <b1e481da103efee82106bf31f261c5a1f4f8d9ca.1499885803.git.jhaberku@scylladb.com>	2017-07-13 17:26:00 +03:00
Tomasz Grabiec	54953c8d27	gdb: Fix "scylla columnfamilies" command Broken in `0e4d5bc2f3`. Message-Id: <1499951956-26206-1-git-send-email-tgrabiec@scylladb.com>	2017-07-13 16:33:32 +03:00
Amnon Heiman	45b3e8cd11	query_options: Allows creating query_options from query_options query_options object cannot be changed after it was created. For internal uses, like internal query paging, it is needed to create a new object based on some of the data from an existing one with a new paging state. This patch adds a constructor from a unique_ptr and paging state. using unique_ptr behave similar to move modify constructor. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-07-13 14:02:11 +03:00
Duarte Nunes	3df6777b9b	database: Load views after loading tables Since base tables no longer look for their views, we need to parse base tables first so that when we add a view we can fetch and connect it to its base table. When announcing view table mutations to other nodes we always include the base table mutations, so there's no need to expect a view being added before its base table. Found out while testing view building. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170712172115.2960-1-duarte@scylladb.com>	2017-07-13 11:14:02 +02:00
Avi Kivity	4704a78332	tests: remove bad constexpr in sstable_datafile_test std::ceil() is not constexpr. Found by clang.	2017-07-12 17:14:13 +03:00
Avi Kivity	67a5e10218	Merge seastar upstream * seastar a2be7a4...ff34c42 (3): > tls: Wrap all IO in semaphore (Fixes #2575) > tests/lowres_clock_test.cc: Declare helper static > tests/lowres_clock_test.cc: fix compilation error for older GCC	2017-07-12 10:19:55 +03:00
Avi Kivity	a397889c81	Merge "Preserve table schema digest on schema tables migration" from Tomasz "Currently new nodes calculate digests based on v3 schema mutations, which are very different from v2 mutations. As a result they will use schemas with different table_schema_version that the old nodes. The old nodes will not recognize the version and will try to request its definition. That will fail, because old nodes don't understand v3 schema mutations. To fix this problem, let's preserve the digests during migration, so that they're the same on new and old nodes. This will allow requests to proceed as usual. This does not solve the problem of schema being changed during the rolling upgrade. This is not allowed, as it would bring the same problem back. Fixes #2549." * tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev: tests: Add test for concurrent column addition legacy_schema_migrator: Set digest to one compatible with the old nodes schema_tables: Persist table_schema_version schema_tables: Introduce system_schema.scylla_tables schema_tables: Simplify read_table_mutations() schema_tables: Resurrect v2 read_table_mutations() system_keyspace: Forward-declare legacy schemas legacy_schema_migrator: Take storage_proxy as dependency	2017-07-11 17:22:42 +03:00
Raphael S. Carvalho	7dbfebb7dc	lcs: remove conditional limit for partial sort Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170711140241.11023-2-raphaelsc@scylladb.com>	2017-07-11 17:18:32 +03:00
Raphael S. Carvalho	ebb5dafef0	lcs: remove useless filter for demotion procedure there's no way a sstable from a level higher than N+1 will be in set of candidates that can be either level N or level N + 1. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170711140241.11023-1-raphaelsc@scylladb.com>	2017-07-11 17:18:31 +03:00
Botond Dénes	33bc62a9cf	Fix crash in the out-of order restrictions error msg composition Use name of the existing preceeding column with restriction (last_column) instead of assuming that the column right after the current column already has restrictions. This will yield an error message that is different from that of Cassandra, albeit still a correct one. Fixes #2421 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>	2017-07-11 17:15:33 +03:00
Gleb Natapov	f88723e739	storage_proxy: pass pending_endpoints by reference instead of by value This makes lifetime of dead_endpoints object more clear and move() also has its price. Message-Id: <20170710084549.GX2324@scylladb.com>	2017-07-11 16:52:21 +03:00
Gleb Natapov	739dd878e3	consistency_level: report less live endpoints in Unavailable exception if there are pending nodes DowngradingConsistencyRetryPolicy uses live replicas count from Unavailable exception to adjust CL for retry, but when there are pending nodes CL is increased internally by a coordinator and that may prevent retried query from succeeding. Adjust live replica count in case of pending node presence so that retried query will be able to proceed. Fixes #2535 Message-Id: <20170710085238.GY2324@scylladb.com>	2017-07-11 16:51:56 +03:00
Avi Kivity	3b7fde18cf	Merge "improvements for leveled strategy manifest" from Raphael "most of changes are to improve maintainability of the strategy but the ones that are introduced by the following patches: lcs: do not check if level 0 can be promoted twice lcs: remove quadratic behavior from L0 compaction lcs: partially sort candidates that will be trimmed lcs: only demote sstable from level higher than target one" * 'lcs_improvements_2' of github.com:raphaelsc/scylla: lcs: only demote sstable from level higher than target one lcs: improve indentation for get_overlapping_starved_sstables lcs: improve indentation for get_compaction_candidates lcs: partially sort candidates that will be trimmed lcs: remove quadratic behavior from L0 compaction lcs: introduce private interface lcs: make some member functions static lcs: make some functions const qualified lcs: remove add method lcs: extract code for higher levels compaction from get_candidates_for lcs: simplify code to get candidates for higher levels lcs: extract round-robin heuristic for even distribution of keys into function lcs: update outdated comments for level 0 compaction lcs: improve worth_promoting_L0_candidates interface lcs: do not check if level 0 can be promoted twice lcs: extract code for level 0 compaction from get_candidates_for	2017-07-11 16:38:50 +03:00
Paweł Dziepak	5aa523aaf9	transport: send correct type id for counter columns CQL reply may contain metadata that describes columns present in the response including the information about their type. However, Scylla incorrectly reports counter types as bigint. The serialised format of counters and bigint is exactly the same, which could explain why the problem hasn't been noticed earlier but it is a bug nevertheless. Fixes #2569. Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>	2017-07-11 16:21:49 +03:00
Tomasz Grabiec	6d53cb7ab5	tests: Add test for concurrent column addition	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	f5909ec515	legacy_schema_migrator: Set digest to one compatible with the old nodes Calculate and set digest using v2 mutations so that digests are the same before and after migration. This is neeed so that no schema definition exchange is required during rolling upgrade. Fixes #2549.	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	5b69d99bf8	schema_tables: Persist table_schema_version When migrating schema tables from v2 to v3, mutations underlying table schema will change, and so will their digest. However, we want the digest to be the same on new nodes as on the old nodes, because schema exchange is not possible between the two nodes, so they must to request schema definitions from each other. The solution is to make the digest persistable, so that it sticks to given table schema, surviving both migration and node restarts. On migration from v2, the digest will be calculated from v2 mutations, so it will be the same on new and old nodes.	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	cdf5b67522	schema_tables: Introduce system_schema.scylla_tables It will be used to store Scylla spcific table metadata. We cannot store it in the standard "tables" table for compatibility reasons - Cassandra will fail to read schema if it encounteres columns it is not expecting.	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	cdcdf4772f	schema_tables: Simplify read_table_mutations()	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	6e62bc77f1	schema_tables: Resurrect v2 read_table_mutations()	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	4b5818a404	system_keyspace: Forward-declare legacy schemas	2017-07-11 14:52:23 +02:00
Tomasz Grabiec	8624edc0fa	legacy_schema_migrator: Take storage_proxy as dependency Will be needed to query for mutations.	2017-07-11 14:52:23 +02:00
Raphael S. Carvalho	6aa2e5be17	lcs: only demote sstable from level higher than target one if we are compacting level 1 into level 2, we only want to demote a sstable from level 3 or higher. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:42 -03:00
Raphael S. Carvalho	53b72b473e	lcs: improve indentation for get_overlapping_starved_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:40 -03:00
Raphael S. Carvalho	3639b48d7b	lcs: improve indentation for get_compaction_candidates Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:38 -03:00
Raphael S. Carvalho	5a8b8a6ccb	lcs: partially sort candidates that will be trimmed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:37 -03:00
Raphael S. Carvalho	8334086441	lcs: remove quadratic behavior from L0 compaction L0 compaction triggers quadratic behavior when many newly created sstables are needed for promotion due to their size being relatively low to max sstable size parameter. So until L0 is worth promoting, the strategy will compact every new sstable with all the existing ones in L0. To fix it, let's do STCS on level 0 until it becomes worth promoting. Fixes #2432. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:35 -03:00
Raphael S. Carvalho	80f1dca328	lcs: introduce private interface Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:33 -03:00
Raphael S. Carvalho	bc71f97116	lcs: make some member functions static Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:32 -03:00
Raphael S. Carvalho	f4b733efe4	lcs: make some functions const qualified Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:28 -03:00
Raphael S. Carvalho	ede0ee16b2	lcs: remove add method Its code can be inlined because no one besides create() calls it Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:26 -03:00
Raphael S. Carvalho	00ef528e5b	lcs: extract code for higher levels compaction from get_candidates_for Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:25 -03:00
Raphael S. Carvalho	a46b73c401	lcs: simplify code to get candidates for higher levels get rid of unneeded loop for dealing with suspect sstables and std::advance because vector allows random access. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:19 -03:00
Raphael S. Carvalho	e954af0f0f	lcs: extract round-robin heuristic for even distribution of keys into function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:15 -03:00
Raphael S. Carvalho	3c0028d921	lcs: update outdated comments for level 0 compaction some comments are no longer relevant, especially the ones that talk about dealing with busy sstables due to parallel compaction, which isn't done by us for lcs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:07 -03:00
Raphael S. Carvalho	62607ba36a	lcs: improve worth_promoting_L0_candidates interface Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:35:00 -03:00
Raphael S. Carvalho	c1e42f6528	lcs: do not check if level 0 can be promoted twice can_promote flag will be used to carry info about whether or not level 0 can promoted. That will avoid a single iteration for higher levels too which can contain tens of thousands of sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:34:49 -03:00
Raphael S. Carvalho	887aab4ae7	lcs: extract code for level 0 compaction from get_candidates_for I will split code for higher levels compaction into functions first before putting it into its own function too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-11 09:34:41 -03:00
Pekka Enberg	ed3c62704e	transport/server: Kill unused functions Message-Id: <1499773755-27920-1-git-send-email-penberg@scylladb.com>	2017-07-11 14:57:54 +03:00
Glauber Costa	780a6e4d2e	change task quota's default The default of 2ms is somewhat arbitrary. Now that we have a lot more mileage deploying Scylla applications in production it does sound not only arbitrary, but high. In particular, it is really hard to achieve 1ms latencies in the face of CPU-heavy workloads with it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>	2017-07-11 13:50:39 +03:00
Avi Kivity	7147808797	Merge seastar upstream * seastar 89cc97c...a2be7a4 (3): > configure.py: verifies boost version > pkg-config: Eliminate spaces in include path arguments > allow applications to override task-quota-ms	2017-07-11 13:50:06 +03:00
Botond Dénes	f18f724f1c	Generate an error when CONTAINS is used on a non-collection column Fixes #2255 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <517bb6268ac213aed9a1def231614c2e88f77c9f.1499764183.git.bdenes@scylladb.com>	2017-07-11 11:30:49 +02:00
Tomasz Grabiec	310d2a54d2	legacy_schema_migrator: Use separate joinpoint instance for each table Otherwise we may deadlock, as explained in commit `5e8f0efc8`: Table drop starts with creating a snapshot on all shards. All shards must use the same snapshot timestamp which, among other things, is part of the snapshot name. The timestamp is generated using supplied timestamp generating function (joinpoint object). The joinpoint object will wait for all shards to arrive and then generate and return the timestamp. However, we drop tables in parallel, using the same joinpoint instance. So joinpoint may be contacted by snapshotting shards of tables A and B concurrently, generating timestamp t1 for some shards of table A and some shards of table B. Later the remaining shards of table A will get a different timestamp. As a result, different shards may use different snapshot names for the same table. The snapshot creation will never complete because the sealing fiber waits for all shards to signal it, on the same name. Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>	2017-07-11 11:21:45 +02:00
Avi Kivity	7b4412c3ce	Revert "Merge "improvements for leveled strategy manifest" from Raphael" This reverts commit `43a3e718e6`, reversing changes made to `3813e94b0a`. It contains some unrelated commits.	2017-07-11 11:12:53 +03:00
Avi Kivity	43a3e718e6	Merge "improvements for leveled strategy manifest" from Raphael "most of changes are to improve maintainability of the strategy but the ones that are introduced by the following patches: lcs: do not check if level 0 can be promoted twice lcs: remove quadratic behavior from L0 compaction lcs: partially sort candidates that will be trimmed lcs: only demote sstable from level higher than target one" * 'lcs_improvements' of github.com:raphaelsc/scylla: (21 commits) lcs: only demote sstable from level higher than target one lcs: improve indentation for get_overlapping_starved_sstables lcs: improve indentation for get_compaction_candidates lcs: partially sort candidates that will be trimmed lcs: remove quadratic behavior from L0 compaction lcs: introduce private interface lcs: make some member functions static lcs: make some functions const qualified lcs: remove add method lcs: extract code for higher levels compaction from get_candidates_for lcs: simplify code to get candidates for higher levels lcs: extract round-robin heuristic for even distribution of keys into function lcs: update outdated comments for level 0 compaction lcs: improve worth_promoting_L0_candidates interface lcs: do not check if level 0 can be promoted twice lcs: extract code for level 0 compaction from get_candidates_for dist/offline_installer: add --skip-setup option to offline installer dist/offline_installer/debian: install python-minimal package before installing scylla deps migration_manager: Give empty response to schema pulls from incompatible nodes migration_manager: Don't pull schema from incompatible nodes ...	2017-07-11 11:08:12 +03:00
Botond Dénes	3813e94b0a	Add Cql.tokens and KDevelop project files to .gitignore Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <ae4935d2ac0c92287022f677c3e66757c0861e13.1499753032.git.bdenes@scylladb.com>	2017-07-11 10:21:00 +03:00
Botond Dénes	61c5c2a175	transport: Fix accept typo in debug log message Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <d2f9269f25ace6579a6fbe6b99f4da60a05beac8.1499753306.git.bdenes@scylladb.com>	2017-07-11 09:16:35 +03:00
Raphael S. Carvalho	8b9686e621	lcs: only demote sstable from level higher than target one if we are compacting level 1 into level 2, we only want to demote a sstable from level 3 or higher. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 16:10:42 -03:00
Raphael S. Carvalho	0d0699e06e	lcs: improve indentation for get_overlapping_starved_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 16:01:31 -03:00
Raphael S. Carvalho	cda2b18f83	lcs: improve indentation for get_compaction_candidates Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:55:43 -03:00
Raphael S. Carvalho	ca1c6fd9ca	lcs: partially sort candidates that will be trimmed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:45:26 -03:00
Raphael S. Carvalho	28ebe1807f	lcs: remove quadratic behavior from L0 compaction L0 compaction triggers quadratic behavior when many newly created sstables are needed for promotion due to their size being relatively low to max sstable size parameter. So until L0 is worth promoting, the strategy will compact every new sstable with all the existing ones in L0. To fix it, let's do STCS on level 0 until it becomes worth promoting. Fixes #2432. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:42:28 -03:00
Raphael S. Carvalho	0392dc5d23	lcs: introduce private interface Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:13:56 -03:00
Raphael S. Carvalho	dd9c9341be	lcs: make some member functions static Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:13:55 -03:00
Raphael S. Carvalho	408a7f902a	lcs: make some functions const qualified Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:47 -03:00
Raphael S. Carvalho	7cba6548e2	lcs: remove add method Its code can be inlined because no one besides create() calls it Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:47 -03:00
Raphael S. Carvalho	0a9fcc6202	lcs: extract code for higher levels compaction from get_candidates_for Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:47 -03:00
Raphael S. Carvalho	8709365d84	lcs: simplify code to get candidates for higher levels get rid of unneeded loop for dealing with suspect sstables and std::advance because vector allows random access. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:47 -03:00
Raphael S. Carvalho	1a9fc835a0	lcs: extract round-robin heuristic for even distribution of keys into function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:40 -03:00
Raphael S. Carvalho	258ed0afbd	lcs: update outdated comments for level 0 compaction some comments are no longer relevant, especially the ones that talk about dealing with busy sstables due to parallel compaction, which isn't done by us for lcs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:30 -03:00
Raphael S. Carvalho	97b5cf94d8	lcs: improve worth_promoting_L0_candidates interface Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:30 -03:00
Raphael S. Carvalho	8f418e9864	lcs: do not check if level 0 can be promoted twice can_promote flag will be used to carry info about whether or not level 0 can promoted. That will avoid a single iteration for higher levels too which can contain tens of thousands of sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:30 -03:00
Raphael S. Carvalho	6785d83c02	lcs: extract code for level 0 compaction from get_candidates_for I will split code for higher levels compaction into functions first before putting it into its own function too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-10 15:10:30 -03:00
Takuya ASADA	abd2b6bd6f	dist/offline_installer: add --skip-setup option to offline installer To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode). Fixes #2533 Signed-off-by: Takuya ASADA <syuu@scylladb.com> [ penberg: clean up diff noise ] Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>	2017-07-10 15:10:30 -03:00
Takuya ASADA	e1a2be28d2	dist/offline_installer/debian: install python-minimal package before installing scylla deps To prevent dependency error, we need to install python-minimal manually. Fixes #2553 Signed-off-by: Takuya ASADA <syuu@scylladb.com> [ penberg: clean up diff noise ] Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>	2017-07-10 15:10:30 -03:00
Tomasz Grabiec	fa17c2a59b	migration_manager: Give empty response to schema pulls from incompatible nodes The old nodes which are still using v2 schema tables will fail to apply our response, with error messages complaining about not being able to locate schema of certain versions (new schema tables). This change inhibits such errors by responding with an empty mutation list.	2017-07-10 15:10:30 -03:00
Tomasz Grabiec	8e8a26ef1b	migration_manager: Don't pull schema from incompatible nodes Currently it results in scary error messages in logs about not being able to find schema of given version. It's benign, but may scare users. It the future incompatibilities could result in more subtle errors. Better to inhibit it completely.	2017-07-10 15:10:30 -03:00
Tomasz Grabiec	b2f52454b9	service: Advertise schema tables format version through gossip Will be needed to inhibit schema exchange on per-peer basis.	2017-07-10 15:10:30 -03:00
Takuya ASADA	bf49dd8aa1	dist/offline_installer: add --skip-setup option to offline installer To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode). Fixes #2533 Signed-off-by: Takuya ASADA <syuu@scylladb.com> [ penberg: clean up diff noise ] Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>	2017-07-10 16:02:11 +03:00
Takuya ASADA	1f97e5b3f4	dist/offline_installer/debian: install python-minimal package before installing scylla deps To prevent dependency error, we need to install python-minimal manually. Fixes #2553 Signed-off-by: Takuya ASADA <syuu@scylladb.com> [ penberg: clean up diff noise ] Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>	2017-07-10 16:01:26 +03:00
Avi Kivity	91221e020b	Merge "Silence schema pull errors during upgrade from 1.7 to 2.0" from Tomasz "Old and new nodes will advertise different schema version because of different format of schema tables. This will result in attempts to sync the schema by each of the node. Currently this will result in scary error messages in logs about sync failing due to not being able to find schema of given version. It's benign, but may scare users. It the future incompatibilities could result in more subtle errors. Better to inhibit it completely." * 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev: migration_manager: Give empty response to schema pulls from incompatible nodes migration_manager: Don't pull schema from incompatible nodes service: Advertise schema tables format version through gossip	2017-07-10 14:04:04 +03:00
Pekka Enberg	8112d7c5c0	idl: Fix frozen_schema version numbers The IDL changes will appear in 2.0 so fix up the version numbers. Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>	2017-07-10 14:02:20 +03:00
Avi Kivity	06b7ec6901	install-dependencies.sh: add snappy	2017-07-10 13:25:57 +03:00
Avi Kivity	7ddd322bce	Add install-dependencies.sh Easier to get started when a script installs all the build dependencies. Message-Id: <20170710101657.12574-1-avi@scylladb.com>	2017-07-10 12:21:02 +02:00
Botond Dénes	e0d0f9f30c	Make the CMakeLists.txt's IDE marker generic To allow some other IDEs (e.g. KDevelop, QtCreator) to use the cmake file in a convenient manner. Keep the existing CLIEN_IDE marker to not break existing workflows. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <5ecf8c0e8a242cc8ebb0d803547bead4dadc38e2.1499667807.git.bdenes@scylladb.com>	2017-07-10 12:21:02 +02:00
Botond Dénes	66cbc45321	Add text(sstring) version of count, max and min functions Fixes #2459 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <b6abb97f21c0caea8e36c7590b92a12d148195db.1499666251.git.bdenes@scylladb.com>	2017-07-10 09:06:15 +03:00
Tomasz Grabiec	72e01b7fe8	tests: commitlog: Check there are no segments left on disk after clean shutdown Reproduces #2550. Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>	2017-07-09 19:25:27 +03:00
Tomasz Grabiec	6555a2f50b	commitlog: Discard active but unused segments on shutdown So that they are not left on disk even though we did a clean shutdown. First part of the fix is to ensure that closed segments are recognized as not allocating (_closed flag). Not doing this prevents them from being collected by discard_unused_segments(). Second part is to actually call discard_unused_segments() on shutdown after all segments were shut down, so that those whose position are cleared can be removed. Fixes #2550. Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>	2017-07-09 19:25:22 +03:00
Tomasz Grabiec	d33d29ad95	legacy_schema_migrator: Drop tables instead of truncate()+remove() It achieves similar effect, but is safer than non-standard remove() path. The latter was missing unregistration from compaction manager. Fixes 2554. Message-Id: <1499447165-30253-1-git-send-email-tgrabiec@scylladb.com>	2017-07-09 18:36:44 +03:00
Duarte Nunes	136accdbf6	database: Fix typos in metric descriptions Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170709145522.19534-1-duarte@scylladb.com>	2017-07-09 18:35:17 +03:00
Raphael S. Carvalho	7f7758fb6f	tests/sstable: make sstable_expired_data_ratio more robust this change will stress histogram ability to return a good estimation after merging keys such that it doesn't grow beyond size limit. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170708205713.5958-1-raphaelsc@scylladb.com>	2017-07-09 10:33:10 +03:00
Botond Dénes	4f6b2a1ff0	transport: Move "accept failed" message to the debug log Fixes #2518 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <492ea8a916bb3b2427f6cc16a4f6eadadaa30b10.1499418234.git.bdenes@scylladb.com>	2017-07-08 10:59:03 +03:00
Takuya ASADA	09aeb2aabe	dist/debian/pbuilderrc: merge Debian releases Merge duplicated lines to simplified pbuilderrc. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1499454242-3716-1-git-send-email-syuu@scylladb.com>	2017-07-08 10:56:54 +03:00
Tomasz Grabiec	07ed512060	migration_manager: Give empty response to schema pulls from incompatible nodes The old nodes which are still using v2 schema tables will fail to apply our response, with error messages complaining about not being able to locate schema of certain versions (new schema tables). This change inhibits such errors by responding with an empty mutation list.	2017-07-07 19:09:57 +02:00
Tomasz Grabiec	5f613d0527	migration_manager: Don't pull schema from incompatible nodes Currently it results in scary error messages in logs about not being able to find schema of given version. It's benign, but may scare users. It the future incompatibilities could result in more subtle errors. Better to inhibit it completely.	2017-07-07 19:08:59 +02:00
Tomasz Grabiec	18a9e1762c	service: Advertise schema tables format version through gossip Will be needed to inhibit schema exchange on per-peer basis.	2017-07-07 19:07:59 +02:00
Piotr Jastrzebski	a4b6cfe8f0	row_cache: use continuity info in single partition queries If a query requests for a single partition that is inside a range that has already been queried, use the continuity info and don't go to disk when it's not needed. Fixes #2244. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Piotr Jastrzebski	b950c59bbb	row_cache: Fix wrong comment on continuity flag This comment was stating exactly the opposite to the truth. This is very misleading Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <79062a061e22ef4c4add24cbdf723cbfb5cda060.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Piotr Jastrzebski	70f4b23876	row_cache_test: Add test to reproduce issue 2544 This tests checks that cache should use continuity information for single partition queries inside a range that has already been queried. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2ebd03ff5366e554d520f86da8054e0b9eff4178.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Avi Kivity	ecda97edeb	Merge seastar upstream * seastar c848486...89cc97c (2): > future-utils: fix do_for_each exception reporting > core/thread: Fix unwind information for seastar threads	2017-07-06 17:28:29 +03:00
Jesse Haber-Kucharsky	4f838a82e2	Add guide for getting started with development ("hacking") This change adds the start of what will hopefully be a continually evolving and improving document for helping developers and contributors to get started with Scylla development. The first part of the document is general advice and information that is broadly applicable. The second part is an opinionated example of a particular work-flow and set of tools. This is intended to serve as a starting point and inspire contributors to develop their own work-flow. The section on branching is marked "TODO" for now, and will be addressed by a subsequent change. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <470a542a92aff20d6205fb94b3fb26168735ae6f.1499319310.git.jhaberku@scylladb.com>	2017-07-06 15:59:16 +03:00
Duarte Nunes	3dd0397700	wrapping_range: Fix lvalue transform() Instead of copying and moving the bound, pass it by reference so the transformer can decide whether it wants to copy or not. The only caller so far doesn't want a copy and takes the value by reference, which would be capturing a temporary value. Caught by the view_schema_test with gcc7. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170705210255.29669-1-duarte@scylladb.com>	2017-07-06 15:47:49 +03:00
Raphael S. Carvalho	ff50b57761	dist: fix spelling mistakes in dev-mode.conf Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170705202054.4614-1-raphaelsc@scylladb.com>	2017-07-06 15:08:17 +03:00
Botond Dénes	b1082641f9	Make sure keyspace strategy class is stored in qualified form Even when it's provided in unqualified (short) form. Fixes #767 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>	2017-07-06 14:50:00 +03:00
Botond Dénes	c4277d6774	cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword To allow the non-reserved keywords "frozen" and "tuple" to be used as column names without double-quotes. Fixes #2507 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>	2017-07-06 12:25:38 +03:00
Avi Kivity	a6d9cf09a7	build: fix excessive stack usage in CqlParser in debug mode The state machines generated by antlr allocate many local variables per function. In release mode, the stack space occupied by the variables is reused, but in debug build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes a single function to take above 100k of stack space. Fix by hacking the generated code to use just one variable. Fixes #2546 Message-Id: <20170704135824.13225-1-avi@scylladb.com>	2017-07-05 23:05:26 +02:00
Duarte Nunes	d583ef6860	thrift/handler: Remove leftover debug artifacts Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170705161156.2307-1-duarte@scylladb.com>	2017-07-05 19:57:07 +03:00
Takuya ASADA	6d0bd01e0f	dist/offline_installer/redhat: enable EPEL repo before try to install makeself To prevent yum install error, we need to enable EPEL repo before install makeself. Fixes #2508 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1499196715-19710-1-git-send-email-syuu@scylladb.com>	2017-07-05 09:50:03 +03:00
Takuya ASADA	71624d7919	dist/common/scripts/scylla_raid_setup: prevent renaming MDRAID device after reboot On Debian variants, mdadm.conf should placed at /etc/mdadm instead of /etc. Also it seems we need update-initramfs to fix renaming issue. Fixes #2502 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1499179912-14125-1-git-send-email-syuu@scylladb.com>	2017-07-04 18:07:20 +03:00
Avi Kivity	b1a0e37fcb	Merge "Adjust row cache metrics for row granularity" from Tomasz * tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev: row_cache: Switch _stats.hits/misses to row granularity row_cache: Rename num_entries() to partitions() for clarity row_cache: Track mispopulations also at row level row_cache: Track row insertions row_cache: Track row hits and misses row_cache: Make mispopulation counter also apply for continuity information row_cache: Add partition_ prefix to current counters misc_services: Switch to using reads_with[_no]_misses counters row_cache: Add metrics for operations on underlying reader row_cache: Add reader-related metrics row_cache: Remove dead code	2017-07-04 15:20:25 +03:00
Tomasz Grabiec	37d2b6b3c6	row_cache: Switch _stats.hits/misses to row granularity Those are exported by the RESTful APIs called "get_row_hits/get_row_misses" and reported by nodetool.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	62c76abf71	row_cache: Rename num_entries() to partitions() for clarity	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	60c2a86192	row_cache: Track mispopulations also at row level	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	94547db620	row_cache: Track row insertions	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a58f2c8640	row_cache: Track row hits and misses	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	77b2a92ece	row_cache: Make mispopulation counter also apply for continuity information	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a5fdff2ac2	row_cache: Add partition_ prefix to current counters In preparation for adding per-row counters.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	ae4b24db06	misc_services: Switch to using reads_with[_no]_misses counters They better approximate the intended meaning than hits/misses, which according to Gleb is whether a read did any I/O or not.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	6a22cbceaf	row_cache: Add metrics for operations on underlying reader	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	5c7b6fc164	row_cache: Add reader-related metrics	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	be2e89d596	row_cache: Remove dead code	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	e720b317c9	row_cache: Restore update of concurrent_misses_same_key It was lost in action in `6f6575f456`. Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 14:51:05 +03:00
Avi Kivity	66e56511d6	Merge "Use selective_token_range_sharder in repair" from Asias "This series introduces selective_token_range_sharder and uses it in repair to generate dht::token_range belongs to a specific shard." * tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev: repair: Use selective_token_range_sharder tests: Add test_selective_token_range_sharder dht: Add selective_token_range_sharder	2017-07-04 14:14:33 +03:00
Asias He	b10e961a64	repair: Use selective_token_range_sharder With this change, we ask all the shard to handle the ranges provided by user and we use selective_token_range_sharder to split the ranges and ignore the ranges do not belong to the current shard.	2017-07-04 18:46:19 +08:00
Asias He	2a794db61b	tests: Add test_selective_token_range_sharder	2017-07-04 18:46:19 +08:00
Asias He	d835cf2748	dht: Add selective_token_range_sharder It is like ring_position_range_sharder but it works with dht::token_range. This sharder will return the ranges belong to a selected shard.	2017-07-04 18:46:19 +08:00
Tomasz Grabiec	1d6fec0755	row_cache: Drop not very useful prefixes from metric names This drops "total_opertaions_" and "objects_" prefixes. There is no convention of adding them in other parts of the system, and they don't add much value. Fixes scylladb/scylla-grafana-monitoring#169. Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 13:37:12 +03:00
Nadav Har'El	d95f908586	Fix test to use non-wrapping range The test put a wrapping range into a non-wrapping range variable. This was harmless at the time this test was written, but newer code may not be as forgiving so better use a non-wrapping range as intended. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170704103128.29689-1-nyh@scylladb.com>	2017-07-04 13:36:29 +03:00
Avi Kivity	07b8adce0e	sstables: fix use-after-free in read_simple() `r` is moved-from, and later captured in a different lambda. The compiler may choose to move and perform the other capture later, resulting in a use-after-free. Fix by copying `r` instead of moving it. Discovered by sstable_test in debug mode. Message-Id: <20170702082546.20570-1-avi@scylladb.com>	2017-07-04 10:24:07 +02:00
Raphael S. Carvalho	7b777fe2e3	sstables/lcs: choose sstable with highest droppable tombstone ratio Currently, lcs will choose, for tombstone compaction, sstable with the lowest ratio from the ones which ratio is at least above threshold (0.2 by default). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170703185633.6644-1-raphaelsc@scylladb.com>	2017-07-04 10:25:10 +03:00
Avi Kivity	bcf7867ac9	Merge "small fixes and cleanup for leveled strategy (part 2)" from from Raphael * 'lcs_improvements_part_2' of github.com:raphaelsc/scylla: lcs: Match estimated tasks arithmetic to score in LCS lcs: prevent leveled_compaction_strategy.hh from being included more than once lcs: use vector instead for storing a level of sstables compaction: keep only one variant of size_tiered_most_interesting_bucket lcs: get rid of unused code in leveled_manifest	2017-07-04 10:10:53 +03:00
Raphael S. Carvalho	7606ffd744	lcs: Match estimated tasks arithmetic to score in LCS Contains fix for CASSANDRA-8904. Added TARGET_SCORE to get rid of magic number for target score which is now used more than once. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:35:02 -03:00
Raphael S. Carvalho	dfb5463478	lcs: prevent leveled_compaction_strategy.hh from being included more than once Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:35:00 -03:00
Raphael S. Carvalho	db98ab6aaf	lcs: use vector instead for storing a level of sstables list is no longer needed because lcs no longer moves a sstable breaking invariant at its level to level 0. Now lcs incrementally restores invariant by compacting together first set of overlapping tables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:34:57 -03:00
Raphael S. Carvalho	b350352e6c	compaction: keep only one variant of size_tiered_most_interesting_bucket two variants of size_tiered_most_interesting_bucket existed to avoid copy, but subsequent work will make lcs use vector for each level of sstables, so let's only keep one variant. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:34:51 -03:00
Raphael S. Carvalho	5921600b95	lcs: get rid of unused code in leveled_manifest Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-04 03:34:34 -03:00
Nadav Har'El	d177ec05cb	repair: further limit parallelism of checksum calculation Repair today has a semaphore limiting the number of ongoing checksum comparisons running in parallel (on one shard) to 100. We needed this number to be fairly high, because a "checksum comparison" can involve high latency operations - namely, sending an RPC request to another node in a remote DC and waiting for it to calculate a checksum there, and while waiting for a response we need to proceed calculating checksums in parallel. But as a consequence, in the current code, we can end up with as many as 100 fibers all at the same stage of reading partitions to checksum from sstables. This requires tons of memory, to hold at least 128K of buffer (even more with read-ahead) for each of these fibers, plus partition data for each. But doing 100 reads in parallel is pointless - one (or very few) should be enough. So this patch adds another semaphore to limit the number of checksum calculations (including the read and checksum calculation) on each shard to just 2. There may still be 100 ongoing checksum comparisons, in other stages of the comparisons (sending the checksum requests to other and waiting for them to return), but only 2 will ever be in the stage of reading from disk and checksumming them. The limit of 2 checksum calculations (per shard) applies on the repair slave, not just to the master: The slave may receive many checksum requests in parallel, but will only actually work on 2 at a time. Because the parallelism=100 now rate-limits operations which use very little memory, in the future we can safely increase it even more, to support situations where the disk is very fast but the link between nodes has very high latency. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170703151329.25716-1-nyh@scylladb.com>	2017-07-03 18:14:57 +03:00
Piotr Jastrzebski	80f08921c4	Make table_helper independent from trace_keyspace_helper table_helper is a generic helper than can easily be used in other places. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <11e46dbc1c90d0273a41c8144e6f6013e21efcdb.1499077818.git.piotr@scylladb.com>	2017-07-03 15:55:00 +03:00
Raphael S. Carvalho	972a0237ef	database: restore indentation for cleanup_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>	2017-07-03 12:48:54 +03:00
Raphael S. Carvalho	b9d0645199	database: fix potential use-after-free in sstable cleanup when do_for_each is in its last iteration and with_semaphore defers because there's an ongoing cleanup, sstable object will be used after freed because it was taken by ref and the container it lives in was destroyed prematurely. Let's fix it with a do_with, also making code nicer. Fixes #2537. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>	2017-07-03 12:48:53 +03:00
Avi Kivity	5883e85da3	Merge "improve maintainability of compaction strategies" from Raphael "compaction_strategy.cc keeps the full implementation of size tiered, major, and null strategies, and partial implementation of leveled and date tiered strategies. It's a mess. In the future, we will also need space for time window strategy. The file is hard to read and maintain. My goal here is to improve maintainability of the strategies by putting each of them into its own header. NOTE: No semantic change is introduced here." * 'improve_compaction_strategy_maintainability' of github.com:raphaelsc/scylla: compaction_strategy: move dtcs to its existing header compaction_strategy: move lcs implementation to its own header compaction_strategy: move stcs implementation to its own header compaction_strategy: move compaction_strategy_impl to its own header	2017-07-03 11:39:30 +03:00
Takuya ASADA	0c81974bc4	dist/common/systemd: move scylla-server.service to be after network-online.target instead of network.target To make sure start Scylla after network is up, we need to move from network.target to network-online.target. Fixes #2337 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493661832-9545-1-git-send-email-syuu@scylladb.com>	2017-07-03 10:01:21 +03:00
Asias He	b2a2fbcf73	repair: Do not store the failed ranges The number of failed ranges can be large so it can consume a lot of memory. We already logged the failed ranges in the log. No need to storge them in memory. Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>	2017-07-03 10:00:25 +03:00
Takuya ASADA	1c35549932	dist/common/scripts/scylla_cpuscaling_setup: skip configuration when cpufreq driver doesn't loaded Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq. To prevent causing error, skip whole configuration when the driver not loaded. Fixes #2051 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>	2017-07-03 09:59:56 +03:00
Takuya ASADA	e645b0fb13	dist/common/scripts: move EC2 configuration verification to 'scylla_ec2_check' Currently we only have EC2 configuration verification on AMI, so move it to /usr/lib/scylla and run it from scylla_setup, to make it usable for non-AMI users. Fixes #1997 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1498811107-29135-1-git-send-email-syuu@scylladb.com>	2017-07-03 09:59:28 +03:00
Avi Kivity	6895f6e603	sstable_datafile_test: fix sstable_expired_data_ratio failure A comment states that we want the file to be old enough, but sets a timestamp of max(), which is in the future. This may have passed because the conversion from numeric_limits<time_t>::max() to db_clock::time_point is not well defined (their dynamic range is different), so truncation may have converted the large number to a low one. Message-Id: <20170702082903.20879-1-avi@scylladb.com>	2017-07-02 20:22:51 +02:00
Avi Kivity	51b6066212	cql3: operation: correctly format error messages Error messages incorrectly used the debug representation of the receiver, rather than the text representation of the operation itself. Fixes #113. Message-Id: <20170701101325.3163-1-avi@scylladb.com>	2017-07-02 20:06:50 +02:00
Duarte Nunes	d157e4558a	utils/log_histogram: Remove largest() function It should never have existed in the first place, as there are no legitimate callers and it can be misused. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170630095939.2429-1-duarte@scylladb.com>	2017-07-02 14:29:17 +03:00
Gleb Natapov	d23111312f	main: wait for wait_for_gossip_to_settle() to complete during boot Boot should not continue until a future returned by wait_for_gossip_to_settle() is resolved. Commit `991ec4a16` mistakenly broke that, so restore it back. Also fix calls for supervisor::notify() to be in the right places. Message-Id: <20170702082355.GQ14563@scylladb.com>	2017-07-02 11:32:36 +03:00
Avi Kivity	5bc13e4454	Revert "Make table_helper independent from trace_keyspace_helper" This reverts commit `db5bf363d0`. Causes errors of the sort Exiting on unhandled exception: exceptions::invalid_request_exception (Keyspace 'system_traces' does not exist)	2017-07-02 11:30:51 +03:00
Avi Kivity	7c809917b6	compaction_manager: fix debug mode build (periodic_compaction_submission_interval) Turn static constexpr variable into a function.	2017-07-01 19:34:46 +03:00
Avi Kivity	c2c69e003f	compaction: fix build on debug mode (DEFAULT_TOMBSTONE_COMPACTION_INTERVAL) Debug mode wants to allocate storage for a constexpr variable for some reason. Turn it into a function.	2017-07-01 19:26:22 +03:00
Avi Kivity	59f649e2bc	Revert "cql_server::do_accepts: modernize loop" This reverts commit `37af493f6e`. Connections are not accepted and ^C does not work anymore.	2017-07-01 12:54:23 +03:00
Jesse Haber-Kucharsky	1100bb8a5b	cql: Eagerly throw lexing and parsing exceptions Previously, lexing and parsing errors were aggregated while CQL queries were evaluated. Afterwards, the first collected error (if present) would be thrown as an exception. The problem was that when parsing and lexing errors were aggregated this way, the parser would continue even in spite of errors like "no viable alternative". Semantic actions attached to grammar rules would still execute, though with variables that had not yet been initialized. This would crash Scylla. This change modifies the error-handling strategy of CQL parsing. Rather than aggregate errors, we throw an exception on the first error we encounter. This ensures that grammar actions never execute unless there is a precise match. One possible issue with this approach is that the generated C++ code from the ANTLR grammar may not be exception-safe. I compiled Scylla in debug-mode with ASan support and executed several erroneous CQL queries with `cqlsh`. No memory leaks were reported. Fixes #2466. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <db1f650a2bbb615b506d9015486eece45375a440.1498836703.git.jhaberku@scylladb.com>	2017-07-01 12:13:44 +03:00
Raphael S. Carvalho	69a9ad468c	compaction_strategy: move dtcs to its existing header Goal is to improve maintainability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-30 03:50:09 -03:00
Raphael S. Carvalho	4d387475fe	compaction_strategy: move lcs implementation to its own header Goal is to improve maintainability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-30 03:50:07 -03:00
Raphael S. Carvalho	4b46d286fd	compaction_strategy: move stcs implementation to its own header Goal is to improve maintainability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-30 03:50:06 -03:00
Raphael S. Carvalho	0d9bb0da39	compaction_strategy: move compaction_strategy_impl to its own header compaction_strategy.cc keeps the full implementation of size tiered, major, and null strategies, and partial implementation of leveled and date tiered strategies. It's a mess. In the future, we will also need space for time window strategy. The file is hard to read and maintain. My goal here is to eventually improve maintainability of the strategies by putting each of them into its own header. This is the first step towards that goal. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-30 03:50:04 -03:00
Raphael S. Carvalho	9fa855e105	compaction_strategy: use duration type for default tombstone compaction interval Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170630041838.20604-1-raphaelsc@scylladb.com>	2017-06-30 08:56:22 +03:00
Piotr Jastrzebski	db5bf363d0	Make table_helper independent from trace_keyspace_helper table_helper is a generic helper than can easily be used in other places. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <3e360a963d4a53de6d758ba8bada78fc572f001a.1498745600.git.piotr@scylladb.com>	2017-06-29 17:20:07 +03:00
Tomasz Grabiec	97005825bf	row_cache: Fix compilation errors with gcc 5 Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>	2017-06-29 16:34:46 +03:00
Avi Kivity	6da9b6eb81	cql3: error_listener: add virtual destructor Found by Eclipse. Message-Id: <20170629063324.31309-1-avi@scylladb.com>	2017-06-29 10:51:20 +02:00
Avi Kivity	9298fea27b	Merge seastar upstream * seastar 0ab7ae5...c848486 (2): > build: export full cflags in pkgconfig file (Fixes #2439) > configure: Avoid putting tmp file on /tmp	2017-06-29 11:35:24 +03:00
Avi Kivity	fc966c0c4c	Merge "tombstone removal compaction" from Raphael "This feature is intended to make compaction more efficient at getting rid of droppable tombstone and expired data wasting disk space. So far, people have been dealing with it manually through major compaction. With strategies other than date tiered, large sstables will be left untouched for a long time even though it's all expired. Date tiered suffers from it when mixing data with different TTL because it only includes for compaction sstable that is fully expired. sstables keeps as metadata a histogram which allows us to easily estimate droppable data ratio from gc_before. sstables which droppable data ratio is above 20% (default value for tombstone_threshold option) will be considered candidates for the operation. Like in C, we will only do tombstone removal compaction when there's nothing to compact in standard way. It would be interesting to trigger it too when disk usage is above a given threshold, but I decided to leave this for later. Fixes #2306." 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla: tests: more testing for tombstone compaction options tests: basic tombstone compaction test for date tiered compaction/dtcs: add support for tombstone compaction tests: basic test of tombstone compaction with lcs compaction/lcs: add support for tombstone compaction tests: basic tombstone compaction test for size tiered compaction/stcs: add support for tombstone compaction tests: add test for estimation of droppable tombstone ratio sstables: introduce function to estimate droppable tombstone ratio compaction_manager: periodically submit cfs for compaction streaming_histogram: fix coding style tests: add streaming_histogram_test streaming_histogram: implement sum tests: add test for sstable with bad tombstone histogram sstables: discard bad streaming histogram for future use tests: add sstable tombstone histogram test streaming_histogram: fix update streaming_histogram: move it to utils streaming_histogram: do not limit it to be used by sstables sstables: update tombstone_histogram for cells with expiration time	2017-06-29 10:19:59 +03:00
Avi Kivity	1317c4a03e	Update ami submodule * dist/ami/files/scylla-ami f10db69...5dfe42f (1): > don't fetch perf from amazon repo	2017-06-29 09:38:48 +03:00
Raphael S. Carvalho	ab335c8085	tests: more testing for tombstone compaction options Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	ce4dc15a20	tests: basic tombstone compaction test for date tiered Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	f76ece5349	compaction/dtcs: add support for tombstone compaction Unlike other strategies, dtcs has tombstone compaction disabled by default due to: - deletion shouldn't be used with DTCS; rather data is deleted through TTL. - with time series workloads, it's usually better to wait for whole sstable to be expired rather than compacting a single sstable when it's more than 20% (default value) expired. See CASSANDRA-9234 for more details. For tombstone compaction, unworthy sstables are filtered out and the oldest one is chosen because it's the one less likely to shadow data and it's also relatively big. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	c400bf97b9	tests: basic test of tombstone compaction with lcs Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	70e54cfe6e	compaction/lcs: add support for tombstone compaction LCS will choose its candidate by starting from highest level and getting sstable which has highest droppable tombstone ratio. Unlike STCS which needs to choose oldest sstable from biggest tier, LCS can choose the one with highest d__t__r because sstables in a given level don't overlap. Sstable picked up for tombstone removal compaction won't be demoted or promoted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	138fda468f	tests: basic tombstone compaction test for size tiered Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	8fd80ac22c	compaction/stcs: add support for tombstone compaction Larger sstables are hard to find sstable peers and therefore are left uncompacted for a long time. Expired data and tombstones which can be purged will waste disk space meanwhile. sstable tracks droppable tombstone from which ratio can be calculated. If ratio is greater than threshold (0.2 by default), sstable will be eligible for compaction. Oldest sstables from biggest tiers are preferrable because droppable data in them are more likely to satisfy the conditions for purge, like not shadowing data in another sstable. Subsequent patches will add support in leveled and date tiered strategies. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	ad24470972	tests: add test for estimation of droppable tombstone ratio Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	eb6d17b748	sstables: introduce function to estimate droppable tombstone ratio Function used to estimate ratio of droppable tombstone. A tombstone is considered droppable for cells expired before gc_before and regular tombstones older than gc_before. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:08 -03:00
Raphael S. Carvalho	0d21129cc7	compaction_manager: periodically submit cfs for compaction This is useful for a column family which isn't generating new content and will have lots of expired data later on that can be purged. Compaction submission is NO-OP if there's nothing to do, so I think it's reasonable to do it at an interval of 1 hour. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:43:03 -03:00
Raphael S. Carvalho	719dbf547d	streaming_histogram: fix coding style Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:08:12 -03:00
Raphael S. Carvalho	6fb26d9f0c	tests: add streaming_histogram_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:08:12 -03:00
Raphael S. Carvalho	a65b9eb8b4	streaming_histogram: implement sum This function is used to estimate number of points in interval [-inf,b]. It will be useful for estimating droppable tombstone ratio in a given sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:08:12 -03:00
Raphael S. Carvalho	c01c659594	tests: add test for sstable with bad tombstone histogram Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:08:12 -03:00
Raphael S. Carvalho	06fabf9810	sstables: discard bad streaming histogram for future use Find bad histogram which had incorrect elements merged due to use of unordered map. The keys will be unordered. Histogram which size is less than max allowed will be correct because no entries needed to be merged, so we can avoid discarding those. This is important because histogram for tombstone will be used to estimate droppable tombstone ratio. If it's incorrectly high for many of existing sstables, we will needlessly compact lots of them. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 02:08:10 -03:00
Raphael S. Carvalho	7b532867ce	tests: add sstable tombstone histogram test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 01:17:28 -03:00
Raphael S. Carvalho	f35bd66da4	streaming_histogram: fix update This bug was introduced when converting java code. Return value of map::erase() was used as if it were the value of the removed entry, but it's actually the number of removed entries. update() also relies on ordered keys, so map is used instead by histogram. In addition, histograms will be written in sorted order (like C* does) such that we can detect bad histograms, using disk_array. disk_array is also used from now on to read histograms. The conversion from array to map is fine because histograms for sstables are limited to 100 elements. Coming patch will detect bad histograms (generated only by us) and discard them, because we can't rely on their information. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-29 01:17:26 -03:00
Amnon Heiman	644868d816	api: remove reply creation As a preperation for the http stream support, creation of empty reply should be avoided. This patch removes a line that cannot be reached but causes the compiler to complain. It has no effect aside of removing the reply creation. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170628130202.8132-1-amnon@scylladb.com>	2017-06-28 16:30:58 +03:00
Tomasz Grabiec	786e75dbf7	row_cache: Use continuity information to decide whether to populate If cache is missing given key, but the range is marked as continuous, it means sstables don't have that entry and we can insert it without asking the presence checker (bloom filter based). The latter is more expensive and gives false positives. So this improves update performance and hit ratio. Another positive effect is that we don't have to clear continuity now. Fixes #1999. Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 13:32:48 +03:00
Tomasz Grabiec	3489c68a68	lsa: Fix performance regression in eviction and compact_on_idle Region comparator, used by the two, calls region_impl::min_occupancy(), which calls log_histogram::largest(). The latter is O(N) in terms of the number of segments, and is supposed to be used only in tests. We should call one_of_largest() instead, which is O(1). This caused compact_on_idle() to take more CPU as the number of segments grew (even when there was nothing to compact). Eviction would see the same kind of slow down as well. Introduced in `11b5076b3c`. Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 12:32:43 +03:00
Raphael S. Carvalho	a3a73899bc	database: remove outdated FIXME comments Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>	2017-06-28 11:06:02 +02:00
Etienne Kruger	37af493f6e	cql_server::do_accepts: modernize loop Replace recursion in cql_server::do_accepts with more modern repeat() from future-util.hh. Fixes #2467. Signed-off-by: Etienne Kruger <el@loadavg.io> Message-Id: <20170628033130.19824-1-el@loadavg.io>	2017-06-28 10:25:22 +03:00
Raphael S. Carvalho	d90f46000d	streaming_histogram: move it to utils It's not specific to sstables. May be needed somewhere else in the future. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-28 01:07:13 -03:00
Glauber Costa	f3742d1e38	disable defragment-memory-on-idle-by-default It's been linked with various performance issues, either by causing them or making them worse. One example is #1634, and also recently I have investigated continuous performance degradation that was also linked to defrag on idle activity. Until we can figure out how to reduce its impact, we should disable it. Signed-off-by: Glauber Costa <glauber@glauber.scylladb> Message-Id: <20170627201109.10775-1-glauber@scylladb.com>	2017-06-28 00:21:11 +03:00
Raphael S. Carvalho	fb9bc609c6	streaming_histogram: do not limit it to be used by sstables streaming histogram will later be placed in /utils, so we want it to use std::unordered_map<> instead of disk_hash<>. That also requires implementing serialization/deserialization functions for it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-27 16:51:52 -03:00
Raphael S. Carvalho	e224653d70	sstables: update tombstone_histogram for cells with expiration time That tombstone_histogram is used to determine droppable data ratio for a sstable, and unlike C*, we were only updating it for tombstones. We need to update it with expiration time of cells too, if any. Creation time (expiration - ttl) cannot be used because if ttl > gc_grace_seconds, the resulting sstable could be considered worth dropping by tomstone compaction before any data is actually expired. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-27 16:50:38 -03:00
Avi Kivity	08488a75e0	dist: tolerate sysctl failures sysctl may fail in a container environment if /proc is not virtualized properly. Fixes #1990 Message-Id: <20170625145930.31619-1-avi@scylladb.com>	2017-06-27 16:11:48 +02:00
Avi Kivity	ff7be8241f	Merge "Fix compilation issues in older environments" from Tomasz * 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev: tests: streamed_mutation_test: Avoid using boost::size() on row ranges tests: row_cache: Remove unused method	2017-06-27 16:30:54 +03:00
Tomasz Grabiec	eb844a10e9	tests: streamed_mutation_test: Avoid using boost::size() on row ranges Fails to compile with libboost 1.55.	2017-06-27 15:27:13 +02:00
Tomasz Grabiec	e68925595c	tests: row_cache: Remove unused method	2017-06-27 14:10:37 +02:00
Vlad Zolotarov	6839a50677	db::commitlog: entry_writer add a virtual destructor Add a virtual destructor for a base class commitlog::entry_writer. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1498511180-18391-1-git-send-email-vladz@scylladb.com>	2017-06-27 10:17:10 +03:00
Takuya ASADA	1e86196ed5	dist/debian: unofficial support of Ubuntu non-LTS versions / Debian non-stable versions Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this change extends support to Ubuntu non-LTS versions / Debian non-stable versions. Note that this is unofficial support, users should build the package for these distributions theirselves. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>	2017-06-26 18:55:55 +03:00
Asias He	cc02a62756	repair: Prefer nodes in local dc when streaming When peer nodes have the same partition data, i.e., with the same checksum, we currently choose to stream from any of them randomly. To improve streaming performance, select the peer within the same DC. This patch is supposed to improve repair perforamnce with multiple DC. Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>	2017-06-26 18:34:21 +03:00
Avi Kivity	1170f56447	Merge "Speed up gossip dissemination in large cluster" from Asias Fixes #2528. * tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev: gossip: Use vector for _live_endpoints gossip: Talk to more live nodes in each gossip round	2017-06-26 17:59:43 +03:00
Asias He	e31d4a3940	gossip: Use vector for _live_endpoints To speed up the random access in get_random_node. Switch to use vector instead of set.	2017-06-26 22:49:59 +08:00
Asias He	437899909d	gossip: Talk to more live nodes in each gossip round In large clusters with multiple DC deployment, it is observed that it takes long delay for gossip update to disseminate in the cluster. To speed up, talk to more live nodes in each gossip round. Fixes #2528	2017-06-26 22:49:59 +08:00
Nadav Har'El	6cf44f6817	Optimize column_family::make_sstable_reader() for one partition This patch does the same thing to column_family::make_sstable_reader() as commit `186f031` did to sstable::as_mutation_source(). Although usually one can fast_forward_to() on the result of a column_family::make_sstable_reader(), earlier we had an optimization where if a single partition was specified, it was read exactly, and fast_forward_to() was NOT allowed. With the mutation_reader::forwarding flag patch, when this flag was on - requesting fast_forward_to() - we disabled this optimization. This makes sense, but is not backward compatible with the code which previously assumes this optimization exists. In particular, column_family::data_query() does a single partition read but does not specify forwarding::no explicitly. So this patch returns this optimization, despite this meaning that we blatently ignore the fwd_mr flag in that case. Fixes #2524. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170626141121.30322-1-nyh@scylladb.com>	2017-06-26 17:13:03 +03:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	555621b537	Disentable memtables from sstables Remove sstable::write_components(memtable), replacing it with a helper. Fixes #2354 Message-Id: <20170624142639.16662-1-avi@scylladb.com>	2017-06-26 09:37:11 +02:00
Avi Kivity	236a8370e4	Remove use of std::random_shuffle() It was removed in C++17. Replace with std::shuffle(). Message-Id: <20170626063809.7563-1-avi@scylladb.com>	2017-06-26 09:36:38 +02:00
Avi Kivity	c4ae2206c7	messaging: respect inter_dc_tcp_nodelay configuration parameter We respect it partially (client side only) for now. Fixes #6. Message-Id: <20170623172048.23103-1-avi@scylladb.com>	2017-06-24 21:49:27 +02:00
Duarte Nunes	2dfd7040eb	CMakeLists.txt: Add boost support Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170623172236.15507-1-duarte@scylladb.com>	2017-06-24 21:49:27 +02:00
Avi Kivity	801b5220d6	Merge seastar upstream * seastar 9e2b7ec...0ab7ae5 (4): > Update fmt submodule > rpc: add options to control tcp_nodelay > core: Fix compilation for older versions of Boost > tests/lowres_clock_test: Fix compilation issues	2017-06-24 20:47:52 +03:00
Tomasz Grabiec	b0bcf2be53	tests: row_cache: Add test_tombstone_merging_in_partial_partition test case	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	23c6f517cb	tests: Introduce row_cache_stress_test Runs readers, updates and eviction concurrently and verifies the following property of reads: - reads see all past writes - reads see no partial writes within a single partition	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	4b4aef789e	utils: Add helpers for dealing with nonwrapping_range<int>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5c9f87fb27	tests: simple_schema: Allow passing the tombstone to make_range_tombstone()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	edf4a3494c	tests: simple_schema: Accept value by reference	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5f70df472f	tests: simple_schema: Make add_row() accept optional timestamp	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	53867c4328	tests: simple_schema: Make new_timestamp() public	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	51b5814ec2	tests: simple_schema: Introduce make_ckeys()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	074c67fe4d	tests: simple_schema: Introduce get_value(const clustered_row&) helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ffc776e06	tests: simple_schema: Fix comment	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ecacd2e84a	tests: simple_schema: Add missing include	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	b56232b216	row_cache: Introduce evict()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	c4e8effffa	tests: Add cache_streamed_mutation_test [tgrabiec: - extracted from a larger commit - removed coupling with how cache_streamed_mutation is created (the code went out of sync), used more stable make_reader(). it's simpler too. - replaced false/true literals with is_continuous/is_dummy where appropraite - dropped tests for cache::underlying (class is gone) - reused streamed_mutation_assertions, it has better error messages - fixed the tests to not create tombstones with missing timestamps - relaxed range tombstone assertions to only check information relevant for the query range - print cache on failure for improved debuggability ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	44fdee3f2e	tests: mutation_assertions: Allow expecting fragments	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1f23130b07	mutation_fragment: Implement equality check	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	116bcb8b30	tests: row_cache: Add test for population of random partitions	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	930a1415fe	tests: row_cache: Add test for partition tombstone population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9bfece6f82	tests: row_cache: Test reading randomly populated partition	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	0358334579	tests: row_cache: Add test_single_partition_update() [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8bb76e2f12	tests: row_cache: Add test_scan_with_partial_partitions	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	896bf2e5de	Remove unused methods from MVCC Some apply methods where replaced by apply_to_incomplete(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6f6575f456	row_cache: Enable partial partition population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5a0ae55f6d	Introduce schema_upgrader	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1828e28bbb	database: Invalidate cache atomically with attaching streaming sstables Not doing so may cause reads to see partial writes, if another update+read happens in between.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	896196b841	database: Invalidate cache from seal_active_streaming_memtable_immediate() Cache must be synchronized atomically with changing the underlying mutation source, otherwise write atomicity may not hold.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7ae40d7045	tests: Add test for update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e792220c3a	row_cache: Introduce update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c29878f49f	row_cache: Extract memtable walking logic from update() into do_update() So that it can be reused in update_invalidating().	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6ebfb730ee	partition_entry: Introduce partition_tombstone() getter	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	fb62dfab02	tests: mvcc: Introduce test_schema_upgrade_preserves_continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	164989a574	tests: mvcc: Add test for partition_entry::apply_to_incomplete()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e433e68610	partition_entry: Make squashed() and upgrade() work with not fully continuous versions Those methods first create a neutral mutation_partition, and left-fold it with the versions. The problem is that there is no neutral element for static row continuity, the flag from the first addend always wins. We have to copy the flag from the first version to preserve the logical value.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	b680de930c	partition_entry: Introduce apply_to_incomplete() Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - extracted from a larger commit - fix heap comparator in apply_incomplete_target to order versions properly - extracted partition_version detaching into partition_entry::with_detached_versions() - dropped unnecessary rows_iterator::_version field - dropped unnecessary allocation of rows_entry and key copies in rows_iterator - dropped row_pointer - replaced apply_reversibly() with weaker and faster apply() - added handling of dummy entries at any position - fixed exception safety issue in apply_to_incomplete() which may result in data loss. We cannot move data out of applied versions into a new synthetic row and then apply it, because if exception happens in the middle, the data which was moved from the source will be lost. To fix that, row_iterator::consume_row() is introduced which allows in-place consumption of data without construction of temporary deletable_row. ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	b6ce963200	partition_version: Introduce partition_entry::with_detached_versions()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2d8f024e4d	partition_version: Document version merging rules on partition_entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	0770845a23	mutation_partition: Introduce r-value accepting deletable_row::apply()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	48a5b1d3ab	converting_mutation_partition_applier: Expose cell upgrade logic	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	04aebaa2cb	streamed_mutation: Introduce transform()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	3755641c4b	row_cache: Introduce cache_streamed_mutation This streamed mutation populates cache with the rows requested by the read. It takes whatever it can find in the cache and fetches the remainings from underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - fixed maybe_add_to_cache_and_update_continuity() leaking entries if the key already exists in the snapshot - fixed a problem where population race could result in a read missing some rows, because cache_streamed_mutation was advancing the cursor, then deferring, and then checking continuity. We should check continuity atomically with advancing. - fixed rows_handle.maybe_refresh() being accessed outside of update section in read_from_underlying() (undefined behavior) - fixed a problem in start_reading_from_underlying() where we would use incorrect start if lower_bound ended with a range tombstone starting before a key. - range tombstone trimming in add_to_buffer() could create a tombstone which has too low start bound if last_rt.end was a prefix and had inclusive end. invert_kind(end_kind) should be used instead of unconditional inc_start. - range tombstone trimming incorrectly assumed it is fine to trim the tombstone from underlying to the previous fragment's end and emit such tombstone. That would mean the stream can't emit any fragments which start before previous tombstone's end. Solve with range_tombstone_stream. - split add_to_buffer() into overloads for clustering_row, and range_tombstone. Better than wrapping into mutation_fragment before the call and having add_to_buffer() rediscover the information. - changed maybe_add_to_cache_and_update_continuity() to not set continuity to false for existing entries, it's not necessary - moved range tombstone trimming to range_tombstone class - moved range tombstone slicing code to range_tombstone_list and partition_snapshot - can_populate::can_use_cache was unused, dropped - dropped assumption that dummy entries are only at the end - renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache() - dropped no longer needed lower_bound class - extracted row_handle to a seaparate patch - made the copy-from-cache loop preemptable - split maybe_add_next_to_buffer_and_update_continuity(bool) - dropped cache_populator - replaced "underlying" class with use of read_context - replaced can_populate class with a function - simplified lsa_manager methods to avoid moves ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	58a8022462	intrusive_set_external_comparator: Introduce insert_check()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	509a0d8a83	row_cache: Allow reading from underlying through read_context The interaction will be as follows: - Before creating cache_streamed_mutation for given partition, cache mutation reader sets up read_context for current partition (in one of two ways) so that the matching underlying streamed_mutation can be accessed at any time by cached_stream_mutation. - cache_streamed_mutation assumes that read_context is set up for current partition and invokes fast_forward_to() and get_next_fragment() to access the underlying streamed_mutation.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	69ea29131f	row_cache: Allow specifying desired snapshot in autoupdating_underlying_reader When reading from incomplete partition entry, we may discover we need to read something from the underlying mutation source. In such case we will fast forward this reader to that partition. But we must do it using a specific snapshot, the one we obtained when entering the partition, not the latest one.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a1d3e0318c	row_cache: Store autoupdating_underlying_reader in read_context Will be reused for reading of incomplete partition entries.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3f2320c377	row_cache: Store information whether query is a range query in read_context We will need to use this information later in yet another place, when creating a reader for incomplete cache entry. This refactors the code so that there is a single place which determines this fact.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a2207ee9a6	row_cache: Move autoupdating_underlying_reader to read_context.hh	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ca920bd0ef	row_cache: Keep only one streamed_mutation in scanning_and_populating_reader Currently scanning_and_populating_reader asks just_cache_scanning_reader for the next partition from cache, together with information if the range is continuous. If it's not, it saves the partition it got from it and moves on to reading from the underlying reader up to that partition. When that's done, it emits the stored partition. This approach won't work well with upcoming changes for storing partial partitions. We won't have whole partitions any more, so streamed_mutation returned for the entry needs to be prepared for reading from the underlying mutation source. We want to reuse the same underlying reader as much as possible, so all streamed_mutations for given read (read_context) will share the state of the underlying reader. Construction of a streamed_mutation will depend on the fact that the shared state is set up for it, so we cannot have two streamed_mutations prepared at the same time (one for entry from primary, and one for the earlier entry being populated). This change defers the creation of a streamed_mutation for the entry present in cache until the whole reader reaches it to avoid this problem. This will also have antoher potentially beneficial effect. Since we defer the decision about which snapshot to use until we reach the entry, there is a higher chance that the current snapshot of the entry will match the one used last by the populating read, and that we will be able to reuse the reader. It's implemented by utilizing a stable partition cursor which tracks its current position so that it's possible to revisit the cache entry (if it's still there) after population ends. The functionality of just_cache_scanning_reader was inlined into scanning_and_populating_reader.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1b041298fe	range: Introduce trim_front()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	045888d5f3	row_cache: Introduce partition_range_cursor	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e989d65539	dht: Make ring_position_view copyable dht::token needs to be stored as a pointer now and not a reference so that validity of old pointers is not impacted by in-place object construction which would occur in the copy-assignment operator. [1] says that old pointers can be used to access the new object only if the type "does not contain any non-static data member whose type is const-qualified or a reference type". [1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c3905bf235	row_cache: Print position instead of key of cache_entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f0cc86e5db	row_cache: Introduce cache_entry::position()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e3272526a1	row_cache: Allow comparing with ring_position views in row_cache::compare	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5bfecaad99	row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges It's needed before switching cache_entry ordering to rely solely on cache_entry::position() so that invalidate_unwrapped() never removes the dummy entry at the end. Currently if the range has upper bound like this: { ring_position::max(), inclusive=true } The code which selects entries for removal would include the dummy row at the end. It uses upper_bound() to get the end iterator, and the dummy entry has a position which is equal to the position in the bound. ring_position_view ranges are end-exclusive, so it's impossible to create a partition range which would include a dummy entry. The code is also simpler.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	64626b32b0	row_cache: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	639af55a78	partition_version: Add versions() getter [tgrabiec: Use explicit return type]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1d3fec43eb	partition_version: Make return type of versions() explicit	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	22a0e301f1	partition_version: Make is_referenced() const-qualified	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a17fa5726f	Introduce streamed_mutation_from_forwarding_streamed_mutation This will allow conversion from streamed_mutation that supports fast forwarding to streamed_mutation that does not. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	11191b7aef	streamed_mutation: Introduce make_empty_streamed_mutation()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7c9569ec95	Introduce partition_snapshot_row_cursor	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	54b3da1910	row_cache: Introduce find_or_create() helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f2d2c221d4	row_cache: Return cache_entry reference from do_find_or_create_entry Will be useful when additional action needs to be done on the entry after it was created or constructed.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	0db6bdc916	row_cache: Introduce cache_entry constructor which constructs incomplete entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1c642219c9	row_cache: Ensure there is always a dummy entry after all clustered rows Algorithms will assume that.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	81e7b561da	dht: Add ring_position min()/max()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	446bcdb00d	database: Add missing cache invalidation after attaching sstables This violation of the contract is currently benign, because there are no reads from those tables before they are populated. If there were, the cache would mark the whole (empty) range as continuous and the table would appear empty. It will cause similar problem once cache starts using snapshots of the underlying mutation source. Then this lack of invalidate() will also result in cache thinking that the table is still empty.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c82c6ec6ed	database: Allow obtaining snapshot_source for sstables	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bd023b6161	tests: Introduce memtable_snapshot_source Snapshottable in-memory mutation source for use in row_cache tests.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ddfcf64966	mutation_source: Make copying cheaper Cache readers will need to take snapshots by copying the mutation_source. That's going to happen quite often, so make copying cheaper.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	58d5e1393b	mutation_reader: Introduce make_combined_mutation_source()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1e2463a382	mutation_reader: Introduce make_empty_*_source()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	289d01c2cc	mutation_reader: Introduce concept of snapshot_source	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2d73c193e7	row_cache: Introduce read_context This object stores all read relevant context required all over the place. This leads to a cleaner code. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - made read_context shareable to allow storing shared mutable state later - added range and cache getters ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a3ff8db323	row_cache: Introduce autoupdating_underlying_reader This is an abstraction that represents a reader to the underlying source and auto updates itself to make sure the reader reflects the latest state of the underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Add range getter to avoid friendships]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	b6d349728f	range_tombstone_list: Introduce slice() working with position range	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6ce08f2f9a	range_tombstone: Introduce trim_front()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	271dfc2eac	position_in_partition: Introduce for_range_start()/for_range_end()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3b52afa4a3	position_in_partition: Introduce no_clustering_row_between()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9c2b3e1167	position_in_partition: Introduce as_start_bound_view()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	b47c8f1df7	partition_snapshot: Add const-qualified overload of version() [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dd9d35c166	partition_snapshot: Add getter for range tombstones	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	60c3c0a471	partition_entry: Add squashed() overload with a single schema	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	98f7671553	partition_snapshot: Introduce squashed()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	87b0f11be3	partition_snapshot: Add getters for static row and partition tombstone [tgrabiec: - Extracted from a different patch - Renamed concept names to more familiar Map and Reduce - Renamed aggregate() to squashed() to match the existing nomenclature - Uncommented the concepts ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ea59b9475e	partition_version: Add const-quialified variant of operator-> [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	f6fe0acea4	partition_version: Make operator bool() const-qualified [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	efc75b0bc3	mutation_partition: Add rows_entry constructor which accepts full contents [tgrabiec: Extracted from different patch]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7f8620d4a7	tests: mutation_source: Relax expectations about range tombstones In preparation for having partial cache which trims range tombstones to the lower bound of the query.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3a9212e0f2	tests: mutation_assertions: Add ability to limit verification to given clustering_row_ranges Currently mutation sources are free to return range tombstones covering range which is larger than the query range. The cache mutation source will soon become more eager about trimming such tombstones. To cover up for such differences, allow telling the restrictions to only care about differences relevant for given clustering ranges.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f925b26241	tests: mutation_reader_assertions: Simplify	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1d5d5e26a2	mutation: Introduce sliced()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	92d6456070	range_tombstone_list: Introduce equal()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1594ace4d3	range_tombstone_stream: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	19edb0b535	range_tombstone_list: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2e75595ecf	range_tombstone_list: Introduce trim()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	5a29c70f3e	mutation_fragment: make mutation_fragment copyable This will be needed by implementation of cache_streamed_mutation Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2fdabcaa9b	Track population phase in partition_snapshot This will be used by partial cache in later patches. [tgrabiec: - changed title, - documented meaning of the variable, - renamed the variable, - introduced open_version(), - fixed continuity of the static row not being preserved in case a new version is created] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a841c77c54	Introduce maybe_merge_versions This will be used in the following patches by partial cache. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9642f806ab	partition_version: Introduce version() getter	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9380dd1ee3	mutation_source: make sure we never ignore fast forwarding mutation source sometimes ignore fast forwarding parameter so this change adds assertion to check that this parameter can be safely ignored. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ab72241e22	mutation_reader: Accept forwarding flag in make_reader_returning() By default make_reader_returning creates a reader that does not support fast forwarding but the second parameter can be used to make it support fast forwarding. [tgrabiec: Improve title] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ac03331490	row_cache_test: improve test_sliced_read_row_presence Remove unused parameter and add checks to make sure all expected rows have been received. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	db053ef902	tests: Add test for continuity merging rules	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2edf08d36a	tests: random_mutation_generator: Generate random continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8873a443db	tests: mutation: Generate mutations with continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dce293e11c	tests: row_cache: Apply only fully continuous mutations to underlying mutation source Cache currently assumes that mutations coming from outside are fully continuous.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	e86f74edd8	tests: row_cache: Add missing apply() to test_mvcc test case [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	95dcfa859b	tests: row_cache: Improve test_mvcc() assert_that().is_equal_to() gives better error message. Also, there is code which can be replaces with assert_that_stream().has_monotonic_positions()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	05b56fcfb0	mutation_partition: Add support for specifying continuity This will allow expressing lack of information about certain ranges of rows (including the static row), which will be used in cache to determine if information in cache is complete or not. Continuity is represented internally using flags on row entries. The key range between two consecutive entries is continuous iff rows_entry::continuous() is true for the later entry. The range starting after the last entry is assumed to be continuous. The range corresponding to the key of the entry is continuous iff rows_entry::dummy() is false. [tgrabiec: - based on the following commits: 4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry 773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry - documented that partition tombstone is always complete - require specifying the partition tombstone when creating an incomplete entry - replaced rows_entry(dummy_tag, ...) constructor with more general rows_entry(position_in_partition, ...) - documented continuity semantics on mutation_partition - fixed _static_row_cached being lost by mutation_partition copy constructors - fixed conversion to streamed_mutation to ignore dummy entries - fixed mutation_partition serializer to drop dummy entries - documented semantics of continuity on mutation_partition level - dropped assumptions that dummy entries can be only at the last position - changed equality to ignore continuity completely, rather than partially (it was not ignoring dummy entries, but ignoring continuity flag) - added printout of continuity information in mutation_partition - fixed handling of empty entries in apply_reversibly() with regards to continuity; we no longer can remove empty entries before merging, since that may affect continuity of the right-hand mutation. Added _erased flag. - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key - fixed partition_builder to not ignore continuity - renamed dummy_tag_t to dummy_tag. _t suffix is reserved. - standardized all APIs on is_dummy and is_continuous bool_class:es - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics - dropped unused remove_dummy_entry() - simplified and inlined cache_entry::add_dummy_entry() - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	063b37f352	partition_snapshot_reader: Be prepared for skipping some row entries If some row entries may have to be skipped by the reader then it could be that _clustering_rows is not empty, but read_next() will return a disengaged optional because there are no more rows in the current range. The code assumed that it's never the case, and if read_next() returns a disengaged optional then we exhousted all ranges. Before introducing dummy entries this needs to be refactored.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2cfe23a35e	partition_snapshot_reader: Use rows_entry::position() for comparing rows key() will not be valid for dummy entries.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	50641b2849	partition_snapshot_reader: Reuse rows_entry comparator	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	d1a1fdfd57	partition_snapshot_reader: Encapsulate row walking to simplify read_next()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a77734952d	mutation_partition: Make rows_entry comparable with position_in_partition	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	65b3123516	mutation_partition: Use rows_entry::position() in comparators key() will not be valid for dummy entries, but position() is always valid. [tgrabiec: Extracted from other commits] [tgrabiec: Added missing change to range_tombstone_stream::get_next]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	660f3127a6	mutation_partition: Introduce rows_entry::position() In preparation for enabling dummy entries with postion past all clustering rows.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9d705bc1c6	position_in_partition_view: Add key component getter	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	206a5f2bf5	position_in_partition_view: Add is_clustering_row()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	874b34ac09	position_in_partition_view: Add converting constructor from a key reference	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	794f9b21ef	position_in_partition: Add is_after_all_clustered_rows()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	14fbf4409c	position_in_partition: Introduce after_key() in the view	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	1457be4de8	position_in_partition: Introduce for_key() [tgrabiec: Take the key by reference]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	94c957c2ff	Extract position_in_partition to separate header This will allow it's usage in mutation_partition.hh Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	0fd4dedc6a	position_in_partition: Add after_all_clustered_rows() to view This is a position that's always in the end after any other position. It will be used for dummy rows_entry. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	60346a2819	row_cache: remove unused read overload This will simplify the following patches and unused code should be removed anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	fbe8c24ebe	tests: row_cache_alloc_stress: Make eviction detection more reliable It can happen that touch() will trigger eviction on entry to allocating section, and drop in occupancy around insertion will not happen. As a result, we may evict a lot without detecting that. Extend the check to include touch() and use more reliable eviction counters.	2017-06-24 18:06:11 +02:00
Jesse Haber-Kucharsky	0791e90424	Further improve `CMakeLists.txt` for CLion We support situations where `seastar.pc` is available, and when it isn't. When `seastar.pc` is available, we grab the compilation flags from it in addition to the defaults. Some DPDK files are statically available in the source repository. We prefer those to files placed during compilation in case modifications are made during development (that would be lost during a build). We always disable GCC6 concepts for the IDE, even if they are enabled while configuring the project for compilation. CLion's parser doesn't understand them. One final benefit to this revision is that now only target-specific flags are modified rather than global flags. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <75457652201c5ed89d05081ec4b56b4340721cf5.1498237756.git.jhaberku@scylladb.com>	2017-06-23 19:21:28 +02:00
Avi Kivity	aebc77507c	Merge "Clock efficiency and organization" from Jesse "This patch series makes two noteworthy changes: - `gc_clock` now uses `seastar::lowres_system_clock` for better performance. - Code common to all clocks is properly refactored and shared. Otherwise, there are some small improvements that should have no functional impact." * 'jhk/lowres_gc_clock/v1' of https://github.com/hakuch/scylla: Seal clock definitions `timestamp_clock::now()` is not `noexcept` Make `db_clock` `time_t` conversions `constexpr` Move common clock implementation helpers Simplify clock implementations db_clock.hh: Clean preprocessor directives Make `gc_clock` a model of `Clock` Use `lowres_system_clock` to back `gc_clock` Add `time_t` conversions for `gc_clock`	2017-06-23 18:47:56 +03:00
Jesse Haber-Kucharsky	b0ad1ff447	Seal clock definitions	2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky	09954c45f1	`timestamp_clock::now()` is not `noexcept` The problem is that `std::chrono::duration_cast` is not `noexcept`. As a result, `timestamp_clock` is actually a model of `Clock` and not `TrivialClock`.	2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky	28169fabca	Make `db_clock` `time_t` conversions `constexpr`	2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky	e045dddae8	Move common clock implementation helpers This change fixes the dependencies between the clock implementation headers. All the clocks share the common clock offset, but are otherwise independent (though the `db_clock` does depend on `gc_clock` for time point conversions).	2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky	2d184f27af	Simplify clock implementations	2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky	51c767c1c7	db_clock.hh: Clean preprocessor directives	2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky	050ece6f74	Make `gc_clock` a model of `Clock` It was missing `is_steady`.	2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky	00bcd568a6	Use `lowres_system_clock` to back `gc_clock` `seastar::lowres_system_clock` is more efficient than `std::chrono::system_clock` and `gc_clock` has very coarse granularity requirements. Fixes #1957.	2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky	73020685ee	Add `time_t` conversions for `gc_clock` `gc_clock` reports system time, and these conversion functions allow for manipulating time points produced by the clock without making assumptions about its epoch.	2017-06-23 11:35:34 -04:00
Duarte Nunes	4ef25e8e38	db/schema_tables: Add note to make_update_view_mutations Document that a new view schema passed to make_update_view_mutations() might be based on base schema that hasn't yet been loaded. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170618200558.96036-1-duarte@scylladb.com>	2017-06-23 15:24:35 +02:00
Duarte Nunes	bc1f1fa88a	Merge branch "Some fixes for clang++ trunk" from Avi "Fix minor issues found while building with clang trunk." * 'clang' of https://github.com/avikivity/scylla: seastarx: don't make seastar namespace inline seastarx: add missing make_shared forward declaration tests: fix call to seastar::sleep() dht: fix bad to_sstring() call	2017-06-22 17:31:27 +02:00
Avi Kivity	f3366d8ae6	seastarx: don't make seastar namespace inline It's apparently not legal to re-declare an existing namespace inline. Use "using" instead.	2017-06-22 18:16:13 +03:00
Avi Kivity	c423330917	seastarx: add missing make_shared forward declaration Without this, clang (correctly) complains that it can't deduce the type when it is not explicitly mentioned.	2017-06-22 18:16:13 +03:00
Avi Kivity	672de608bf	tests: fix call to seastar::sleep() It's not in the global namespace.	2017-06-22 18:16:13 +03:00
Avi Kivity	f9f2f18145	dht: fix bad to_sstring() call to_sstring() is part of seastar, nor the global namespace.	2017-06-22 17:51:27 +03:00
Avi Kivity	6f8dba3fa9	Merge "small fixes and cleanup for leveled strategy" from Raphael * 'lcs_improvements_v1' of github.com:raphaelsc/scylla: lcs: remove useless code for choosing L0 candidates lcs: remove some dead code lcs: make logger static lcs: actually prefer oldest sstables of L0 when it falls behind lcs: remove useless expensive check for overlapping L1 sstables	2017-06-22 15:45:34 +03:00
Raphael S. Carvalho	4351e0a996	compaction: introduce new compaction type for reshard so now user can look at nodetool compactionstats and determine whether or not resharding is running, for example: $ ./bin/nodetool compactionstats pending tasks: 3 id compaction type keyspace table completed total unit progress <none> RESHARD system compaction_history 11 256 keys 4.30% <none> RESHARD system compaction_history 2 256 keys 0.78% <none> RESHARD system compaction_history 10 256 keys 3.91% <none> RESHARD system compaction_history 8 256 keys 3.12% <none> RESHARD system compaction_history 7 256 keys 2.73% Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>	2017-06-22 14:48:38 +03:00
Gleb Natapov	9b8499df0e	cache_hitrate_calculator: filter cfs based on replication strategy instead of a name The code filters CFs by name to not include system keyspace, but v3 schema added yet another system namespace. Better filter according to replication strategy to accommodate for schema v4 adding even more system keyspaces. Fixes: #2516 Message-Id: <20170621073816.GB3944@scylladb.com>	2017-06-22 11:26:34 +03:00
Tzach Livyatan	9e6337f330	Add a comment experimental line to scylla.yaml Making it easier for users to enable experimental features Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <20170621191720.13575-1-tzach@scylladb.com>	2017-06-22 09:06:19 +03:00
Jesse Haber-Kucharsky	8174a40098	Improve `CMakeLists.txt` for CLion This version optionally reads the include paths for Seastar from pkg-config and uses file globbing to register all source and header files. In comparison to the previous version, I see all files in the project explorer view are "active" (rather than just .cc files). I believe there are also fewer errors reported by the editor.	2017-06-21 16:34:47 -04:00
Avi Kivity	f0b20be14d	Revert "system_keyspace: Make sure "system" is written to keyspaces (visible)" This reverts commit `89ef69c4b3`. Prevents nodes from joining the cluster.	2017-06-21 16:58:04 +03:00
Avi Kivity	8585a356eb	Revert "Revert "db: prevent latency spikes during streaming/repair"" This reverts commit `399d219cab`. Turns out it was not the culprit.	2017-06-21 16:58:04 +03:00
Takuya ASADA	aa77ac1138	dist/debian: Debian 9(stretch) support Add support Debian new stable release. Also including following changes: - update libthrift due to unable to compile on Debian 9 - drop dist/debian/supported_release since distribution check code moved to pbuilderrc - add libssl-dev for build-depends - add sudo for pbuilder extra packages (Debian doesn't have it by default install) Signed-off-by: syuu <syuu@dokukino.com> Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1498047515-19972-1-git-send-email-syuu@scylladb.com>	2017-06-21 15:30:22 +03:00
Avi Kivity	399d219cab	Revert "db: prevent latency spikes during streaming/repair" This reverts commit bdfa2ed923245e236837f58925c797e26df32361; prevents nodes from joining.	2017-06-21 11:28:29 +03:00
Calle Wilund	89ef69c4b3	system_keyspace: Make sure "system" is written to keyspaces (visible) Fixes #2514 Bug in schema version 3 update: We failed to write "system" to the schema tables. Only visible on an empty instance of course. Message-Id: <1497966982-10044-1-git-send-email-calle@scylladb.com>	2017-06-20 20:59:47 +02:00
Avi Kivity	bdfa2ed923	db: prevent latency spikes during streaming/repair The memtable destructor can take a long time if the memtable is full; use clear_gently() to clear it without impacting latency. Fixes #2477. Message-Id: <20170620093550.16121-1-avi@scylladb.com>	2017-06-20 13:03:43 +02:00
Nadav Har'El	186f031187	Optimize sstable::as_mutation_source() for one partition Although usually one can fast_forward_to() on the result of a sstable::as_mutation_source(), earlier we had an optimization where if a single partition was specified, it was read exactly, and fast_forward_to() was NOT allowed. With the mutation_reader::forwarding flag patch, when this flag was on - requesting fast_forward_to() - we disabled this optimization. This makes sense, but is not backward compatible with the code which previously assumes this optimization exists... So this patch returns this optimization, despite this meaning that we blatently ignore the fwd_mr flag in that case. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170620081107.14335-1-nyh@scylladb.com>	2017-06-20 10:32:10 +02:00
Avi Kivity	ba0ba87bf9	Merge seastar upstream * seastar 621b7ed...9e2b7ec (8): > Merge "Low-resolution clocks" from Jesse > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes > build: work around ragel 7 generated code bug > rpc: make unmarshall_exception() inline > future-utils: make functions global > fix reactor stall detector rate limiting on an mostly idle system > prometheus: Add ability to add /metrics to any http_server > prometheus: fix memory leak in http_server_control	2017-06-20 11:01:40 +03:00
Tomasz Grabiec	358bf88cf8	mutation_reader: Fix abort when streaming more than one range multi_range_mutation_reader uses fast_forward_to() to skip between ranges, so we always need to create the underlying reader with with mutation_reader::forwarding::yes if there is more than one range, irrespective of whether multi_range_mutation_reader itself will be forwarded or not. Fixes #2510. Introduced in commit `3018df1`. Message-Id: <1497943032-18696-1-git-send-email-tgrabiec@scylladb.com>	2017-06-20 10:29:45 +03:00
Amos Kong	92731eff4f	common/scripts: fix node_exporter url Commit `ff3d83bc2f` updated node_exporter from 0.12.0 to 0.14.0, and it introduced a bug to download install file. node_exporter started to add 'v' prefix in release tags[1] from 0.13.0, so we need to fix the url. [1] https://github.com/prometheus/node_exporter/tags Fixes #2509 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <42b0a7612539a34034896d404d63a0a31ce79e10.1497919368.git.amos@scylladb.com>	2017-06-20 09:25:39 +03:00
Raphael S. Carvalho	82048e6f77	lcs: remove useless code for choosing L0 candidates The code being removed could be used if parallel compaction were allowed for LCS, but the current code isn't even allowing that. At the moment, it's only wasting cycles. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 21:19:29 -03:00
Raphael S. Carvalho	81f20068d6	lcs: remove some dead code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 21:02:46 -03:00
Raphael S. Carvalho	b26dc6db1a	lcs: make logger static otherwise, there will be one instance per shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 21:01:21 -03:00
Raphael S. Carvalho	4bb27cbd6f	lcs: actually prefer oldest sstables of L0 when it falls behind Strategy prefers promoting oldest sstables in L0. Because sort procedure is incorrectly sorting elements in descending order, newest sstables will be promoted first if and only if L0 falls behind (more than 32 sstables). If L0 doesn't fall behind, we'll have all L0 sstables compacted with overlapping ones in L1. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 20:45:39 -03:00
Raphael S. Carvalho	90db2d7eba	lcs: remove useless expensive check for overlapping L1 sstables there's no way a L1 sstable will be in candidates set which was previously built from list of L0 sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 19:33:52 -03:00
Takuya ASADA	71600eb298	dist/debian: Use pbuilder for Ubuntu/Debian debs Enable pbuilder for Ubuntu/Debian to prevent build enviroment dependent issues. Also support cross building by pbuilder. (cross-building from Fedora 25 and Ubuntu 16.04 are tested) closes #629 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1497895661-26376-1-git-send-email-syuu@scylladb.com>	2017-06-19 21:15:06 +03:00
Nadav Har'El	984da1d8d7	Make forwarding_tag local to streamed_mutation As Avi noticed, the "forwarding_tag" which was meant to be local in streamed_mutation, became global. If another class copied the same trick, it would share the same type instead of being distinct types as intended. The problem is that in: using forwarding = bool_class<class forwarding_tag>; Apparently, the "class forwarding_tag" forward-declares a global type - it does not create a local-scope type as intended, which the following apparently does (even though no actual definition is given for that class): class forwarding_tag; using forwarding = bool_class<forwarding_tag>; Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619153933.13116-1-nyh@scylladb.com>	2017-06-19 20:04:47 +03:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Pekka Enberg	98fb2c0b56	docs: Fix Docker Hub documentation logo The URL got broken when www.scylladb.com changed. Fix it up. Message-Id: <1497360648-19210-1-git-send-email-penberg@scylladb.com>	2017-06-19 13:11:59 +03:00
Takuya ASADA	e1459dc9ef	dist/debian: provides 3rdparty packages for Debian jessie Now we provides jessie prebuilt 3rdparty packages, allow running without --rebuild-dep. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1497863340-19726-1-git-send-email-syuu@scylladb.com>	2017-06-19 13:11:20 +03:00
Takuya ASADA	3d671baf3b	dist/debian: define dh_auto_configure task correctly We mistakenly placed ./configure.py to dh_auto_build, but it's should place at dh_auto_configure. This bug causes issue #2505, since we haven't defined dh_auto_configure task yet(It seems running cmake on top of the dir is one of default behavior of dh_auto_configure). Fixes #2505 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1497866068-32097-1-git-send-email-syuu@scylladb.com>	2017-06-19 12:58:59 +03:00
Duarte Nunes	7c17eba8e8	cql3/cql3_type: Don't quote tuple types A regression introduced in `08b2ceb28e` quoted tuple type names, which, being of the form tuple<t1, ..., tn>, would always be quoted. The quoted name would then not be found in any internal data structures. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1497813816-39956-1-git-send-email-duarte@scylladb.com>	2017-06-18 22:03:57 +02:00
Duarte Nunes	ffcd4c76c2	ide: Add CMakeLists.txt for cmake-based IDEs This patch add the CMakeLists.txt file for IDEs based on cmake, like CLion. This file assumes the existence of a build/release/gen directory, containing generated files. Refs #867 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170618151333.94714-1-duarte@scylladb.com>	2017-06-18 18:47:55 +03:00
Avi Kivity	58fd3dd006	Merge "cql3: Quote type name when needed" from Duarte "This patch set ensures we quote the name of a UDT when it contains characters that may cause parsing by the CQL parser to fail. Fixes #2491" * 'cql3-quote-type/v1' of https://github.com/duarten/scylla: cql3/util: Make maybe_quote() take argument by const reference cql3/cql3_type: Quote UDT name if needed schema: Lift maybe_quote() into cql3/util	2017-06-18 17:59:47 +03:00
Gleb Natapov	72a4554dd9	storage_proxy: Fix compilation on older (1.55) boost Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by boost::adaptors::transformed() when std::ref to lambda is passed to it do not match iterator concept. It cannot be default constructed because std::reference_wrapper is not default constructable. boost::range::min_element() never actually default construct it, but concept is checked anyway. The patch fixes it by providing an explicit functor that is default constructable. Message-Id: <20170618131836.GD3944@scylladb.com>	2017-06-18 16:54:41 +03:00
Duarte Nunes	b2c5aca4cf	db/schema_tables: View mutations shouldn't always include base ones When making the schema mutations for a view update, we should only include the base table schema mutations (in case the target node doesn't contain them) when the view is being directly updated. When it is being updated as a side effect of updating the base table, then including the base schema mutations will hide the actual changes being performed on the base. Fixes #2500 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>	2017-06-18 16:29:59 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Amnon Heiman	ff3d83bc2f	node_exporter_install script update version to 0.14 Fixes #2097 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170612125724.7287-1-amnon@scylladb.com>	2017-06-18 12:25:58 +03:00
Calle Wilund	3464422051	commitlog_test: Fix reader test dropping rp handles Test wants data in live segments to read from, so should not just drop the handles returned from allocate. Message-Id: <1497344532-2616-1-git-send-email-calle@scylladb.com>	2017-06-16 22:45:46 +01:00
Etienne Kruger	be0a947596	tests: perf_simple_query: Add delete perf test Add a performance test for deletion in addition to the existing update and query tests. The deletion performance test is executed using the '--delete' argument to perf_simple_query. Fixes #2417. Signed-off-by: Etienne Kruger <el@loadavg.io> Message-Id: <20170615232500.26987-1-el@loadavg.io>	2017-06-16 14:51:00 +01:00
Duarte Nunes	b993124d94	cql3/util: Make maybe_quote() take argument by const reference Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-15 19:55:52 +00:00
Duarte Nunes	08b2ceb28e	cql3/cql3_type: Quote UDT name if needed This patch ensures we properly quote a UDT name, which may contain characters like ".", which can lead the name to be interpreted as a keyspace qualified name when parsed by the CQL parser. Fixes #2491 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-15 19:55:52 +00:00
Duarte Nunes	4886b7ed5e	schema: Lift maybe_quote() into cql3/util It's a more natural place given its current and future usages. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-15 19:55:52 +00:00
Avi Kivity	2c57ab84b2	mutation_reader: fix typo in forwarding_tag The typo went unnoticed since the compiler picked up the global scope's forwarding_tag. The bug made streamed_mutation::forwarding and mutation_reader::forwarding the same type, but fortunately there were no type mixups due to this.	2017-06-15 20:13:01 +03:00
Avi Kivity	9cf6db3de5	Merge	2017-06-15 19:11:07 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Avi Kivity	da24bd7c34	Merge "Balance read requests according to CF's cache hit ratio" from Gleb "During read query with CL<ALL not all replicas are contacted. It is possible for some replicas to cache less data for some CF's (for instance because of node restart), so the replica choice may have a big impact on request's completion latency and on amount of work it generates in a cluster. This patch series keep track of per CF cached hit ratio and uses this information to choose best replicas for a request. Nodes with lower hit ratios are still contacted in order to populate their cache, but less frequently." * 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev: storage_proxy: load balance read requests according to cache hit rates choose extra replica for speculation in filter_for_query() consistency_level: drop filter_for_query_dc_local function database: reset node's hit rate information on connection drop messaging_service: connection drop notifier Store cluster wide cache hit statistics in CF messaging_service: return cache hit ratio as part of data read Distribute cache temperature over gossiper. periodically calculate avg cache hit rate between all shards database: introduce cache_temperature class Rename load_broadcaster.cc to misc_services.cc storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-15 14:33:08 +03:00
Avi Kivity	7dffe7f933	Merge "parallel repair and more memory usage fix" from Asias "This series reduces repair memory usage and improves repair speed." * tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev: repair: Repair on all shards repair: Allow one stream plan in flight	2017-06-15 14:00:19 +03:00
Duarte Nunes	5736468a71	mutation_partition_serializer: Assume range tombstone support Range tombstones were introduced in version 1.3 and there exists no direct upgrade from 1.2 to vnext, so we can retire the code enforcing backwards compatibility. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170614211654.82501-1-duarte@scylladb.com>	2017-06-15 09:54:05 +03:00
Avi Kivity	e11f1c9cc3	tests: fix partitioner_test build on gcc 5	2017-06-14 17:22:01 +03:00
Gleb Natapov	4fdfa2dbb7	gdb: Fix "scylla heapprof" command Message-Id: <20170612084241.GF21915@scylladb.com>	2017-06-14 15:41:39 +02:00
Gleb Natapov	c7a59ab7ff	do not calculate serialized size of commitlog_entry_writer before final format is knows Currently commitlog_entry_writer constructor calculates serialized size before it is knows if a schema should be included into the entry. The result is never used since it is recalculated when schema information is supplied. The patch removes needless calculation. Message-Id: <20170614114607.GA21915@scylladb.com>	2017-06-14 14:53:07 +03:00
Gleb Natapov	a032078410	intern also tuple and user defined types Currently each time UDT or tuple is parsed new object is created. If those objects are used to create container type repeatedly it will cause memory leak since container types are interned, but lookup in the cache is done using pointer to a contained type (which will be always different for UDT and tuples). This patches interns also UDT and tuple, so each type the same object is parsed same pointer is also returned. Refs #2469 Fixes #2487 Message-Id: <20170612142942.GO21915@scylladb.com>	2017-06-14 14:41:17 +03:00
Asias He	47345078ec	repair: Repair on all shards Currently, shard zero is the coordinator of the repair. All the work of checksuming of the local node and sending of the repair checksum rpc verb is done on shard zero only. This causes other shards being underutilized. With this patch, we split the ranges need to be repaired into at least smp::count ranges, so sizeof(ranges) / smp::count will be assigned to each shard. For exmaple, we have 8 shards and 256 ragnes, each shard will repair 32 ranges. Each shard will repair the 32 ranges sequencially. There will be at most 8 (smp::count) ranges of repair in parallel.	2017-06-14 17:52:49 +08:00
Asias He	54831a344c	repair: Allow one stream plan in flight In "repair: Use more stream_plan" (commit `2043ffc064`), we switched to do stream while doing checksum instead of do stream only after checksum pahse is completed. We take a parallelism_semaphore before we do checksum, if there are more than sub_ranges_to_stream (1024) ranges, we start a stream_plan and wait for the streaming to complete (still under the parallelism_semaphore). So at most parallelism_semaphore (100) stream_plans can be in parallel. The parallelism_semaphore limits the parallelism of both checksum and the streaming plan. However, it is not necessary to have the same parallelism for both checksum and streaming, because 1) a streaming operation itself runs in parallel (handling ranges on all shards in prallel, sending mutaitons in parallel) , 2) and with more streaming plan (in worse case 100) means we can write to 100 memtables at the same time and flush 100 memtables to disk at the same time which can take a lot of memory. With this patch, we only allow one stream plan in flight.	2017-06-14 17:52:36 +08:00
Calle Wilund	525730e135	database: Fix assert in truncate to handle empty memtables+sstables If we do two truncates in a row, the second will have neither memtable nor sstable data. Thus we will not write/remove sstables, and thus get no resulting truncation replay position. Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>	2017-06-14 11:21:21 +02:00
Gleb Natapov	87094849fa	storage_proxy: load balance read requests according to cache hit rates This patch makes storage proxy to choose replicas to read from base on their cache hit rates. Replicas with higher cache hit rates will see more requests while replicas with lower hit rates will see less. Local node has a special bonus and will get more requests even if another node has slightly higher cache hit rate (same goes for local vs remote DC), but after the patch it is no longer guarantied that a coordinator node will be chosen as a replica for the read (if the feature is enabled).	2017-06-13 09:57:14 +03:00
Gleb Natapov	bc8aa1b4ee	choose extra replica for speculation in filter_for_query() Currently storage proxy has to loop over remaining replicas to search for suitable extra replica, but doing it in filter_for_query() is extremely easy, so do it there instead.	2017-06-13 09:57:14 +03:00
Gleb Natapov	8437ea3b99	consistency_level: drop filter_for_query_dc_local function Merge filter_for_query_dc_local() functionality into filter_for_query(). This is more efficient since filter_for_query_dc_local() partitions endpoints into 'local' and 'remote' set but filter_for_query() already does it for CL=LOCAL so for such queries we needlessly do it twice.	2017-06-13 09:57:14 +03:00
Gleb Natapov	ca812a8ea0	database: reset node's hit rate information on connection drop Node may go down, so after it restarts cache hit rate info will be incorrect and it can be overwhelmed with traffic until new and up-to-date cache hit rate arrives. Solve this by dropping node's information on connection reset, it is more accurate than relying on gossip which may be slow and miss reboot of a node.	2017-06-13 09:57:14 +03:00
Gleb Natapov	23c51b3e57	messaging_service: connection drop notifier Allow registering callbacks that will be called when connection is going down.	2017-06-13 09:57:14 +03:00
Gleb Natapov	0e4d5bc2f3	Store cluster wide cache hit statistics in CF	2017-06-13 09:57:14 +03:00
Gleb Natapov	69c5526301	messaging_service: return cache hit ratio as part of data read	2017-06-13 09:57:14 +03:00
Gleb Natapov	8ca1432b04	Distribute cache temperature over gossiper. When a node start it does not have any information about cache temperature of other nodes in the cluster and it is hard (if not impossible) to make right guess. During cluster startup all nodes have cold caches, so there is no point to redirect reads to other nodes even though local cache it cold, but if only that node restarted than other nodes have populated cache and reads should be redirected. The node will get up-to-date information about other nodes caches, but only after receiving first reply, until then it does not have the information to make right decisions which may cause unwanted spikes immediately after restart. Having cache temperature in gossiper helps to solve the problem.	2017-06-13 09:57:14 +03:00
Gleb Natapov	991ec4a16c	periodically calculate avg cache hit rate between all shards This patch adds new class cache_hitrate_calculator whose responsibility is to periodically calculate average cache hit rates between all shards for each CF.	2017-06-13 09:57:14 +03:00
Gleb Natapov	fab18c0c5a	database: introduce cache_temperature class The class will represent cache hit rate for a column family and is serializable for use with RPC.	2017-06-13 09:57:14 +03:00
Gleb Natapov	f59ecc2687	Rename load_broadcaster.cc to misc_services.cc load_broadcaster is very small class, move it into generic file so that we can put other small services there to save on compilation time.	2017-06-13 09:57:14 +03:00
Gleb Natapov	7bcf4c690f	storage_proxy: use db::count_local_endpoints function instead open code it	2017-06-13 09:57:14 +03:00
Gleb Natapov	21197981a5	Fix use after free in nonwrapping_range::intersection end_bound() returns temporary object (end_bound_ref), so it cannot be taken by reference here and used later. Copy instead. Message-Id: <20170612132328.GJ21915@scylladb.com>	2017-06-12 15:34:36 +01:00
Tomasz Grabiec	20095d7ed6	gdb: Fix "scylla column_families" command Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>' in a type name, so this: std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>> should be this: std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> > Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:39:50 +03:00
Tomasz Grabiec	9e7a040f0c	gdb: Fix "scylla keyspaces" command The problem is that 'key' is a 'bytes' object now, which doesn't have __format__. Fixes the following error: Traceback (most recent call last): File "~/src/scylla/scylla-gdb.py", line 184, in invoke TypeError: non-empty format string passed to object.__format__ Error occurred in Python command: non-empty format string passed to object.__format__ Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:22:59 +03:00
Tomasz Grabiec	230683bdfa	gdb: Add missing seastar namespace qualifier Message-Id: <1497253433-374-1-git-send-email-tgrabiec@scylladb.com>	2017-06-12 11:22:53 +03:00
Asias He	2bcb368a13	repair: Fix range use after free Capture it by value. scylla: [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed) scylla: [shard 0] repair - Failed sync of range ==<runtime_exception (runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed) Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>	2017-06-12 11:00:57 +03:00
Avi Kivity	419ad9d6cb	Merge "repair memory usage fix" from Asias "This series switches repair to use more stream plans to stream the mismatched sub ranges and use a range generator to produce sub ranges. Test shows no huge memory is used for repair with large data set. In addition, we now have a progress reporter in the log how many ranges are processed. Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942] Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942] Fixes #2430." * tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev: repair: Remove unused sub_ranges_max repair: Reduce parallelism in repair_ranges repair: Tweak the log a bit repair: Use more stream_plan repair: iterator over subranges instead of list	2017-06-08 14:19:08 +03:00
Tomasz Grabiec	9b7f170121	gdb: Improve error message Message-Id: <1496849069-21750-1-git-send-email-tgrabiec@scylladb.com>	2017-06-07 18:26:31 +03:00
Tomasz Grabiec	0dfe1ad431	Merge "Relax replay position ordering requirement" from Calle From seastar-dev.git calle/concorde Normally, we require that all mutations applied to a column family have replay positions higher than all previously flushed. The main reason for this is to be able to determine when to drop a commit log segment, i.e. determine that all replay positions less than X are now in sstables. This patch series, small as it is, relaxes this by instead of just keeping track of high rp applied, keep a reference count to each segment per CF in memtables, and on flush, release this very count. The only case where we need to keep a water mark for RP is then for table truncation, for which we simply say that the highest RP applied to the column family is the lowest allowed henceforth, and use the old reordering logic for this instead. I.e. very rare. There is of course one (big?) downside to all this, and this is "normal" commit log replay on startup after crash/shutdown. Since we relax RP ordering, we cannot use RP:s in sstables as low marks for replay start, since it is now allowed to exist non-persisted mutations in commitlog with lower RP:s than previously flushed. I.e. we more or less always have to replay the full commit log. It is worth noting though that due to compaction and the non- propagation of RP marks to new sstables, we end up often doing this anyway, so it is hard to say how much of a regression this is.	2017-06-07 14:51:28 +02:00
Calle Wilund	18806989b6	database: remove hard rp ordering requirement, set low rp mark on truncate With commitlog keeping use-count per CF id, we can ease the ordering restriction on replay positiontion. Previously we required that all added mutations have a position > previously flushed. However, if we accept that replay must now be all data, by keeping track instead per CF of highest RP ever entered, we can instead just set a low mark on truncation, since this is the only remaining hard RP divider.	2017-06-07 12:07:01 +00:00
Calle Wilund	d9b8c79eb9	commitlog_replayer: Ignore sstable replay positions With relaxed position ordering, we cannot use existing sstables as water mark for replay. We must replay everything above truncation marks.	2017-06-07 12:07:01 +00:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Calle Wilund	0c598e5645	commitlog_test: Fix test_commitlog_delete_when_over_disk_limit Test should a.) Wait for the flush semaphore b.) Only compare segement sets between start and end, not start, end and inbetwen. I.e. the test sort of assumed we started with < 2 (or so) segments. Not always the case (timing) Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>	2017-06-07 12:44:02 +03:00
Avi Kivity	07ff3f68e0	Merge seastar upstream * seastar b1f69cc...621b7ed (8): > net/api: Remove outdated comments > Merge "Fixes for Clang 5" from Paweł > Merge "Metrics: Safely transfer metadata between shared" from Amnon > posix: add missing #include > build: add cmake dependency > build: add -Wno-maybe-uninitialized > rpc: handle messages larger than memory limit (Fixes #2453) > doxygen: enable macro expansion	2017-06-07 11:04:56 +03:00
Takuya ASADA	7fe63c539a	dist/debian: install gdebi when it's not exist Since we started to use gdebi for install build-dep metapackage that generated by mk-build-dep, we need to install gdebi on build_deb.sh too. Fixes #2451 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496819209-30318-1-git-send-email-syuu@scylladb.com>	2017-06-07 10:24:22 +03:00
Asias He	3fdb8a3d3f	repair: Remove unused sub_ranges_max With the sub range iterator, it is not used anymore. Drop it.	2017-06-07 08:52:45 +08:00
Asias He	ca00c10b35	repair: Reduce parallelism in repair_ranges We currently repair all the ranges in parallel. 1) All the ranges will contend for parallelism_semaphore, instead of processing multiple ranges in parallel and calculating the sub ranges (which take memory) for each range in parallel, we can handle the ranges one bye one. We could have enough parallelism because the checksum are calucated on all the shards. 2) If for some reason the repair failed, if we handle ranges 1 by 1, we can log which range of repair is successful. Next time, we can ignore them. If we start ranges in parallel, it has a high chance, no single range is completed because all the ranges are on going. Refs #1912	2017-06-07 08:50:57 +08:00
Asias He	3852665156	repair: Tweak the log a bit - Count n out m ranges the repair is running for (kind of progress report) - Make the 'Found differing range' log debug because it can be millions of such entries - Print the failed ranges	2017-06-07 08:50:57 +08:00
Asias He	2043ffc064	repair: Use more stream_plan In the very beginning, we use a stream_plan for each checksum range. Later, we changed to use a single stream_plan for all the checksum ranges. It pushes memory presure to streaming, e.g., millinons of ranges in a vector to send over RPC. To fix, we do checksum and streaming in parallel, limit the number of checksum ranges stored in memory. Fixes #2430	2017-06-07 08:50:56 +08:00
Nadav Har'El	b3ff37e67f	repair: iterator over subranges instead of list When starting repair, we divided the large token ranges (vnodes) linto small subranges of a desired length (around 100 partition), and built a huge list of those subranges - to iterate over them later and compare checksums of those chunks. However, building this list up-front is completely unnecessary, and wastes a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was spent on this list. Instead, what we do in this patch is to find the next chunk in a DFS-like splitting algorithm, using only the token range midpoint() function (as before). The amount of memory needed for this is O(logN), instead of O(N) in the previous implementation. Refs #2430. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2017-06-07 08:50:56 +08:00
Raphael S. Carvalho	0ca1e5cca3	sstables: fix report of disk space used by bloom filter After change in boot, read_filter is called by distributed loader, so its update to _filter_file_size is lost. The load variant which receives foreign components that must do it. We were also not updating it for newly created sstables. Fixes #2449. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>	2017-06-06 18:20:28 +03:00
Takuya ASADA	a4c392c113	dist/debian: use gdebi instead of mk-build-deps -i At least on Debian8, mk-build-deps -i silently finishes with return code 0 even it fails to install dependencies. To prevent this, we should manually install the metapackage generated by mk-build-deps using gdebi. Fixes #2445 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>	2017-06-06 11:37:34 +03:00
Takuya ASADA	5608842e96	dist/debian/dep: install texlive from jessie-backports to prevent gdb build fail on jessie Installing openjdk-8-jre-headless from jessie-backports breaks texlive on jessie main repo. It causes 'Unmet build dependencies' error when building gdb package. To prevent this, force insatlling texlive from jessie-backports before start building gdb. Fixes #2444 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>	2017-06-06 11:37:33 +03:00
Paweł Dziepak	b2b78158f6	mutation_partition: restore formatting No functional change. Message-Id: <20170526104119.22075-2-pdziepak@scylladb.com>	2017-06-06 11:20:57 +03:00
Gleb Natapov	f5679e0416	database: remove remnants of no longer existing db::serializer. Message-Id: <20170604100552.GD8248@scylladb.com>	2017-06-04 13:07:17 +03:00
Raphael S. Carvalho	dcbeb42f67	sstables: explicitly close file in fsync_directory or close is called in the reactor thread when destroying the file object. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170602024346.7803-1-raphaelsc@scylladb.com>	2017-06-02 21:09:58 +03:00
Pekka Enberg	a6dc21615b	Merge "Fixes to thrift/server" from Duarte "This series fixes some issues with the thrift_server, namely ensuring that streams and sockets are properly closed. Fixes #499 Fixes #2437" * 'thrift-server-fixes/v1' of github.com:duarten/scylla: thrift/server: Close connections when stopping server thrift/server: Move connection class to header thrift/server: Shutdown connection thrift/server: Close output_stream when connection is done	2017-06-02 08:15:22 +03:00
Duarte Nunes	c525331e60	thrift/server: Close connections when stopping server Fixes #499 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-02 00:15:20 +02:00
Duarte Nunes	315c69b830	thrift/server: Move connection class to header No changes in functionality. Required for an upcoming patch. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-02 00:15:20 +02:00
Duarte Nunes	22fafd5034	thrift/server: Shutdown connection This patch adds the shutdown() function to thrif_server::connection, and calls it after a connection is done. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-02 00:15:20 +02:00
Duarte Nunes	0a5ec97b7f	thrift/server: Close output_stream when connection is done Fixes #2437 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-06-02 00:15:20 +02:00
Jesse Haber-Kucharsky	376c661823	Eliminate duplicate definition of sstable column mask values The column mask identifies the kind of atom in a row in an sstable. Two definitions of these values were present: one as a C-style enumeration and one as a C++11-style enumeration. The C++11-style definition is used elsewhere in `sstables.cc`. It also offers additional type-safety. Therefore, this commit removes the inlined C-style enumeration. Fixes #2214. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <c525b4ae7fad3b54480e133921aa4ffe0dd5d9ce.1496352711.git.jhaberku@scylladb.com>	2017-06-02 00:06:31 +02:00
Michał Matczuk	04da4dbf83	docker support for api-address Message-Id: <1b5fb2bbba1b879aae825094a0f1b77c865be139.1496318996.git.michal@scylladb.com>	2017-06-01 15:31:45 +03:00
Takuya ASADA	22339bba44	dist/debian: depends to collectd-core instead of collectd, to reduce dependencies To reduce unwanted dependencies, we need to replace dependency from collectd to collectd-core. However, collectd provides /etc/collectd/collectd.conf, so without this package we need to install the configuration file by our self. So install the file on .postinst script. Fixes #2426 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496231743-7828-1-git-send-email-syuu@scylladb.com>	2017-06-01 13:20:37 +03:00
Takuya ASADA	909a9ebf97	dist/debian: provide prebuilt 3rdparty packages for Ubuntu 16.04 Currently we only offers 14.04 prebuit but we have 16.04 one on s3, so use it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1496301544-15251-1-git-send-email-syuu@scylladb.com>	2017-06-01 10:37:52 +03:00
Duarte Nunes	15a62701f2	test.py: Ensure view_schema_test runs with only one cpu In the write path we don't wait for view updates, as they happen in the background. The view schema tests can fail when running with more than one cpu due to this inherent race condition: the write to the base table returns while the view updates are still being processed, after which we issue a query to the view table. The shard handling the view data is not guaranteed to finish processing the mutation before handling the query. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170531165726.9212-1-duarte@scylladb.com>	2017-05-31 19:17:51 +01:00
Raphael S. Carvalho	b8091799ca	lcs: fix off-by-one comparison invariant is broken if size of L0 candidates is equal to max sstable size because the overlapping L1 sstables will not be added to compacting set, and they will be promoted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170530143708.3775-1-raphaelsc@scylladb.com>	2017-05-30 17:39:51 +03:00
Avi Kivity	15af6acc8b	dist: redirect stdout/stderr to the journal on systemd systems Fixes #2408. Message-Id: <20170524080729.10085-1-avi@scylladb.com>	2017-05-30 08:47:17 +03:00
Avi Kivity	1c84aae0c1	Merge seastar upstream * seastar 68dbf60...b1f69cc (10): > metrics: fix namespace in documentation > add special logger for memory allocation failures > xen: remove > Merge "sanity checks, fixes and extensions in the perftune.py" from Vlad > tutorial: more "seastar" namespace > execution_stages: fix build errors in comments > tutorial: more "seastar" namespace additions > tutorial: more minor changes > tutorial: minor changes to the introduction > tutorial: start overhauling the examples to use "seastar" namespace	2017-05-29 19:02:02 +03:00
Calle Wilund	3512ed4596	storage_service/config: Add "native_transport_port_ssl" option Mimic origin behaviour, iff TLS encryption is enabled, and native_transport_port_ssl is set and different from native_transport_port, start both tls- and non-tls listeners. Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>	2017-05-29 15:53:56 +03:00
Calle Wilund	1b387a1f56	cql server: Allow multiple listeners on different ports Need to separate "notifiers" to per-port/address and keep life span as such. Message-Id: <1496061600-24454-1-git-send-email-calle@scylladb.com>	2017-05-29 15:53:50 +03:00
Avi Kivity	ef98afa748	build: make swagger generated code depend on the code generator Fixes failures when moving between branches due to the seastar namespace change. Message-Id: <20170528100052.29131-1-avi@scylladb.com>	2017-05-29 13:17:42 +02:00
Avi Kivity	8979d7abf0	Deprecate non-murmur3 partitioners Removing non-murmur3 partitioners will allow us to reduce memory footprint and speed up some code by utilizing the properties of the murmur3 partitioner token. Message-Id: <20170528172536.16079-1-avi@scylladb.com>	2017-05-28 19:35:56 +02:00
Takuya ASADA	36ccbc1539	dist/ami: follow rpm output dir path change CentOS mock support on build_rpm.sh changed rpm output directory, so follow it. Fixes #2406 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1495573343-13912-1-git-send-email-syuu@scylladb.com>	2017-05-28 13:02:36 +03:00
Amos Kong	f655639e5a	scylla_setup: fix deadloop in inputting invalid option example: # scylla_setup --invalid-opt Fixes #2305 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <9a4f631b126d8eaaae479fa99137db7a61a7c869.1493135357.git.amos@scylladb.com>	2017-05-28 13:02:10 +03:00
Takuya ASADA	bdec38d23c	dist/common/scripts/scylla_setup: skip SELinux setup when it's already disabled It doesn't make sence to ask "Do you want to disable SELinux?" when SELinux is already disabled, so skip whole question. Fixes #2411 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1495652423-20806-1-git-send-email-syuu@scylladb.com>	2017-05-28 13:00:10 +03:00
Avi Kivity	c4faa1e202	Merge "tracing: tracing spans and time series helper table" from Vlad " - Introduce a parent span IP and span ID paradigm. - Introduce time series tables to simplify traces processing. - Add the "How to get traces?" chapter to the tracing.md. " * 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev: docs: tracing.md: add a "how to get traces" chapter tracing::trace_keyspace_helper: introduce a time series helper tables tracing: cleanup: use nullptr instead of trace_state_ptr() tracing: introduce a span ID and parent span ID	2017-05-28 12:01:35 +03:00
Paweł Dziepak	d9dd798c4f	counter_write_query: avoid use-after-free on partition range Message-Id: <20170526104119.22075-1-pdziepak@scylladb.com>	2017-05-28 11:41:30 +03:00
Raphael S. Carvalho	41137c7fb6	compaction: use sstable::bytes_on_disk for calculating start and end size Currently, start and end size of compaction are calculated using the uncompressed size of data component. bytes_on_disk() returns size used by all components. NOTE: start and end size are written to compaction history, so users who monitor it should be aware of this change. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>	2017-05-28 11:33:24 +03:00
Raphael S. Carvalho	3b5ad23532	db: fix computation of live disk usage stat after compaction sstable::data_size() is used by rebuild_statistics() which only returns uncompressed data size, and the function called by it expects actual disk space used by all components. Boot uses add_sstable() which correctly updates the stat with sstable::bytes_on_disk(). That's what needs to be used by r__s() too. Fixes #1592 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>	2017-05-28 10:38:32 +03:00
Vlad Zolotarov	1ae40ee91a	utils::timestamped_val: fix the touch() comment The current comment has been written when the function has not been a timestamped_val member. Let's adjust it to the current code. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1495555659-10881-1-git-send-email-vladz@scylladb.com>	2017-05-26 19:26:56 +03:00
Vlad Zolotarov	0619c2cb71	utils::serialization: remove not used deserialization_xxx() functions Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1495556124-16672-1-git-send-email-vladz@scylladb.com>	2017-05-26 19:26:20 +03:00
Tomasz Grabiec	de70d942a9	memtable: Decouple from sstable We can make the dependency more abstract by using mutation_source instead of an sstable. Will be useful in some stress tests which want to avoid the disk, but is also good for the sake of decoupling. Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:21 +03:00
Tomasz Grabiec	f3a6d94398	sstables: Introduce sstable::as_mutation_source() Adaptors extracted from existing testing code. Message-Id: <1495729508-30081-1-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:20 +03:00
Glauber Costa	3d3afd8f11	node_exporter: add interrupt information Information about interrupts is invaluable when debugging performance problems with Scylla in the field. node_exporter doesn't include that in the list of collectors enabled by default, so we suggest we do it here. The list that goes into this file is the default list as shown by node_exporter, with "interrupts" added to it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170525153008.26720-1-glauber@scylladb.com>	2017-05-25 19:11:30 +03:00
Avi Kivity	a8a82433e5	Merge "improve lcs promotion decision with compression enabled" from Raphael "lcs' current behavior will make it hard to reduce number of levels by increasing sstable size because it uses uncompressed length when deciding whether or not a level needs promotion. Demotion process is slower because of that." * 'lcs_promotion_improvement_2' of github.com:raphaelsc/scylla: lcs: use sstable compressed length when computing level size sstables: introduce sstable::ondisk_data_size sstables: unconditionally set sstable data file size	2017-05-25 12:37:24 +03:00
Tomasz Grabiec	848ca035a2	gdb: Adjust scylla-gdb.py for the namespace change in seastar Message-Id: <1495700444-29269-1-git-send-email-tgrabiec@scylladb.com>	2017-05-25 11:52:42 +03:00
Raphael S. Carvalho	b7e1575ad4	db: remove partial sstable created by memtable flush which failed partial sstable files aren't being removed after each failed attempt to flush memtable, which happens periodically. If the cause of the failure is ENOSPC, memtable flush will be attempted forever, and as a result, column family may be left with a huge amount of partial files which will overwhelm subsequent boot when removing temporary TOC. In the past, it led to OOM because removal of temporary TOC took place in parallel. Fixes #2407. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>	2017-05-25 11:50:02 +03:00
Raphael S. Carvalho	0a105473df	lcs: use sstable compressed length when computing level size lcs uses uncompressed length of sstables when computing size of a level, and that may result in unnecessary promotion when the goal is to reduce the number of levels after an increase in sstable size, for example. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-24 20:10:02 -03:00
Raphael S. Carvalho	b2dc0b2db5	sstables: introduce sstable::ondisk_data_size this new function is an alternative to data_size(), which will return size of data component. data_size() returns uncompressed size of data component if compression is enabled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-24 20:10:00 -03:00
Raphael S. Carvalho	e02bf6da58	sstables: unconditionally set sstable data file size Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-24 20:09:58 -03:00
Piotr Jastrzebski	6528f3a963	Make sure mutation_reader for sstables can be fast-forwarded Fixes #2145. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Extracted from a series, fixed title] Message-Id: <1495639745-19387-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 16:36:24 +01:00
Tomasz Grabiec	6cf2841654	mvcc: Extract partition_snapshot_reader to separate header Right know whole world includes it transitively, which results in painful recompiles when the code changes. Relax dependencies. Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 12:13:15 +01:00
Asias He	f792c78c96	streaming: Do not abort session too early in idle detection Streaming ususally takes long time to complete. Abort it on false positive idle detection can be very wasteful. Increase the abort timeout from 10 minutes to a very large timeout, 300 minutes. The real idle session will be aborted eventually if other mechanisms, e.g., streaming manager has gossip callback for on_remove and on_restart event to abort, do not abort the session. Fixes #2197 Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>	2017-05-24 12:29:50 +03:00
Paweł Dziepak	3b9c0a6ae2	Merge "loading_cache: fix the known complexity issue in the shrink() method" from Vlad Use the boost::intrusive containers in order to achieve a O(1) complexity for both "LRU list" update and to minimize the memory overhead in the hash table item to "LRU list" item connection: - Make the timestamped_val be both a bi::list and a bi::unordered_set item. - Make a bi::unordered_set be a cache backend instead of the std::unordered_map. As a result dropping k LRU items becomes an O(k) operation instead of O(N log N), where N is a total number of all cached items: - Every time a value is read - move it to the front of the "LRU list" (O(1)). - When we need to remove k LRU items: - Repeat k times: - Take an element from the back of the "LRU list". (O(1)). - Remove it from the bi::unordered_set and dispose. (O(1)). We use an auto-unlink configuration for bi::list, therefore disposing an item is going to auto unlink it from the list. * 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev: utils::loading_cache: cleanup utils/loading_cache.hh: use intrusive list to store the lru entry utils::loading_cache: implement automatic rehashing utils::loading_cache: make the underlying map to be an intrusive unordered_set	2017-05-23 16:18:16 +01:00
Tomasz Grabiec	cec8d7f38c	gdb: Fix error about gdb.Value not being convertible to int by %x format Message-Id: <1495538843-27777-1-git-send-email-tgrabiec@scylladb.com>	2017-05-23 15:38:58 +03:00
Avi Kivity	fd0e1eb1e2	Merge "Fixes for mutation algebra" from Tomasz "Enforces commutativity of addition: m1 + m2 == m2 + m1 and consistency of difference and addition with equality: m1 + (m2 - m1) == m1 + m2" * tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev: mutation: Make compare_*_for_merge() consistent with equals() tests: mutation: Improve assertion failure message tests: Use default equality in test_mutation_diff_with_random_generator mutation: Make counter cell difference consistent with apply tests: range_tombstone_list_test: Improve error message tests: range_tombstone_list: Check adjacent range merging range_tombstone_list: Merge adjacent range tombstones in apply() tests: mutation: Check commutativity of mutation addition range_tombstone_list: Avoid violating set invariant range_tombstone_list: Make tombstone merging commutative range_tombstone_list: Add erase() operation to the reverter range_tombstone_list: Make all undo operations ordered relative to each other utils: Extract to_boost_visitor() to a separate header allocating_strategy: Introduce alloc_strategy_unique_ptr<>	2017-05-23 15:20:38 +03:00
Tomasz Grabiec	804f46f684	mutation: Make compare_*_for_merge() consistent with equals() equals() considers expiring cells to be different form non-expiring cells, but compare_row_marker_for_merge() considers them equal. Fix the latter to pick expiring cells. The choice was arbitrary.	2017-05-23 13:35:03 +02:00
Tomasz Grabiec	c1475a8eb2	tests: mutation: Improve assertion failure message	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	d15880b3b7	tests: Use default equality in test_mutation_diff_with_random_generator	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	9dbae279ad	mutation: Make counter cell difference consistent with apply The case when both cells are dead was not handled properly, the diff was always empty, whereas the cell with higher timestamp should win. Caused test_mutation_diff_with_random_generator to fail.	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	951da421db	tests: range_tombstone_list_test: Improve error message	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	bee40b4628	tests: range_tombstone_list: Check adjacent range merging	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	3c509308ab	range_tombstone_list: Merge adjacent range tombstones in apply() Needed for equivalence to work correctly with difference and addition: m1 + (m2 - m1) = m1 + m2 Fixes #2158.	2017-05-23 13:16:03 +02:00
Tomasz Grabiec	ef4c7c458c	tests: mutation: Check commutativity of mutation addition	2017-05-23 12:11:12 +02:00
Tomasz Grabiec	1dea251ca2	range_tombstone_list: Avoid violating set invariant The code was inserting an entry with the same key as its successor, and only later adjusting the key of the old entry. This is violating set's invariant of unique keys, and insertion may cause rebalancing. I don't know if this violation actually causes problems currently, but it's safer not to. Fix by first updating the existing entry and then inserting the new one.	2017-05-23 12:11:12 +02:00
Tomasz Grabiec	a2a22e5f00	range_tombstone_list: Make tombstone merging commutative Example of non-commutative case: a = [1, 5]@t2 b = {[2, 3]@t1, [4, 5]@t1} a + b = [1, 5]@t2 b + a = [1, 4)@t2, [4, 5]@t2 After this patch, both will yield [1, 5]@t2. The patch also changes the logic to handle overlaps of tombstones with equal timestamps to be handled symmetrically. They are now merged instead of split on either of the boundary. Refs #2158.	2017-05-23 12:11:12 +02:00
Tomasz Grabiec	c4dac7c80f	range_tombstone_list: Add erase() operation to the reverter	2017-05-23 12:11:12 +02:00
Tomasz Grabiec	935709cddc	range_tombstone_list: Make all undo operations ordered relative to each other Later operation may depend on the result of previous operation. Same dependency is present when reverting the operations. Fixes assertion failure in update reverter.	2017-05-23 12:11:12 +02:00
Vlad Zolotarov	2d4d198fb9	utils::loading_cache: cleanup - Remove "_" at the beginning of the type names. - s/Pred/EqualPred/ Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-22 23:02:18 -04:00
Vlad Zolotarov	fd59a548c0	utils/loading_cache.hh: use intrusive list to store the lru entry Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive list entry to the head of the list every time the values are read. This will keep the list ordered by the last read time from the most recently read to the least recently read entry. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-22 23:00:18 -04:00
Vlad Zolotarov	0c4e9efce7	utils::loading_cache: implement automatic rehashing - Start the cache with 256 buckets - the minimum number of buckets. - Limit the maximal number of buckets by 1M buckets. - Keep the load factor between 0.25 and 1.0 as long as the number of buckets is between the minimum and the maximum values mentioned above. - Grow and shrink the hash every "refresh" period if needed. - Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options. - Enable bi::unordered_set_base_hook's bi::store_hash option. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-22 22:57:44 -04:00
Vlad Zolotarov	2be3596a4f	utils::loading_cache: make the underlying map to be an intrusive unordered_set Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val> instead of std::unordered_set<Key, timestamped_val>. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-22 18:45:13 -04:00
Tomasz Grabiec	5aeb9eb70c	utils: Extract to_boost_visitor() to a separate header	2017-05-22 19:30:02 +02:00
Tomasz Grabiec	69e2eccf68	allocating_strategy: Introduce alloc_strategy_unique_ptr<>	2017-05-22 19:30:02 +02:00
Raphael S. Carvalho	4b4a1883aa	refresh: do not use default priority for loading new sstables Metadata is read using default priority class, which can significantly slow down the process under high load. Compaction class can be used, and if it turns out to be a problem, we can switch to a special class for it. Fixes #1859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>	2017-05-22 19:03:17 +03:00
Avi Kivity	ef428d008c	Merge "reduce memory requirement for loading sstables" from Rapahel "fixes a problem in which memory requirement for loading in-memory components of sstables is very high due to unlimited parallelism." * 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla: database: fix indentation of distributed_loader::open_sstable database: reduce memory requirement to load sstables sstables: loads components for a sstable in parallel sstables: enable read ahead for read of in-memory components sstables: make random_access_reader work with read ahead	2017-05-22 18:23:03 +03:00
Raphael S. Carvalho	28206993a4	database: fix indentation of distributed_loader::open_sstable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-22 11:52:52 -03:00
Raphael S. Carvalho	a4e414cb3b	database: reduce memory requirement to load sstables SSTable load temporarily uses more space than needed to store metadata, due to: 1) All components are read using read_simple() which uses 128k buffer. file::dma_read_bulk() will allocate 128k, and may potentially allocate another big buffer (128k - read) for file::read_maybe_eof(). 2) read_filter() may use double the space it needs to. Due to the fact that sstable loading parallelism is unlimited, Scylla may require much more memory to load all sstables, and that may lead to OOM. Higher the number of sstables higher the memory overhead. To confirm this problem, I wrote a test[1] which loads 30k sstables in parallel and reports the memory usage peak in the end. When loading 30k sstables, each of which metadata is ~300kb, memory usage peak was ~18G. When loading completed, only ~9GB were needed to store all the metadata. [1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d To fix this problem, we need to set a limit on load parallelism (let's start with a small number like 3 and adjust later if needed) and rely on readahead so that the requirement drops considerably without increasing boot time. Actually, boot time is improved by it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2017-05-22 11:52:51 -03:00
Raphael S. Carvalho	043fae2ef5	sstables: loads components for a sstable in parallel Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2017-05-22 11:52:49 -03:00
Raphael S. Carvalho	0ac729fd57	sstables: enable read ahead for read of in-memory components Read ahead 4 is used. Let's adjust it later if needed. File size is used to prevent file_input_stream from issuing useless reads beyond file size with read ahead enabled. We can switch to variant without length once file_input_stream handles it properly. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-22 11:52:37 -03:00
Raphael S. Carvalho	77b8870cf3	sstables: make random_access_reader work with read ahead Scylla crashes if read ahead is enabled by file_random_access_reader because a call to seek() destroys the existing input stream without closing it first. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-05-22 11:52:33 -03:00
Duarte Nunes	6ac73b57fb	cql3/statements/select_statement: Remove dead code Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170522100230.17393-1-duarte@scylladb.com>	2017-05-22 14:32:12 +03:00
Avi Kivity	5828ddcca4	Merge seastar upstream * seastar 4af898c...68dbf60 (4): > dpdk: follow namespace changes to fix compile error > perftune.py: fix regression introduced in df5f74ac > doc: typo in README.md > posix_net: load-balance connections	2017-05-22 12:39:48 +03:00
Asias He	b56ba02335	gossip: Make bootstrap more robust The bootstrapping node will be a gossip only member, until the streaming finishes and the node becomes NORMAL state. If during this time, the bootstrapping node is overwhelmed with streaming, it is possible the node will delay the update the gossip heartbeat. Be forgiving for the bootstrapping node and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will not be resent becasue the node is not in gossip membership anymore. Fixes #2150 Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>	2017-05-21 19:25:40 +03:00
Takuya ASADA	7777b558c4	dist/redhat: Use mock for CentOS/RHEL rpms Enable mock for CentOS/RHEL, also support cross building by mock. Fixes #630 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170513171200.14926-1-syuu@scylladb.com>	2017-05-21 19:22:54 +03:00
Avi Kivity	2f23648b9e	Revert "dist: add conflict with Cassandra" This reverts commit `da55aecca3`. Instead of an install-time conflict, we'll add a run-time conflict.	2017-05-21 18:37:59 +03:00
Alexys Jacob	c8116b4252	scylla_raid_setup: fix typo on print_usage Simple typo fix on the usage message output, the script name was not correct. Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170519145851.6205-1-ultrabug@gentoo.org>	2017-05-21 18:01:28 +03:00
Avi Kivity	5b182537db	Merge seastar upstream * seastar 8aef5f5...4af898c (4): > memory: fix debug build > tests: fix slab_test build > xen: fix fallouts from seastar namespace change > build: make swagger generated files depend on the code generator	2017-05-21 13:48:24 +03:00
Alexys Jacob	8dbad4f34a	scylla_sysconfig_setup: fix typo on print_usage Simple typo fix on the usage message output, the script name was not correct. Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170519143227.2741-1-ultrabug@gentoo.org>	2017-05-21 13:41:43 +03:00
Alexys Jacob	c0756d97b8	scylla_setup: fix typos on cpu scaling messages This fixes typos on CPU scaling related messages. Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170519143703.3574-1-ultrabug@gentoo.org>	2017-05-21 13:41:42 +03:00
Glauber Costa	5f99158889	api: return correct values for bloom filter statistics We are currently suspecting that the bloom filter false positive ratio is not being respected. While trying to debug that, I found out that we have a more basic problem: The numbers are all meaningless, because the stats are wrong. We are accumulating by summing the ratios together. It's easy to see how this doesn't work, if we look at an example where the ratio for some CFs is zero: SST1: false = 1, total = 2. ratio = 0.5 SST2: false = 0, total = 98 . ratio = 0. The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed ratio will be 0.5 + 0 = 0.5. This patch will map reduce all the sstables together keeping both numerator and denominator, yielding the right value at the end. To do that, we'll reuse the existing ratio_holder class, which already does exactly what we want. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170518222333.16307-1-glauber@scylladb.com>	2017-05-21 13:11:22 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Avi Kivity	dab2783b58	Merge seastar upstream * seastar 45b718b...f726938 (2): > memory: add --mbind option to supress warning message when running Seastar apps on container > Add support for Gentoo Linux irqbalance configuration detection.	2017-05-20 21:15:46 +03:00
Avi Kivity	c8cb3d6ff5	Merge "Materialized views: bug fixes and unit tests" from Duarte "This series fixes bugs related to materialized views, most pertaining to column filtering in the where clause." * 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla: tests/view_schema_test: Add more test cases tests/cql_assertions: Add assertion for row set equality single_column_relation: Correctly print IN relation statement_restrictions: Allow filtering regular columns for views statement_restrictions: Relax clustering restrictions for views statement_restrictions: Relax partition restrictions for views cql3/statements: Prevent setting default ttl on view cql3/restrictions: Complete implementation of is_satisfied_by() db/view: Re-implement clustering_prefix_matches() db/view: Re-implement partition_key_matches() db/view: Generate regular tombstone for base deletions db/view: Consider cell liveness when generating updates db/view: Don't generate view updates for static rows	2017-05-20 13:52:56 +03:00
Tomasz Grabiec	cd4d15672b	utils: estimated_histogram: Fix clear() It was a no-op. It doesn't seem currently used, but I will have a use for it soon. Message-Id: <1495198172-1969-1-git-send-email-tgrabiec@scylladb.com>	2017-05-19 14:34:34 +01:00
Paweł Dziepak	c560cf9d9d	Merge "fixes and improvements in the permissions cache implementation" from Vlad "There are numerous issues in the current implementation of permissions cache starting from the logical errors and bugs and ending with the suboptimal implementation described in the issue #2262." * 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev: utils::loading_cache: avoid the reads storm when the key is not in the cache utils::loading_cache: cleanup utils::loading_cache: align the constrains in the constructor with the parameters description utils::loading_cache: refresh in the background auth::auth: add operator<<() for a permission_cache key auth::auth::permissions_cache: use the values from the configuration - don't try to be smart db::config: define a saner default value for permissions_validity_in_ms	2017-05-18 13:33:05 +01:00
Vlad Zolotarov	6a63c87a9f	utils::loading_cache: avoid the reads storm when the key is not in the cache Use a mutex to serialize producers when the key is not present in the cache. Fixes #2262 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-18 07:55:48 -04:00
Tomasz Grabiec	3fc1703ccf	range: Fix SFINAE rule for picking the best do_lower_bound()/do_upper_bound() overload mutation_partition has a slicing constructor which is supposed to copy only the rows from the query range. The rows are located using nonwrapping_range::lower_bound() and nonwrapping_range::lower_bound(). Those two have two different implementations chosen with SFINAE. One is using std::lower_bound(), and one is using container's built in lower_bound() should it exist. We're using intrusive tree in mutation_partition, so container's lower_bound() is preferred. It's O(log N) whereas std::lower_bound() is O(N), because tree's iterator is not random access. However, the current rule for picking container's lower_bound() never triggers, because lower_bound() has two overloads in the container: ./range.hh:618:14: error: decltype cannot resolve address of overloaded function typename = decltype(&std::remove_reference<Range>::type::upper_bound)> ^~~~~~~~ As a result, the overload which uses std::lower_bound() is used. Spotted when running perf_fast_forward with wide partition limit in cache lifted off. It's so slow that I timeouted waiting for the result (> 16 min). Fixes #2395. Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>	2017-05-18 13:28:10 +03:00
Avi Kivity	ba31619594	tests: fix partitioner_test for g++ 5 It can't make the leap from dht::ring_position to stdx::optional<range_bound<dht::ring_position>> for some reason.	2017-05-18 13:09:41 +03:00
Pekka Enberg	30b5933db2	Merge "Add Gentoo Linux support to utility and setup scripts" from Alexys "These patches add support to setup and operate ScyllaDB on Gentoo Linux. * scylla_setup and related scripts * node_health_check I have kept them as simple as possible and tested them to setup and operate succesfully a three nodes cluster running on Gentoo Linux." * 'gentoo_linux_support' of github.com:ultrabug/scylla: scylla_setup: add gentoo linux installation detection prometheus node_exporter install: add support for gentoo linux raid setup: add support for gentoo linux ntp setup: add support for gentoo linux kernel check: add support for gentoo linux cpuscaling setup: add support for gentoo linux coredump setup: add support for gentoo linux detect gentoo linux on selinux setup add gentoo_variant detection and SYSCONFIG setup	2017-05-18 09:41:13 +03:00
Vlad Zolotarov	1ef22f84c1	utils::loading_cache: cleanup - Fix a callback signature: receive a const ref. - White spaces. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 15:03:14 -04:00
Vlad Zolotarov	87ce0b2d47	utils::loading_cache: align the constrains in the constructor with the parameters description According to description of permissions_validity_in_ms the permissions_cache is enabled if this value is set to a non-zero value. Otherwise the permissions_cache is disabled. According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache is enabled. permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero if permissions_cache is enabled. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 15:03:14 -04:00
Vlad Zolotarov	e286828472	utils::loading_cache: refresh in the background This patch changes the way a loading_cache works. Before this patch: 1) If a permissions key is not in the cache it's loaded in the foreground and the original query is blocked till the permissions are loaded. 2) Every _period the timer does the following: 1) If a value was loaded more than _expiry time ago it is removed from the cache. 2) If the cache is too big - the less recently loaded values are removed till the cache fits the requested size. After this patch: 1) If a permissions key is not in the cache it's loaded in the foreground and the original query is blocked till the permissions are loaded. 2) Every _period the timer does the following: 1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache. 2) If the cache is too big - the less recently read values are removed till the cache fits the requested size. 3) The values that were loaded more than _refresh time ago are re-read in the background. The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single event (when the value is loaded for the first time). It also ensures we do not reload values we no longer need. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 15:03:06 -04:00
Alexys Jacob	fa0944ac19	scylla_setup: add gentoo linux installation detection	2017-05-17 18:06:54 +02:00
Alexys Jacob	9bb1bda466	prometheus node_exporter install: add support for gentoo linux	2017-05-17 18:06:34 +02:00
Alexys Jacob	1d235e5012	raid setup: add support for gentoo linux	2017-05-17 18:06:14 +02:00
Alexys Jacob	fdd5944ab2	ntp setup: add support for gentoo linux	2017-05-17 18:05:59 +02:00
Alexys Jacob	412f96a1bf	kernel check: add support for gentoo linux	2017-05-17 18:05:45 +02:00
Alexys Jacob	a198f2b1af	cpuscaling setup: add support for gentoo linux	2017-05-17 18:05:24 +02:00
Alexys Jacob	6a1807a7d8	coredump setup: add support for gentoo linux	2017-05-17 18:05:08 +02:00
Alexys Jacob	bc63e501db	detect gentoo linux on selinux setup	2017-05-17 18:04:20 +02:00
Vlad Zolotarov	4edb336ac5	auth::auth: add operator<<() for a permission_cache key Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 12:03:56 -04:00
Vlad Zolotarov	d780818cac	auth::auth::permissions_cache: use the values from the configuration - don't try to be smart Our configuration already has the default values for for permission cache parameters. Therefore if user decides to give some bad parameters we'd rather fail the load and inform him/her about the bad parameters instead of trying to silently "fix" them. In addition the original code wasn't passing the parameters correctly: it switched the "expiry" and "refresh" parameters in the utils::loaded_cache constructor. Add to this that the original code was doing really strange things in the permission_cache::expiry(cfg) method. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 12:03:56 -04:00
Vlad Zolotarov	ea1cfabe28	db::config: define a saner default value for permissions_validity_in_ms It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms. This may cause the values to be invalidated only because some minor delays in the timer scheduling. It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms. This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling. In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s. This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data in the foreground. Setting it to 10s would allow a second read attempt before we enforce the foreground read. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-17 12:03:56 -04:00
Alexys Jacob	2ca0380d06	add gentoo_variant detection and SYSCONFIG setup	2017-05-17 18:03:53 +02:00
Avi Kivity	2aa5b3e20c	Merge "Improve perf_fast_forward test" from Tomasz "Notably: - add validation of the results (e.g. fragment count, expectations about disk activity) - add cache-specific tests" * 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev: tests: perf_fast_forward: Report cache stats row_cache: Keep counters in a struct tests: perf_fast_forward: Add cache-specific tests tests: perf_fast_forward: Extract test_reading_all() tests: perf_fast_forward: Add validation of the results tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments tests: perf_fast_forward: Allow the test to be interrupted tests: perf_fast_forward: Allow testing with cache enabled row_cache: Implement mutation_reader::fast_forward_to() for cache scanner	2017-05-17 18:06:02 +03:00
Calle Wilund	29b20d410a	schema_tables: Remove "class" attribute from strategy options Not 100% proper, but in line with how we still store the info. Ensures (helps at least) to keep schema loaded from tables and schema from builder comparable. Fixes schema_changes_test error. Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>	2017-05-17 17:56:11 +03:00
Calle Wilund	6ca07f16c1	scylla: fix compilation errors on gcc 5 Message-Id: <1495030581-2138-1-git-send-email-calle@scylladb.com>	2017-05-17 17:56:06 +03:00
Paweł Dziepak	3ecceaee48	Merge "Fix fast_forward_to() on sstable reader being ignored in some cases" from Tomasz "When mutation reader enters the partition using index, streamed_mutation object is returned to the user before the row start fragment is processed. In that case, when we process the row start, we should ignore it and not call setup_for_partition() again. That may override user's fast_forward_to() request." * 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Test forwarding in single-key readers sstables: Remove unused code sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases sstables: Fix verify_end_state() to tolerate ATOM_START_2 state	2017-05-17 15:35:30 +01:00
Avi Kivity	eb69fe78a4	Merge "Adding private repository to housekeeping" from Amnon "This series adds private repository support to scylla-housekeeping" * 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev: scylla-housekeeping service: Support private repositories scylla-housekeeping-upstart: Use repository id, when checking for version scylla-housekeeping: support private repositories	2017-05-17 15:56:46 +03:00
Tomasz Grabiec	777ffa3a27	tests: perf_fast_forward: Report cache stats	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	d1bde3036e	row_cache: Keep counters in a struct So that taking a snapshot of all stats is easy.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	7a81f5e980	tests: perf_fast_forward: Add cache-specific tests	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	1a7b03004a	tests: perf_fast_forward: Extract test_reading_all()	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	a38fd16f89	tests: perf_fast_forward: Add validation of the results	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	3c3ea51657	tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments make_pkeys() needs to be invoked with n equal to the number of keys which the table was populated with. Otherwise the extra keys, which are missing in the table, may be placed anywhere in the vector due to ring order sorting, and break the assumption that the table contains all keys from the array up to index n. This resulted in the test reading slighlty less fragments than it would follow from the desired count. Another problem is that we should not skip the fast_forward_to() call for the inital range (workaround for a bug in sstable mutation reader), otherwise we will read slightly less than expected as well.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	49a0bc3847	tests: perf_fast_forward: Allow the test to be interrupted	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	5c7f5643a6	tests: perf_fast_forward: Allow testing with cache enabled	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	35c9dfecc2	row_cache: Implement mutation_reader::fast_forward_to() for cache scanner Needed to make perf_fast_forward work with cache enabled.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	84648f73ef	Merge "Fix performance problems with high shard counts tag" from Avi From http://github.com/avikivity/scylla exponential-sharder/v3. The sharder, which takes a range of tokens and splits it among shards, is slow with large shard count and the default murmur3_partitioner_ignore_msb_bits. This patchset fixes excessive iteration in sstable sharding metadata writer and nonsignular range scans. Without this patchset, sealing a memtable takes > 60 ms on a 48-shard system. With the patchset, it drops below the latency tracker threshold I used (5 ms).	2017-05-17 14:03:33 +02:00
Avi Kivity	68034604e1	dht: murmur3_partitioner: simplify moving to and from the zero-based token range	2017-05-17 13:50:30 +03:00
Avi Kivity	1a99ebaa65	storage_proxy: switch to the exponential sharder for nonsingular queries Nonsingular queries used exponential expansion of the token space to avoid spending too much cpu time on near-empty tables, but the generation of the search space was itself exponential. Switch to the exponential sharder which has linear cost.	2017-05-17 13:50:30 +03:00
Avi Kivity	00f48f96cb	sstables: select just the shard we want when writing sharding metadata On a system with many shards, this saves many useless iterations where we just skip the unwanted shard.	2017-05-17 13:50:30 +03:00
Avi Kivity	44a1a51987	tests: add tests for dht::split_range_to_single_shard()	2017-05-17 13:50:30 +03:00
Avi Kivity	76f12a8842	dht: add split_range_to_single_shard() Intersects a shard's owning range with a ring position range, and return the sorted result.	2017-05-17 13:50:27 +03:00
Tomasz Grabiec	1da3daa4f4	range: Use more standard notation for singular range Reuse notation for a single-element set. Message-Id: <1494923827-10097-1-git-send-email-tgrabiec@scylladb.com>	2017-05-17 13:28:42 +03:00
Avi Kivity	a65e8bd215	dht: add a ring-position-range-vector variant of the exponential sharder The "exponentiality" is not carried over from one range to another, because we expect one or two ranges (two ranges result from a wrapped around thrift token range).	2017-05-17 13:18:52 +03:00
Avi Kivity	6eb6f12909	tests: add test for ring_position_exponential_sharder	2017-05-17 13:18:52 +03:00
Avi Kivity	f671ac13b4	dht: add an exponential ring_position range sharder Like the regular sharder, the exponential sharder divides a range into subranges owned by individual ranges. Unlike the regular sharder, it generates ever-increasing subranges, spanning more and more shards, and eventually returns several subranges per shard. To avoid using exponential cpu and memory, subranges belonging to a single shard are merged, and a flag is set to indicate the subranges are not ordered wrt. each other.	2017-05-17 13:18:49 +03:00
Avi Kivity	025c6b45b2	dht: extend i_partitioner::next_token_for_shard() Right now, next_token_for_shard() only allows iterating linearly in shard order. Add the ability to select a specific shard to skip to (in case we're only interested in a single shard), and to select larger ranges (so that exponential increases are not implemented by iteration).	2017-05-17 12:30:03 +03:00
Avi Kivity	7156ea8804	dht: make ring_position_range_sharder more independent of global_partitioner Useful for testing.	2017-05-17 12:30:03 +03:00
Avi Kivity	302fec8293	dht: make i_partitioner::name() const	2017-05-17 12:30:03 +03:00
Avi Kivity	f462c4327e	dht: make i_partitioner keep track of the number of shards it was configured with Useful for testing classes layered on top of the partitioner (the sharders).	2017-05-17 12:30:03 +03:00
Avi Kivity	04b16ae8ec	dht: fix partitioner initialization for tests The partitioners now depend on smp::count to be initialized correctly, but smp::count isn't available at static initialization time. The scylla executable isn't affected because it calls set_global_partitioner() after smp::count has been initialized. Fix by deferring initialization to the first global_partitioner() call.	2017-05-17 12:30:03 +03:00
Avi Kivity	1c6cecd9d0	utils: introduce div_ceil() Divides integrals but rounds up rather than down.	2017-05-17 12:30:03 +03:00
Avi Kivity	f1dbb951da	Merge "Materialized views: implement read before write" from Duarte "This patch ensures we read the base table rows that an update is modifying, in order to correctly calculate the set of materialized view updates. The read-before-write is performed on the shard applying the update and attempts to do a precise read of the rows being modified, which can be more than one in case of ranged deletions or a batch update." * 'materialized-views/read-existing/v2' of https://github.com/duarten/scylla: database: Read existing base mutations db/view: Calculate clustering ranges for MV read-before-write query db/view: Replace entry if cells don't match view_info: Store base regular col in the view's PK as column_id compound_view_wrapper: Add tri_compare bound_view: Build range bound from bound_view clustering_bounds_comparator: Enable Range concept range: Add lvalue version of transform() tests: Add test case for nonwrapping_range::intersection() nonwrapping_range: Add intersection() function	2017-05-17 12:26:26 +03:00
Duarte Nunes	ef252036ba	tests/view_schema_test: Add more test cases Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 11:21:58 +02:00
Duarte Nunes	983af595e9	database: Read existing base mutations When generating updates for a materialized view we need to read the existing base row, to be able to determine the primary key of the view row the new base update will supplant, in case the view includes a base non-primary key column in its own primary key. That old view row will be tombstoned or updated, if it exists, depending on the difference between the new base row and the existing one, if any. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	0861a66853	tests/cql_assertions: Add assertion for row set equality For row set equality, the order of the actual rows doesn't need to match the order of the expected rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	8a77bfe35b	db/view: Calculate clustering ranges for MV read-before-write query Introduce the calculate_affected_clustering_ranges() function to calculate the smallest subject of affected clustering ranges that we need to query for. The update_requires_read_before_write() function checks whether a view is potentially affected by the base update. The patch also cleans up the may_be_affected_by() function. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	9115862419	single_column_relation: Correctly print IN relation So that the output of a set of relations can be fed back into the CQL parser; useful for materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	ec681060a8	db/view: Replace entry if cells don't match If a base table regular columns is part of the view's pk, and if that column changes, we should replace the entry, by deleting the row(s) with the old value and inserting a new one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	3ef1a825c9	statement_restrictions: Allow filtering regular columns for views Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	0170c743d3	statement_restrictions: Relax clustering restrictions for views In process_clustering_columns_restrictions(), don't require all clustering columns to be restricted if we're dealing with a materialized view's where clause. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	4f90b19cc2	statement_restrictions: Relax partition restrictions for views In process_partition_key_restrictions(), don't require all partition key columns to be restricted if we're dealing with a materialized view's where clause. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	99b234d717	cql3/statements: Prevent setting default ttl on view Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	d480daffca	cql3/restrictions: Complete implementation of is_satisfied_by() This patch implements the is_satisfied_by() function for the remaining types of restrictions, lifting the function declaration to abstract_restrictions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	bad0edb23b	db/view: Re-implement clustering_prefix_matches() This patch implements clustering_prefix_matches() in terms of abstract_restriction::is_satisfied_by() instead of ranges, which supports filtering just a subset of the clustering columns. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	b0d1ea76a2	db/view: Re-implement partition_key_matches() This patch implements partition_key_matches() in terms of abstract_restriction::is_satisfied_by() instead of ranges, which supports filtering just a component of a compound partition key. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	38be85a21d	db/view: Generate regular tombstone for base deletions Instead of shadowable tombstones, which only apply to updates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	1fd8b8e723	db/view: Consider cell liveness when generating updates This patch ensures we take into account the liveness of the base's regular column in the view's pk when generating view updates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	c421da6825	db/view: Don't generate view updates for static rows Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:19 +02:00
Duarte Nunes	f41a5e554d	view_info: Store base regular col in the view's PK as column_id This patch stores the base_non_pk_column_in_view column as column_id, which is more convenient, and it also stores a two-level optional to encode both lazy initialization and the absence of such a column. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	257eaa0d05	compound_view_wrapper: Add tri_compare Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	06a6679826	bound_view: Build range bound from bound_view We introduce the bound_view::to_range_bound() function, which builds a wrapping_range or nonwrapping_range bound from a bound_view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	8288e504fb	clustering_bounds_comparator: Enable Range concept Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	fb1e966137	range: Add lvalue version of transform() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	f365b7f1f7	tests: Add test case for nonwrapping_range::intersection() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Duarte Nunes	1f9359efba	nonwrapping_range: Add intersection() function intersection() returns an optional range with the intersection of the this range and the other, specified range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-17 10:33:18 +02:00
Avi Kivity	f5dae826ce	Merge "Migrate schema tables to v3 format" from Calle "Defines origin v3-format for system/schema tables, and use them for schema storage/retrival. Includes a legacy_schema_migrator implementation/port from origin. Note that since we don't support features like triggers, functions and aggregates, it will bail if encountering such a feature used. Note also that this patch set does not convert the "hints" and "backlog" tables, even though these have changed in v3 as well. That will be a separate patch set. Tested against dtests. Note that patches for dtest + ccm will follow." * 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits) legacy_schema_migrator: Actually truncate legacy schema tables on finish database: Extract "remove" from "drop_columnfamily" v3 schema test fixes thrift: Update CQL mapping of static CFs schema_tables: Use v3 schema tables and formats type_parser: Origin expects empty string -> bytes_type cf_prop_defs: Add crc_check_chance as recognized (even if we don't use) types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s cql3_type: v3 to_string cql_types: Introduce cql3_type::empty and associate with empty data_type schema: rename column accessors to be in line with origin schema: Add "is_static_compact_table" schema_builder: Add helper to generate unique column names akin origin schema: Add utility functions for static columns schema: Use heterogeneous comparator for columns bounds cql3_type_parser: Resolve from cql3 names/expressions cql3_type: Add "prepare_interal" and "references_user_type" cql3::cql3_type: Add prepare_internal path using only "local" holders cql3_type: Add virtual destructor. database/main: encapsulate system CF dir touching ...	2017-05-17 11:25:52 +03:00
Asias He	0abfe39d8f	database: Log compaction strategy setting on shard 0 only The compaction strategy is per node not per shard. Do not duplicate the same log on all shards. Message-Id: <1494835519.git.asias@scylladb.com>	2017-05-17 11:17:41 +03:00
Avi Kivity	f09f056515	Merge seastar upstream * seastar 4a3118c...45b718b (7): > tests: make connect_test use a random port > log: Introduce log.info0 > configure.py: link to DPDK PMD drivers which are already built on build/dpdk and enabled by default on DPDK config > Update fmt submodule > perftune: fix perftune.py IndexError when NIC uses less IRQs than requested. > build: Add more required build dependencies to the Dockerfile > Prometheus: Reserve in protobuf object before iterating	2017-05-17 11:16:58 +03:00
Raphael S. Carvalho	a58699cc92	sstables: kill sstable::mark_for_deletion_on_disk Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170515233021.21223-1-raphaelsc@scylladb.com>	2017-05-17 11:15:59 +03:00
Raphael S. Carvalho	deabf06d49	lcs: log invariant restoration It will be useful for understanding the strategy behavior after invariant is possibly broken by resharding. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170515234925.22793-1-raphaelsc@scylladb.com>	2017-05-17 11:15:41 +03:00
Avi Kivity	2eef7cd395	Merge "compress the tracing session ID when compression is requested" from Vlad "Tested with: - test.py --mode relase - debug/test-serialization - c-s with both debug and relase compiled scylla with authentication enabled: cassandra-stress write n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra' Tested with: - test.py --mode relase - debug/test-serialization - c-s with both debug and relase compiled scylla with authentication enabled: cassandra-stress write n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'" * 'compress_tracing_session_id-v6' of github.com:cloudius-systems/seastar-dev: cql_server::response: rework the tracing session ID insertion utils::UUID: align the UUID serialization API with the similar API of other classes in the project utils: serialization: unify the variety of serialize_XXX(...) cql_server::response: rework the compress(...) method cql_server::response: store the frame flags inside the class	2017-05-17 09:48:49 +03:00
Pekka Enberg	374c3d66ab	Merge "Fixes for CQL regressions" from Duarte "This series fixes a set of regressions introduced by `f7bc88734a`, resulting in two failed tests: testDenseNonCompositeTable(org.apache.cassandra.cql3.validation.operations.CreateTest) and testStaticColumnsWithDistinct(org.apache.cassandra.cql3.validation.entities.StaticColumnsTest)" * 'cql-fixes/v1' of github.com:duarten/scylla: update_statement: Reject empty values for dense clustering key modification_statement: Fix detection of clustering keys cql3/restrictions/statement_restrictions: Consider statement type cql3/statements/modification_statement: Extract statement_type	2017-05-17 09:29:24 +03:00
Vlad Zolotarov	a0737abdc5	cql_server::response: rework the tracing session ID insertion Insert the tracing session ID into the response body in the cql_server::response constructor. Fixes #2356 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 15:57:28 -04:00
Vlad Zolotarov	494ea82a88	utils::UUID: align the UUID serialization API with the similar API of other classes in the project The standard serialization API (e.g. in data_value) includes the following methods: size_t serialized_size() const; void serialize(bytes::iterator& it) const; bytes serialize() const; Align the utils::UUID API with the pattern above. The only addition is that we are going to make an output iterator parameter of a second method above a template so that we may serialize into different output sources. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 15:56:03 -04:00
Vlad Zolotarov	7706775a63	utils: serialization: unify the variety of serialize_XXX(...) Use the same templated implementation for all different serialize_XXX(...). The chosen implementation is based on the std::copy_n(char*, size, OutputIterator), which is heavily optimized and will be using memcpy/memmove where possible. This patch also removes the not needed specializations that accept signed integer values since we were casting them to unsigned value anyway. The std::ostream based specifications are also removed since they are not used anywhere except for a test-serialization.cc and adjusting the ostream to the iterator is a single-liner. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 15:56:03 -04:00
Vlad Zolotarov	a33fe5b775	cql_server::response: rework the compress(...) method Cleanup the compress(...) method interface: - Encapsulate the technical details inside the method: - Re-write the _body inside the method instead of returning it. - Set the response::_flags inside the method. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 15:53:35 -04:00
Vlad Zolotarov	c00814383d	cql_server::response: store the frame flags inside the class It makes a lot more sense to keep the flags mask inside the response and update it each time the corresponding feature is set instead of holding the separate components like tracing state pointer. This patch adds this ability to set the flags. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-05-16 14:31:54 -04:00
Takuya ASADA	da55aecca3	dist: add conflict with Cassandra Cassandra and Scylla are not able to install single instance, so add cassandra to 'Conflicts'. Fixes #2157 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1494856314-9322-1-git-send-email-syuu@scylladb.com>	2017-05-16 19:18:27 +03:00
Gleb Natapov	c7ad3b9959	database: remove temporary sstables sequentially The code that removes each sstable runs in a thread. Parallel removing of a lot of sstables may start a lot of threads each of which is taking 128k for its stack. There is no much benefit in running deletion in parallel anyway, so fix it by deleting sstables sequentially. Fixes #2384 Message-Id: <20170516103018.GQ3874@scylladb.com>	2017-05-16 15:06:10 +03:00
Tomasz Grabiec	bdf3c536aa	tests: mutation_source_test: Test forwarding in single-key readers	2017-05-16 13:36:10 +02:00
Tomasz Grabiec	e07cc44af2	sstables: Remove unused code	2017-05-16 13:31:01 +02:00
Tomasz Grabiec	0e23f8aa9b	sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases When mutation reads enters the partition using index, streamed_mutation object is returned to the user before the row start fragment is processed. In that case, when we process the row start, we should ignore it and not call setup_for_partition() again. That may override user's fast_forward_to() request.	2017-05-16 13:31:01 +02:00
Tomasz Grabiec	a1dea3c4fc	sstables: Fix verify_end_state() to tolerate ATOM_START_2 state We would be in that state if consume_row_start() returns porceed::yes and the stream ends after that. This can happen if slicing using promoted index determined that there are no cells in the partition in the range.	2017-05-16 13:31:01 +02:00
Raphael S. Carvalho	706ce5a27b	sstables: do not swallow system error exception in read_simple If error code is different than ENOENT, exception is swallowed. That can lead to a variety of problems down the road. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170515225309.19185-1-raphaelsc@scylladb.com>	2017-05-16 08:47:34 +02:00
Alexys Jacob	9ddc05899d	Fix scylla-housekeeping version detection to work with newer setuptools Newer setuptools parse_version() don't like dashed version strings, so we should trim it to avoid false negative version_compare() checks. Signed-off-by: Alexys Jacob <ultrabug@gentoo.org> Message-Id: <20170511162646.22129-1-ultrabug@gentoo.org>	2017-05-15 12:41:49 +03:00
Gleb Natapov	385645e8df	storage_proxy: Fix mutation logging Log mutation type only if mutation set is not empty. Message-Id: <20170510142406.GA30426@scylladb.com>	2017-05-11 15:49:52 +01:00
Tomasz Grabiec	7b6be7e188	row_cache: Add missing propagation of the forwarding flag in handle_large_partition() Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 15:47:19 +01:00
Vlad Zolotarov	a855e82eff	service::client_state: don't allow dropping the system_auth and system_traces objects Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables) but allow their modification (including tables). We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters. We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed. Fixes #2346 Fixes #2338 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>	2017-05-11 13:03:30 +01:00
Tomasz Grabiec	0351ab8bc6	row_cache: Fix undefined behavior in read_wide() _underlying is created with _range, which is captured by reference. But range_and_underlyig_reader is moved after being constructed by do_with(), so _range reference is invalidated. Fixes #2377. Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 09:43:43 +01:00
Duarte Nunes	a69039df03	tests/batchlog_manager_test: Fix failure Since `a9f6e5f8da`, metrics can't be duplicated. This patch works around that by avoiding to create a new batchlog_manager (one is already created by the cql_test_env). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170510191047.6154-1-duarte@scylladb.com>	2017-05-11 08:28:08 +02:00
Duarte Nunes	ec35cc33f1	update_statement: Reject empty values for dense clustering key Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 19:54:42 +02:00
Duarte Nunes	03f765c468	modification_statement: Fix detection of clustering keys Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 19:54:42 +02:00
Duarte Nunes	d7701087af	cql3/restrictions/statement_restrictions: Consider statement type Now that update_statement uses statement_restrictions, we need our validation logic to take the statement type into account, in particular to deal with insertion statements which only set static columns but specify clustering values. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 19:54:42 +02:00
Duarte Nunes	c2041753c9	cql3/statements/modification_statement: Extract statement_type This patch extracts the statement_type into its own file. The type will be later passed to statement_restrictions for validation purposes. Further along, we could add methods to it that currently live in other statements so we can move more validation into statement_restrictions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 19:54:42 +02:00
Calle Wilund	c8f92536c1	legacy_schema_migrator: Actually truncate legacy schema tables on finish	2017-05-10 16:44:48 +00:00
Calle Wilund	3514123677	database: Extract "remove" from "drop_columnfamily"	2017-05-10 16:44:48 +00:00
Calle Wilund	66991a7ccb	v3 schema test fixes	2017-05-10 16:44:48 +00:00
Duarte Nunes	6260f31e08	thrift: Update CQL mapping of static CFs This patch updates the mapping of static CFs so that their CQL representation is a non-compound, non-dense schema with static columns, instead of regular ones. This matches the representation os static CFs in Cassandra 3.x. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 16:44:48 +00:00
Calle Wilund	6c8b5fc09d	schema_tables: Use v3 schema tables and formats Switches system/schema_* for system_schema/*, updates schema/schema builder and uses to hold/expect v3 style info (i.e. types & dropped).	2017-05-10 16:44:48 +00:00
Calle Wilund	f9b83e299e	type_parser: Origin expects empty string -> bytes_type	2017-05-10 16:44:48 +00:00
Calle Wilund	97c54d254b	cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)	2017-05-10 16:44:48 +00:00
Calle Wilund	3d90152dc5	types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s	2017-05-10 16:44:48 +00:00
Calle Wilund	7969a156d5	cql3_type: v3 to_string	2017-05-10 16:44:48 +00:00
Calle Wilund	c572a8c83c	cql_types: Introduce cql3_type::empty and associate with empty data_type	2017-05-10 16:44:48 +00:00
Calle Wilund	0e6ae8dec2	schema: rename column accessors to be in line with origin More pointedly: Expose columns as is (currently all_columns_in_select_order), expose name->column mapping more appropriately named. Renaming like this is not strictly neccesary, but there is a point to trying to keep nomenclature similar-ish with origin, esp. when select order column need to become filtered (spoiler alert).	2017-05-10 16:44:48 +00:00
Calle Wilund	d2dc7898aa	schema: Add "is_static_compact_table"	2017-05-10 16:44:48 +00:00
Calle Wilund	1c328a4166	schema_builder: Add helper to generate unique column names akin origin	2017-05-10 16:44:48 +00:00
Duarte Nunes	5387ac98f8	schema: Add utility functions for static columns Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 16:44:48 +00:00
Duarte Nunes	0439d83d1e	schema: Use heterogeneous comparator for columns bounds Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-10 16:44:47 +00:00
Calle Wilund	b1c5447ab5	cql3_type_parser: Resolve from cql3 names/expressions Cassandra 3 uses cql names for column/field types, thus we need to parse these out-of-line, and resolve more akin to the cql parser. Also wrap building user types similarly to origin, using a "builder" wrapper, and usage graph resolving.	2017-05-10 16:44:47 +00:00
Calle Wilund	fcfea4c121	cql3_type: Add "prepare_interal" and "references_user_type" Allows localized use of cql type + parsing + resolving	2017-05-10 16:44:47 +00:00
Calle Wilund	8b3e7bbe05	cql3::cql3_type: Add prepare_internal path using only "local" holders	2017-05-10 16:44:47 +00:00
Calle Wilund	2f791f5c3d	cql3_type: Add virtual destructor. It should be there.	2017-05-10 16:44:47 +00:00
Calle Wilund	48ddcbb77b	database/main: encapsulate system CF dir touching	2017-05-10 16:44:47 +00:00
Calle Wilund	9eb91bc30b	main: Add legacy schema migration to startup	2017-05-10 16:44:47 +00:00
Calle Wilund	3964055d98	legacy_schema_migrator: Add schema table converter Initial. Does not actually write anything.	2017-05-10 16:44:47 +00:00
Paweł Dziepak	ba6b74e305	storage_service: counters are no longer experimental Message-Id: <20170510124552.23558-1-pdziepak@scylladb.com>	2017-05-10 17:18:32 +03:00
Gleb Natapov	ab92406585	storage_proxy: optimize reconcile logic for CL=ONE Regular single key query will never reconcile with CL=ONE since there will be no digest mismatch, but range queries do not have digest stage, so always goes through reconcile code. For CL=ONE there will be only one result though, so no need to run complicated reconciliation logic and the only result can be returned directly. Message-Id: <20170509100334.GQ28272@scylladb.com>	2017-05-10 17:09:34 +03:00
Pekka Enberg	b63d33526d	cql3: Fix variable_specifications class get_partition_key_bind_indexes() The "_specs" array contains column specifications that have the bind marker name if there is one. That results in get_partition_key_bind_indices() not being able to look up a column definition for such columns. Fix the issue by keeping track of the actual column specifications passed to add() like Cassandra does. Fixes #2369 Message-Id: <1494397358-24795-1-git-send-email-penberg@scylladb.com>	2017-05-10 12:38:18 +03:00
Calle Wilund	8066efb710	system_keyspace: Add getter/setter for built index status Even though we have none.	2017-05-09 13:48:55 +00:00
Calle Wilund	061ef16562	system_tables/schema_tables: Remove special format case of "execute_cql" Having a varadic parameter being used in implicit sprint is not very readable + makes it less intuitive when suddenly system keyspace becomes more than one -> multiple sprints in the chain -> more confusion or more execution paths. Its not that horrible with some spread out sprint:s	2017-05-09 13:48:55 +00:00
Calle Wilund	f5fcadf0b1	schema: Add "as_cql_string" for column_def + quote-wrapper	2017-05-09 13:48:55 +00:00
Calle Wilund	cb7ee98217	json: Add convinience ability to generate unordered_maps	2017-05-09 13:48:55 +00:00
Calle Wilund	e960724724	caching_options: Add from/to map methods	2017-05-09 13:48:55 +00:00
Calle Wilund	2e1c23f2f2	database: Relax rp ordering check to allow non-commitlog mutations Allow replay to come post certain operations. Such as schema migration	2017-05-09 13:48:55 +00:00
Calle Wilund	27fdc5cfef	schema_tables/system_tables: Add v3 tables to "ALL" and handle in init I.e. deal with more than one keyspace in system_keyspace::make	2017-05-09 13:48:55 +00:00
Calle Wilund	afcf0372df	cql3::untyped_result_set: Add more getter methods	2017-05-09 13:48:55 +00:00
Calle Wilund	539b65fc90	client_state: Make "has_access" auth check schema ks name independent	2017-05-09 13:48:55 +00:00
Calle Wilund	815aa8ba9f	schema_tables: Add schema definitions for v3 tables	2017-05-09 13:48:55 +00:00
Calle Wilund	4378dca6e1	schema_tables: Hide/abstract schema keyspace name	2017-05-09 13:48:55 +00:00
Calle Wilund	2fb36e3bf8	system_keyspace: Add query overloads with named keyspace	2017-05-09 13:48:55 +00:00
Calle Wilund	32909d4c84	system_keyspace: Add v3+legacy schema definitions	2017-05-09 13:48:55 +00:00
Calle Wilund	b522b2bf22	Merge branch 'master' of https://github.com/scylladb/scylla	2017-05-09 13:48:47 +00:00
Avi Kivity	8af2b7c418	transport: honor the skip_metadata flag Reduces processing overhead and network traffic. We can't use the NO_METADATA flag in the metadata object, because this is a request attribute; different executions of the same prepared statement can have different settings for skip_metadata. Message-Id: <20170419175145.19766-1-avi@scylladb.com>	2017-05-09 14:52:03 +03:00
Tomasz Grabiec	e56711a54d	sstables: mutation_reader: Avoid reading index when restrictions cover whole partition The check for is_static_row() used to be enough, but it no longer is after optimization made in commit `3e06065`, which avoids reading the static row. Message-Id: <1494241164-25810-1-git-send-email-tgrabiec@scylladb.com>	2017-05-09 11:03:18 +01:00
Pekka Enberg	5b931268d4	cql3: Move variable_specifications implementation to source file Move the class implementation to source file to reduce the need to recompile everything when the implementation changes... Message-Id: <1494312003-8428-1-git-send-email-penberg@scylladb.com>	2017-05-09 12:44:18 +03:00
Calle Wilund	c30e515c70	Merge branch 'master' of https://github.com/scylladb/scylla	2017-05-09 09:29:27 +00:00
Duarte Nunes	65d96421da	tests/sstable_datafile_test: Fix regression This patch fixes a regression introduced in `9e88b60`, where the wrong clustering key was being specified. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170509091621.2682-1-duarte@scylladb.com>	2017-05-09 12:18:47 +03:00
Gleb Natapov	2d5a7c8058	storage_proxy: make read repair stats accessible through Prometheus Currently they can be read only through JMX. Message-Id: <20170509075546.GN28272@scylladb.com>	2017-05-09 11:23:38 +03:00
Calle Wilund	780a7c8641	Merge branch 'master' of https://github.com/scylladb/scylla	2017-05-08 15:39:46 +00:00
Avi Kivity	8c5c5d3004	Merge "CQL front-end for secondary indices" from Pekka "This patch series adds CQL front-end support for secondary indices. You can now execute CREATE INDEX and DROP INDEX statements, which will update the newly added "Indexes" system table. However, the indexes are not actually backed up by anything nor are they available for CQL queries. The feature is hidden behind a new cluster feature flag and enabled only with the "--experimental" flag." * 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits) schema: Kill index_type enum schema: Kill index_info class cql3/statements/create_index_statement: Use database::existing_index_names() in validation cql3/statements: Use secondary index manager in alter_table_statement class index: Add secondary_index_manager thrift/handler: Use index_metadata db/schema_tables: Index persistence schema: Add all_indices() to schema class schema: Remove add_default_index_names() from schema_builder class db/schema_tables: Add system table for indices cql3/Cgl.g: DROP INDEX cql3/statements: Add drop_index_statement class database: Add find_indexed_table() to database class cql3: Return change event from announce_migration() cql3/statements: Multiple index targets for CREATE INDEX cql3/statements: Use index_metadata in create_index_statement class cql3/statements: Use feature flag in create_index_statement class service/storage_service: Add feature flag for secondary indices database: Add get_available_index_name() to database class schema: Add get_default_index_name() to index_metadata class ...	2017-05-08 17:04:40 +03:00
Calle Wilund	2049303399	query_pagers: bugfix: must count pk only/pk + static rows as 1 Previously only counted clustered/regular Message-Id: <1494249013-4069-1-git-send-email-calle@scylladb.com>	2017-05-08 16:35:27 +03:00
Pekka Enberg	dfee4d2bb0	cql3: Fix partition key bind indices for prepared statements Fix the CQL front-end to populate the partition key bind index array in result message prepared metadata, which is needed for CQL binary protocol v4 to function correctly. Fixes #2355. Message-Id: <1494247871-3148-1-git-send-email-penberg@scylladb.com>	2017-05-08 16:33:17 +03:00
Calle Wilund	a03d54d9f8	Merge branch 'master' of https://github.com/scylladb/scylla	2017-05-08 11:28:26 +00:00
Pekka Enberg	35bb6dedd8	schema: Kill index_type enum	2017-05-08 10:19:34 +03:00
Pekka Enberg	06564afedb	schema: Kill index_info class It's no longer used. Indices are managed by the index_metadata class.	2017-05-08 10:19:34 +03:00
Pekka Enberg	3f27d12e99	cql3/statements/create_index_statement: Use database::existing_index_names() in validation	2017-05-08 10:19:34 +03:00
Pekka Enberg	b87679821c	cql3/statements: Use secondary index manager in alter_table_statement class	2017-05-08 10:03:28 +03:00
Pekka Enberg	4b4e4e6878	index: Add secondary_index_manager	2017-05-08 10:03:28 +03:00
Pekka Enberg	94bc031ca7	thrift/handler: Use index_metadata	2017-05-08 10:03:28 +03:00
Pekka Enberg	11474ed4c6	db/schema_tables: Index persistence	2017-05-08 10:03:28 +03:00
Avi Kivity	9e67bd5aac	Merge " Add partial range deletion support" from Duarte "This series introduces partial support for range deletions. This allows deletion operations such as delete from cf where p=1 and c > 0 and c <= 3. This series only adds support for single-column range restrictions. We enforce that both range bounds be specified, because we can't represent infinite bounds in the current sstable format. Such bounds are represented as a prefix with no components, with the bound_kind informing whether they are a bottom of top bound. We're currently unable to serialize an infinite bound in such a way that it would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a composite with a (<length><value><EOC>)+ format. While we could technically represent the bottom bound, the top bound, if written as a single component with 0 bytes in size and some EOC, would always sort before other values. The same would happen if represented as an empty (no components) composite, because in Cassandra 2.2.x those always have EOC = NONE. This limitation should stay in place until we can properly represent range tombstones in the storage format." * 'range-deletions/v2' of https://github.com/duarten/scylla: mutation: Set cell using clustering_key_prefix mutation_partition: Harmonize apply_delete overloads prefix_compound_view_wrapper: Add is_full and is_empty functions tests/cql_query_test: Add range deletion tests cql3: Partially support ranged deletions single_column_primary_key_restrictions: Implement has_bound() modification_statement: Use statement_restrictions for where clause statement_restrictions: Expose primary key restrictions to_string: Add missing include	2017-05-07 19:27:09 +03:00
Takuya ASADA	b574100075	dist/common/scripts/scylla_selinux_setup: keep symlink on /etc/sysconfig/selinux Current script has a bug that overwrites symlink on /etc/sysconfig/selinux by real file, the script not able to disable SELinux because of it. So keep symlink after modifying the file. Fixes #2279 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493663263-10573-1-git-send-email-syuu@scylladb.com>	2017-05-07 17:30:05 +03:00
Tomer Sandler	9a1aa6c1d3	node_health_check: Major rework This is a folded version of the following rework on the node health check script: - Added support for non-default cql + nodetool ports - Script will not exit if either Scylla-server / Scylla-jmx / Both services are not up and running. It will alert the user about it and which output cannot be collected, but continue collecting everything else. - Removed lshw installation and non-needed use in commands - Script supports RHEL/CentOS/Ubuntu14/Ubuntu16/Debian (tested on all beside Debian, should behave the same as Ubuntu14/16) - All Indentation issues fixed -> using only tab (no spaces) consistently. - >> vs. > was fixed as well in the needed places. - Changes the ${VAR_NAME} instances to $VAR_NAME, and kept the {} only where needed. - Check Scylla service as Vlad recommended using 'ps -C' - Fixed the CQL not listening error message. - Added Sanity check if script is attempted to run on non-Fedora and non-Debian OS -> alert the user and exit. - Removed the MANUAL CHECK LIST section (moved to Google Forms) - Added date in head of the report. - Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore). [ penberg: Fold into a single commit and add proper license. ] Acked-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1493900076-29170-1-git-send-email-penberg@scylladb.com>	2017-05-06 08:38:12 +03:00
Paweł Dziepak	798cfbc68f	Merge "Fixes for gcc 7" from Avi "gcc 7 doesn't like some of our code, so adjust to make it happy." * 'gcc7' of http://github.com/avikivity/scylla: Remove exception specifications commitlog: handle noexcept conflict between unlink and function object thrift: change generated code namespace	2017-05-05 15:42:56 +01:00
Avi Kivity	a592573491	Remove exception specifications C++17 removed exception specifications from the language, and gcc 7 warns about them even in C++14 mode. Remove them from the code base.	2017-05-05 17:02:31 +03:00
Avi Kivity	5278e1a14d	commitlog: handle noexcept conflict between unlink and function object ::unlink is declared as noexcept, but the function object it is passed into is not. gcc 7 warns, so wrap ::unlink in a lambda to make it happy.	2017-05-05 17:02:30 +03:00
Avi Kivity	d542cdddf6	thrift: change generated code namespace org::apache::cassandra (the generated namespace name) gets confused with apache::cassandra (the thrift runtime library namespace), either due to changes in gcc 7 or in thrift 0.10. Either way, the problem is fixed by changing the generated namespace to plain cassandra.	2017-05-05 05:26:20 +03:00
Paweł Dziepak	c9470b5c94	Merge "Fix abort in advance_and_check_if_present()" form Tomasz "Fixes abort which happens when making a single-key query for a key which is after all keys present in the sstable." * 'tgrabiec/fix-abort-in-index-reader' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Add test cases for single-key out of range reads sstables: index_reader: Remove redundant function sstables: index_reader: Fix abort in advance_and_check_if_present()	2017-05-04 17:19:47 +01:00
Duarte Nunes	9e88b60ef5	mutation: Set cell using clustering_key_prefix Change the clustering key argument in mutation::set_cell from exploded_clustering_prefix to clustering_key_prefix, which allows for some overall code simplification and fewer copies. This mostly affects the cql3 layer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:50 +02:00
Duarte Nunes	db63ffdbb4	mutation_partition: Harmonize apply_delete overloads This patch ensures the different mutation_partition::apply_delete() overloads behave similarly, so that, for example, an empty clustering key is treated the same way as an empty exploded_clustering_key_prefix. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:50 +02:00
Duarte Nunes	07e648251b	prefix_compound_view_wrapper: Add is_full and is_empty functions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:50 +02:00
Duarte Nunes	ef138bdd2c	tests/cql_query_test: Add range deletion tests Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:50 +02:00
Duarte Nunes	42873189d4	cql3: Partially support ranged deletions This patch introduces partial support for range deletions. This allows deletion operations such as delete from cf where p=1 and c > 0 and c <= 3. We enforce that both range bounds be specified, because we can't represent infinite bounds in the current sstable format. Such bounds are represented as a prefix with no components, with the bound_kind informing whether they are a bottom of top bound. We're currently unable to serialize an infinite bound in such a way that it would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a composite with a (<length><value><EOC>)+ format. While we could technically represent the bottom bound, the top bound, if written as a single component with 0 bytes in size and some EOC, would always sort before other values. The same would happen if represented as an empty (no components) composite, because in Cassandra 2.2.x those always have EOC = NONE. This limitation should stay in place until we can properly represent range tombstones in the storage format. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:50 +02:00
Duarte Nunes	169cc41251	single_column_primary_key_restrictions: Implement has_bound() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:49 +02:00
Duarte Nunes	f7bc88734a	modification_statement: Use statement_restrictions for where clause This patch replaces the custom where clause processing by adding and using a statement_restrictions field to modification_statement. This improves code reuse and also moves some checks to prepare-time. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:49 +02:00
Duarte Nunes	aff23f93b4	statement_restrictions: Expose primary key restrictions Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:49 +02:00
Duarte Nunes	8b7d7c4e6d	to_string: Add missing include Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-05-04 15:59:49 +02:00
Tomasz Grabiec	e71771d019	tests: mutation_source_test: Add test cases for single-key out of range reads	2017-05-04 14:59:08 +02:00
Tomasz Grabiec	297b4b0cf5	sstables: index_reader: Remove redundant function	2017-05-04 14:59:08 +02:00
Tomasz Grabiec	ec45f1e51d	sstables: index_reader: Fix abort in advance_and_check_if_present() Happens when the key is missing and after all keys in the sstables. Fixes #2345.	2017-05-04 14:59:08 +02:00
Pekka Enberg	25e2777344	schema: Add all_indices() to schema class	2017-05-04 14:59:12 +03:00
Pekka Enberg	830591b092	schema: Remove add_default_index_names() from schema_builder class The add_default_index_names() is part of the old and incomplete secondary index implementation in Scylla. Drop it as it's no longer used.	2017-05-04 14:59:12 +03:00
Pekka Enberg	8b943c0ceb	db/schema_tables: Add system table for indices	2017-05-04 14:59:12 +03:00
Pekka Enberg	af9015d0d0	cql3/Cgl.g: DROP INDEX	2017-05-04 14:59:12 +03:00
Pekka Enberg	a4ee9f9fa1	cql3/statements: Add drop_index_statement class	2017-05-04 14:59:12 +03:00
Pekka Enberg	f26b8d7afb	database: Add find_indexed_table() to database class	2017-05-04 14:59:12 +03:00
Pekka Enberg	14391a8ec8	cql3: Return change event from announce_migration() This changes announce_migration() to return a change event directory in schema_altering_statement base class. It's needed for drop index statement, which does not know the keyspace or column family until it looks up them based on the index. Two stage approach of announcing a migration and then creating the change event won't work because in the latter stage, the lookup will fail. The same change in announce_migration() has been applied to Apache Cassandra.	2017-05-04 14:59:12 +03:00
Pekka Enberg	82394debe6	cql3/statements: Multiple index targets for CREATE INDEX	2017-05-04 14:59:12 +03:00
Pekka Enberg	fe315bd31a	cql3/statements: Use index_metadata in create_index_statement class	2017-05-04 14:59:12 +03:00
Pekka Enberg	651af0f45a	cql3/statements: Use feature flag in create_index_statement class	2017-05-04 14:59:12 +03:00
Pekka Enberg	815c91a1b8	service/storage_service: Add feature flag for secondary indices	2017-05-04 14:59:11 +03:00
Pekka Enberg	930fa79aff	database: Add get_available_index_name() to database class	2017-05-04 14:59:11 +03:00
Pekka Enberg	ef29520c8e	schema: Add get_default_index_name() to index_metadata class	2017-05-04 14:59:11 +03:00
Pekka Enberg	c6e7d4484a	database: Make existing_index_names() per-keyspace operation	2017-05-04 14:59:11 +03:00
Pekka Enberg	8c729f0f5f	database: Rewrite existing_index_names() to use new index metadata	2017-05-04 14:59:11 +03:00
Pekka Enberg	4391faaf45	cql3/statements: Add constants to index_target	2017-05-04 14:59:11 +03:00
Pekka Enberg	546d1e47dd	cql3/statements: Add as_cql_string() to index_target class	2017-05-04 14:59:11 +03:00
Pekka Enberg	56cca3b0d6	cql3: Add to_cql_string() to column_identifier class	2017-05-04 14:59:11 +03:00
Pekka Enberg	58b90655d2	cql3/statements: Add to_string(target_type)	2017-05-04 14:59:11 +03:00
Pekka Enberg	1f5a52d03f	cql3/statements: Use namespaces in index_target.cc file	2017-05-04 14:59:11 +03:00
Pekka Enberg	5e8f2f49c3	schema: Add indices() to schema class	2017-05-04 14:59:11 +03:00
Pekka Enberg	5abd4b8041	schema: Add has_index() to schema class	2017-05-04 14:59:11 +03:00
Pekka Enberg	1fb1828aa2	schema: Add index_names() to schema class	2017-05-04 14:59:11 +03:00
Pekka Enberg	1c1a767408	schema: Add find_index_noname() to schema class This adds a find_index_nomame() helper to the schema class, which searches for index that is otherwise equal but ignores the name of the index in comparison. This is needed to for CREATE INDEX to reject duplicate index creation.	2017-05-04 14:59:11 +03:00
Pekka Enberg	62fba73a05	schema_builder: Add index_metadata support	2017-05-04 14:59:11 +03:00
Pekka Enberg	05e12a1d2b	schema: Add index_metadata maps to raw_schema class	2017-05-04 13:22:12 +03:00
Raphael S. Carvalho	ddc1d80c28	compaction: remove dead function declaration Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>	2017-05-04 11:48:51 +03:00
Raphael S. Carvalho	61229ab88c	compaction: fix type for cleanup After compaction revamp, compaction type set by cleanup at its ctor is being overwritten at compaction::setup(). Consequently, cleanup would not be stopped by 'nodetool stop cleanup' and cleanup would be listed as regular compaction in 'nodetool compactionstats'. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>	2017-05-04 11:48:50 +03:00
Avi Kivity	211a337883	Merge seastar upstream * seastar 194d80f...4a3118c (4): > execution_stage: fix wrong exception thrown for non-unique stages > metrics: add missing move assignment operators for metric_group, metric_groups > Remove unused lambda captures > core: lw_shared_ptr::get() should return nullptr for null pointer	2017-05-04 11:47:05 +03:00
Asias He	66e3b73b9c	repair: Fix partition estimation We estimate number of partitions for a given range of a column familiy and split the range into sub ranges contains fewer partitions as a checksum unit. The estimation is wrong, because we need to count the partitions on all the shards, instead of only counting the local shard. Fixes #2299 Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>	2017-05-03 16:25:45 +03:00
Pekka Enberg	1e04731fa0	Merge "gossip mark alive fixes" from Asias "This series fixes the user after free issue in gossip and elimates the duplicated / unnecessary mark alive operations. Fixes #2341" * tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev: gossip: Ignore callbacks and mark alive operation in shadow round gossip: Ingore the duplicated mark alive operation gossip: Fix user after free in mark_alive	2017-05-03 12:19:16 +03:00
Jacob Johansen	9616956c16	dist/docker: Add support for experimental flag Fixes #2188 Message-Id: <20170502180047.24071-1-jacob.johansen@virginpulse.com>	2017-05-03 10:29:55 +03:00
Asias He	3bd9840c01	gossip: Ignore callbacks and mark alive operation in shadow round In shadow round, we only interested in the peer's endpoint_state, e.g., gossip features, host_id, tokens. No need to call the on_restart or on_join callbacks or to go through the mark alive procedure with EchoMessage gossip message. We will do them during normal gossip runs anyway.	2017-05-03 07:24:21 +08:00
Asias He	1441ae5cac	gossip: Ingore the duplicated mark alive operation If a node is being marked as alive with EchoMessage, ignore the future duplicated mark alive opeariton.	2017-05-03 07:24:21 +08:00
Asias He	d682fbfa28	gossip: Fix user after free in mark_alive After sending echo message, the Node might not be in the endpoint_state_map anymore, use the reference of local_state might cause user after free. Fixes #2341	2017-05-03 07:24:20 +08:00
Raphael S. Carvalho	8b0e358d73	tests/sstable_test: fix release-mode compaction_manager_test in release mode, compaction task is active after submitting request because ready future may be scheduled immediately. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170502171925.9893-1-raphaelsc@scylladb.com>	2017-05-02 20:48:30 +03:00
Calle Wilund	a37d03cd1d	transport::server: ignore socket shutdown future results These will as of-late always be ready. Removing the future usage in preparation for changing api signature to void(*)() (i.e. prevent breakage on seastar update)	2017-05-02 15:08:47 +00:00
Avi Kivity	7e29dd7066	managed_bytes: improve alignment hygene While blob_storage is marked as an unaligned type, the back references also point to an unaligned type (a pointer to blob_storage), since a back reference can live in a blob_storage. This triggers errors from zapcc/clang 4. Fix by creating a type for the reference, which is marked as unaligned. Message-Id: <20170502071404.507-1-avi@scylladb.com>	2017-05-02 10:04:13 +01:00
Pekka Enberg	2f83232a02	schema: Add index_metadata class	2017-05-02 10:29:18 +03:00
Avi Kivity	b46f6a4124	build: ignore unused lambda capture warnings from clang Worthwhile to revisit later.	2017-05-02 10:09:58 +03:00
Raphael S. Carvalho	8dfb5f9c33	tests/sstable_test: fix compaction_manager_test after 'compaction: make major compaction go through compaction manager', the test fails because task is preempted in debug mode before it reaches intruction to increase stat. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170501183255.6191-1-raphaelsc@scylladb.com>	2017-05-02 09:06:41 +03:00
Avi Kivity	1d12d69881	logalloc: define segment_zone::maximum_size Yield build errors with some compilers, if missing.	2017-05-01 16:31:29 +03:00
Amnon Heiman	b59c95359d	scylla_setup: Fix conditional when checking for newer version During the changes in the way the housekeeping check for newer version and warn about it in the installation the UUID part was removed but kept in the sarounding if. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170426075724.7132-1-amnon@scylladb.com>	2017-05-01 12:13:35 +03:00
Raphael S. Carvalho	3071b9052a	compaction: make cleanup_compaction inherit from regular_compaction Some fields that belong to regular and cleanup aren't needed for resharding_compaction, such as incremental selector (which is used for determining max purgeable timestamp for a given decorated key) Better move those fields to regular and make cleanup inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>	2017-04-30 19:37:09 +03:00
Raphael S. Carvalho	687a4bb0c2	dtcs: do not compact fully expired sstable which ancestor is not deleted yet Currently, fully expired sstable[1] is unconditionally chosen for compaction by DTCS, but that may lead to a compaction loop under certain conditions. Let's consider that an almost expired sstable is compacted, and it's not deleted yet, and that the new sstable becomes expired before its ancestor is deleted. Because this new sstable is expired, it will be chosen by DTCS, but it will not be purged because 'compacted undeleted' sstables are taken into account by calculation of max purgeable timestamp and prevents expired data from being purged. The problem is that this sequence of events can keep happening forever as reported by issue #2260. NOTE: This problem was easier to reproduce before improvement on compaction of expired cells, because fully expired sstable was being converted into a sstable full of tombstones, which is also considered fully expired. Fixes #2260. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>	2017-04-30 19:35:46 +03:00
Paweł Dziepak	24f4dcf9e4	db: make virtual dirty soft limit configurable Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>	2017-04-30 19:17:22 +03:00
Avi Kivity	248aa4fc23	Merge "Fix update of counter in static rows" from Paweł "The logic responsible for converting counter updates to counter shards was not covered by unit tests and didn't transform counter cells inside static rows. This series fixes the problem and makes sure that the tests cover both static rows and transformation logic." * tag 'pdziepak/static-counter-updates/v1' of github.com:cloudius-systems/seastar-dev: tests/counter: test transform_counter_updates_to_shards tests/counter: test static columns counters: transform static rows from updates to shards	2017-04-30 19:13:44 +03:00
Avi Kivity	339322517e	Merge "sstables: index_reader: Fix advance_to() to include relevant range tombstones" from Tomasz "Fixes #2326." * 'tgrabiec/fix-range-tombstones-missing-when-slicing' of github.com:cloudius-systems/seastar-dev: tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones() tests: mutation_source_test: Add test for slicing of clustered rows tests: mutation_reader_assertions: Log expectations tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation() tests: sstables: Use read_row() for single-key reads tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source() sstables: Improve logging sstables: index_reader: Fix advance_to() to include relevant range tombstones	2017-04-30 14:40:41 +03:00
Avi Kivity	831ee80c3c	tests: workaround older boost::apply_visitor requiring a result_type member Older versions of boost::apply_visitor require a result_type member for the visitor; supply it to make them happy. Fixes #2312.	2017-04-30 13:56:44 +03:00
Takuya ASADA	a19c1b7f86	dist/redhat: add missing dependencies for Fedora We only have "%{?rhel:Requires}" for scylla-server, need fedora one. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493314367-419-1-git-send-email-syuu@scylladb.com>	2017-04-30 11:06:27 +03:00
Takuya ASADA	fe9f72d2c0	dist/debian: add python3-pyudev to dependencies pyudev is required for seastar/scripts/perftune.py. Fixes #2315 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493309116-18074-1-git-send-email-syuu@scylladb.com>	2017-04-30 11:05:15 +03:00
Paweł Dziepak	f5cf86484e	lsa: introduce upper bound on zone size Attempting to create huge zones may introduce significant latency. This patch introduces the maximum allowed zone size so that the time spent trying to allocate and initialising zone is bounded. Fixes #2335. Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>	2017-04-30 10:58:11 +03:00
Paweł Dziepak	5c302cf67b	tests/counter: test transform_counter_updates_to_shards	2017-04-28 16:29:34 +01:00
Paweł Dziepak	0473750056	tests/counter: test static columns	2017-04-28 16:29:34 +01:00
Paweł Dziepak	0ffdd8d3d0	counters: transform static rows from updates to shards	2017-04-28 16:29:34 +01:00
Tomasz Grabiec	d4df6e214e	tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones()	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	22cce52dff	tests: mutation_source_test: Add test for slicing of clustered rows	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	86b693f562	tests: mutation_reader_assertions: Log expectations	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	ece6e107cc	tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation()	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	6354acc1a2	tests: sstables: Use read_row() for single-key reads So that as_mutation_reader() will create the same kind of reader which database::make_sstable_reader() does. Before this change, all readers were range readers.	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	fd5dbe04b5	tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source() Test different versions of the format, and different promoted index block sizes. The size of 1 is especially important, it will put each fragment in a separate block, exposing various issues with promoted index handling.	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	c5baeed6d2	sstables: Improve logging	2017-04-27 18:43:49 +02:00
Tomasz Grabiec	b523815ac1	sstables: index_reader: Fix advance_to() to include relevant range tombstones Fixes #2326.	2017-04-27 18:43:49 +02:00
Glauber Costa	14b9aa2285	reduce kernel scheduler wakeup granularity We set the scheduler wakeup granularity to 500usec, because that is the difference in runtime we want to see from a waking task before it preempts the running task (which will usually be Scylla). Scheduling other processes less often is usually good for Scylla, but in this case, one of the "other processes" is also a Scylla thread, the one we have been using for marking ticks after we have abandoned signals. However, there is an artifact from the Linux scheduler that causes those preemption to be missed if the wakeup granularity is exactly twice as small as the sched_latency. Our sched_latency is set to 1ms, which represents the maximum time period in which we will run all runnable tasks. We want to keep the sched_latency at 1ms, so we will reduce the wakeup granularity so to something slightly lower than 500usec, to make sure that such artifact won't affect the scheduler calculations. 499.99usec will do - according to my tests, but we will reduce it to a round number. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170427135039.8350-1-glauber@scylladb.com>	2017-04-27 18:11:35 +03:00
Pekka Enberg	9cfb94510f	Merge "Fix issues found by PVS-Studio static analyzer" from Vlad Fix issues found by PVS-Studio as reported by Phillip Khandeliants. Merge branch 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev * 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev: type_parser: catch exceptions by reference and not by value token_metadata::get_host_id(ep): add a missing 'throw'	2017-04-27 11:39:49 +03:00
Vlad Zolotarov	d5b76d5198	type_parser: catch exceptions by reference and not by value Found by PVS-Studio static analyzer: Type slicing. An exception should be caught by reference rather than by value. Fixes #2288 Reported-by: Phillip Khandeliants Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-26 15:12:15 -04:00
Vlad Zolotarov	181c68e97d	token_metadata::get_host_id(ep): add a missing 'throw' Caught by PVS-Studio static analyzer: The object was created but it is not being used. The 'throw' keyword could be missing: throw runtime_error(FOO); Reported-by: Phillip Khandeliants Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-26 14:54:34 -04:00
Takuya ASADA	7a59336b8a	main.cc: drop FS type check Since we add support ext4, we don't need to limit filesystem to XFS anymore. Fixes #1933 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493212525-26264-1-git-send-email-syuu@scylladb.com>	2017-04-26 17:35:55 +03:00
Raphael S. Carvalho	8bae413bcf	database: fix format msg for sprint Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170425224920.16607-1-raphaelsc@scylladb.com>	2017-04-26 17:18:58 +03:00
Raphael S. Carvalho	f49bdb6839	compaction_manager: dont go on with major compaction if task was stopped A column family which was truncated will remove itself from compaction manager. Any task running a compaction should be interrupted and a task waiting to run should bail out when it wakes up. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170425224350.15965-3-raphaelsc@scylladb.com>	2017-04-26 17:18:37 +03:00
Takuya ASADA	abf65cb485	dist/debian: skip tunables when kernel = 3.13.0--generic, to prevent kernel panic bug There is kernel panic bug on kernel = 3.13.0--generic(Ubuntu 14.04), we have to skip tunables. Fixes #1724 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493196636-25645-1-git-send-email-syuu@scylladb.com>	2017-04-26 11:54:11 +03:00
Vlad Zolotarov	a9ad762f47	docs: tracing.md: add a "how to get traces" chapter This chapter describes how to get tracing information. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-25 21:52:29 -04:00
Vlad Zolotarov	f993e85b5f	tracing::trace_keyspace_helper: introduce a time series helper tables Introduce two time series helper tables that will simplify the querying of traces. One for querying regular traces: CREATE TABLE system_traces.sessions_time_idx ( minute timestamp, started_at timestamp, session_id uuid, PRIMARY KEY (minute, started_at, session_id)) and one for querying slow query records: CREATE TABLE system_traces.node_slow_log_time_idx ( minute timestamp, started_at timestamp, session_id uuid, start_time timeuuid, node_ip inet, shard int, PRIMARY KEY (minute, started_at, session_id)) With these tables one may get the relevant traces like in an example below: SELECT * from system_traces.sessions_time_idx where minutes in ('2016-09-07 16:56:00-0700') and started_at > '2016-09-07 16:56:30-0700' Fixes #2243 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-25 21:52:28 -04:00
Vlad Zolotarov	81bcc36b16	tracing: cleanup: use nullptr instead of trace_state_ptr() Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-25 21:52:28 -04:00
Vlad Zolotarov	b0f660331a	tracing: introduce a span ID and parent span ID This patch makes the tracing framework follow the general idea of Google's Dapper paper: traces generated in a context of the same query are forming a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running on each involved replica Node and edges are RPCs sent from one Node to another. - Each vertex in the tree above has an ID - "span ID". - In order to be able to build the tree from the sessions traces we need to know the parent "span ID" - the ID of a span that sent an RPC that created the current span. - Each span of a tracing session is given a 64-bit random span ID. - The root span has a span_id::illegal_id value. This patch adds: - The described above parent span ID and a span ID to the one_session_records object. - The current span ID is passed in the trace_info struct to the remote replica. - Add parent_id and span_id columns to system_traces.events table for the parent ID and span ID. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-25 21:52:23 -04:00
Tomasz Grabiec	92dba05f0d	sstables: Fix malformed_sstable_exception from single-key reads After `4742008b70`, _read_partial_row is never set, and we will fail here in case the consumer will exhoust the range. That would be the case if the end bound of the slice aligns with the end of the index page. Fix by assuming that if we're out of range in the middle of partition, we sliced. Message-Id: <1493121249-18847-1-git-send-email-tgrabiec@scylladb.com>	2017-04-25 14:59:08 +03:00
Avi Kivity	628b3092e4	Merge "Reify shadowable tombstones" from Duarte "This series introduces the row_tombstone class, which represents a tombstone applied to a clustering row. It distinguishes itself from a normal tombstone by the fact that it contains a regular tombstone and a shadowable one, which can be erased by a row marker. The intent of the series is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be, leading to incorrect results." * 'materialized-views/shadowable/v5' of https://github.com/duarten/scylla: sstables: Read and write shadowable tombstones mutation_partion: Use row_tombstone mutation_partion: Introduce row_tombstone mutation_partition: Introduce shadowable tombstones idl-compiler: Support optional fields in views tombstone: Extract out relational operators row_marker: Mark constructors explicit	2017-04-25 13:05:27 +03:00
Duarte Nunes	d45596ae8e	sstables: Read and write shadowable tombstones This patch serializes shadowable tombstones to sstables by adding a new, incompatible atom's mask. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Duarte Nunes	4e693383f7	mutation_partion: Use row_tombstone This patch replaces the current row tombstone representation by a row_tombstone. The intent of the patch is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be. We need to distinguish shadowable from non-shadowable row tombstones to support scenarios such as, when inserting to a table with a materialzied view: 1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1 2. delete from base using timestamp 2 where p = 3 3. insert into base (p, v1) values (3, 1) using timestamp 3 These should yield a view row where v2 is definitely null, but with the current implementation, v2 will pop back with its value v2=3@TS=1, even though its dead in the base row. This is because the row tombstone inserted at 2) is a shadowable one. This patch only addresses the memory representation of such row_tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Duarte Nunes	6a2bccd4ae	mutation_partion: Introduce row_tombstone This patch introduces the row_tombstone class, which represents a tombstone made up of a regular tombstone and a shadowable one. The rules for row_tombstones are as follows: - The shadowable tombstone is always >= than the regular one; - The regular tombstone works as expected; - The shadowable tombstone doesn't erase or compact away the regular row tombstone, nor dead cells; - The shadowable tombstone can erase live cells, but only provided they can be recovered (e.g., by including all cells in a MV update, both updated cells and pre-existing ones); - The shadowable tombstone can be erased or compacted away by a newer row marker. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:28 +02:00
Duarte Nunes	3d49c1da01	mutation_partition: Introduce shadowable tombstones A shadowable tombstone is a tombstone that can be replaced by a smaller one if provided a row_marker with a bigger timestamp than the shadowable tombstone. In the context of a row, it is only valid as long as no newer insert is done (thus setting a live row marker; note that if the row timestamp set is lower than the tombstone's, then the tombstone remains in effect as usual). If a row has a shadowable tombstone with timestamp Ti and that row is updated with a timestamp Tj, such that Tj > Ti (and that update sets the row marker), then the shadowable tombstone is shadowed by that update. A concrete consequence is that if the update has cells with timestamp lower than Ti, then those cells are preserved (since the deletion is removed), and this is contrary to a regular, non-shadowable row tombstone where the tombstone is preserved and such cells are removed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:22 +02:00
Duarte Nunes	8cc29f84fb	idl-compiler: Support optional fields in views When generating view code, the compiler was ignoring optional fields. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:43:04 +02:00
Duarte Nunes	d216c3dbd2	tombstone: Extract out relational operators This patch extracts out the relational operators in struct tombstone to a class capable of generating them from a tri-compare function. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:43:04 +02:00
Duarte Nunes	392403b5b3	row_marker: Mark constructors explicit Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:43:04 +02:00
Tomasz Grabiec	f3609fc813	tests: log_historgram_test: Fix compiation on Ubuntu Some gcc versions incorrectly complain: tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; } Apparently this is a bug in gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036 Fixes #2307. Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>	2017-04-25 12:15:28 +03:00
Pekka Enberg	940c3f4330	Merge "Clang fixes (part 2)" from Avi "This series fixes some more errors found by clang, with the aim of enabling clang/zapcc as a supported compiler. A single issue remains, but it's probably in std::experimental::optional::swap(); not in our code." * tag 'clang/2/v1' of https://github.com/avikivity/scylla: sstable_test: avoid passing negative non-type template arguments to unsigned parameters UUID: add more comparison operators sstable_datafile_test: avoid string_view user-defined literal conversion operator mutation_source_test: avoid template function without template keyword cql_query_test: define static variable cql_query_test: add braces for single-item collection initializers storage_service: don't use typeid(temporary) logalloc: remove unused max_occupancy_for_compaction storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic storage_proxy: drop unused member access from return value storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare read_repair_decision: fix operator<<(std::ostream&, ...)	2017-04-24 20:32:16 +03:00
Tomasz Grabiec	dfbb9fd8f1	gdb: Workaround for gdb.Value being not accepted by %x Fixes the following error in "scylla segment-descs" and a similar one in "scylla lsa-segment": Traceback (most recent call last): File "scylla-gdb.py", line 530, in invoke gdb.write('0x%x: lsa free=%d region=0x%x zone=0x%x\n' % (addr, desc['_free_space'], desc['_region'], desc['_zone'])) TypeError: %x format: an integer is required, not gdb.Value Message-Id: <1493029465-6482-1-git-send-email-tgrabiec@scylladb.com>	2017-04-24 13:27:25 +03:00
Avi Kivity	6d9e18fd61	logalloc: reduce descriptor overhead Every lsa-allocated object is prefixed by a header that contains information needed to free or migrate it. This includes its size (for freeing) and an 8-byte migrator (for migrating). Together with some flags, the overhead is 14 bytes (16 bytes if the default alignment is used). This patch reduces the header size to 1 byte (8 bytes if the default alignment is used). It uses the following techniques: - ULEB128-like encoding (actually more like ULEB64) so a live object's header can typically be stored using 1 byte - indirection, so that migrators can be encoded in a small index pointing to a migrator table, rather than using an 8-byte pointer; this exploits the fact that only a small number of types are stored in LSA - moving the responsibility for determining an object's size to its migrator, rather than storing it in the header; this exploits the fact that the migrator stores type information, and object size is in fact information about the type The patch improves the results of memory_footprint_test as following: Before: - in cache: 976 - in memtable: 947 After: mutation footprint: - in cache: 880 - in memtable: 858 A reduction of about 10%. Further reductions are possible by reducing the alignment of lsa objects. logalloc_test was adjusted to free more objects, since with the lower footprint, rounding errors (to full segments) are different and caused false errors to be detected. Missing: adjustments to scylla-gdb.py; will be done after we agree on the new descriptor's format.	2017-04-24 12:23:12 +02:00
Avi Kivity	b4e897a66d	cql3::metadata: fix undefined evaluation order in constructor We both move names_ to its destination, and call names_.size() in the same expression; this has undefined evaluation order, and fails with clang. With this patch as well as the clang build fixes, Scylla starts and is able to serve requests (light cassandra-stress load). Message-Id: <20170423121727.1948-1-avi@scylladb.com>	2017-04-24 10:40:12 +03:00
Duarte Nunes	cddf2f4d74	tests: Fix failure virtual_reader_test This patch fixes a failure of virtual_reader_test, where both the test itself and the cql_test_env initialize the messaging_service to listen on the same address and port, triggering an assert in posix_ap_server_socket_impl::accept(). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170423104240.21275-1-duarte@scylladb.com>	2017-04-23 14:06:35 +03:00
Avi Kivity	566c094764	sstable_test: avoid passing negative non-type template arguments to unsigned parameters Clang complains. The test looks somewhat bogus, but that's for another patch.	2017-04-22 22:13:55 +03:00
Avi Kivity	dc6ea51ffa	UUID: add more comparison operators Clang wanted them for some unit test; not sure how gcc was able to synthesize them, but they're clearly needed.	2017-04-22 22:12:33 +03:00
Avi Kivity	5424aca745	sstable_datafile_test: avoid string_view user-defined literal conversion operator Clang doesn't like it, perhaps because it isn't in the std namespace (it's still in std::experimental).	2017-04-22 22:11:30 +03:00
Avi Kivity	705ac957a2	mutation_source_test: avoid template function without template keyword This isn't (yet?) standard C++, and clang rejects it.	2017-04-22 22:10:21 +03:00
Avi Kivity	551fb03476	cql_query_test: define static variable single_node_cql_env is declared but not defined; define it to make clang happy.	2017-04-22 22:01:44 +03:00
Avi Kivity	eb700752d8	cql_query_test: add braces for single-item collection initializers Clang complains that braces are missing; I didn't verify it but I'm sure it's right. Add braces to make it happy.	2017-04-22 22:00:49 +03:00
Avi Kivity	6bb8ae7788	storage_service: don't use typeid(temporary) Clang warns that the expression will be evaluated (doh). While the warning seems dubious, keep it and change the code to call the function outside typeid(), in case it does help someone one day.	2017-04-22 21:09:41 +03:00
Avi Kivity	9303b09a64	logalloc: remove unused max_occupancy_for_compaction Noticed by clang.	2017-04-22 21:09:41 +03:00
Avi Kivity	6d0811711f	storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic Clang's std::abs() doesn't support __int128_t, so use __int64_t instead. With this change, it's possible that a read repair 252,700 years after a write will be interpreted as a recent write and the read repair will incorrectly be skipped; hopefully by that time __int128_t will be standardized.	2017-04-22 21:09:41 +03:00
Avi Kivity	5ec1742b9a	storage_proxy: drop unused member access from return value Noticed by clang.	2017-04-22 21:09:41 +03:00
Avi Kivity	e4bae0df51	storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare Noticed by clang.	2017-04-22 21:09:41 +03:00
Avi Kivity	944047f039	read_repair_decision: fix operator<<(std::ostream&, ...) Argument-dependent lookup requires that the operator be declared in the same namespace as the class; move it there. While at it, de-static it, it only causes bloat.	2017-04-22 21:09:41 +03:00
Raphael S. Carvalho	4a86dd473d	tests: add tests/sstable_resharding_test.cc Forgot to add file after resolving conflict. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170422172053.3734-1-raphaelsc@scylladb.com>	2017-04-22 21:09:29 +03:00
Benoît Canet	f68049ef5d	tests: Fix clang auto universal reference type deduction Replace it by regular template type deduction. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20170421204150.4626-2-benoit@scylladb.com>	2017-04-22 20:04:00 +03:00
Benoit Canet	b902f3b81b	tests: Remove parenthesis in variable declaration Prevent clang compilation of this tests. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20170421204150.4626-1-benoit@scylladb.com>	2017-04-22 20:04:00 +03:00
Avi Kivity	54ab13eb8e	Merge "sstable resharding revamp" from Raphael "Currently, a shared sstable is rewritten at all shards it belongs to, and only after that, it's deleted. This new algorithm adds the ability to reshard a set of sstables together at a single shard and produce unshared sstable for all shards involved. That's important for the leveled compaction strategy issue, in which the number of sstables growing considerably after resharding. What happened is that every sstable was being split into N ones, so we could end up with tons of small sstables. Now, we will reshard together a set of adjacent sstables." * 'sstable_resharding_revamp_v9' of github.com:raphaelsc/scylla: tests: add test for new sstable resharding database: kill column_family::start_rewrite database: wire up new resharding algorithm database: implement new sstable resharding algorithm database: introduce function to replace new sstables by their ancestors prevent regular compaction from choosing shared sstables compaction_strategy: implement resharding strategy for compaction strategies sstables: store more info in foreign_sstable_open_info sstables: make it possible to get open info from loaded sstable database: export column family dir database: inform if column family has shared tables sstables: add method to export ancestors lcs: implement get_level_count compaction_manager: introduce method to check if manager stopped lcs: restore invariant instead of sending overlapping sst to L0 sstables: extend compaction for new resharding sstables: allow shard A to correctly create sstable for shard B compaction: rework compacting_sstable_writer to work with multiple writers compaction: prepare compacting_sstable_writer to work with writers sstables: rework compaction to make it easy to extend	2017-04-22 13:31:54 +03:00
Raphael S. Carvalho	8a37b279ed	tests: add test for new sstable resharding Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:34 -03:00
Raphael S. Carvalho	662fe77c11	database: kill column_family::start_rewrite Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:33 -03:00
Raphael S. Carvalho	43ac19eb52	database: wire up new resharding algorithm Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:31 -03:00
Raphael S. Carvalho	cf45333588	database: implement new sstable resharding algorithm NOTE: it's not wired yet. Currently, a shared sstable is rewritten at all shards it belongs to and only after that, it's deleted. With this new algorithm, a shared sstable will be read only once and N unshared sstables will be created, each of them with 1/N of the data. After it's done, each owner shard will receive its new unshared sstable replacing its ancestors. Another benefit is that we'll no longer have resharding resulting in number of sstables growing considerably after resharding. A full-sized leveled sstable is usually 160MB, so after resharding, we could have N files of 160MB/N. Now, leveled strategy will help resharding. N adjacent sstables of same level will be resharded together, so we'll end up with N files of N*160MB/N. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:30 -03:00
Raphael S. Carvalho	6513252e91	database: introduce function to replace new sstables by their ancestors When resharding, we're working with sstables from all shards. So let's say we're done with resharding of sstable A that belongs to shard 0 and 1 and sstable B that belongs to shard 1 and 2. SStables were generated for shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables and remove the ancestors. Shard 1 for example will remove sstables A and B (ancestors) and add the new one. Then it comes this new function. We'll forward new sstables to their target shards using foreign sstable open info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:27 -03:00
Raphael S. Carvalho	c44a2319e6	prevent regular compaction from choosing shared sstables For new resharding, it's important to exclude resharding sstables from the list of candidates for regular compaction. That's doesn't affect current resharding because it marks the sstables as compacting. That won't work with new resharding which will work with sstables from multiple shards. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:26 -03:00
Raphael S. Carvalho	13477075e2	compaction_strategy: implement resharding strategy for compaction strategies Strategies other than leveled will reshard one shared sstable at a time, and the target shard, shard at which job will run, for each job will be chosen in a round-robin fashion. For leveled strategy, we will reshard together smp::count adjacent sstables that belong to same level. The reason for that is because resharding one sstable at a time may result in creation of file for each shard, meaning after resharding we could end up with NO_SSTABLES*NO_SHARDS. These resharding strategies will be used for our new resharding algorithm. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:24 -03:00
Raphael S. Carvalho	bf930476b3	sstables: store more info in foreign_sstable_open_info We need that info for opening a sstable at different shard, unlike sstable loader which has everything in entry_descriptor, obtained from components in sstable filename. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:22 -03:00
Raphael S. Carvalho	e5e7037aa4	sstables: make it possible to get open info from loaded sstable It will be useful for resharding which will need to move a sstable across shards, and to do that without reloading the sstable at target shard, we need to be able to get the open info and move it to the target shard instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:21 -03:00
Raphael S. Carvalho	405e41e9a8	database: export column family dir Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:19 -03:00
Raphael S. Carvalho	2b774c5bc3	database: inform if column family has shared tables That's gonna be useful to quickly determine if it's worth resharding a column family. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:17 -03:00
Raphael S. Carvalho	2d119287b7	sstables: add method to export ancestors Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:16 -03:00
Raphael S. Carvalho	f2f8a2f5c7	lcs: implement get_level_count Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:14 -03:00
Raphael S. Carvalho	585596cede	compaction_manager: introduce method to check if manager stopped Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:12 -03:00
Raphael S. Carvalho	d82a8dfae0	lcs: restore invariant instead of sending overlapping sst to L0 A large token span sstable may find its way into high level due to resharding, which means the strategy invariant is broken. The invariant is restored by compacting first set of overlapping sstables, meaning that the restoration is done incrementally for multiple overlapping sets. Invariant is restored by regular compaction after resharding puts new unshared sstables into their original level, where level > 0. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:09 -03:00
Raphael S. Carvalho	0127309820	sstables: extend compaction for new resharding Extends compaction for new resharding algorithm. Not wired yet. New resharding will compact shared sstable(s) and create one sstable for each owner. It's up to the caller to open these new unshared sstables at their respective column families. This new approach will save a lot of bandwidth because we'll no longer read the entire shared sstable #smp::count times. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:08 -03:00
Raphael S. Carvalho	758bc38e7a	sstables: allow shard A to correctly create sstable for shard B That's possible by shard A explicitly saying that sstable is created for shard B. If we don't do that, sharding metadata isn't correct, and consequently sstable will report wrong owners. We'll need this for resharding which will create sstables for all shards that own the shared sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:06 -03:00
Raphael S. Carvalho	2a437ab427	compaction: rework compacting_sstable_writer to work with multiple writers compacting_sstable_writer only allowed one writer so far, but we will need multiple ones for resharding. It's done by moving writer management to compaction. finish_sstable_writer() is added for compaction impl to stop all writers, whereas stop_sstable_writer() will only stop current writer (needed when current sstable reaches max limit size for example). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:05 -03:00
Raphael S. Carvalho	a35a3a9647	compaction: prepare compacting_sstable_writer to work with writers No need for compacting_sstable_writer to store items that are available in compaction class. Also, that's a step towards supporting multiple writers for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:03 -03:00
Raphael S. Carvalho	38ed83e2f7	sstables: rework compaction to make it easy to extend compact_sstables() supported both regular and cleanup compaction, but with lots of conditions that made it ugly and hard to extend. In the future, we want to introduce a new type of compaction for resharding that will create one sstable for every shard owning the sstable(s) given as input. That will be easier now. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:02 -03:00
Avi Kivity	fdcf64520d	Merge seastar upstream * seastar 2eec212...194d80f (4): > removing the collectd tests > fix fstream metrics reporting. > do_for_each: Make it check for need preempt > core/sharded: introduce copy method to foreign_ptr	2017-04-21 22:14:01 +03:00
Avi Kivity	fccbf2c51f	Merge "Reduce memory reclamation latency" from Tomasz "Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634." * tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev: lsa: Reduce reclamation latency tests: Add test for log_histogram log_histogram: Allow non-power-of-two minimum values lsa: Use regular compaction threshold in on-idle compaction tests: row_cache_test: Induce update failure more reliably lsa: Add getter for region's eviction function	2017-04-21 17:47:06 +03:00
Tomasz Grabiec	20f4c9bf23	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	4313641c03	tests: Add test for log_histogram	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	c83768d6bb	log_histogram: Allow non-power-of-two minimum values We will want to reuse the min_size mechanism for the whole compaction threshold, including the occupancy threshold. That threshold is close to the segment size and we cannot pick a power of two which would be close enough to what we need. Therefore, change log_histogram to support arbitrary minimum base. bucket_of() was moved into log_histogram_options so that it can be used in number_of_buckets(), which makes for a simple and much less error-prone implementation.	2017-04-21 10:54:50 +02:00
Tomasz Grabiec	7a800c54bf	lsa: Use regular compaction threshold in on-idle compaction Idle-time compaction should not produce not-compactible segments becuase that means we would have to evict a lot when we finally need to reclaim some memory, so that occupancy falls below the regular compaction threshold. This may cause latency spikes. Refs #1634.	2017-04-20 15:00:15 +02:00
Tomasz Grabiec	e054ccc037	tests: row_cache_test: Induce update failure more reliably After changing region evicitability condition to be less strict, cache update stopped failing because reclamation was able to compact dense region. Induce failure by installing evictor which refuses to evict from cache beyond few elements.	2017-04-20 14:51:47 +02:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Vlad Zolotarov	9c1d803157	fix_system_distributed_tables.py: add --node and --port parameters Allow giving a non-default IP address and a port to connect to the cluster. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1491316458-18420-1-git-send-email-vladz@scylladb.com>	2017-04-20 14:49:26 +03:00
Avi Kivity	68f0df12ee	Merge "Optimize reads with clustering restrictions" from Tomasz "This series makes several optimizations to sstable mutation reader relevant for large partitions. Some highlights: One optimization is to use the index for skipping across clustering restrictions. Currently we read whole partition in such cases. That includes the case when we need to read a static row and then jump to some clustering row in the middle of the partition. Another case is having more than one clustering restriction, e.g. selecting multiple single rows from the same partition. Another optimization is using information from the index for creation of streamed_mutation. That can save us the cost of reading the partition header form the data file in case we would not continue reading, but skip to the middle of that partition. Or we may not even attempt to read anything from that partition, if after we determine the key that reader will be put behind other readers, which will exhaust the query limit first. Another optimization is switching single-partition queries to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). This is also a cleanup, a step towards converting all code to use the index_reader." * tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits) sstables: Remove unused code sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition sstables: mutation_reader: Use index_reader for single-partition reads sstables: mutation_reader: Add trace-level logging sstables: mutation_reader: Move partition reading code to sstable_data_source sstables: mutation_reader: Move definitions out of the class body sstables: Move binary_search() to a header database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating sstables: index_reader: Introduce advance_to_next_partition() sstables: index_reader: Introduce advance_and_check_if_present() sstables: index_reader: Introduce advance_past() sstables: index_reader: Make copyable sstables: index_reader: Optimize advancing to extreme positions sstables: index_reader: Keep two last pages alive dht: ring_position_view: Add key getter dht: ring_position_view: Add constructor and factory from ring_position_view sstables: mutation_reader: Advance to next partition using index in some cases sstables: index_reader: Expose access to partition key and tombstone sstables: index_reader: Introduce promoted_index_view sstables: mutation_reader: Move _index_in_current to sstable_data_source ...	2017-04-20 13:58:37 +03:00
Tomasz Grabiec	3472a74de4	sstables: Remove unused code	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	c1059ca8e4	sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition It's cheaper than a key-based lookup, so use it when we can.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	4742008b70	sstables: mutation_reader: Use index_reader for single-partition reads This switches single-partition query to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). perf_fast_forward, rows: 1000000, value size: 100 Before: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.002182 2 916 3 152 2 0 0 1 1 88.1% After: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.000758 2 2639 3 152 2 0 0 1 1 48.6% This is also a cleanup, a step towards converting all code to use the index_reader.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	9d8795089d	sstables: mutation_reader: Add trace-level logging	2017-04-20 11:18:55 +02:00
Tomasz Grabiec	b198c31c46	sstables: mutation_reader: Move partition reading code to sstable_data_source It will be reused for read_row(), which does't create mutation_reader instance, only sstable_data_source.	2017-04-20 11:18:26 +02:00
Tomasz Grabiec	6e4bca0be6	sstables: mutation_reader: Move definitions out of the class body To make further refactoring easier to review. No functional changes here.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4ed7e529db	sstables: Move binary_search() to a header There are instantiations of binary_search() used in sstables.cc, but defined in partition.cc. The instantiations are explicitly declared in partition.cc, but the types changed and they became obsolete. The thing worked because partition.cc also instantiated it with the right type. But after that code will be removed, it no longer would, and we would get a linker error. To avoid such problems, define binary_search() in a header.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	bedd0ab6f9	database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	0b5ba13230	sstables: index_reader: Introduce advance_to_next_partition()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4b81844d2e	sstables: index_reader: Introduce advance_and_check_if_present()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	b92f095bf0	sstables: index_reader: Introduce advance_past()	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	6780756258	sstables: index_reader: Make copyable	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	7db83fa3fe	sstables: index_reader: Optimize advancing to extreme positions	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	f66443c01c	sstables: index_reader: Keep two last pages alive The idea behind caching is that when we have two index readers where one is catching up with the other, each page will be read only once. Currently that's not always the case. There is a case when advance_to() may need to read two pages. That's when the target position is not found in the first page as determined by the summary index. The second reader which catches up would have to read the first page as well, but it would not be in cache any more. To avoid this extra I/O let's keep a reference to the two last pages touched by the index.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	c7b9c5dfd3	dht: ring_position_view: Add key getter	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	5b71e0b9ab	dht: ring_position_view: Add constructor and factory from ring_position_view	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	3e8795494e	sstables: mutation_reader: Advance to next partition using index in some cases To produce a streamed_mutation for the next partition, we need to read its key and the tombstone. Currently we always do that by consuming the partition header from the data file. In some cases that may cause unnecessary IO. It's better to obtain partition information from the index if we already have it. We can save on IO if the user will skip past the front of partition immediately after. It is also better to pay the cost of reading the index if we know that we will need to use the index anyway soon. This patch predicts that by checking if there are any clustering restrictions. If there are any, we will almost surely need_skip() and use the index anyway. This change also lays the ground for unification of multi and single partiton queries without loss of performance.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	e35fe7492c	sstables: index_reader: Expose access to partition key and tombstone	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	ae72c159b1	sstables: index_reader: Introduce promoted_index_view So that we have a nice way of extracting tombstone out of it. We not always need fully parsed index.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	0ef33b7f29	sstables: mutation_reader: Move _index_in_current to sstable_data_source sstable_data_source holds a shared state between mutation_reader and streamed_mutation for sstables. The information whether index is in current partition will have to be accessed by both in the following patches.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	885f53d905	sstables: mutation_reader: Avoid resetting the walker Before the change, the following scenario was happening: 1) we try to skip based on clustering restrictions 2) we find the page and fast forward to it, recording walker's lower bound counter 3) we read the first fragment, it's not a tombstone, so we reset the walker, and its lower bound counter too 4) the fragment is not in range (the range starts in the middle of the page) 5) needs_skip() is true, we redo the index lookup, which wastes some CPU This change fixes the problem by avoiding resetting the walker. We can do that because leading tombstones are checked with a non-mutable contains_tombstone()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	bf21aa3a1f	clustering_ranges_walker: Introduce contains_tombstone()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	b030ce693d	sstables: mutation_reader: Don't try to read index to skip to static row Static row is always at the beginning, there's no point in doing index lookups.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	3e060659f1	sstables: mutation_reader: Don't try to read static row if table doesn't have any	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	b1860a8a24	clustering_ranges_walker: Allow excluding the static row	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	77d3e30239	sstables: mutation_reader: Use index to skip across clustering restrictions Improves scans with clustering restrictions. Before the change such scans would scan whole partition. Below are results of a test case from perf_fast_forward which selects few rows from a large partition using query restrictions (not fast forwarding). Before: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000609 1 1642 3 152 2 1 0 1 1 38.0% 500000 2 0.242255 2 8 511 64152 398 4 0 1 1 98.6% 250000 4 0.281592 4 14 749 95832 564 4 0 1 1 98.4% 125000 8 0.328056 8 24 873 111704 657 4 0 1 1 98.4% 62500 16 0.306700 16 52 935 119640 751 4 0 1 1 99.4% After: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000711 1 1406 3 152 2 1 0 1 1 42.1% 500000 2 0.000910 2 2197 5 216 3 2 0 1 1 39.2% 250000 4 0.001384 4 2891 9 344 5 4 0 1 1 35.3% 125000 8 0.003197 8 2502 21 728 13 8 0 1 1 53.1% 62500 16 0.006664 16 2401 41 1368 25 16 0 1 1 58.2%	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	05a1f92cbc	clustering_ranges_walker: Introduce lower_bound_change_counter() Allows detecting changes of lower_bound(). Result of advance_to() is not enough. When we get false from advance_to() twice in a row, lower bound may or may not have changed.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	461f2af0a1	sstables: mutation_reader: Avoid index lookups when out of range	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	10c92d37d1	sstables: mutation_reader: Simplify fast_forward_to()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	bfb6858e55	sstables: mutation_reader: Let clustering_ranges_walker handle the _fwd_range start Simplifies the code a bit, but also will make it easier to calculate the next position we should skip to after forwarding, taking into consideration both the position forwarded to as well as clustering ranges of the query. That will be just calling _ck_ranges_walker->lower_bound() after it is trimmed.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	d056d9c31b	sstables: mutation_reader: Let mp_row_consumer decide about position passed to the index In general mp_row_consumer has better information about the next position to read. It could be after the position we forward to if there are clustering restrictions. This will be exploited in the following patches.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	a37712e9ae	sstables: mutation_reader: Move mp_row_consumer::fast_forward_to() out of line	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	bb3e683783	clustering_ranges_walker: Support trimming Makes implementing fast_forward_to() easier. mp_row_consumer emulates this currently. This change will allow simplifying this.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	652d04e78a	clustering_ranges_walker: Generalize to work on position ranges It will include the static row by default. This will allow simplifying users, which work with position ranges already.	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	c85fe3183c	position_range: Allow stealing of bounds	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	503c68de44	position_in_partition: Add more factory methods	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	6c1dc642ee	sstables: mutation_reader: Create index on-demand	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	434fda3577	sstables: mutation_reader: Keep priority_class by reference To indicate that it is not optional.	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	a8c126c82a	sstables: Expose get_index_reader()	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	e1af5a406d	sstables: Make sstable::get_index_reader() return unique_ptr<> Makes callers a bit simpler	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	7dc3fe7d3f	tests: perf_fast_forward: Add test case for forwarding with clustering restrictions in a large partition	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	eed864690b	tests: perf_fast_forward: Add test case for slicing of large partition using a single-partition reader	2017-04-20 10:54:36 +02:00
Tomasz Grabiec	81fc7977a4	tests: perf_fast_forward: Add test for selecting few rows from large partition	2017-04-20 10:54:36 +02:00
Raphael S. Carvalho	3286f7aaa6	compaction: make major compaction go through compaction manager From now on, major compaction will go through compaction manager. Major compaction is serialized to reduce disk space requirement. Each column family will be running either minor and major compaction at a given time. The only issue is number of small sstables growing while major compaction is running, but major compaction itself will reduce the number of tables considerably. If this turns out to be an issue, we can allow minor to start in parallel to major, but not the other way around. Fixes #1156. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>	2017-04-19 15:44:21 +03:00
Duarte Nunes	e06bafdc6c	alter_type_statement: Fix signed to unsigned conversion This could allow us to alter a non-existing field of an UDT. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170419114254.5582-1-duarte@scylladb.com>	2017-04-19 14:48:12 +03:00
Tomasz Grabiec	02da3ba316	tests: perf_fast_forward: Fix use-after-free in scan_with_stride_partitions() partition_range must live as long as the reader is used.	2017-04-19 08:37:56 +02:00
Raphael S. Carvalho	e78db43b79	compaction_manager: fix crash when dropping a resharding column family Problem is that column family field of task wasn't being set for resharding, so column family wasn't being properly removed from compaction manager. In addition to fixing this issue, we'll also interrupt ongoing compactions when dropping a column family, exactly like we do with shutdown. Fixes #2291. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>	2017-04-18 17:39:27 +03:00
Duarte Nunes	af37a3fdbf	logalloc: Fix compilation error This patch moves a function using the region_impl type after the type has been defined. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170418124551.25369-1-duarte@scylladb.com>	2017-04-18 15:56:26 +03:00
Raphael S. Carvalho	11b74050a1	partitioned_sstable_set: fix quadratic space complexity streaming generates lots of small sstables with large token range, which triggers O(N^2) in space in interval map. level 0 sstables will now be stored in a structure that has O(N) in space complexity and which will be included for every read. Fixes #2287. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>	2017-04-18 13:04:38 +03:00
Takuya ASADA	86e464ab26	dist/offline_installer: support Ubuntu/Debian moved existing script to dist/offline_installer/redhat, added .deb version into dist/offline_installer/debian. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1492474821-9907-1-git-send-email-syuu@scylladb.com>	2017-04-18 10:56:50 +03:00
Pekka Enberg	b31c45d8af	Merge "clang fixes (part 1)" from Avi "This series fixes some errors found by clang, with the aim of enabling clang/zapcc as a supported compiler. A few more fixes are needed to produce a binary." * tag 'clang/1/v1' of https://github.com/avikivity/scylla: logalloc: avoid auto in function argument declaration thrift: avoid auto in function argument declaration streamed_mutation: fix non-POD argument to C-style variadic function mutation_partition_serializer: avoid auto in function argument declaration date: use correct casts for years streaming: avoid auto in function argument declaration repair: avoid auto in function argument declaration gms: expose gms::inet_address streaming operator murmur3_partitioner: fix build on clang i_partitioner: remove unused function byte_ordered_partitioner: fix bad operator precedence result_set: pass comparator by reference to std::sort() to_string: move standard container overloads of to_string to std:: namespace cql_type: fix bad enum syntax on clang build: disable more warnings for clang build: fix detection of unsupported warnings on clang	2017-04-18 08:49:25 +03:00
Avi Kivity	844529fe33	logalloc: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type. Since the right type is private, add some friendship.	2017-04-17 23:18:44 +03:00
Avi Kivity	54add19ca2	thrift: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type.	2017-04-17 23:18:44 +03:00
Avi Kivity	f0c25fc20f	streamed_mutation: fix non-POD argument to C-style variadic function Clang warns that passing a non-POD to a C-style variadic function will result in an abort(). That happens to be exactly what we want, but to silence the warning, use a template instead. Since templates aren't allowed in local classes, move the containing class to namespace scope.	2017-04-17 23:18:44 +03:00
Avi Kivity	635c32eb32	mutation_partition_serializer: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type.	2017-04-17 23:18:44 +03:00
Avi Kivity	a0858dda3e	date: use correct casts for years Our date implementation uses int64_t for years, but some of the code was not changed; clang complains, so use the correct casts to make it happy.	2017-04-17 23:03:15 +03:00
Avi Kivity	ca69a04969	streaming: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type.	2017-04-17 23:03:15 +03:00
Avi Kivity	ae7d7ae20f	repair: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type.	2017-04-17 23:03:15 +03:00
Avi Kivity	c885c468a9	gms: expose gms::inet_address streaming operator The standard says, and clang enforces, that declaring a function via a friend declaration is not sufficient for ADL to kick in. Add a namespace level declaration so ADL works.	2017-04-17 23:03:15 +03:00
Avi Kivity	af118ab52b	murmur3_partitioner: fix build on clang Don't know what the root cause it, but the fix is harmless.	2017-04-17 23:03:15 +03:00
Avi Kivity	c05f60387b	i_partitioner: remove unused function Found by clang.	2017-04-17 23:03:15 +03:00
Avi Kivity	a496ec7f5b	byte_ordered_partitioner: fix bad operator precedence Found by clang.	2017-04-17 23:03:15 +03:00
Avi Kivity	d9aaa95b29	result_set: pass comparator by reference to std::sort() Clang complains about some error without it, I could not understand it, but I'm not going to argue with it. Since std::sort() will copy the comparator, it's better to pass using an std::ref(), and everyone is happy.	2017-04-17 23:03:15 +03:00
Avi Kivity	a83a24268d	to_string: move standard container overloads of to_string to std:: namespace Argument-dependent lookup will not find to_string() overloads in the global namespace if the argument and the caller are in other namespaces. Move these to_string() overloads to std:: so ADL will find them. Found by clang.	2017-04-17 23:03:15 +03:00
Avi Kivity	a7fe7aedbf	cql_type: fix bad enum syntax on clang cql3::type used some gcc extension that is not recognized on clang; use the standard syntax instead.	2017-04-17 22:35:41 +03:00
Avi Kivity	1faef017e3	build: disable more warnings for clang We should fix the source and re-enable the warnings, but this will do for now.	2017-04-17 22:34:59 +03:00
Avi Kivity	78e9b0265b	build: fix detection of unsupported warnings on clang The diagnostic that clang spits out when it sees an unrecognized warning is itself a warning, so the test compilation succeeds and we don't notice the warning is not supported. Adding -Werror turns the warning about the unrecognized warning into an error, allowing the detection machinery to work.	2017-04-17 22:33:01 +03:00
Takuya ASADA	b8f40a2dff	dist/ami/files/.bash_profile: warn user when enhanced networking is not enabled Show warnings on following conditions: - VPC is not used - Driver is not enhanced networking one Fixes #1984 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1488844756-14935-1-git-send-email-syuu@scylladb.com>	2017-04-15 15:16:55 +03:00
Benoît Canet	8f793905a3	perf_sstable: Change busy loop to futurized loop The blocked task detector introduced in `113ed9e963` was seeing the initialization phase of perf_ssttable as a blocked task. Tranform this part of the code in a futurized loop to make to blocked task detector happy. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20170413132506.17806-1-benoit@scylladb.com>	2017-04-13 18:17:28 +03:00
Amnon Heiman	1dfd32f070	scylla-housekeeping service: Support private repositories This patch add support for private repositories for scylla-housekeeping. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-04-13 18:13:57 +03:00
Amnon Heiman	5839dc1f20	scylla-housekeeping-upstart: Use repository id, when checking for version This patch allows the check version to use private repository when checking for version. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-04-13 18:12:52 +03:00
Amnon Heiman	622502de7a	scylla-housekeeping: support private repositories This patch allows the check version to support private repositories. If a repository file is passed as a parameter, the repository id will be passed passed as a parameter when checking version. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-04-13 18:11:15 +03:00
Avi Kivity	7d16cfa5f0	Merge branch 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev "The version of create_index_statement class that was translated to C++ is pretty old by now. This series of cleanups brings it closer to Apache Cassandra trunk to make it easier to bring over more secondary index code to Scylla." * 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev: cql3/statements/create_index_statement: Move target validation cql3/statements/create_index_statement: Remove static column validation cql3/statements/create_index_statement: Extract validations cql3/statements/create_index_statement: Kill bogus custom validation cql3/statements/create_index_statement: Add materialized view to validate() cql3/statements/create_index_statement: Remove validation	2017-04-13 13:27:53 +03:00
Asias He	d27b47595b	gossip: Fix possible use-after-free of entry in endpoint_state_map We take a reference of endpoint_state entry in endpoint_state_map. We access it again after code which defers, the reference can be invalid after the defer if someone deletes the entry during the defer. Fix this by checking take the reference again after the defering code. I also audited the code to remove unsafe reference to endpoint_state_map entry as much as possible. Fixes the following SIGSEGV: Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --'. Program terminated with signal SIGSEGV, Segmentation fault. (this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127 127 in /usr/include/c++/5/bits/stl_pair.h [Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))] Fixes #2271 Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>	2017-04-13 13:18:17 +03:00
Takuya ASADA	81c1b07bac	dist: add offline installer This introduce offline installer generator. It will generate self-extractable archive witch contains Scylla packages and dependency packages. Package installation automatically starts when the archive executed. Limitation: Only supported CentOS at this point. Fixes #2268 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491997091-15323-1-git-send-email-syuu@scylladb.com>	2017-04-13 13:16:09 +03:00
Avi Kivity	ac48767146	Merge "tracing and cql3 patches" from Vlad "This series was initially meant to only transition the keyspace based backend to work on top of prepared statements but there were a few potential issues found on the way. In addition the original Tracing series has been expanded with a few patches in the cql3 layer that are improving the generic clq3 layer but are not obvious without the context of the following Tracing patches. The "main" patch contains a heavy rework of trace_keyspace_helper: - Use prepared statements for updating tables instead of manually constructing mutations: - We intentionally decrease the level of code robustness from "paranoid" to "normal". - The code gets a lot more simple, e.g. we don't need to cache columns definitions any more. - We are loosing some performance here but: - Tracing write is not in the fast path. - Tracing write events should be rare. - Currently the performance loss (for the actual write time of all trace records) for a "SELECT" query with a specific key is about 45%: 144us vs 99us." * 'tracing_rework_using_prepared-v6' of github.com:cloudius-systems/seastar-dev: tracing: use prepared statment for updating tables tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter tracing::trace_keyspace_helper: introduce a table_helper class tracing::trace_keyspace_helper: add static qualifier to make_monotonic_UUID_tp() and elapsed_to_micros() methods tracing::tracing: allow slow query TTL only in the signed 32-bit integer range cql3::query_processor::prepare(): futurize the error case cql3::query_options: add a factory method for creation of options for a BATCH statement cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value cql3::query_processor: use weak_ptr for passing the prepared statements around	2017-04-13 11:07:49 +03:00
Raphael S. Carvalho	a6f8f4fe24	compaction: do not write expired cell as dead cell if it can be purged right away When compacting a fully expired sstable, we're not allowing that sstable to be purged because expired cell is unconditionally converted into a dead cell. Why not check if the expired cell can be purged instead using gc before and max purgeable timestamp? Currently, we need two compactions to get rid of a fully expired sstable which cells could have always been purged. look at this sstable with expired cell: { "partition" : { "key" : [ "2" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 120, "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z", "ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true }, "cells" : [ { "name" : "country", "value" : "1" }, ] now this sstable data after first compaction: [shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79 (~65% of original) in 229ms = 0.000328997MB/s. { ... "rows" : [ { "type" : "row", "position" : 79, "cells" : [ { "name" : "country", "deletion_info" : { "local_delete_time" : "2017-04-09T17:07:12Z" }, "tstamp" : "2017-04-09T17:07:12.702597Z" }, ] now another compaction will actually get rid of data: compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original) in 1ms = 0MB/s. ~2 total partitions merged to 0 NOTE: It's a waste of time to wait for second compaction because the expired cell could have been purged at first compaction because it satisfied gc_before and max purgeable timestamp. Fixes #2249, #2253 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>	2017-04-13 10:59:19 +03:00
Vlad Zolotarov	f8956ba01a	tracing: use prepared statment for updating tables In addition to actually moving to using the prepared statements the changes also include: - Kill the cache_xxx() methods - the schema is going to be checked during the prepared statement creation and during its execution. - Move the caching of table ID and the prepared statement to the get_schema_ptr_or_create(). - Rename: get_schema_ptr_or_create() -> cache_table_info(). After these changes we are less strict in our demands to system_traces tables schemas, e.g. if some column's type is not exactly as we expect but rather only "compatible" in the CQL sense we will tolerate this and will continue to write into that table. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 17:12:42 -04:00
Vlad Zolotarov	baf0289951	tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter An object built with this constructor will use the what() message from the given exception in the final error message. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 17:04:54 -04:00
Vlad Zolotarov	98864c6c30	tracing::trace_keyspace_helper: introduce a table_helper class This class contains a general table info and implements standard operations on this table: - Creation. - Info caching. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 17:04:48 -04:00
Pekka Enberg	8c45038729	cql3/statements/create_index_statement: Move target validation Move index target validation to preserve the same code structure as Apache Cassandra and simplify support for multiple index targets.	2017-04-12 20:50:09 +03:00
Pekka Enberg	528c33a05b	cql3/statements/create_index_statement: Remove static column validation Apache Cassandra supports secondary indices on static columns since commit 9e74891 ("Add support for secondary indexes on static columns").	2017-04-12 20:46:29 +03:00
Pekka Enberg	975e1c8fc6	cql3/statements/create_index_statement: Extract validations Extract specific validations to separate functions to preserve the same structure as Apache Cassandra code and make it easier to add support for multiple index targets.	2017-04-12 20:44:15 +03:00
Pekka Enberg	940d6de1b8	cql3/statements/create_index_statement: Kill bogus custom validation Rejecting custom indices is bogus because it's just a configuration mechanism like replication strategy, for example. Furthermore, it's needed for SASI indices, which we likely need to be compatible with.	2017-04-12 20:17:09 +03:00
Pekka Enberg	cfadb70565	cql3/statements/create_index_statement: Add materialized view to validate() Apache Cassandra does not support secondary indices on materialized views so neither should we.	2017-04-12 20:15:20 +03:00
Pekka Enberg	1706f346f2	cql3/statements/create_index_statement: Remove validation The validation was removed in Apacha Cassandra commit 0626be8 ("New 2i API and implementations for built in indexes"). Let's also remove it from our code so that we remove one dependency to the obsolete db/index/ code.	2017-04-12 20:14:52 +03:00
Vlad Zolotarov	b4bf0735b8	tracing::trace_keyspace_helper: add static qualifier to make_monotonic_UUID_tp() and elapsed_to_micros() methods Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:08 -04:00
Vlad Zolotarov	aa1f8ccea4	tracing::tracing: allow slow query TTL only in the signed 32-bit integer range Any TTL is eventually converted into the gc_clock::duration value, which is based on int32_t type. Limit the node_slow_log TTL user configurable value to the same values range for consistency. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:08 -04:00
Vlad Zolotarov	1685cefa57	cql3::query_processor::prepare(): futurize the error case Make sure that errors are reported in a form of an exceptional future and not by a direct exception throwing. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:08 -04:00
Vlad Zolotarov	fcef9d3b05	cql3::query_options: add a factory method for creation of options for a BATCH statement Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:08 -04:00
Vlad Zolotarov	75fbc7c558	cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value This constructor should be used when we know that there are no bound terms in the current batch statement. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:08 -04:00
Vlad Zolotarov	ff55b76562	cql3::query_processor: use weak_ptr for passing the prepared statements around Use seastar::checked_ptr<weak_ptr<pepared_statement>> instead of shared_ptr for passing prepared statements around. This allows an easy tracking and handling of statements invalidation. This implementation will throw an exception every time an invalidated statement reference is dereferenced. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-12 12:24:03 -04:00
Pekka Enberg	ecb8ee4efd	cql3/statements: Cleanup create_index_statement.cc Use namespaces and fix formatting issues to make the source file easier on the eyes. Message-Id: <1491977964-26629-1-git-send-email-penberg@scylladb.com>	2017-04-12 16:48:45 +03:00
Avi Kivity	db73cc045f	Merge seastar upstream * seastar e899c0b...2eec212 (4): > Merge "add seastar::checked_ptr class" from Vlad > resource: reduce default_reserve_memory size to fit low memory environment > scripts: posix_net_conf.sh: add --tune net parameter to perftune.py invocation > core: weak_ptr: add a weak_ptr(std::nullptr_t) constructor	2017-04-12 13:49:21 +03:00
Paweł Dziepak	0318dccafd	lsa: avoid unnecessary segment migrations during reclaim segment_zone::migrate_all_segments() was trying to migrate all segments inside a zone to the other one hoping that the original one could be completely freed. This was an attempt to optimise for throughput. However, this may unnecesairly hurt latency if the zone is large, but only few segments are required to satisfy reclaimer's demands. Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>	2017-04-11 08:55:29 +02:00
Gleb Natapov	b4c368a6bc	storage_proxy: update correct statistics on range reads Fixes #2167 Message-Id: <20170405094119.GM8197@scylladb.com>	2017-04-09 18:16:06 +03:00
Glauber Costa	a808c32676	scylla_util: fix issues with cpuset handling While cpuset.conf is supposed to be set before this is used, not having a cpuset.conf at all is a valid configuration. The current code will raise an exception in this case, but it shouldn't. Also, as noted by Amos, atoi() is not available as a global symbol. Most invocations were safe, calling string.atoi(), but one of them wasn't. This patch replaces all usages of atoi() with int(), which is more portable anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170407172937.17562-1-glauber@scylladb.com>	2017-04-09 16:04:02 +03:00
Glauber Costa	f842aeb07a	mark i3 as a supported instance during login We have recently fixed the ami init scripts to mark i3 as a supported instance. However, the code to detect whether or not the instance is supported is duplicated, and called from multiple locations. That means that when the user logs in, it will see the instance as not supported - as the test is coming from a different source. This patch moves it to the scylla_lib.sh utilities script, so we can share it, and make sure it is right for all locations. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170406201137.8921-1-glauber@scylladb.com>	2017-04-09 12:35:35 +03:00
Glauber Costa	7ef8c6aaec	centos: add runtime dependency for perftune script Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170406172949.13264-1-glauber@scylladb.com>	2017-04-09 11:36:26 +03:00
Glauber Costa	ca8ca3b823	scylla_io_setup: change permissions The script ended up with the default permissions, which lack the executable bit. Change it, so it can be executed directly. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170407173109.17870-1-glauber@scylladb.com>	2017-04-09 11:36:13 +03:00
Pekka Enberg	a9ad5cc560	dist/redhat: Add node_health_check to package	2017-04-08 08:12:57 +03:00
Tomer Sandler	c49f944483	dist: Add node_health_check script This patch adds a script for running an automated node health check. [ penberg: Fold into single commit ]	2017-04-08 08:06:30 +03:00
Avi Kivity	3b19ea8796	Update scylla-ami submodule * dist/ami/files/scylla-ami 5d73a71...f10db69 (2): > use latest repo for amzn kernel > do not use duplicated code to test for instance support status	2017-04-07 12:28:45 +03:00
Avi Kivity	f579e4cc46	Merge seastar upstream * seastar c5dd395...e899c0b (5): > perftune: remove psutil dependency > metrics: change push_back({...}) to emplace_back(...) > prometheus: return the metric prefix > Merge "scripts/perftune.py: add disks tuning" from Vlad > Use safe name for metric family	2017-04-06 17:10:19 +03:00
Glauber Costa	f4502bfb79	ami: update packer to 1.0.0 so ENA gets enabled Older versions of packer do not support ENA, and when faced with the option "enhanced_networking": true, will only actually enable it for the older 82599 VF instances. Fortunately, packer 1.0 already supports it, and all we have to do is update it. While we are at it, let's check if the file is legit before using a random file we have downloaded from the internet, to avoid breaching our building process. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170406123952.14708-1-glauber@scylladb.com>	2017-04-06 16:39:19 +03:00
Glauber Costa	039f8d5994	dist/redhat: install the new scylla_util.py file For the debian package, the files don't have to be listed individually, but for RPMs they do. The rpm builds are currently failing with: error: Installed (but unpackaged) file(s) found: /usr/lib/scylla/scylla_util.py Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170406012336.21081-1-glauber@scylladb.com>	2017-04-06 10:20:32 +03:00
Avi Kivity	ecf5781597	Update scylla-ami submodule * dist/ami/files/scylla-ami 9e8c36d...5d73a71 (1): > support i3 instances	2017-04-05 18:52:46 +03:00
Amnon Heiman	6c1858b275	API:storage_service should support metrics load Following C* API there are two APIs for getting the load from storage_service: /storage_service/metrics/load /storage_service/load This patch adds the implementation for /storage_service/metrics/load The alternative, is to drop on of the API and modify the JMX implementation to use the same API. Fixes #2245 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170401181520.19506-1-amnon@scylladb.com>	2017-04-05 18:14:19 +03:00
Tomasz Grabiec	d523c60629	sstables: Push fragments from mp_row_consumer so that parser is interrupted less Currently we return proceed::no after every mutation_fragment which is to be consumed. This froces parser to save and reload its state often. This can be avoided if we pushed the fragments directly from mp_row_consumer, then we would return proceed::no only when the buffer fills up. tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan, from 1.34M frag/s to 1.55M frag/s. Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>	2017-04-05 18:10:54 +03:00
Takuya ASADA	da75ce694a	dist: migrate python2 scripts to python3 Now we can package python3 .py script on CentOS, no need to keep using python2 so migrate them to python3. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491402479-7703-1-git-send-email-syuu@scylladb.com>	2017-04-05 17:35:35 +03:00
Takuya ASADA	610bc31b04	dist/redhat/scylla.spec.in: stop compiling .py on rpm rpmbuild tries to compile any *.py by default, but it causes compilation error on python3 code when python2 is system default (CentOS, RHEL). So skip compiling it, drop .pyc / .pyo from the package. Fixes #2235 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1490859768-18900-1-git-send-email-syuu@scylladb.com>	2017-04-05 12:15:22 +03:00
Avi Kivity	58318db4bb	Merge "Rewrite scylla_io_setup in python" from Glauber "Also, this new version supports i3. Number of requests for i3 is obtained similarly as to i2: I have run tests for a single disk, and then we'll take the amount of disks into account. Other possible limits are also taken into account, like the max per-shard seastar limit of 128 in-flight request, and the per-disk limit obtained by sysfs." * 'python-io-setup' of https://github.com/glommer/scylla: rewrite scylla_io_setup in python scripts: add python module with common utilities	2017-04-05 10:39:54 +03:00
Avi Kivity	88637c86c2	Update ami submodule * dist/ami/files/scylla-ami 407e8f3...9e8c36d (1): > Switch to Amazon Linux's kernel	2017-04-04 21:55:36 +03:00
Glauber Costa	2fa698ee95	rewrite scylla_io_setup in python We do it using the new scylla_util.py library. As we do it, we also enable i3 support. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-04-04 14:44:52 -04:00
Glauber Costa	ba7010b7a5	scripts: add python module with common utilities As we convert more stuff to python, we'll have more opportunities for sharing code between them. We already do that for the bash scripts with a file "scylla_lib.sh". We'll do the same for python. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-04-04 14:42:19 -04:00
Vlad Zolotarov	c26799c9b0	config: enforce the 'stop' value for commit_failure_policy/disk_failure_policy Fixes #2246 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1491246164-26612-1-git-send-email-vladz@scylladb.com>	2017-04-04 16:46:36 +03:00
Takuya ASADA	4262edd843	dist: use distribution standard fstrim script instead of our custom one We recently introduced fstrim cronjob / systemd timer unit, but some of distributions already has their own fstrim cronjob / systemd timer unit. So let's use them when it's posssible. Fixes #2233 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491226727-10507-1-git-send-email-syuu@scylladb.com>	2017-04-04 16:44:45 +03:00
Takuya ASADA	72696eff22	dist/common/scripts/scylla_setup: add 'cancel' feature on RAID disk selector prompt Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491217320-10211-2-git-send-email-syuu@scylladb.com>	2017-04-04 15:35:23 +03:00
Takuya ASADA	4eee3dd778	dist/common/scripts/scylla_setup: skip list DVD drive on block device list for RAID Fixes #2230 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491217320-10211-1-git-send-email-syuu@scylladb.com>	2017-04-04 15:35:22 +03:00
Takuya ASADA	b087616a6c	dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME We are sourcing sysconfig file on upstart, but forgot to load them as environment variables. So export them. Fixes #2236 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>	2017-04-03 16:34:20 +03:00
Pekka Enberg	57c4bed420	Revert "dist: add --options-file /etc/scylla/scylla.yaml on sysconfig" This reverts commit `58e628eb3d`. Takuya says it does not fix issue #2236.	2017-04-03 16:33:58 +03:00
Takuya ASADA	58e628eb3d	dist: add --options-file /etc/scylla/scylla.yaml on sysconfig After RAID devices mounted to /var/lib/scylla, scylla-server doesn't able to find ./conf/ directory since we haven't created symlinks on the volume. Instead of creating symlink on scylla_raid_setup, let's specify scylla.yaml path on program argument. Fixes #2236 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1491056110-1078-1-git-send-email-syuu@scylladb.com>	2017-04-02 11:28:27 +03:00
Vlad Zolotarov	2d8fcde695	init: add a proper message when there is a bad 'seeds' configuration Fixes #2193 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1490912678-32004-1-git-send-email-vladz@scylladb.com>	2017-04-02 10:41:52 +03:00
Avi Kivity	d03207e939	Merge seastar upstream * seastar 2ebe842...c5dd395 (3): > configure.py: Fix unrecognized option error > Merge fixes for fstream slow start from Paweł > Merge "A cleaner safer metrics layer" from Amnon	2017-04-02 10:17:40 +03:00
Tzach Livyatan	4efee3432b	dist/ami: run nodetool status on each login Running nodetool status on each login to Scylla AMI helps in three ways: - give the user a quick view of the node and cluster status beyond the current "scylla is active" - hint to the user about the nodetool and how to use it - move the first, slow, run of nodetool to the login phase, making the second interactive run much faster on the down side, it does slow the login in a few sec Signed-off-by: Tzach Livyatan <tzach@scylladb.com> [ penberg: fix formatting ] Message-Id: <20170330081711.22038-1-tzach@scylladb.com>	2017-03-30 13:45:48 +03:00
Vlad Zolotarov	761a5eb72f	scripts: add fix_system_distributed_tables.py This script validates schemas of distributed system keyspaces: system_traces and system_auth. It tries to add the missing columns and checks that the existing columns have expected types. In case of any problem the corresponding info message is printed and non-zero exit status is returned. Extra columns are ignored. The validation function may also be called from another Python program. It returns True in case of success and False otherwise. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1490195645-29905-1-git-send-email-vladz@scylladb.com>	2017-03-29 19:15:05 +03:00
Avi Kivity	5b530aa464	Merge "Use promoted index for skipping in sstable mutation readers" from Tomasz "sstable_streamed_mutation::fast_forward_to() is changed to use promoted index (via index_reader) to optimize skipping in large partitions. In addition to that, sstable mutation_reader is changed to use the index to skip to the next partition. Performance impact was evaluated using newly added tests/perf/perf_fast_forward What's beyond this series: - Using index_reader for single-partition reads as well - Using index_reader for skipping across ranges in clustering restrictions" * tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits) tests: Add performance test for fast forwarding of sstable readers tests: Allow starting cql_test_env on pre-existing data config: Allow specifying source when setting value tests: sstable: Add test for fast forwarding within partition using index sstables: sstable_streamed_mutation: use index in fast_forward_to() sstables: Store parsed promoted index in index_entry sstables: Add trace-level logging for sstable consumption sstables: Define deletion_time earlier sstables: Make parsing throw exception on malformed promoted index tests: Add tests for ordering of position_in_partition relative to composites position_range: Introduce all_clustered_rows() factory method position_in_partition: Introduce for_key()/after_key() factory methods position_in_partition: Add factory methods for positions around all rows position_in_partition: Introduce for_range_start()/for_range_end() position_in_partition: Fix friendship declaration keys: Introduce is_empty() for prefixes position_in_partition: Make comparable with composites types: Enhance lexicographical comparators compound_compat: Accept marker value in serialize_value() compound_compat: Add trichotomic comparator ...	2017-03-29 19:01:12 +03:00
Raphael S. Carvalho	023031b0c8	compaction: lcs: fix functionality to feed starved levels quick introduction to level starvation: high levels may be left uncompacted (thus starved) for a long time if user makes something that make they contain little data, such as cleanup or change of max sstable size (default 160M). Leveled strategy handles this problem as follow: consider we're compacting L1 to L2. If L3 is starved, we look for one of its sstable that is fully contained in token range of candidates L1->L2, so that we won't end up with an overlapping in L2. now the problem: the functionality isn't working properly now because range of candidates is being incorrectly calculated due to an accident when converting the code to C++. It won't cause an overlap because it's actually being more restrictive about which sstable from starved level can be used. A test case was added to confirm the problem. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>	2017-03-29 18:59:46 +03:00
Takuya ASADA	5e0cb39db6	dist: follow DPDK tools directory name change Fixes #2234 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1490776630-6626-1-git-send-email-syuu@scylladb.com>	2017-03-29 11:38:39 +03:00
Tomasz Grabiec	97742fd4c2	test.py: Enable stack trace on UBSAN errors Message-Id: <1490769716-10217-1-git-send-email-tgrabiec@scylladb.com>	2017-03-29 11:08:05 +03:00
Tomasz Grabiec	7fd724821b	tests: Add performance test for fast forwarding of sstable readers	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	543a484d78	tests: Allow starting cql_test_env on pre-existing data	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	2c775bbb6e	config: Allow specifying source when setting value So that is_set() will be true for that option. Needed in tests which set some config options in higher layer and then lower layers detects if option was set or not before applying its default.	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	f1aca6d116	tests: sstable: Add test for fast forwarding within partition using index	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	3fbc0bed6e	sstables: sstable_streamed_mutation: use index in fast_forward_to()	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	5b36976bf0	sstables: Store parsed promoted index in index_entry	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	a2a8312c78	sstables: Add trace-level logging for sstable consumption	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	5af815bf20	sstables: Define deletion_time earlier	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	5e34743882	sstables: Make parsing throw exception on malformed promoted index Will be easier to propagate failure to upper layers once parsing is reused in the index_reader. The old behavior of ignoring parsing failures is preserved, but the error is logged now.	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	b40b20387a	tests: Add tests for ordering of position_in_partition relative to composites	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	5b813898bc	position_range: Introduce all_clustered_rows() factory method	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	622713be60	position_in_partition: Introduce for_key()/after_key() factory methods	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	e27fa712f5	position_in_partition: Add factory methods for positions around all rows	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	b90275f8e3	position_in_partition: Introduce for_range_start()/for_range_end()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	5c29f4dd04	position_in_partition: Fix friendship declaration	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	212a021fc6	keys: Introduce is_empty() for prefixes	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	2ff6c1705b	position_in_partition: Make comparable with composites	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	4e7fe40a70	types: Enhance lexicographical comparators They now accept optional lexicographical_relation which can be used to alter position of the element relative to elements prefixed by it. Example. Let's consider lexicographical ordering on strings. The position of "bc" in a sample sequence is affected by lexicographical_realtion as follows: aa aaa b ba --> before_all_prefixed bc --> before_all_strictly_prefixed bca bcd --> after_all_prefixed bd bda c ca	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	6c6be0f7e4	compound_compat: Accept marker value in serialize_value()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	1331d10811	compound_compat: Add trichotomic comparator	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	2121c2ee8a	compound_compat: Make composites printable	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	38e14ab3c8	compound_compat: Introduce composite::serialize_static() Generelized from static_prefix().	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	bef677b57d	compound_compat: Introduce composite_view::last_eoc()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	18a057aa81	compound_compat: Return composite from serialize_value() To make the code more type-safe. Also, mark constructor from bytes explicit.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	680ffd20f5	compound: Use const bytes_view as iterator's value type The iterator doesn't really allow modifying unserlying component. This change enables using the iterator with boost::make_iterator_range() and boost::range::join(), which get utterly confused otherwise.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	123b102dd6	sstables: Skip to next partition using index Slicing front of a very large partition: Before: offset read time [s] frags frag/s aio [KiB] blocked dropped cpu 0 1 0.110960 1 9 992 126956 924 0 92.4% After: offset read time [s] frags frag/s aio [KiB] blocked dropped cpu 0 1 0.000784 1 1276 3 344 2 1 37.3%	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	a9252dfc58	sstables: Use separate index readers for lower and upper bounds So that lower bound can be advanced within the range.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	27d86dfe18	sstables: Enable skipping to cells at data_consume_context level	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	aad943523a	sstables: index_reader: Add trace-level logging	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	388315c1ff	sstables: Expose index metrics	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	1dbd2e239e	sstables: index_reader: Share index lists among other index readers Direct motivation for this is to be able to use two index readers from a single mutation reader, one for lower bound of the range and one for the upper bound of the range, without sacrificing optimization of avoiding index reads when forwarding to partition ranges which are close by. After the change, all index readers of given sstable will share index buffers, so lower bound reader can reuse the page read by the upper bound reader. The reason for using two readers will be so that we are able to skip inside the partition range, not only outside of it. This is not possible if we use the same index reader to locate the upper bound of the range, because we may only advance the cursor.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	0635d74e17	sstables: Make index_entry copyable Needed to make the index_list copyable, which is going to be needed to implement legacy get_index_entries() which returns by value, after index sharing is implemented.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	e36979da47	sstables: index_reader: Use sstable's schema Makes for a simpler interface.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	e3e2f037bb	sstables: index_reader: Refactor around the concept of a cursor Index reader already can be queried only with monotonic positions, so the concept of a cursor is ingrained. Making it explicit will make it easier to define behavior for forwarding withing the partition. After the change: - lower_bound() is renamed to advance_to() and doesn't return the position, only advances the cursor - data file position for partition under cursor can be obtained at any time with data_file_position()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	27862fa8f6	sstables: index_reader: Narrow down summary range during lookup Positions passed to lower_bound() must be non-decreasing, so summary indexes as well.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	02ace99798	sstables: index_reader: Change lookup to work on ring_position_view In preparation for changing the interface to work not only with ranges.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	cd295e9926	sstables: Avoid moving an sstable In preparation for adding non-movable members.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	5edb427873	sstables: Remove private constructor To reduce duplication.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	705bd6da1a	sstables: Remove unused method	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	a7301a702f	tests: Add missing blocking on fast_forward_to()	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	5fe14735e8	tests: dht: Test ring_position_comparator	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	ff6cca6e9e	tests: Add utility for checking total orders	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	d4b6e430ed	dht: Introduce ring_position_view	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	55a7cceef5	dht: Move comparison logic from ring_position::tri_compare() to ring_position_comparator It will soon define common ordering for many objects, not just ring_position.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	d5e704ca1e	sstables: Make key_view constructor from bytes_view explicit	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	65a8920b25	dht: Make min/max tokens capturable by reference So that they can be later used in views.	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	f6d2a07422	config: Warn on use of [[deprecated]] instead of failing	2017-03-28 18:10:39 +02:00
Calle Wilund	b12b65db92	commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too The previous fix removed the additional insertion of "min rp" per source shard based on whether we had processed existing CF:s or not (i.e. if a CF does not exist as sstable at all, we must tag it as zero-rp, and make whole shard for it start at same zero. This is bad in itself, because it can cause data loss. It does not cause crashing however. But it did uncover another, old old lingering bug, namely the commitlog reader initiating its stream wrongly when reading from an actual offset (i.e. not processing the whole file). We opened the file stream from the file offset, then tried to read the file header and magic number from there -> boom, error. Also, rp-to-file mapping was potentially suboptimal due to using bucket iterator instead of actual range. I.e. three fixes: * Reinstate min position guarding for unencoutered CF:s * Fix stream creating in CL reader * Fix segment map iterator use. v2: * Fix typo Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>	2017-03-28 10:32:28 +02:00
Duarte Nunes	53014bd762	mutation_source_test: Ensure unique collection elements Duplicate elements are illegal in collections, so we ensure they only contain unique ones. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170327161149.8938-4-duarte@scylladb.com>	2017-03-27 18:44:11 +02:00
Duarte Nunes	94d568924d	mutation_source_test: Sort collection elements Ref #1607 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170327161149.8938-3-duarte@scylladb.com>	2017-03-27 18:43:58 +02:00
Duarte Nunes	4963902922	mutation_source_test: Remove extra randomness source This patch ensures we generate UUIDs using the same randomness source as all the other values we randomly generator, so that we can get a deterministic run from the seeds we print. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170327161149.8938-2-duarte@scylladb.com>	2017-03-27 18:43:44 +02:00
Takuya ASADA	b84828b487	dist/common/scripts/scylla_fstrim: don't abort the program when a disk doesn't support TRIM Do not abort the program until run fstrim on all directories. Fixes #2220 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1490595397-19130-1-git-send-email-syuu@scylladb.com>	2017-03-27 11:43:19 +03:00
Avi Kivity	27c42359bc	Merge seastar upstream * seastar 6b21197...2ebe842 (6): > Merge "Various improvements to execution stages" from Paweł > app-template: allow apps to specify a name for help message > bool_class: avoid initializing object of incomplete type > app-template: make sure we can still get help with required options > prometheus: Http handler that returns prometheus 0.4 protobuf or text format > Update DPDK to 17.02 Includes patch from Pawel to adjust to updated execution_stage interface.	2017-03-26 10:50:21 +03:00
Takuya ASADA	e7697c37b2	dist/common/scripts/scylla_setup: warn twice before constructing RAID volume Since RAID construction has possibility to destroy user data, warn twice before executing scylla_raid_setup. Fixes #1346 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1487891797-13109-1-git-send-email-syuu@scylladb.com>	2017-03-26 10:36:23 +03:00
Glauber Costa	f7f187a7f3	docker: do not touch yaml during startup Users sometimes need to run their own yaml configuration files, and it is currently annoying to deploy modified files on docker. One possible solution is to bind mount the file into the docker container using the -v switch, just like we already do for for the data volume. The problem with the aforementioned approach is that we have to change the yaml file to insert the addresses, and that will change the file in the host (or fail to happen, if we bind mount it read-only). The solution I am proposing is to avoid touching the yaml file inside the container altogether. Instead, we can deploy the address-related arguments that we currently write to the yaml file as Scylla options. Fixes #2113 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1490195141-19940-1-git-send-email-glauber@scylladb.com>	2017-03-23 16:52:40 +02:00
Pekka Enberg	c5b1908e03	dist/docker: Use stdout as logging output If early startup fails in docker-entrypoint.py, the container does not start. It's therefore not very helpful to log to a file _within_ the container... Message-Id: <1490275943-23590-1-git-send-email-penberg@scylladb.com>	2017-03-23 16:48:34 +02:00
Calle Wilund	c3a510a08d	commitlog_replayer: Do proper const-loopup of min positions for shards Fixes #2173 Per-shard min positions can be unset if we never collected any sstable/truncation info for it, yet replay segments of that id. Wrap the lookups to handle "missing data -> default", which should have been there in the first place. Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>	2017-03-22 17:57:09 +02:00
Amnon Heiman	064f5e1b63	row_cache: switch to the metrics layer registration This patch moves the row_cache metrics registration from collectd to the metric layer. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170321143812.785-3-amnon@scylladb.com>	2017-03-21 16:42:58 +02:00
Amnon Heiman	a6a13865bf	API: remove unneeded refrences to collectd This patch removes left over references to the collectd from the API. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170321143812.785-2-amnon@scylladb.com>	2017-03-21 16:42:57 +02:00
Vlad Zolotarov	79978c156e	transport::server: don't report a Tracing session ID unless requested Don't report a Tracing session ID unless the current query had a Tracing bit in its flags. Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that. Let's fix that. Fixes #2179 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1490063752-8915-1-git-send-email-vladz@scylladb.com>	2017-03-21 13:57:08 +00:00
Vlad Zolotarov	9dd5b5762d	dist: install seastar/scripts/perftune.py together with posix_net_conf.sh posix_net_conf.sh is currently a wrapper for perftune.py script and perftune.py has to be at the same directory as posix_net_conf.sh. Fixes #2176. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1490020882-32361-1-git-send-email-vladz@scylladb.com>	2017-03-21 12:31:50 +02:00
Amnon Heiman	7572addfbf	column_family: metrics should be register once column_family constructor uses delegation, as such, only the actual constructor implementation should contain a call to register the metrics. Current implementation ends up with re registration of the metrics. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170320140817.22214-1-amnon@scylladb.com>	2017-03-21 11:20:29 +02:00
Pekka Enberg	85a127bc78	dist/docker: Expose Prometheus port by default This patch exposes Scylla's Prometheus port by default. You can now use the Scylla Monitoring project with the Docker image: https://github.com/scylladb/scylla-grafana-monitoring To configure the IP addresses, use the 'docker inspect' command to determine Scylla's IP address (assuming your running container is called 'some-scylla'): docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla and then use that IP address in the prometheus/scylla_servers.yml configuration file. Fixes #1827 Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>	2017-03-20 15:29:52 +02:00
Amos Kong	468df7dd5f	scylla_setup: match '-p' option of lsblk with strict pattern On Ubuntu 14.04, the lsblk doesn't have '-p' option, but `scylla_setup` try to get block list by `lsblk -pnr` and trigger error. Current simple pattern will match all help content, it might match wrong options. scylla-test@amos-ubuntu-1404:~$ lsblk --help \| grep -e -p -m, --perms output info about permissions -P, --pairs use key="value" output format Let's use strict pattern to only match option at the head. Example: scylla-test@amos-ubuntu-1404:~$ lsblk --help \| grep -e '^\s*-D' -D, --discard print discard capabilities Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>	2017-03-20 08:10:35 +02:00
Raphael S. Carvalho	7deeffc953	database: serialize sstable cleanup We're cleaning up sstables in parallel. That means cleanup may need almost twice the disk space used by all sstables being cleaned up, if almost all sstables need cleanup and every one will discard an insignificant portion of its whole data. Given that cleanup is frequently issued when node is running out of disk space, we should serialize cleanups in every shard to decrease the disk space requirement. Fixes #192. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>	2017-03-19 12:33:03 +02:00
Tomasz Grabiec	bb0ce5d8fe	Merge "Ensure base and view schema versions match" from Duarte The mapping between a base table update and a view update is schema dependent, so we need to ensure the view schema versions match the base schema version. For example, we match base columns to view columns by name, so we need to ensure the base and view schemas we're using for writting are isolated with respect to a previous alter table statement. We thus need to match base schema versions with view schema versions, and we need to so atomically to ensure that when one fiber sees a schema, it also sees the complete set of corresponding view schemas. This series ensures the schemas modified as a result of an alter table statement are published atomically, under the schema lock. This way, all the schemas referenced by the database are consistent with each other when they are observed by other fibers. Finally, we upgrade the mutation schema before generating the view updates, to ensure it matches the most recent view schemas the base replica knows about, registered in the database. The db::view::view class was replaced by a set of non-member functions, with its state, which used to reflect only the most recent schema version, being moved to a new view_info class.	2017-03-17 12:40:00 +01:00
Duarte Nunes	b27da688f9	mutation: Remove dead get_cell() function Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170316234843.23130-1-duarte@scylladb.com>	2017-03-17 11:18:23 +02:00
Pekka Enberg	91b9e0d914	Update scylla-ami submodule * dist/ami/files/scylla-ami eedd12f...407e8f3 (1): > scylla_create_devices: check block device is exists Fixes #2171	2017-03-17 11:13:07 +02:00
Tomasz Grabiec	3609665b19	lsa: Fix debug-mode compilation error By moving definitions of setters out of #ifdef	2017-03-16 18:23:05 +01:00
Tomasz Grabiec	88e7b3ff79	lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory() One of the goals of can_allocate_more_memory() is to prevent depleting seastar's free memory close to its minimum, leaving a head room above that minimum so that standard allocations will not cause reclamation immediately. Currently the function doesn't take into accoutn actual threshold used by the seastar allocator, so there could be no gap or even could go below the minimum. Fix that by ensuring there's always a gap above min_free_memory(). min_gap was reduced to 1 MiB so that low memory setups are not impacted significantly by the change. Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>	2017-03-16 12:42:50 +00:00
Tomasz Grabiec	17ede24a77	Update seastar submodule * seastar 4d25b85...6b21197 (3): > core: memory: Expose control of the free memory low water mark > scripts: add perftune.py > tutorial: make network examples work on multi-core	2017-03-16 13:32:45 +01:00
Pekka Enberg	3afd7f39b5	cql3: Wire up functions for floating-point types Fixes #2168 Message-Id: <1489661748-13924-1-git-send-email-penberg@scylladb.com>	2017-03-16 11:04:59 +00:00
Avi Kivity	434a4fee28	Merge "tests: Use allocating_section in lsa_async_eviction_test" from Tomasz "The test allocates objects in batches (allocation is always under a reclaim lock) of ~3MiB and assumes that it will always succeed because if we cross the low water mark for free memory (20MiB) in seastar, reclamation will be performed between the batches, asynchronously. Unfortunately that's prevented by can_allocate_more_memory(), which fails segment allocation when we're below the low water mark. LSA currently doesn't allow allocating below the low water mark. The solution which is employed across the code base is to use allocating_section, so use it here as well. Exposed by recent consistent failures on branch-1.7." * 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev: tests: lsa_async_eviction_test: Allocate objects under allocating section lsa: Allow adjusting reserves in allocating_section	2017-03-16 12:44:14 +02:00
Tomasz Grabiec	cefb6b604a	tests: lsa_async_eviction_test: Allocate objects under allocating section	2017-03-16 10:21:10 +01:00
Tomasz Grabiec	4ab8b255da	lsa: Allow adjusting reserves in allocating_section	2017-03-16 10:21:10 +01:00
Raphael S. Carvalho	6b6bb38f38	compaction_manager: stop manager after storage io error Manager will stop itself if a compaction fails due to storage io error, which unconditionally results in stop of transportation services. Fixes #2147. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170316054538.23423-1-raphaelsc@scylladb.com>	2017-03-16 10:37:47 +02:00
Duarte Nunes	876a514743	database: Upgrade mutation to current schema to push view updates This patch ensures we upgrade the mutation to the current schema when generating and pushing view updates, so that the it matches the most up to date views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 18:15:27 +01:00
Duarte Nunes	be12a2bf0a	db/schema_tables: Atomically publish base and view changes This patch ensures that the schema merging atomically publishes schema changes. In particular, it ensures that when a base schema and a subset of its views are modified together (i.e., upon an alter table or alter type statement), then they are published together as well, without any deferring in-between. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 16:35:07 +01:00
Duarte Nunes	e215f25b11	migration_manager: Atomically migrate table and views This patch changes the migration path for table updates such that the base table mutations are sent and applied atomically with the view schema mutations. This ensures that after schema merging, we have a consistent mapping of base table versions to view table versions, which will be used in later patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 16:03:56 +01:00
Duarte Nunes	bfb8a3c172	materialized views: Replace db::view::view class The write path uses a base schema at a particular version, and we want it to use the materialized views at the corresponding version. To achieve this, we need to map the state currently in db::view::view to a particular schema version, which this patch does by introducing the view_info class to hold the state previously in db::view::view, and by having a view schema directly point to it. The changes in the patch are thus: 1) Introduce view_info to hold the extra view state; 2) Point to the view_info from the schema; 3) Make the functions in the now stateless db::view::view non-member; 4) Remove the db::view::view class. All changes are structural and don't affect current behavior. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 15:50:05 +01:00
Duarte Nunes	a64c47f315	schema: Move raw_view_info outside of raw_schema In preparation of an upcoming patch, where the schema won't directly store the raw_view_info. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 15:38:31 +01:00
Duarte Nunes	4b209be8b8	view_info: Rename to raw_view_info In preparation for upcoming patches, which will deal with moving the state in db::view::view to view_info. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 15:38:31 +01:00
Paweł Dziepak	dc99197318	Merge "Correctly handle tombstoned collections" from Duarte "The current implementations of collection_type_impl::is_empty() and collection_type_impl::difference() don't handle tombstoned collection mutations correctly. In particular: - is_empty() considers a collection mutation with a tombstone (and no entries) as empty; - difference() doesn't do set difference between the cells tombstones, and always returns the highests. Fixes #2152" * 'collection-diff/v4' of github.com:duarten/scylla: mutation_test: Add more test cases for difference() mutation_source_test: Randomly generate collection cells collection_type_impl: Use set difference for tombstones collection_type_impl: A mutation with a tombstone is not empty	2017-03-15 13:39:55 +00:00
Duarte Nunes	143136647a	mutation_test: Add more test cases for difference() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 14:34:01 +01:00
Duarte Nunes	005e4741e3	mutation_source_test: Randomly generate collection cells Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 14:34:01 +01:00
Duarte Nunes	61741a69b6	collection_type_impl: Use set difference for tombstones This patch fixes collection_type_impl::difference() so it does set difference for tombstones instead of just returning the larger one, as difference() is supposed to return only the information in mutation A that supersedes that in B, given difference(A, B). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 14:34:01 +01:00
Duarte Nunes	19fcd2d140	collection_type_impl: A mutation with a tombstone is not empty This patch changes the collection_type_impl::is_empty() function so that it doesn't consider empty a collection_mutation which has a tombstone. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-15 14:34:01 +01:00
Takuya ASADA	b65d58e90e	dist/common/scripts/scylla_raid_setup: don't discard blocks at mkfs time Discarding blocks on large RAID volume takes too much time, user may suspects the script doesn't works correctly, so it's better to skip, do discard directly on each volume instead. Fixes #1896 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1489533460-30127-1-git-send-email-syuu@scylladb.com>	2017-03-15 13:13:57 +02:00
Calle Wilund	078589c508	commitlog_replayer: Make replay parallel per shard Fixes #2098 Replay previously did all segments in parallel on shard 0, which caused heavy memory load. To reduce this and spread footprint across shards, instead do X segments per shard, sequential per shard. v2: * Fixed whitespace errors Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>	2017-03-15 13:07:17 +02:00
Amnon Heiman	0a2eba1b94	database: requests_blocked_memory metric should be unique Metrics name should be unique per type. requests_blocked_memory was registered twice, one as a gauge and one as derived. This is not allowed. Fixes #2165 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170314162826.25521-1-amnon@scylladb.com>	2017-03-14 19:36:45 +02:00
Avi Kivity	ed4b5f5a18	Merge seastar upstream * seastar fd29fd0...4d25b85 (2): > core/file: fix EOF detection for file with custom impl > tutorial: fix echo server example Includes patch from Raphael updating checked_file_impl: "Now file_impl requires dma_read_bulk to be implemented, and for checked_file_impl, it only's about calling dma_read_bulk from the posix file it wraps."	2017-03-14 13:38:38 +02:00
Takuya ASADA	d016dd4b74	dist: schedule daily fstrim for data directory and commitlog directory Schedule daily fstrim for data directory and commitlog directory, witch is recommended by Scylla doc: http://www.scylladb.com/doc/admin/#schedule-fstrim Fixes #1347 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1489447472-2981-1-git-send-email-syuu@scylladb.com>	2017-03-14 11:51:53 +02:00
Amnon Heiman	295a981c61	storage_proxy: metrics should have unique name Metrics should have their unique name. This patch changes throttled_writes of the queu lenght to current_throttled_writes. Without it, metrics will be reported twice under the same name, which may cause errors in the prometheus server. This could be related to scylladb/seastar#250 Fixes #2163. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170314081456.6392-1-amnon@scylladb.com>	2017-03-14 11:19:39 +02:00
Tomasz Grabiec	ed530dfb3a	tests: sstables: Add test for skipping within a compressed stream Refs #2143.	2017-03-13 13:08:24 +01:00
Tomasz Grabiec	1e0af2efc3	Update seastar submodule * seastar 84a0b70...fd29fd0 (4): > Fix smp::submit_to() with function reference > execution_stage: add concept restraint for operator() > core/temporary_buffer: Add operator==() > map_reduce: allow reducer to take accumulated value by rref	2017-03-13 10:13:03 +01:00
Paweł Dziepak	60c6b9a240	Merge "Implement sstable_streamed_mutation::fast_forward_to()" from Tomasz "This replaces use of a generic forwarding wrapper in sstable reader with specialized implentation. Forwarding doesn't yet utilize indexes in this series, only integrates it with mp_row_consumer, which is a prerequisite. It's still an optimization, since mp_row_consumer will not try to consume past the range as it used to. Sending early for easier consumption." * tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev: sstables: Remove use of forwarding wrapper sstables: Implement sstable_streamed_mutation::fast_forward_to() sstables: Extract and use clustering_ranges_walker tests: sstables: Add test for handling of repeated tombstones sstables: Extract writer parameters into config objects tests: Move as_mutation_source() helper to header tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions streamed_mutation: Add streamed_mutation_returning() helper tests: mutation_source_test: Add test case for forwarding to a full range tests: simple_schema: Add fragment factories tests: Extract simple_schema sstables: Move workaround for out-of-order range tombstones to mp_row_consumer sstables: Drop default mp_row_consumer constructor sstables: Swap order of values in "proceed" so that "no" is assigned 0 util/optimized_optional: Make printable position_in_partition: Add is_static_row() in the view range_tombstone_stream: Add reset() range_tombstone_stream: Add get_next(position_in_partition_view) sstables: streamed_mutation: Stop reading when end of slice reached sstables: Switch is_in_range() to position_in_partition	2017-03-10 13:55:46 +00:00
Tomasz Grabiec	1f1b516b31	sstables: Remove use of forwarding wrapper	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	d7afab21e7	sstables: Implement sstable_streamed_mutation::fast_forward_to() Handling of forwarding is done inside mp_row_consumer, because it allows us to filter out irrelevant data sooner and thus more efficiently. Becuase static row can be now skipped as well, _skip_clustering_row was renamed to more generic _skip_in_progress.	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	4750216387	sstables: Extract and use clustering_ranges_walker Extracted from mp_row_consumer.	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	88ccc99017	tests: sstables: Add test for handling of repeated tombstones	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	124dde30db	sstables: Extract writer parameters into config objects Also enables users to change the default promoted index block size.	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	ad1e69c4c5	tests: Move as_mutation_source() helper to header	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	6f409d367b	tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	dc7b93a326	streamed_mutation: Add streamed_mutation_returning() helper	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	06a964b3a0	tests: mutation_source_test: Add test case for forwarding to a full range	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	929842ad3f	tests: simple_schema: Add fragment factories	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	d98f013b07	tests: Extract simple_schema	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	01374c41f2	sstables: Move workaround for out-of-order range tombstones to mp_row_consumer This is a preliminary step before adding support for fast-forwarding to mp_row_consumer, so that range handling can be solely in mp_row_consumer rather than split between it and sstable_streamed_mutation. This also alleviates #2080 by reading all tombstones only up to the first row, after that range tombstones are treated like other fragments.	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	d41a7c5eb4	sstables: Drop default mp_row_consumer constructor	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	56f1ad7841	sstables: Swap order of values in "proceed" so that "no" is assigned 0	2017-03-10 14:42:22 +01:00
Tomasz Grabiec	58c29be45c	util/optimized_optional: Make printable	2017-03-10 14:42:21 +01:00
Tomasz Grabiec	a32cf6c4cc	position_in_partition: Add is_static_row() in the view	2017-03-10 14:42:21 +01:00
Tomasz Grabiec	e4db643730	range_tombstone_stream: Add reset()	2017-03-10 14:42:21 +01:00
Tomasz Grabiec	48ad2e2d64	range_tombstone_stream: Add get_next(position_in_partition_view)	2017-03-10 14:42:21 +01:00
Tomasz Grabiec	084747b1ee	sstables: streamed_mutation: Stop reading when end of slice reached As part of this change, skip detection detection is refactored. This simplifies reasoning about mp_row_consumer's state a bit because now is_mutation() is not reset externally and only depends on current position of the reader. It will prove useful when we extend mutation reader to decide if it should skip to the next partition up front before calling _context.read(), so that we can for instance skip using index instead. Fixes #2088.	2017-03-10 14:42:19 +01:00
Duarte Nunes	16bcf8d085	db/schema_tables: Avoid copying keyspace name This patch changes a lambda argument type so the keyspace name is passed by reference instead of copying it, in read_schema_for_keyspaces(). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170309213134.10331-1-duarte@scylladb.com>	2017-03-10 11:03:56 +02:00
Duarte Nunes	d32c848d73	utils/logalloc: Change linkage of hist_options to external Change linkage of segment_descriptor_hist_options to external to keep good old GCC5 happy, despite C++11 allowing static linkage of non-type template arguments. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170309213206.10383-1-duarte@scylladb.com>	2017-03-10 11:02:51 +02:00
Tomasz Grabiec	55358cacc5	sstables: Switch is_in_range() to position_in_partition Makes it immune to #1446 and is a prerequisite for implementing forwarding in mp_row_consumer.	2017-03-09 21:15:11 +01:00
Paweł Dziepak	aaae8db033	loggers should not have external linkage Message-Id: <20170309111034.20929-1-pdziepak@scylladb.com>	2017-03-09 12:27:20 +01:00
Gleb Natapov	d34f3a0440	batchlog: introduce batch_size_fail_threshold_in_kb option Add batch_size_fail_threshold_in_kb to prevent huge batch from been applied and causing troubles. Also do not warn or fail if only one partition is affected. Fixes: #2128 Message-Id: <20170309111247.GE8197@scylladb.com>	2017-03-09 12:20:17 +01:00
Amnon Heiman	7b04841dda	main: Name the http servers In main there are two http servers that start, the API and prometheus. This patch name them accordingly so their metrics will have more meaning. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1489055282-10887-1-git-send-email-amnon@scylladb.com>	2017-03-09 12:30:49 +02:00
Glauber Costa	a7b0a899a3	dist: don't execute dpdk scripts if not in dpdk mode The scripts are not liking very much being executed inside docker. Since we don't really need those variables set outside DPDK scenarios, just don't set them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1488823691-9014-1-git-send-email-glauber@scylladb.com>	2017-03-09 11:40:08 +02:00
Avi Kivity	efd96a448c	Merge "Add execution stages" from Paweł "These patches introduce execution stages to Scylla in order to improve icache friendliness. The places were stages are added are not chosen very carefully and rather introduced at between different subsystems: cql, storage proxy and database. This already results in a rather significant improvement and can be tuned later if necessary. Performance results: perf_simple_query -c4 --duration 60 (medians) before after diff write 83017.75 242876.04 +192.6% read 61709.16 168258.26 +172.7% The real life improvements aren't as good because it is much harder to collect sufficiently high number of operations in a batch." Additional benchmarking from Paweł: "I did some tests on my local setup. * Latency at light loads Scylla running on 16 logical CPUs (8 cores) with 64 GB of RAM. cassandra-stress -rate threads=32 write latency master seda median 1.2 0.6 95th 1.6 0.8 99th 1.7 0.9 99.9th 2.5 1.3 max 26.4 24.2 Flags '--poll-mode' and '--defragment-memory-on-idle false' didn't improve situation for master. See also attached graph write_99.svg and write_999.svg. read latency master seda median 0.8 0.6 95th 1.0 0.9 99th 1.1 1.0 99.9th 1.4 1.2 max 18.5 18.0 See also attached graph read_99.svg and read_999.svg. * Server 100% loaded, dataset fitting in memory (throughput) Scylla running on 2 cores with 64 GB of RAM. 4x scylla-bench with the uniform workload (concurrency of each s-b: 512 for writes, 256 for reads). There were no cache misses during reads. master seda diff writes 107722.4 168482.26 +56.4% reads 51049.48 76158.19 +49.2% * Server 100% loaded, writes being flushed and compacted (throughput) Scylla running on 2 cores with 4 GB of RAM. 4x scylla-bench with the uniform workload, concurrency 256 each. master seda diff writes 79575.77 114206.11 +43.5% See attached graph: writes_with_flushes_and_compaction.png (first run: master, second: seda)." * tag 'pdziepak/scylla-execution-stages/v1-rebased' of github.com:cloudius-systems/seastar-dev: transport: make process_request_one() an execution stage mutation_query: add an execution stage db: make database::query() an execution stage db: make apply an execution stage storage_proxy: make mutate() an execution stage cql3: make batch statement an execution stage cql3: make modification statement an execution stage cql3: make select statement an execution stage mutation_reader: make mutation_source nothrow movable	2017-03-09 11:29:43 +02:00
Paweł Dziepak	74f35864ef	transport: make process_request_one() an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	a78501c206	mutation_query: add an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	b5f0e590be	db: make database::query() an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	38c1501f4d	db: make apply an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	cfde2ad5b4	storage_proxy: make mutate() an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	827357cb08	cql3: make batch statement an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	dce785089a	cql3: make modification statement an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	d005b20071	cql3: make select statement an execution stage	2017-03-09 09:27:43 +00:00
Paweł Dziepak	12135dbe21	mutation_reader: make mutation_source nothrow movable	2017-03-09 09:27:43 +00:00
Amnon Heiman	4e8d73098f	main: Prometheus should start as early as possible There is no need to wait when starting the prometheus server. As it is up to each of the modules to register its metrics when it is ready. This is especially important when debuging boot issues. This patch moves the prometheus initilization to be done at an early stage of the boot sequencec. Fixes #2144 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1489041986-28974-1-git-send-email-amnon@scylladb.com>	2017-03-09 11:26:51 +02:00
Asias He	39d2e59e7e	repair: Fix midpoint is not contained in the split range assertion in split_and_add We have: auto halves = range.split(midpoint, dht::token_comparator()); We saw a case where midpoint == range.start, as a result, range.split will assert becasue the range.start is marked non-inclusive, so the midpoint doesn't appear to be contain()ed in the range - hence the assertion failure. Fixes #2148 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Asias He <asias@scylladb.com> Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>	2017-03-09 09:09:17 +01:00
Avi Kivity	b8e4113dba	Merge seastar upstream * seastar 5861f99...84a0b70 (13): > build: don't error out on [[deprecated]] APIs > Merge "Introduce execution stages" from Paweł > Remove unused include statement > http: catch and count errors in read and respond > Merge "Adding metrics configuration" from Amnon > future: add concepts for map_reduce(), when_all_succeed() > doxygen: exclude c-ares directory > scripts/posix_net_conf.sh: add --use-cpu-mask option > file: take flush into account when calculating size for truncate in optimize_queue() > Fixing the prometheus cleanup patch > Merge "posix_net_conf.sh: better distribute ingress processing" from Vlad > prometheus: code clean up > future: relax finally() constraints even more	2017-03-08 20:02:05 +02:00
Tomasz Grabiec	abf8e83c8d	gdb: Cast gdb.Values to int Fails with newer GDB with: TypeError: %x format: an integer is required, not gdb.Value Message-Id: <1488981412-22279-1-git-send-email-tgrabiec@scylladb.com>	2017-03-08 19:43:48 +02:00
Paweł Dziepak	6db6d25f66	Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek "If a node is bootstrapped with auto_boostrap disabled, it will not wait for schema sync before creating global keyspaces for auth and tracing. When such schema changes are then reconciled with schema on other nodes, they may overwrite changes made by the user before the node was started, because they will have higher timestamp. To prevent that, let's use minimum timestamp so that default schema always looses with manual modifications. This is what Cassandra does. Fixes #2129." * tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev: db: Create default auth and tracing keyspaces using lowest timestamp migration_manager: Append actual keyspace mutations with schema notifications	2017-03-08 10:59:47 +00:00
Nadav Har'El	506e074ba4	sstable decompression: fix skip() to end of file The skip() implementation for the compressed file input stream incorrectly handled the case of skipping to the end of file: In that case we just need to update the file pointer, but not skip anywhere in the compressed disk file; In particular, we must NOT call locate() to find the relevant on-disk compressed chunk, because there is none - locate() can only be called on actual positions of bytes, not on the one-past-end-of-file position. Fixes #2143 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170308100057.23316-1-nyh@scylladb.com>	2017-03-08 12:35:05 +02:00
Tomasz Grabiec	d6425e7646	db: Create default auth and tracing keyspaces using lowest timestamp If the node is bootstrapped with auto_boostrap disabled, it will not wait for schema sync before creating global keyspaces for auth and tracing. When such schema changes are then reconciled with schema on other nodes, they may overwrite changes made by the user before the node was started, because they will have higher timestamp. To prevent that, let's use minimum timestamp so that default schema always looses with manual modifications. This is what Cassandra does. Fixes #2129.	2017-03-07 19:19:15 +01:00
Tomasz Grabiec	06d4ad1bdd	migration_manager: Append actual keyspace mutations with schema notifications There is a workaround for notification race, which attaches keyspace mutations to other schema changes in case the target node missed the keyspace creation. Currently that generated keyspace mutations on the spot instead of using the ones stored in schema tables. Those mutations would have current timestamp, as if the keyspace has been just modified. This is problematic because this may generate an overwrite of keyspace parameters with newer timestamp but with stale values, if the node is not up to date with keyspace metadata. That's especially the case when booting up a node without enabling auto_bootstrap. In such case the node will not wait for schema sync before creating auth tables. Such table creation will attach potentially out of date mutations for keyspace metadata, which may overwrite changes made to keyspace paramteters made earlier in the cluster. Refs #2129.	2017-03-07 19:19:15 +01:00
Avi Kivity	1b5ba63676	sstable: fix unhandled exception in atomic_deletion_manager::delete_atomically() The current code is assymetric: the first N-1 shards to delete a set receive a synthetic future to wait on, while the last deletion receives the result of the delete operation (which also broadcasts completion to the first N-1 operations. This results, in case of an error, with the Nth future being reported as an unhandled error. Fix by making everything symmetric: all N callers receive a synthetic future. Nobody waits for the deletion operation (which still broadcasts its completion to all waiters, so errors are not lost). Message-Id: <20170305151607.14264-1-avi@scylladb.com>	2017-03-07 12:41:12 +02:00
Avi Kivity	439b38f5ab	Merge "Improvements to counter implementation" from Paweł "This series adds various optimisations to counter implementation (nothing extreme, mostly just avoiding unnecessary operations) as well as some missing features such as tracing and dropping timed out queries. Performance was tested using: perf-simple-query -c4 --counters --duration 60 The following results are medians. before after diff write 18640.41 33156.81 +77.9% read 58002.32 62733.93 +8.2%" * tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits) cell_locker: add metrics for lock acquisition storage_proxy: count counter updates for which the node was a leader storage_proxy: use counter-specific timeout for writes storage_proxy: transform counter timeouts to mutation_write_timeout_exception db: avoid allocations in do_apply_counter_update() tests/counters: add test for apply reversability counters: attempt to apply in place atomic_cell: add COUNTER_IN_PLACE_REVERT flag counters: add equality operators counters: implement decrement operators for shard_iterator counters: allow using both views and mutable_views atomic_cell: introduce atomic_cell_mutable_view managed_bytes: add cast to mutable_view bytes: add bytes_mutable_view utils: introduce mutable_view db: add more tracing events for counter writes db: propagate tracing state for counter writes tests/cell_locker: add test for timing out lock acquisition counter_cell_locker: allow setting timeouts db: propagate timeout for counter writes ...	2017-03-07 11:48:13 +02:00
Tomasz Grabiec	ecfa9e40de	Merge 'duarte/lsa/hist-cleanup/v2' from github.com:duarten/scylla histogram cleanups from Duarte.	2017-03-07 10:33:50 +01:00
Gleb Natapov	5c4158daac	memtable: do not yield while holding reclaim_lock Holding reclaim_lock while yielding may cause memory allocations to fail. Fixes #2139 Message-Id: <20170306153151.GA5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Gleb Natapov	d7bdf16a16	memtable: do not open code logalloc::reclaim_lock use logalloc::reclaim_lock prevents reclaim from running which may cause regular allocation to fail although there is enough of free memory. To solve that there is an allocation_section which acquire reclaim_lock and if allocation fails it run reclaimer outside of a lock and retries the allocation. The patch make use of allocation_section instead of direct use of reclaim_lock in memtable code. Fixes #2138. Message-Id: <20170306160050.GC5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Avi Kivity	1af9e3a5cb	Merge "database: fix the 'nodetool clearsnapshot'" from Vlad "Work on this series started with fixing the 'nodetool clearsnapshot'. The current master code ignores the snapshots in deleted keyspaces (issue #2045). I noticed that in many places our code has to build the path to some directory/file it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows. I understand that this is a long shot but if we can make it right now - why not to. The answer is boost::filesystem::path class - its synchronous parts, of course. I decided to take an initiative and fix the issues above and then use the fixed code for fixing the issue #2045: - Fix some minor issues in the existing code. - Extend the lister class and move it into the separate files outside database.cc. On the way I've found an issue in the existing code (issue #2071). This series fixes this one too (PATCH2)."	2017-03-06 16:45:31 +02:00
Glauber Costa	2d620a25fb	raid script: improve test for mounted filesystem The current test for whether or not the filesystem is mounted is weak and will fail if multiple pieces of the hierarchy are mounted. util-linux ships with a mountpoint command that does exactly that, so we'll use that instead. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1488742801-4907-1-git-send-email-glauber@scylladb.com>	2017-03-06 15:59:29 +02:00
Gleb Natapov	7f5923f510	storage_service: handle empty token list correctly boost::split() return one empty string if called on an empty input. Trying to cast an empty string to a token value results in a bad_lexical_cast exception. Fix it by handling empty token list explicitly. Message-Id: <20170302125405.GU11471@scylladb.com>	2017-03-06 15:31:33 +02:00
Takuya ASADA	6602221442	dist/redhat: enables discard on CentOS/RHEL RAID0 Since CentOS/RHEL raid module disables discard by default, we need enable it again to use. Fixes #2033 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1488407037-4795-1-git-send-email-syuu@scylladb.com>	2017-03-06 12:21:42 +02:00
Duarte Nunes	ca4f5cabd4	lsa: Extract log_histogram class Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-04 14:47:19 +01:00
Avi Kivity	24d6560fbc	Update scylla-ami submodule * dist/ami/files/scylla-ami d5a4397...eedd12f (3): > Rewrite disk discovery to handle EBS and NVMEs. > add --developer-mode option > trivial cleanup: replace tab in indent	2017-03-04 13:29:32 +02:00
Duarte Nunes	5c73978b68	thrift/handler: Enable Aggregator concept with GCC6_CONCEPT Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170303172509.16844-1-duarte@scylladb.com>	2017-03-04 13:27:16 +02:00
Duarte Nunes	2b6abd5a91	lsa: Make log_histogram more generic Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:59:17 +01:00
Duarte Nunes	3819e6d55f	lsa: log_histogram cleanups Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:09:07 +01:00
Tomasz Grabiec	22199abf50	gc_clock: Remove orphaned comment Message-Id: <1488381379-8618-1-git-send-email-tgrabiec@scylladb.com>	2017-03-02 12:56:09 +02:00
Tomasz Grabiec	6a83fe5534	Merge 'pdziepak/optimise-commitlog-entry-writer/v1' from seastar-dev.git From Paweł: These patches optimise commitlog_entry_writer so that it avoids copying column mapping, which is a particularly expensive operation. perf_simple_query -c4 --write --duration 60 (medians) before after diff write 79434.35 89247.54 +12.3% Tested with: commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table commitlog_test.py:TestCommitLog.test_commitlog_replay_with_counters	2017-03-02 11:37:42 +01:00
Paweł Dziepak	04b80272f2	cell_locker: add metrics for lock acquisition	2017-03-02 09:05:12 +00:00
Paweł Dziepak	00b42c477f	storage_proxy: count counter updates for which the node was a leader	2017-03-02 09:05:12 +00:00
Paweł Dziepak	cf193f4b41	storage_proxy: use counter-specific timeout for writes	2017-03-02 09:05:12 +00:00
Paweł Dziepak	d177160f90	storage_proxy: transform counter timeouts to mutation_write_timeout_exception	2017-03-02 09:05:12 +00:00
Paweł Dziepak	f93a766db4	db: avoid allocations in do_apply_counter_update()	2017-03-02 09:05:12 +00:00
Paweł Dziepak	8457f407ef	tests/counters: add test for apply reversability	2017-03-02 09:05:11 +00:00
Paweł Dziepak	3bccca67b9	counters: attempt to apply in place It is expected that most counter updates just modify the values of existing shards and can be done in place.	2017-03-02 09:05:11 +00:00
Paweł Dziepak	7604b926e1	atomic_cell: add COUNTER_IN_PLACE_REVERT flag The general algorithm for merging counter cells involves allocating a new buffer for the shards. However, it is expected that most of the applies are just updating the values of existing shards and not adding new ones, therefore can be done in place. However, reverting the general and in-place applies requires different logic, hence the need for an additional flag to differentiate between them.	2017-03-02 09:05:11 +00:00
Paweł Dziepak	1e3fbddb3a	counters: add equality operators	2017-03-02 09:05:11 +00:00
Paweł Dziepak	772c9078d0	counters: implement decrement operators for shard_iterator	2017-03-02 09:05:11 +00:00
Paweł Dziepak	edad5202f3	counters: allow using both views and mutable_views	2017-03-02 09:05:11 +00:00
Paweł Dziepak	2db92e92b2	atomic_cell: introduce atomic_cell_mutable_view	2017-03-02 09:05:11 +00:00
Paweł Dziepak	1293073019	managed_bytes: add cast to mutable_view	2017-03-02 09:05:11 +00:00
Paweł Dziepak	29430ba970	bytes: add bytes_mutable_view	2017-03-02 09:05:11 +00:00
Paweł Dziepak	0ed2352ade	utils: introduce mutable_view std::basic_string_view does not allow modifying the underlying buffer. This patch introduces a mutable_view which permits that.	2017-03-02 09:05:10 +00:00
Paweł Dziepak	774241648d	db: add more tracing events for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	277501f42f	db: propagate tracing state for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	2b5c4386b5	tests/cell_locker: add test for timing out lock acquisition	2017-03-02 09:05:10 +00:00
Paweł Dziepak	5af780360f	counter_cell_locker: allow setting timeouts	2017-03-02 09:05:10 +00:00
Paweł Dziepak	25173f8095	db: propagate timeout for counter writes	2017-03-02 09:05:10 +00:00
Paweł Dziepak	c122f3b2f8	cell_locker: use internal storage for hashtable	2017-03-02 09:05:10 +00:00
Paweł Dziepak	4702ebf80f	counters: use c_c_builder::from_single_shard() when possible	2017-03-02 09:05:10 +00:00
Paweł Dziepak	13ec22ad9a	counters: drop tombstone handling in transform update to shards Encountering tombstones while transforming counter update from deltas to shards is expected to be rare due to the fact that counter cells cannot be recreated once removed. This assumption makes it unnecessary to care much about removed cells during delta->shard transformation as it adds complexity to the code and is not required to produce correct results.	2017-03-02 09:05:10 +00:00
Paweł Dziepak	37485e5b29	counters: optimise counter_cell_builder This patch attempts to avoid excessive allocations and copies when constructing counter cells using counter_cell_builder. That involves adding serializer interface to atomic_cell so that the counter cell can be directly serialized to the buffer allocated for atomic cell. counter_cell_builder::from_single_shard() is added as well to avoid std::vector<> overhead when creating a counter cell from a single shard.	2017-03-02 09:05:10 +00:00
Glauber Costa	9e61a73654	setup: support mount points in raid script By default behavior is kept the same. There are deployments in which we would like to mount data and commitlog to different places - as much as we have avoided this up until this moment. One example is EC2, where users may want to have the commitlog mounted in the SSD drives for faster writes but keep the data in larger, less expensive and durable EBS volumes. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1488258215-2592-1-git-send-email-glauber@scylladb.com>	2017-03-01 19:23:59 +02:00
Tomasz Grabiec	4b6e77e97e	db: Fix overflow of gc_clock time point If query_time is time_point::min(), which is used by to_data_query_result(), the result of subtraction of gc_grace_seconds() from query_time will overflow. I don't think this bug would currently have user-perceivable effects. This affects which tombstones are dropped, but in case of to_data_query_result() uses, tombstones are not present in the final data query result, and mutation_partition::do_compact() takes tombstones into consideration while compacting before expiring them. Fixes the following UBSAN report: /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int' Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>	2017-03-01 18:49:56 +02:00
Paweł Dziepak	bdac487b5a	do not use long_type for counter update	2017-03-01 16:33:37 +00:00
Paweł Dziepak	f25fa6566f	db: avoid deserialization when applying counter mutation In the later stages of counter write path a mutation is produced that already has all cells transformed to counter shards and can be applied to the memtable and written to the commitlog. The current interface expectes a frozen mutation, which is suboptimal for counters. The freeze itself is unaviodable -- it is required by commitlog, but we can avoid later deserialization of frozen_mutation when it is applied to the memtable if we pass the unfrozen mutation along.	2017-03-01 16:33:37 +00:00
Paweł Dziepak	582d397c41	introduce counter_write_query() Counter write path involves read-modify-write. That read is guaranteed to query only a single partition, does not care about dead cells and expects to receive an unserialized mutation as a result. Standard mutation queries can are able to produce results fit for counter updates, but the logic involved is much more general (i.e. slower), hence the addition of new, counter-specific kind of query.	2017-03-01 16:33:36 +00:00
Paweł Dziepak	426345e1d4	storage_proxy: avoid excessive mutation freezes	2017-03-01 16:33:36 +00:00
Paweł Dziepak	f10eb952d0	coordinator: do not apply counter write twice on leader	2017-03-01 16:33:36 +00:00
Paweł Dziepak	910bff297a	to_string: add operator<< overload for std::array<>	2017-03-01 16:33:36 +00:00
Takuya ASADA	ba323e2074	dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5 Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to build gcc-5. To avoid dead link, use launchpad.net archives instead of using apt-get source. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>	2017-03-01 17:13:14 +02:00
Tomasz Grabiec	0c84f00b16	query: Fix invalid initialization of _memory_tracker by moving-from-self Fixes the following UBSAN warning: core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment Since the field was not initialied properly, probably also fixes some user-visible bug. Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>	2017-03-01 11:38:28 +00:00
Duarte Nunes	c0e5964462	database: Explicitly use discard_result() Values returned from the lambda passed to finally() are immediately destroyed, so make that explicit by using discard_result(). Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235541.28330-1-duarte@scylladb.com>	2017-02-28 18:41:19 +02:00
Duarte Nunes	11b5076b3c	lsa: Use log histogram for closed segments This patch replaces the current heap with a logarithmic histogram to hold the closed segment descriptors. This histogram stores elements in different buckets according to their size. Values are mapped to a sequence of power-of-two ranges that are split in N sub-buckets. Values less than a minimum value are placed in bucket 0, whereas values bigger than a maximum value are not admitted. There is some loss of precision as segments are now not totally ordered, and precision decreases the more sparse a segment is. This allows to reduce the cost of the computations needed when freeing from a closed segment. Performance results for perf_simple_query -c4 --duration 60 before after diff read 43954.27 45246.10 +2.9% write 48911.54 52807.76 +7.9% Fixes #1442 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235328.27937-1-duarte@scylladb.com>	2017-02-28 18:40:38 +02:00
Avi Kivity	359fc68283	Merge seastar upstream * seastar 4d4a58d...5861f99 (9): > future: adjust finally constraint to allow any future to be returned from the continuation > build: allow specifying the C compiler > socket: Change signature (and impls) of socket shutdown to void > reactor: give names to OS threads > Concepts support > core/file: Fix short-read in read_maybe_eof() > core/fstream: Avoid issuing read requests beyond _remain > tests: Improve assertion failure message > reactor: Expose IO stats in a public API	2017-02-28 13:13:35 +02:00
Avi Kivity	c1aac6fa87	build: accept and pass seastar's --c-compiler option	2017-02-28 13:13:02 +02:00
Duarte Nunes	a3873423d6	configure.py: Enable concepts support This patch enables conditional concept support by propagating seastar's --enable-gcc6-concepts flag. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235028.27490-1-duarte@scylladb.com>	2017-02-28 11:56:22 +02:00
Paweł Dziepak	5d66031b7a	sstable: make input_stream_history initializers in-class sstable has two constructors but only one of them was creating input stream history objects. Message-Id: <20170227151734.16928-1-pdziepak@scylladb.com>	2017-02-28 09:22:11 +01:00
Paweł Dziepak	374c8a56ac	commitlog: avoid copying column_mapping It is safe to copy column_mapping accros shards. Such guarantee comes at the cost of performance. This patch makes commitlog_entry_writer use IDL generated writer to serialise commitlog_entry so that column_mapping is not copied. This also simplifies commitlog_entry itself. Performance difference tested with: perf_simple_query -c4 --write --duration 60 (medians) before after diff write 79434.35 89247.54 +12.3%	2017-02-27 17:05:58 +00:00
Paweł Dziepak	4df4994b71	idl: fix generated writers when member functions are used When using member name in an idetifer of generated class or method idl compiler should strip the trailing '()'.	2017-02-27 17:05:58 +00:00
Paweł Dziepak	018d16d315	idl: add start_frame() overload for seastar::simple_output_stream	2017-02-27 17:05:58 +00:00
Paweł Dziepak	0198d8e470	Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz "This introduces an API which allows forward navigation in a stream of mutation fragments. It allows one to consume only a subset of the stream by iteratively specifying sub-ranges from which fragments should be returned. API outline: When in forwarding mode, the stream does not return all fragments right away, but only those belonging to the current range. Initially current range only covers the static row. The stream can be forwarded, even before reaching end- of-stream for current range, to a later range with fast_forward_to(). Forwarding doesn't change initial restrictions of the stream, it can only be used to skip over data. Monotonicity of positions is preserved by forwarding. That is fragments emitted after forwarding will have greater positions than any fragments emitted before forwarding. For any range, all range tombstones relevant for that range which are present in the original stream will be emitted. Range tombstones emitted before forwarding which overlap with the new range are not necessarily re-emitted. When not in forwarding mode, the stream acts as if the current range was equal to the full range. This implies that fast_forward_to() cannot be used. Whether stream is in forwarding mode or not is specified when the stream is created, typically via mutation_source interface. What's left for later series: Optimization by providing specialized implementations. This series implements forwarding support in all mutation sources via generic wrapper which simply drops fragments." * tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev: tests: mutation_source_tests: Verify monotonicty of positions tests: random_mutation_generator: Spread the keys more tests: mutation_source_test: Make blobs more easily distinguishable tests: streamed_mutation: Test that merged stream passes mutation source tests tests: mutation_source_test: Add tests for forwarding of streamed_mutation tests: streamed_mutation_assertions: Add methods for navigating the stream tests: Add range generators to random_mutation_generator partition_slice_builder: Add with_ranges() query: Introduce full_clustering_range streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation() db: Enable creating forwardable readers via mutation_source mutation_source: Document liveness requirements mutation_source: Cleanup db: Replace virtual_reader_type with mutation_source_opt partition_version: Refactor make_partition_snapshot_reader() overloads database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr memtable: Accept all mutation_source parameters streamed_mutation: Implement fast_forward_to() in stream merger streamed_mutation: Add generic implementation of forwardable streamed_mutation streamed_mutation: Add fast_forward_to() API position_in_partition: Introduce position_range position_in_partition: Introduce position constructor for right after the static row streamed_mutation: Make cast to view non-explicit streamed_mutation: Make schema() getter non-copying	2017-02-24 10:37:51 +00:00
Tomasz Grabiec	0798ea22c8	tests: mutation_source_tests: Verify monotonicty of positions	2017-02-23 18:50:54 +01:00
Tomasz Grabiec	d0421ba545	tests: random_mutation_generator: Spread the keys more The deviation was very low so most ranges were very close. Spread them to test more cases.	2017-02-23 18:50:54 +01:00
Tomasz Grabiec	27ff169b6b	tests: mutation_source_test: Make blobs more easily distinguishable It's easier to compare them if they differ only by a few most significant bits, than by all bits.	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	182e3f981b	tests: streamed_mutation: Test that merged stream passes mutation source tests	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	122562c1cc	tests: mutation_source_test: Add tests for forwarding of streamed_mutation	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	1d7e84f770	tests: streamed_mutation_assertions: Add methods for navigating the stream	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	f2feb54fb0	tests: Add range generators to random_mutation_generator	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	f56308597c	partition_slice_builder: Add with_ranges()	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	0073df30aa	query: Introduce full_clustering_range	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	cbf4601e31	streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()	2017-02-23 18:50:53 +01:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	b1d1091906	mutation_source: Document liveness requirements	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	15db80188b	mutation_source: Cleanup - combines telescopic overloads into one method with default paramters. - Introduce func_type for a full handler to avoid some duplication.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	586dbaa8d3	db: Replace virtual_reader_type with mutation_source_opt Virtual reader is a mutation_source.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	acfad565f0	partition_version: Refactor make_partition_snapshot_reader() overloads So that streamed_mutation is created in only one of the overloads and others delegate to that one. Later there will be common logic added to the construction and doing this will help avoid a duplication.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	f46ae8128d	database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr It was using the state passed via as_mutation_source() instead. Let's respect mutation_source contract instead, and use the state passed via mutation_source invocation. Technically just a cleanup. Alse prerequisite for more cleanup.	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	2cc27f72ca	memtable: Accept all mutation_source parameters	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	53b1a257cc	streamed_mutation: Implement fast_forward_to() in stream merger	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	e0a7ed48b0	streamed_mutation: Add generic implementation of forwardable streamed_mutation Generic but not very efficient wrapper which simply drops fragments from the original stream.	2017-02-23 18:23:51 +01:00
Tomasz Grabiec	301cd4912b	streamed_mutation: Add fast_forward_to() API	2017-02-23 18:23:28 +01:00
Gleb Natapov	2dc56013f8	commitlog: handle cycle() error Do not ignore a future<> retuned by cycle() since it will produce a warning in case of an error. Log it instead. Message-Id: <20170219151811.GN11471@scylladb.com>	2017-02-22 19:15:14 +01:00
Calle Wilund	d5f57bd047	messaging_service: Move log printout to actual listen start Fixes #1845 Log printout was before we actually had evaluated endpoint to create, thus never included SSL info. Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>	2017-02-22 17:08:21 +01:00
Avi Kivity	9b113ffd3e	config: enable new sharding algorithm for new deployments Set murmur3_partitioner_ignore_msb_bits to 12 (enabling the new sharding algorithm), but do this in scylla.yaml rather than the built-in defaults. This avoids changing the configuration for existing clusters, as their scylla.yaml file will not be updated during the upgrade. Message-Id: <20170214123253.3933-1-avi@scylladb.com>	2017-02-22 11:23:12 +01:00
Calle Wilund	0a4edca756	counters/cql: allow wormholing actual counter values (with shards) via cql Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that argument value, which must be a list of tuples <int, UUID, long, long>, should be inserted as an actual counter value, not update. This of course to allow counters to be read from sstable loader. Note that we also need to allow timestamps for counter mutations, as well as convince the counter code itself to treat the data as already baked. So ugly wormhole galore. v2: * Changed flag names * More explicit wormholing, bypassing normal counter path, to avoid read-before-write etc * throw exceptions on unhandled shard types in marshalling v3: * Added counter id ordering check * Added batch statement check for mixing normal and raw counter updates Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>	2017-02-22 09:19:46 +00:00
Calle Wilund	0d87f3dd7d	utils::UUID: operator< should behave as comparison of hex strings/bytes I.e. need to be unsigned comparison. Message-Id: <1487683665-23426-1-git-send-email-calle@scylladb.com>	2017-02-22 09:19:22 +00:00
Tomasz Grabiec	2b2d5c4c7a	Update seastar submodule * seastar 5088065...4d4a58d (3): > reactor utilization should return the utilization in 0-1 range > collectd should ignore type label in name creation > fix append_challenged_posix_file_impl::process_queue() to handle recursion	2017-02-22 09:40:25 +01:00
Calle Wilund	e20b804a65	commitlog/database: Add "release" method to ensure we free segments On database stop, we do flush memtables and clean up commit log segment usage. However, since we never actually destroy the distributed<database>, we don't actually free the commitlog either, and thus never clear out the remaining (clean) segments. Thus we leave perfectly clean segments on disk. This just adds a "release" method to commitlog, and calls it from database::stop, after flushing CF:s. Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>	2017-02-21 18:17:47 +01:00
Gleb Natapov	0977f4fdf8	sstable: close sstable_writer's file if writing of sstable fails. Failing to close a file properly before destroying file's object causes crashes. [tgrabiec: fixed typo] Message-Id: <20170221144858.GG11471@scylladb.com>	2017-02-21 18:17:47 +01:00
Tomasz Grabiec	8fd19a71ff	position_in_partition: Introduce position_range	2017-02-21 16:49:36 +01:00
Tomasz Grabiec	78c563ea6a	position_in_partition: Introduce position constructor for right after the static row	2017-02-21 16:43:09 +01:00
Tomasz Grabiec	ce58706b50	streamed_mutation: Make cast to view non-explicit	2017-02-21 16:43:09 +01:00
Paweł Dziepak	274bcd415a	tests/cql_test_env: wait for storage service initialization Message-Id: <20170221121130.14064-1-pdziepak@scylladb.com>	2017-02-21 17:05:45 +02:00
Paweł Dziepak	359c617821	db: restore call to check_valid_rp() `5a0955e89d` "db: add operations for applying counter updates" merged two column_family::apply() overloads into do_apply() in order to reduce code duplication. Unfortunately, a call to check_valid_rp() didn't survive that change. Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>	2017-02-21 15:26:04 +01:00
Tomasz Grabiec	b4fd3a08e6	streamed_mutation: Make schema() getter non-copying	2017-02-21 14:18:57 +01:00
Duarte Nunes	65b21e3a99	schema_registry: Don't leak schemas When loading a schema asynchronously, we're leaving a strong reference to the loaded schema in the entry's shared future. This patch fixed this by storing a shared_promised, which is reset when the schema is loaded. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170220193654.17439-1-duarte@scylladb.com>	2017-02-21 09:56:21 +01:00
Tomasz Grabiec	33457cc9a9	sstables: Fix detection of repeated tombstones The check was not catching range tombstone repeated immediately after itself. Message-Id: <1487596098-17409-1-git-send-email-tgrabiec@scylladb.com>	2017-02-20 15:35:15 +00:00
Tomasz Grabiec	cc439df542	Revert "sstables: Simplify sstable_streamed_mutation::read_next()" This reverts commit `1e2c01ff49`. We do not detect repeated tombstone if it follows an in-range tombstone following a skipped clustering row, because _in_progress will be disengaged after such tombstone is emitted. Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>	2017-02-20 15:34:58 +00:00
Vlad Zolotarov	978241d473	database: move lister class into separate files Move lister class away from database.cc. This is a preparation for moving it to the seastar library. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	34cafa71c3	database: make 'clearsnapshot' to delete the snapshots of deleted keyspaces if requested The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots of the keyspaces that are alive at the time the command is issued (issue #2045). This, besides not implementing the spec, prevents users from being able to clear the disk space occupied by snapshots of deleted keyspaces that are no longer needed (e.g. snapshots created when KS is deleted). This patch fixes this issue by making the database::clear_snapshot() scan the data directories looking for the snapshots to be deleted instead of relying on in-memory data structures. This patch makes column_family::clear_snapshot() method not needed any more. Fixes #2045 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	e1ee669aff	database: lister: add the rmdir() static method Removes the directory with all its contents (like 'rm -rf <dir name>' shell command). Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:40 -05:00
Vlad Zolotarov	53532ba5ff	database: lister: pass the parent path object to callbacks Pass a parent directory boost::filesystem::path object to the walker and filter callbacks. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:50:37 -05:00
Vlad Zolotarov	b4c970dfc6	database: lister: make the "filter" callback receive directory_entry instead of sstring Filter should get all information that the caller has in hand that may be used for filtering. directory_entry has the following information: - Type of the entry - Its name For the code that used lister filters so far this would be enough, however it's not hard to imagine a filter that may need the parent directory as well. We will add the parent directory path in the follow up patches to make the interface complete. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:59 -05:00
Vlad Zolotarov	6f9f0e1b3f	database: lister: add "show_hidden" parameter If show_hidden parameter is set to show_hidden::yes - list hidden entries, otherwise skip them. By default set to show_hidden::no. This patch also completely removes default parameters in lister::scan_dir() and replaces them with a few lister::scan_dir() overloads that ensure that lambdas are always going to be the last parameter in the parameters list. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:58 -05:00
Vlad Zolotarov	9aedb191f6	database: lister: if entries' types set is empty - list everything Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:58 -05:00
Vlad Zolotarov	cb614f9be4	database: lister::guarantee_type: handle the case when entry type may not be read There is a possibility that the type of the given entry may not be available that would manifest in the ENOENT or ENOTDIR value set in the errno by the fstat() call for this entry. In this case engine().file_type() will return a not engaged optional<directory_entry_type> value. Return the future with the std::runtime_error exception in this case. This will prevent any further usage of the not engaged optional value by the code in the normal flow. The exception is going to be propagated to the caller and it's the caller's responsibility to handle it. Fixes #2071 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-17 17:46:55 -05:00
Tomasz Grabiec	9f63e172fb	tests: compaction_manager_test: Fix abort on exception Message-Id: <1487343901-12745-1-git-send-email-tgrabiec@scylladb.com>	2017-02-17 15:53:55 +00:00
Avi Kivity	113ed9e963	Merge seastar upstream * seastar 28a143a...5088065 (8): > configure.py: switch cmake to build c-ares to do out-of-source-tree build > iotune: make sure help is working > collectd: send double correctly for gauge > tls: make shutdown/close do "clean" handshake shutdown in background > tls: Make sink/source (i.e. streams) first class channel owners > native-stack: Make sink/source (i.e. streams) first class channel owners > posix-stack: Make sink/source (i.e. streams) first class channel owners > Merge "Detector for tasks blocked" from Glauber Fixes #2085. Packaging updated to require cmake, drop libtool and automake.	2017-02-16 19:34:28 +02:00
Vlad Zolotarov	25502149cf	database: lister::scan_dir(): std::move() all that needs to be moved Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-02-16 11:56:44 -05:00
Raphael S. Carvalho	53d9008052	sstables/deletion_manager: kill dead code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <2b24d9e622238030a737fbbe12b8439853d5d075.1487095059.git.raphaelsc@scylladb.com>	2017-02-16 18:38:54 +02:00
Vlad Zolotarov	f2e4629254	main.cc: expose scylla version as a gauge metrics Add a new metric that exposes the current ScyllaDB version as a gauge metrics. The version is exposed as a label with the "version" key. Fixes #1979 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1487083703-27929-1-git-send-email-vladz@scylladb.com>	2017-02-16 16:57:55 +02:00
Piotr Jastrzebski	2b8e340761	Replace deprecated BOOST_MESSAGE with BOOST_TEST_MESSAGE BOOST Unit test deprecated BOOST_MESSAGE as early as 1.34 and had it been perminently removed. This patch replaces all uses of BOOST_MESSAGE with BOOST_TEST_MESSAGE. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f1732018912a864cea229b0f7cd48170fd927dc2.1487238426.git.piotr@scylladb.com>	2017-02-16 10:49:03 +01:00
Avi Kivity	b8c4b35b57	Merge "Fixes for counter cell locking" from Paweł "This series contains some fixes and a unit test for the logic responsible for locking counter cells." * 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev: tests: add test for counter cell locker cell_locking: fix schema upgrades cell_locker: make locker non-movable cell_locking: allow to be included by anyone	2017-02-15 17:36:48 +02:00
Paweł Dziepak	f7f89df782	tests: add test for counter cell locker	2017-02-15 15:09:40 +00:00
Paweł Dziepak	2eb3e35815	cell_locking: fix schema upgrades * cell_entry destructor may be called when the former is unlinked * update pointer to schema in partition_entry on schema upgrade * use correct bucket count when creating a new hash table	2017-02-15 15:09:40 +00:00
Paweł Dziepak	fa9c712263	cell_locker: make locker non-movable locker keep iteratos to potentially internally stored data and moving the object would invalidate them.	2017-02-15 13:48:47 +00:00
Paweł Dziepak	fc9145671d	cell_locking: allow to be included by anyone	2017-02-15 13:48:47 +00:00
Tomasz Grabiec	9da078a18a	tests: logalloc_test: Print debugging info and abort on failure The test started to fail sporadically on jenkins after `7a00dd6985` due to quiesce() timing out. It's not clear though if this is a regression because before the series such timeouts would not cause test failure if the future resulves eventually, timeout was only logged. I was not able to reproduce it on my setup nor on jenkins, so let's add more debugging output and trigger a coredump next time the test fails. Message-Id: <1487089576-27147-1-git-send-email-tgrabiec@scylladb.com>	2017-02-15 14:41:49 +02:00
Takuya ASADA	9c8515eeed	dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to backport from Fedora anymore. Fixes #2087 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>	2017-02-15 12:58:00 +02:00
Paweł Dziepak	a5476b4e7d	Merge "Emit only range tombstones relevant for query restrictions" from Tomasz "Immediate reason to do this is to ensure that forwarding of streamed_mutation will give the same mutations as slicing would, and have unit tests which verify that those two access methods are consistent with each other. Secondary reason is performance, to avoid processing unnecessary data. Note that this should not cause digest mismatch of data queries during rolling upgrade, because data queries are checksumming only tombstones affecting rows in the results, so only relevant tombstones. Fixes #1254." * tag 'tgrabiec/only-relevant-range-tombstones-v2' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Test that slicing returns only relevant range tombstones tests: Pass all mutation source parameters tests: mutation_source_tests: Ensure timestamps are strictly monotonic tests: streamed_mutation_assertions: Add more expectation methods tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages sstables: Simplify sstable_streamed_mutation::read_next() sstables: Emit only relevant range tombstones range_tombstone: Introduce end_position() position_in_partition: Print position when printing fragment position_in_partition: Make printable position_in_partition: Add cast to view position_in_partition: Generalize from-bound_view constructor bound_view: Extract converters for range start and end bounds mutation_partition: Drop unneeded range tombstones mutation_partition: Simplify row removal range_tombstone_list: Introduce erase() partition_snapshot_reader: Emit only relevant tombstones range_tombstone_stream: Add slicing apply() overload range_tombstone_list: Introduce slice()	2017-02-14 11:18:51 +00:00
Tomasz Grabiec	7ec8c4cf54	tests: mutation_source_test: Test that slicing returns only relevant range tombstones	2017-02-13 20:52:50 +01:00
Tomasz Grabiec	2b8bd10dca	tests: Pass all mutation source parameters	2017-02-13 20:52:49 +01:00
Tomasz Grabiec	25dffef6ae	tests: mutation_source_tests: Ensure timestamps are strictly monotonic	2017-02-13 16:19:32 +01:00
Tomasz Grabiec	e6a95fd8cc	tests: streamed_mutation_assertions: Add more expectation methods	2017-02-13 16:19:32 +01:00
Tomasz Grabiec	62843175ea	tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages	2017-02-13 16:19:32 +01:00
Tomasz Grabiec	1e2c01ff49	sstables: Simplify sstable_streamed_mutation::read_next() mp_row_consumer doesn't split row fragments on repeated range tombstones any more.	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	6324876f24	sstables: Emit only relevant range tombstones	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	72d74b7b40	range_tombstone: Introduce end_position()	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	79d982cd86	position_in_partition: Print position when printing fragment	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	f2e1f2938b	position_in_partition: Make printable	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	fb67dab548	position_in_partition: Add cast to view	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	9ddb5a5173	position_in_partition: Generalize from-bound_view constructor We will need to create positions corresponding to general range bounds, not only corresponding to range tombstones.	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	69911a87d3	bound_view: Extract converters for range start and end bounds	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	2489a0f82e	mutation_partition: Drop unneeded range tombstones Fixes #1254.	2017-02-13 16:12:16 +01:00
Tomasz Grabiec	884858078a	mutation_partition: Simplify row removal	2017-02-13 16:12:15 +01:00
Tomasz Grabiec	fb42366552	range_tombstone_list: Introduce erase()	2017-02-13 16:12:15 +01:00
Tomasz Grabiec	fcf3391785	partition_snapshot_reader: Emit only relevant tombstones Refs #1254.	2017-02-13 16:12:15 +01:00
Tomasz Grabiec	440e50b76a	range_tombstone_stream: Add slicing apply() overload	2017-02-13 16:12:15 +01:00
Tomasz Grabiec	8b7f93175c	range_tombstone_list: Introduce slice()	2017-02-13 16:12:15 +01:00
Paweł Dziepak	8b1d34f39d	mutation_partition_serializer: avoid creating atomic_cell object write_{live, counter, expiring, dead}_cell() take a const reference to an atomic cell as argument. However, their caller (which is write_row_cells) passes to them an atomic_cell_view. There is an appropriate implicit constructor so instead of compiler complaints we get atomic_cell objects being constructed from views which involves an allocation and a copy. Message-Id: <20170213100106.9071-1-pdziepak@scylladb.com>	2017-02-13 11:23:23 +01:00
Avi Kivity	bcf34b9a58	Merge seastar upstream * seastar 83a41c8...28a143a (5): > prometheus: send one MetricFamily per unique metric name > tests: Add test for circular_buffer::erase() > circular_buffer: Introduce erase() > protect against infinite do_until loop > metrics: alternative metrics creation with labels	2017-02-12 21:56:11 +02:00
Gleb Natapov	bb72425b61	storage_proxy: fix send_to_endpoint() to use correct create_write_response_handler() overload There are several problems with storage_proxy::send_to_endpoint right now. It uses create_write_response_handler() overload that is specific to read repair which is suboptimal and creates incorrect logs, it does not process errors and it does not hold storage_proxy object until write is complete. The patch fixes all of the problems. Message-Id: <20170208101949.GA19474@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2017-02-12 10:46:13 +02:00
Tomasz Grabiec	c70ebc7ca5	lsa: Make reclaim_timer enclose segment_pool::reclaim_segments() LSA timing did not include segment migration. It does after this change. Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>	2017-02-09 17:07:59 +00:00
Avi Kivity	5f15388e7a	Merge "Size-based buffering of mutation_fragments" from Paweł "This series changes buffering of mutation fragments in streamed mutations so that the size of the fragments is taken into account. The original implementation buffered up to 16 fragments which was pretty much meaningless since it could be far too much if the fargments were large or not nearly enough in case they were small Fixes #2036.." * 'pdziepak/buffer-mfs-by-size/v1' of github.com:cloudius-systems/seastar-dev: streamed_mutation: size-based mutation_fragment buffer limit mutation_fragment: cache size in memory mutation_fragment: make write access more explicit	2017-02-09 16:42:42 +02:00
Paweł Dziepak	3079b1661e	streamed_mutation: size-based mutation_fragment buffer limit Currently, streamed mutations buffer up to 16 mutation fragments. This may be too much, not enough or a perfect choice depending on the mutation fragment size. This patch makes streamed mutation choose how much mutation fragments to keep in the buffer depending on their size, so that we avoid using too much memory in case of large mutation fragments and are able to buffer a lot of fragments if they are small.	2017-02-09 10:51:11 +00:00
Paweł Dziepak	cd0dc7734a	mutation_fragment: cache size in memory	2017-02-09 10:50:51 +00:00
Paweł Dziepak	354ce0b2c7	mutation_fragment: make write access more explicit mutation_fragments are going to be caching their size in memory. In order to be able to invalidate that correctly, they need to know when that size may change (but avoid invalidation when it is not necessary).	2017-02-09 10:49:46 +00:00
Avi Kivity	9530bac2d6	Merge "Adding metrics using histogram and labels" from Amnon "This series uses the newly added histogram and label support to add metrics to the storage_proxy and to the column_family. This would add latency and histogram and the missing metrics from column family." * 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev: database: add metrics registration for the coloumn family storage_proxy: add read and write latency histogram estimated_histogram: returns a metrics histogram	2017-02-09 11:40:57 +02:00
Avi Kivity	9e4ae0763d	Merge "Disallow mixed schemas" fro Paweł "This series makes sure that schemas containing both counter and non-counter regular or static columns are not allowed." * 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev: schema: verify that there are no both counter and non-counter columns test/mutation_source: specify whether to generate counter mutations tests/canonical_mutation: don't try to upgrade incompatible schemas	2017-02-07 18:03:28 +02:00
Paweł Dziepak	4cbbbc67f0	schema: verify that there are no both counter and non-counter columns	2017-02-07 15:17:14 +00:00
Paweł Dziepak	4ffe0401ee	test/mutation_source: specify whether to generate counter mutations Tests using random mutation generator should be provided with bot counter and non-counter mutations to ensure that both cases are sufficiently covered. However, mixed schemas (with both counter and non-counter columns) are not allowed so the RMG has to be explicitly told whether to use counter or non-counter schema.	2017-02-07 15:17:14 +00:00
Paweł Dziepak	294bf0bb7a	tests/canonical_mutation: don't try to upgrade incompatible schemas Test case test_reading_with_different_schemas uses randomly generated pairs of mutations and tries to upgrade one to the schema of the other. However, there are cases when one schema cannot be upgraded to another, for example, counter and non-counter schemas.	2017-02-07 15:17:14 +00:00
Amnon Heiman	292c08f598	database: add metrics registration for the coloumn family This patch adds a metrics registration to the column_family. Using label each column metrics is label with its keyspace and column family name. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-02-06 18:27:01 +02:00
Amnon Heiman	2cf13c26e2	storage_proxy: add read and write latency histogram Register the read and write latency histogram on the metrics layer. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-02-06 17:54:47 +02:00
Amnon Heiman	1e3cfe7396	estimated_histogram: returns a metrics histogram The metrics histogram is a struct that describe a histogram. This patch adds a getter method that lets the estimated_histogram return a metrics::histogram, this will allow to register it as a histogram metrics. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2017-02-06 17:34:43 +02:00
Avi Kivity	afe98f8572	Merge "Materialized views: support new entries" from Duarte "This patchset adds bits of the MV write-path, enough to support new entries to be added. Note that this is still limited, as only adding new rows to a base table will work correctly." * 'materialized-views/insert-path/v4' of https://github.com/duarten/scylla: (30 commits) database: Apply mutation to views column_family: Push view replica update materialized views: partial mutate_MV materialized views: function to send a mutation to endpoint materialized views: add VIEW write type database: Ensure new write_type is correctly printed materialized views: match base and view replicas column_family: Generate view updates column_family: Adds affected_views() function view: Add view_update_builder class range_tombstone_accumulator: Expose current tombstone range_tombstone_accumulator: apply() takes value view_updates: Generate updates view_updates: Adds function to replace row view_updates: Update view entry view_updates: Delete old view entry mutation_partition: Introduce shadowable tombstone view_updates: Create view entry view_updates: Compute row marker view: Introduce view_updates class ...	2017-02-06 15:10:38 +02:00
Duarte Nunes	0eca6301d3	database: Apply mutation to views This patch changes the database apply path so that it also generates the mutations for the column family's views and sends them to the paired view replicas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:37:33 +01:00
Duarte Nunes	4777172348	column_family: Push view replica update This patch adds a function to push updates to the view replicas of a particular base table.	2017-02-06 13:36:45 +01:00
Nadav Har'El	3ae73164a4	materialized views: partial mutate_MV This adds a function mutate_MV() which takes view mutations and sends them to the appropriate nodes (this may be the current node, or a remote node). This is only a partial implementation - we still don't do the local batch log (to survive reboots and failures) and some other stuff which is left commented out. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Nadav Har'El	f2fd81ece0	materialized views: function to send a mutation to endpoint Add a function for sending one mutation to one remote replica owning this mutation. This is needed for materialized views, where each base replica sends each view mutation to one particular view replica. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2017-02-06 13:36:45 +01:00
Nadav Har'El	92fc7386f6	materialized views: add VIEW write type This adds to the "write_type" enum also the "VIEW" write type. To be honest, I don't understand why the "write_type" distinction is important. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	11bd3bd29f	database: Ensure new write_type is correctly printed By removing the default case in the switch statement over a write_type variable, we ensure the compiler warns us about lack of exhaustiveness in case we add a value to the enum but forget to change the corresponding operator<<(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Nadav Har'El	365df8f900	materialized views: match base and view replicas A function to find the appropriate replica to send a view update to. This patch creates a new source file db/view/view.cc. We should eventually move a lot more of the materialized views code there. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	16206e9f15	column_family: Generate view updates This patch adds the generate_view_updates() function to the column_family class, which will use the view_update_builder to generate updates to the column_family's materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	90cb35db04	column_family: Adds affected_views() function This patch the affected_views() to determine the column family's views a given update affects. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	d5a61a8c48	view: Add view_update_builder class This patch adds the view_update_builder class, which is responsible for calculating the mutations to apply to a column family's materialized views, given a streamed_mutation representing an update to the base table and a streamed_mutation representing the pre-existing rows which the update covers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	2ab9ba995a	range_tombstone_accumulator: Expose current tombstone Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	f3c5ea392a	range_tombstone_accumulator: apply() takes value range_tombstone_accumulator::apply() now takes a value so the caller can decide whether to move or copy the argument. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	3991a58f08	view_updates: Generate updates This patch adds the view_updates::generate_update() function to generate view updates given a base row update and the corresponding, pre-existing row. This function will decide which of the previously introduced functions to call based on whether there is a pre-existing row and whether there exists a regular base column that's part of the view's PK. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	861d2dfb61	view_updates: Adds function to replace row This patch adds a function to replace a view row given a base table update and the pre-existing row, which simply deletes the old view entry and adds a new one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	7901ce7de4	view_updates: Update view entry This patch introduces the view_updates::update_entry function, which creates the updates to apply to the existing view entry given the base table row before and after the update. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	b34ae6d6da	view_updates: Delete old view entry This patch introduces the view_updates::delete_old_entry function, which creates a view row mutation to delete an entry given an updated base table row. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	7e150a18eb	mutation_partition: Introduce shadowable tombstone This patch introduces shadowable row tombstones. A shadowable row tombstone is valid only if the row has no live marker. In other words, the row tombstone is only valid as long as no newer insert is done (thus setting a live row marker; note that if the row timestamp set is lower than the tombstone's, then the tombstone remains in effect as usual). If a row has a shadowable tombstone with timestamp Ti and that row is updated with a timestamp Tj, such that Tj > Ti (and that update sets the row marker), then the shadowable tombstone is shadowed by that update. A concrete consequence is that if the update has cells with timestamp lower than Ti, then those cells are preserved (since the deletion is removed), and this is contrary to a regular, non-shadowable row tombstone where the tombstone is preserved and such cells are removed. Currently, only Materialized Views require shadowable row tombstones, which solve a problem with view row deletions. Consider a base row with columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations: 1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0; 2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0; 3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0; Without shadowable tombstones, the view contains: At 1), pk = (0, 0), row_marker@T0, v2=1@T0 At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0 pk = (0, 1), row_marker@T1, v2=1@T0 At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0 pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0 Notice how, if we read row (0, 0), the value of v2 will be shadowed by the row tombstone we previously inserted. With a view's row tombstone becoming shadowable, at 3) the row (0, 0) will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0, which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0. Since the shadowable tombstone is shadowed by the new row marker (T0 < T2), now v2 would be taken into account. Finally, note that this patch doesn't generalize the idea of shadowable tombstone, instead taking advantage of the fact that they are only needed by Materialized Views. This saves changing the tombstone representation to account for an extra flag, the bits such representation would require, and also avoids changes to the storage format. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:36:45 +01:00
Duarte Nunes	e0f642180f	view_updates: Create view entry This patch introduces the view_updates::create_entry function, which creates a view row mutation given a new base table row. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:31 +01:00
Duarte Nunes	b8b8a8099c	view_updates: Compute row marker This patch adds a function to compute the row marker of a view row given the base row. There are two cases to consider when building the row marker: 1) there is a column C that is a regular base column but is in the view PK; and 2) the columns for the base and the view PKs are the same. For 1), the view row marker timestamp will be the biggest between the base's row marker and C. The TTL will be that of C. This means that if C expires, the view row maker will expire as well (and the row, if no other column is keeping it alive). Note that if the base row marker expires but not C, then the base row will still be live due to C and we shouldn't expire the view row. For 2), the view row timestamp will be the same as the base row timestamp. The TTL should be set in such a way that both base and view rows live for the same time. We thus set the view row TTL to be the max of any other TTL in the base row. This is particularly important in the case where the base row marker has a TTL, but a column absent from the view holds a greater one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:31 +01:00
Duarte Nunes	7321938bcf	view: Introduce view_updates class This patch introduces the view_updates class, which is responsible for generating and storing updates to a particular materialized view. The updates will be generated from an updated base row and the pre-existing one (if any), in later patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:31 +01:00
Duarte Nunes	0f8dbc9243	collection_type_impl: Iterate over collection cells This patch introduces the collection_type_impl::for_each_cell() function, which allows the caller to iterate over the cells of a particular collection_mutation_view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:31 +01:00
Duarte Nunes	082ef56df1	view: Store pk view column that's non-pk in the base To help calculate the view mutations from a base update, we store in the view class the column that's part of the view's primary key but not part of the base's, if such column exists. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	734ad80390	view: Add matches_view_filter() function This patch adds the matches_view_filter() function which specifies whether a given base row matches the view filter. Unlike may_be_affected_by(), this function has no false positives. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	7be0f319d4	single_column_restriction: Filter clustering rows This patch adds the is_satisfied_by() function to single_column_restriction, which given a clustering row returns whether the restrictions applies or not. This is useful for secondary indexing such as materialized views, where filters on regular columns precisely select which base table rows to denormalize. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	3b52440ff3	statement_restrictions: Expose non-pk restrictions This patch exposes the non-primary key column restrictions in a given select statement, exposing them as single_column_restrictions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	e987d87ab1	collection_type_impl: Identify concrete types This patch adds the is_set() and is_list() functions to collection_type_impl, which identify the concrete collection type. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	71faa4a4eb	abstract_restriction: Rename uses_function() This patch renames abstract_restriction::uses_function() to term_uses_function(), as it was previously hiding a function with the same name in the restriction base class. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	21d1bbb527	view: Add may_be_affected_by function This patch adds the may_be_affected_by() function to the view class, which is responsible to determine whether an update to a base class affects one of its views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	c35d14e285	column_family: Store a pointer to view Instead of storing the view in the column_family's map of materialized views, store a lw_shared_ptr so that the view can be removed while it is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Duarte Nunes	69171c28f0	cql3/util: Fix use-after-free This patch fixes a use-after-free error in rename_column_in_where_clause(), where we were creating a boost adaptor on an rvalue. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-02-06 13:35:30 +01:00
Avi Kivity	3896c27e5f	Merge "DNS use in scylla" from Calle "Fixes #1531 Adds lookup to gms::inet_address and uses it in (hopefully all) the salient places where configured symbolic names are interpreted. Removes the dummy dns modula in scylla in favour of the seastar one." * 'calle/use-dns' of github.com:cloudius-systems/seastar-dev: remove scylla dns code service::storage_service: Remove depedency on scylla dns main.cc: remove scylla dns dependency main/init: Lookup inet addresses from config by dns lookup db::system_keyspace: Find rpc_address by lookup gms::inet_address: Add lookup functionality. scylla tls: Add option support for client auth and tls opts	2017-02-06 13:50:42 +02:00
Avi Kivity	da8d00199e	Merge	2017-02-06 13:43:07 +02:00
Avi Kivity	fdfabbf8bb	Merge seastar upstream * seastar f07f8ed...83a41c8 (8): > Cleaning the metrics API > tutorial: pick the name "asynchronous function". > tutorial: explain the difference between exception and exception future > tutorial: abstract > ninja: don't bother building c-ares shared libraries > ninja: unbreak build ordering > ninja: unbreak "ninja -t clean" > Add libtool to dependencies	2017-02-06 13:42:38 +02:00
Calle Wilund	44503f8253	remove scylla dns code Use seastar facilities instead.	2017-02-06 11:36:57 +00:00
Calle Wilund	ab800c225a	service::storage_service: Remove depedency on scylla dns Use seastar facilities instead	2017-02-06 11:36:57 +00:00
Calle Wilund	c4c4eb06c4	main.cc: remove scylla dns dependency Use seastar facilities instead.	2017-02-06 11:36:57 +00:00
Avi Kivity	b18e54307f	tests: add --operations-per-shard option to perf_simple_query This helps achieve more repeatable runs that can then be compared via the Linux perf tool. The option overrides duration-based testing and runs the test for a specific number of iterations. Message-Id: <20170204172937.8462-1-avi@scylladb.com>	2017-02-06 12:08:04 +01:00
Gleb Natapov	3c372525ed	storage_proxy: use storage_proxy clock instead of explicit lowres_clock Merge commit `45b6070832` used butchered version of storage_proxy patch to adjust to rpc timer change instead the one I've sent. This patch fixes the differences. Message-Id: <20170206095237.GA7691@scylladb.com>	2017-02-06 12:51:36 +02:00
Calle Wilund	feffc2bbe1	main/init: Lookup inet addresses from config by dns lookup I.e. allow symbolic names in addition to ip addresses.	2017-02-06 09:45:37 +00:00
Calle Wilund	ef26ab0e1b	db::system_keyspace: Find rpc_address by lookup	2017-02-06 09:45:37 +00:00
Calle Wilund	0a740b5ccf	gms::inet_address: Add lookup functionality. To find addresses by name.	2017-02-06 09:45:37 +00:00
Calle Wilund	ff8f82f21c	scylla tls: Add option support for client auth and tls opts Refs #1813 (fixes scylla part) Added require_client_auth and priority_string options to server_encryption_options/client_encryption_options an process them. Allows TLS method/algo specification. Also enabled enforcing known cert authentication for both node-to-node and client communication.	2017-02-06 09:45:09 +00:00
Avi Kivity	6e9e28d5a3	cell_locking: work around for missing boost::container::small_vector small_vector doesn't exist on Ubuntu 14.04's boost, use std::vector instead.	2017-02-05 20:48:36 +02:00
Avi Kivity	2510b756fc	dist: add build dependency on automake Needed by seastar's c-ares.	2017-02-05 20:16:27 +02:00
Takuya ASADA	e82932b774	dist/common/systemd: introduce scylla-housekeeping restart mode scylla-housekeeping requires to run 'restart mode' for check the version during scylla-server restart, which wasn't called on systemd timer so added it. Existing scylla-housekeeping.timer renamed to scylla-housekeeping-daily.timer, since it is running 'daily mode'. Fixes #1953 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1486180031-18093-1-git-send-email-syuu@scylladb.com>	2017-02-05 10:46:04 +02:00
Avi Kivity	4175f40da1	dist: add libtool build dependency for seastar/c-ares	2017-02-05 10:42:53 +02:00
Takuya ASADA	12b5e7288d	dist/common/scripts/scylla_setup: show restart message when SELinux was disabled on the script Disabling SELinux requires server restart, so warn user to restart before running Scylla. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1485817393-25919-2-git-send-email-syuu@scylladb.com>	2017-02-05 10:10:18 +02:00
Takuya ASADA	c28a574b9e	dist/common/scripts: stop setting hugepages boot parameter Stop setting hugepages boot parameter since we don't use it on default configuration (posix mode), but keep scylla_bootparam_setup to setup clocksource on AMI. Fixes #1758 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1485817393-25919-1-git-send-email-syuu@scylladb.com>	2017-02-05 10:10:18 +02:00
Paweł Dziepak	37b0c71f1d	cell_locking: fix parititon_entry::equal_compare The comparator constructor took schema by value instead of const l-ref and, consequently, later tried to access object that has been destroyed long time ago. Message-Id: <20170202135853.8190-1-pdziepak@scylladb.com>	2017-02-03 19:49:18 +01:00
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Paweł Dziepak	788892e931	counters: fix build failure on gcc5 Message-Id: <20170202132049.4497-1-pdziepak@scylladb.com>	2017-02-02 14:23:49 +01:00
Piotr Jastrzebski	36b2c4df19	row_cache_test: extend test_mvcc Make the test execute with and without an active reader to memtable that's flushed to cache. This improves the code covarage of MVCC with tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <007b6cd1ba7a84ea5675ea82e454bf1adf3b3330.1485954941.git.piotr@scylladb.com>	2017-02-02 13:51:32 +01:00
Tomasz Grabiec	5458a32f13	gdb: Introduce commands for inspecting pending task queue Message-Id: <1485426236-6627-1-git-send-email-tgrabiec@scylladb.com>	2017-02-02 13:15:17 +02:00
Avi Kivity	000edc36c4	Merge "Counters" from Paweł "This series introduces support for counters. The implementation of counters more or less follows the design described on our wiki page [1]. Counter cells contain many shards with replicas being able to modify and announce new versions only of the shards that they own. Historically, there were three types of shards: local, remote and global. In these patches only support for the global ones is added. [1] https://github.com/scylladb/scylla/wiki/Counters Currently, counters are only enabled as experimental features as there still several things that need to be done before they become production ready. Namely, the performance is expected to be quite poor (especially for writes), there is no proper tracing support and timed out counter requests may not be recognized and dropped early. There are also no counter-related metrics. However, apart from these problems there are no other missing parts of counter implementation and they are expected to work correctly. Fixes #577." * 'pdziepak/counters/v3-rebased' of github.com:cloudius-systems/seastar-dev: (38 commits) perf_simple_query: add counter tables tests thrift: add support for counter operations cql3: allow counters in CREATE TABLE statements cql3: selection: do not panic when seeing counters storage_proxy: support counter updates storage_proxy: add get_live_endpoints() cql3: add counter increment and decrement operations db: add operations for applying counter updates counters: implement transforming counter deltas to shards add infrastructure for locking counter cells add fnv1a hasher position_in_partition: add feed_hash() position_in_partition: add functions for querying object type types: make counter_type_impl report its cql3_type transport: encode counters as long_type mutation_partition: make for_each_cell() accessible outside source file messaging_service: add COUNTER_MUTATION verb storage_service: add COUNTERS feature idl: add idl description of consistency level schema: make is_counter() return correct value ...	2017-02-02 12:40:09 +02:00
Paweł Dziepak	8671d8329d	perf_simple_query: add counter tables tests	2017-02-02 10:35:14 +00:00
Paweł Dziepak	4ca7f0a491	thrift: add support for counter operations	2017-02-02 10:35:14 +00:00
Paweł Dziepak	fa29ef3cc0	cql3: allow counters in CREATE TABLE statements	2017-02-02 10:35:14 +00:00
Paweł Dziepak	fce6e0987f	cql3: selection: do not panic when seeing counters At this stage counters cells are already long_type values, so no special handling is necessary.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	1e8814f5ce	storage_proxy: support counter updates	2017-02-02 10:35:14 +00:00
Paweł Dziepak	c14c6b753b	storage_proxy: add get_live_endpoints()	2017-02-02 10:35:14 +00:00
Paweł Dziepak	d6ebf84edf	cql3: add counter increment and decrement operations	2017-02-02 10:35:14 +00:00
Paweł Dziepak	5a0955e89d	db: add operations for applying counter updates	2017-02-02 10:35:14 +00:00
Paweł Dziepak	8d889082bf	counters: implement transforming counter deltas to shards The leader receives counter updates as deltas which have to be transformed to counter shards. In order to do that, current local shard of the modified counter cell needs to be read, logical clock incremented and the value modified by the specified delta.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	55277b3182	add infrastructure for locking counter cells The leader receives counter update in a form of deltas which need to be transformed to counter shards. In order to do that the node needs to read its current state of the modified counter cells. Since this is essentially a read-modify-write opertation an appropriate locking mechanism is needed. Counter cell locker introduced in this patch uses a hashtable of partition entry each containing a hashtable of cell entries. Inside a cell entry there is a semaphore used for synchronization. Once no longer needed cell entries and partition entries are removed. In order to avoid deadlocks cell entries are always locked in the same order which is the lexicographical order of (clustering key, column id) pairs. Note that schema changes are not a difficulty since they do not make it possible to change ordering of such pairs.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	22fbb11f90	add fnv1a hasher	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a16761dcb4	position_in_partition: add feed_hash()	2017-02-02 10:35:14 +00:00
Paweł Dziepak	f4fce93807	position_in_partition: add functions for querying object type	2017-02-02 10:35:14 +00:00
Paweł Dziepak	53d9a6f220	types: make counter_type_impl report its cql3_type	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a805bea97a	transport: encode counters as long_type For the purposes of CQL counters are long values (either a delta in case of writes or the final value for reads).	2017-02-02 10:35:14 +00:00
Paweł Dziepak	b6564651e4	mutation_partition: make for_each_cell() accessible outside source file for_each_cell() const already can be used from any place in the code, allow the same with non-const version.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	bf60b7844b	messaging_service: add COUNTER_MUTATION verb This verb is going to be used for coordinator<->leader communication during counter updates.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	67ca6959bd	storage_service: add COUNTERS feature	2017-02-02 10:35:14 +00:00
Paweł Dziepak	9989239c97	idl: add idl description of consistency level	2017-02-02 10:35:14 +00:00
Paweł Dziepak	4b3c0db5cc	schema: make is_counter() return correct value	2017-02-02 10:35:14 +00:00
Paweł Dziepak	99b21fbb86	tests: random_mutation_generator: generate counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	de2acd47c9	tests/sstables: test reading and writing counters	2017-02-02 10:35:14 +00:00
Paweł Dziepak	83c6fc1114	sstables: write counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	5905729c4a	sstables: read counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	de698105e4	tests/counter: test apply, difference and freeze	2017-02-02 10:35:14 +00:00
Paweł Dziepak	0c93d01232	atomic_cell: make sure upper level tombstones cover counters Support for deletion of counters is limited in a way that once deleted they cannot be used again (i.e. tombstone always wins, regardless of the timestamp). Logic responsible for merging two counter cells already makes sure that tombstones are handled properly, but it is also necessary to ensure that higher level tombstones always cover counters.	2017-02-02 10:35:14 +00:00
Paweł Dziepak	9f1ebd4f7c	idl/mutation: add counter serialisation logic	2017-02-02 10:35:14 +00:00
Paweł Dziepak	47d14906e6	mutation_partition: support querying counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	63f25eb12c	mutation_hasher: handle counter cells properly	2017-02-02 10:35:14 +00:00
Paweł Dziepak	25c8ed1c71	feed_hash: allow additional arguments	2017-02-02 10:35:14 +00:00
Paweł Dziepak	a57e86cc37	mutation_partition: compute counter difference	2017-02-02 10:35:13 +00:00
Paweł Dziepak	2725a4945d	mutation_partition: apply counter cells properly	2017-02-02 10:35:13 +00:00
Paweł Dziepak	496b42fcc7	tests: add test for counters	2017-02-02 10:35:13 +00:00
Paweł Dziepak	7bb5b49799	add in memory representation of counters Live counter cells are collections of shards, each one representing the sum of all operations performed by a particular replica. This commits introduces an in-memory representation of counters as well as basic operations such as merge, difference and hashing.	2017-02-02 10:35:13 +00:00
Paweł Dziepak	c66db213d3	storage_service: allow getting local host id without futures<>	2017-02-02 10:35:13 +00:00
Paweł Dziepak	0a8f00c159	atomic_cell: add flag for recognizing counter updates A counter cell may be either a collection of shards or just a delta. The former can only appear in certain places on coordinator and leader.	2017-02-02 10:35:13 +00:00
Paweł Dziepak	ab344c5aa3	mutation_partition_view: extract atomic_cell variant	2017-02-02 10:35:13 +00:00
Paweł Dziepak	83f6018ea2	schema: keep counter information in column definition	2017-02-02 10:35:13 +00:00
Avi Kivity	aec419da13	Merge seastar upstream * seastar c1dbd89...f07f8ed (3): > Merge "Introduce when_all_succeed()" from Paweł > tests: adjust collectd test for metric API change > Merge "DNS query support" from Calle	2017-02-02 12:30:10 +02:00
Piotr Jastrzebski	15cc8460bd	mutation_partition: make rows_entry constructors explicit All converting constructors should be explicit otherwise they can create a confusion. I got myself in such a situation when clustering key got implicitly converted into rows_entry when I was not expecting it. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c3f19719760f6dc7cf5e858b9c452506faedf521.1485950529.git.piotr@scylladb.com>	2017-02-01 17:57:50 +01:00
Tomasz Grabiec	2fd339787b	tests: lsa: Add test for reclaimer starting and stopping	2017-02-01 17:41:56 +01:00
Tomasz Grabiec	f943296da0	tests: lsa: Add request releasing stress test	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	9aa1be5d08	lsa: Do not start or stop reclaiming on hard pressure We already call these when crossing the soft threshold. We shouldn't stop reclaiming when hard pressure is gone because soft pressure may still be present. Calling start_reclaiming() on hard pressure is unnecessary because soft pressure also starts it, and when there is hard pressure there is also soft pressure.	2017-02-01 17:40:15 +01:00
Amnon Heiman	45b6070832	Merge seastar upstream * seastar 397685c...c1dbd89 (13): > lowres_clock: drop cache-line alignment for _timer > net/packet: add missing include > Merge "Adding histogram and description support" from Amnon > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&' > Set the option '--server' of tests/tcp_sctp_client to be required > core/memory: Remove superfluous assignment > core/memory: Remove dead code > core/reactor: Use logger instead of cerr > fix inverted logic in overprovision parameter > rpc: fix timeout checking condition > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier Includes treewide change to accomodate rpc changing its timeout clock to lowres_clock. Includes fixup from Amnon: collectd api should use the metrics getters As part of a preperation of the change in the metrics layer, this change the way the collectd api uses the metrics value to use the getters instead of calling the member directly. This will be important when the internal implementation will changed from union to variant. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>	2017-02-01 14:39:08 +02:00
Glauber Costa	facb0aa6d9	row_cache: rewrite loop so that debug mode doesn't become a noop need_preempt() is always true in debug mode. Because of that, this loop will never be executed. Rewrite it as a do-while loop so we are sure that it is executed at least once - or exactly once in debug mode. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>	2017-02-01 10:02:13 +02:00
Tomasz Grabiec	634761dbba	commitlog: Fix default limit for size on disk The per-node limit will be total memory divided by number of shards instead of just total memory. For example, when Scylla is started with -c16 -m16G, the commit log will induce flushes on given shard when unflushed data exceeds on that shard 62MB instead of 1GB. Fixes #2046. Message-Id: <1485874534-10939-1-git-send-email-tgrabiec@scylladb.com>	2017-01-31 17:12:59 +02:00
Piotr Jastrzebski	c7e95af0b0	row_cache_test: fix test_mvcc Currently the test does not wait for cache update to finish before carrying on with the checks. This makes the test nondeterministic and purely wrong because checks expect update to be finished. This patch changes the test to wait for update to finish. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2a99bba24b1628466d3495332b48ef3ccdb43c26.1485862389.git.piotr@scylladb.com>	2017-01-31 11:37:29 +00:00
Avi Kivity	aedb5e5cfa	mutation_fragment: add std::ostream support Helps poor debuggers. Message-Id: <20170130163605.4858-1-avi@scylladb.com>	2017-01-31 10:37:42 +01:00
Tomasz Grabiec	0d40b86546	Merge "bail sooner from cache update if need_preempt()" from Glauber An earlier patch of mine was using should_yield to do the same. That is a better direction, but should_yield() was demonstrably more expensive so for now we'll go with need_preempt() - since this is hurting pretty much every latency-dependent workload. I am also including the scripts that I have used to measure and compare the various versions of this patch.	2017-01-31 09:51:34 +01:00
Tomasz Grabiec	f053b48f7c	tests: lsa: Adjust to take into account that reclaimers are run synchronously	2017-01-30 19:18:07 +01:00
Tomasz Grabiec	ed9ff19467	lsa: Document and annotate reclaimer notification callbacks They are called from region_group::update(), so must be alloc-free and noexcept.	2017-01-30 19:18:07 +01:00
Tomasz Grabiec	2ec6fe415e	tests: lsa: Use with_timeout() in quiesce() Current consutrct doesn't interrupt the test, the timeout failure will only be logged.	2017-01-30 19:18:07 +01:00
Pekka Enberg	a625aae489	cql3/values.hh: Fix to_bytes_opt(raw_value) The data() method already returns a bytes_opt so there's no need to call to_bytes_opt() again. Fixes compliation failure on CentOS: In file included from ./cql3/query_options.hh:51:0, from ./cql3/cql_statement.hh:47, from ./cql3/statements/raw/select_statement.hh:45, from build/release/gen/cql3/CqlParser.hpp:65, from build/release/gen/cql3/CqlParser.cpp:44: ./cql3/values.hh: In function 'bytes_opt to_bytes_opt(const cql3::raw_value&)': ./cql3/values.hh:184:37: error: no matching function for call to 'to_bytes_opt(bytes_opt)' return to_bytes_opt(value.data()); Message-Id: <1485761863-28236-1-git-send-email-penberg@scylladb.com>	2017-01-30 10:49:31 +02:00
Gleb Natapov	6e4817137e	storage_proxy: report foreground reads instead of reads The reason is the same as why foreground writes are reported instead of total writes (`049ae37d08`): It is much easier to see what is going on this way. Also fixes a typo in a counter's description. Fixes #1217 Message-Id: <20170129093412.GS11469@scylladb.com>	2017-01-29 12:40:56 +02:00
Avi Kivity	9fb2f31616	Merge "CQL binary protocol unset value support" from Pekka This patch series adds support for "unset values" that were introduced in CQL binary protocol v4. They allow bound statements to skip updates to some or all of the bound variables. Unset values are specified using the BoundStatement.unset() method in the Java driver: http://docs.datastax.com/en/drivers/java/3.1/com/datastax/driver/core/BoundStatement.html#unset-int- and using the UNSET_VALUE constant in the Python driver: https://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.UNSET_VALUE Fixes #2039. * 'penberg/cql-unset-values/v2' of github.com:cloudius-systems/seastar-dev: transport/server: CQL unset value support cql3/statements/select_statement: Unset value support cql3/user_types: Unset value support cql3/tuples: Unset value support cql3/maps: Unset value support cql3/sets: Unset value support cql3/lists: Unset value support cql3/constants: UNSET_VALUE constant cql3/constants: Unset value support cql3/attributes: Unset value support types.hh: Add field_name_as_string() to user_type_impl type cql3: Introduce raw_value and raw_value_view types	2017-01-29 10:59:01 +02:00
Pekka Enberg	533c8d3949	transport/server: CQL unset value support This patch implements support for CQL unset values at the protocol level. Fixes #2039	2017-01-27 09:24:36 +02:00
Pekka Enberg	2bd560118e	cql3/statements/select_statement: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	baaf1779c5	cql3/user_types: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	99c7dabd2a	cql3/tuples: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	a0e6f6f371	cql3/maps: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	f883e64d70	cql3/sets: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	50ec81ee67	cql3/lists: Unset value support	2017-01-27 09:24:36 +02:00
Pekka Enberg	c4cd0a6541	cql3/constants: UNSET_VALUE constant	2017-01-27 09:24:36 +02:00
Pekka Enberg	063be3ed44	cql3/constants: Unset value support	2017-01-27 09:24:36 +02:00
Glauber Costa	b4ac2c1d60	debug: add systemtap script to measure interesting latencies during cache updates. Example output: Measuring Scylla row cache update times ^C Total update time, (usec) value \|-------------------------------------------------- count 2 \| 0 4 \| 0 8 \|@@ 2 16 \|@@@ 3 32 \| 0 64 \| 0 128 \|@@@@ 4 256 \|@@ 2 512 \| 0 1024 \| 0 Time spent per partition batch (nsec) value \|-------------------------------------------------- count 128 \| 0 256 \| 0 512 \| 43 1024 \| 2 2048 \| 2 4096 \| 45 8192 \| 349 16384 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 61494 32768 \|@@@@@@@@@@@@@@@@@ 21497 65536 \| 0 131072 \| 0 Partitions updated per batch: value \|-------------------------------------------------- count 0 \| 57 1 \| 46 2 \| 76 4 \| 134 8 \| 324 16 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 82795 32 \| 0 64 \| 0 Total partitions updated: 2485000 Average time spent per partition batch (nsec): 28816 Average time per partition per partition (nsec): 967 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-01-26 22:15:16 -05:00
Glauber Costa	69dbb3e108	row_cache: yield if need_preempt(), even if there is quota left. The quota check is quite old at the moment, and dates back to a time in which the infrastructure in seastar threads was lacking a lot. It is a bad check since it will not take into consideration the size of the partition or the time it takes to merge them. A better check would at least take need_preempt() into account, so that we would respect the task quota. That check is now embedded into should_yield(), so there would no need to check anything else. Although should_yield() does the job, it is still currently quite expensive. And because we are in a seastar thread with a computationally intensive loop, it can hurt latency a lot. So as a temporary measure, let's at least check for need_preempt() - as it is hurting real users at the moment - and soon work on making should_yield() cheaper. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-01-26 22:10:54 -05:00
Glauber Costa	0e1f64b163	row_cache: add systemtap markers for the update process update is one of our biggest sources of performance issues as far as the cache is concerned. systemtap can be useful in helping tracking some of them down. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-01-26 21:56:32 -05:00
Duarte Nunes	937ed1bacb	bound_view: Simplify copy ctor By using default generation. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Reviewed-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1485355007-1913-1-git-send-email-duarte@scylladb.com>	2017-01-26 19:29:29 +02:00
Avi Kivity	b91b9b351a	Revert "Merge seastar upstream" This reverts commit f301c678bfe5eb5df71f71fd20e08b422b1023bb; the rpc changes don't compile due to rpc timeout type change.	2017-01-26 18:30:56 +02:00
Avi Kivity	f301c678bf	Merge seastar upstream * seastar 397685c...f5fa2e3 (3): > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier	2017-01-26 18:16:14 +02:00
Pekka Enberg	3385144860	cql3/attributes: Unset value support	2017-01-26 13:50:04 +02:00
Pekka Enberg	630aba32ff	types.hh: Add field_name_as_string() to user_type_impl type This is needed to construct validation error messages when user types encounter unset values.	2017-01-26 13:50:04 +02:00
Pekka Enberg	be0351b49c	cql3: Introduce raw_value and raw_value_view types Currently, the code is using bytes_opt and bytes_view_opt to represent CQL values, which can hold a value or null. In preparation for supporting a third state, unset value introduced in CQL v4, introduce new raw_value and raw_value_view types and use them instead. The new types are based on boost::variant<> and are capable of holding null, unset values, and blobs that represent a value.	2017-01-26 13:50:04 +02:00
Gleb Natapov	64660397fc	storage_proxy: move operation type information from counter's name to a label Makes it much more flexible to view the data in various ways in Graphana. Message-Id: <20170126102746.GL11469@scylladb.com>	2017-01-26 12:38:29 +02:00
Tomasz Grabiec	2c7902fb2b	Revert "lsa: Reduce reclamation latency" This reverts commit `d61002cc33`. Introduced a regression in row_cache_alloc_stress. The problem is that reclaim_from_evictable() evicts way too much after the refactor due to the stop condition not taking into account how much data was evicted so far and only looking at occupancy of the minimal segment. This may lead to eviction of the whole region.	2017-01-26 10:43:18 +01:00
Paweł Dziepak	8cdffd7c57	time_type_impl: value initialize result parse_time() adds hourse, minutes, etc to a final value 'result'. However, it is of type std::chrono::nanoseconds which means it is not zeroed at initialization unless it is explicitly asked to do so. Fixed debug mode failures in types_tyes and cql_query_test. Message-Id: <20170125155239.1253-1-pdziepak@scylladb.com>	2017-01-25 17:56:31 +02:00
Paweł Dziepak	034d028329	Merge "range_tombstone_list: Properly implement difference()" from Duarte "This patchset properly implements range_tombstone_list::difference(), which was very broken. We add unit tests for the function and ensure we always randomly generate range_tombstones in other unit tests so other problems aren't hidden."	2017-01-25 12:08:19 +00:00
Duarte Nunes	8c65b98ea7	mutation_merger: Emit deferred tombstones This patch ensures the mutation_merger emits any deferred tombstones that it still may be holding before closing the stream. Together with the range_tombstone_list: Properly implement difference() patch set, this fixes breakage of streamed_mutation_test and row_cache_test. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170123195643.9876-1-duarte@scylladb.com>	2017-01-25 12:02:03 +00:00
Takuya ASADA	bce0fb3fa2	dist: add lspci on dependencies, since it used by dpdk-devbind.py On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time. Fixes #2035 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>	2017-01-25 10:22:57 +02:00
Avi Kivity	d2fc98270e	Merge seastar upstream * seastar 6d80c6a...397685c (4): > Merge "add label to the io_queue" from Amnon > rpc: Modify the shutdown code to wait and handle exceptions > tls.cc: Fix shutdown_input/output to conform with expected socket behaviour > core: Add counter for polls	2017-01-24 18:36:25 +02:00
Gleb Natapov	ccee01f352	storage_proxy: put datacenter name into a label instead of counter's name Having datacenter name as a label makes it possible to create Prometheus board for the counters. Message-Id: <20170124132051.GX11469@scylladb.com>	2017-01-24 15:27:34 +02:00
Duarte Nunes	54a464ae27	random_mutation_generator: Always generate range tombstones Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-01-23 19:02:23 +01:00
Duarte Nunes	a01aa91c82	range_tombstone_list: Add unit tests for difference() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-01-23 18:14:33 +01:00
Duarte Nunes	85315d1760	range_tombstone_list: Correctly implement difference() The difference method wasn't properly implemented. The version in this patch correctly computes the difference and returns a range tombstone list contains those range tombstones in "this" but absent from the other, specified range tombstone list. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-01-23 18:14:33 +01:00
Duarte Nunes	e7d20ea900	range_tombstone_list: Add apply() convenience overload Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-01-23 18:14:33 +01:00
Duarte Nunes	0847954d92	bound_view: Add copy ctor and assignment operator Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-01-23 18:14:33 +01:00
Avi Kivity	1758361640	Merge seastar upstream * seastar 38aaa4a...6d80c6a (2): > DPDK: Change the metrics registration with label support > metric: Fix the error: could not convert {...} from <brace-enclosed initializer list> to struct metric_definition_impl	2017-01-23 11:55:21 +02:00
Takuya ASADA	f6d7a76223	dist: rename dist/ubuntu to dist/debian Now we supported both Ubuntu and Debian on dist/ubuntu, and Ubuntu is one of Debian variant, so dist/debian is better naming. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1485161896-21851-1-git-send-email-syuu@scylladb.com>	2017-01-23 10:59:52 +02:00
Avi Kivity	31c8e6885b	build: improve support for custom builds Add a counter field to RELEASE, just before the date, and fix it at zero. This allows custom package builds to override it in a way that sorts before the official packages. Example: Official release: 1.6.0-0.20160120.<githash> Custom release 1: 1.6.0-1.avi.20160121.<githash> Custom release 2: 1.6.0-2.avi.20160122.<githash> The counter (0/1/2) ensures that the build number dominates over the date when sorting. Message-Id: <20170122102814.19649-1-avi@scylladb.com>	2017-01-22 14:56:52 +02:00
Avi Kivity	1be9c232b6	Merge seastar upstream * seastar ff098c8...38aaa4a (1): > metrics: equal operator should use ==	2017-01-22 14:41:59 +02:00
Tomasz Grabiec	834df74df0	Merge batch statement optimization from github.com/avikivity/scylla/1689/v2 From Avi: In many cases, batch statements are used to mutate a single partition, or a number of partitions that is smaller than the number of statements within the batch. We can detect this case and reduce the numbers of mutations applied, and in some cases, convert a logged batch into an unlogged batch. Ref #1689.	2017-01-20 13:44:05 +01:00
Tomasz Grabiec	6c75614d19	sstables: Fix input_stream not being closed by index_reader Fixes #2022 Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>	2017-01-20 11:58:33 +00:00
Paweł Dziepak	19ad35610b	sstables: do not discard future returned by fast_forward_to() continuous_data_consumer::fast_forward_to() returns a future which was later ignored by data_consume_context::fast_forward_to(). With the current implementation, the future in question is always ready and that's why the problem didn't manifest itself in the form of crashes or invalid results. Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>	2017-01-20 12:22:17 +01:00
Avi Kivity	a9403877e4	cql3: add more metrics for batch statements - how many statements are in a batch - different types of batches - whether we were able to convert a logged batch to an unlogged batch	2017-01-20 13:19:00 +02:00
Avi Kivity	e3c003544d	cql3: optimize batch_statement when the same partition is mutated multiple times Batch statements are often used to insert multiple rows into the same partition. Recognize this case and merge mutations to the same partition. If the result is a single mutation, there is an additional win (already present in the code), where a logged batch can be converted into an unlogged batch. Ref #1689.	2017-01-20 13:18:56 +02:00
Benoît Canet	bcc826cc34	mutation_reader: Short circuit the read path on empty range Add a boolean to short circuit the read path on empty range hoping for some speedup. tested in read write with cs using: cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost Will do some additional benchmark. Fixes #1056 Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20170118194451.16836-1-benoit@scylladb.com>	2017-01-20 10:05:40 +00:00
Avi Kivity	54b8acdd9f	dht: add hashing and comparison helpers to dht::decorarted_key An std::hash specialization, and an equality comparator.	2017-01-20 11:24:14 +02:00
Avi Kivity	141048e0e5	dht: improve token hash function For a small token, we can just return it, since it already is a hash. We hash large tokens using murmur3, which is supposedly a good hash.	2017-01-20 11:24:14 +02:00
Raphael S. Carvalho	1857ba0abc	db: fix bad resource usage distribution when resharding due to refresh That's because a single shard is used to calculate generation for new sstables in upload directory, and that will result in that single shard sharing all the resources with other shards. For refresh without upload dir, it currently works fine because we reshuffle column family dir instead. flush_upload_dir() is now a free function, takes a distributed database object, and uses calculate_shard_from_sstable_generation() to decide which shard will move sstable using its own generation namespace. Fixes #2008. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>	2017-01-19 18:55:21 +02:00
Duarte Nunes	d53f96e0da	column_family: Only update stats once for a shared sstables This patch ensures that when adding a shared sstable, we select only one cpu to update that column family's stats. This is important so we don't overestimated the on-disk size of sstables when resharding This fixes only a temporary miscount of the current load, since shared sstables are eventually re-written, but a fixes a permanent miscount of the total load. Refs #1592 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170119144823.31041-1-duarte@scylladb.com>	2017-01-19 17:40:35 +02:00
Tomasz Grabiec	d61002cc33	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. The change improves worst case latency. Reclamation time statistics over 30 second period after cache fills up, in microseconds: Before: avg = 1524.283148 stdev = 11021.021118 min = 12.934000 max = 144356.000000 sum = 257603.852000 samples = 169 After: avg = 1317.362414 stdev = 1913.542802 min = 263.935000 max = 19244.600000 sum = 175209.201000 samples = 133 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-01-19 17:35:36 +02:00
Amos Kong	b880bdccef	dist/redhat: fix path of housekeeping.cfg scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating Housekeeping failed to execute for missing the config file, the config file should be in /etc/scylla.d/. Fixes #2020 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>	2017-01-19 11:08:46 +02:00
Avi Kivity	3c05a81ef9	Merge seastar upstream * seastar 240b0bf...ff098c8 (15): > metrics::impl::shard(): check if reactor is initialized before using it > reactor: introduce engine_is_ready() > fix metric name > Merge "Add label support to the metric layer" from Amnon > core: Avoid memory leak when submission to syscall_work_queue fails > core: Avoid memory leak when submission to smp_message_queue fails > core: append_challenged_posix_file_impl: Make exception-safe > Merge "Log backtrace in report_failed_future" from Tomasz > install-dependencies.sh: add systemtap-sdt-dev to Ubuntu/Debian dependencies > core: add fsqual.cc/.hh to core > dpdk: Fix compile error with rte_pci.h > fstream_test: fix spurious failures due to BOOST_REQUIRE_EQUAL thread-unsafety > reactor: unregister metrics of queue on shard 0 > build: track system header changes too > Prometheus: do not rely on collectd for the hostname	2017-01-19 11:00:12 +02:00
Tomasz Grabiec	dd0fb48564	sstables: Close _file even if random_access_reader::close() reports errors close() operation is like a destructor, it cannot fail. It just reports errors, but close itself succeeds. So we should proceed with the closing even if it fails. Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 12:41:55 +00:00
Tomasz Grabiec	d048eec254	row_cache: Fix stats handling for uncached wide partitions Report hitting wide partition dummy as a cache miss instead of a hit. Refs #2011 Message-Id: <1484302266-3828-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 09:58:04 +00:00
Tomasz Grabiec	87f15624f4	row_cache: Add counter for wide partition mispopulations Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 09:57:51 +00:00
Calle Wilund	5da92db432	cell_comparator: Better fix (i.e. potentially correct) for compound/clustered desc. As Tomek pointed out, previous code, regardless of version mismatch, of generating comparator description string was not correct (as in: in sync with origin). This modifies it to look at 1.) Actual clustring size 2.) Compound-ness 3.) Dense-ness to determine whether we should generate a compound desc, and whether it should contain a trailing utf8-desc type. v2: Simplify non-dense base column addition and ensure it handles thrift non-utf8 (as per comments from tomek) Message-Id: <1484670171-18362-1-git-send-email-calle@scylladb.com>	2017-01-17 18:03:11 +01:00
Amnon Heiman	e19fa02a17	remove scollectd from headers As the metrics migration progressed, some include to scollectd.hh left behind. Because of the nature of the scollecd implementation those include brings alot of code with them to the header files and eventually to many source file. This patch remove those include and add a missing include to storage_proxy.cc. The reason the compiler didn't complain is an indication to the problematic nature of those include in the first place. Before this patch, change in metrics.hh would cause 169 files to compile, after this change 17. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>	2017-01-17 17:39:47 +02:00
Calle Wilund	7d2a4defcf	schema: Fix version check for comparator desc string formatting Fixes #2019 According to the Java driver and cassandra, all versions < 3 include the PK in the comparator descriptor string. This broke for us when bumping the cassandra version 2.1 -> 2.2 Message-Id: <1484657580-14411-1-git-send-email-calle@scylladb.com>	2017-01-17 14:59:47 +02:00
Tomasz Grabiec	ddfee57c97	Replace iostream include with iosfwd in headers Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:44 +02:00
Tomasz Grabiec	50e3e3af08	db: Add missing include Message-Id: <1484656119-8386-3-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:44 +02:00
Tomasz Grabiec	ea9ab36ad5	db: Move operator<<() definition to .cc Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:43 +02:00
Duarte Nunes	c8cbfb7919	storage_service: Make MV feature experimental This patch ensures that the host only announces and registers the MATERIALIZED_VIEWS feature if it was started with the experimental flag. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170116123412.21365-1-duarte@scylladb.com>	2017-01-16 15:45:25 +02:00
Tomasz Grabiec	a559a7ae19	streamed_mutation: Fix memory corruption when reader constructor throws After we call unlink_leftmost_without_rebalance(), we must unlink all elements before mutatation is destroyed. We did this properly from ~reader, but it would not be called if reader construction failed, which it may. Message-Id: <1484572581-6537-1-git-send-email-tgrabiec@scylladb.com>	2017-01-16 13:26:30 +00:00
Paweł Dziepak	e03868c226	tests: run with all features enabled Since `ce083308a1` "random_mutation_generator: Generate RTs by default" random mutation generator produces range tombstones. However, so far the tests were run with all features disabled (because of incomplete initialization of all services) which meant that RANGE_TOMBSTONE feature was not enabled and the code couldn't handle range tombstones that weren't just prefixes. This patch solves the problem by forcing all features to be enabled when tests are run. Message-Id: <20170116103324.22956-1-pdziepak@scylladb.com>	2017-01-16 11:38:45 +01:00
Tomasz Grabiec	3c3a4358ae	storage_proxy: Fix capturing of on-stack variable by reference partition_range_count was accepted by do_with callback by value and then captured by reference by async code, thus invoking use after destroy. Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>	2017-01-16 11:49:11 +02:00
Avi Kivity	c314047b6c	config: disable new sharding algorithm It still has problems: - while resharding a very large leveled compaction strategy table, a huge amount of tiny sstables are generated, overwhelming the file descriptor limits - there is a large impact on read latency while resharding is going on (cherry picked from commit `cf27d44412`) (forward-ported from branch-1.6)	2017-01-15 10:48:53 +02:00
Tomasz Grabiec	66547e7d7c	storage_proxy: Add missing initialization of _short_read_allowed Dropped by `a1cafed370` ("storage_proxy: handle range scans of sparsely populated tables"). Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test. Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>	2017-01-13 16:47:54 +02:00
Takuya ASADA	bee7f549a9	scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping Since scylla-housekeeping running as scylla user, it doesn't have a permission to create a file on /etc/scylla.d. So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid file on the directory. Fixes #2009 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>	2017-01-13 16:27:53 +02:00
Avi Kivity	c227e3e706	Merge "move a few files in the ScyllaDB project to use the new metrics registration API" from Vlad * 'rearrange-scylla-collectd-stats-registration-v3' of github.com:cloudius-systems/seastar-dev: thrift::server: move collectd counters registration to the metrics registration layer gms::gossiper: move collectd counters registration to the metrics registration layer utils::logalloc: move collectd counters registration to metrics registration layer streaming::stream_manager: move a collectd counters registration to the metrics registration layer db::commitlog::commitlog: move collectd counters registration to the metrics registration layer sstables::compaction_manager: move collectd metrics registration to the metrics registration layer db::batchlog_manager: move collectd registration to the metrics registration layer transport::server: move collectd metrics registration to the metrics registration layer cql3::query_processor: move collectd metrics registration to the metrics registration layer database: move collectd registrations to metrics registration layer tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer tracing::trace_keyspace_helper: fix alignment tracing::tracing: move collectd metrics registration to metrics registration layer	2017-01-12 17:13:08 +02:00
Tomasz Grabiec	1e8151b4f2	storage_proxy: Fix use-after-free on one_or_two_partition_ranges query_mutations_locally() takes one_or_two_partition_ranges by reference and requires, indirectly, that it is kept alive until operation resolves. However, we were passing expiring value to it, the result of unwrap(). Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test Another potential problem was that we were dereferencing "s" in the same expression which move-constructs an argument out of it. Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>	2017-01-12 15:10:51 +02:00
Takuya ASADA	c07d703d0d	dist/redhat/scylla.spec.in: fix typo of scylla_cpuscaling_setup Fix packaging error Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1484191955-28006-2-git-send-email-syuu@scylladb.com>	2017-01-12 12:13:33 +02:00
Takuya ASADA	0e6df2a82e	dist: follow DPDK script renaming On DPDK 16.11 dpdk_nic_bind.py is renamed to dpdk-devbind.py, so we are getting "file not found" both on packaging and scripts, fixed that. Also fixed inconsistent packaging. Since Seastar copied dpdk_nic_bind.py to its scripts/ directory, there're two different versions of the script, .rpm/.deb packaging different one: dist/redhat: seastar/dpdk/tools/dpdk_nic_bind.py dist/ubuntu: seastar/scripts/dpdk_nic_bind.py That's won't work because we sharing setup scripts between two distributions, so I changed dist/ubuntu package to use DPDK one. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1484191955-28006-1-git-send-email-syuu@scylladb.com>	2017-01-12 12:13:33 +02:00
Gleb Natapov	76aed548e3	storage_proxy: add replica side counters for data read Message-Id: <20170112085907.GN11469@scylladb.com>	2017-01-12 11:41:04 +02:00
Vlad Zolotarov	ca0a0f1458	tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation Explicitly generate tables' IDs of tables from the system_traces KS using generate_legacy_id() in order to ensure all Nodes create these tables with the same IDs. This is going to prevent hitting issue #420. Fixes #1976 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>	2017-01-12 11:36:35 +02:00
Tomasz Grabiec	33e1f9af6b	sstables: Close input_stream from random_access_reader Spotted by destroy-without-close detector. Message-Id: <1484072527-13058-1-git-send-email-tgrabiec@scylladb.com>	2017-01-11 09:40:00 +00:00
Duarte Nunes	ce083308a1	random_mutation_generator: Generate RTs by default This patch changes the random_mutation_generator so it generates range tombstones by default. This fixes a bug where reversibly applying range tombstones wasn't being tested. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170110164822.28747-1-duarte@scylladb.com>	2017-01-11 09:24:37 +00:00
Vlad Zolotarov	7fb0bab7d7	thrift::server: move collectd counters registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:55 -05:00
Vlad Zolotarov	eb4fbb3949	gms::gossiper: move collectd counters registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:55 -05:00
Vlad Zolotarov	022bca16bf	utils::logalloc: move collectd counters registration to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:55 -05:00
Vlad Zolotarov	a850bea820	streaming::stream_manager: move a collectd counters registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	dcdd98ccc1	db::commitlog::commitlog: move collectd counters registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	00e37c389b	sstables::compaction_manager: move collectd metrics registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	a9f6e5f8da	db::batchlog_manager: move collectd registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	3b41d589f8	transport::server: move collectd metrics registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	8d0a2e3883	cql3::query_processor: move collectd metrics registration to the metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	cda382e8d6	database: move collectd registrations to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	af29c3506b	tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	0df37c04f6	tracing::trace_keyspace_helper: fix alignment Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Vlad Zolotarov	6267bb63f4	tracing::tracing: move collectd metrics registration to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Avi Kivity	1ff0eef0a8	intrusive_set_external_comparator: avoid using boost::intrusive::value_traits_pointers boost::intrusive::value_traits_pointers was introduced in boost 1.56, while we also support boost 1.55. Replace with an equivalent expression. (with additions by Asias) Message-Id: <20170110084700.19994-1-avi@scylladb.com>	2017-01-10 18:16:56 +02:00
Pekka Enberg	3d0217ec43	db/schema_tables: Fix system keyspace table list Commit `f0c28e1` ("db/schema_tables: Add schema_functions and schema_aggregates tables") forgot to add the newly added tables to the db::schema_tables::ALL list, which is used for authorization checks, for example. Fixes the following auth_test.py dtest failures: ('Unable to connect to any servers', {'127.0.0.1': Unauthorized('Error from server: code=2100 [Unauthorized] message="User cathy has no SELECT permission on <table system.schema_functions> or any of its parents"',)}) Message-Id: <1484045277-4997-1-git-send-email-penberg@scylladb.com>	2017-01-10 13:55:04 +01:00
Avi Kivity	0591303b72	Merge "avoid excessive memory usage during resharding" from Rapahel "Intended to reduce memory usage when resharding by sharing sstable components among shards. File descriptors are also shared from now on, meaning that a much smaller number of file descriptors will be used during resharding. Fixes #1951." branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla * 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla: db: avoid excessive memory usage during resharding checked_file_impl: add support to dup sstables: group sstable components that can be shared among shards sstables: rename sstable member	2017-01-09 20:43:50 +02:00
Raphael S. Carvalho	68dfcf5256	db: avoid excessive memory usage during resharding After resharding, sstables may be owned by all shards, which means that file descriptors and memory usage for metadata will increase by a factor equal to number of shards. That can easily lead to OOM. SSTable components are immutable, so they can be stored in one shard and shared with others that need it. We use the following formula to decide which shard will open the sstable and share it with the others: (generation % smp::count), which is the inverse of how we calculate generation for new sstables. So if no resharding is performed, everything is shard-local. With this approach, resource usage due to loaded sstables will be evenly distributed among shards. For this approach to work, we now only populate keyspaces from shard 0. It's now the sole responsible for iterating through column family dirs. In addition, most of population functions are now free and take distributed database object as parameter. Fixes #1951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 15:24:36 -02:00
Raphael S. Carvalho	9200e389c2	checked_file_impl: add support to dup That's needed for sstable fd sharing to work. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 13:33:30 -02:00
Avi Kivity	77cb2b452f	Merge "CQL 3.3.1 support" from Pekka "This patch series adds support for CQL 3.3.1. The changes to CQL are listed here: https://github.com/apache/cassandra/blob/cassandra-2.2/doc/cql3/CQL.textile#changes The following CQL features are already supported by Scylla: - TRUNCATE TABLE alias - Double-dollar string literals - Aggregate functions: MIN, MAX, SUM, and AVG This series adds the following CQL features: - New data types: tinyint, smallint, date, and time - CQL binary protocol v4 (required by the new data types) - Advertise Cassandra 2.2.8 version from Scylla so that drivers correctly detect the presence of CQL 3.3.1 The following CQL features are not supported by Scylla: - Role-based access control (issue #1941) - JSON data type - User-defined functions (UDFs) - User-defined aggregates (UDAs) The following CQL binary protocol v4 changes are not implemented by this series: - Read_failure and Write_failure error codes are not implemented. They error codes not used by the smart drivers but as they are propagated to application code, we eventually need to wire them up to our storage proxy implementation. - Function_failure error code is only used by user-defined functions and the fromJson function, which are not implemented by Scylla. Fixes #1284." * 'penberg/cql-3.3.1/v5' of github.com:cloudius-systems/seastar-dev: version: Bump Cassandra version to 2.2.8 db/schema_tables: Add schema_functions and schema_aggregates tables tests/type_tests: TIME type test cases tests/cql_query_test: TIME type test cases cql3: TIME data type support tests/type_tests: DATE type test cases tests/cql_query_test: DATE type test cases cql3: DATE type support date.h: 64-bit year and days representation licenses: Add utils/date.h license utils/date.h: Import date and time library sources tests/type_tests: TINYINT and SMALLINT type test cases tests/cql_query_test: TINYINT and SMALLINT type test cases cql3: TINYINT and SMALLINT data type support types: Fix integer_type_impl::parse_int() for bytes	2017-01-09 11:54:45 +02:00
Avi Kivity	8f36dca6f1	storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan mutation_result_merger::get() assumes that the merged result may be a short read if at least one of the partial results is a short read (in other words, if none of the partial results is a short read, then the merged result is also not a short read). However this is not true; because we update the memory accounter incrementally, we may stop scanning early. All the partial results are full; but we did not scan the entire range. Fix by changing the short_read variable initialization from `no` (which assumes we'll encounter a short read indication when processing one of the batches) to `this->short_read()`, which also takes into account the memory accounter. Fixes #2001. Message-Id: <20170108111315.17877-1-avi@scylladb.com>	2017-01-09 09:21:43 +00:00
Pekka Enberg	856d0e40fb	version: Bump Cassandra version to 2.2.8 Advertise Cassandra 2.2.8 version to the drivers: CQL 3.3.1 language version and CQL binary protocol version 4 support.	2017-01-09 10:42:21 +02:00
Pekka Enberg	f0c28e1b2d	db/schema_tables: Add schema_functions and schema_aggregates tables The 3.0.3 Java driver, for example, search for the tables and fails when we advertise Cassandra 2.2 version from Scylla.	2017-01-09 10:42:21 +02:00
Pekka Enberg	10facd7db8	tests/type_tests: TIME type test cases	2017-01-09 10:42:21 +02:00
Pekka Enberg	a49ee9387e	tests/cql_query_test: TIME type test cases	2017-01-09 10:42:20 +02:00
Pekka Enberg	93e6592296	cql3: TIME data type support This adds support for the TIME data type introduced in CQL 3.3.1. Refs #1284	2017-01-09 10:42:20 +02:00
Pekka Enberg	9ceea7bbc4	tests/type_tests: DATE type test cases	2017-01-09 10:42:20 +02:00
Pekka Enberg	f0cbfb9e4f	tests/cql_query_test: DATE type test cases	2017-01-09 10:42:20 +02:00
Pekka Enberg	9def7db381	cql3: DATE type support This adds support for the DATE type introduced in CQL 3.3.1. Refs #1284	2017-01-09 10:42:20 +02:00
Pekka Enberg	f83503c09e	date.h: 64-bit year and days representation We need 64-bit year and days representation to support the boundary values of the CQL data type, which is implemented using Joda Time library's DateTime type.	2017-01-09 10:42:20 +02:00
Pekka Enberg	41df14f62d	licenses: Add utils/date.h license	2017-01-09 10:42:20 +02:00
Pekka Enberg	7f2fc6470c	utils/date.h: Import date and time library sources This patch imports the "date.h" date and time library based on the C++11 <chrono> header, which is proposed for standadization: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0355r1.html We need it to implement support for the CQL date type. Import repository https://github.com/HowardHinnant/date Import commit: commit 2935f80109b8cfc15eb1243afe35f7ec3530f971 Author: Howard Hinnant <howard.hinnant@gmail.com> Date: Sun Jan 1 15:02:08 2017 -0500 Have get_version check for the file named version first	2017-01-09 10:39:54 +02:00
Takuya ASADA	42c1e1e0e8	dist/common/systemd: run node-exporter.service as scylla user For security reason, we should run node-exporter.service as scylla user, instead of root. Fixes #1968 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1483543419-16541-1-git-send-email-syuu@scylladb.com>	2017-01-09 09:51:47 +02:00
Paweł Dziepak	3339cced05	sstables: file_writer: make write() non-virtual Noone overrides file_writer::write() so there is no reason to inhibit optimisations and cause compiler to emit indirect calls. Message-Id: <20170104163618.26251-1-pdziepak@scylladb.com>	2017-01-09 09:47:37 +02:00
Takuya ASADA	5422a8e046	dist/ubuntu: generate Ubuntu/Debian revision correctly Ubuntu Packaging Guide says if there's no upstream package (means it's not ported from Debian), revision should be "0ubuntu1", not "ubuntu1" which is we currently using. On Debian, Debian Policy Manual says it's conventional to restart revision from 1 when upstream version increased, so we should specify it to "1". To do it in single script, we will generate the revision on building time. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1483498658-27491-1-git-send-email-syuu@scylladb.com>	2017-01-09 09:45:46 +02:00
Takuya ASADA	920683a882	dist/common/scripts: add scylla_cpuscaling_setup To setup cpu scaling governor to 'performance', add new script to do it on scylla_setup. Fixes #1895 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1483542216-12195-1-git-send-email-syuu@scylladb.com>	2017-01-09 09:44:41 +02:00
Avi Kivity	97ab0d9feb	build: track system header changes too Changes to boost headers should trigger a rebuild if they change.	2017-01-08 20:49:19 +02:00
Avi Kivity	85f4e16336	main: fix incorrect low memory warning A spurious division by smp::count warns that memory is low even when plenty is available. Fix by removing the division. Fix #2002. Message-Id: <20170108122216.27233-1-avi@scylladb.com> Tested-by: Benoît Canet <benoit@scylladb.com>	2017-01-08 15:14:36 +02:00
Amnon Heiman	8cd3d7445c	scylla_setup: remove the uuid file creation Scylla housekeeping can crete a uuid file if it is missing. There is no longer need to create one for it. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>	2017-01-08 14:11:04 +02:00
Amnon Heiman	32888fc0aa	scylla-housekeeping: Create a uuid file if one is missing This patch gets housekeeping to create a uuid file if a path to a uuid file is upplied but the file is missing. Because it import the uuid lib, uuid parameters where renamed. Fixes #1987 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>	2017-01-08 14:11:03 +02:00
Gleb Natapov	9ed3346f98	main: fix error reporting about low memory Message-Id: <20170108112144.GT1829@scylladb.com>	2017-01-08 13:46:48 +02:00
Raphael S. Carvalho	eed2a7d065	sstables: group sstable components that can be shared among shards We intend to share immutable sstable components among shards to reduce excessive memory usage when resharding shared sstables. This change is about grouping those components into a structure, and using foreign ptr to make sure that the structure will be deleted by whichever shard created it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-06 15:16:19 -02:00
Raphael S. Carvalho	a492f8dfaf	sstables: rename sstable member Rename _components to _recognized_components because _components will be used to name a field with shareable components. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-06 15:16:17 -02:00
Avi Kivity	38b2fa27ad	Merge seastar upstream * seastar 1c8e389...240b0bf (15): > file/dup: don't decrease refcnt twice when file is explicitly closed > reactor: Add missing CentOS 7.2 dependency systemtap-sdt-devel > reactor: Cleaning the smp queue metrics when shuting down > metrics: metrics keep the value map while unregistering > change the reactor load metrics to utilization > Merge "ASan fiber switches" from Paweł > tls: Add missing credentials_builder::set_client_auth method > collectd: create metrics with the right format > io_queue: remove owner number from metric name > reactor: change the load metric name to load > Merge "reactor: stop using signals for task_quota timer" > metrics: Allow initializing the metric_group in its constructor > Update DPDK to 16.11 > Revert "rpc: Avoid using zero-copy interface of output_stream" > core::metrics_groups: add a clear() method	2017-01-06 16:34:51 +02:00
Vlad Zolotarov	492295eb7f	init: move supervisor_notify() out of main.cc Transform the supervisor_notify() and related functions into the "supervisor" class and place this class implementation in a separate .cc file. This is going to fix the compilation breakage of tests introduced by a commit `8014adc2a1` init: serialize the creation of system_traces KS objects Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1483663955-20096-1-git-send-email-vladz@scylladb.com>	2017-01-06 10:10:55 +00:00
Avi Kivity	be11b054e1	Merge "Reduce the size of mutation_partition" from Piotr "Reduce the size of mutation_partition by implementing intrusive set using bi::rbtree_algorithms directly and using tree nodes optimized for size. This will reduce the size of mutation_partition by: 24 bytes + <number of cql rows> * 8 bytes This should have a positive impact on performance because mutation_partitions are stored both in memtable and cache. Fixes #742." * 'haaawk/742' of github.com:cloudius-systems/seastar-dev: intrusive_set: rename size() to calculate_size() Make intrusive_set_external_comparator::_value_traits static Implement intrusive set using rbtree_algorithms mutation_partition: make apply_reversibly_intrusive_set nongeneric mutation_partition: take schema in find_row and clustered_row mutation_partition: Extract intrusive set logic to a class. mutation_partition: Replace value_comp with key_comp calls	2017-01-05 17:34:10 +02:00
Tomasz Grabiec	cd630fece6	db: Make system tables use the commitlog Before this patch system table writes were not writing to commit log because database::add_column_family() disables writes to commit log for the table which is added if _commitlog is not set at that time. Fix by initializing commit log before system tables are created. Fixes #1986. Fixes recent regression in batch_test.py:TestBatch.replay_after_schema_change_test after scylla-jmx was updated to not flush system tables on nodetool flush. Could cause system keyspace writes to be delayed for more than before under heavy write workload. Refs #1926. Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>	2017-01-05 14:53:51 +02:00
Avi Kivity	eb520e7352	storage_proxy: fix result ordering for parallel partition range scans During a range scan, we try to avoid sorting according to partition range when we can do so. This is when we scan fewer than smp::count shards -- each shard's range is strictly ordered with respect to the others. However, we use the wrong key for the sort -- we use the shard number. But if we started at shard s > 0 and wrapped around to shard 0, then shard 0's range will be after the range belonging to shard s, but will sort before it. Fix by storing the iteration order as the sort key. We use that when we know that shards do not overlap (shards < smp::count) and the index within the source partition range vector when they do. Fixes #1998. Message-Id: <20170105114253.17492-1-avi@scylladb.com>	2017-01-05 12:51:37 +01:00
Vlad Zolotarov	8014adc2a1	init: serialize the creation of system_traces KS objects Serialize the creation of a system_traces KS objects when they do not exist - the initial cluster boot. Avoid creating them in parallel by different cluster Nodes in order to avoid issue #420. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1483552503-12873-3-git-send-email-vladz@scylladb.com>	2017-01-05 12:41:38 +01:00
Vlad Zolotarov	d3b8b67e66	service::storage_service: serialize the system_auth KS initialization Move the system_auth KS initialization to be before Node moves to the NORMAL state. This way we will serialize this code running on different Nodes and avoid hitting issue #420. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1483552503-12873-2-git-send-email-vladz@scylladb.com>	2017-01-05 12:36:06 +01:00
Piotr Jastrzebski	b159e08764	intrusive_set: rename size() to calculate_size() This hopefully will make it more apparent that the time complexity of this method is O(N) not O(1). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 12:21:43 +01:00
Piotr Jastrzebski	b47a296053	Make intrusive_set_external_comparator::_value_traits static _value_traits can be shared among all instances and there's no need to store it in every single one. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 12:21:10 +01:00
Avi Kivity	4667641f5f	result_memory_tracker: fix too-short short reads 1.6 truncates paged queries early to avoid overrunning server memory with too-large query results, but in the case of partition range queries, this terminates too early due to an uninitialized variable holding the maximum result size. This results in slow performance due to additional round trips. Fix by initializing the maximum result size from the result_memory_tracker running on the coordinating shard. Fixes #1995. Message-Id: <20170105103915.10633-1-avi@scylladb.com>	2017-01-05 10:51:55 +00:00
Piotr Jastrzebski	041b0a65ac	Implement intrusive set using rbtree_algorithms This new implementation takes less memory because it does not store comparator. It also uses tree nodes optimized for size. This means that instead of storing an enum field \|color\| they embed this information inside pointer to parent. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:46:58 +01:00
Piotr Jastrzebski	a0c20f5c49	mutation_partition: make apply_reversibly_intrusive_set nongeneric apply_reversibly_intrusive_set is used only in one place and always with rows_type. There's no need for it to be generic. This will allow changing intrusive set implementation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	4bbe05dd47	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	fe3c91db90	mutation_partition: Extract intrusive set logic to a class. It will make it easier to change the implementation of the intrusive set. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Piotr Jastrzebski	da67ac7ae4	mutation_partition: Replace value_comp with key_comp calls This will reduce the size of bi::set API being used. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-01-05 11:26:03 +01:00
Pekka Enberg	0ea5652354	tests/type_tests: TINYINT and SMALLINT type test cases	2017-01-05 10:57:35 +02:00
Pekka Enberg	41e3327ebc	tests/cql_query_test: TINYINT and SMALLINT type test cases	2017-01-05 10:57:35 +02:00
Pekka Enberg	fcaa743e3d	cql3: TINYINT and SMALLINT data type support This adds support for the TINYINT and SMALLINT data types introduced in CQL 3.3.1. Refs #1284	2017-01-05 10:57:35 +02:00
Pekka Enberg	257fa541f1	types: Fix integer_type_impl::parse_int() for bytes The integer_type_impl::parse_int() function uses boost::lexical_cast() under the hood, which parses 8-bit numbers as characters. Fix the function to lexical cast to 64-bit integer and convert the result to integer_type_impl template type.	2017-01-05 10:57:35 +02:00
Nadav Har'El	45f19f2633	main: better error message on failing to start Prometheus Previously, if the Prometheus port (by default, 0.0.0.0:9180) could not be opened, the following message appeared in the log about 10 seconds into the run, and Scylla crashed. ERROR 2017-01-01 19:31:04,066 [shard 0] seastar - Exiting on unhandled exception: std::system_error (error system:98, Address already in use) The puzzled user would have no idea which address was already in use, why, or why Scylla stopped. In this patch, before the above message we get the much more informative message: ERROR 2017-01-01 19:58:19,080 [shard 0] init - Could not start Prometheus API server on 0.0.0.0:9180: std::system_error (error system:98, Address already in use) We continue to print the original message - and exit - in this case, under the assumption that it's better not to run the database while improperly configured. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170102121304.2060-1-nyh@scylladb.com>	2017-01-04 14:58:26 +02:00
Tzach Livyatan	0c746b22e0	Fix a typo in scylla_setup housekeeping prompt Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1483362474-22113-1-git-send-email-tzach@scylladb.com>	2017-01-04 14:54:22 +02:00
Takuya ASADA	43655512e1	dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping scylla-housekeeping breaks when python-setuptools doesn't installed, so add it on dependency. Fixes #1884 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>	2017-01-04 14:32:10 +02:00
Pekka Enberg	060841b756	tests/types_test: Fix int32 type string conversion boundary case The test case is interested in the upper boundary of 32-bit integer because we already test the lower boundary in assertions below. The old test passed, of course, but it wasn't very interesting. Message-Id: <1483522773-6008-1-git-send-email-penberg@scylladb.com>	2017-01-04 11:57:02 +01:00
Avi Kivity	3232d47d4f	dist: remove another bc dependency No longer used.	2017-01-01 11:13:34 +02:00
Tzach Livyatan	2bfa7cc086	dist/common/scripts: improve scylla_setup wording Fix a few minor typos and improve the user prompt text Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1482918340-19375-1-git-send-email-tzach@scylladb.com>	2016-12-30 13:18:08 +02:00
Tzach Livyatan	436ce7ae49	conf/scylla.yaml: Move broadcast_rpc_address to the supported section Fixes #1779 Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1483021417-8415-1-git-send-email-tzach@scylladb.com>	2016-12-29 16:24:56 +02:00
Takuya ASADA	e48cc9cf01	dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation Ubuntu has it in minimal installation but Debian doesn't, so add it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>	2016-12-29 11:33:21 +02:00
Pekka Enberg	a443dfa95e	tracing: Add seastar/core/scollectd.hh include Fix the following build breakage: FAILED: build/release/gen/cql3/CqlParser.o g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT -O2 -DBOOST_TEST_DYN_LINK -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp In file included from ./query-request.hh:31:0, from ./locator/token_metadata.hh:51, from ./locator/abstract_replication_strategy.hh:29, from ./database.hh:26, from ./service/storage_proxy.hh:44, from ./db/schema_tables.hh:43, from ./db/system_keyspace.hh:46, from ./cql3/functions/function_name.hh:45, from ./cql3/selection/selectable.hh:48, from ./cql3/selection/writetime_or_ttl.hh:45, from build/release/gen/cql3/CqlParser.hpp:63, from build/release/gen/cql3/CqlParser.cpp:44: ./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type scollectd::registrations _registrations; ^~~~~~~~~ Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>	2016-12-28 18:40:18 +02:00
Nadav Har'El	d49aa7abd2	storage_service: make is_joined() an immediate function Commit `d41cd48a` made the is_joined() method a future<bool> because only cpu 0 knows its real value. This makes this function inconvenient to use. So this patch reverts commit `d41cd48a`, and instead sets this flag's value on all shards, so each shard can read its value locally (and immediately). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20161228160450.5831-1-nyh@scylladb.com>	2016-12-28 18:37:22 +02:00
Pekka Enberg	2aee7f6334	Merge seastar upstream * seastar f32e4c2...1c8e389 (2): > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad > file: add dup() support	2016-12-28 17:04:11 +02:00
Duarte Nunes	1444a52fae	position_in_partition: Add tri_comparator Will be needed to order view updates with the existing mutations. Signed-off-by: Duarte Nunes <duarte@scylladb.com> [pdziepak: corrected component name in commit message] Message-Id: <1482880989-3086-2-git-send-email-duarte@scylladb.com>	2016-12-28 13:04:16 +01:00
Duarte Nunes	c6b0387f31	clustering_bounds_comparator: Add tri_comparator This patch adds a tri_comparator for bound_view, which will be used by to add a tri comparator to position_in_partition. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1482880989-3086-1-git-send-email-duarte@scylladb.com>	2016-12-28 13:02:57 +01:00
Duarte Nunes	adb727f7dc	clustering_row: Add apply() overload This patch adds an overload to the apply() function, which takes a clustering_row by reference, to copy. This will be needed by future patches, when merging base table updates with the existing data. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1482881106-3202-1-git-send-email-duarte@scylladb.com>	2016-12-28 12:45:12 +01:00
Pekka Enberg	302035577e	cql3/statements: Make batch_statement::_type private The _type member variable is never accessed outside of the batch_statement class so make it private. Message-Id: <1482921073-28485-1-git-send-email-penberg@scylladb.com>	2016-12-28 12:08:05 +01:00
Pekka Enberg	20daf43403	cql3/statements: Move batch_statement implementation to source file Clean up batch_statement class by moving implementation to the batch_statement.cc source file to make it easier to modify the class. Message-Id: <1482920872-28303-1-git-send-email-penberg@scylladb.com>	2016-12-28 12:30:03 +02:00
Duarte Nunes	86a109915d	streamed_mutations: Update comments This patch removes references to the old begin_range_tombstone and end_range_tombstone mutation_fragments, which have been replaced by a single range_tombstone fragment. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1482880820-2831-1-git-send-email-duarte@scylladb.com>	2016-12-28 09:06:49 +01:00
Gleb Natapov	4ca58959ad	storage_proxy: do not deref unengaged stdx:optional Fixes intentional short reads. Message-Id: <20161227142133.GE1829@scylladb.com>	2016-12-27 16:30:03 +02:00
Vlad Zolotarov	9606db2f08	api::set_tracing_probability: prevent a server from returning 500 for a bad probability value - Change an exception type thrown by a tracing::tracing::set_trace_probability() to make it different from the one thrown by an std::stod() when it fails to parse a given string. - Catch the std::out_of_range exception thrown by a tracing::tracing::set_trace_probability() and wrap the exception string into the httpd::bad_param_exception() object. - Throw a httpd::bad_param_exception() with a "Bad format in a probability value: <a user given probability string value>" message if std::invalid_argument is caught. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465300738-1557-1-git-send-email-vladz@cloudius-systems.com>	2016-12-27 12:07:09 +02:00
Avi Kivity	339cc0c2fa	main: verify sufficient memory per shard Refuse to boot if we don't have at least 1 GiB per shard, unless in developer mode. The primary violator here is docker, but since it starts in developer mode, it won't get fixed. We need some extra logic for this case. Message-Id: <20161221090222.28677-1-avi@scylladb.com>	2016-12-27 12:05:52 +02:00
Avi Kivity	868b4d110c	Merge "Fixes for intentional short reads" from Paweł "This patchset contains fixes for the changes introduced in "Query result size limiting". It also improves handling of short data reads. I order to minimise chances of digest mismatch during data queries replicas that were asked just to return a digest also keep track of the size of the data (in the IDL representation) so that they would stop at the same point nodes doing full data queries would. Moreover, data queries are not affected by per-shard memory limit and the coordinator sends individual result size limits to replicas in order not to depend on hardcoded values. It is still possible to get digest mismatches if the IDL changes (e.g. a new field is added), but, hopefully, that won't be a serious problem." * 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev: query: introduce result_memory_accounter::foreign_state storage_proxy: fix short reads in parallel range queries storage_proxy: pass maximum result size to replicas mutation_partition: use result limiter for digest reads query: make result_memory_limiter constants available for linker result_memory_limiter: add accounter for digest reads idl: allow writers to use any output stream result_memory_limiter: split new_read() to new_{data, mutation}_read() idl: is_short_read() was added in 1.6 mutation_partition: honour allowed_short_read for static rows storage_proxy: fix _is_short_read computation storage_proxy: disallow short reads if got no live rows storage_proxy: don't stop after result with no live rows	2016-12-26 10:42:49 +02:00
Avi Kivity	1d9ee358f1	Revert "Merge "Reduce the size of mutation_partition" from Piotr" This reverts commit `aa392810ff`, reversing changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses boost::intrusive::detail directly, which it must not, and doesn't compile on all boost versions as a consequence.	2016-12-25 16:07:48 +02:00
Avi Kivity	59d389bd46	Merge seastar upstream * seastar 0b98024...f32e4c2 (11): > Merge "Moving the reactor counters to the metric layer" from Amnon > metrics: Metrics function should take variable as a refernce > Revert "Merge ""Moving the reactor counters to the metric layer from Amnon" > Merge ""Moving the reactor counters to the metric layer from Amnon > Revert "fstream: Auto-close data_sink and data_source" > rpc: Avoid resource unit leaks on failure > fstream: Auto-close data_sink and data_source > http: Move metrics registration to the metrics layer > output_stream: add batching to zero copy interface > Revert "slab: Move the metrics registration to the metrics layer" > slab: Move the metrics registration to the metrics layer	2016-12-25 15:50:09 +02:00
Amnon Heiman	70b2a1bfd4	Set the prometheus prefix to scylla This patch make the prometheus prefix configurable and set the default value to scylla. Fixes #1964 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>	2016-12-25 15:21:53 +02:00
Avi Kivity	b99a0fc076	licenses: clarify that licenses in this directory do not cover entire work	2016-12-25 12:59:38 +02:00
Avi Kivity	aa392810ff	Merge "Reduce the size of mutation_partition" from Piotr "Reduce the size of mutation_partition by implementing intrusive set using bi::rbtree_algorithms directly and using tree nodes optimized for size. This will reduce the size of mutation_partition by: 24 bytes + <number of cql rows> * 8 bytes This should have a positive impact on performance because mutation_partitions are stored both in memtable and cache. Fixes #742." * 'haaawk/742' of github.com:cloudius-systems/seastar-dev: intrusive_set: rename size() to calculate_size() Make intrusive_set_external_comparator::_value_traits static Implement intrusive set using rbtree_algorithms mutation_partition: make apply_reversibly_intrusive_set nongeneric mutation_partition: take schema in find_row and clustered_row mutation_partition: Extract intrusive set logic to a class. mutation_partition: Replace value_comp with key_comp calls	2016-12-25 12:56:10 +02:00
Benoît Canet	a24ff47c63	scylla_setup: Use blkid or ls to list potentials block devices blkid does not list root raw device. Revert to lsblk while taking care of having a fallback path in case the -p option is not supported. Fixes #1963. Suggested-by: Avi Kivity <avi@scylladb.com> Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <20161225100204.13297-1-benoit@scylladb.com>	2016-12-25 12:03:40 +02:00
Takuya ASADA	f3e45bc9ef	dist/redhat: don't try to adduser when user is already exists Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error. Fixes #1958 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>	2016-12-25 11:37:25 +02:00
Piotr Jastrzebski	345ed5b6ff	intrusive_set: rename size() to calculate_size() This hopefully will make it more apparent that the time complexity of this method is O(N) not O(1). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	151fa3aaf0	Make intrusive_set_external_comparator::_value_traits static _value_traits can be shared among all instances and there's no need to store it in every single one. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	671affc36c	Implement intrusive set using rbtree_algorithms This new implementation takes less memory because it does not store comparator. It also uses tree nodes optimized for size. This means that instead of storing an enum field \|color\| they embed this information inside pointer to parent. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:32:13 +01:00
Piotr Jastrzebski	b0f712a4e8	mutation_partition: make apply_reversibly_intrusive_set nongeneric apply_reversibly_intrusive_set is used only in one place and always with rows_type. There's no need for it to be generic. This will allow changing intrusive set implementation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	2af6ff68d9	mutation_partition: take schema in find_row and clustered_row This will allow intrusive set implementation that does not store schema. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	b3b924dec9	mutation_partition: Extract intrusive set logic to a class. It will make it easier to change the implementation of the intrusive set. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Piotr Jastrzebski	ac7481f4b2	mutation_partition: Replace value_comp with key_comp calls This will reduce the size of bi::set API being used. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-12-23 11:29:07 +01:00
Tomasz Grabiec	f2a63270d1	sstables: Fix double close on index and data files when writing fails file output streams take the responsibility of closing the file, they will close the file as part of closing the stream. During sstable writing we create sstable object and keep file references there as well. Sstable object also has responsibility for closing the files, and does so from sstable::~sstable(). Double close was supposed to be avoided by a construct like this: writer.close().get(); _file = {}; However if close() failed, which can happen when write-ahead failed, _file would not be cleared, and both the writer and sstable would close the file. This will result in a crash in append_challenged_posix_file_impl::close(), which is not prepared to be closed twice. Another problem is that if exception happened before we reached that construct, we still should close the writer. Currently we don't, so there's no double close on the file, but that's a bug which needs to be fixed and once that's fixed double close on _file will be even more likely. The fix employed here is to not keep files inside sstable object when writing. As soon as the writer is constructed, it's the only owner of the file. Fixes #1764. Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>	2016-12-23 11:44:43 +02:00
Raphael S. Carvalho	fd80499b3d	database: make column_family::add_sstable() private again Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <38226308bee2970a91b0e35370d6a646b85ecfe9.1482459877.git.raphaelsc@scylladb.com>	2016-12-23 11:42:16 +02:00
Paweł Dziepak	e6d27ac529	query: introduce result_memory_accounter::foreign_state Range queries used to be performed sequentially and the shard performing part of the read was reading state of the merger's memory accounter directly. Now, they may be performed in parallel so it is safer to just pass relevant data by value to the intersted shards so that they are not reading something that another shard is modyfing at the same time. Since query is done in parallel there is a chance of overread. However, the parallelism is high only in sparsely populated tables and that's when the overread is less serious problem.	2016-12-22 17:16:24 +01:00
Paweł Dziepak	49d675223e	storage_proxy: fix short reads in parallel range queries Since `a1cafed370` "storage_proxy: handle range scans of sparsely populated tables" nonsingular range queries may be performed in parallel on multiple shards. The consequence of this that result may be added to the merger out of order. This requires more complex logic for handling short reads. As soon as mutation_result_merger gets a short read it starts to discard all subsequently received results that are known to contain partitions with larger keys. Then when the final result is being prepared the merger may need to combine and sorts results which ordering is not known. If at least one of these results is a short one all partitions with larger keys are removed. Due to request being performed in parallel it is possible that even though there was a short read the merger has got enough live data to satisfy specified limits. If this has happened the short read flag is not set on the final result.	2016-12-22 17:16:24 +01:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Paweł Dziepak	40176ca2f8	mutation_partition: use result limiter for digest reads Even if we are performing a digest query we should do proper result memory accounting so that the result ends exactly in the same place that it would if it was a data query. This is to avoid digest mismatches between replicas.	2016-12-22 17:16:23 +01:00
Avi Kivity	8686a59ea5	dht: use nonwrapping_ranges in ring_position_range_sharder It was the observation that ring_position_range_sharder doesn't support wrapping ranges that started the nonwrapping_range madness, but that class still has some leftover wrapping ranges. Close the circle by removing them. Message-Id: <20161123153113.8944-1-avi@scylladb.com>	2016-12-22 14:40:30 +01:00
Takuya ASADA	7c3b98806d	dist/common/scripts/scylla_setup: improve the message of disk selection prompt Not to confuse users, describe we only list up unmounted disks. Fixes #1841 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>	2016-12-22 15:36:46 +02:00
Paweł Dziepak	a7d694654a	query: make result_memory_limiter constants available for linker	2016-12-22 13:35:04 +01:00
Paweł Dziepak	a0523df8d6	result_memory_limiter: add accounter for digest reads Digest reads differ from data reads in a way that they do not really consume any memory. We still want them to stop in the same place that data reads would, but the per-shard semaphore shouldn't be updated by them.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	38ee69dee0	idl: allow writers to use any output stream Original IDL generated code was hardcoded to always use bytes_ostream. This patch makes the output stream a template parameter so that any valid output stream can be used. Unfortunately, making IDL writers generic requires updates in the code that uses them, this is fixed in C++17 which would be able to deduce the parameter in most cases.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	aa083d3d85	result_memory_limiter: split new_read() to new_{data, mutation}_read() For data queries it is very important that all replicas get limited in the same place (this includes replicas returning only digest). That's why they shouldn't be affected by per-shard result memory limit. Moreover, we should make sure that individual memory limits are the same, making the coordinator provide it for replicas which allow to safely change it in the future. Mutation queries are not as sensitive but it is still beneficial to make sure that all replicas use the same individual limit.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	b8e29cc99c	idl: is_short_read() was added in 1.6	2016-12-22 13:35:04 +01:00
Paweł Dziepak	1c7cade559	mutation_partition: honour allowed_short_read for static rows	2016-12-22 13:35:04 +01:00
Paweł Dziepak	a7a454c388	storage_proxy: fix _is_short_read computation	2016-12-22 13:35:04 +01:00
Paweł Dziepak	8c1e4a707c	storage_proxy: disallow short reads if got no live rows If after reconciliation the coordinator ends up with no live rows and short reads are allowed a retry may not make any progress if replicas end their reads in the same place. The solution is to disallow short reads on retries which are caused by final result having no live rows.	2016-12-22 13:35:04 +01:00
Paweł Dziepak	6db262446f	storage_proxy: don't stop after result with no live rows mutation_result_merger merges results from different shards and stops as soon as a shard returned a short read or memory usage on the merging shard is too high. However, it should never stop unless at least one live rows is in the merged result.	2016-12-22 13:35:04 +01:00
Avi Kivity	74ecd7072a	Merge "Reduce overhead of get_max_purgeable_timestamp() during compaction" from Tomasz * 'tgrabiec/calculate-hash-once-compaction' of github.com:cloudius-systems/seastar-dev: sstables: Calculate key hash only once during compaction tests: sstables: Add more test cases to tombstone_purge_test db: Expose column_family::add_sstable tests: sstables: Ensure timestamps are increasing tests: sstables: Simplify tombstone_purge_test	2016-12-22 14:33:30 +02:00
Tomasz Grabiec	045b9fd7c1	sstables: Calculate key hash only once during compaction Improves compaction performance.	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	fb8765bef9	tests: sstables: Add more test cases to tombstone_purge_test	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	c7ff2a2bb0	db: Expose column_family::add_sstable Needed by compaction tests.	2016-12-22 13:24:46 +01:00
Tomasz Grabiec	d841cab02c	tests: sstables: Ensure timestamps are increasing	2016-12-22 13:24:45 +01:00
Tomasz Grabiec	21ade8e4a4	tests: sstables: Simplify tombstone_purge_test - moved to seastar thread - extracted sstable creation and validation logic - reduced code duplication - switched to mutation_reader assertions - used result of compact_sstable() to locate the new sstable - rather than setting gc timestamp in the past, bump the clock before compacting	2016-12-22 13:24:41 +01:00
Tomasz Grabiec	bc6486b304	Use gc_clock instead of db_clock where possible Some code paths were obtaining db_clock timestamp to only convert it to gc_clock later. Avoid this. In the future we could make gc_clock cheaper cause it has low precision. Message-Id: <1482401190-2035-1-git-send-email-tgrabiec@scylladb.com>	2016-12-22 13:27:55 +02:00
Raphael S. Carvalho	c26090a6b2	sstables/compress: fix error message for snappy uncompression Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <898ad07db705355bdbf780afdb3aa982b8ca3823.1482364125.git.raphaelsc@scylladb.com>	2016-12-22 09:08:34 +01:00
Raphael S. Carvalho	27fb8ec512	db: avoid excessive disk usage during sstable resharding Shared sstables will now be resharded in the same order to guarantee that all shards owning a sstable will agree on its deletion nearly the same time, therefore, reducing disk space requirement. That's done by picking which column family to reshard in UUID order, and each individual column family will reshard its shared sstables in generation order. Fixes #1952. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>	2016-12-21 23:18:06 +02:00
Tomasz Grabiec	d87d50dc64	db: Use microsecond precision for server-side timestamps Currently server-side timestamps use a clock with millisecond precision. Timestamps have microsecond resolution, with lower bits used to serialize mutations originating from given client. Timestamps for column drops always use just the millisecond base. A column drop which is executed after an insert may thus be given lower timestamp than the insert, even when the two are serialized on the client side over same connection. Use microsecond precision to reduce chances of that event. This is supposed to fix sporadic failures of schema_test.py:TestSchema.drop_column_queries_test dtest. Message-Id: <1482343119-27698-1-git-send-email-tgrabiec@scylladb.com>	2016-12-21 18:03:22 +00:00
Avi Kivity	875635554d	Merge "educe overhead of partition presence checker during cache update" from Tomasz Refs #1943. * 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev: db: Compute key hash once in partition_presence_checker bloom_filter: Allow checking presence using pre-hashed key db: Use incremental selector in partition_presence_checker	2016-12-21 14:24:54 +02:00
Takuya ASADA	d356c21512	configure.py: don't allow to run multiple 'ninja -C seastar' on same time Scylla's build.ninja allows to run multiple 'ninja -C seastar' on same time, it breaks DPDK build after upgraded to DPDK-16.10: https://gist.github.com/syuu1228/4bd1170630b7e5f15653281b4728e521 To prevent it, we need to limit number of seastar build only one in same time. Note: it doesn't mean disabling parallel build on Seastar. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1482250560-20289-1-git-send-email-syuu@scylladb.com>	2016-12-21 12:42:52 +02:00
Vlad Zolotarov	62cad0f5f5	tracing: don't start tracing until a Tracing service is fully initialized RPC messaging service is initialized before the Tracing service, so we should prevent creation of tracing spans before the service is fully initialized. We will use an already existing "_down" state and extend it in a way that !_down equals "started", where "started" is TRUE when the local service is fully initialized. We will also split the Tracing service initialization into two parts: 1) Initialize the sharded object. 2) Start the tracing service: - Create the I/O backend service. - Enable tracing. Fixes issue #1939 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>	2016-12-21 12:40:14 +02:00
Gleb Natapov	0a2dd39c75	messaging_service: move MUTATION_DONE messages to separate connection If a node gets more MUTATION request that it can handle via RPC it will stop reading from this RPC connection, but this will prevent it from getting MUTATION_DONE responses for requests it coordinates because currently MUTATION and MUTATION_DONE messages shares same connection. To solve this problem this patches moves MUTATION_DONE messages to separate connection. Fixes: #1843 Message-Id: <20161201155942.GC11581@scylladb.com>	2016-12-21 11:10:15 +02:00
Piotr Jastrzebski	3e502de153	mutation_partition: don't use unique_ptr to manage LSA objects Unique_ptr won't destruct them correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>	2016-12-21 09:40:15 +01:00
Raphael S. Carvalho	e28537b56f	sstables: fix calculation of memory footprint for summary size of keys weren't taken into account, so value reported via collectd is much smaller than actual footprint. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>	2016-12-20 18:28:47 +00:00
Paweł Dziepak	d0e61fd092	test.py: remove '.cc' from view_schema_test	2016-12-20 18:26:52 +00:00
Avi Kivity	3989e4ed15	Revert "config, dht: reduce default msb ignore bits to 4" This reverts commit `b81a57e8eb`. With exponential range scanning, we should now be able to survive msb ignore bits of 12, which allows better sharding on large clusters.	2016-12-20 19:41:05 +02:00
Duarte Nunes	a9e5b7f124	view_info: Fix comparison Two view_info object are equal if their fields are equal, not different. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1482253839-2736-1-git-send-email-duarte@scylladb.com>	2016-12-20 18:36:39 +01:00
Avi Kivity	a1cafed370	storage_proxy: handle range scans of sparsely populated tables When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the default), a scan range can be split into a large number of subranges, each going to a separate shard. With the current implementation, subranges were queried sequentially, resulting in very long latency when the table was empty or nearly empty. Switch to an exponential retry mechanism, where the number of subranges queried doubles each time, dropping the latency from O(number of subranges) to O(log(number of subranges)). If, during an iteration of a retry, we read at most one range from each shard, then partial results are merged by concatentation. This optimizes for the dense(r) case, where few partial results are required. If, during an iteration of a retry, we need more than one range per shard, then we collapse all of a shard's ranges into just one range, and merge partial results by sorting decorated keys. This reduces the number of sstable read creations we need to make, and optimizes for the sparse table case, where we need many partial results, most of which are empty. We don't merge subranges that come from different partition ranges, because those need to be sorted in request order, not decorated key order. [tgrabiec: trivial conflicts] Message-Id: <20161220170532.25173-1-avi@scylladb.com>	2016-12-20 18:32:29 +01:00
Tomasz Grabiec	dc94bd0642	Merge branch 'materialized-views/cql/v4' from git@github.com:duarten/scylla.git This patchset implements the multiple CQL3 statements relating to materialized views, as well as ensuring other statements now take materialized views into account. It also adds the necessary internal data structures to hold materialized view metadata.	2016-12-20 14:21:18 +01:00
Duarte Nunes	8ac4d7b2e8	tests: Add view_schema_test This patch adds a set of tests for materialized view schema handling, complementing the dtests for the same feature. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	eb25a8f3cd	cql_test_env: Add do_with_cql_env_thread function This patch introduces the do_with_cql_env_thread() function, which behaves like do_with_cql_env() except that it executes the user-specified function in the context of a Seastar thread. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	124802e196	cql3: Add function to build view's select statement This patch adds an utility function that creates a raw select statement from a set of columns and a where clause. It is intended to be used to create the prepared select statement used by the view class. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	088dfdb108	select_statement: Consider materialized views This patch considers materialized views in select_statement::check_access(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	5511dab914	cql3: Add drop view statement This patch adds the drop_view_statement, which enables users to drop a given materialized view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	5c51a24217	cql3: Parse drop view statement This patch adds the necessary grammar to Cql.g to parse drop view statements. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	3025ea63fc	cql3: Add alter view statement This patch adds the alter_view_statement, which enables users to change the properties of a materialized view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	71b1e7c056	cql3: Parse alter view statement Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	8792fed651	create_view_statement: Complete implementation Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	02bc0d2ab3	create_view_statement: Require MV feature This patch adds the MATERIALIZED_VIEWS_FEATURE to the set of cluster features and requires its presence to allow creating a view. This ensures view schemas can be safely propagated across nodes. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	59682c95a1	create_view_statement: Require experimental switch Creating a materialized view requires running Scylla with the experimental switch. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	c626c983f4	create_view_statement: Reuse validation code This replace some validation logic with a call to validation::validate_column_family. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	5bd74abee8	create_view_statement: Implement check_access This patch implements check_access according to Cassandra's implementation. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	a9c17b0a52	select_statement: Propagate for_view argument This patch propagates the for_view argument, used by statement_restrictions to ensure IS NOT NULL can be used when creating a materialized view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	65535b3444	modification_statement: Check access for tables with views This patch checks for additional permissions when modifying a table with views, since that update will require reading from the table and writing into its views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	5187fdbb3a	modification_statement: Views aren't updated directly This patch ensures that views cannot be modified directly through an insert or update statement. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	21e34c5054	alter_type_statement: Consider materialized views This patch ensures we also update materialized views where the type being updated occurs. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	a5b7b0464b	migration_manager: Only drop table without views This patch forbids dropping a column family if there are still views associated with it, and also forbids dropping a view through the drop table statement. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	76276f1a53	alter_table_statement: Update materialized view This patch ensures that changes to a base table's schema are reflected in that table's materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	44a1f2d836	query_processor: Use cql3::util::do_with_parser() To minimize code duplication, have query_processor use do_with_parser() instead of manually creating the CqlParser. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	bd1e66f411	cql3: Allow renaming a column in a where clause This patch adds an utility function to rename a column occurring a textual where clause. It is intended to change a view's where clause when users alter the underlying base table. To do this, we rely on functions that transform a textual where clause into a set of relations, which allows to reliably rename the column. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	ced4b6e4ff	cql3: Allow renaming an identifier in a relation This patch adds an utility function to rename an identifier occurring in a cql3 relation. This function will be used when renaming an identifier in a view's where clause. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	282c023524	migration_manager: Announce view drop Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	99aa8eb4b8	migration_manager: Announce view update Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	6ef3358321	migration_manager: Announce new view creation Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	8ce21a9c01	schema_tables: Make drop view mutations Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	61a5a74ea2	schema_tables: Make update view mutations Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	2098c336d9	schema_tables: Make create view mutations This patch builds the mutations to announce a new view. Aside from including the view schema, we include the base table mutations so that a node is resilient against receiving create view mutations before the base table create mutations. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	19a76a82e8	frozen_schema: Support view schemas This patch allows a view schema to be frozen. To unfreeze such a schema, we add an is_view attribute to the schema idl. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	c11eb30225	schema_tables: Replace add_table_to_schema_mutation This patch replaces the add_table_to_schema_mutation() function with add_table_or_view_to_schema_mutation(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	04b93ba803	schema_tables: Make view mutations This patch adds functions that translate a view schema to the corresponding mutations. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	fe632e8ba5	schema_tables: Factor out duplicate code This patch factors out duplicate code between merge_tables() and merge_views(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	3fd79bb6d6	schema_tables: Merge views for schema merging Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	06ab61a570	schema_tables: Extract update_column_family This patch extracts update_column_family from schema_tables into database so it can be used when adding materialized views, in future patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	ecc4290bc6	database: Remove view from base table upon drop This patch changes the drop_column_family() function to remove a view schema from the list of views of its base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	4f166cfa6a	database: Parse views schema table upon init This patch adds code for parsing the views schema table upon init and also ensures that when adding a view column family, that we add it to its base table list of views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	40c684b5f5	database: Extract common create cf code This patch moves some duplicate code into the add_column_family_and_create_directory() function. It also saves some superfluous keyspace lookups and readies the code to be used by materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	42242273f6	schema_tables: Create views from mutations This patch enables views to be created from their low-level, mutation-based representation. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	888a8923c7	read_table_mutations: Support other schemas This patch changes read_table_mutations() so that it can now read schemas from other tables besides the column families schema table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	93458f314c	migration_manager: Notify of view schema changes Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	22d8aa9bb6	migration_listener: Listen for view schema changes Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	b9cf25c4dd	schema_tables: Add views schema table This patch adds the views schema table, containing the definition of views in a keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	e41494996f	thrift: Skip materialized views This patch ensures we don't provide access to materialized views over thrift. This includes preventing updates but also omitting them when describing a keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	2b231f22b8	keyspace_metadata: Add tables() and views() functions This patch adds utility functions to keyspace_metadata to select only the tables or only the views out of all the schemas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	7818339791	materialized views: Add view class This patch adds the view class, which will contains functions related to populating a view, either from the base table's write path or from the view building mechanism which copies over already existing data in the base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	d0ed8fa29b	schema: Add view_ptr class The view_ptr class contains a schema_ptr known to represent a materialized view. It is intended to be used by functions that require such a schema, and thus obviate the need for the function to check for schema::is_view(). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	82ce8eedbd	schema: Add view_info field This patch adds a view_info optional field to the schema. It's presence indicates the schema represents a materialized view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	4b3ac42914	materialized views: Add view_info class The view_info class is meant to augment a schema with fields relevant for materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	d7e607ff51	query_pagers: Fix over-counting of rows This patch fixes a regression introduced in `0518895`, where we counted one extra row per partition when it contained live, non static rows. We also simplify the visitor logic further, since now we don't need to count rows one by one. Also remove a bunch of unused fields. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>	2016-12-20 11:58:37 +00:00
Tomasz Grabiec	0e487b3499	db: Compute key hash once in partition_presence_checker I measured reduction of cache update time by 20% for 6 sstables and by 40% for 16. Refs #1943.	2016-12-19 14:20:58 +01:00
Tomasz Grabiec	ab5c77fcf1	bloom_filter: Allow checking presence using pre-hashed key Will allow us to calculate the hash once and use it on many filters instead of calculating the hash for each filter separately. Another change made is to avoid precomputing all indexes during filter operations, and have for_each_index() template instead which invokes a functor.	2016-12-19 14:20:58 +01:00
Tomasz Grabiec	78844fa2e5	db: Use incremental selector in partition_presence_checker This reduces the number of sstables we need to check to only those whose token range overlaps with the key. Reduces cache update time. Especially effective with leveled compaction strategy. Refs #1943. Incremental selector works with an immutable sstable set, so cache updates need to be serialized. Otherwise we could mispopulate due to stale presence information. Presence checker interface was changed to accept decorated key in order to gain easy access to the token, which is required by the incremental selector.	2016-12-19 14:20:58 +01:00
Avi Kivity	b740aff777	tests: adjust mutation_query_test for partition and row limits Won't build otherwise.	2016-12-19 11:37:25 +02:00
Avi Kivity	f3c8cbbac5	Merge "Introduce dht::token_range an dht::partition_range" from Asias "nonwrapping_range<ring_position> and nonwrapping_range<token> are used in many places. Let's make an alias for them to make it less verbose. Also there is a query::partition_range in query-request.hh which is the alias of nonwrapping_range<ring_position>. query::partition_range is used in places not related to query at all. Let's unify the usage project wide." * tag 'asias/repair_dht_token_range/v2' of github.com:cloudius-systems/seastar-dev: Convert to use dht::partition_range_vector and dht::token_range_vector dht: Introduce dht::partition_range_vector and dht::token_range_vector Get rid of query::partition_range Convert to use dht::partition_range Convert to use dht::token_range dht: Rename token_range to token_range_endpoints dht: Introduce dht::token_range an dht::partition_range	2016-12-19 10:59:52 +02:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	7a446986fa	dht: Introduce dht::partition_range_vector and dht::token_range_vector std::vector<dht::partition_range> and std::vector<dht::token_range> are used in a lot of places, introduce dht::partition_range_vector and dht::token_range_vector as the alias.	2016-12-19 08:09:28 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	85034c1b57	Convert to use dht::partition_range	2016-12-19 08:04:30 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Asias He	1f06eedb58	dht: Rename token_range to token_range_endpoints It is a helper class used in storage_service only. Rename it so we can use it for the real dht::token_range.	2016-12-19 08:04:29 +08:00
Asias He	264b6ee69e	dht: Introduce dht::token_range an dht::partition_range nonwrapping_range<ring_position> and nonwrapping_range<token> are used in many places. Let's make an alias for them to make it less verbose. Also there is a query::partition_range in query-request.hh which is the alias of nonwrapping_range<ring_position>. query::partition_range is used in places not related to query at all. Let's unify the usage project wide.	2016-12-19 08:04:29 +08:00
Avi Kivity	32fb4c3661	Merge "repair: Reduce unnecessary streaming traffic even more" from Asias "In `7c873f0d` (repair: Reduce unnecessary streaming traffic), we optimize in cases when 1) all the remote nodes has the same checksum and 2) local node has zero checksum. In this series, we make the optimization more generec and cover more cases." * tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev: repair: Reduce unnecessary streaming traffic even more repair: Add hash specialization for partition_checksum	2016-12-18 16:53:39 +02:00
Avi Kivity	3421ebe8be	Merge "storage_proxy: Enforce row limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. Uppers layers like the pager may already be depending on this behavior." * 'enforce-row-limit/v3' of https://github.com/duarten/scylla: query_pagers: Don't trim returned rows select_statement: Don't always trim result set query_result_merger: Limit rows mutation_query: to_data_query_result enforces row limit	2016-12-18 08:15:51 +02:00
Avi Kivity	6bb875bdb7	Merge "storage_proxy: Enforce partition limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. To achieve this, we add the partition count to query::result, and allow the result_merger to trim excess partitions." * 'enforce-partition-limit/v3' of https://github.com/duarten/scylla: storage_proxy: Decrease limits when retrying command storage_proxy: Don't fetch superfluous partitions query::result: Add partition count column_family: Use counters in query::result::builder query_result_builder: Use the underlying counters mutation_partition: Count partitions in query_compacted mutation_partition: Remove tabs in query_compacted query::result::builder: Add partition count query_result_merger: Limit partitions	2016-12-16 13:57:37 +02:00
Glauber Costa	7133583797	track streaming and system virtual dirty memory A case could be made that we should have counters for them no matter what, since it can help us reason about the distribution of memory among the groups. But with the hierarchy being broken in 1.5 it becomes even more important. Now by looking solely at dirty, we will have no idea about how much memory we are using in those groups. After this patch, the dirty_memory_manager will register its metrics for the 3 groups that we have, and the legacy names will be used to show totals. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>	2016-12-16 10:59:40 +02:00
Avi Kivity	293876c72f	Merge "Limit number of readers streaming uses" from Paweł "Original, naive db::make_streaming_reader() implementation created a set of memtable and sstable readers for every partition range. This caused bad interaction with the code limiting sstable readers concurrency and was suboptimal. This series introduces multi range mutation reader that takes mutation source and a sorted, disjoint vector of ranges. It creates only a single set of memtable and sstable readers and fast forwards it to the next range once the current one is completed." * 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev: db: use multi range reader for streaming readers dht: describe split_range[s]_to_shards() guarantees repair: remove outdated fixme test/mutation_reader_test: add multi_range_reader test tests/mutation_reader: extract key creation code mutation_reader: add multi_range_reader	2016-12-15 17:48:31 +02:00
Paweł Dziepak	cf679a413c	db: use multi range reader for streaming readers A naive approach was to create a set of readers for each range and pass them all to combining reader. This however performed badly if the number of ranges was high. The solution is to use multi range reader which uses only a single set of readers and fast forwards from range to range when necessary. This adds another requirement that the ranges passed to make_streaming_reader() are sorted and disjoint.	2016-12-15 13:54:43 +00:00
Paweł Dziepak	b86a826baf	dht: describe split_range[s]_to_shards() guarantees We are going to require these functions to return sorted and disjoint ranges. They already do so (provided that the input ranges are sorted and disjoint), but if the guarantee is not explicitly stated it may disappear some day.	2016-12-15 13:07:32 +00:00
Paweł Dziepak	5287417136	repair: remove outdated fixme	2016-12-15 13:07:32 +00:00
Paweł Dziepak	5b0cf20f75	test/mutation_reader_test: add multi_range_reader test	2016-12-15 13:07:32 +00:00
Paweł Dziepak	787a976c2b	tests/mutation_reader: extract key creation code	2016-12-15 13:07:32 +00:00
Paweł Dziepak	52a4e79210	mutation_reader: add multi_range_reader So far, the only way to combine outputs of multiple readers was to use combining reader. It is very general and, in particular, supports case when the readers emit mutations from overlapping ranges. However, we have cases (e.g. streaming) when we need to read from several disjoint ranges. Combining reader is a suboptimal solution as it requires to creating a reader for each range and ignores the fact that they do not overlap. This patch introduces multi_range_mutation_reader which takes a mutation_source and a sorted set of disjoint ranges. Internally, it uses mutation_reader::fast_forward_to() to move to the next range once the current one is completed.	2016-12-15 13:07:31 +00:00
Duarte Nunes	0518895f5b	query_pagers: Don't trim returned rows Since storage_proxy::query() now respects the read_command limits, we can remove the trimming logic from query_pagers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:46 +00:00
Duarte Nunes	7ce859799b	select_statement: Don't always trim result set Trimming the result set is only needed when the query contains an "IN" relation, an ORDER BY clause, and defines a limit, which is the case where we query different ranges concurrently. We don't use the result_merger to trim since we first need to reorder the rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:46 +00:00
Duarte Nunes	fee0b7fa48	query_result_merger: Limit rows This patch makes the row limit enforced by the storage_proxy layer. It adds a row limit to the query_result_merger, useful when merging results for concurrent queries. More importantly, it provides guarantees that upper layers may be relying on implicitly (e.g., the paging code). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:36 +00:00
Duarte Nunes	efc986d548	mutation_query: to_data_query_result enforces row limit This patch changes mutation_query::to_data_query_result() so that it enforces the row limit alongside the partition limit and the per-partition limit. In the following patch, we'll enforce the row limit in an upper layer, but this lets us optimize the case where only when replica replies. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:56:40 +00:00
Duarte Nunes	c2072c7dc9	storage_proxy: Decrease limits when retrying command This patch changes a read_command's limits when retrying it, so that we don't ask for more rows than necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:41:06 +00:00
Duarte Nunes	9572c19dc6	storage_proxy: Don't fetch superfluous partitions This patch ensures we keep track of how many partitions we've queried so we don't ask for more than the number we need. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	93be8d7cef	query::result: Add partition count This patch adds a partition count to query::result, filled by the query::result::builder. The partition count is present whenever the result carries data, being absent only for the case where the result contains only a digest. We also ensure that counts are present for an empty query::result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	781cd82cb8	column_family: Use counters in query::result::builder This patch changes column_family::query() to use the counters in the builder to determine how many partitions and rows to ask for and also to implement the stop condition. This saves a continuation to do the bookkeeping, and allows us to remove data_query_result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	05b2ef4fa2	query_result_builder: Use the underlying counters This patch changes the query_result_builder to use the counters provided by the query::result::builder. It also ensures they are kept current. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f5cf7f7921	mutation_partition: Count partitions in query_compacted This patch changes mutation_partition::query_compacted() to count the number of partitions written to the underlying writer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f21dfb8217	mutation_partition: Remove tabs in query_compacted Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	2409b6b250	query::result::builder: Add partition count This patch adds a partition count to the query::result::builder. It is intended to be incremented by users, and later used to build a query::result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	108011a839	query_result_merger: Limit partitions This patch adds a partition limit to the query_result_merger, useful when merging results for concurrent queries. This change also makes the partition limit enforced by the storage_proxy layer, no changes being needed by the upper layers, namely the Thrift interface. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:41 +00:00
Pekka Enberg	06c5216c9d	Merge "Improve gossip feature logging" from Asias	2016-12-15 10:36:54 +02:00
Asias He	e578e65103	gossip: Log feature enabled message on shard zero only Feature is per node. No need to log them number of shards times.	2016-12-15 16:33:11 +08:00
Asias He	4137fab91b	gossip: Make log in check_features debug level We saw the message twice for the same feature check. This is a bit confusing. INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {} This is because ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE); ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE); The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE) is constructed. The second message is printed when the ss._range_tombstones_feature is copy-constructed.	2016-12-15 16:33:10 +08:00
Asias He	2b1ebc4719	gossip: Introduce gms:features::enable helper Add the helper function to enable the a feature and log the feature is enabled. When a feature is enabled, we see INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled in the log.	2016-12-15 16:33:10 +08:00
Paweł Dziepak	b70e5d2089	Merge seastar upstream Submodule seastar 6fbd792..0b98024: > fstream: fix read ahead byte metric types > fstream: add read-ahead metrics > future-util: make stop_iteration use bool_class<> > util: introduce bool_class<Tag>	2016-12-14 15:01:13 +00:00
Avi Kivity	57f4910832	Merge "Query result size limiting" from Paweł "This series makes Scylla limit size of query results it produces in case they grow unreasonably large. This is possible because CQL paging queries do not guarantee that the returned page is going to have page_size rows and pages smaller than tha do not indicate end of stream. Non-paged queries and Thrift requests do not have such flexibility and they also get all the requested data (though their memory usage is still accounted for and may limit paged queries). There is a maximum result size (1 MB) and all results builders will stop after reaching it. Moreover, there is a per-shard limitation on the amount of memory used by all results combined (10%). To avoid tiny results a query has to reserve (wait if necessary) 4 kB before starting executing, after that it can consume more memory without any additional waiting provided it is below individual and shard-local limits. Enabling the cluster to return less rows than requested also means some changes for the coordinator. Firstly, if it receives such short result from a replica retrying it with a larger limit obviously makes no sense whatsoever. Instead, in such cases the coordinator removes the clustering rows it has incomplate information about and sends short result back to the client. Moreover, even if no replica returned short response reconciliation may have made it so. In this case, the coordinator do not necessairly need to retry the query as well. Unfortunately, with the current implementation short responses ruin data queries since they will cause a digest mismatch. Three new metrics were added: * database_bytes_total_result_memory -- total memory used by query results * database_total_operations_short_data_queries -- data queries that were limited by size, particulary bad as it basically forces coordinator to retry them as mutation queries * database_total_operations_short_mutation_queries -- mutation queries limited by size" * 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev: storage_proxy: clean up after primary_key introduction cql3: allow short reads with paged queries storage_proxy: handle intentional short reads storage_proxy: make sure coordinator has complete data storage_proxy: honour partition limit storage_proxy: use cmd limits to determine that replica reached end db: add metrics for short reads and memory used for results data_query: limit result size mutation_query: limit result size db: create result_memory_accounters when starting query query_builder: add partition_slice getter reconcilable_result: keep result_memory_tracker object mutation_compactor: honour stop_iteration from consumers db: add result_memory_limiter query: add result size limiter reconcilable_result: properly propagate short_read flag query_pagers: handle short reads properly query: allow short reads serializer_impl: add serializer for bool_class<Tag>	2016-12-14 16:53:07 +02:00
Paweł Dziepak	4c69d7e2fe	storage_proxy: clean up after primary_key introduction primary_key was introduced as a replacement for std::pair<dht::decorated_key, std::optional<clustering_key>>. In order to simplify patch introducing its fields were named 'first' and 'second'. This patch changes the names to something less useless, removes old row_address alias and removes is_missing_rows() in favour of primary_key::less_compare_clustering comparator. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	dde4bd5051	cql3: allow short reads with paged queries Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	3c173d87b5	storage_proxy: handle intentional short reads If the result is going to be too large the replica may decide to make it shorter and coordinator should handle this properly (i.e. do not retry). Moreover, coordinator could avoid some retries by setting the short_read flag itself. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	dd67de7218	storage_proxy: make sure coordinator has complete data got_incomplete_information() ensures that the coordinator has received all required data from all replicas. (see `77dbe3c12f` "storage_proxy: fix reconciliation with limits" for the examples when that may not be the case). However, this function is called only if reconciled result has at least as much rows as the user asked for. This was correct when we had only total row limit: if the result was shorter than that either all replicas sent all data they have or the coordinator will retry anyway. However, since then we got partition limit and per partition row limit and a request may be limited by one of these while being still below the total row limit. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	2ff5308d8e	storage_proxy: honour partition limit At the moment the coordinator does not care much for the partition limit. In particular it doesn't check whether after reconciliation the result still contains enough partitions. This patch makes it honour the partition limit and increase it in the retried queries if necessary. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	7bed7aa7de	storage_proxy: use cmd limits to determine that replica reached end Coordinator may retry a query with larger limits. However, code determining whether replica has no more data always used the original limits. This may cause a livelock. For example, consider cluster having the following partitions (deletions cover live cells): node1: pk=0, v=0 pk=1, v=1 node2 delete pk=0 delete pk=1 pk=2, v=2 pk=3, v=3 Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is going to send partitions 0 and 1 while second node is going to send 2 and 3 + tombstones for 0 and 1. The coordinator will decide that it needs to retry the request with larger row limit since node1 may have some information about partitions 2 and 3 that are newer than what node2 has sent. However, when the second response arrives node1 will still sent only two rows since it has no more data. Because the coordinator uses original row limit it will not notice that this node reached the end and we are going to get another retry without making any progress. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	f1b9f49f2b	mutation_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	0bce4047bd	query_builder: add partition_slice getter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	15de8de9e5	reconcilable_result: keep result_memory_tracker object Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	34f9eb4cbd	mutation_compactor: honour stop_iteration from consumers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	5d7185fd39	db: add result_memory_limiter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	ee89d80d5c	query: add result size limiter This patch introduces an infrastrucutre for limiting result size. There is a shard-local limit which makes sure that all results combined do not use more than 10% of the shard memory. There is also an invidual limit which restricts a result to 4 MB. In order In order to avoid sending tiny results there is minimum guaranteed size (4 kB), which the query needs to reserve before it starts producing the result. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	43fe3439ca	reconcilable_result: properly propagate short_read flag reconcilable_result can be merged with another or transformed into query::result. Make sure that short_read information is never lost. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	837d24f1b2	query_pagers: handle short reads properly Currently, the paging implementation assumes that the server retunrs either as many rows as it was asked for all reached the end. Soon, that's not going to be true so instead of making any assumptions about the number of the rows returned use the new "short read" flag to determine whether there is going to be more data. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	da7ca85040	query: allow short reads When paging is used the cluster is allowed to return less rows than the client asked for. However, if such possibility is used we need a way of telling that to the coordinator and the paging implementation so that they can differentiate between short reads caused by the replica running out of data to sent and short reads caused by any other means. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:01 +00:00
Paweł Dziepak	7a15c89b1d	serializer_impl: add serializer for bool_class<Tag> Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:01 +00:00
Takuya ASADA	8918a4be57	dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed Instead of abort scylla_setup, print warning message then continue to next setup. Fixes #1357 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>	2016-12-14 13:31:50 +02:00
Tomasz Grabiec	c9344826e9	tests: Remove unintentional enablement of trace-level logging Sneaked in by mistake.	2016-12-14 10:58:07 +01:00
Tomasz Grabiec	fe6a70dba1	tests: commitlog: Fix assumption about write visibility The test assumed that mutations added to the commitlog are visible to reads as soon as a new segment is opened. That's not true because buffers are written back in the background, and new segment may be active while the previous one is still being written or not yet synced. Fix the test so that it expectes that the number of mutations read this way is <= the number of mutations read, and that after all segments are synced, the number of mutations read is equal. Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>	2016-12-14 11:29:33 +02:00
Avi Kivity	a61ff53150	Merge "rework flush criteria" from Glauber "The current criteria for memtable flush is not being respected. The problem is demonstrated to happen when the dirty memory group is over limit, and so is the system table extra allowance. In that situation, both the normal region and the system table region will be under pressure and try to flush. More specifically, because the normal region inherits from the system region, if the normal region is under pressure (over the soft limit threshold), the system region will certainly be as well, even though it has an extra allowance. This is because after virtual dirty, we start blocking when we reach half the region, but memory itself can grow up to 100 % of the region. So the total amount of memory used will be certainly bigger than the system pressure threshold, which is now 50 % plus the allowance. To fix that, this patch reworks the flush logic so that the regions are not dependent on each other. Fixes #1918" * 'flush-criteria-v6' of github.com:glommer/scylla: config: get rid of memtable_total_space database: rework dirty memory hierarchy system keyspace: write batchlog mutation in user memory database: remove flush_token database: abstract pressure condition notification database: encapsulate semaphore_units into a flush_permit database: remove friendship declaration database: simplify flush_one database: make memtable_list aware in cases it can't flush	2016-12-14 11:24:10 +02:00
Takuya ASADA	c18a95cddf	dist/redhat: add scylla_lib.sh to scylla.spec Fix .rpm build error. Fixes #1932 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>	2016-12-14 10:27:37 +02:00
Glauber Costa	56df53f51e	compaction_manager: fix shutdown sequence By the time we are able to acquire this semaphore, we may be stopped already. So we need to test it before we go ahead. I can see shutdown hangs before this patch that are fixed with it applied. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>	2016-12-14 09:26:24 +01:00
Asias He	84fa2c91c7	repair: Reduce unnecessary streaming traffic even more In `7c873f0d` (repair: Reduce unnecessary streaming traffic), we optimize in cases when 1) all the remote nodes has the same checksum and 2) local node has zero checksum. In this patch, we make the optimization more generec and cover more cases. 1) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node2 Before: INFO 2016-12-09 16:24:31,961 [shard 0] repair - Found differing range (-4091524285777924069, -4086237930244473115] on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1} INFO 2016-12-09 16:24:31,963 [shard 0] repair - Found differing range (-609511120964672970, -605253169726090861] on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3} INFO 2016-12-09 16:24:31,964 [shard 0] repair - Found differing range (-7655412157560911259, -7652234653747163387] on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1} INFO 2016-12-09 16:24:31,965 [shard 0] repair - Found differing range (-4133815130045531703, -4128528774512080749] on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1} INFO 2016-12-09 16:24:31,967 [shard 0] repair - Found differing range (-605253169726090861, -600995218487508751] on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3} INFO 2016-12-09 16:24:31,968 [shard 0] repair - Found differing range (438510347741343837, 441475345714861354] on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3} After: INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-660606535827658284, -656348584589076175] on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3} INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4234255885181099833, -4228969529647648879] on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3} INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4228969529647648879, -4223683174114197925] on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3} INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4223683174114197925, -4218396818580746971] on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3} INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-7728494745277112315, -7725317241463364443] on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3} INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-720217853167807818, -715959901929225709] on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3} Before, we need to fetch data from both node 1 and node 3 and send data back to node 1 and node 3, i.e., 2 IN, 2 OUT After, we only need to fetch data from node 3, i.e. 0 IN, 1 OUT We saved 3X traffic, with higher RF, we can save even more. 2) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node3 Before: INFO 2016-12-09 16:20:11,448 [shard 0] repair - Found differing range (-8533861887892628919, -8052600134279395253] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} INFO 2016-12-09 16:20:11,465 [shard 0] repair - Found differing range (7190719703944308372, 7692358524564683543] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} INFO 2016-12-09 16:20:11,486 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} INFO 2016-12-09 16:20:11,494 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:20:11,518 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:20:11,519 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} After: INFO 2016-12-09 16:29:22,433 [shard 0] repair - Found differing range (67885601051654285, 447405341661896387] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:29:22,454 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:29:22,473 [shard 0] repair - Found differing range (2523396860109747637, 3083778975065200884] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:29:22,474 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} INFO 2016-12-09 16:29:22,487 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252] on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {} INFO 2016-12-09 16:29:22,493 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531] on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {} This shows the new more generic methods covers the optimization we had before as well.	2016-12-14 09:37:35 +08:00
Asias He	bd1cd53b2a	repair: Add hash specialization for partition_checksum So we can store partition_checksum in std::map as key.	2016-12-14 09:33:16 +08:00
Glauber Costa	2aa6514667	config: get rid of memtable_total_space Those values are now statically set. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 17:05:12 -05:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Glauber Costa	db7cc3cba8	system keyspace: write batchlog mutation in user memory Batchlog is a potentially memory-intensive table whose workload is driven by user needs, not system's. Move it to the user dirty memory manager. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:35 -05:00
Glauber Costa	be9e4c71ad	database: remove flush_token We had a flush_token structure in addition to the flush_permit because we needed to keep a pointer to the dirty_memory_manager and apply changes to the region group upon the region destruction. Since Tomek's latest series, this is no longer needed and now this structure doesn't have a place in the world anymore. Simplify the code by removing it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	98030ad66c	database: abstract pressure condition notification Done in a separate patch to reduce clutter in the main patch. Soon we'll be testing for one more condition. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	c9a8b03311	database: encapsulate semaphore_units into a flush_permit We will soon need to hold more than a semaphore_units<> object per flush, potentially. Preparation patch for that. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	2e8c7d2c62	database: remove friendship declaration Not needed anymore since memtable started having a direct pointer to the memtable list. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	bb1509c21e	database: simplify flush_one flush_one has to make sure that we're using the correct dirty_memory_manager object, because we could be flushing from a region group different than the one the flush request originated. It's simpler to just assume flush_one will be dealing with the right object, and use a different object instead of "this" when calling it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	8ab7c04caa	database: make memtable_list aware in cases it can't flush Some of our CFs can't be flushed. Those are the ones who are not marked as having durable writes. We treat them just the same from the point of view of the flush logic, but they provide a function that doesn't do anything and just returns right away. We already had troubles with that in the past, and that also poses a problem for an upcoming patch reworking the flush memtable pick criteria. It's easier, simpler, and cleaner, to just make the memtable_list aware it can't flush. Achieving that is also not very complicated: we just need a special constructor that doesn't take a seal function and then we make sure that it is initialized to an empty std::function Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Takuya ASADA	0a6312d254	dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant Use it as "if is_debian_variant; then". Fixes #1931 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:42 +02:00
Takuya ASADA	ed4cd1908f	dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from "is_debian_variant" to "! is_debian_variant". Fixes #1930 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:29 +02:00
Takuya ASADA	6c0dc55495	dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh This fixes following command not found error: ``` /usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found ``` Fixes #1929 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:13 +02:00
Takuya ASADA	3b74c50546	dist/ubuntu: add uuidgen to package dependency We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup script may abort because of command not found. Fixes #1928 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:28:48 +02:00
Duarte Nunes	1e75a4950e	database: Complete query when hitting partition limit Currently, we weren't completing a query as early as possible if it reached the partition limit, we instead had to wait until reaching the end of the specified partition ranges. This patches fixes that by including a check to the partition limit in the termination condition. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161213114559.26438-1-duarte@scylladb.com>	2016-12-13 14:53:46 +02:00
Tomasz Grabiec	f451014785	schema: Implement operator<< for column_mapping Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>	2016-12-13 12:20:46 +02:00
Tomasz Grabiec	059a1a4f22	db: Fix commitlog replay to not drop cell mutations with older schema column_mapping is not safe to access across shards, because data_type is not safe to access. One of the manifestation of this is that abstract_type::is_value_compatible_with() always fails if the two types belong to different shards. During replay, column_mapping lives on the replaying shard, and is used by converting_mutation_partition_applier against the schema on the target shard. Since types in the mapping will be considered incompatible with types in the schema, all cells will be dropped. Fix by using column_mapping in a safe way, by copying it to the target shard if necessary. Each shard maintains its own cache of column mappings. Fixes #1924. Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>	2016-12-13 12:19:32 +02:00
Avi Kivity	32d55bbb4c	Merge seastar upstream * seastar 0773e98...6fbd792 (2): > tls: Only run our "verify" function in client session > Merge "Clean the metric definition" from Amnon Includes patch from Amnon adjusting the metrics registration due to seastar API changes.	2016-12-13 12:17:14 +02:00
Avi Kivity	6f9c317b91	Merge "Use uuid file in housekeeping" from Amnon "This patch adds the use of uuid file to the housekeeping daily version check. uuid file are optional, if a file is missing no uuid will be used."	2016-12-13 10:52:44 +02:00
Avi Kivity	c67782f169	Merge seastar upstream * seastar 0a74317...0773e98 (6): > tls: Add support for client cetrificate verification & priority strings > semaphore: add consume_units > semaphore: add available_units() > thread: check need_preempt for threads in a scheduling group as well > tutorial: fix semaphore example, and text > stop_iteration: add && and \|\| operators	2016-12-12 18:06:19 +02:00
Avi Kivity	c801cc4bd1	Merge "streaming and repair updates" from Asias "This series: - We can make reader with ranges - Fix possible use after free of 'si' - Streaming ranges now are sorted and merged - Fix shard_begin shard_end end loop in both streaming and repair"	2016-12-12 11:32:42 +02:00
Asias He	ba54654af3	streaming: Use interval_set to sort and merge ranges So that the ranges are sorted and have no overlaps. We can have less ranges to deal with and it can help the mutation readers to optimize. Here is an exmaple of ranges generated by repair: Before: INFO 2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id = dec9fa90-bc3b-11e6-af78-000000000001, before ranges = {(-3383928698815274642, -3376937163195039606], (-7260764223708720005, -7251657821052234309], (-4767213984179237293, -4747032371925842389], (-7645879646119667643, -7589962743703481776], (-2340199306656526861, -2320523117224780931], (-576028861239229331, -560973674020019962], (-4070378863644120252, -3987599893827407860], (-2551584407739673151, -2498779102482524711], (-5416061903556353312, -5354212455975869358], (37594980457713898, 67885601051654285], (3083778975065200884, 3091232478835418439], (3131345970514528877, 3187922544267434961], (5765437476661317163, 5778671293583720541], (5960610072466058818, 5972289771228014343], (7749618183851698485, 7758080813117351135], (-3987599893827407860, -3899198931034439776], (-7251657821052234309, -7131649010279865221], (-3576581915808403133, -3383928698815274642], (-417850207760366422, -327959672080599465], (-2671876682129336880, -2551584407739673151], (-1305178847032904465, -1137497074548854552], (8540448858050275827, 8610171849752115483], (-560973674020019962, -417850207760366422], (-2498779102482524711, -2340199306656526861], (2394447940525988167, 2523396860109747637], (-6703329224557608009, -6517757811218772762], (-3675103288021821677, -3576581915808403133], (-5622185785296846551, -5416061903556353312], (8610171849752115483, 8742605005068551458], (8068079250973315241, 8185655671734937642], (560264964510741191, 790641981923757238], (5581202487214475094, 5765437476661317163], (8742605005068551458, 8923908282731801645], (-6038176423022601107, -5622185785296846551], (5778671293583720541, 5960610072466058818], (-3899198931034439776, -3675103288021821677], (8356739976149429222, 8540448858050275827], (-6517757811218772762, -6038176423022601107], (-8052600134279395253, -7645879646119667643], (-327959672080599465, 37594980457713898], (7758080813117351135, 8019254284118543066], (4781565016737645510, 5067070718000527886], (2523396860109747637, 3083778975065200884], (-5354212455975869358, -4767213984179237293], (6784138025918878582, 7190719703944308372], (67885601051654285, 447405341661896387], (-2190610927722759275, -1305178847032904465], (-4747032371925842389, -4070378863644120252]}, size=48 After: INFO 2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id = dec9fa90-bc3b-11e6-af78-000000000001, after ranges = {(-8052600134279395253, -7589962743703481776], (-7260764223708720005, -7131649010279865221], (-6703329224557608009, -3376937163195039606], (-2671876682129336880, -2320523117224780931], (-2190610927722759275, -1137497074548854552], (-576028861239229331, 447405341661896387], (560264964510741191, 790641981923757238], (2394447940525988167, 3091232478835418439], (3131345970514528877, 3187922544267434961], (4781565016737645510, 5067070718000527886], (5581202487214475094, 5972289771228014343], (6784138025918878582, 7190719703944308372], (7749618183851698485, 8019254284118543066], (8068079250973315241, 8185655671734937642], (8356739976149429222, 8923908282731801645]}, size=15	2016-12-12 11:09:26 +08:00
Asias He	e523803a5d	token_metadata: Introduce interval_to_range helper It is used to convert a boost::icl::interval<token> interval back to a range<token>.	2016-12-12 11:09:26 +08:00
Asias He	af3d76e6ac	repair: Fix a typo in the log sucessfully -> successfully	2016-12-12 11:09:26 +08:00
Asias He	374324e6fb	repair: Fix shard_begin and shard_end A range now alternates between different shards: the first part of the range goes to shard X, the next to shard X+1, but after a while we go back to shard X. So we can't do a simple loop between shard_begin and shard_end. Fix by using the newly introduced dht::split_range_to_shards Use the cf.make_streaming_reader with ranges to simplify the code a bit.	2016-12-12 11:09:26 +08:00
Asias He	1987264beb	streaming: Make streaming reader with ranges Now that we have the new interface to make readers with ranges, we can simplify the code a lot. 1) Less readers are needed before: number of ranges of readers after: smp::count readers at most 2) No foreign_ptr is needed There is no need to forward to a shard to make the foreign_ptr for send_info in the first phase and forward to that shard to execute the send_info in the second phase. 3) No do_with is needed in send_mutations since si now is a lw_shared_ptr 4) Fix possible user after free of 'si' in do_send_mutations We need to take a reference of 'si' when sending the mutation with send_stream_mutation rpc call, otherwise: msg1 got exception si->mutations_done.broken() si is freed msg2 got exception si is used again The issue is introduced in `dc50ce0ce5` (streaming: Make the mutation readers when streaming starts) which is master only, branch 1.5 is not affected.	2016-12-12 09:04:21 +08:00
Asias He	463cc4fbde	dht: Introduce split_ranges_to_shards Split a ranges into shard ranges map with ring_position_range_sharder helper.	2016-12-12 09:04:21 +08:00
Asias He	044c4ff44c	dht: Introduce split_range_to_shards Split a range into shard ranges map with ring_position_range_sharder helper.	2016-12-12 09:04:21 +08:00
Asias He	cd2105b8bd	database: make_streaming_reader for ranges Allow to make a streaming reader with a vector of ranges in addition to a single range. This will be used soon in following streaming patch. We can make the reader more efficient later.	2016-12-12 09:04:21 +08:00
Duarte Nunes	ada2f1092e	dht: Make i_partitioner::tri_compare pure virtual This patch makes the i_partitioner::tri_compare() function pure virtual as it is overridden by all partitioners. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161211172037.16496-1-duarte@scylladb.com>	2016-12-11 19:29:37 +02:00
Duarte Nunes	bb66b051ed	dht: Make i_partitioner::tri_compare memory safe This patch fixes a typo in i_partitioner::tri_compare() where we were using std::max instead of std::min, thus avoiding accessing random memory and getting random results. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161211165043.17816-1-duarte@scylladb.com>	2016-12-11 18:58:10 +02:00
Amnon Heiman	08dcd8cb4a	scylla housekeeping ubuntu service: use uuid file This patch adds uuid file support for ubuntu system. It also split the behaviour between restart and daily checks. The first run in r mode and the second in d mode. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-12-11 16:35:07 +02:00
Amnon Heiman	6fef24aaf0	housekeeping systemd service: use uuid file This set the housekeeping systemd service to use a uuid file and use daily mode. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-12-11 16:02:16 +02:00
Amnon Heiman	17b8306bc4	scylla-housekeeping support uuid file Allows scylla-housekeeping getting the uuid from a file instead of the command line. If the file is missing no uuid will be used. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-12-11 16:00:34 +02:00
Avi Kivity	299d1fad0b	Merge "reduce bloom filter overhead in compaction" from Raphael "Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. That's possible with an incremental selector that will incrementally select sstables based on key being compacted. Currently, we iterate through all non-compacting sstables and consult their bloom filter to determine max purgeable timestamp, and that will be very expensive for compactions that are frequently deciding whether or not to purge tombstones." * 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla: compaction: reduce bloom filter overhead with incremental selector tests: add test for sstable set's incremental selector sstable_set: introduce incremental selector compatible_ring_position: add function to return token	2016-12-11 09:46:58 +02:00
Glauber Costa	5803957ab5	compaction: fix build Commit `732ee275` moved tracking of one statistics value inside a lambda without capturing this in that lambda. Compilation fails as a result. Signed-off-by: Glauber Costa <glauber@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>	2016-12-10 09:00:20 +02:00
Raphael S. Carvalho	fcfc84e836	compaction: reduce bloom filter overhead with incremental selector The procedure to calculate max purgeable timestamp is optimized by only visiting sstables that overlap with key being currently compacted. That's done using incremental sstable selector. Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. Fixes #1322. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Raphael S. Carvalho	548f6066c5	tests: add test for sstable set's incremental selector Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Raphael S. Carvalho	02541e15c1	sstable_set: introduce incremental selector Incrementally select sstables from sstable set using token in ascending order. For leveled strategy, it returns all sstables that belong to current interval. For other strategies, it just return all sstables from the set. Useful for compaction which needs all sstables that overlap with key being currently compacted to calculate maximum purgeable timestamp. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:16 -02:00
Glauber Costa	9b5e6d6bd8	commitlog: correctly report requests blocked The semaphore future may be unavailable for many reasons. Specifically, if the task quota is depleted right between sem.wait() and the .then() clause in get_units() the resulting future won't be available. That is particularly visible if we decrease the task quota, since those events will be more frequent: we can in those cases clearly see this counter going up, even though there aren't more requests pending than usual. This patch improves the situation by replacing that check. We now verify whether or not there are waiters in the semaphore. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>	2016-12-09 15:02:26 +02:00
Raphael S. Carvalho	732ee275f8	compaction: fix running compaction counter when splitting sstables The counter was being increased before taking the semaphore, so every pending split would count as a running compaction which misleads the user as a result. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>	2016-12-09 15:01:43 +02:00
Raphael S. Carvalho	453620a316	compatible_ring_position: add function to return token Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-08 14:25:29 -02:00
Avi Kivity	872b5ef5f0	sstables: fix probe with Unknown component Commit `53b7b7def3` ("sstables: handle unrecognized sstable component") ignores unrecognized components, but misses one code path during probe_file(). Ignore unrecognized components there too. Fixes #1922. Message-Id: <20161208131027.28939-1-avi@scylladb.com>	2016-12-08 15:24:25 +01:00
Glauber Costa	733d87fcc6	database: try to acquire semaphore before we start flush As Tomek pointed out, as we are starting the flush before we acquire the semaphore, we are not really limiting parallelism, but only delaying the end of the flush instead. Fixes #1919 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>	2016-12-08 12:18:32 +01:00
Tomasz Grabiec	3511bf4a81	Merge branch 'tgrabiec/memtable-gentle-clearing' from seastar-dev.git When row cache is disabled, update_cache() will do nothing to the memtable. Active readers may keep the memtable alive for unbounded amount of time, preventing it from going away. This doesn't play well with virtual dirty accounting. Soon before calling update_cache(), the memory which was subtracted during flush is added back to the amount of virtual dirty memory. If there was write pressure all along, we will be at the dirty memory limit. When we give back subtracted memory this will put virtual dirty way above the limit. This will stall all writes until another memtable flush drags virtual dirty down or readers finally release the memtable. We want to prevent upward jumps of virtual dirty. First part of the fix is to ensure that as long as the memtable's region is in the dirty group, we will not revert flushed memory. This must happen synchronously from region's memory being removed from the group in order to prevent upward virtual dirty jumps. To make this easier, tracking of flushed memory was moved to the memtable object. Another part of the fix is to gradually clear the memtable when cache is disabled in a similar fashion as when it's moved to cache. This ensures that the actual memory held by memtable's region is released sooner than it dies. Refs #1879	2016-12-08 12:18:32 +01:00
Gleb Natapov	a05516f14c	storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters Message-Id: <20161206105154.GL1866@scylladb.com>	2016-12-08 11:42:52 +02:00
Avi Kivity	5530a61975	stables: fix build with older boost (boost::variant::get<T&>) Older boost doesn't support boost::variant::get<T&> (where the type parameter is reference qualified); remove (unneeded anyway).	2016-12-08 10:56:05 +02:00
Pekka Enberg	0bc3ce7e09	Merge "sstables: remove sharding metadata from Statistics component" from Avi "Due to my misreading of Cassandra code, I thought it would ignore new components in the Statistics component; however, it doesn't, and the change (introduced in `bdd11648ac` ("sstables: add intra-node sharding metadata") breaks sstable2json and likely any Cassandra code that touches sstables. To fix, move the sharding data into a new component ("Scylla.db"), which Cassandra does ignore. The new component is designed to be extensible so we don't experience the same issue later on."	2016-12-08 10:14:07 +02:00
Avi Kivity	7f26f9c0f9	Merge "repair refactor and fix" from Asias * tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev: repair: Limit the number of sub ranges repair: Use estimated_keys_for_range in repair_cf_range repair: Extract the target_partitions into repair_info class repair: Put request_transfer_ranges into repair_info class repair: Introduce check_failed_ranges helper repair: Introduce do_streaming helper repair: Make the neighbors const reference repair: Introduce repair_info repair: Attach the repair id in the stream plan name	2016-12-08 10:06:39 +02:00
Tomasz Grabiec	f7197dabf8	commitlog: Fix replay to not delete dirty segments The problem is that replay will unlink any segments which were on disk at the time the replay starts. However, some of those segments may have been created by current node since the boot. If a segment is part of reserve for example, it will be unlinked by replay, but we will still use that segment to log mutations. Those mutations will not be visible to replay after a crash though. The fix is to record preexisting segents before any new segments will have a chance to be created and use that as the replay list. Introduced in `abe7358767`. dtest failure: commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>	2016-12-07 15:54:47 +02:00
Avi Kivity	4fedbf8430	Merge "service::storage_proxy: rework collectd counters registration" from Vlad - Add "coordinator" and "replica" categories - Use a new seastar/metrics_registration framework * 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev: service::storage_proxy: rework the collectd counters registration service/storage_proxy: regroup collectd statistics	2016-12-07 15:38:40 +02:00
Avi Kivity	3c3a18f222	sstables: move sharding metadata from Statistics component to a new Scylla component The Cassandra derived sstable tools (and likely Cassandra itself) object to a new sub-component in the Statistics component; create a new Scylla component instead to host this data.	2016-12-07 15:20:13 +02:00
Avi Kivity	24140ec8c6	sstables: add support for sets of discriminated union types Allow declaring discriminated unions (with an enum type as the discriminant and any sstable serializable type as a value) and sets of these unions, with the disciminant as the key. Parsers and writers are auto-generated.	2016-12-07 13:27:52 +02:00
Avi Kivity	e0cce9d299	Merge "streaming: Improve logging" from Asias "This seires adds streaming bandwidth and streaming plan name to the log when streaming is finished."	2016-12-07 12:21:47 +02:00
Amos Kong	f32f7993cc	systemd: reset housekeeping timer at each start Currently housekeeping timer won't be reset when we restart scylla-server. We expect the service to be run at each start, it will be consistent with upstart script in Ubuntu 14.04 When we restart scylla-server, housekeepting timer will also be restarted, so let's replace "OnBootSec" with "OnActiveSec". Fixes: #1601 Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>	2016-12-06 18:33:37 +02:00
Takuya ASADA	5a5ab51254	dist/ubuntu/dep: fix incorrect file path to detect previously built .deb existance check Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480667672-9453-4-git-send-email-syuu@scylladb.com>	2016-12-06 12:06:30 +02:00
Takuya ASADA	6dd6b868a6	scripts/scylla_install_pkg: support Debian Supported Debian on installation script. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480667672-9453-3-git-send-email-syuu@scylladb.com>	2016-12-06 12:06:30 +02:00
Takuya ASADA	7f2df8f86e	dist/common/scripts: introduce scylla_lib.sh To reduce duplicated code and simplified scripts introduce scylla_lib.sh for shellscripts which provides functions to classify distributions, and load all sysconfig files. This also fixes script bugs to misdetect Debian and RHEL. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>	2016-12-06 12:06:30 +02:00
Takuya ASADA	8464903021	dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3 RHEL 7.3's systemd contains known bug on timer.c: https://github.com/systemd/systemd/issues/2632 This is workaround to avoid hitting bug. Fixes #1846 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>	2016-12-06 10:48:28 +02:00
Takuya ASADA	b2c0059da3	dist/common/scripts/scylla_coredump_setup: use systemd-coredump on Ubuntu 16.04 Ubuntu 16.04 has systemd-coredump, better to use it. Fixes #1916 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480679267-30844-1-git-send-email-syuu@scylladb.com>	2016-12-05 17:09:38 +02:00
Takuya ASADA	2976799ef2	main: fix startup failing on Ubuntu 15.10/16.04 Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service). To avoid this, we need to verify UPSTART_JOB == "scylla-server". If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case. Fixes #1199 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>	2016-12-05 16:28:25 +02:00
Tomasz Grabiec	527ff6aa40	db: Clear memtable after flush when cache is disabled So that memory is released gradually (impacting latency less) and sooner than when memtable is destroyed. Active readers may keep the memtable alive for unbounded amount of time. Refs #1879	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1bba51319e	memtable: Maintain virtual dirty on clear() When memtable is flushing, it subtracts _flushed_memory from groups's size to gradually allow more writes. Ideally _flushed_memory would be equal to region's size when flush ends, so the group's size would reach zero. When the memtable and its region are gone the group size should remain the same as after the flush. This is ensured by adding back _flushed_memory to group's size right before the region is removed from the group. Calling clear() before region is removed from the group breaks the accounting because it will shrink the region, but will not affect the amount of memory subtracted due to _flushed_memory. So group's size would decrease more than we want (twice the region's size). The fix is to change clear() so that it reverts _flushed_memory by the amount by which the region size is reduced. This will keep the groups's size constant as long as _flushed_memory > 0.	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Asias He	00d7a35949	utils: Put crc32 under utils namespace It conflicts with crc in zlib Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>	2016-12-05 11:48:29 +02:00
Takuya ASADA	54ea0055fc	dist/common/scripts/node_exporter_install: use curl instead of wget CentOS/Ubuntu contains curl on minimal instllation but wget doesn't, and we already has dependency for curl, so we should switch to curl. Fixes #1902 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480929047-2347-1-git-send-email-syuu@scylladb.com>	2016-12-05 11:26:36 +02:00
Asias He	86c2620b7a	gossip: Skip stopping if it is not started If exception is triggered early in boot when doing an I/O operation, scylla will fail because io checker calls storage service to stop transport services, and not all of them were initialized yet. Scylla was failing as follow: scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = gms::gossiper]: Assertion `local_is_initialized()' failed. Aborting on shard 0. Backtrace: 0x000000000048a2ca 0x000000000048a3d3 0x00007fc279e739ff 0x00007fc279ad6a27 0x00007fc279ad8629 0x00007fc279acf226 0x00007fc279acf2d1 0x0000000000c145f8 0x000000000110d1bc 0x000000000041bacd 0x00000000005520f1 0x00007fc279aeaf1f Aborted (core dumped) Refs #883. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Asias He <asias@scylladb.com> Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>	2016-12-05 09:42:37 +02:00
Asias He	49229964d0	streaming: Add streaming plan name when session is failed Before: [shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000] Stream failed, peers={127.0.0.1, 127.0.0.2} After: [shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000] Stream failed for streaming plan repair-in-29, peers={127.0.0.1, 127.0.0.2}	2016-12-05 08:20:18 +08:00
Asias He	1c47e26913	streaming: Add streaming plan name when all sessions are completed Before: [shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000] All sessions completed, peers={127.0.0.2} After: [shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000] All sessions completed for streaming plan repair-in-32, peers={127.0.0.2}	2016-12-05 08:20:18 +08:00
Asias He	984f427cb5	streaming: Log streaming bandwidth It looks like: [Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.1 is complete, state=COMPLETE [Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.2 is complete, state=COMPLETE [Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.3 is complete, state=COMPLETE [Stream #f3907fd0-a557-11e6-a583-000000000000] bytes_sent = 393284364, bytes_received = 0, tx_bandwidth = 17.048 MiB/s, rx_bandwidth = 0.000 MiB/s [Stream #f3907fd0-a557-11e6-a583-000000000000] All sessions completed, peers={127.0.0.1, 127.0.0.2, 127.0.0.3} Fixes #1826	2016-12-05 08:20:18 +08:00
Asias He	4ae5781e40	repair: Limit the number of sub ranges A range is diveded into N sub ranges so that each sub range contains 100 partitions. So N depends on the number of partitions in that range. N can grow unbounded and the memory usage of vector to hold these sub ranges can go unbouded. Limit the max number of sub ranges a range can divided into. The downside is that the limited sub range will make we include more partitions in the checksum. Fixes #1917	2016-12-05 08:12:48 +08:00
Asias He	d850b86145	repair: Use estimated_keys_for_range in repair_cf_range Use the newly introduced interface to estimate number of partitions in the range.	2016-12-05 08:05:07 +08:00
Asias He	7b63cbbe0d	repair: Extract the target_partitions into repair_info class We can tune the number on a per repair basis.	2016-12-05 08:05:07 +08:00
Asias He	d9b689321e	repair: Put request_transfer_ranges into repair_info class	2016-12-05 08:05:07 +08:00
Asias He	7741393059	repair: Introduce check_failed_ranges helper To check if there is any failed ranges and log it.	2016-12-05 08:05:07 +08:00
Asias He	f8d7aa597b	repair: Introduce do_streaming helper To execute the stream_plans to sync data between nodes.	2016-12-05 08:05:07 +08:00
Asias He	d0a6290d4f	repair: Make the neighbors const reference We do not modify it. Make it const reference.	2016-12-05 08:05:07 +08:00
Asias He	6d0f6c1a99	repair: Introduce repair_info To reduce the number of parameters we pass around. Simplify the code a little bit.	2016-12-05 08:05:06 +08:00
Asias He	9be5170c07	repair: Attach the repair id in the stream plan name So that we know which repair id this stream plan belongs to.	2016-12-05 08:05:06 +08:00
Tomasz Grabiec	d496dfeced	Update seastar submodule * seastar 7790e68...0a74317 (2): > core/reactor: Move definitions out of #ifndef > Add systemtap-sdt-devel to fedora dependencies Fixes #1915.	2016-12-02 10:49:17 +01:00
Vlad Zolotarov	e5e7ac1bd4	service::storage_proxy: rework the collectd counters registration Use the new seastar's metrics_registration framework: - Change the registration syntax. - Add a long description for each counter. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-12-01 22:38:09 -05:00
Vlad Zolotarov	3bf12e4ffc	service/storage_proxy: regroup collectd statistics Instead of putting all statistics under the same "storage_proxy" category separate them into 2 groups according to where the corresponding counters are updated: - "storage_proxy_replica" - "storage_proxy_coordinator" Fixes #1763 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-12-01 22:27:47 -05:00
Glauber Costa	99a5a77234	prevent commitlog replay position reordering during reserve refill When requests hit the commitlog, each of them will be assigned a replay position, which we expect to be ordered. If reorders happen, the request will be discarded and re-applied. Although this is supposed to be rare, it does increase our latencies, specially when big requests are involved. Processing big requests is expensive and if we have to do it twice that adds to the cost. The commitlog is supposed to issue replay positions in order, and it coudl be that the code that adds them to the memtables will reorder them. However, there is one instance in which the commitlog will not keep its side of the bargain. That happens when the reserve is exhausted, and we are allocating a segment directly at the same time the reserve is being replenished. The following sequence of events with its deferring points will ilustrate it: on_timer: return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) { At this point, the segment id is already allocated. new_segment(): if (_reserve_segments.empty()) { [ ... ] return allocate_segment(true).then ... At this point, we have a new segment that has an id that is higher than the previous id allocated. Then we resume the execution from the deferring point in on_timer(): i = _reserve_segments.emplace(i, std::move(s)); The next time we need to allocate a segment, we'll pick it from the reserve. But the segment in the reserve has an id that is lower than the id that we have already used. Reorders are bad, but this one is particularly bad: because the reorder happens with the segment id side of the replay position, that means that every request that falls into that segment will have to be reinserted. This bug can be a bit tricky to reproduce. To make it more common, we can artificially add a sleep() fiber after the allocate_segment(false) in on_timer(). If we do that, we'll see a sea of reinsertions going on in the logs (if dblog is set to debug). Applying this patch (keeping the sleep) will make them all disappear. We do this by rewriting the reserve logic, so that the segments always come from the reserve. If we draw from a single pool all the time, there is no chance of reordering happening. To make that more amenable, we'll have the reserve filler always running in the background and take it out of the timer code. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>	2016-12-01 13:20:46 +01:00
Tomasz Grabiec	570fc0008b	scylla-gdb: Fix lookup of symbols in 'scylla ptr' Message-Id: <1480529617-26564-1-git-send-email-tgrabiec@scylladb.com>	2016-12-01 12:33:29 +02:00
Raphael S. Carvalho	b30a2cb21a	lcs: generate info that preserves token distribution in higher levels The information (last compacted keys) is lost after node is restarted or schema is updated, which causes strategy to be rebuilt. We need it for strategy to guarantee uniform distribution of token range across sstables, or we could end up with 1 sstable of level L overlapping with lots of sstables of level L+1, and that results in a compaction of undesired length. That information can be generated from scratch by getting last key of newest sstable in each level > 0. Fixes #1906. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>	2016-12-01 11:19:58 +02:00
Raphael S. Carvalho	38743c1948	sstables: provide write time of data component Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>	2016-12-01 11:19:57 +02:00
Glauber Costa	d7256e7b21	database: do not call seal directly from the streaming timer Streaming memtable have a delayed mode where many flushes are coalesced together into one, with the actual flush happening later and propagated to all the previous waiters. However, the timer that triggers the actual flush was not using the newly introduced flush infrastructure. This was a minor problem because those flushes wouldn't try to take the semaphore, and so we could have many flushes going on at the same time. What was a potential performance issue became a correctness issue when we moved the reversal of the dirty memory accounting out of revert_potentially_cleaned_up_memory() into remove_from_flush_manager(). Since the latter is only called through the flush infrastructure, it simply wasn't called. So the deferral of the reversal exposed this bug. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>	2016-11-30 18:00:55 +01:00
Tomasz Grabiec	c35e18ba12	tests: Fix use-after-free on commitlog Only shutdown() ensures all internal processes are complete. Call it before calling clear(). Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>	2016-11-30 11:03:26 +02:00
Avi Kivity	281b4c64ea	Update ami submodule * dist/ami/files/scylla-ami 25e101f...d5a4397 (1): > scylla_install_ami: allow specify different repository for Scylla installation and receive update	2016-11-29 19:26:49 +02:00
Takuya ASADA	17ef5e638e	dist/ami: allow specify different repository for Scylla installation and receive update This fix splits build_ami.sh --repo to three different options: --repo-for-install is for Scylla package installation, only valid during AMI construction. --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to receive update package on AMI. --repo is both, for installation and update. Fixes #1872 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>	2016-11-29 19:26:07 +02:00
Avi Kivity	5ea235e3e8	Merge "Prevent overloading memory with timed out writes" from Tomasz "The goal of this series is to prevent unbounded memory use in cases when requests are timing out. Write requests which timed out may still occupy memory for a while because of local mutation application. This memory is not accounted for and can build up. First part of the fix changes local mutation application so that it times out at about the same time as the request handler. Then the life time of the request handler is extended to cover any background activity of that request which hasn't timed out yet. This has two main effects: (1) by timing out local writes we prevent build up of background activity for timed out requests (2) we ensure that memory used by background activity is not left behind unaccounted for. This will prevent CQL server from admitting more requests than memory usage limit allows. Fixes #1756." * tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev: storage_proxy: Do not flood logs with timeout errors database: Add counter for timed out writes storage_proxy: Delay timeout response until background work ceases storage_proxy: Propagate timeout to local writes storage_proxy: Use shared ownership for abstract_write_response_handler storage_proxy: Add counter for all alive write handlers db: Allow writes to be timed out db: Introduce counters for failed reads and writes commitlog: Allow allocations to be timed out utils/logalloc: Add ability to timeout run_when_memory_available() task utils/flush_queue: Add ability to wait with a timeout	2016-11-29 18:55:52 +02:00
Avi Kivity	28a5ff51cb	dist: add build dependency on systemtap-sdt Needed to newer seastar.	2016-11-29 18:49:51 +02:00
Tomasz Grabiec	48bbd6733c	storage_proxy: Do not flood logs with timeout errors Timeout errors are flooding the log after local mutate can time out. We don't log remote mutate timeouts, so for consistency we won't log local ones as well. There is a database counter for timed out writes which can be consulted in order to check if they're occuring. Perhaps this would be better solved by a generic log message throttling/coalescing mechanism, but that's not ready yet.	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	b5d5612f98	database: Add counter for timed out writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	14cb31f69a	storage_proxy: Delay timeout response until background work ceases Write requests which timed out may still occupy memory for a while due to local write. It should time out soon as well but there is a time window in which it has not yet. If we don't delay timeout response, the request would be seen as not consuming any memory too early. This in turn would cause the CQL server to allow more requests than we want. In some cases causing OOM or exceeding memory limits and causing excessive cache eviciton. Fixes #1756.	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	ba3779802f	storage_proxy: Propagate timeout to local writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	6d195a1538	storage_proxy: Use shared ownership for abstract_write_response_handler	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	5805330d98	storage_proxy: Add counter for all alive write handlers Currently the counter uses _response_handlers.size(), but after later patches we may have an active (timed out) write with no response handler, so count live instances instead.	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	2c561ecaed	db: Allow writes to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	b1ae6ad2ad	db: Introduce counters for failed reads and writes	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	31645e2c4a	commitlog: Allow allocations to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	61d81617e1	utils/flush_queue: Add ability to wait with a timeout	2016-11-29 16:40:58 +01:00
Raphael S. Carvalho	a16425833c	size_tiered: do not recreate bucket when it goes beyond max threshold Problem will cause size tiered to return small jobs when there are more than max_threshold sstables of similar size. For example, if max_threshold is 32, and there are 36 sstables of similar size, strategy will only return 4 sstables to be compacted. That's because we incorrectly create a new bucket when it meets the max threshold. What we should do is to allow buckets to grow beyond max threshold and trim them when selecting the most suitable one for compaction. Important to mention that estimation for size tiered will now work better when there are more than max_threshold sstables of similar size. Fixes #1901. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>	2016-11-29 16:56:02 +02:00
Glauber Costa	353a4cd2d4	commitlog: sync segments before acquiring semaphore on shutdown. Sync all segments before acquiring the semaphore, otherwise waiting may have to wait for the timer to kick in and push them down. Note that we can't guarantee that no other requests were executed in the mean time, so we have to sync again. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>	2016-11-29 11:07:28 +02:00
Tomasz Grabiec	96c7764458	Revert "prevent commitlog replay position reordering during reserve refill" This reverts commit `0e9b75d406`. commitlog_test fails with this: Running 14 test cases... ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory) ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen WARN 2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required) *** 1 failure detected in test suite "tests/commitlog_test.cc" WARN 2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)	2016-11-28 20:52:13 +01:00
Raphael S. Carvalho	f141b0cdae	database: atomically add new sstables to cf when refreshing New sstables are loaded and added in parallel, meaning that scylla can potentially return stale data if a new sstable containing a tombstone wasn't loaded yet. Compaction should also not run until all new sstables are added for similar reasons. Fix is about separating blocking and non-blocking steps to allow atomic add of multiple new sstables. Fixes #1368. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>	2016-11-28 20:30:48 +02:00
Glauber Costa	0e9b75d406	prevent commitlog replay position reordering during reserve refill When requests hit the commitlog, each of them will be assigned a replay position, which we expect to be ordered. If reorders happen, the request will be discarded and re-applied. Although this is supposed to be rare, it does increase our latencies, specially when big requests are involved. Processing big requests is expensive and if we have to do it twice that adds to the cost. The commitlog is supposed to issue replay positions in order, and it coudl be that the code that adds them to the memtables will reorder them. However, there is one instance in which the commitlog will not keep its side of the bargain. That happens when the reserve is exhausted, and we are allocating a segment directly at the same time the reserve is being replenished. The following sequence of events with its deferring points will ilustrate it: on_timer: return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) { At this point, the segment id is already allocated. new_segment(): if (_reserve_segments.empty()) { [ ... ] return allocate_segment(true).then ... At this point, we have a new segment that has an id that is higher than the previous id allocated. Then we resume the execution from the deferring point in on_timer(): i = _reserve_segments.emplace(i, std::move(s)); The next time we need to allocate a segment, we'll pick it from the reserve. But the segment in the reserve has an id that is lower than the id that we have already used. Reorders are bad, but this one is particularly bad: because the reorder happens with the segment id side of the replay position, that means that every request that falls into that segment will have to be reinserted. This bug can be a bit tricky to reproduce. To make it more common, we can artificially add a sleep() fiber after the allocate_segment(false) in on_timer(). If we do that, we'll see a sea of reinsertions going on in the logs (if dblog is set to debug). Applying this patch (keeping the sleep) will make them all disappear. We do this by rewriting the reserve logic, so that the segments always come from the reserve. If we draw from a single pool all the time, there is no chance of reordering happening. To make that more amenable, we'll have the reserve filler always running in the background and take it out of the timer code. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>	2016-11-28 19:26:26 +01:00
Takuya ASADA	1042e40188	dist/common/scripts/scylla_kernel_check: fix incorrect document URL Fixes #1871 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>	2016-11-28 13:51:19 +02:00
Avi Kivity	18df2d9e9e	partition_version: fix const correctness in rows_entry_compare Using a non-const-correct comparator results in build failures with boost 1.55. Fixes #1892. Message-Id: <20161128104335.28789-1-avi@scylladb.com>	2016-11-28 10:55:12 +00:00
Avi Kivity	5358984982	Merge seastar upstream * seastar 93c3b12...7790e68 (7): > core/reactor: Introduce reactor-*/dervie-busy_ns metric > Collectd: Hold a reference to the metrics implementation in registration > future: Improve comments > fstream: actually use dynamically adjusted buffer > debug: add latency detector script > reactor: add static probes for latency detector > semaphore: Fix with_semaphore() in case wait() throws	2016-11-28 11:05:59 +02:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Glauber Costa	c32803f2f0	database: move reversion of virtual dirty state closer to update_cache. When we finish writing a memtable, we revert the dirty memory charges immediately. When we do that, dirty memory will grow back to what it was, and soon (we hope) will go down again when we release the requests for real. During that time, we may not accept new requests. Sealing can take a long time, specially in the face of Linux issues like the ones we have seen in the past. It also will take proportionally more time if the SSTables end up being small, which is a possibility in some scenarios. This patch changes the dirty_memory_manager so that the charges won't be reverted right after we finish the flush. Rather, we will hold on to it, and revert it right before we update the cache. We don't need to do it for all classes of memtable writes, because after we finish flushing, flush_one() will destroy the hashed element anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>	2016-11-24 18:18:15 +01:00
Raphael S. Carvalho	4781b6eb71	sstables: use nonwrapping_range::make to avoid compilation issues GCC 5.3.1 was unable to convert bound to optional<bound>. sstables/sstables.cc:2494:123: error: no matching function for call to ‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position, dht::ring_position)’ (dtr.right.exclusive ? dht::ring_position::starting_at : dht::ring_position::ending_at)(std::move(t2))); In file included from ./dht/i_partitioner.hh:52:0, from ./query-request.hh:28, from ./clustering_key_filter.hh:27, from sstables/sstables.hh:35, from sstables/sstables.cc:38: ./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range( const wrapping_range<U>&) [with T = dht::ring_position] explicit nonwrapping_range(const wrapping_range<T>& r) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>	2016-11-24 11:26:16 +02:00
Duarte Nunes	cc3f26c993	lz4: Conditionally use LZ4_compress_default() Since not all distributions have a version of LZ4 with LZ4_compress_default(), we use it conditionally. This is specially important beginning with version 1.7.3 of LZ4, which deprecates the LZ4_compress() function in favour of LZ4_compress_default() and thus prevents Scylla from compiling due to the deprecated warning. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161124092339.23017-1-duarte@scylladb.com>	2016-11-24 11:25:03 +02:00
Avi Kivity	1be95b1227	Merge seastar upstream * seastar d6f26d8...93c3b12 (3): > rpc: Conditionally use LZ4_compress_default() > queue: allow queue to change its maximum size > util/defer: add missing return to move assignment	2016-11-24 11:00:53 +02:00
Duarte Nunes	a527ba285f	thrift: Don't apply cell limit across rows In Thrift, SliceRange defines a count that limits the number of cells to return from that row (in CQL3 terms, it limits the number of rows in that partition). While this limit is honored in the engine, the Thrift layer also applies the same limit, which, while redundant in most cases, is used to support the get_paged_slice verb. Currently, the limit is not being reset per Thrift row (CQL3 partition), so in practice, instead of limiting the cells in a row, we're limiting the rows we return as well. This patch fixes that by ensuring the limit applies only within a row/partition. Fixes #1882 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161123220001.15496-1-duarte@scylladb.com>	2016-11-24 10:38:31 +02:00
Takuya ASADA	ce80fb3a39	dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart) Follow the change of NOFILE for non-systemd environment. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>	2016-11-24 10:13:41 +02:00
Avi Kivity	d58c8aaa32	db: remove unused belongs_to_{current,other}_shard(s) functions Obsoleted by new sharding mechanism, but break the build for some.	2016-11-23 21:39:29 +02:00
Avi Kivity	b81a57e8eb	config, dht: reduce default msb ignore bits to 4 With the default value of 12, a node's range is partitioned into 4096 * smp::count sub-ranges which are queried sequentually for a range scan. If the number of rows in the table is smaller than the required result size, we will query all of them. This can take so long that we time out. A better fix is to query multiple sub-ranges in parallel and merge them, but for that we need to resurrect the non-sequential merger.	2016-11-23 21:25:37 +02:00
Pekka Enberg	c526a9f0be	Update seastar submodule * seastar 7473945...d6f26d8 (2): > semaphore_units: add missing return statement > metrics: Do not detroy the metrics layer if it is been used	2016-11-23 20:27:09 +02:00
Paweł Dziepak	919825a2c7	Merge "Improve sharding in large clusters" from Avi "Clusters with a large number of nodes, or a low number of vnodes, and a high number of shards, or a combination, suffer from an aliasing problem: both vnodes and intra-node sharding consider the most significant bits to select the owning node and owning shard respectively. Since the same bits are used for both, a low number of vnodes leads to some shards being overcommitted relative to others. This series fixes the problem by sharding on bits 0:47 of the token (murmur3 partitioner only), leaving the most significant 12 bits for vnodes. Simulation shows that this value provides reasonable sharding for 100-node, 30-shard clusters. In order to prevent re-sharding sstables on each boot, token ranges for the range are stored in a new sub-component of the sstable Statistics component. With the default 12 ignored bits we have 4096 token ranges for non-Level-compacted SSTables, which takes some space but is still reasonable. Fixes #1277."	2016-11-23 11:25:53 +00:00
Glauber Costa	18b9fa3d43	dist: increase number of open files This limit was found to be too low for production environments. It would be hit at boot, when we're touching a lot of files from multiple shards before deciding that we don't need them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>	2016-11-23 13:10:25 +02:00
Avi Kivity	07d5a20bae	Wire up sharding ignore msb parameter to configuration We might have used a fancy map<sstring, any> to pass the parameters, but that's overkill for now.	2016-11-22 22:40:47 +02:00
Avi Kivity	8b1d689de8	partitioner: add ignore_msb parameters to byte ordered and random partitioners Ignored; doesn't make sense on byte ordered, and random is deprecated.	2016-11-22 21:56:42 +02:00
Avi Kivity	af16c0fac4	murmur3_partitioner: shard on the middle token bits, not most significant bits Sharding on the most significant token bits aliases with the vnode mechanism, which also uses the most significant bits; this requires a huge number of vnodes to achieve good sharding. This patch teaches the murmur3 partitioner to ignore the most significant N bits when calculating a token's hard, so we use token bits which still have some entropy. In effect, with changes the token range layout from shard 0 shard 1 ... shard S-1 to shard 0 shard 1 ... shard S-1 shard 0 shard 1 ... shard S-1 ... shard 0 shard 1 ... shard S-1 Where the number of repetitions of the block is 2^(ignored msb bits). For compatibility, the default is zero ignored bits, matching the pre-patch state, until we wire things up.	2016-11-22 21:56:42 +02:00
Avi Kivity	024c8ef8a1	db: adjust sstable load to use sstable self-reporting of shard ownership Instead of calculating the owning shard from the sstable's partition key range, delegate to the new sstable method for getting owning shard infomation. This insulates us from changes in the sharding algorithm.	2016-11-22 21:56:40 +02:00
Avi Kivity	98a4544e1c	sstables: add method to get sstable owning shards from an unloaded sstable When we load an sstable, we don't know beforehand which shards it belongs to; we don't want to open it until we do. Add a method that allows us to read just the sharding data, without opening anything else.	2016-11-22 21:52:23 +02:00
Avi Kivity	bdd11648ac	sstables: add intra-node sharding metadata Add a metadata component that describes token ranges that are spanned by this sstable. With the current sharding algorithm, where each shard owns a single token range, the first/last partition key is sufficient to describing sharding information, but for multi-range algorithms, this is not sufficient.	2016-11-22 21:44:25 +02:00
Avi Kivity	316ef1d70a	sstables: automate writing statistics components Add a virtual funnction to metadata_base so we can loop over statistics components when writing them.	2016-11-22 21:05:06 +02:00
Glauber Costa	13973e7f3b	keep background work semaphore alive during sstable flush We have a semaphore controlling the amount of background work generated by the memtable flush process. However, because we are not moving it inside the memtable post-flush continuation, the units are being released when we star the flush and not when we finish it. That's not the intended behavior and that can cause flushes to accumulate. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>	2016-11-22 19:54:08 +01:00
Avi Kivity	d05b22e502	sstables: automatically calculate offsets in statistics Instead of calculating the offset for each statistic component manually, use a loop to iterate over all components, accumulating the offset as we go along.	2016-11-22 20:35:24 +02:00
Avi Kivity	7c5e6525ef	sstables: switch statistics components to generic serialized_size() implementation	2016-11-22 20:20:38 +02:00
Avi Kivity	096ae59a5b	sstables: introduce generic serialized_size() Introduce a new function that reuses the file_writer code to compute the serialized size of an sstable object, by serializing it into memory and discarding the result.	2016-11-22 20:06:23 +02:00
Avi Kivity	3c06ffac9d	sstables: const correctness for the write(file_writer&, T&) functions write() doesn't need to change its input; so change it to const. The only snag is that describe_type() isn't and can't be made const-correct, so cheat when it is called and const_cast the input. This helps in writing a generic serialized_size() that is const correct, in the next patch.	2016-11-22 20:04:27 +02:00
Tomasz Grabiec	eefc538225	Update seastar submodule * seastar 7504026...7473945 (1): > Merge "Improve support for timeouts in primitives"	2016-11-22 17:51:29 +01:00
Glauber Costa	0b8b5abf16	commitlog: acquire semaphore earlier Recently we have changed our shutdown strategy to wait for the _request_controller semaphore to make sure no other allocations are in-flight. That was done to fix an actual issue. The problem is that this wasn't done early enough. We acquire the semaphore after we have already marked ourselves as _shutdown and released the timer. That means that if there is an allocation in flight that needs to use a new segment, it will never finish - and we'll therefore neve acquire the semaphore. Fix it by acquiring it first. At this point the allocations will all be done and gone, and then we can shutdown everything else. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>	2016-11-21 22:19:32 +00:00
Avi Kivity	6bdb8ba31d	storage_proxy: don't query concurrently needlessly during range queries storage_proxy has an optimization where it tries to query multiple token ranges concurrently to satisfy very large requests (an optimization which is likely meaningless when paging is enabled, as it always should be). However, the rows-per-range code severely underestimates the number of rows per range, resulting in a large number of "read-ahead" internal queries being performed, the results of most of which are discarded. Fix by disabling this code. We should likely remove it completely, but let's start with a band-aid that can be backported. Fixes #1863. Message-Id: <20161120165741.2488-1-avi@scylladb.com>	2016-11-21 18:19:46 +02:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Duarte Nunes	def2bc72b0	size_estimates_virtual_reader: Add unit test Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	6a37d87c76	db: Delete size_estimates_recorder Now that access to the size_estimates system is virtualized, we no longer need the recorder. Fixes #1616 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	225648780d	size_estimates: Add virtual reader This patch add a virtual mutation_reader so that queries to the size_estimates system table are handled by the engine without needing to perform any IO. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	cd7e2fd602	column_family: Add support for virtual readers Virtual readers allow queries to selected tables, usually system tables, to be answered by the engine. This is useful for tables which aren't written by users and whose contents can be calculated on demand. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Duarte Nunes	c0d450c57d	storage_service: get_local_tokens() returns a future This patch changes the get_local_tokens() function in storage_service to return a future instead of requiring running under a seastar::thread. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	9b384d375f	nonwrapping_range: Add slice() function This patch add the slice() function to nonwrapping range, which uses its bounds to slice an input sequence. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	bdba8d99c3	range: Find a sequence's lower and upper bounds This patch extracts a pair of functions from mutation_partition to calculate the lower and upper bounds of a sequence from a nonwrapping_range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	636287fdf2	system_keyspace: Build mutations for size estimates This patch adds a function to system_keyspace responsible for creating a mutation to a partition of the size_estimates system table from a set of range_estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:04 +00:00
Duarte Nunes	18ddec245e	size_estimates: Store the token range as bytes This patch changes the range_estimates struct so that the tokens are represented as utf8 encoded bytes. This will make future patches require less conversions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:14:21 +00:00
Duarte Nunes	e7a5162c1d	range_estimates: Add schema This will be used in future patches, when virtualizing the size_estimates system table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 10:56:32 +00:00
Duarte Nunes	01815ecd24	murmur3_partitioner: Convert maximum_token to sstring This patch ensures we can convert the maximum_token to an sstring. For Cassandra, the minimum and maximum tokens have the same representation. So, we use the string representation of the maximum_token for the maximum_token. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 10:56:32 +00:00
Takuya ASADA	eee63027e5	dist/ami/build_ami.sh: update base AMI to CentOS7-Base5 To drop unnecessary .ssh/authozied_keys, we need to update base AMI. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>	2016-11-21 10:12:47 +02:00
Avi Kivity	783729c540	Merge "Clean up T::memory_usage() function" from Paweł "This series is just a cleanup which intention is to deal with all confusion related to the way T::memory_usage() functions work. * T::memory_usage() which returned external memory usage are renamed to T::external_memory_usage() * T::memory_usage() is introduced where needed to avoid repeating sizeof(T) + T::external_memory_usage()" Paweł Dziepak (6): rename memory_usage() to external_memory_usage() where applicable streamed_mutation: add memory_usage() to mutation fragment types keys: add memory_usage() partition_snapshot_accounter: use range_tombstone::memory_usage() mutation_rebuilder: use memory_usage() frozen_mutation: use memory_usage()	2016-11-21 10:11:39 +02:00
Avi Kivity	498887ca0d	Merge seastar upstream * seastar 31c5fd7...7504026 (2): > circular_buffer: add move assignment operator > scollectd: Fix serialization of GAUGE-typed values	2016-11-20 20:16:56 +02:00
Gleb Natapov	9222a47fed	sstable test: add test for generated summary data Message-Id: <20161117155051.GV6765@scylladb.com>	2016-11-20 19:50:45 +02:00
Glauber Costa	21c1e2b48c	commitlog: wait for pending allocations to finish before closing gate. allocations may enter the gate, so it would be wise for us to wait for them. Fixes #1860 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>	2016-11-20 19:45:33 +02:00
Avi Kivity	a39b92a40a	build: fix tests-with-symbols generation Bad indentation caused the libs variable for tests-with-symbols to be overwritten, resulting in link failure.	2016-11-20 17:23:26 +02:00
Glauber Costa	504b5ac30f	database: don't check for waiters in the condition variable predicate. In the last iterations of this patchset, we have moved explicit flushes to acquire the semaphore directly and the coalescing inside the memtable_list. As a result, we are no longer keeping any kind of action for them inside the condition variable. Checking for them has no longer a purpose. This is a cleanup patch that remove does checks. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>	2016-11-18 21:34:48 +01:00
Glauber Costa	1933349654	database: fix direct flushes of non-durable column families. If a Column Family is non-durable, then its flushes will never create a memtable flush reader. Our current flush logic depends on that being created and destroyed to release the semaphore permits on the flush. We will remove the permits ourselves it there is an exception, but not under normal circumnstances. Given this issue, however, it would be more adequate to always try to remove the permits after we flush. If the permits were already removed by the flush reader, then this test will just see that the permit is not in the map and return. But if it is still there, then it is removed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>	2016-11-18 21:32:29 +01:00
Avi Kivity	6eecbc80dc	CONTRIBUTING.md: add sections for help and issues Don't scare away users reporting an issue with the CLA.	2016-11-18 22:21:10 +02:00
Glauber Costa	60b7d35f15	commitlog: close file after read, and not at stop There are other code paths that may interrupt the read in the middle and bypass stop. It's safer this way. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>	2016-11-18 14:09:33 +00:00
Paweł Dziepak	249e0ab087	frozen_mutation: use memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	948c062e64	mutation_rebuilder: use memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	e04664e851	partition_snapshot_accounter: use range_tombstone::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	711bd19f16	keys: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	6b8bf030c0	streamed_mutation: add memory_usage() to mutation fragment types This patch introduces memory_usage() to static_row, clustering_row and range_tombstone so that we can avoid repeating sizeof(T) + x.external_memory_usage(). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Avi Kivity	fec4ef3390	Merge "Make sure commitlog replay is able to make progress" from Glauber "Fixes #1856 Commitlog replay reads are being issued without a priority. That means they will lose to compaction every time." * 'issue-1856-v2' of github.com:glommer/scylla: commitlog: use read ahead for replay requests commitlog: use commitlog priority for replay commitlog: close replay file	2016-11-18 12:04:18 +02:00
Takuya ASADA	55e5123313	dist/redhat: Support RHEL7 We supported install CentOS7 .rpm on RHEL7, but we haven't supported building on RHEL7, since there is little difference between CentOS, and that causes build error. This patch fixes the error, now we can produce .rpm for RHEL7 wihout using CentOS. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>	2016-11-18 11:56:05 +02:00
Glauber Costa	461778918b	fix shutdown and exception conditions for flush logic This patch addresses post-merge follow up comments by Tomek. Basically, what we do is: - we don't need to signal() from remove_from_flush_manager(), because the explicit flushes no longer wait on the condition variable. So we don't. - We now wait on the stop() flushes (regardless of their return status) so we can make sure that the _flush_queue will indeed be done with. - we acquire the semaphore before shutting down the dirty_memory_manager to make sure that there are no pending flushes - the flush manager that holds the semaphore has to match in the exception handler Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>	2016-11-17 21:16:44 +01:00
Glauber Costa	59a41cf7f1	commitlog: use read ahead for replay requests Aside from putting the requests in the commitlog class, read ahead will help us going through the file faster. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 14:09:54 -05:00
Glauber Costa	aa375cd33d	commitlog: use commitlog priority for replay Right now replay is being issued with the standard seastar priority. The rationale for that at the time is that it is an early event that doesn't really share the disk with anybody. That is largely untrue now that we start compactions on boot. Compactions may fight for bandwidth with the commitlog, and with such low priority the commitlog is guaranteed to lose. Fixes #1856 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 14:09:02 -05:00
Glauber Costa	4d3d774757	commitlog: close replay file Replay file is opened, so it should be closed. We're not seeing any problems arising from this, but they may happen. Enabling read ahead in this stream makes them happen immediately. Fix it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-17 12:35:24 -05:00
Avi Kivity	eaf83ab59c	Merge seastar upstream * seastar 3001c08...31c5fd7 (2): > Safe use of collectd during shutdown > udp: abort reader and writer when udp channel close	2016-11-17 18:44:28 +02:00
Piotr Jastrzebski	9d33948487	mutation_rebuilder: fix fragment size calculation It wasn't calculating the size of data correctly. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>	2016-11-17 16:23:42 +00:00
Raphael S. Carvalho	3dc9294023	db: do not leak deleted sstable when deletion triggers an exception The leakage results in deleted sstables being opened until shutdown, and disk space isn't released. That's because column_family::rebuild_sstable_list() will not remove reference to deleted sstables if an exception was triggered in sstables::delete_atomically(). A sstable only has its files closed when its object is destructed. The exception happens when a major compaction is issued in parallel to a regular one, and one of them will be unable to delete a sstable already deleted by the other. That results in remove_by_toc_name() triggering boost::filesystem ::filesystem_error because TOC and temporary TOC don't exist. We wouldn't have seen this problem if major compaction were going through compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should be made resilient. Fixes #1840. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>	2016-11-17 17:46:36 +02:00
Gleb Natapov	c052a1bc4f	sstable: use schema's min_index_interval config when generating missing summary Message-Id: <20161116181937.GA25303@scylladb.com>	2016-11-17 15:24:03 +02:00
Avi Kivity	5d067eebf2	Merge "get rid of memtable size parameter and rework flush logic" from Glauber "This patchset allows Scylla to determine the size of a memtable instead of relying in the user-provided memtable_cleanup_threshold. It does that by allowing the region_group to specify a soft limit which will trigger the allocation as early as it is reached. Given that, we'll keep the memtables in memory for as long as it takes to reach that limit, regardless of the individual size of any single one of them. That limit is set to 1/4 of dirty memory. That's the same as last submission, except this time I have run some experiments to gauge behavior of that versus 1/2 of dirty memory, which was a preferred theoretical value. After that is done, the flush logic is reworked to guarantee that flushes are not initiated if we already have one memtable under flush. That allow us to better take advantage of coalescing opportunities with new requests and prevents the pending memtable explosion that is ultimately responsible for Issue 1817. I have run mainly two workloads with this. The first one a local RF=1 workload with large partitions, sized 128kB and 100 threads. The results are: Before: op rate : 632 [WRITE:632] partition rate : 632 [WRITE:632] row rate : 632 [WRITE:632] latency mean : 157.8 [WRITE:157.8] latency median : 115.5 [WRITE:115.5] latency 95th percentile : 486.7 [WRITE:486.7] latency 99th percentile : 534.8 [WRITE:534.8] latency 99.9th percentile : 599.0 [WRITE:599.0] latency max : 722.6 [WRITE:722.6] Total partitions : 189667 [WRITE:189667] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END After: op rate : 951 [WRITE:951] partition rate : 951 [WRITE:951] row rate : 951 [WRITE:951] latency mean : 104.8 [WRITE:104.8] latency median : 102.5 [WRITE:102.5] latency 95th percentile : 155.8 [WRITE:155.8] latency 99th percentile : 177.8 [WRITE:177.8] latency 99.9th percentile : 686.4 [WRITE:686.4] latency max : 1081.4 [WRITE:1081.4] Total partitions : 285324 [WRITE:285324] Total errors : 0 [WRITE:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:00 END The other workload was the workload described in #1817. And the result is that we now have a load that is very stable around 100k ops/s and hardly any timeouts, instead of the 1.4 baseline of wild variations around 100k ops/s and lots of timeouts, or the deep reduction of 1.5-rc1." * 'issue-1817-v4' of github.com:glommer/scylla: database: rework memtable flush logic get rid of max_memtable_size pass a region to dirty_memory_manager accounting API memtable: add a method to expose the region_group logalloc: allow region group reclaimer to specify a soft limit database: remove outdated comment database: uphold virtual dirty for system tables.	2016-11-17 14:36:43 +02:00
Avi Kivity	18078bea9b	storage_proxy: avoid calculating digest when only one replica is contacted If we're talking to just one replica, the digest is not going to be used, so better not to calculate it at all. The optimization helps with LOCAL_ONE queries where the result is large, but does not contain large blobs (many small rows). This patch adds a digest_algorithm parameter to the READ_DATA verb that can take on two values: none and MD5 (default), and sets it to none when we're reading from one replica. In the future we may add other values for more hardware-friendly digest algorithms. Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>	2016-11-17 13:04:30 +02:00
Asias He	dc50ce0ce5	streaming: Make the mutation readers when streaming starts Currenlty we make the mutation readers for streaming at different time point, i.e., do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) { make a mutation reader for this range read mutations from the reader and send }) If there are write workload in the background, we will stream extra data, since the later the reader is made the more data we need to send. Fix it by making all the readers before starting to stream. Fixes #1815 Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>	2016-11-17 12:41:53 +02:00
Gleb Natapov	ae0a2935b4	sstables: fix ad-hoc summary creation If sstable Summary is not present Scylla does not refuses to boot but instead creates summary information on the fly. There is a bug in this code though. Summary files is a map between keys and offsets into Index file, but the code creates map between keys and Data file offsets instead. Fix it by keeping offset of an index entry in index_entry structure and use it during Summary file creation. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20161116165421.GA22296@scylladb.com>	2016-11-17 11:05:23 +02:00
Glauber Costa	f08162e181	database: rework memtable flush logic The way we currently flush memtables, we seal the current one but wait on a semaphore for the actual flush to proceed. This is pointless, because if the flush is not proceeding we'll use up memory for the new entries anyway, be them in a newly opened memtable or not. As a matter of fact, by opening a new memtable we are foregoing coalescing opportunities. After recent changes to the flush paths, we are now in a position to do differently. We move the semaphore earlier, and if we can't acquire it we keep appending to the current memtable. For explicit flushes, we'll queue and prioritize them over memory-based flushes. This has the nice property of potentially coalescing various flushes for the same CF into one. Coalescing flushes for the same CF is particularly helpful for commitlog-initiated flushes that can't complete within the flush period. What we see currently, is that under heavy load the commitlog will keep sealing memtables adding to the existing load. Another interesting property of this approach is that we can keep the disk utilization higher, by allowing a new flush to start before the memtable is fully sealed. By design, every time a memtable is finished flushing it will call revert_potentially_cleaned_up_memory() to revert the virtual memory charges. That is the perfect moment for us to act. It indicates that all the data flushing part is done. The way we'll do it is by keeping the semaphore_units alive for this memtable. When the flush ends, we destroy that object. This will effectively trigger the next flush if there is a next flush that can be initiated. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:58 -05:00
Glauber Costa	895e838ac0	get rid of max_memtable_size After recent changes to the memtable code, there is no reason for us to uphold a maximum memtable size. Now that we only flush one memtable at a time anyway, and also have soft limit notifications from the region_group_reclaimer, we can just set the soft limit to the target size and let all of that be handled by the dirty_memory_manager. It does have the added property that we'll be flushing when we globally reach the soft limit threshold. In conditions in which we have multiple CF writes fighting for memory, that guarantees that we will start flushing much earlier than the hard limit. The threshold is set to 1/4 of dirty memory. While in theory we would prefer the memtables to go as big as 1/2 of dirty memory, in my experiments I have found 1/4 to be a better fit, at least for the moment. The reason for such behavior is that in situations where we have slow disks, setting the soft limit to 1/2 of dirty will put us in a situation in which we may not have finished writing down the memtable when we hit the limit, and then throttle. When set the threshold to 1/4 of dirty, we don't throttle at all. This behavior could potentially be fixed by not doing the full memtable-based throttling after we do the commitlog throttling, but that is not something realistic for the moment. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	2ed3f342c1	pass a region to dirty_memory_manager accounting API We would like to know from which region is a particular flush coming from, and account accordingly. The reasoning behind that, is that soon we'll be driving the flushes internally from the dirty_memory_manager without explcitly triggering them. We need to start a flush before the current one finishes, otherwise we'll have a period without significant disk activity when the current SSTable is being sealed, the caches are being updated, etc. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	0b337dab14	memtable: add a method to expose the region_group That is technically not needed because a memtable inherits from group. So whenever we have a memtable, we can use it's group() method to obtain a group for it, and then from there go to the region_group. However, region() is a const method in the memtable, so we have to play trick with the const_cast, or remove the constness from the region. An alternative to that, which I prefer, is to expose a method for the region_group directly from the memtable object that does the right thing and bypasses all that. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	f86c9e36f4	logalloc: allow region group reclaimer to specify a soft limit The region_group_reclaimer will let us know every time we are over the limit we have specified for memory usage. However, For some applications, we would be interested in knowing about memory build up earlier, so we can start doing something about it before we reach that condition. This patch introduce soft limit notifications for the region_group_reclaimer. After this patch is applied, start_reclaim() is called earlier, and stop_reclaim() later, after the soft condition is abated. There are methods that allow one to easily test if the pressure condition is a soft limit condition or a hard, threshold condition and act accordingly. Whether to act on both conditions or just one of them is up to the application. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Glauber Costa	da738a6cd1	database: remove outdated comment Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Glauber Costa	919de98aa5	database: uphold virtual dirty for system tables. Currently the virtual dirty mechanism is not properly set for system tables. We haven't divided the system table allowance by two, which means it won't start thottling earlier as it was supposed to. In practice, this has little effect because system table requests are very well behaved, their sizes well known, and they tend to be force-flushed. But we should be consistent. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Avi Kivity	f26c6569d2	Update scylla-ami submodule * dist/ami/files/scylla-ami 61ff5c6...25e101f (1): > scylla_install_ami: delete unneeded authorized_keys from AMI image	2016-11-16 22:36:31 +02:00
Takuya ASADA	3802f289f8	dist: remove bc from dependency Since we replaced shellscript based cpuset generator with python based one, we no longer depends to bc command. `d571123afd` So drop it from .rpm/.deb dependency. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>	2016-11-16 15:02:55 +02:00
Amnon Heiman	a4be7afbb0	API: cache_capacity should use uint for summing Using integer as a type for the map_reduce causes number over overflow. Fixes #1801 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>	2016-11-16 13:55:46 +01:00
Avi Kivity	31d0e31de2	Merge seastar upstream * seastar 47e1821...3001c08 (5): > core: Introduce weak_ptr<> > timer: Add missing include > tutorial: fix TeX template > Merge "Adding the metrics layer" from Amnon > core/memory: let malloc(0) return a valid pointer	2016-11-16 14:20:49 +02:00
Pekka Enberg	8a4bd6ecd5	README: Guidelines for contributing Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>	2016-11-16 12:50:02 +02:00
Paweł Dziepak	f877be50b0	Merge "Keep wide partition cache entry longer than others" from Piotr "Cache entries for wide partitions are usually smaller than other entries and the cost of recreating them is higher so it makes sense to keep them longer than ordinary entries."	2016-11-15 20:44:52 +00:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Paweł Dziepak	999dafbe57	row_cache: touch entries read during range queries Fixes #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:54:11 +01:00
Tomasz Grabiec	11c5f4ab50	storage_proxy: Add counters for throttled writes	2016-11-15 17:18:25 +01:00
Piotr Jastrzebski	5ec668c9c6	Add separate LRU for wide partitions. Evict wide partitions only every 1000 normal partition evictions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-11-15 16:19:13 +01:00
Piotr Jastrzebski	9a41bfbf69	Add collectd metric for wide partition evictions. This will allow us to see how big is an amount of evictions of cached info about wide partitions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-11-15 15:53:14 +01:00
Paweł Dziepak	055d78ee4c	query_pagers: distinct queries do not have clustering keys Query pager needs to handle results that contain partitions with possibly multiple clustering rows quite differently than results with just one row per partition (for example a page may end in a middle of partition). However, the logic dealing with partitions with clustering rows doesn't work correctly for SELECT DISTINCT queries, which are much more similar to the ones for schemas without clustering key. The solution is to set _has_clustering_keys to false in case of SELECT DISTINCT queries regardless of the schema which will make pager correctly expect each partition to return at most one rows. Fixes #1822. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 11:06:01 +01:00
Glauber Costa	93386bcec7	histograms: do not use latency_in_nano Now that the histogram has its own unit expressed in its template parameter, there is no reason to convert it to nano just so we may need to convert it back if the histogram needs another unit. This patch will keep everything as a duration until last moment, and then we'll convert when needed. This was suggested by Amnon. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>	2016-11-14 18:01:43 +02:00
Nadav Har'El	c5254b6502	repair: fix undefined variable If the "trace" parameter of the repair was not given, we will use the "trace" variable without setting it. We need to set a default value. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>	2016-11-14 17:16:19 +02:00
Raphael S. Carvalho	e86de40b49	compaction_manager: inform about compaction cancelled by shutdown After some changes in compaction manager, user no longer is informed that compaction was cancelled in event of shutdown. That's because we only ignore ready future when compaction manager was asked to stop. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>	2016-11-14 16:37:33 +02:00
Piotr Jastrzebski	4fe989d58e	Cleanup sstables::mutation_reader::impl Pointer to sstable seems unnecessary. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>	2016-11-14 11:52:52 +00:00
Avi Kivity	14c1b17105	storage_service: fix construct_range_to_endpoint_map with semi-infinite range After the conversion to nonwrapping ranges, construct_range_to_endpoint_map() may be called with semi-infinite token ranges, but it does not expect this, calling nonwrapping_range::end()->value() unconditionally. Fix by checking whether this is a semi-infinite range on the right, and replace ->value() by maximum_token() instead. Fixes `nodetool describering` (once more). Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>	2016-11-14 11:39:48 +01:00
Raphael S. Carvalho	9a9f0d3a0f	main: fix exception handling when initializing data or commitlog dirs Exception handling was broken because after io checker, storage_io_error exception is wrapped around system error exceptions. Also the message when handling exception wasn't precise enough for all cases. For example, lack of permission to write to existing data directory. Fixes #883. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>	2016-11-14 12:34:10 +02:00
Takuya ASADA	d571123afd	dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead We get error from bc command when we run the script on >34 ncpus, to prevent the issue add a python script to generate cpuset parameter. Fixes #1824 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>	2016-11-14 11:45:23 +02:00
Duarte Nunes	66f6a367a4	ring_position_range_sharder: Avoid copying eagerly Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161104115632.15974-1-duarte@scylladb.com>	2016-11-13 11:42:23 +02:00
Avi Kivity	bf20aa722b	Merge "Fixes for histogram and moving average calculations" from Glauber "JMX metrics were found to be either not showing, or showing absurd values. Turns out there were multiple things wrong with them. The patches were sent separately but conflict with one another. This series is a collection of the patches needed to fix the issues we saw. Fixes #1832, #1836, #1837"	2016-11-13 11:16:32 +02:00
Avi Kivity	2670e46f3e	storage_service: deinline most methods Most inline methods in storage_service are too large to be inlined, and just increase compile time. De-inline them.	2016-11-12 21:12:28 +02:00
Glauber Costa	608d825790	histogram: fix reporting units We are tracking latencies in microseconds, but almost everywhere else they are reported in microseconds. Instead of just converting, this patch tries to be a bit more future proof and embed the unit into the type - and we then default to microseconds. I have verified that the JMX measures now report sane values for both the storage proxy and the column family. nodetool cfhistograms still works fine. That one is reported in nanoseconds, but through the estimated_histogram, not ihistogram. Fixes #1836 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-11 11:36:56 -05:00
Glauber Costa	1342d044eb	moving averages: change metrics calculation We have recently fixed a bug due to which the constructor parameters for moving average were inverted, leading to the numbers being just plain wrong. However, the calculation of alpha was already inverted, meaning it was right by accident and now that's wrong. With the wrong alpha, the values we see are still correct, but they move very quickly. The intention of this code is obviously to smooth things out. This was found out by Nadav. I have tested and confirmed that the smoothing factor now works as expected. Fixes #1837 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-10 22:33:34 -05:00
Amnon Heiman	a977ea85e1	histogram: moving_average and total rate should be calculate in seconds The moving average and the total average should be calculated in seconds and not nanoseconds. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-11-10 22:32:53 -05:00
Glauber Costa	d3f11fbabf	histogram: moving averages: fix inverted parameters moving_averages constructor is defined like this: moving_average(latency_counter::duration interval, latency_counter::duration tick_interval) But when it is time to initialize them, we do this: ... {tick_interval(), std::chrono::minutes(1)} ... As it can be seen, the interval and tick interval are inverted. This leads to the metrics being assigned bogus values. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>	2016-11-10 11:28:51 -08:00
Paweł Dziepak	f16d6f9c40	partition_version: make sure that snapshot is destroyed under LSA Snapshot destructor may free some objects managed by the LSA. That's why partition_snapshot_reader destructor explicitly destroys the snapshot it uses. However, it was possible that exception thrown by _read_section prevented that from happenning making snapshot destoryed implicitly without current allocator set to LSA. Refs #1831. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>	2016-11-10 13:13:10 +01:00
Gleb Natapov	27e041606b	fix LOCAL_ONE printout Message-Id: <20161109125307.GH7766@scylladb.com>	2016-11-09 12:53:55 +00:00
Duarte Nunes	e680587b8a	sstable_test: Be explicit about uncompressed tables After 7c28ed, the schemas defined in the test became compressed by default. This patch changes the test so that it is explicit about which schemas shouldn't define a compressor. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>	2016-11-09 11:21:59 +02:00
Pekka Enberg	b3dea313dd	Merge "API changes for Cassandra 3.x migration" from Calle "Mostly small changes/additions to the API calls to match Cv3 requirements/semantics, i.e. updated scylla-jmx can implement required nodetool etc calls in a working fashion."	2016-11-09 10:30:32 +02:00
Duarte Nunes	e33c02aa60	cql3: Disable compression on empty properties The CQL 3.1 documentation specifies that for disabling compression, users should use an empty string: ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''}; However, Cassandra also accepts the absence of the sstable_compression option to disable compression. The patch 7c28ed prevented this behavior in Scylla, which this patch aims to fix. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>	2016-11-09 10:03:59 +02:00
Gleb Natapov	93f068bd44	storage_proxy: fix speculation target selection logic Current speculation target selection logic has several bugs in multi-dc setup. It may select a non local target for CL=LOCAL and it may select more than one target to speculate, one of which is non local. Examples: 1. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_QUORUM. In this scenario db::filter_for_query() will return both replicas from local DC and speculation target selection logic will peek one one which will be in different DC. 2. Two dataceneters: DC1 RF 2, DC2 RF 2 and read with LOCAL_ONE + RRD.DC_LOCAL In this scenario db::filter_for_query() will return all nodes in local DC and there already be enough nodes to speculate, but current logic will add one node from non local dc as a speculation target. The patch below fixed both of those scenarios. Message-Id: <20161103154637.GS7766@scylladb.com>	2016-11-08 18:32:47 +01:00
Paweł Dziepak	a8308e2a8d	row_cache: dummy entry does not count as partition Since continuity flag introduction row cache contains a single dummy entry. cache_tracker knows nothing about it so that it doesn't appear in any of the metrics. However, cache destructor calls cache_tracker::on_erase() for every entry in the cache including the dummy one. This is incorrect since the tracker wasn't informed when the dummy entry was created. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>	2016-11-08 13:54:44 +01:00
Piotr Jastrzebski	50b41f7d1d	Fix row_cache_test partition_range passed to row_cache::make_reader has to be kept alive as long as the resulting reader is used. Otherwise weird things start to happen. This used to work just because of a pure luck. When I started changing the row_cache implementation I run into very weird behaviors for this tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>	2016-11-08 13:28:53 +01:00
Calle Wilund	473326d49a	api/column_family: Make mean row size return integral As (at least) per C3, these metrics are integral in origin. Adapt. (Other option would be to translate in jmx).	2016-11-08 12:22:04 +00:00
Calle Wilund	bd646a6755	repair (api): Add option handling (sort of) for nodetool default options	2016-11-08 12:22:04 +00:00
Calle Wilund	0181fc8159	api::cache_service: Add (dummy) calls for key&counter metrics	2016-11-08 12:22:04 +00:00
Calle Wilund	5eb54f9bc4	api::storage_service: c3 compat - make query keyspaces a trinary choice all, user or non-local strategy ones.	2016-11-08 12:22:04 +00:00
Calle Wilund	3b7a7dd383	api::failure_detector: c3 compat - add endpoint phi value query	2016-11-08 12:22:04 +00:00
Calle Wilund	218df55349	failure_detector: add accessor and api shortcut for arrival samples	2016-11-08 12:22:04 +00:00
Calle Wilund	f9836cd23b	api::endpoint_snitch: c3 compat - allow dc/rack query for broadcast	2016-11-08 12:22:04 +00:00
Calle Wilund	54ba06a8bf	api::column_family: Add calls/parameters for c3 compatibility	2016-11-08 12:22:04 +00:00
Amnon Heiman	c8082ccadb	API: fix a type in storage_proxy This patch fixes a typo in the URL definition, causing the metric in the jmx not to find it. Fixes #1821 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1478563869-20504-1-git-send-email-amnon@scylladb.com>	2016-11-08 11:09:21 +02:00
Amos Kong	95fe88c1d3	scripts/scylla_current_repo: use HTTP to access downloads.scylladb.com Https isn't available for downloads.scylladb.com, or we can access it by https://s3.amazonaws.com/downloads.scylladb.com/... Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <d4b65e1724bbeb76c928790d5d3e95b91ee9db79.1478153034.git.amos@scylladb.com>	2016-11-08 11:03:50 +02:00
Avi Kivity	767cfb4fe9	storage_service: fix range wrapping in describe_ring even more Commit `8fca1887c2` ("storage_service: fix range wrapping in describe_ring") fixed incorrect range wrapping code for describe_ring, but fails when the number of endpoints for a token is greater than one, because the endpoints are stored in an unordered vector. Fix by comparing the endpoints in a way that ignores their order. Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>	2016-11-07 16:18:20 +01:00
Calle Wilund	11baf37ab5	commitlog: Prevent exceptions in stream::produce from being set twice Fixes #1775 stream lacks a check "is_open", which is a bummer. We have to both prevent exception propagation and add a flag of our own to make sure exceptions in producer code reaches consumer, and does not simply get lost in the reactor. Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>	2016-11-07 11:41:33 +01:00
Tomasz Grabiec	e6cc0a2e10	Merge branch '1766/v1' from duarten/scylla.git This patchset adds missing properties to the create_view_statement, such as whether the view is compact or the order of its clustering columns. Fixes #1766	2016-11-07 10:44:24 +01:00
Takuya ASADA	0f1ba1a3bb	dist/redhat: remove unused dependencies Seems like we mistakenly added unneeded packages for BuildRequires when we created .spec file, so remove them. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1478504761-15067-1-git-send-email-syuu@scylladb.com>	2016-11-07 09:48:50 +02:00
Paweł Dziepak	985d2f6d4a	Merge "Remove quadratic behavior from atomic sstable deletion" from Avi "The atomic sstable deletion provides exception safety at the cost of quadratic behavior in the number of sstables awaiting deletion. This causes high cpu utilization during startup. Change the code to avoid quadratic complexity, and add some unit tests. See #1812."	2016-11-04 15:48:04 +00:00
Avi Kivity	f75aceabc5	sstables: add unit tests for atomic deletion We simulate shards deleting sstables, but this is all happening on a single core, and no sstables are harmed during test execution.	2016-11-04 15:48:43 +02:00
Avi Kivity	f10b9906d8	sstables: move atomic deletion code to its own files This will simplify unit testing. We move generic code that depends only on seastar, so compile time should not increase too much.	2016-11-04 15:47:35 +02:00
Avi Kivity	9e85653c33	sstables: make atomic_deletion_manager more abstract Make the shard count and method of deleting sstables abstract, in order not to require all that machinery for unit tests.	2016-11-04 15:44:09 +02:00
Avi Kivity	e527da1e3c	sstables: wrap atomic deletion code in a class This makes it easier to abstract and unit-test.	2016-11-04 15:44:07 +02:00
Avi Kivity	a05837936a	sstables: remove quadratic behavior from atomic sstable deletions In order to ensure exception safety, the atomic sstable deletion code creates a copy of the list of sstables pending deletion, modifies that copy, and then replaces the original data with the copy. This guarantees that any exception does not change the data, since the assignment does not require allocation. However, it does result in quadratic behavior. During startup, all sstables are loaded on each shard, and each shard deletes sstables that are do not have any partitions served by that shard; this results in almost all sstables being deleted from all shards, with all that work going to shard 0; the list grows to O(nr sstables), and there are O((nr sstables) * (nr shards)) operations to perform. Fix by replacing the copy-modify-assign method with an in-place update, but one that is designed to only commit changes after all allocations have been made; in addition, instead of using a list, use a hash table, removing another source of quadratic behavior. Fixes #1812 (the quadratic beahvior part).	2016-11-04 15:42:44 +02:00
Avi Kivity	8fca1887c2	storage_service: fix range wrapping in describe_ring describe_ring() tries to re-wrap the ranges, but fails because the ranges are not sorted. Adjust the code not to rely on sorting. Message-Id: <1478198630-27483-1-git-send-email-avi@scylladb.com>	2016-11-04 10:48:14 +00:00
Paweł Dziepak	8afd9e52c7	Merge "Process range queries sequentially on shards" from Avi "Currently, partition range queries are processed in parallel on all shards. This is inefficient because we are likely to drop the results from all but one shard, assuming a well-populated column family. We are multiplying our work by a factor of smp::count. While this is worthwhile in its own right, it is really an excuse to sneak in the range/shard generator (patch 5), which is preliminary for a new sharding algorithm, dividing tokens among shards based on the middle-significant bits rather than the most-siginificant bits (which alias with vnodes) Fixes #1573."	2016-11-04 09:58:04 +00:00
Tomasz Grabiec	c1a7e2090e	Revert "database: change find_column_families signature so it returns a lw_shared_ptr" This reverts commit `f3528ede65`.	2016-11-04 10:48:21 +01:00
Tomasz Grabiec	3b5ccda70e	Revert "database: refactor code so apply_in_memory() is called only once" This reverts commit `3f825f593d`.	2016-11-04 10:48:18 +01:00
Tomasz Grabiec	6366eb5cf8	Revert "correctly calculate latencies for writes" This reverts commit `a382f10fc4`.	2016-11-04 10:48:02 +01:00
Tomasz Grabiec	a5ee87611a	Revert "database: when querying, move latency counter instead of copying" This reverts commit `8840a5a593`.	2016-11-04 10:47:58 +01:00
Tomasz Grabiec	f3c1ff78e6	Merge branch 'cql_read_write_counters-v4' from seastar-dev.git New CQL counters from Vlad.	2016-11-04 09:19:07 +01:00
Avi Kivity	b3299d5bc3	storage_proxy: simplify range queries Instead of asking a shard for cmd->partition_limit and cmd->row_limit, just ask it for the number of partitions and rows still needed to satisfy the query. This removes the need to trim the shard's result.	2016-11-03 19:10:20 +02:00
Avi Kivity	a668e575f6	storage_proxy: execute multi-partition query sequentially over shards Since every shard might cause the row_limit quota to be satisfied, every shard might be the last one we need. Hence it is better to process shards sequentially, stopping if the quota is reached or the range is exhausted. The original code tried to yield to reduce latency, but this is now unnecessary, as we're doing a lot less work per iteration (if it becomes necessary, we should do it on the replica shard, not the coordinating shard).	2016-11-03 19:10:20 +02:00
Avi Kivity	1d77e3a03a	partitioner: add unit tests for token_for_next_shard() i_partitioner::token_for_next_shard() is an inverse for i_partitioner::shard_of(), test that this is so.	2016-11-03 19:10:20 +02:00
Avi Kivity	7202b94183	dht: introduce a sharder for vectors of partition ranges Building on the single-range sharder, add a sharder for vectors of partition ranges. This helps with wrapped ranges, which are translated into a vector containing two shards.	2016-11-03 19:10:20 +02:00
Avi Kivity	43a2380899	dht: add a generator for shard/range pairs Divides a ring_position range into a sequence of shard/range pairs. This allows sequential iteration over shards in ring order. The current multi-partition query executes on all shards in parallel, but this is very wasteful, as most of the data will be thrown away if it is not included in the page. With the generator, we can switch to sequential execution.	2016-11-03 19:10:17 +02:00
Avi Kivity	1f88d103a8	partitioner: add i_partitioner::token_for_next_shard() When performing a range query, we want to iterate over shards, running the query on each shard in order until the query range is exhausted or we have the right number of rows. To be able to do this, introduce token_for_next_shard(), which allows us to determine the boundary between shards. It is a sort-of inverse to shard_of(), in that shard_of(token_for_next_range(t)) == shard_of(t) + 1	2016-11-03 19:09:23 +02:00
Vlad Zolotarov	6c15dd967a	cql3::query_processor: make the collectd metrics registration nicer Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-11-03 11:49:20 -04:00
Vlad Zolotarov	36cc351ae1	cql3::query_processor: add a counter for BATCH CQL statements - Add a "batches" member to cql_stats. - Update it where appropriate. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-11-03 11:49:20 -04:00
Vlad Zolotarov	6e1d27bed1	cql3::query_processor: add a counter for a number of CQL modification requests ("writes") - Add a inserts, updates, deletes members to cql_stats. - Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field. - Store cql_stats& in a batch_statement and increment the statistics for each BATCH member. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-11-03 11:49:15 -04:00
Vlad Zolotarov	fa4e1db0cb	cql: add a counter for CQL read (SELECT) requests - Add a "reads" counter to a cql3::cql_stats struct. - Store a reference for a query_processor::_cql_stats in the select_statement object. - Increment a "reads" counter where needed. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-11-03 11:48:57 -04:00
Vlad Zolotarov	7606588267	cql3::query_processor: add cql_stats - Add cql_stats member. - Pass it to cql3::raw::parsed_statement::prepare() virtual method. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-11-03 11:48:57 -04:00
Glauber Costa	8840a5a593	database: when querying, move latency counter instead of copying It is comprised of two time points. Let's move it instead of copying it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	a382f10fc4	correctly calculate latencies for writes Right now we are calculating latencies only when we are about to add an item to the memtable. That's incorrect and misleading, for two reasons. First, it leaves the commitlog latencies out. But second, it is done after the memtable wall effect is applied, which means we are not counting throttle time neither in the memtables or in the commitlog. To do that, we'll start the latency_counter object as soon as possible and move it all the way to apply_in_memory(). That should span the entire write operation. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	3f825f593d	database: refactor code so apply_in_memory() is called only once There are two variants of apply_in_memory() being called in do_apply(): with and without the commitlog. The main differences are that when the commitlog is involved, we need to wait for its future to complete before moving to apply_in_memory. That can easily be factored out by providing an always-ready future if we don't have the commitlog enabled, and waiting on that. The second, is that the commitlog version can cause apply_in_memory to generate an exception if there is replay position reordering. However, there is no harm in appending the exception handler to both versions. In one of them it's an impossible exception, but that's fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	f3528ede65	database: change find_column_families signature so it returns a lw_shared_ptr There are places in which we need to use the column family object many times, with deferring points in between. Because the column family may have been destroyed in the deferring point, we need to go and find it again. If we use lw_shared_ptr, however, we'll be able to at least guarantee that the object will be alive. Some users will still need to check, if they want to guarantee that the column family wasn't removed. But others that only need to make sure we don't access an invalid object will be able to avoid the cost of re-finding it just fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Avi Kivity	6c45b0bae8	partitioner: make comparators public The public comparison operators depend on global_partitioner(), and are therefore less useful for tests.	2016-11-03 11:27:40 +02:00
Avi Kivity	6320181b97	partitioner: const correctness for comparators	2016-11-03 11:27:40 +02:00
Avi Kivity	470826d127	partitioner: change partitioners to have shard counts independent from smp::count Useful for testing.	2016-11-03 11:27:40 +02:00
Avi Kivity	75706c0a26	size_estimates_recorder: sort token range before rewrapping it Since size estimates are stored as wrapped ranges, we call compat::wrap() to convert from the now-standard unwrapped ranges back to wrapped ranges. However, compat::wrap() relies on the ranges being in sorted order, but our input is not. This leads to a crash as we find an unexpected empty token in the middle of the vector. Sort it so compat::wrap() works as expected. Fixes #1804. Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>	2016-11-03 09:43:41 +01:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Takuya ASADA	8c55c99353	dist/common/scripts/scylla_io_setup: pass --smp option to iotune command We were ignored --smp option taken from io.conf since iotune didn't supported it, but now it supported we can pass it. (We need to pass it because we need to measure io performance on same condition with scylla) Fixes #1768 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>	2016-11-02 12:49:50 +02:00
Raphael S. Carvalho	53b7b7def3	sstables: handle unrecognized sstable component As in C*, unrecognized sstable components should be ignored when loading a sstable. At the moment, Scylla fails to do so and will not boot as a result. In addition, unknown components should be remembered when moving a sstable or changing its generation. Fixes #1780. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>	2016-11-02 12:44:53 +02:00
Avi Kivity	72c2982260	dist: require scylla-boost-static for EL RPM build	2016-11-01 18:55:55 +02:00
Pekka Enberg	e1e8ca2788	cql3: Fix selecting same column multiple times Under the hood, the selectable::add_and_get_index() function deliberately filters out duplicate columns. This causes simple_selector::get_output_row() to return a row with all duplicate columns filtered out, which triggers and assertion because of row mismatch with metadata (which contains the duplicate columns). The fix is rather simple: just make selection::from_selectors() use selection_with_processing if the number of selectors and column definitions doesn't match -- like Apache Cassandra does. Fixes #1367 Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>	2016-11-01 09:09:01 +00:00
Pekka Enberg	d46ed53e9e	scripts: add `update-version` This patch adds an `update-version` script for updating the Scylla version number in `SCYLLA-VERSION-GEN` file and committing the change to git. Example use: $ ./scripts/update-version 1.4.0 which results into the following git commit: commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf Author: Pekka Enberg <penberg@scylladb.com> Date: Fri Oct 28 10:24:52 2016 +0300 release: prepare for 1.4.0 diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN index 753c982..eba2da4 100755 --- a/SCYLLA-VERSION-GEN +++ b/SCYLLA-VERSION-GEN @@ -1,6 +1,6 @@ #!/bin/sh -VERSION=666.development +VERSION=1.4.0 if test -f version then Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>	2016-10-30 12:43:41 +02:00
Avi Kivity	feb8faf70b	Merge "make refresh resilient to permission denied error" from Raphael Fixes #1709. * 'refresh-resilient-v3' of github.com:raphaelsc/scylla: db: make refresh resilient to permission denied error db: make it possible to use custom error handler with io checker sstables: remove duplicated declaration of remove_by_toc_name	2016-10-30 10:28:09 +02:00
Takuya ASADA	68d9f5212c	dist/ubuntu/dep/thrift.diff: add missing build time dependency We need libcrypto header to build thrift, so add it. Fixes #1798 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1477676716-5726-1-git-send-email-syuu@scylladb.com>	2016-10-29 17:49:30 +03:00
Avi Kivity	71532d8cd5	Merge seastar upstream * seastar 05f6c5c...47e1821 (1): > rpc: Avoid using zero-copy interface of output_stream (Fixes #1786)	2016-10-28 14:09:16 +03:00
Avi Kivity	e03ca06431	dist: fix rpm build --static-boost is supposed to be an input to ./configure.py, not ninja. Move it there.	2016-10-28 08:42:26 +03:00
Pekka Enberg	b54870764f	auth: Fix resource level handling We use `data_resource` class in the CQL parser, which let's users refer to a table resource without specifying a keyspace. This asserts out in get_level() for no good reason as we already know the intented level based on the constructor. Therefore, change `data_resource` to track the level like upstream Cassandra does and use that. Fixes #1790 Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>	2016-10-27 23:37:26 +03:00
Glauber Costa	ef3c7ab38e	auth: always convert string to upper case before comparing We store all auth perm strings in upper case, but the user might very well pass this in upper case. We could use a standard key comparator / hash here, but since the strings tend to be small, the new sstring will likely be allocated in the stack here and this approach yields significantly less code. Fixes #1791. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>	2016-10-27 22:08:57 +03:00
Raphael S. Carvalho	d11e839520	db: make refresh resilient to permission denied error User may forget to set permission of new sstables in upload dir before refreshing them, and that will result in shutdown. io_checker is now able to work with a custom handler, so all we have to do is to whitelist EACCES. Fixes #1709. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-27 16:50:40 -02:00
Raphael S. Carvalho	a3e065da9b	db: make it possible to use custom error handler with io checker By default, io checker will cause Scylla to shutdown if it finds specific system errors. Right now, io checker isn't flexible enough to allow a specialized handler. For example, we don't want to Scylla to shutdown if there's an permission problem when uploading new files from upload dir. This desired flexibility is made possible here by allowing a handler parameter to io check functions and also changing existing code to take advantage of it. That's a step towards fixing #1709. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-27 15:54:21 -02:00
Takuya ASADA	a1b7e76d43	dist/ubuntu: support 16.10 Add 16.10 to 'supported_release' Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1477585454-2115-1-git-send-email-syuu@scylladb.com>	2016-10-27 19:26:14 +03:00
Takuya ASADA	36e831a106	dist/common/scripts/scylla_bootparam_setup: support EC2 paravirtual instances EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file. So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2. Fixes #1598 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>	2016-10-27 18:55:05 +03:00
Avi Kivity	402a3f1c9f	Merge seastar upstream * seastar 9bed76a...05f6c5c (5): > reactor: improve task quota timer resolution > Update dpdk submodule to local-patches-20161027 tag > tests: wire up json_formatter_test > json_formatter_test: Add rudimentary json formatter test > scripts/posix_net_conf.sh: detect IRQs of virtio-net and xen_netfront correctly	2016-10-27 18:19:40 +03:00
Avi Kivity	e995f5a3a7	dist: statically link with boost on RHEL Reduces runtime dependencies on Scylla-provided third-party boost packages. Message-Id: <1477552490-28961-1-git-send-email-avi@scylladb.com>	2016-10-27 12:35:12 +03:00
Avi Kivity	76628a7b0b	dist: make wget quieter wget is often used from scripts recording to logs; as it emits a log line every second, the logs are huge and unreadable. Make it quieter. Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>	2016-10-27 12:11:26 +03:00
Avi Kivity	72d78ffa7e	Merge "Cache fixes" from Paweł "5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to use fast forwarding reader") brought some significant changes to the row cache implementation. Unfortunately, "significant changes" often translates to "more bugs" and this time was no different. This series contains fixes for the problems introduced in that rework and makes failing dtest bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test pass again." * 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev: row_cache: avoid dereferencing invalid iterator row_cache: set _first_element flag correctly row_cache: fix clearing continuity flag at eviction	2016-10-27 11:44:15 +03:00
Takuya ASADA	5cb7dc5dc3	dist/ubuntu/dep: update thrift to 0.9.3 To make thrift compilable on gcc-6.2, we need to upgrade latest version of thrift. This is required to support Ubuntu 16.10. Fixes #1784 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1477517671-18067-1-git-send-email-syuu@scylladb.com>	2016-10-27 10:22:06 +03:00
Paweł Dziepak	a7224ae46e	row_cache: avoid dereferencing invalid iterator Conditions in row_cache::do_find_or_create_entry() make it possible that std::prev(it) is going to be dereferenced even if it is a begin iterator. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 15:24:23 +01:00
Paweł Dziepak	654f651e0c	row_cache: set _first_element flag correctly If the continuity flag was set for the first element _first_element flag would not be cleared. This shouldn't cause any correctness problems but properly setting the flag allows to avoid some unnecessary key comparisons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 15:07:24 +01:00
Paweł Dziepak	567ff96f2a	row_cache: fix clearing continuity flag at eviction In original implementation the continuity flag indicated that cache has full information about the range the between current partition and the one following it, hence when evicting an entry the one preceeding it had to have its continuity flag cleared. This was changed, however, and now the continuiy flag tells whether the cache is continuous between the current element and the one before it. This means that eviction code needs to clear the flag for the entry directly following the evicted one. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 14:58:20 +01:00
Raphael S. Carvalho	bc2d351c25	sstables: remove duplicated declaration of remove_by_toc_name Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-26 11:21:27 -02:00
Takuya ASADA	7617adadf4	dist/ami/files/.bash_profile: fix confusing message when running AMI on unsupported instance type To describe witch instance type is supported, show document URL instead of confusing message. Fixes #1646 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1477473336-25373-1-git-send-email-syuu@scylladb.com>	2016-10-26 12:48:51 +03:00
Avi Kivity	7faf2eed2f	build: support for linking statically with boost Remove assumptions in the build system about dynamically linked boost unit tests. Includes seastar update which would have otherwise broken the build.	2016-10-26 08:51:21 +03:00
Piotr Jastrzebski	27726cecff	Clean up position_in_partition. Introduce position_in_partition_view and use it in position() method in mutation_fragment, range_tombstone, static_row and clustering_row. Clean up comparators in position_in_partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>	2016-10-25 15:13:20 +01:00
Tomasz Grabiec	cbaae2bf7f	Merge seastar upstream * seastar e18205b...3777135 (1): > rpc: Do not close client connection on error response for a timed out request Fixes #1778	2016-10-25 13:59:41 +02:00
Raphael S. Carvalho	975ce62dbc	sstables: do not swallow exception when reading TOC That caused problem when refreshing a sstable with bad permissions. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>	2016-10-25 12:21:32 +03:00
Avi Kivity	ddd4dbf928	Update scylla-ami submodule * dist/ami/files/scylla-ami e1e3919...61ff5c6 (1): > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8	2016-10-25 11:18:58 +03:00
Avi Kivity	4b55a687b6	Merge seastar upstream * seastar 98b5a2d...e18205b (1): > json::formatter: Add formatters for maps + rudimentary test	2016-10-25 11:17:29 +03:00
Avi Kivity	e8edaaf6a4	Merge seastar upstream * seastar 69acec1...98b5a2d (9): > rpc: Silence warning about ignored failed future > future: prioritise continuations that can run immediately > iotune: relax aio restrictions > build: support for static linking with boost > rpc: Fix crash during connection teardown > rpc: Move _connected flag to protocol::connection > rpc test: fail test if exception is thrown during test execution > rpc: do not assume underling semaphore type > rpc: fix default resource limit	2016-10-25 11:09:40 +03:00
Avi Kivity	fc8210a875	tests: fix tests with boost 1.60 In boost 1.60, the executable's command-line arguments are expected to be separated from the boost command-line arguments by '--'. Detect this requirement and comply with it. Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>	2016-10-24 09:36:56 +02:00
Avi Kivity	37f112b610	dist: add python3-yaml to ununtu dependencies for blocktune	2016-10-23 16:42:13 +03:00
Avi Kivity	7d50d6df9b	blocktune: fix syntax error in exception handling	2016-10-23 16:40:00 +03:00
Avi Kivity	e261a380a9	dist: add PyYAML dependency to rpm (for blocktune)	2016-10-23 10:36:29 +03:00
Raphael S. Carvalho	fa308c079c	database: fix collectd metrics for clustering key filter Same instance name was used for exported metrics, which is definitely wrong. Checked it works properly now via collectd exporter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>	2016-10-22 09:51:18 +03:00
Glauber Costa	a13c410749	commitlog: cycle based on total size, not on mutation size We calculate two sizes during the allocation: "size", which is the in-segment size of this mutation, and "s", which is that plus the overhead. cycle() must be called with the latter, not the former, as doing otherwise may lead to buffer overflows. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>	2016-10-21 18:57:41 +03:00
Glauber Costa	d9875784a1	commitlog: do not wait on pending operations for batch mode This was explicitly mentioned in my set as gone in one of the versions. Somehow it came back in the final version - sorry about that. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>	2016-10-21 17:27:16 +03:00
Vlad Zolotarov	f75a350a8f	service::storage_proxy: use global_trace_state_ptr when using invoke_on When trace_state may migrate to a different shard a global_trace_state_ptr has to be used. This patch completes the patch below: commit `7e180c7bd3` Author: Vlad Zolotarov <vladz@cloudius-systems.com> Date: Tue Sep 20 19:09:27 2016 +0300 tracing: introduce the tracing::global_trace_state_ptr class Fixes #1770 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>	2016-10-21 11:34:13 +03:00
Avi Kivity	e3ae54f0fe	Merge "Rework commitlog to avoid timeouts" from Glauber "This patchset reworks the commitlog logic to better handle conditions in which we are getting requests faster than the disk can handle. It does this by building a wall around the commitlog and only allowing allocations to proceed when we are under the desired memory threshold. The main advantage of that is that we can now easily set the commitlog to work at disk speed, more or less allowing an "one byte in for each byte out" approach instead of depending on the current cycle to finish. As a result, max latencies are greatly reduced. Testing Results =============== To test this, I have ran a workload that times out frequently. That workload use 10 threads to write 100 partitions (to isolate from the effects of the memtable introduced latencies) in a loop and each partition is 2MB in size. After 10 minutes running this load, we are left with the following percentiles: latency mean : 51.9 [WRITE:51.9] latency median : 9.8 [WRITE:9.8] latency 95th percentile : 125.6 [WRITE:125.6] latency 99th percentile : 1184.0 [WRITE:1184.0] latency 99.9th percentile : 1991.2 [WRITE:1991.2] latency max : 2338.2 [WRITE:2338.2] After this patch: latency mean : 54.9 [WRITE:54.9] latency median : 43.5 [WRITE:43.5] latency 95th percentile : 126.9 [WRITE:126.9] latency 99th percentile : 253.9 [WRITE:253.9] latency 99.9th percentile : 364.6 [WRITE:364.6] latency max : 471.4 [WRITE:471.4] I have run this with larger sizes as well, and it generally performs much better than the baseline version. For sizes up to 5MB, I have seen no timeouts in my setup. After that, I see some timeouts. Buffer splitting is expected to make this better. Aside from performance testing, this was also tested with batch and periodic mode for various requests sizes."	2016-10-20 16:44:39 +03:00
Glauber Costa	d5618c6ace	commitlog: add total_operations type for requests_blocked_memory Current tracker for pending allocations is a queue_size GAUGE. Add a total_operations version so we have more insight on what's going on. It will be called requests_blocked_memory for consistency with other subsystems that track similar things. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-20 09:25:38 -04:00
Avi Kivity	db2f5e6be1	blocktune: wire up blocktune on startup Message-Id: <1476357027-15014-3-git-send-email-avi@scylladb.com>	2016-10-20 13:24:05 +03:00
Avi Kivity	098d02ad1a	scylla-blocktune: introduce scylla-blocktune is a script that parses scylla.yaml and tunes the data file and commitlog directories it references. Tuning includes: - set the I/O scheduler to noop - disable merging - tune dependent devices (like RAID members) Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>	2016-10-20 13:24:05 +03:00
Avi Kivity	fad34eef6c	scylla_raid_setup: don't mess with read-ahead It doesn't affect O_DIRECT reads, and it's not persistent. Message-Id: <1476269082-2473-2-git-send-email-avi@scylladb.com>	2016-10-20 13:23:38 +03:00
Avi Kivity	a837da06ef	scylla_raid_setup: increase chunk size The current chunk size of 256 gives a 50% probability of a 128k read or write getting split into two accesses. This reduces efficiency and increases latency. Change the chunk size to 1MB, with a 12% probability of cross-member access. Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>	2016-10-20 13:23:38 +03:00
Takuya ASADA	80e3d8286c	dist/ami: fix incorrect /etc/fstab entry on CentOS7 base image There was incorrect rootfs entry on /etc/fstab: /dev/sda1 / xfs defaults,noatime 1 1 This causes boot error when updated to new kernel. (see: https://github.com/scylladb/scylla/issues/1597#issuecomment-250243187) So replaced the entry to UUID=<uuid> / xfs defaults,noatime 1 1 Also all recent security updates applied. Fixes #1597 Fixes #1707 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475094957-9464-1-git-send-email-syuu@scylladb.com>	2016-10-20 11:48:24 +03:00
Takuya ASADA	5f602752a5	dist/ubuntu: backport g++-5 from Debian 9(stretch) to Debian 8(jessie) Since Debian 8(jessie) does not provides g++-5, we frequently got compile error because we are using older compiler. To fix the problem, backport g++-5 from Debian 9(stretch). Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>	2016-10-20 11:41:02 +03:00
Takuya ASADA	7d67504b56	dist/ubuntu: use VERSION_ID from /etc/os-release instead of 'lsb_release -r' On Debian, lsb_release -r returns the version number something like '8.6'. However, on this script we want to check major version only. Therefore we can use VERSION_ID from /etc/os-release which only contains major version number. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>	2016-10-20 11:41:02 +03:00
Avi Kivity	0da2f64cfb	Merge seastsar upstream * seastar ccd8649...69acec1 (2): > app/iotune: add --smp option > rpc: Add missing adjustment of snd_buf::size Fixes #1767. Fixes #1768.	2016-10-20 11:16:40 +03:00
Paweł Dziepak	210a390892	tests: add missing sstable for partition skipping test Commit `7dcd70124a` "tests/sstables: add test for fast forwarding reader" added a test for skipping parts of sstable. Unfortunately, it did not include the sstables it was trying to read. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 23:23:49 +01:00
Glauber Costa	1578d7363a	commitlog: rework blocking logic The current incarnation of commitlog establishes a maximum amount of writes that can be in-flight, and blocks new requests after that limit is reached. That is obviously something we must do, but the current approach to it is problematic for two main reasons: 1) It forces the requests that trigger a write to wait on the current write to finish. That is excessive; ideally we would wait for one particular write to finish, not necessarily the current one. That is made worse by the fact that when a write is followed by a flush (happens when we move to a new segment), then we must wait for all writes in that segment to finish. 1) it casts concurrency in terms of writes instead of memory, which makes the aforementioned problem a lot worse: if we have very big buffers in flight and we must wait for them to finish, that can take a long time, often in the order of seconds, causing timeouts. The approach taken by this patch is to replace the _write_semaphore with a request_controller. This data structure will account the amount of memory used by the buffers and set a limit on it. New allocations will be held until we go below that limit, and will be released as soon as this happens. This guarantees that the latencies introduced by this mechanism are spread out a lot better among requests and will keep higher percentile latencies in check. To test this, I have ran a workload that times out frequently. That workload use 10 threads to write 100 partitions (to isolate from the effects of the memtable introduced latencies) in a loop and each partition is 2MB in size. After 10 minutes running this load, we are left with the following percentiles: latency mean : 51.9 [WRITE:51.9] latency median : 9.8 [WRITE:9.8] latency 95th percentile : 125.6 [WRITE:125.6] latency 99th percentile : 1184.0 [WRITE:1184.0] latency 99.9th percentile : 1991.2 [WRITE:1991.2] latency max : 2338.2 [WRITE:2338.2] After this patch: latency mean : 54.9 [WRITE:54.9] latency median : 43.5 [WRITE:43.5] latency 95th percentile : 126.9 [WRITE:126.9] latency 99th percentile : 253.9 [WRITE:253.9] latency 99.9th percentile : 364.6 [WRITE:364.6] latency max : 471.4 [WRITE:471.4] Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:56:36 -04:00
Glauber Costa	aec724bbda	commitlog: factor out code for checking mutation size In a subsequent patch, I'll use this code in a different place. To prepare for that, we move it out as a method. It also fits a lot better inside the segment manager, so move it there. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	a50996f376	commitlog: calculate segment-independent size of mutations Goal is to calculate a size that is lesser or equal than the segment-dependent size. This was originally written by Tomasz, and featured in his submission "commitlog: Handle overload more gracefully" Extracted here so it sits clearly in a different patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	0b7c9fa17f	commitlog: remove _needed_size It is mostly an optimization, and while it makes sense in this context, it won't soon as we'll stop waiting for the current cycle specifically to finish. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	6214bdeb66	commitlog: move segment_manager constructor outside the class definition We'll do that so we can, in following patches, use static members from the segment. Those are not defined at this point. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Glauber Costa	299877f432	commitlog: add a counter for pending allocations We track the amount of pending allocations but we don't really export it. It will be crucial when we stop tracking pending writes. This patch exports it through a method instead of the totals structure, so we can easily change it. Current code probing pending_allocations (the api code) is also converted to use the public method instead of the totals struct. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-19 13:49:47 -04:00
Avi Kivity	07c995ab3d	Merge "Fast forward mutation readers" from Paweł "This patchset enables mutation readers to be fast forwarded to a different partition range. The main reason for introducing such feature are range queries served from cache. If the cache is partially populated in the requested range the reader will end up with multiple subranges that have to be read from the sstables. Originally, each of these subranges would require a new reader to be created, but with fast forwarding we can have just one sstable reader. This is better since there is a chance that buffers kept by the reader may be still useful after fast forwarding it. In this series there are also patches that clean up cache readers in order to make integration with fast forwarding easier. Namely, continuity flag is changed to store information about range before the entry which significantly simplifies the logic. Fixes #1299." * 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits) sstables: keep separate stream history for single and range reads sstables: drop sstable::{lower, upper}_bound() row_cache: rework cache to use fast forwarding reader row_cache: put cache entry flags in a struct row_cache: add do_find_or_create_entry() to reduce code duplication mutation_reader: forward fast_forward_to() calls tests/row_cache: add fast_forward_to() to throttled reader tests/row_cache: count mutations read from _underlying memtable: add support for fast_forward_to() drop key readers tests/mutation_reader: test fast forwarding combined reader database: enable fast forwarding of range_sstable_reader combined_mutation_reader: implement fast_forward_to() mutation_reader: make combinded_reader public tests/sstables: add test for fast forwarding reader tests: add more helpers to mutation reader assertions sstables: enable fast forwarding for range readers mutation_reader: introduce fast_forward_to() sstables: implement mutation_reader::impl::fast_forward_to() sstables: introduce index_reader ...	2016-10-19 18:10:44 +03:00
Paweł Dziepak	ab0eeae82d	sstables: keep separate stream history for single and range reads Single partition and partition range reads are expected to behave considerably different so it is worth to have them use separate file stream history. This also makes reads use different history for each sstable which is also a good thing. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	20bfa1fa52	sstables: drop sstable::{lower, upper}_bound() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	5ff699e09f	row_cache: rework cache to use fast forwarding reader This uncomfortably large patch overhauls cache range reader so that it can take advantage of fast forwarding mutation readers. A significant change in the cache itself is that the continuity flag now is used to determine whether cache is contiguous between the previous entry and the current one. This allows for a significant simplification of the cache code and easier integration with reader fast forwarding. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	18acb0c0e6	row_cache: put cache entry flags in a struct Flags are easier to manage if they are in a single structure. Especially, default initialization and move contstructors are simpler and less error prone. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	f248e23db5	row_cache: add do_find_or_create_entry() to reduce code duplication Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	bcd374c05d	mutation_reader: forward fast_forward_to() calls Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	0c24bbe639	tests/row_cache: add fast_forward_to() to throttled reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	69645455f3	tests/row_cache: count mutations read from _underlying Originally, cache tests checked how many times a mutation reader was created from the underlying mutation source to determine whether continuity flag is working correctly. This is not going to work with fast forwarding mutation readers so the test is switched to count number of mutations (+ end of stream markers) returned from underlying mutaiton readers which is much less fragile. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	e14f8027d5	memtable: add support for fast_forward_to() Fast forwarding of memtable readers is needed only for unit tests which often use memtables as underlying data source for cache and the cache readers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	5ac9babe97	tests/mutation_reader: test fast forwarding combined reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	7bebfb851f	database: enable fast forwarding of range_sstable_reader When fast forwarding a reader that combines sstable reader we must also remember that the set of sstables for the new range may be different than for the previous one. The reader introduced in this patch makes sure that we read from correct sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	b7b7b2bd63	combined_mutation_reader: implement fast_forward_to() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	2c0cdd55fc	mutation_reader: make combinded_reader public We want to be able to fast forward sstable readers. However, just implementing fast_forward_to() for combined_reader is not enough as the sstables we are reading from may need to change. Following patches are going to introduce a combined sstable reader that derives from combined_reader. To make that possible we first need to make combined_reader public. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	7dcd70124a	tests/sstables: add test for fast forwarding reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	5534dc2817	tests: add more helpers to mutation reader assertions Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	cf024975fe	sstables: enable fast forwarding for range readers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	62c9492d33	mutation_reader: introduce fast_forward_to() This patch introduces the interface for fast forwarding mutation readers. The main user of this feature is going to be cache which, while serving range query, may need to read multiple small ranges from the sstables to populate itself with the missing entries. Fast forwarding is an alternative to recreating a reader with different range. Its main advantage is fact that it avoids dropping data that has already been read. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	c63e88d556	sstables: implement mutation_reader::impl::fast_forward_to() This patch allows sstable readers to be fast forwarded without making it necessary to recreate the reader (and dropping all buffers in the process). It is built on top of index_reader and ability of data_consume_context to be fast forwarded. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	a530762277	sstables: introduce index_reader index_reader is a helper that implements index lookups. Its goal is to avoid dropping read buffers if they still may be needed (for example to get end bound of the range or after fast forwarding the reader). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	f49a9e0d64	sstables: drop unused read_range_rows() overload That overload was used only by unit test and violated guarantee that partition range lives until mutation reader is done. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	0bc873ace5	sstables: add fast_forward_to() to continuous_data_consumer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	25b91c51e2	ssables: add data_consume_rows_context::reset() reset() is going to be used to restore valid state after fast forwarding the reader. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	2124d08b88	sstables: add skip() to compressed_file_data_source Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	54069162f5	Merge "Add test for partition version list consistency after compaction" from Tomek	2016-10-18 11:03:25 +01:00
Tomasz Grabiec	308434f891	tests: memtable: Add test for partition version list consistency after compaction	2016-10-18 11:57:14 +02:00
Tomasz Grabiec	6548132423	lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions is_compactible() will pass on very small regions. full_compaction() is only used in tests to force objects to be moved due to compaction, so we want all reclaimable regions to be compacted.	2016-10-18 11:16:08 +02:00
Tomasz Grabiec	ecf85cbffb	mutation: Define + operation It's more convenient to write m1 + m2 in tests than to do more elaborate constructs with copy constructors and apply().	2016-10-18 11:16:08 +02:00
Tomasz Grabiec	fe387f8ba0	partition_version: Fix corruption of partition_version list The move constructor of partition_version was not invoking move constructor of anchorless_list_base_hook. As a result, when partition_version objects were moved, e.g. during LSA compaction, they were unlinked from their lists. This can make readers return invalid data, because not all versions will be reachable. It also casues leaks of the versions which are not directly attached to memtable entry. This will trigger assertion failure in LSA region destructor. This assetion triggers with row cache disabled. With cache enabled (default) all segments are merged into the cache region, which currently is not destroyed on shutdown, so this problem would go unnoticed. With cache disabled, memtable region is destroyed after memtable is flushed and after all readers stop using that memtable. Fixes #1753. Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>	2016-10-18 09:25:38 +01:00
Duarte Nunes	1d45f19c78	create_view_statement: Use cf_properties This patch uses cf_properties instead to add the missing attributes to the create_view_statement class. Fixes #1766 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-18 01:18:52 +00:00
Duarte Nunes	7c58b7e764	unimplemented: Add materialized views This patch adds the VIEWS element to the cause enum so we can mark failures due to incomplete support of materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-18 01:18:52 +00:00
Duarte Nunes	7c28ed3dfc	schema: Extract default compressor This patch extracts the definition of the default compressor into the compression_parameters class, so that the table and view creation statements don't have to explicitly deal with it. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-18 01:18:52 +00:00
Duarte Nunes	dc470e6a36	cql3: Extract cf_properties This patch extracts the cf_properties class, which contains common attributes of tables and materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-18 01:18:51 +00:00
Takuya ASADA	587d375e19	main: exit with 1 when verify_seastar_io_scheduler() failed Since we are exiting Scylla process in engine().at_exit() using ::_exit(0), even verify_seastar_io_scheduler() throwing an exception, scylla always exit with 0. Systemd misunderstands scylla-server.service was shutdown successfully because of this, so we need to pass correct exit code to ::_exit() here. Fixes #1674 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>	2016-10-17 13:57:00 +03:00
Avi Kivity	163088c6af	Merge seastar upstream * seastar 207bf3d...ccd8649 (3): > Merge "Augment semaphore with non-blocking operations" from Glauber > Merge "More dynamic fstream patches" from Paweł > Merge "fstream: add dynamic adjustments based on stream history" from Paweł	2016-10-17 12:49:17 +03:00
Avi Kivity	65c27ccf21	bytes_ostream: make max_chunk_size() an inline function Fixes debug build looking for a variable definition and not finding it.	2016-10-17 11:49:33 +03:00
Avi Kivity	c0a1ad0b77	bytes_ostream: use larger allocations A 1MB response will require 2000 allocations with the current 512-byte chunk size. Increase it exponentially to reduce allocation count for larger responses (still respecting the upper limit). Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>	2016-10-16 10:05:48 +01:00
Tomasz Grabiec	d836e8f64b	tests: memtable: Add tests for flushing reader Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>	2016-10-14 15:11:06 +01:00
Tomasz Grabiec	63784fd921	db: Fix corruption of partition_entry Memory accounting code was attaching partition_snapshot to partition_entry in order to calculate the size of partition_version object. However, it is only allowed if partition_entry doesn't have any snapshot attached already. In this case it always has one, created by the flushing reader. Change the accounting code to reuse existing partition_snapshot reference. Fixes #1746 Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>	2016-10-14 15:10:48 +01:00
Paweł Dziepak	d08cffd3c7	lsa: avoid exceptions during segment_zone creation LSA tries to allocate zones as large as possible (while still leaving enough free space for the standard allocator). It uses the amount of free memory in order to guess how much it can get, but that obviously doesn't account for fragmentation and the allocation attempt may fail. This patch changes the LSA code so that it doesn't throw in case zone couldn't be created but just returns a null pointer which should be more performant if the LSA memory cannot grow any more. Fixes #1394. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>	2016-10-14 11:08:24 +02:00
Amnon Heiman	7829da13b4	scylla_setup: Reorder questions and actions The expected behaviour in the scylla_setup script is that a question will be followed by the answer. For example, after asking if the scylla should be run as a service the relevant actions will be taken before the following question. This patch address two such mis-orders: 1. the scylla-housekeeping depends on the scylla-server, but the setup should first setup the scylla-server service and only then ask (and install if needed) the scylla-housekeeping. 2. The node_exporter should be placed after the io_setup is done. Fixes #1739 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>	2016-10-13 18:29:36 +03:00
Pekka Enberg	3b4e6cdc5e	abstract_replication_strategy: Fix exception type if class not found Change abstract_replication_strategy::create_replication_strategy() to throw exceptions::configuration_error if replication strategy class lookup to make sure the error is converted to the correct CQL response. Fixes #1755 Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>	2016-10-13 17:39:28 +03:00
Tomasz Grabiec	e617bcd8a7	logalloc: disable abort on allocation failure in places in which it is benign Some places start big expecting allocation failure, then reduce the requested size. Let's not abort in such cases. Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>	2016-10-13 10:53:32 +03:00
Avi Kivity	13e9d4c8e3	Merge seastar upstream * seastar f937fb0...207bf3d (11): > Merge "iotune: gracefully exit on predictable exceptions" (Fixes #1623) > core/semaphore: Add semaphore_units::release() > Merge "rometheus API with grafana uses labels" from Amnon > core/thread: Fix stack alloc-dealloc mismatch > core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group > file: support for XFS on older kernels > reactor: fix bug when handling EBADF in flush_pending_aio() > prometheus CPU should start in 0 > Collectd: bytes ordering depends on the type > tests: Check that backtrace() doesn't corrupt signal mask > core/thread: Add stack guards to seastar thread stacks	2016-10-12 23:47:12 +03:00
Avi Kivity	63f053e9b7	storage_proxy: fix mutation reordering with wrapping ranges If we have a range query involving a wrapping range (i.e., from thrift), and mutations from both halves of the result are involved, then we will return the results in the wrong order (and potentially the wrong partitions) since we order by token, so the results from the second half of the wrapping range end up before the first. Fix by splitting the two queries, and merging the second half with lower priority compared to the first half. Note: this will be fixed in a better way once we have the sharding iterator, as then we can query sequentially. Fixes #1761. Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>	2016-10-12 15:59:16 +02:00
Avi Kivity	1506b06617	Merge "node_exporter service on ubuntu 16" from Amnon "This series address two issues that interfere with running the node_exporter as a service in ubuntu 16. 1. The service file should be packed in the deb file 2. When setting the node_exporter as a service it doesn't need to run with scylla use" * 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev: node-exporter service: No need to run as scylla user debian package: Include the node_exporter service file	2016-10-12 12:11:18 +03:00
Amnon Heiman	1bd50789e0	node-exporter service: No need to run as scylla user the node-exporter does not need to run as scylla user. It can run without scylla or without the scylla user being configure. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-10-11 12:44:27 +03:00
Amnon Heiman	d523bf56ed	debian package: Include the node_exporter service file This will include the node_exporter service script for ubuntu distribution with systemd support. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-10-11 12:44:14 +03:00
Avi Kivity	f6998bb260	Merge "Implement describe_splits_ex based on Cassandra" from Duarte "This patch-set re-implements the describe_splits_ex() verb to more closely follow Cassandra's implementation, on which some clients rely. Ref #1139 Ref #693" * 'describe-splits/v2' of github.com:duarten/scylla: thrift: Implement describe_splits_ex based on Cassandra storage_service: Implement get_splits() function sstables: Add function to get key samples sstables/key: Add to_partition_key function size_estimates_recorder: Increase estimate accuracy sstables: Get estimates for a particular range sstables/key: Make key::kind public	2016-10-11 11:13:35 +03:00
Takuya ASADA	0007f2d838	dist/common/sbin: add scylla_cpuset_setup and scylla_dev_mode_setup to /usr/sbin We haven't added symlinks to /usr/sbin for newly created scripts, so add them. Fixes #1702 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1474879711-31793-1-git-send-email-syuu@scylladb.com>	2016-10-11 11:02:14 +03:00
Takuya ASADA	ccad720bb1	dist/common/script/scylla_io_setup: handle comma correctly when parsing cpuset The script mistakenly split value at "," when cpuset list is separated by comma. Instead of matching possible patterns of the argument, let's pass all characters until reach to space delimiter or end of line. Fixes #1716 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>	2016-10-11 10:42:32 +03:00
Duarte Nunes	d8cfc56376	thrift: Implement describe_splits_ex based on Cassandra This patch re-implements the describe_splits_ex() verb to more closely follow Cassandra's implementation, on which some clients rely. Ref #1139 Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 22:32:10 +02:00
Duarte Nunes	01ab2081cd	storage_service: Implement get_splits() function This patch implements the get_splits() function in storage_service, used to split a particular token range in slices of approximately the specified size, using the sample keys and estimates of the CF's sstables. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 22:32:08 +02:00
Duarte Nunes	c36dbaf0f1	sstables: Add function to get key samples This patch implements the get_key_samples() function, on which a future patch will base an implementation of the describe_splits() thrift verb closer to Cassandra's. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 19:50:14 +02:00
Duarte Nunes	fc07b66678	sstables/key: Add to_partition_key function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 19:50:11 +02:00
Duarte Nunes	c19c633299	size_estimates_recorder: Increase estimate accuracy This patch uses the estimated_keys_for_range() function to get better estimates. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 17:52:16 +02:00
Duarte Nunes	ceed09b23e	sstables: Get estimates for a particular range This patch adds the estimated_keys_for_range() function, which estimates the number of keys present between the specified range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 17:52:15 +02:00
Duarte Nunes	8c223b31c8	sstables/key: Make key::kind public Needed to create synthetic keys without any value but with ordering properties. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-10-10 17:47:24 +02:00
Avi Kivity	b305d92a65	Merge "housekeeping: check version during setup" from Amnon "The version is taken from the installation rather than the API, a mode command line indicated that this is part of the setup and uuid is used for the interaction with the checkversion server." * 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev: scylla_setup: Check and report the scylla version scylla-housekeeping: check version during setup	2016-10-10 16:37:14 +03:00
Vlad Zolotarov	ab748e829d	docs: tracing.md: initial commit Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1475686745-20383-1-git-send-email-vladz@cloudius-systems.com>	2016-10-10 16:12:02 +03:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Pekka Enberg	3b75ff1496	docs/docker: Tag `--listen-address` as 1.4 feature The Docker Hub documentation is the same for all image versions. Tag `--listen-address` as 1.4 feature. Message-Id: <1475819164-7865-1-git-send-email-penberg@scylladb.com>	2016-10-10 13:26:16 +03:00
Vlad Zolotarov	006999f46c	api::storage_service::slow_query: don't use duration_cast in GET The slow_query_record_ttl() and slow_query_threshold() return the duration of the appropriate type already - no need for an additional cast. In addition there was a mistake in a cast of ttl. Fixes #1734 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1475669400-5925-1-git-send-email-vladz@cloudius-systems.com>	2016-10-09 18:09:13 +03:00
Takuya ASADA	469e9af1f4	dist/common/scripts/scylla_setup: use 'swapon -s' instead of 'swapon --show' Since Ubuntu 14.04 doesn't supported --show option, we need to prevent use it. Fixes #1740 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475788340-22939-2-git-send-email-syuu@scylladb.com>	2016-10-09 18:05:14 +03:00
Takuya ASADA	8452045b85	dist/ubuntu: add realpath to dependency, requires for scylla_setup We need dependency to realpath, since scylla_setup using it. Fixes #1740. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475788340-22939-1-git-send-email-syuu@scylladb.com>	2016-10-09 18:05:14 +03:00
Tomasz Grabiec	41e66ebce2	gdb: Introduce 'scylla heapprof' Presents current heap profile recording. Works in text mode or dumps to collapsed stacks format from which flame graph can be generated. To generate a flamegraph: (gdb) scylla heapprof --flame Wrote heapprof.stacks $ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg flamegraph.pl comes from: https://github.com/brendangregg/FlameGraph.git Text mode example: (gdb) scylla heapprof --min 100000000 All (274699676, #10213) \-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169 (268435456, #1) memory::allocate_large_aligned(unsigned long, unsigned long) + 87 memory::allocate_aligned(unsigned long, unsigned long) + 48 aligned_alloc + 9 logalloc::segment_zone::segment_zone() + 304 logalloc::segment_pool::allocate_segment() + 477 logalloc::segment_pool::segment_pool() + 304 __tls_init.part.801 + 72 logalloc::region_group::release_requests() + 1333 logalloc::region_group::add(logalloc::region_group*) + 514 The branches are formatted like this: -- <symbol> (<size>, #<count>) Where <size> is total size of live objects and <count> is total number of live objects, for all objects allocated from paths going through this node. Nodes which share the same <size> and <count> are stacked like this: -- <symbol_1> (<size>, #<count>) <symbol_2> <symbol_3> Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>	2016-10-09 10:54:08 +03:00
Glauber Costa	33e9c2bbdd	memtable: reduce sstable flush concurrency to one Limiting the concurrency of memtable flushes to 4 was a temporary workaround for the fact that we lacked good write behind support. Now that write behind is properly merged we can reduce the concurrency to what it should be, one. This means that memtable flushes will now be serialized, and only when one of them ends will the next one begin. Disk parallelism is obtained through the write-behind mechanism. Fixes #1373 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>	2016-10-09 10:48:57 +03:00
Tomasz Grabiec	2a5a90f391	db: Do not timeout streaming readers There is a limit to concurrency of sstable readers on each shard. When this limit is exhausted (currently 100 readers) readers queue. There is a timeout after which queued readers are failed, equal to read_request_timeout_in_ms (5s by default). The reason we have the timeout here is primarily because the readers created for the purpose of serving a CQL request no longer need to execute after waiting longer than read_request_timeout_in_ms. The coordinator no longer waits for the result so there is no point in proceeding with the read. This timeout should not apply for readers created for streaming. The streaming client currently times out after 10 minutes, so we could wait at least that long. Timing out sooner makes streaming unreliable, which under high load may prevent streaming from completing. The change sets no timeout for streaming readers at replica level, similarly as we do for system tables readers. Fixes #1741. Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>	2016-10-07 15:41:04 +03:00
Raphael S. Carvalho	9175977a9d	cql3: fix build failure by defining out unused function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <cba6207278ea945ee750d78b189320443843a288.1475793747.git.raphaelsc@scylladb.com>	2016-10-07 08:45:18 +03:00
Avi Kivity	9ac441d3b5	range: adjust split_after to allow split_point outside input range Make split_after() more generic by allowing split_point to be anywhere, not just within the input range. If the split_point is before, the entire range is returned; and if it is after, stdx::nullopt is returned. "before" and "after" are not well defined for wrap-around ranges, so but we are phasing them out and soon there will not be wrapping_range::split_after() users. This is a prerequisite for converting partition_range and friends to nonwrapping_range. Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>	2016-10-06 17:54:44 +02:00
Raphael S. Carvalho	7ea4513595	database: trigger compaction after loading new sstables Scylla wasn't trying to compact new sstables uploaded via 'nodetool refresh'. Thus, all new sstables were left uncompacted until user issued 'nodetool flush' or a new sstable was written which would trigger compaction too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>	2016-10-06 18:26:49 +03:00
Raphael S. Carvalho	9c59ccc52a	storage_service: improve log message for refresh 'No new SSTables were found for keyspace1.standard1' was printed if user uploaded new sstables to upload dir instead, and that is confusing. We should instead print that if new sstables weren't found in both cf and cf/upload dirs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <90386f6255407697434213227ae7ff0de7464f99.1475535203.git.raphaelsc@scylladb.com>	2016-10-06 18:26:32 +03:00
Raphael S. Carvalho	76862d0d9c	main: start compaction procedure after commit log is replayed Commit log replay is a synchronous operation in bootstrap, so services will only be started after it's completed. By starting compaction before, less bandwidth will be available to both and consequently boot will be slowed down. Fix is simply about moving compaction, which is an asynchronous operation after commitlog replay is over. Fixes #1620. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>	2016-10-06 18:25:24 +03:00
Nadav Har'El	ee7ec10b11	CQL parser: "CREATE MATERIALIZED VIEW" statement This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement, following Cassandra 3 syntax. For example: CREATE MATERIALIZED VIEW building_by_city AS SELECT * FROM buildings WHERE city IS NOT NULL PRIMARY KEY(city, name); It also adds the "IS NOT NULL" operator needed for this purpose. As in Cassandra, "IS NOT NULL" can only be used for materialized view creation, and not in a normal SELECT. It can only be used with the NULL operand (i.e., "IS NOT 3" will be a syntax error). The current implementation of this statement just does some sanity checking (such as to verify that "city" is a valid column name and that the "building" base table exists), complains that materialized views are not yet supported: SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS SELECT * FROM buildings WHERE city IS NOT NULL PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported"> As mentioned above, the "IS NOT NULL" restriction is not allowed in ordinary selects not creating a materialized views: SELECT * FROM buildings WHERE city IS NOT NULL; InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation" Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>	2016-10-06 15:42:37 +03:00
Glauber Costa	7146776d7c	fix sstable tests by not using the flush_reader if no region_group The latest virtual dirty patches broke the SSTable tests. The reason for this is that those tests will flush synthetic memtables that do not have a region_group attached to it. Normally in cases like this we would just give the flush_reader an empty region group. However, the memtable class constructor takes a region_group pointer and that can be null according to the interface. So we must conditionally test it. If there isn't a region_group involved, the virtual dirty accounting should be disabled: after all, we won't even have the baseline memory to begin with. One of the approaches to fix this could be to just provide null accounter classes to be used as a surrogate for the accounting classes in this case. However, since this is mostly used for tests, a much simpler way is to just revert back to the scanning reader in that case. The scanning reader is similar enough to the flush_reader, except that it can handle partial ranges, slices, and delegate accesses to an sstable post-flush. We don't need any of that, but as argued above, there is no need to remove it either. Signed-off-by: Glauber Costa <glommer@scylladb.com> Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>	2016-10-05 12:44:21 +01:00
Avi Kivity	c94fb1bf12	build: reduce inclusions of messaging_service.hh Remove inclusions from header files (primary offender is fb_utilities.hh) and introduce new messaging_service_fwd.hh to reduce rebuilds when the messaging service changes. Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>	2016-10-05 11:46:49 +03:00
Avi Kivity	f8118d9fc2	Merge "Virtual dirty memory management" from Glauber "Description: ============ Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that, is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Results ======= With this patchset running a load big enough to easily saturate the disk, (commitlog disabled to highlight the effects of the memtable writer), I am able to run scylla for many minutes, with timeouts occurring only when I run out of disk space, whereas without this patch a swarm of timeouts would start merely 2 seconds after the load started - and would never get stable. In V2, I have sent a set of graphs illustrating the performance of this solution. This version does not have any significant differences in that front. For details, please refer to https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ Accuracy of the accounting: --------------------------- It is important for us to be as accurate as possible when accounting freed memory, since every byte we mark as freed may allow one or more requests to be executed. I have measured the accuracy of this approach (ignoring padding, object size for the mutation fragments) to be 99.83 % of used memory in the test workload I have ran (large, 65k mutations). Memtables under this circumnstance tend to have a very high occupancy ratio because throttle breeds idle, and idle breeds compact-on-idle. Known Issues: ------------- A lot of time can be elapsed between destroying the flush_reader and actually releasing memory. The release of memory only happens when the SSTable is fully sealed, and we have to flush the files, as well as finish writing all SSTable components at this point. This happened in practice with a buggy kernel that would result in flushes taking a long time. After that is fixed, this is just a theoretical problem and in practice it shouldn't matter given the time we expect those operations to take." * 'virtual-dirty-v6' of github.com:glommer/scylla: database: allow virtual dirty memory management streamed_mutation: make _buffer private add accounting of memory read to partition_snapshot_reader move partition_snapshot_reader code to header file LSA: allow a group to query its own region group memtables: split scanning reader in two sstables: use special reader for writing a memtable LSA: export information about object memory footprint LSA: export information about size of the throttle queue database: export virtual dirty bytes region group	2016-10-04 20:57:52 +03:00
Avi Kivity	cc33c8b4ba	Merge seastar upstream * seastar 18f7bb8...f937fb0 (5): > Merge "Fix signal mask corruption" from Tomasz > core/memory: Avoid violating strict aliasing when accessing allocation sites > core/memory: Avoid indirection when storing allocation sites > core/memory: Add a way to disable abort on allocation failure in some scope > core/sharded: Allow mapper to take the service by non-const reference	2016-10-04 20:08:57 +03:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	7b6e8a2526	streamed_mutation: make _buffer private It is currently protected, but now all users go through push_mutation_fragment(). So we can safely move its visibility to guarantee that it stays that way. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	1db245b52d	add accounting of memory read to partition_snapshot_reader By default, we don't do any accounting. By specializing this class and providing an accounter class, we can account how much memory are we reading as we read through the elements. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	452eb95943	move partition_snapshot_reader code to header file This is so we can template it without worrying about declaring the specializations in the .cc file. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	86aa0b830d	LSA: allow a group to query its own region group Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	eee15578fb	memtables: split scanning reader in two The code that is common will live in its own reader, the iterator_reader. All friendly private access to memtable attributes and methods happen through the iterator reader. After this patch, we are now left with the scanning_reader - same as always, but now implemented on top of the iterator_reader, and a flush_reader, which will be used by SSTable flushes only. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	16886eeb96	sstables: use special reader for writing a memtable Right now the special reader doesn't do much, but the idea is that we will soon replace it will a reader that specializes in flush, and is in turn able to provide read-side on-flush functionality like virtual dirty. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	28e3f2f6ee	LSA: export information about object memory footprint We allocate objects of a certain size, but we use a bit more memory to hold them. To get a clerer picture about how much memory will an object cost us, we need help from the allocator. This patch exports an interface that allow users to query into a specific allocator to get that information. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Pekka Enberg	c3bebea1ef	dist/docker: Add '--listen-address' to 'docker run' Add a '--listen-address' command line parameter to the Docker image, which can be used to set Scylla's listen address. Refs #1723 Message-Id: <1475485165-6772-1-git-send-email-penberg@scylladb.com>	2016-10-04 13:57:55 +03:00
Marius	876775a52c	dist/docker/ubuntu: refactored $IP/listen_address In order to allow Scylla’s docker container to handle multiple network interfaces, the start-scylla script was refactored: - `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to be confused or interfere with other environment variables. - `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to resolve a hostname, if no IP was set to it. - `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and contain any available NIC device name (e.g. `eth0`). The script automatically retrieves the IP address from the device. Usage: 1. With `$SCYLLA_LISTEN_ADDRESS` as IP: `docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla` 2. With `$SCYLLA_LISTEN_ADDRESS` as hostname: `docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla` 3. With `$SCYLLA_LISTEN_DEVICE`: `docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla` Message-Id: <20161003151230.67672-1-marius@twostairs.com>	2016-10-04 13:56:55 +03:00
Raphael S. Carvalho	747b42299c	database: remove unused code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>	2016-10-04 09:26:43 +03:00
Paweł Dziepak	7599ef6fde	query_pager: fix splitting range at the end bound Currently, the code responsible for calculating ranges for the next request could produce a wrap-around partition range. For example, if the original range was (unimportant, A] and the last partition key A then the output range would be (A, A]. This patch adds checks to make sure that in such cases the range is removed. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>	2016-10-03 19:33:42 +02:00
Avi Kivity	8747054d10	exceptions: mark function called before construction static cassandra_exception::prepare_message() is called from derived classes' constructors before the base cassnadra_exception object is constructed. This is technically illegal but harmless. Fix by marking the function static. Found by clang.	2016-10-03 16:29:02 +03:00
Calle Wilund	5b815b81b4	auth::password_authenticator: Ensure exceptions are processed in continuation Fixes #1718 (even more) Message-Id: <1475497389-27016-1-git-send-email-calle@scylladb.com>	2016-10-03 14:49:59 +02:00
Pekka Enberg	f3cd21c8f1	Merge seastar upstream * seastar 0e60722...18f7bb8 (1): > core/memory: Fix compilation errors	2016-10-03 12:54:38 +03:00
Calle Wilund	d24d0f8f90	auth::password_authenticator: "authenticate" should not throw undeclared excpt Fixes #1718 Message-Id: <1475487331-25927-1-git-send-email-calle@scylladb.com>	2016-10-03 12:53:30 +03:00
Avi Kivity	a51804eca8	Merge "token_restriction: Deal with minimum tokens" from Duarte "This patch set ensures we can correctly handle queries where the minimum token is specified." * 'min-token/v3' of github.com:duarten/scylla: cql_query_test: Add test case for min/max token bounds token_restriction: Deal with minimum tokens partitioner: Parse token from bytes	2016-10-02 12:32:40 +03:00
Avi Kivity	5071f4c0bf	Merge seastar upstream * seastar 9e1d5db...0e60722 (9): > core/memory: Replace assert with bad_alloc in allocate_large() > chunked_fifo: avoid direct use of sized operator delete > memory: fix build without heap profiler > xen: initialize port::_sem > Merge "Make input streams skippable" from Paweł > semaphore: require explict setting for start value > prometheus: remove invalid chars from meric names > core/memory: Introduce heap profiler > util/backtrace: Mark noexcept if func() doesn't throw	2016-10-02 11:43:22 +03:00
Vlad Zolotarov	7e180c7bd3	tracing: introduce the tracing::global_trace_state_ptr class This object, similarly to a global_schema_ptr, allows to dynamically create the trace_state_ptr objects on different shards in a context of the original tracing session. This object would create a secondary tracing session object from the original trace_state_ptr object when a trace_state_ptr object is needed on a "remote" shard, similarly to what we do when we need it on a remote Node. Fixes #1678 Fixes #1647 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>	2016-10-02 11:31:37 +03:00
Amnon Heiman	a83bd900be	scylla_setup: Check and report the scylla version This patch adds a call to the scylla-housekeeping check version during setup, so a warning will be printed if a newer version is available. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-10-02 11:11:07 +03:00
Amnon Heiman	5e3ab32365	scylla-housekeeping: check version during setup This changes are for running scylla during setup. It contains the following changes: 1. get the current version from the command line (as the syclla does not run at this stage). 2. It support a mode parameter in the command line to indicate that we running during the installation. 3. It accept an external uuid that will be used with all interaction with the check_version server. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-10-02 11:11:07 +03:00
Takuya ASADA	15b156c9d4	dist/common/scripts/scylla_io_setup: describe how to set developer mode when validation tests failed Describe how to set developer mode, not to confuse users. Fixes #1701 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475167584-18092-1-git-send-email-syuu@scylladb.com>	2016-10-02 10:58:38 +03:00
Avi Kivity	58ddfea18f	Merge "Fixes for leveled compaction strategy" from Raphael * 'lcs_fixes' of github.com:raphaelsc/scylla: lcs: fix starvation at higher levels lcs: fix broken token range distribution at higher levels	2016-10-01 21:34:21 +03:00
Takuya ASADA	9639cc840e	dist/redhat: add missing build time dependency for libunwind There was missing dependency for libunwind, so add it. Fixes #1722 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475260099-25881-1-git-send-email-syuu@scylladb.com>	2016-09-30 21:33:39 +03:00
Takuya ASADA	c89d9599b1	dist/ubuntu: add missing build time dependency for libunwind There was missing dependency for libunwind, so add it. Fixes #1721 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1475255706-26434-1-git-send-email-syuu@scylladb.com>	2016-09-30 21:33:21 +03:00
Raphael S. Carvalho	a8ab4b8f37	lcs: fix starvation at higher levels When max sstable size is increased, higher levels are suffering from starvation because we decide to compact a given level if the following calculation results in a number greater than 1.001: level_size(L) / max_size_for_level_l(L) Fixes #1720. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-30 14:09:49 -03:00
Raphael S. Carvalho	a3bf7558f2	lcs: fix broken token range distribution at higher levels Uniform token range distribution across sstables in a level > 1 was broken, because we were only choosing sstable with lowest first key, when compacting a level > 0. This resulted in performance problem because L1->L2 may have a huge overlap over time, for example. Last compacted key will now be stored for each level to ensure sort of "round robin" selection of sstables for compactions at level >= 1. That's also done by C*, and they were once affected by it as described in https://issues.apache.org/jira/browse/CASSANDRA-6284. Fixes #1719. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-30 14:09:16 -03:00
Paweł Dziepak	eb1fcf3ecc	query_pagers: fix clustering key range calculation Paging code assumes that clustering row range [a, a] contains only one row which may not be true. Another problem is that it tries to use range<> interface for dealing with clustering key ranges which doesn't work because of the lack of correct comparator. Refs #1446. Fixes #1684. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>	2016-09-30 17:32:59 +02:00
Tomasz Grabiec	7e25b958ac	transport: Extend request memory footprint accounting to also cover execution CQL server is supposed to throttle requests so that they don't overflow memory. The problem is that it currently accounts for request's memory only around reading of its frame from the connection and not actual request execution. As a result too many requests may be allowed to execute and we may run out of memory. Fixes #1708. Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>	2016-09-30 14:23:14 +01:00
Duarte Nunes	72af476397	cql_query_test: Add test case for min/max token bounds This patch adds a test case for specifying the minimum and maximum tokens in a cql3 query. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-30 11:45:45 +00:00
Duarte Nunes	98b4814894	token_restriction: Deal with minimum tokens This patch fixes a bug where queries such as the following are not handled properly: "SELECT * FROM ks.cf WHERE token(id) > 9207857967443869328 AND token(id) <= -9223372036854775808" Here -9223372036854775808 represents the minimum token, which we were just translating into a token with kind::key, thus returning incorrect results. Ref #1139 Ref #693 Fixes #1717 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-30 11:17:08 +00:00
Duarte Nunes	862f51cddf	partitioner: Parse token from bytes This patch adds the from_bytes() function to the i_partitioner class, whose purpose is parse a particular token and explicitly handle the case when the minimum token is specified. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-30 11:17:02 +00:00
Duarte Nunes	0c8f280af7	partition_key_view: Implement operator<< The operator is declared, but it isn't implemented. This patch fixes that. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1475225647-3800-1-git-send-email-duarte@scylladb.com>	2016-09-30 10:54:54 +02:00
Duarte Nunes	a36888f3cb	storage_service: Convert token through partitioner This patch ensures we use the partitioner to convert a token to sstring instead of casting. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1475179683-28552-1-git-send-email-duarte@scylladb.com>	2016-09-30 10:54:26 +02:00
Tomasz Grabiec	91b1bada55	Merge seastar upstream * seastar 5b7252d...9e1d5db (5): > prometheus: prevent illegal prometheus names > scollectd: raw_to_value should not use network order > semaphore: Introduce get_units() > core::scollectd: truncate the identifiers fields on a 63 characters boundary > Merge "Fix ASAN errors in debug builds" from Tomasz	2016-09-29 13:23:24 +02:00
Asias He	511f8aeb91	gossip: Do not remove failure_detector history on remove_endpoint Otherwise a node could wrongly think the decommissioned node is still alive and not evict it from the gossip membership. Backport: CASSANDRA-10371 7877d6f Don't remove FailureDetector history on removeEndpoint Fixes #1714 Message-Id: <f7f6f1eec2aab1b97a2e568acfd756cca7fc463a.1475112303.git.asias@scylladb.com>	2016-09-29 13:00:47 +03:00
Asias He	a6d6341627	streaming: Add total_{incoming,outgoing}_bytes collectd metrics It reflects number of bytes sent or received per second in streaming. To use it: $ tools/scyllatop/scyllatop.py "streaming" Refs #1655 Message-Id: <5f7943cb2b459db5ed4bd8d7365532ea201ad2d9.1475116963.git.asias@scylladb.com>	2016-09-29 11:54:32 +02:00
Asias He	a6529ad582	repair: Fix split_and_add Before: the range is split only once, so it is split into 2 sub ranges INFO 2016-09-29 15:52:43,625 [shard 0] repair - target_partitions=100, estimated_partitions=537, ranges.size=2, range=(8993553141924659802, 8997061146192366917] -> ranges={ (8993553141924659802, 8995307144058513359], (8995307144058513359, 8997061146192366917]} After: the range is split mulitple times, resulting 16 sub ranges. INFO 2016-09-29 15:55:07,934 [shard 0] repair - target_partitions=100, estimated_partitions=67, ranges.size=16, range=(8993553141924659802, 8997061146192366917] -> ranges={ (8993553141924659802, 8993772392191391496], (8993772392191391496, 8993991642458123191], (8993991642458123191, 8994210892724854885], (8994210892724854885, 8994430142991586580], (8994430142991586580, 8994649393258318274], (8994649393258318274, 8994868643525049969], (8994868643525049969, 8995087893791781664], (8995087893791781664, 8995307144058513359], (8995307144058513359, 8995526394325245053], (8995526394325245053, 8995745644591976748], (8995745644591976748, 8995964894858708443], (8995964894858708443, 8996184145125440138], (8996184145125440138, 8996403395392171832], (8996403395392171832, 8996622645658903527], (8996622645658903527, 8996841895925635222], (8996841895925635222, 8997061146192366917]} Without this patch, repair can do checksum with a range with a lot of partitions, not the expected less than 100 partitions per checksum. This can lead to unncessary data transfer since the checksum is too coarse. For instacne, as above, if the checksum of 1 out of 537 partitions is different, the whole 527 partitions will be synced. Fixes #1613 Message-Id: <0775c20c485c105df5f10bd685048227f074c365.1475137029.git.asias@scylladb.com>	2016-09-29 10:09:25 +01:00
Pekka Enberg	20dccb4bf7	transport/server: Fix CQL Snappy compression failure The snappy_compress() function expects the "compressed_length" parameter to contain the actual output buffer length but now we're passing random garbage from the stack. Fixes #1711 Message-Id: <1475132127-316-1-git-send-email-penberg@scylladb.com>	2016-09-29 09:29:51 +01:00
Asias He	774d16306f	gossip: Use lowres_clock for scheduled_gossip_task The timer is fired once per second. Using low resolution clock is enough. Message-Id: <1f21514e975afea6ac5c9dde18a881a41561da70.1475130948.git.asias@scylladb.com>	2016-09-29 10:03:14 +03:00
Piotr Jastrzebski	1948ec8061	Update README.md Add --init to git submodules update. It's needed for fmt. Add libunwind-devel dependency do dnf install. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <4918237f91d985649c195035c02b2dd9e9a1ff68.1475087373.git.piotr@scylladb.com>	2016-09-29 10:02:34 +03:00
Gleb Natapov	32989d1e66	Merge seastar upstream * seastar 2b55789...5b7252d (3): > Merge "rpc: serialize large messages into fragmented memory" from Gleb > Merge "Print backtrace on SIGSEGV and SIGABRT" from Tomasz > test_runner: avoid nested optionals Includes patch from Gleb to adapt to seastar changes.	2016-09-28 17:34:16 +03:00
Pekka Enberg	9ea24c9d2b	Merge "repair: less stream_plan and less streaming traffic" from Asias "This series improves repair by 1) using less streaming sessions 2) reducing unnecessary streaming traffic 3) fixing a hang during shutdown See commit log for "repair: Reduce stream_plan usage", "repair: Reduce unnecessary streaming traffic" and "streaming: Fail streaming sessions during shutdown" for details. Tested with repair_additional_test.py."	2016-09-28 09:54:15 +03:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Glauber Costa	aa6a96d09b	database: export virtual dirty bytes region group Currently, we export the region group where memtables are placed as dirty bytes. Upcoming patches will optimistically mark some bytes in this region as free, a scheme we know as "virtual dirty". We are still interested in knowing the real state of the dirty region, so we will keep track of the bytes virtually freed and split the counters in two. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Gleb Natapov	c95df8f053	messaging_service: use correct value for listen_to_bc_address is a constructor used by tests Also make sure to not listen on the same exact address twice in case listen_address == broadcast_address. Scylla configuration code does not allow such thing to be configured, but better to be safe. Message-Id: <20160927102316.GO32178@scylladb.com>	2016-09-27 11:27:23 +01:00
Pekka Enberg	e35166af10	Merge "gossip: Fix expire_time for gossip membership removal" from Asias "We currently use steady_lock which is not consistent on nodes in the cluster. Use system_clock for it. Fixes #1704"	2016-09-27 11:46:09 +03:00
Asias He	1292341d77	gossip: Improve the expire time logging Print when the node will be removed from gossip membership, e.g., INFO 2016-09-27 08:54:49,262 [shard 0] gossip - Node 127.0.0.3 will be removed from gossip at [2016-09-30 08:54:48]: (expire = 1475196888294489339, now = 1474937689262295270, diff = 259199 seconds)	2016-09-27 16:42:35 +08:00
Asias He	f0d3084c8b	gossip: Switch to use system_clock The expire time which is used to decide when to remove a node from gossip membership is gossiped around the cluster. We switched to steady clock in the past. In order to have a consistent time_point in all the nodes in the cluster, we have to use wall clock. Switch to use system_clock for gossip. Fixes #1704	2016-09-27 16:42:13 +08:00
Avi Kivity	bfa9aa5d23	Merge "Installing the node_exporter" from Amnon "The prometheus project and its sub project does not have RPM/DEB packaging yet, but it does have binaries for download. This series adds an installation script that download install and run as a service the node_exporter. For os that uses systemd it has a spec file ready that will be package with the system. For ubuntu a service file will be created when running the installer. After this series running node_exporter_install a node_exporter will be running as a service on the machine."	2016-09-27 11:00:00 +03:00
Asias He	802c25e67b	repair: Switch to use make_streaming_reader in checksum calculation In patch `ac619820` (streaming: Switch to use make_streaming_reade), we switched to use make_streaming_reader for streaming. In repair, the checksum phases also uses a mutation reader. For the same reasons (no pollution to row cache, bounded new data after the reader is created), switch repair checksum calculation to use the make_streaming_reader too. Fixes #382 Fixes #1682 Message-Id: <9e0ecda861bb0b6f690da5e2378b208159ffa41c.1474933195.git.asias@scylladb.com>	2016-09-27 10:58:31 +03:00
Tomasz Grabiec	c03568d687	Merge tag 'asias/read_data_from_sstable_in_streaming/v2' from seastar-dev.git From Asias: With this series, streaming and repair are improved: - streaming, repair will not pollute the row cache on the sender side any more. Currently, we are risking evicting all the frequently-queried partitions from the cache when an operation like repair reads entire sstables and floods the row cache with swathes of cold data from they read from disk. - less data will be sent becasue the reader will only return existing data before the point of the reader is created, plus bounded amount of writes which arrive later. This helps reducing the streaming time in the case new data is being inserted all the time while streaming is in progress. E.g., adding a new node while there is a lot of cql write workload. Fixes #382 and #1682	2016-09-26 11:30:12 +02:00
Asias He	ac6198208b	streaming: Switch to use make_streaming_reader Using make_streaming_reader for streaming on the sender side, it has the following advantages: - streaming, repair will not pollute the row cache on the sender side any more. Currently, we are risking evicting all the frequently-queried partitions from the cache when an operation like repair reads entire sstables and floods the row cache with swathes of cold data from they read from disk. - less data will be sent becasue the reader will only return existing data before the point of the reader is created, plus bounded amount of writes which arrive later. This helps reducing the streaming time in the case new data is being inserted all the time while streaming is in progress. E.g., adding a new node while there is a lot of cql write workload. Fixes #382 Fixes #1682	2016-09-26 16:12:56 +08:00
Asias He	b505e34062	database: Introduce make_streaming_reader The make_streaming_reader returns a combined mutation reader reads mutations from sstables and memtable. The memtable reader handles memtable flushing automatically so no special handling is needed here. It will be used by streaming soon.	2016-09-26 16:02:48 +08:00
Asias He	e5a5a9ba15	repair: Rename sync_ranges to request_transfer_ranges To refelct the fact that the function does not sync the ranges but add the ranges to request from peer or transfer to peer.	2016-09-26 16:00:07 +08:00
Takuya ASADA	d38aa6570f	dist/common/scripts/scylla_setup: do not ask to select disks when there's no free disk When there's no free disk, it asks to select disks from empty list: "Please select disks from following list: type 'done' to finish selection. selected:" We should avoid to ask it, abort RAID setup instead. Fixes #1673 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1474429218-28382-1-git-send-email-syuu@scylladb.com>	2016-09-26 08:53:54 +03:00
Gleb Natapov	26ae8e8365	implement listen_on_broadcast_address option When using multiple physical network interfaces, set this to true to listen on broadcast_address in addition to the listen_address, allowing nodes to communicate in both interfaces. Ignore this property if the network configuration automatically routes between the public and private networks such as EC2. Message-Id: <20160921094810.GA28654@scylladb.com>	2016-09-26 08:49:54 +03:00
Asias He	f377a3b7ac	streaming: Fail streaming sessions during shutdown Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test The test does: - Insert data on node1 only - Insert data on node2 only - Run repair on node1 and stop node1 once "starting user-requested repair" is seen The repair shutdown code may wait for the stream session to complete for a very long time if node 1 finishes sending data to node2 and is waiting for node2 to send data to it, when node1 is stopped. The stream session will not be closed in this case until stream session _keep_alive_timeout (10 minutes) expires. Instead of waiting for the stream_session keep alive timer to expire, we can fail all the stream sessions during shutdown. Before 1 - The bad case (repair shutdown will last for 10 minutes): INFO 2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Executing streaming plan for repair-in INFO 2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Starting streaming to 127.0.0.2 INFO 2016-09-21 16:23:56,617 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Beginning stream session with 127.0.0.2 INFO 2016-09-21 16:23:56,618 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0 INFO 2016-09-21 16:23:58,625 [shard 0] storage_service - Stop transport: stop_gossiping done INFO 2016-09-21 16:23:58,625 [shard 0] storage_service - Thrift server stopped INFO 2016-09-21 16:23:58,625 [shard 0] storage_service - CQL server stopped INFO 2016-09-21 16:23:58,625 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - messaging_service stopped INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: shutdown messaging_service done INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: auth shutdown INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - Stop transport: done INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - Drain on shutdown: stop_transport done INFO 2016-09-21 16:23:58,626 [shard 0] tracing - Asked to shut down INFO 2016-09-21 16:23:58,626 [shard 0] tracing - Tracing is down INFO 2016-09-21 16:23:58,626 [shard 1] tracing - Asked to shut down INFO 2016-09-21 16:23:58,626 [shard 1] tracing - Tracing is down INFO 2016-09-21 16:23:58,626 [shard 0] storage_service - Drain on shutdown: tracing is stopped INFO 2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: flush column_families done INFO 2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done INFO 2016-09-21 16:23:58,669 [shard 0] storage_service - Drain on shutdown: done INFO 2016-09-21 16:23:58,669 [shard 0] repair - Starting shutdown of repair INFO 2016-09-21 16:25:56,624 [shard 0] stream_session - [Stream #bd34fea1-7fd4-11e6-8020-000000000001] The session 0x600021516c00 made no progress with peer 127.0.0.2 Before 2 - The good case: INFO 2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Executing streaming plan for repair-in INFO 2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Starting streaming to 127.0.0.2 INFO 2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Beginning stream session with 127.0.0.2 INFO 2016-09-21 16:18:32,087 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0 INFO 2016-09-21 16:18:34,098 [shard 0] storage_service - Stop transport: stop_gossiping done INFO 2016-09-21 16:18:34,098 [shard 0] storage_service - Thrift server stopped INFO 2016-09-21 16:18:34,098 [shard 0] storage_service - CQL server stopped INFO 2016-09-21 16:18:34,098 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done INFO 2016-09-21 16:18:34,155 [shard 0] messaging_service - Retry verb=19 to 127.0.0.2:0, retry=10: rpc::closed_error (connection is closed) WARN 2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] COMPLETE_MESSAGE for 127.0.0.2 has failed: rpc::closed_error (connection is closed) WARN 2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Streaming error occurred INFO 2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Session with 127.0.0.2 is complete, state=FAILED INFO 2016-09-21 16:18:34,155 [shard 0] storage_service - messaging_service stopped INFO 2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: shutdown messaging_service done INFO 2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] bytes_sent = 0, bytes_received = 245000 WARN 2016-09-21 16:18:34,155 [shard 0] stream_session - [Stream #fbc668d1-7fd3-11e6-bc54-000000000001] Stream failed, peers={127.0.0.2} WARN 2016-09-21 16:18:34,155 [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed) INFO 2016-09-21 16:18:34,155 [shard 0] repair - repair 1 failed - streaming::stream_exception (Stream failed) INFO 2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: auth shutdown INFO 2016-09-21 16:18:34,155 [shard 0] storage_service - Stop transport: done INFO 2016-09-21 16:18:34,155 [shard 0] storage_service - Drain on shutdown: stop_transport done INFO 2016-09-21 16:18:34,155 [shard 0] tracing - Asked to shut down INFO 2016-09-21 16:18:34,155 [shard 0] tracing - Tracing is down INFO 2016-09-21 16:18:34,156 [shard 1] tracing - Asked to shut down INFO 2016-09-21 16:18:34,156 [shard 1] tracing - Tracing is down INFO 2016-09-21 16:18:34,156 [shard 0] storage_service - Drain on shutdown: tracing is stopped INFO 2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: flush column_families done INFO 2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done INFO 2016-09-21 16:18:34,199 [shard 0] storage_service - Drain on shutdown: done INFO 2016-09-21 16:18:34,199 [shard 0] repair - Starting shutdown of repair INFO 2016-09-21 16:18:34,199 [shard 0] repair - Completed shutdown of repair INFO 2016-09-21 16:18:34,199 [shard 0] compaction_manager - Asked to stop INFO 2016-09-21 16:18:34,199 [shard 1] compaction_manager - Asked to stop After: INFO 2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Executing streaming plan for repair-in INFO 2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Starting streaming to 127.0.0.2 INFO 2016-09-21 16:06:21,684 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Beginning stream session with 127.0.0.2 INFO 2016-09-21 16:06:21,685 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Prepare completed with 127.0.0.2. Receiving 1, sending 0 INFO 2016-09-21 16:06:23,687 [shard 0] storage_service - Stop transport: stop_gossiping done INFO 2016-09-21 16:06:23,687 [shard 0] storage_service - Thrift server stopped INFO 2016-09-21 16:06:23,687 [shard 0] storage_service - CQL server stopped INFO 2016-09-21 16:06:23,687 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - messaging_service stopped INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: shutdown messaging_service done INFO 2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Session with 127.0.0.2 is complete, state=FAILED INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - stream_manager stopped INFO 2016-09-21 16:06:23,688 [shard 1] storage_service - stream_manager stopped INFO 2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] bytes_sent = 0, bytes_received = 25725 INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: shutdown stream_manager done WARN 2016-09-21 16:06:23,688 [shard 0] stream_session - [Stream #48661c51-7fd2-11e6-8ba7-000000000001] Stream failed, peers={127.0.0.2} WARN 2016-09-21 16:06:23,688 [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed) INFO 2016-09-21 16:06:23,688 [shard 0] repair - repair 1 failed - streaming::stream_exception (Stream failed) INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: auth shutdown INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Stop transport: done INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Drain on shutdown: stop_transport done INFO 2016-09-21 16:06:23,688 [shard 0] tracing - Asked to shut down INFO 2016-09-21 16:06:23,688 [shard 0] tracing - Tracing is down INFO 2016-09-21 16:06:23,688 [shard 1] tracing - Asked to shut down INFO 2016-09-21 16:06:23,688 [shard 1] tracing - Tracing is down INFO 2016-09-21 16:06:23,688 [shard 0] storage_service - Drain on shutdown: tracing is stopped INFO 2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: flush column_families done INFO 2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: shutdown commitlog done INFO 2016-09-21 16:06:23,774 [shard 0] storage_service - Drain on shutdown: done INFO 2016-09-21 16:06:23,774 [shard 0] repair - Starting shutdown of repair INFO 2016-09-21 16:06:23,774 [shard 0] repair - Completed shutdown of repair INFO 2016-09-21 16:06:23,774 [shard 0] compaction_manager - Asked to stop INFO 2016-09-21 16:06:23,774 [shard 1] compaction_manager - Asked to stop	2016-09-26 06:29:40 +08:00
Asias He	7c873f0d1f	repair: Reduce unnecessary streaming traffic If the remote peers have the same checksum, we can only fetch from one of the peer node instead of all of them since they all have the same data anyway. No need to fetch from all of them. In addition to above optimization, if the local peer has no data, we can skip sending the data back to the remote peer. Due to the fact that all the remote peers have the same checksum and local peer has no data, so each and every remote peer has all the data. There is no need to merge the remote data with local data and send back the merged data back to remote peers. Refs: #1617	2016-09-26 06:28:51 +08:00
Asias He	99e77e8ec2	repair: Do not abort the repair when one range is failed failed_ranges is added to track the ranges that fail during repair.	2016-09-26 06:28:51 +08:00
Asias He	81c98ff3d9	repair: Reduce stream_plan usage Right now, we are using one stream_plan for each range of a column family. This generates tons of stream_plans and stream_sessions. Each stream_plan can transfer multiple ranges and column families. We can use a single stream_plan to stream datas for multiple ranges and column families, so that 1) overhead of stream_plan/session negotiation is reduced 2) it is much easier to debug/monitor few stream_sessions Fixes #1685	2016-09-26 06:28:50 +08:00
Asias He	a0020fdad2	stream_session: Allow adding ranges to a cf more than once Append the ranges to a stream_transfer_task if the cf is already added to _transfers in add_transfer_ranges.	2016-09-26 06:28:50 +08:00
Asias He	576e15532f	streaming: Add append_ranges for stream_transfer_task Allow to append more ranges to transfer for a stream transfer task.	2016-09-26 06:28:50 +08:00
Avi Kivity	3057ca05bc	Merge "Improve loggging when nodes are decommissioned" from Asias "When a node is decommissioned, its gossip state will not be removed from gossip immediately. It will only be removed 3 days later which helps nodes that were down when the node was decommissioned to know decommission later when they are up again. This series improves the logging to reduce confusion when a node tries to talking to a decommissioned node. In addition, we now do not try to talk to the decommissioned in the unreachable_endpoints gossip round. Fixes #1615" * tag 'asias/loggging_decommissioned_nodes/v1' of github.com:cloudius-systems/seastar-dev: gossip: Make two log items debug level gossip: Print node status when node is UP or DOWN gossip: Ignore the node which is decommissioned in gossip round gossip: Print convict debug info only when the node is alive gossip: Add more timing log in add_expire_time_for_endpoint streaming: Print on_remove and on_restart log when peer exists streaming: Introduce has_peer in stream_manager	2016-09-25 15:19:13 +03:00
Asias He	830f4ee353	gossip: Make two log items debug level It is duplciated with "InetAddresss x.x.x.x is now UP" message. INFO 2016-09-23 10:35:15,512 [shard 0] gossip - Node 127.0.0.1 has restarted, now UP, status = NORMAL INFO 2016-09-23 10:35:15,513 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL Make the log a bit cleaner.	2016-09-25 07:17:19 +08:00
Asias He	a26a26963c	gossip: Print node status when node is UP or DOWN For example: gossip - InetAddress 127.0.0.4 is now UP, status = NORMAL gossip - InetAddress 127.0.0.3 is now DOWN, status = LEFT gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown	2016-09-25 07:17:19 +08:00
Asias He	1d9401d080	gossip: Ignore the node which is decommissioned in gossip round If the node is decommissioned, there is no point to try to contact it again in the gossip round.	2016-09-25 07:17:19 +08:00
Asias He	4b73443222	gossip: Print convict debug info only when the node is alive	2016-09-25 07:17:19 +08:00
Asias He	99a2ae0fb5	gossip: Add more timing log in add_expire_time_for_endpoint It tells when the node is expected to expire and how many seconds are left.	2016-09-25 07:17:19 +08:00
Asias He	40f7a355a0	streaming: Print on_remove and on_restart log when peer exists We print the following messages even if there is no stream_session with that peer. It is a bit confusing. INFO 2016-09-23 08:26:37,254 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.1 in on_restart INFO 2016-09-23 08:26:37,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.3 in on_remove Print only when the streaming session with the peer exists.	2016-09-25 07:17:19 +08:00
Asias He	2ac4ce77a9	streaming: Introduce has_peer in stream_manager It is used to query if a streaming peer with inet_address exists.	2016-09-25 07:17:13 +08:00
Nadav Har'El	fe1ba753ce	Avoid semaphore's default initial value The fact that Seastar's semaphore has a default initializer of 1 if not explicitly initialized is confusing and unexpected and recently lead to two bugs. So ScyllaDB should not rely on this default behavior, and specify the initial value of each semaphore explicitly. In several cases in the ScyllaDB code, the explict initialization was missing, and this patch adds it. In one case (rate_limiter) I even think the default of 1 was a bit strange, and 0 makes more sense. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1474530745-23951-1-git-send-email-nyh@scylladb.com>	2016-09-24 19:25:02 +03:00
Paweł Dziepak	eb59b4c4ab	keys: disable constructing from generic range stdx::optional<T> uses quite elaborate std::enable_if_t magic to decide whether the argument passed to its constructor should be used for a call T constructor or stdx::optional<T> constructor. Apparently, with GCC 6.2 having T constructor which accepts any type confuses that magic and we end up with compile errors. The solution is to have from_range() method that replaces that constructor from range. There is also constructor that creates a key from std::vector<bytes> so that code generated by IDL works as it did before. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1474550971-15309-1-git-send-email-pdziepak@scylladb.com>	2016-09-24 18:57:01 +03:00
Raphael S. Carvalho	cfe7419f0f	sstables: update or remove some outdated comments Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <74bae503447da2544a005e29b7d3aafa9f6e8c90.1474383273.git.raphaelsc@scylladb.com>	2016-09-24 18:53:19 +03:00
Raphael S. Carvalho	0f1bd3c527	db: fix clustering key filter When date tiered strategy is enabled, filter_sstable_for_reader() was returning more sstables than needed because the return type of serialized_tri_compare::operator() was wrong, which results in bad performance. tgrabiec: Refs #1449 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0301d7588e33c7bbb8cd80fed20a1827926a8fff.1474585088.git.raphaelsc@scylladb.com>	2016-09-23 12:33:58 +02:00
Asias He	e352570f52	conf: Move initial_token to supported section in scylla.yaml initial_token is actually supported Fixes #1686 Message-Id: <465da088696f72a3a7bcf19ba8e4895a0a648e7c.1474512235.git.asias@scylladb.com>	2016-09-23 09:34:05 +03:00
Tomasz Grabiec	0b0d126721	Merge seastar upstream Fixes #1622. Fixes #1690. * seastar 40a68fa...2b55789 (5): > input_stream: Fix possible infinite recursion in consume() > iostream: Fix stack overflow in output_stream::split_and_put() > condition_variable: fix spurious wakeup > Merge "assorted rpc fixes" from Gleb > Merge "Simple fixes for doxygen" from Glauber	2016-09-22 14:27:52 +02:00
Amnon Heiman	a6749116a7	scylla_setup: Install node_exporter This adds the option to install node_exporter during setup. The node_exporter export server information in the prometheus API. It should be used when using the scylla prometheus API to get the server information. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-22 09:43:53 +03:00
Amnon Heiman	4e0dcb59e7	scylla.spec: package the node_exporter scripts This patch adds the node_exporter related files to the rpm. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-22 09:43:53 +03:00
Amnon Heiman	3d242fdb4d	Add a link to node_exporter_install This adds a link to node_exporter_install in sbin, so it will be availabe in the path. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-22 09:43:53 +03:00
Amnon Heiman	9d3edd3a28	service file for node_exporter with systemd This patch adds a service file for OS that supports systemd. When started, it would run an already installed node_exporter or fail. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-22 09:43:53 +03:00
Amnon Heiman	801b2c4914	An installation script for node_exporter node_exporter is a utility that export node information via prometheus API. It takes care of host related metrics such as CPU and memory. The install script, download the node_exporter binaries, create a link in /usr/bin. On OS with systemd supported it would enable and start the installed service file to start as a service. On others (ubuntu) it would create a conf file and start it. The installation should be done using sudo. After a successful installation, the node_exporter would run as a service. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-22 09:33:03 +03:00
Paweł Dziepak	906250dcbd	Merge "Enhance GDB script with new LSA-related commands" from Tomek	2016-09-21 13:22:00 +01:00
Raphael S. Carvalho	67343798cf	api: implement api to return sstable count per level 'nodetool cfstats' wasn't showing per-level sstable count because the API wasn't implemented. Fixes #1119. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>	2016-09-21 09:13:40 +03:00
Asias He	aa47265381	gossip: Fix std::out_of_range in setup_collectd It is possible that endpoint_state_map does not contain the entry for the node itself when collectd accesses it. Fixes the issue: Sep 18 11:33:16 XXX scylla[19483]: [shard 0] seastar - Exceptional future ignored: std::out_of_range (_Map_base::at) Fixes #1656 Message-Id: <8ffe22a542ff71e8c121b06ad62f94db54cc388f.1474377722.git.asias@scylladb.com>	2016-09-20 19:38:16 +03:00
Tomasz Grabiec	69aec3835f	scylla-gdb: Enhance 'scylla ptr' to show if object is managed by LSA Example: (gdb) scylla ptr 0x601000480000 thread 1, large, LSA-managed One can then use 'scylla lsa-segment 0x601000480000' to examine LSA segment contents.	2016-09-20 16:53:23 +02:00
Tomasz Grabiec	486a92092b	scylla-gdb: Add 'scylla segment-descs' command Displays information about shard's segment descriptors. One can see which segments belong to LSA, what's their occupancy, etc. (gdb) scylla segment-descs ... 0x601000940000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420 0x601000980000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420 0x6010009c0000: lsa free=261153 region=0x60100036fcf0 zone=0x6010000fb420 0x601000a00000: std 0x601000a40000: lsa free=25508 region=0x60100036d890 zone=0x6010000fb420 0x601000a80000: std 0x601000ac0000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420 0x601000b00000: lsa free=26092 region=0x60100036d890 zone=0x6010000fb420 0x601000b40000: std ...	2016-09-20 16:53:23 +02:00
Tomasz Grabiec	b0b28696b5	scylla-gdb: Add 'scylla lsa-segment' command Allows one to examine contents of LSA segment. Example: (gdb) scylla lsa-segment 0x601000480000 0x601000480e70: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480f10: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480fb0: free size=192 0x60100048107e: free size=42 0x6010004814e0: free size=192 0x6010004815ae: free size=40 0x6010004815e8: free size=192 0x6010004816b8: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000481758: free size=192 ...	2016-09-20 16:53:21 +02:00
Tomasz Grabiec	5011b77e15	scylla-gdb: Add std::vector wrapper Makes vector values itearable from python level.	2016-09-20 16:53:20 +02:00
Pekka Enberg	42dd4670dc	transport/server: Add CQL frame Snappy compression support Fixes #1286 Message-Id: <1474370861-5928-1-git-send-email-penberg@scylladb.com>	2016-09-20 12:33:36 +01:00
Pekka Enberg	acc93509a2	transport/server: Fix CQL connection compression negotiation Benoît Canet points out that CQL messages are not always compressed although compression is enabled by the driver. Turns out our CQL compression negotiation is broken. We need to negotiate compression upon STARTUP message and not rely on the incoming request to have the compression bit enabled. Fixes #1680 Message-Id: <1474366693-3001-1-git-send-email-penberg@scylladb.com>	2016-09-20 11:19:27 +01:00
Pekka Enberg	f92bbc6f44	cql3: Kill unimplemented query_options constructor The constructor was added in commit `7f3ce39` ("query_options: Add constructor for batch mode options (multi-level)") but apparently it was never actually implemented. Spotted by CLion. Message-Id: <1474303017-23383-1-git-send-email-penberg@scylladb.com>	2016-09-20 10:01:10 +01:00
Pekka Enberg	f1d0401ed2	main: Use proper logger for API server messages We have a "startlog" that we can use to print out API server messages. Message-Id: <1474358312-26510-1-git-send-email-penberg@scylladb.com>	2016-09-20 11:09:59 +03:00
Pekka Enberg	38b137713f	transport/server: Fix CQL v1 prepared statement execution The EXECUTE message encoding is different between CQL binary protocol versions v1 and v2 (and later). Fix process_execute() to deserialize the message as per the CQL binary protocol v1 specification: Executes a prepared query. The body of the message must be: <id><n><value_1>....<value_n><consistency> where: - <id> is the prepared query ID. It's the [short bytes] returned as a response to a PREPARE message. - <n> is a [short] indicating the number of following values. - <value_1>...<value_n> are the [bytes] to use for bound variables in the prepared query. - <consistency> is the [consistency] level for the operation. Fixes #1676 Message-Id: <1474287392-16792-1-git-send-email-penberg@scylladb.com>	2016-09-19 15:26:30 +03:00
Raphael S. Carvalho	0eaa0f46c9	sstables: store first and last decorated keys in sstable object leveled strategy uses heavily first and last decorated keys of a sstable to get overlapping sstables in a given level. By storing first and last decorated keys in sstable object, it's expected that performance of leveled strategy (not compaction) will be improved. We will set first and last keys in sstable when either loading or sealing it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>	2016-09-19 13:25:58 +02:00
Raphael S. Carvalho	dffb41f9d8	sstables: remove schema parameter from some sstable methods schema can now be found in the sstable object itself. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>	2016-09-19 13:25:58 +02:00
Tomasz Grabiec	2282599394	tests: Add test for UUID type ordering Message-Id: <1473956716-5209-2-git-send-email-tgrabiec@scylladb.com>	2016-09-16 11:07:14 +01:00
Tomasz Grabiec	804fe50b7f	types: fix uuid_type_impl::less timeuuid_type_impl::compare_bytes is a "trichotomic" comparator (-1, 0, 1) while less() is a "less" comparator (false, true). The code incorrectly returns c1 instead of c1 < 0 which breaks the ordering. Fixes #1196. Message-Id: <1473956716-5209-1-git-send-email-tgrabiec@scylladb.com>	2016-09-16 11:06:55 +01:00
Duarte Nunes	bc3cbb7009	thrift: Correctly detect clustering range wrap around This patch uses the clustering bounds comparator to correctly detect wrap around of a clustering range in the thrift handler. Refs #1446 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1473938611-8590-1-git-send-email-duarte@scylladb.com>	2016-09-15 14:31:16 +01:00
Shlomi Livne	acb83073e2	ami: Fix instructions how to run scylla_io_setup on non ephemeral instances On instances differenet then i2/m3/c3 we provide instructions to run scylla_ip_setup. Running scylla_io_setup requires access to /var/lib/scylla to crate a temporary file. To gain access to that directory the user should run 'sudo scylla_io_setup'. refs: #1645 Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <4ce90ca1ba4da8f07cf8aa15e755675463a22933.1473935778.git.shlomi@scylladb.com>	2016-09-15 13:40:53 +03:00
Avi Kivity	d2b5f3ff44	Merge seastar upstream * seastar e534401...40a68fa (1): > rpc: fix dangling reference in read_rcv_buf	2016-09-15 12:20:49 +03:00
Gleb Natapov	2e8b255741	Merge seastar upstream * seastar 0303e0c...e534401 (6): > Merge "enable rpc to work on non contiguous memory for receive" from Gleb > install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py > fix tcp stuck when output_stream write more than 212992 bytes once. > scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message > scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask > native_network_stack: Fix use after free/missing wait in dhcp Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.	2016-09-15 12:12:16 +03:00
Tomasz Grabiec	ed312c2b1a	Merge remote-tracking branch 'duarte/comparator/v1' From Duarte: This patchset reuses the bound_view::comparator in range_tombstone to correctly detect wrap around of a clustering range. This fixes a manifestation of #1446 that results in wrong query results. Introduced by `b1f9688432` Fixes #1669 Refs #1446	2016-09-14 18:21:05 +02:00
Paweł Dziepak	bc2ff41003	cql3: fix units in large batch warning When displaying a warning about batch being too large C* reports batch size and limit in bytes while S* uses kB. This patch switches Scylla to use bytes. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1473867171-18932-1-git-send-email-pdziepak@scylladb.com>	2016-09-14 18:38:46 +03:00
Takuya ASADA	647673195c	dist/redhat/build_rpm.sh: add dependency for rpmbuild Install rpmbuild when it's not installed yet. Fixes #1651 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1473193430-14792-1-git-send-email-syuu@scylladb.com>	2016-09-14 14:57:55 +03:00
Calle Wilund	f126cf769a	column_family: Ensure flush() waits for all previous flushes + self Fixes #1577 Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>	2016-09-14 11:00:41 +01:00
Duarte Nunes	f864bca773	row_cache: Deal with side-effects in allocating_section In row_cache::make_reader, we update statistics inside an allocating_section, which retries the supplied function until it can satisfy all allocations by way of reserving LSA memory up front. Since those updates are interleave with allocations, retries can lead to miscounts. This patch fixes this by updating statistics after all allocations. Fixes #1659 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1473845977-20205-1-git-send-email-duarte@scylladb.com>	2016-09-14 10:46:25 +01:00
Tomasz Grabiec	a498da1987	database: Ignore spaces in initial_token list Currently we get boost::lexical_cast on startup if inital_token has a list which contains spaces after commas, e.g.: initial_token: -1100081313741479381, -1104041856484663086, ... Fixes #1664. Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>	2016-09-14 11:58:13 +03:00
Paweł Dziepak	c220c676c8	types: honour end of sstring_view There are several places in types.cc where we assume that sstring_view range is null terminated. That may be not true and we should always use either begin()/end() or data()/size() pairs. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-09-07 14:30:56 -07:00
Paweł Dziepak	6373289532	Merge "Adding slow query API" from Amnon "This series adds an API for the slow query recording. After this series it will be possible to set the/get the slow query recording parameters."	2016-09-07 11:06:09 -07:00
Pekka Enberg	1095705a6b	Update scylla-ami submodule * dist/ami/files/scylla-ami 14c1666...e1e3919 (1): > scylla_ami_setup: remove scylla_cpuset_setup	2016-09-07 21:04:03 +03:00
Avi Kivity	7ac729b4d5	Merge "Optimize reads for clustered data" from Raphael "This will be very important for read performance of time series use case, where timestamp is usually stored as a clustering key, and the user asks for specific data using a clustering range filter. Example: CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) ); ... SELECT * FROM temperature WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00'; This is based on: https://issues.apache.org/jira/browse/CASSANDRA-5514 To check correctness, I wrote a dtest that runs scylla with row cache disabled, creates several sstables with non overlapping clustering key ranges, queries data using several clustering range filters, and checks that the database returns the expected results. Tested performance with a tool I wrote myself [1] and performance is indeed improved by this patchset. This tool works as follow: Scylla is started with row cache disabled. That's wanted here because we're measuring a specific code that only gets executed if row cache misses the data we asked for. Then Scylla is populated node with N sstables ('nodetool flush' is used to ensure it), where each will have M clustering keys, totaling N*M clustering keys. Finally, we will start asking for data using a clustering range filter. The tool measures throughput and min/max/avg latency. [1]: https://gist.github.com/raphaelsc/4c415f592aaed14a18be31279d225972 Follow the results: BEFORE ----- ('Clustering keys / second: ', 747.9672111659951) ('Max latency (ms): ', 33) ('Min latency (ms): ', 12) ('Avg latency (ms): ', 13.0) The operation took 13.3695700169 seconds AFTER ----- ('Clustering keys / second: ', 3159.115303945648) ('Max latency (ms): ', 22) ('Min latency (ms): ', 2) ('Avg latency (ms): ', 3.0) The operation took 3.16544318199 seconds NOTE: Throughput and average latency are improved by a factor of ~4. -----"	2016-09-04 15:06:32 +03:00
Amnon Heiman	11c687dd93	API: Add slow query logging implementation This adds the implementation for the slow query logging API. After this patch the following will be available: curl -X GET "http://localhost:10000/storage_service/slow_query" curl -X POST "http://localhost:10000/storage_service/slow_query?enable=true&ttl=10&threshold=6000" Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-03 01:15:22 +03:00
Amnon Heiman	ed1d02b1a3	API: Add slow query API definition This adds the GET and POST api for slow query logging. The GET return an object with the enable, ttl and threshold and the POST lets you configure each of them. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-09-03 01:15:15 +03:00
Raphael S. Carvalho	b9f67351da	db: expose clustering filter info via collectd That's needed to observe behavior of clustering filter, and to check if it's worthwhile for a specific workload. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:32:23 -03:00
Raphael S. Carvalho	a2dc88889d	db: enable clustering optimization only on dtcs Leveled strategy will not benefit from this strategy because there's only a few sstables that will contain a given partition key, which means that a clustering key that belongs to a specific partition key can only be in a few sstables as well. Date tiered strategy is the one that will actually benefit the most from this optimization. Size tiered may benefit from it too if clustering key isn't overwritten, but it will not use the clustering optimization. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:31:07 -03:00
Raphael S. Carvalho	8d03ccd604	sstables: optimize reads with clustering filter If user specifies a clustering filter, it's possible to filter out sstable based on its metadata that tracks min/max clustering value. For example, if sstable stores clustering key from 'a' through 'c', it's possible to filter out that sstable if user asks for data with clustering key greater than 'c'. That's done by comparing each component separately because clustering key may be composite. Further information can be found here: https://issues.apache.org/jira/browse/CASSANDRA-5514 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:51:50 -03:00
Raphael S. Carvalho	768aced741	partition_slice: introduce key-independent function to get ranges That will be important for sstable code that will rule out a sstable if it doesn't cover a given clustering key range. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:50:56 -03:00
Raphael S. Carvalho	dce61ddb02	types: introduce abstract_type::as_tri_comparator() That's akin to abstract_type::as_less_comparator's nature. So we don't have to repeat something like the following everywhere: auto cmp = [&type] (const bytes_view& b1, const bytes_view& b2) { return type->compare(b1, b2); } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:50:53 -03:00
Raphael S. Carvalho	004617839d	database: check bloom filter of all sstables earlier All sstables will now have bloom filter checked in a single pass before reader iterate through all candidates. It's possible that we will need to futurize the procedure if it holds cpu for too long. This change is also a step towards the optimization that will rule out sstables based on clustering filter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:50:08 -03:00
Raphael S. Carvalho	2a426ab248	tests: add test to check tombstone metadata Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:35 -03:00
Raphael S. Carvalho	94c8ef39c3	sstables: store components ranges in sstable object Store range for each clustering component in sstable itself to optimize sstable filtering based on clustering key. If schema defines no clustering key, this new field will be empty. Each range stores min and max value of that specific component. With this information, it's possible to know if a sstable possibly stores a given clustering component. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:32 -03:00
Raphael S. Carvalho	026853fabb	tests: add test to check composite validity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:30 -03:00
Raphael S. Carvalho	0a5af61176	sstables: introduce function to validate min max clustering values Scylla was generating a sstable with incorrect min max clustering values. This information is used to filter out a sstable when user asks for a range of clustering rows. So it's important to detect wrong metadata and make sure that it will not be used. The validation is fast and will only happen when loading a sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:28 -03:00
Raphael S. Carvalho	1f31223f32	sstables: store schema in sstable object That will be needed for optimization that will store decorated keys in the sstable object, and also for a subsequent work that will detect wrong metadata (min/max column names) by looking at columns in the schema. As schema is stored in sstable, there's no longer a need to store ks and cf names in it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 10:49:17 -03:00
Avi Kivity	7a140a306e	Revert "sstables: optimize selection of sstables for leveled strategy" This reverts commit c75b07fc34f0e7267a8e49276b96bbd4686cb78d; does not deduplicate the sstable list.	2016-09-01 18:34:08 +03:00
Raphael S. Carvalho	c75b07fc34	sstables: optimize selection of sstables for leveled strategy It's possible to copy sstables directly into vector, and that will improve performance. my benchmark tool[1] shows that new version reduces running time of copy procedure by factor of two after 1024^2 calls. Switching to back_inserter improves throughput even further. [1]: gist.github.com/raphaelsc/a4b27290f362cdecdef399770dda759c Refs #1632. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <7153514a9b5f5eb24dff518ee9fa3680e0881dae.1472741401.git.raphaelsc@scylladb.com>	2016-09-01 18:08:53 +03:00
Glauber Costa	dc5d8e33af	Revert "row_cache: update sstable histograms on cache hits" This reverts commit `1726b1d0cc`. Reverting this patch turns our SSTable access counter into a miss counter only. The estimated histogram always starts its first bucket at 1, so by marking cache accesses we will be wrongly feeding "1" into the buckets. Notice that this is not yet ideal: nodetool is supposed to show a histogram of all reads, and by doing this we are changing its meaning slightly. Workloads that serve mostly from cache will be distorted towards their misses. The real solution is to use a different histogram, but we will need to enforce a newer version of nodetool for that: the current issue is that nodetool expects an EstimatedHistogram in a specific format in the other side. Conflicts: row_cache.hh Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy lladb.com> Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-01 18:07:31 +03:00
Avi Kivity	e33671c285	Merge "tracing: Trace read sstables" from Duarte "This patchset traces sstables we read from. To do that, we need to flow the trace_state_ptr to the mutation_readers."	2016-09-01 13:24:16 +03:00
Duarte Nunes	ba374da043	database: Trace sstable accesses This patch traces when we read from an sstable, be it a key range or a single one. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	f4cf2f2aef	tracing: Make trace_state_ptr argument required This patch makes the optional trace_state_ptr arguments introduced in previous patches mandatory where possible. Functions which are called internally don't have a trace context, so for those we keep the argument's default value for convenience. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	46b86ff801	storage_proxy: Pass along trace_state for queries This patch changes the storage_proxy so it passed along a trace_state_ptr to the layers below, when querying locally or receiving a remote query request. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	030db65c62	database: Accept a trace_state_ptr This patch changes the database and column_family types so a trace_state_ptr can be passed in when querying. This enables tracing of the inner components. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:28 +02:00
Duarte Nunes	9269256246	row_cache: Accept a trace_state_ptr This patch changes the row_cache so it accepts a trace_state_ptr, which it is responsible of flowing to the underlying mutation_reader if needed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:00:55 +02:00
Duarte Nunes	5fd66f00c2	mutation_reader: Accept trace_state_ptr This patch changes the mutation_reader so it optionally accepts a trace_state_ptr. This will allow us to trace, for example, which sstables are accessed during a request. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:00:31 +02:00
Avi Kivity	cc127295e9	Merge "Fill in information for sstables per read histogram" from Glauber "Nodetool cfhistograms is supposed to tell us how many SSTables were touched per read. Currently, we are a bit in the dark as we don't export that information. This patch exports that, so that we can start using it."	2016-09-01 12:54:24 +03:00
Glauber Costa	1726b1d0cc	row_cache: update sstable histograms on cache hits If we have a cache hit, we still need to update our sstable histogram - notting that we have touched 0 SSTables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:22 -04:00
Glauber Costa	ce24fd05fe	database: keep statistics on SSTables touched per read That is done for single partition queries only - mimicking what Cassandra does on that matter. For this to be correct, we also need to update this histogram on cache hits - in which case we update the read as having touched 0 SSTables. That will be done on a separate patch. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:21 -04:00
Glauber Costa	0f413695ac	database: make column family stats mutable The make_reader method is currently a const method, but we would like to start keeping hit statistics from it. Instead of relaxing the const condition too much, we can just mark the _stats field as mutable, indicating that make_reader will not be able to change anything in the CF, except for keeping statistics. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:24 -04:00
Glauber Costa	5c4d73577a	initialize sstables_per_read histogram with 35 instead of 90 buckets This is to match what Cassandra does. Nodetool may be expecting this on the other side. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:24 -04:00
Glauber Costa	4310635bae	move estimated histogram to utils Nothing sstable-specific in it, really. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Glauber Costa	ffc2131c51	decouple estimated_histogram from sstables There is nothing really that fundamentally ties the estimated histogram to sstables. This patch gets rid of the few incidental ties. They are: - the namespace name, which is now moved to utils. Users inside sstables/ now need to add a namespace prefix, while the ones outside have to change it to the right one - sstables::merge, which has a very non-descriptive name to begin with, is changed to a more descriptive name that can live inside utils/ - the disk_types.hh include has to be removed - but it had no reason to be here in the first place. Todo, is to actually move the file outside sstables/. That is done in a separate step for clarity. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Yoav Kleinberger	624165da79	scyllatop: dump all output to stdout instead of running a fancy console interface Sometimes the user would like to dump all the metrics into a file or pipe it to another program, as requested in issue #1506. This patch makes scyllatop check if stdout is connected to a TTY, and if not - it does not fire up the fancy urwid UI but instead, just writes all it's collected metrics to stdout. Optionally, the user tell the program to quit after a specific number of iterations via the -n or --iterations flag Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1471777516-9903-1-git-send-email-yoav@scylladb.com>	2016-08-31 08:31:36 +03:00
Paweł Dziepak	e981101fa9	Merge "Remove clustering_key_filtering_context" from Piotr "clustering_key_filtering_context is no longer needed. partition_slice can be used instead so this series removes clustering_key_filtering_context and passes partition_slice down where it's needed. Then a static get_ranges method is used to obtain clustering key ranges for a given partition. Fixes #1614."	2016-08-30 22:30:15 +01:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Piotr Jastrzebski	b05b90b3a5	Introduce clustering_key_filter_ranges. This fixes the problem of multiple concurrent get_ranges calls. Previously each call was invalidating the result of the previous call. Now they don't step on each other foot. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 19:46:38 +02:00
Duarte Nunes	39e0fb1260	storage_proxy: Support multiple partition ranges This patch adds the ability to query multiple partition ranges. This is needed since `55f2cf1626`, where we started unwrapping partition ranges in Thrift. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1472474594-15368-1-git-send-email-duarte@scylladb.com>	2016-08-30 17:43:40 +03:00
Takuya ASADA	533dc0485d	dist/common/scripts/scylla_sysconfig_setup: sync cpuset parameters with rps_cpus settings when posix_net_conf.sh is enabled and NIC is single queue On posix_net_conf.sh's single queue NIC mode (which means RPS enabled mode), we are excluded cpu0 and it's sibling from network stack processing cpus, and assigned NIC IRQ to cpu0. So always network stack is not working on cpu0 and it's sibling, to get better performance we need to exclude these cpus from scylla too. To do this, we need to get RPS cpu mask from posix_net_conf.sh, pass it to scylla_cpuset_setup to construct /etc/scylla.d/cpuset.conf when scylla_setup executed. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1472544875-2033-2-git-send-email-syuu@scylladb.com>	2016-08-30 16:51:16 +03:00
Takuya ASADA	0c3bb2ee63	dist/common/scripts/scylla_prepare: drop unnecesarry multiqueue NIC detection code on scylla_prepare Right now scylla_prepare specifies -mq option to posix_net_conf.sh when number of RX queues > 1, but on posix_net_conf.sh it sets NIC mode to sq when queues < ncpus / 2. So the logic is different, and actually posix_net_conf.sh does not need to specify -sq/-mq now, it autodetects queue mode. So we need to drop detection logic from scylla_prepare, let posix_net_conf.sh to detect it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1472544875-2033-1-git-send-email-syuu@scylladb.com>	2016-08-30 16:51:15 +03:00
Pekka Enberg	eff14bae0e	transport/server: Explict CQL type IDs The CQL type IDs are specified as hex in the CQL binary protocol specification. Define CQL type IDs in the code explicitly to make reviewing the code and adding new types easier. Message-Id: <1472537971-26053-1-git-send-email-penberg@scylladb.com>	2016-08-30 09:45:26 +03:00
Avi Kivity	809d739ae8	Merge seastar upstream * seastar 2b07b1f...0303e0c (3): > scripts/posix_net_conf.sh: add support --cpu-mask mode > file: improve tmpfs support > file::close: remove trailing newline in log message	2016-08-29 13:26:04 +03:00
Pekka Enberg	2d3aee73a6	systemd: Don't start Scylla service until network is up Alexandr Porunov reports that Scylla fails to start up after reboot as follows: Aug 25 19:44:51 scylla1 scylla[637]: Exiting on unhandled exception of type 'std::system_error': Error system:99 (Cannot assign requested address) The problem is that because there's no dependency to network service, Scylla simply attempts to start up too soon in the boot sequence and fails. Fixes #1618. Message-Id: <1472212447-21445-1-git-send-email-penberg@scylladb.com>	2016-08-29 13:15:39 +03:00
Takuya ASADA	74d994f6a1	dist/common/scripts/scylla_setup: support enabling services on Ubuntu 15.10/16.04 Right now it ignores Ubuntu, but we shareing .service between Fedora/CentOS and Ubuntu >= 15.10, so support it. Fixes #1556. Message-Id: <1471932814-17347-1-git-send-email-syuu@scylladb.com>	2016-08-29 13:13:14 +03:00
Avi Kivity	fb3a83a811	Merge "Slow query logging" from Vlad "This series introduces a "slow query logging" feature that allows logging the queries that take more than a specified threshold time to complete. Once such a query detected, it will be logged in a system_traces.node_slow_log table. In addition all trace for that query that have been collected on a Coordinator are going to be written as well. If the handling time on a replica in the context of a query takes more than (the same) threshold they are going to be written too. The raw in a node_slow_log contains a session_id of a corresponding tracing session, thereby allowing the user to query the system_traces tables for the corresponding trace records. The schema of the node_slow_log table is as follows: CREATE TABLE system_traces.node_slow_log ( node_ip inet, shard int, session_id uuid, date timestamp, start_time timeuuid, command text, duration int, parameters map<text, text>, source_ip inet, table_names set<text>, username text, PRIMARY KEY (start_time, node_ip, shard)) WITH default_time_to_live = 86400 where - node_ip: IP of the coordinator Node. - shard: shard ID on a Coordinator where the query was handled. - session_id: ID of a corresponding tracing session. - date: a time when the query has began. - start_time: a time-based UUID for this query (needed for a primary key mostly). - command: a query string. - duration: a time it took to handle this query (in microseconds). - parameters: a map of query parameters (like in system_traces.sessions). - source_ip: IP of a Client that sent this query. - table_names: a set of "<keyspace>.<table name>" strings representing column families used in this query. - username: a user name used for this query. The good thing is that most of the data we needed is already collected by the regular tracing framework. The only missing ones are a username and tables' names. So, this series makes the framework collect them too. The whole feature is integrated in the Tracing framework. The main changes to the framework that were made are as follows: - Store the constant capabilities of the tracing session in an enum_set, e.g.: - primary/secondary. - write on close. - Introduce two new capabilities to a tracing session of a specific query: - full tracing: collect all traces for this query (as it is before this series). - log slow query: log this query if its duration is above the threshold. These two capabilities may be defined independently. - Add the logic that handles the "log slow query"-only case: - Build the parameters<sstring, sstring> map only if the "duration" is above the given threshold. - The same about writing the trace entries. - In a not-only "log slow query" case: - Write the node_slow_log entry. - Extend the trace_info struct to pass slow query threshold and TTL to the replica Node. In addition to above this series add the capability to configure the slow query logging threshold and a TTL for the node_slow_log records. The heaviest patch in the series is the last one. The series contains a few cosmetic (renaming) patches that are meant to align the naming of the existing methods with the ones the last one is going to add."	2016-08-29 13:11:36 +03:00
Gleb Natapov	a2cdddb795	storage_proxy: forward mutation write with correct timeout value Now that mutation handler knows how much time is left for mutation write to be handled it can use this knowledge to set correct timeout for forwarded mutations. Message-Id: <20160828080637.GE9243@scylladb.com>	2016-08-29 13:06:36 +03:00
Avi Kivity	6cb796f38b	Merge seastar upstream * seastar ef063c5...2b07b1f (1): > file: make close() more robust against concurrent calls	2016-08-29 12:25:57 +03:00
Avi Kivity	f5f58b46c7	sstables: enable write-behind Write-behind allows a single sstable write to saturate the disk, improving throughput. Later we can take advantage of this to reduce the number of sstables being written concurrently.	2016-08-29 12:25:15 +03:00
Pekka Enberg	c5e5e7bb40	dist/docker: Clean up Scylla description for Docker image Message-Id: <1472145307-3399-1-git-send-email-penberg@scylladb.com>	2016-08-29 10:48:06 +03:00
Vlad Zolotarov	a491ac0f18	tracing: introduce a log_slow_query logic The main idea is to log queries that take "too long" to complete. The "too long" is above the given threshold. To achieve the above this patch does the following: - Introduce two new properties to the tracing::trace_state: - "Full tracing": when the tracing of this query was explicitly requested. In this state we will record all possible traces related to this query: both on the coordinator and on any replica involved. - "Log slow query": when slow query logging is enabled. If slow query logging is enabled and a session's "duration" is above the specified threshold we will create a record in the "slow queries log" and write all trace records created on the coordinator and on a replica if a replica's session lasts longer than that threshold. (We will propagate the Coordinator's slow query logging threshold to replicas in the context of a specific tracing/logging session). The properties above are independent, namely they may be enabled and/or disabled independently and any combination of them is legal (naturally, creating a tracing session when both states above are disabled makes no sense). - Instrument the tracing::tracing service to allow the following: - Enable/disable slow query logging. - Set/get the slow query duration threshold (in microseconds). - Set/get the slow query log record TTL value (in seconds). - Instrument the trace_keyspace_helper to write a slow query log entry when requested. - The slow query logging is disabled by default and the threshold is set to half a second. - The TTL of a slow log record is set to 86400 seconds by default. - It makes sense to use the same "slow query logging threshold" and a "slow query record TTL" both on a coordinator and on a replica Nodes in a context of the same tracing session: - Pass both TTL and a threshold to the replica in a trace_info. This patch also implements the new slow query logging specific logic: - Don't write the pending tracing records before the end of a tracing session until "duration" reaches the logging threshold. - Don't build the parameters<sstring, sstring> map unless we know we will write it to I/O. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-28 18:28:44 +03:00
Avi Kivity	e81c1df557	Merge seastar upstream * seastar 6fadd98...ef063c5 (2): > rpc: pass a timeout to a verb's server handler if the one was specified by a client > rpc: cleanup the old metaprogramming craft	2016-08-25 17:53:19 +03:00
Paweł Dziepak	6012a7e733	mutation_partition: fix iterator invalidation in trim_rows Reversed iterators are adaptors for 'normal' iterators. These underlying iterators point to different objects that the reversed iterators themselves. The consequence of this is that removing an element pointed to by a reversed iterator may invalidate reversed iterator which point to a completely different object. This is what happens in trim_rows for reversed queries. Erasing a row can invalidate end iterator and the loop would fail to stop. The solution is to introduce reversal_traits::erase_dispose_and_update_end() funcion which erases and disposes object pointed to by a given iterator but takes also a reference to and end iterator and updates it if necessary to make sure that it stays valid. Fixes #1609. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>	2016-08-25 16:52:35 +03:00
Paweł Dziepak	5f84348ce1	test.py: add missing nonwrapping_range_test Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1472126087-15484-1-git-send-email-pdziepak@scylladb.com>	2016-08-25 15:36:10 +03:00
Piotr Jastrzebski	cda2e8f833	Remove stateless_clustering_key_filter_factory It can be easily replaced with partition_slice_clustering_key_filter_factory. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-25 08:53:31 +02:00
Piotr Jastrzebski	5bf8807f9b	Remove clustering_key_filtering_context::get_filter* These methods are not used any more so they can go away. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-25 08:53:31 +02:00
Piotr Jastrzebski	7c9de37ef9	Remove clustering_key_filtering_context::want_static_columns It's always true and clustering_key_filtering_context is going away so the first step is to get rid of this method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-25 08:53:31 +02:00
Raphael S. Carvalho	d8be32d93a	api: use estimation of pending tasks in compaction manager too We have API for getting pending compaction tasks both in column family and compaction manager. Column family is already returning pending tasks properly. Compaction manager's one is used by 'nodetool compactionstats', and was returning a value which doesn't reflect pending compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a20b88938ad39e95f98bfd7f93e4d1666d1c6f95.1471641211.git.raphaelsc@scylladb.com>	2016-08-24 14:00:23 +03:00
Takuya ASADA	b9e02dad2e	dist/ami: install scylla metapackage on AMI Mistakenly we didn't install scylla metapackage on AMI, so install it. Fixes #1572 Message-Id: <1471977742-21984-1-git-send-email-syuu@scylladb.com>	2016-08-24 12:55:01 +03:00
Vlad Zolotarov	8609900621	tracing: introduce trace_state capabilities bit field - Instead of keeping separate booleans introduce a trace_state_props_set enum_set and pass it around instead of separate booleans. - Change the trace_info to hold this value in addition to write_on_close. Initialize a corresponding bit in an enum_set based on a write_on_close value in a trace_info constructor for a backward compatibility. - Separate a trace_state constructor into two: - For a primary session object. - For a secondary session object. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 18:34:36 +03:00
Amnon Heiman	2b98335da4	housekeeping: Silently ignore check version if Scylla is not available Normally, the check version should start and stop with the scylla-server service. If it fails to find scylla server, there is no need to check the version, nor to report it, so it can stop silently. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-23 18:08:59 +03:00
Amnon Heiman	4598674673	housekeeping: Use curl instead of Python's libraries There is a problem with Python SSL's in Ubuntu 14.04: ubuntu@ip-10-81-165-156:~$ /usr/lib/scylla/scylla-housekeeping -q version Traceback (most recent call last): File "/usr/lib/scylla/scylla-housekeeping", line 94, in <module> args.func(args) File "/usr/lib/scylla/scylla-housekeeping", line 71, in check_version latest_version = get_json_from_url(version_url + "?version=" + current_version)["version"] File "/usr/lib/scylla/scylla-housekeeping", line 50, in get_json_from_url response = urllib2.urlopen(req) File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open return self.do_open(httplib.HTTPSConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure> Instead of using Python libraries to connect to the check version server, we will use curl for that. Fixes #1600 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-23 18:07:05 +03:00
Amnon Heiman	91944b736e	housekeeping: Add curl as a dependency To work around an SSL problem with Python on Ubuntu 14.04, we need to use curl. Add it as a dependency so that it's available on the host. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-23 18:06:13 +03:00
Vlad Zolotarov	c8cf2ef82c	tracing::trace_state: introduce is_in_state() and set_state() accessors Use these new methods to manipulate trace_state::_state value. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	39b23cd084	tracing::trace_state: rename: get_write_on_close() -> write_on_close() Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	09624f704f	tracing::trace_state: rename: get_type() -> type() Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	b40a819d1e	tracing::trace_state: rename: get_session_id() -> session_id() Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	ed21398ce9	trace_keyspace_helper: create a system_traces.node_slow_log table This table is going to be used to store information about queries which are slower than a specified threshold. Also added a column caching and mutation creation functions Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	efeb62e72f	tracing: trace_keyspace_helper: introduce a check_column_definition() helper function Checks if a given column definition exists and has a requested type. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	6c3e1935b0	tracing::session_record: change a type of a "ttl" field to be std::chrono::seconds TTL is always defined in seconds - make its type explicitly reflect that. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	e017533229	tracing: set a username session parameter Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	93c2502be4	tracing: set a table_name in a BATCH query Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	25be28bb3c	tracing: set a table_name parameter in a SELECT statement Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	abae3b05e7	tracing: set table_name in a modification statement Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	372da7e71b	tracing: add support for setting a username and a table name parameters - "username" is a name used in the authentication process. - "table name" is a <keyspace>.<cf name> string representing a name of a table used for a query in question. Note that there may be more than one table name in a batch query. Therefore we store an unordered set of tables names. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:42 +03:00
Vlad Zolotarov	eaf5db66a8	tracing::session_record: store "parameters" data in an std::map instead of in an unordered_map Avoid sorting (and creating a new one) container at a backend code when a sorted container is needed. The overhead for the backends where it's not needed is minimal since the size of the map is very small. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 17:58:10 +03:00
Takuya ASADA	80f7449095	dist/ubuntu: support scylla-housekeeping service on all Ubuntu versions Current scylla-housekeeping support on Ubuntu has bug, it does not installs .service/.timer for Ubuntu 16.04. So fix it to make it work. Fixes #1502 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Tested-by: Amos Kong <amos@scylladb.com> Message-Id: <1471607903-14889-1-git-send-email-syuu@scylladb.com>	2016-08-23 13:49:44 +03:00
Takuya ASADA	aac60082ae	dist/common/systemd: don't use .in for scylla-housekeeping.*, since these are not template file .in is the name for template files witch requires to rewrite on building time, but these systemd unit files does not require rewrite, so don't name .in, reference directly from .spec. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1471607533-3821-1-git-send-email-syuu@scylladb.com>	2016-08-23 13:49:09 +03:00
Duarte Nunes	440c1b2189	thrift: Avoid always recording size estimates Size estimates for a particular column family are recorded every 5 minutes. However, when a user calls the describe_splits(_ex) verbs, they may want to see estimates for a recently created and updated column family; this is legitimate and common in testing. However, a client may also call describe_splits(_ex) very frequently and recording the estimates on every call is wasteful and, worse, can cause clients to give up. This patch fixes this by only recording estimates if the first attempt to query them produces no results. Refs #1139 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>	2016-08-23 13:08:25 +03:00
Takuya ASADA	383148de13	dist/common/scripts/scylla_bootparam_setup: fix failing setup hugepages= variable on boot parameter This is caused because mistakenly dropped sourcing sysconfig file, so source it again. Fixes #1599 Message-Id: <1471943742-19684-1-git-send-email-syuu@scylladb.com>	2016-08-23 12:41:39 +03:00
Takuya ASADA	1ad578ecf1	dist/common/scripts/scylla_bootparam_setup: use distribution standard grub.cfg update command on Ubuntu Result is almost same, but let's do it in ubuntu/debian flavor. Message-Id: <1471943898-24490-1-git-send-email-syuu@scylladb.com>	2016-08-23 12:41:34 +03:00
Paweł Dziepak	5feed84e32	sstables: do not call consume_end_partition() after proceed::no After state_processor().process_state() returns proceed::no the upper layer should have a chance to act before more data is pushed to the consumer. This means that in case of proceed::no verify_end_state() should not be called immediately since it may invoke consume_end_partition(). Fixes #1605. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1471943032-7290-1-git-send-email-pdziepak@scylladb.com>	2016-08-23 12:24:39 +03:00
Duarte Nunes	ee2694e27d	cql3: Consider bound type when detecting wrap around This patch uses the clustering bounds comparator to correctly detect wrap around of a clustering range. This fixes a manifestation of #1446, introduced by `b1f9688432`, where a query such as select * from cf where k = 0x00 and c0 = 0x02 and c1 > 0x02 would result in a range containing a clustering key and a prefix, incorrectly ordered by the prefix equality or lexicographical comparators. Refs #1446 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-22 17:55:34 +02:00
Duarte Nunes	084b931457	bounds_view: Create from nonwrapping_range Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-22 17:52:36 +02:00
Duarte Nunes	878927d9d2	range_tombstone: Extract out bounds_view This patch extracts bounds_view from range_tombstone so its comprator can be reused elsewhere. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-22 17:52:36 +02:00
Pekka Enberg	9d1d8baf37	dist/docker: Separate supervisord config files Move scylla-server and scylla-jmx supervisord config files to separate files and make the main supervisord.conf scan /etc/supervisord.conf.d/ directory. This makes it easier for people to extend the Docker image and add their own services. Message-Id: <1471588406-25444-1-git-send-email-penberg@scylladb.com>	2016-08-22 17:20:23 +03:00
Avi Kivity	55f2cf1626	thrift: do not generate wrapping ring_position ranges As part of the move to unwrap ranges, don't generate wrapping ranges from thrift. A little extra motivation is to avoid the need for the solution to #1573 to be able to handle wrapping ranges. This patch may also be fixing a bug in that the range (token, token] was previously translated as (-inf, +inf), while now it is translated as {(token, +inf), (-inf, token]}; the new translation respects ordering better. Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1471869587-12972-1-git-send-email-avi@scylladb.com>	2016-08-22 14:43:45 +01:00
Avi Kivity	46caff5b06	Merge "Store frozen_mutation in fragmented buffer" from Paweł "This series switches frozen_mutations and to use bytes_ostream internally so that the size of a single allocation is bounded. Deserializers are also enhanced so that they can cope with reading from fragmented buffers. The goal of the change is to reduce memory pressure in case of large partitions. Performance as measured by perf_simple_query (median of 30). before after diff read 705270.74 702906.35 -0.3% write 814504.81 836462.33 +2.7% Refs #1440. Refs #1545. Fixes #1546."	2016-08-22 13:01:34 +03:00
Paweł Dziepak	1315090bf0	query-result: no need to linearize buffer any more Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	3fe5ed3cd9	query: use result_view::consume() where appropriate Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	cb2a557cf7	query::result: reduce chunk count Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	ea3ac0a270	frozen_mutation: reduce chunk count in constructor Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	1daf4c73a3	frozen_mutation: avoid buffer linearization and copy Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	3707d7fec3	frozen_mutation: use bytes_ostream internally Unlike bytes, bytes_ostream supports fragmented buffers, thus reducing the pressure on the memory allocator caused by large frozen partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	c0425b63ff	frozen_mutation: add mutation_view() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	89f7b46f61	idl: switch to utils::input_stream Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	dcf794b04d	idl: make bytes compatible with bytes_ostream This patch makes idl type "bytes" compatible with both bytes and bytes_ostream. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	2c5ec44281	atomic_cell: add overloads taking const bytes& Deserialization code is going to use a proxy object that will be casted to either bytes or bytes_ostream depending on the demand. It cannot be casted directly to bytes_view though as it won't extend the lifetime of the buffer appropriately. The simples solution is just to add overloads that accept const bytes&. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	387434a76c	bytes_ostream: add reduce_chunk_count() Deserialization code has now two variants. The faster one can be used only when the source buffer is not fragmented. reduce_chunk_count() aims to increase number of cases when the fast path can be used. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	7d4b7fd5fc	bytes_ostream: add equality operator Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	222bde7e6f	bytes_ostream: introduce upper bound on chunk size This patch makes append() and write() limit the maximum size of a single allocation to bytes_ostream::max_chunk_size. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	0ee98ea4c4	tests: add fragmented input stream test Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	e76203c927	utils: add input_stream input_stream performs a type erasure on seastar::simple_input_stream and fragmented_input_stream. The main goal is to keep the overhead for the cases when simple_input_stream is used minimum. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Paweł Dziepak	29827a9726	utils: add fragmented_input_stream fragmented_input_stream is an input stream usable by IDL-generated deserializers which can read from fragmented buffers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00
Asias He	4ffd867ad0	gossip: Add log when cluster or partioner mismatch It is easier for user to figure out the configuration error. The log looks like: WARN 2016-08-22 15:04:56,214 [shard 0] gossip - ClusterName mismatch from 127.0.0.2 test2!=test WARN 2016-08-22 15:06:16,106 [shard 0] gossip - Partitioner mismatch from 127.0.0.2 org.apache.cassandra.dht.RandomPartitioner!=org.apache.cassandra.dht.Murmur3Partitioner Fixes: #1587 Message-Id: <745ed8857da6f70745735b94eef7b226d2f22e10.1471849834.git.asias@scylladb.com>	2016-08-22 11:06:31 +03:00
Raphael S. Carvalho	77d4cd21d7	sstables: Fix estimation of pending tasks for leveled strategy There were two underflow bugs. 1) in variable i, causing get_level() to see an invalid level and throw an exception as a result. 2) when estimating number of pending tasks for a level. Fixes #1603. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <cce993863d9de4d1f49b3aabe981c475700595fc.1471636164.git.raphaelsc@scylladb.com>	2016-08-22 10:37:15 +03:00
Vlad Zolotarov	92921fe110	tracing::trace_state: push the UUID to the end of an error message in a destructor Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1471780783-25406-1-git-send-email-vladz@cloudius-systems.com>	2016-08-21 16:50:52 +03:00
Vlad Zolotarov	0683d4bd29	tracing::trace_state: don't throw in a destructor The condition in question is sanity check for a SW bug. This SW bug (if occurs) is not critical - there is an additional protection against it in the stop_foreground_and_write(). Having said all that, since we shell not throw from a destructor, replace throwing of a std::logic_error with an logger error message. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1471773320-7398-1-git-send-email-vladz@cloudius-systems.com>	2016-08-21 13:50:52 +03:00
Avi Kivity	2dcbbf9006	Merge seastar upstream * seastar 4254111...6fadd98 (6): > simple_input_stream: remove explicit copy constructor > simple_stream: add [[gnu::always_inline]] > perf_fstream: don't initialize variable-size array inline > rpc: annotate template function calls with "template" > iotune: avoid "auto" parameters > file: ext4 support	2016-08-21 13:38:43 +03:00
Paweł Dziepak	e60bb83688	sstables: optimise clustering rows filtering Clustering rows in the sstables are sorted in the ascending order so we can use that to minimise number of comparisons when checking if a row is in the requested range. Refs #1544. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <1471608921-30818-1-git-send-email-pdziepak@scylladb.com>	2016-08-19 18:11:11 +03:00
Pekka Enberg	2bf5e8de6e	dist/docker: Use Scylla mascot as the logo Glauber "eagle eyes" Costa pointed out that the Scylla logo used in our Docker image documentation looks broken because it's missing the Scylla text. Fix the problem by using the Scylla mascot instead. Message-Id: <1471525154-2800-1-git-send-email-penberg@scylladb.com>	2016-08-19 12:50:02 +03:00
Pekka Enberg	4d90e1b4d4	dist/docker: Fix bug tracker URL in the documentation The bug tracker URL in our Docker image documentation is not clickable because the URL Markdown extracts automatically is broken. Fix that and add some more links on how to get help and report issues. Message-Id: <1471524880-2501-1-git-send-email-penberg@scylladb.com>	2016-08-19 12:49:52 +03:00
Yoav Kleinberger	25fb5e831e	docker: extend supervisor capabilities allow user to use the `supervisorctl' program to start and stop services. `exec` needed to be added to the scylla and scylla-jmx starter scripts - otherwise supervisord loses track of the actual process we want to manage. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1471442960-110914-1-git-send-email-yoav@scylladb.com>	2016-08-18 15:08:11 +03:00
Avi Kivity	d33d958239	Merge "tracing: cleanups" from Vlad "This series includes a time stamp representation changes Avi asked. In addition is fixes a session "duration" semantics to be the time it took to satisfy the user's request and not a time it took to achieve the complete replication factor."	2016-08-18 14:36:19 +03:00
Pekka Enberg	1553bec57a	dist/docker: Documentation cleanups - Fix invisible characters to be space so that Markdown to PDF conversion works. - Fix formatting of examples to be consistent. - Spellcheck. Message-Id: <1471514924-29361-1-git-send-email-penberg@scylladb.com>	2016-08-18 13:09:37 +03:00
Pekka Enberg	4ca260a526	dist/docker: Document image command line options This patch documents all the command line options Scylla's Docker image supports. Message-Id: <1471513755-27518-1-git-send-email-penberg@scylladb.com>	2016-08-18 13:01:58 +03:00
Avi Kivity	42094524e7	Merge seastar upstream * seastar ab29b12...4254111 (3): > file: fix size() return type > build: adjust -fsantize=vptr broken warning > thread_impl.hh: add missing include	2016-08-18 10:34:58 +03:00
Avi Kivity	d0308ff488	Merge seastar upstream * seastar 81df893...ab29b12 (1): > core: Fix bug in make_file_impl() which affects directory scanning	2016-08-17 21:57:03 +03:00
Amos Kong	9d53305475	systemd: have the first housekeeping check right after start Issue: https://github.com/scylladb/scylla/issues/1594 Currently systemd run first housekeeping check at the end of first timer period. We expected it to be run right after start. This patch makes systemd to be consistent with upstart. Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <4cc880d509b0a7b283278122a70856e21e5f1649.1471433388.git.amos@scylladb.com>	2016-08-17 16:02:00 +03:00
Avi Kivity	0033197ba9	Merge seastar upstream * seastar 823a404...81df893 (3): > memory: Do not increase g_allocs on failure in allocate and allocate_aligned > memory: Balance the g_frees and g_allocs > Merge "thread: explicitly yield on get()" from Glauber Fixes #1586.	2016-08-17 13:28:30 +03:00
Avi Kivity	4871b19337	Merge "Fixes for streamed_mutation_from_mutation" from Paweł "This series contains fixes for two memory leaks in streamed_mutation_from_mutation. Fixes #1557."	2016-08-17 13:24:22 +03:00
Avi Kivity	e7eb76fc58	Introduce stdx.hh header file So we don't have to create an stdx = std::experimental alias everywhere. Message-Id: <1471417039-21391-1-git-send-email-avi@scylladb.com>	2016-08-17 11:19:49 +01:00
Paweł Dziepak	148e9c5608	streamed_mutation_from_mutation: fix destroying bi::sets Once unlink_leftmost_without_rebalance() has been called on a bi::set no other method can be used. This includes clear_and_disposed() used by the mutation_partition destructor. We like unlink_leftmost_without_rebalance() because it is efficient, so the solution is to manually finish destroying clustering row and range tombstone sets in the reader destructor using that function. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-17 11:03:59 +01:00
Paweł Dziepak	fe9575d01d	streamed_mutation_from_mutation: fix leak on allocation failure mutation_fragment() constructor allocates memory. If it fails the already unlinked parts of mutation (either rows_entry or range_tombtone) will be leaked. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-17 11:02:24 +01:00
Benoit Canet	90ef150ee9	systemd: Remove WorkingDirectory directive The WorkingDirectory directive does not support environment variables on systemd version that is shipped with Ubuntu 16.04. Fortunately, not setting WorkingDirectory implicitly sets it to user home directory, which is the same thing (i.e. /var/lib/scylla). Fixes #1319 Signed-of-by: Benoit Canet <benoit@scylladb.com> Message-Id: <1470053876-1019-1-git-send-email-benoit@scylladb.com>	2016-08-17 12:34:11 +03:00
Raphael S. Carvalho	108fd1fade	database: close file in lister After listing is done, let's close file. This fixes no bug. It's only an improvement. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>	2016-08-17 11:01:44 +03:00
Glauber Costa	b361dee488	database: memtables pending flushes tell us nothing We have two counters that tracks how many memtable flushes are in progress, and how much memory are they pinning. The problem is, after we have revamped the code to limit the amount of flushes in progress, those counters became useless: as they live inside the semaphore side, they will only be incremented once we have past the semaphore. One wouldn't notice if working with CPU-bound problems, where memtables don't pile. But as soon as they do, those counters will always show the same numbers: the depth of the semaphore, which doesn't mean much. The problem is poised to become much worse: once we enable write behind in full and set the semaphore's depth to one, that's the number we'll see here all the time. The fix is to move the counters outside the semaphore, which will bring back its old semantics. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>	2016-08-17 10:54:15 +03:00
Piotr Jastrzebski	bb0c4c3c40	Fix compilation errors query::range parameter in mutation_partiton::range has to be changed to nonwrapping_range. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <36e444bfe90586f8d3b08ca36d8dc13d5898ef97.1471347402.git.piotr@scylladb.com>	2016-08-16 12:49:54 +01:00
Vlad Zolotarov	37da6f53f8	tracing: fix a session "duration" semantics A session's "duration" should be a time it took to handle a request, which is a time till response to a user. In other words - till a consistency level is reached. Before this patch is was a time that takes a complete handling of a request, which is the time it takes to handle all replicas and not only those required to reach a CL. This patch fixes this situation by extending the trace_state's state values to 3 states: inactive, foreground and background. A primary session may be in 3 states: - "inactive": between the creation and a begin() call. - "foreground": after a begin() call and before a stop_foreground_and_write() call. - "background": after a stop_foreground_and_write() call and till the state object is destroyed. - Traces are not allowed while state is in an "inactive" state. - The time the primary session was in a "foreground" state is the time reported as a session's "duration". - Traces that have arrived during the "background" state will be recorded as usual but their "elapsed" time will be greater or equal to the session's "duration". Secondary sessions may only be in an "inactive" or in a "foreground" states. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-16 12:32:34 +03:00
Vlad Zolotarov	f83e33fc13	tracing: make "elapsed" be std::chrono::duration - Define an tracing::elapsed_clock type (std::chrono::steady_clock). Use it instead of trace_state::clock_type. - Store the "elapsed" information in a form of elapsed_clock::duration. - Make all keyspace_backend specific conversions inside the trace_keyspace_helper class, where they belong. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-16 12:32:34 +03:00
Vlad Zolotarov	ebf13da9c9	tracing::session_record: make start_at to be a time_point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-16 12:32:34 +03:00
Vlad Zolotarov	5e24f6f442	trace_keyspace_helper::apply_events_mutation(): Avoid extra std::move of std::deque events_records are promised to be kept alive till the future returned by apply_events_mutation() resolves: it's dowithificated by a caller already. In addition, since its passed by a reference, it's a logical thing to demand it to be kept alive by a caller till the future above resolves. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-16 12:32:34 +03:00
Avi Kivity	bf02ca831d	Merge "Tracing: change a back pressure scheme" from Vlad "This series changes the tracing back pressure scheme from limiting the amount traces in a single session by a fixed number to have a per-shard budget consumed by all active tracing sessions. It was really easy to cause the traces to be dropped even if there weren't too many active traces: e.g. if there was a single active session which creates more traces than a per-session limit (30) the traces above 30-th were going to be dropped. Namely traces were dropped when there were only 30 active traces, which is ridiculous. This series introduces two main changes: - Changes the records budgeting from being per-session to be per-shard. This substantially increases the amount of active records after which new records are going to be dropped. - Introduces a flow when events' records are written BEFORE the corresponding tracing session is over (right now traces are written to I/O back end only when the session object is destroyed). The later is meant to virtually eliminate the traces drops in normal situations at all. Of course, if a back end is slow or if there are a lot of small sessions that do not complete we would still have to drop new sessions/records in order to avoid uncontrolled growth of a memory foot print of Tracing. If we see the later case happening a lot in the future we may add lowres timers to each session that would commit the cached records for writing every X time. But let's not try to optimize something that we are not completely sure has to be optimized... "	2016-08-16 12:21:02 +03:00
Amnon Heiman	0706db9387	API: use the estimated sum when converting histogram to json The function that convert histogram to the json histogram object need to use the estimated_sum to get the actual sum and not the sampled sum. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1467547341-30438-3-git-send-email-amnon@scylladb.com>	2016-08-16 11:06:51 +03:00
Amnon Heiman	4c14b2a527	histogram: Add an estimated sum method The histogram implementation uses sampling to estimate the mean and sum. This patch adds a method that returns an estimated sum based on the mean and the total number of events measured. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1467547341-30438-2-git-send-email-amnon@scylladb.com>	2016-08-16 11:06:50 +03:00
Pekka Enberg	fde8677f1a	cql3/query_processor: Clean up code formatting Currently, query_processor.cc code formatting is all over the place, which makes the file hard to read. Apply some formatting magic to make it prettier. Message-Id: <1470832486-26020-2-git-send-email-penberg@scylladb.com>	2016-08-16 10:39:15 +03:00
Pekka Enberg	ce07822f49	cql3/query_processor: Use type deduction to make code more readable Use the 'auto' specifier for variables and lambda parameters to make the code more readable. Message-Id: <1470832486-26020-1-git-send-email-penberg@scylladb.com>	2016-08-16 10:39:11 +03:00
Avi Kivity	b1f9688432	Merge "range: Add nonwrapping_range" from Duarte "Ranges that wrap around are a source of complexity and bugs. This patchset adds a nonwrapping_range class, which specifies the range can't wrap around. It is the user of the nonwrapping_range that is required to enforce this constraint. The idea is to incrementaly disallow ranges that wrap around. We do it for query::clustering_range in this patchset, and it can be done similarly for other ranges. This moves the burden of unwrapping ranges to the edges. Fixes #1544"	2016-08-16 10:08:24 +03:00
Duarte Nunes	5161ea283f	query: query::clustering_range can't wrap around This patch changes the type of query::clustering_range to express that ranges that wrap around are not allowed, and ranges that have the start bound after the end bound are considered empty. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:50:20 +00:00
Duarte Nunes	3275fabe53	storage_proxy: Short circuit query without clustering ranges This patch makes the storage_proxy return an empty result when the query doesn't define any clustering ranges (default or specific). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	56f10abce3	thrift: Don't always validate clustering range This patch makes make_clustering_range not enforce that the range be non-wrapping, so that it can be validated differently if needed. A make_clustering_range_and_validate function is introduced that keeps the old behavior. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	be4adf212a	nonwrapping_range: Add unit tests Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	bb16e194bc	range: Add nonwrapping_range class This patch introduces the nonwrapping_range class. This class is intended to be used by code that requires non wrapping ranges. Internally, it uses a wrapping_range. Users are responsible for ensuring the bounds are correct when creating a nonwrapping_range. The path proposed here is to incrementally replace usages of wrapping_range/range by nonwrapping_range, pushing usages of wrapping ranges as further to the edges as possible. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	2bb428973a	range: Rename to wrapping_range This patch renames range to wrapping_range in preparation for adding a new range type, nonwrapping_range. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Duarte Nunes	2c0b049176	clustering_key_filter: Don't forward declare range Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-15 14:48:57 +00:00
Paweł Dziepak	5cae44114f	partition_version: handle errors during version merge Currently, partition snapshot destructor can throw which is a big no-no. The solution is to ignore the exception and leave versions unmerged and hope that subsequent reads will succeed at merging. However, another problem is that the merge doesn't use allocating sections which means that memory won't be reclaimed to satisfy its needs. If the cache is full this may result in partition versions not being merged for a very long time. This patch introduces partition_snapshot::merge_partition_versions() which contains all the version merging logic that was previously present in the snapshot destructor. This function may throw so that it can be used with allocating sections. The actual merging and handling of potential erros is done from partition_snapshot_reader destructor. It tries to merge versions under the allocating section. Only if that fails it gives up and leaves them unmerged. Fixes #1578 Fixes #1579. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1471265544-23579-1-git-send-email-pdziepak@scylladb.com>	2016-08-15 15:56:53 +03:00
Asias He	ef782f0335	gossip: Add heart_beat_version to collectd $ tools/scyllatop/scyllatop.py 'gossip' node-1/gossip-0/gauge-heart_beat_version 1.0 node-2/gossip-0/gauge-heart_beat_version 1.0 node-3/gossip-0/gauge-heart_beat_version 1.0 Gossip heart beat version changes every second. If everyting is working correctly, the gauge-heart_beat_version output should be 1.0. If not, the gauge-heart_beat_version output should be less than 1.0. Message-Id: <cbdaa1397cdbcd0dc6a67987f8af8038fd9b2d08.1470712861.git.asias@scylladb.com>	2016-08-15 12:32:00 +03:00
Nadav Har'El	0d00da7f7f	sstables: don't forget to read static row [v2: fix check for static column (don't check if the schema is not compound) and move want-static-columns flag inside the filtering context to avoid changing all the callers.] When a CQL request asks to read only a range of clustering keys inside a partition, we actually need to read not just these clustering rows, but also the static columns and add them to the response (as explained by Tomek in issue #1568). With the current code, that CQL request is translated into an sstable::read_row() with a clustering-key filter. But this currently only reads the requested clustering keys - NOT the static columns. We don't want sstable::read_row() to unconditionally read the from disk the static columns because if, for example, they are already cached, we might not want to read them from disk. We don't have such partial-partition cache yet, but we are likely to have one in the future. This patch adds in the clustering key filter object a flag of whether we need to read the static columns (actually, it's function, returning this flag per partition, to match the API for the clustering-key filtering). When sstable::read_row() sees the flag for this partition is true, it also request to read the static columns. Currently, the code always passes "true" for this flag - because we don't have the logic to cache partially-read partitions. The current find_disk_ranges() code does not yet support returning a non- contiguous byte range, so this patch, if it notices that this partition really has static columns in addition to the range it needs to read, falls back to reading the entire partition. This is a correct solution (and fixes #1568) but not the most efficient solution. Because static columns are relatively rare, let's start with this solution (correct by less efficient when there are static columns) and providing the non- contiguous reading support is left as a FIXME. Fixes #1568 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1471124536-19471-1-git-send-email-nyh@scylladb.com>	2016-08-15 12:30:19 +03:00
Avi Kivity	4fcebd4ca6	random_partitioner: fix overflow in shard_of() uint128_t will overflow if smp::count > 2. Replace with a larger type. Message-Id: <1471188765-30142-1-git-send-email-avi@scylladb.com>	2016-08-15 09:41:54 +03:00
Amnon Heiman	612f677283	scylla.spec: conditionally include the housekeeping.cfg in the conf package When the housekeeping configuration name was changed from conf to cfg it was no longer included as part of the conf rpm. This change adds a macro that determines of if the file should be included or not and use that marco to conditionally add the configuration file to the rpm. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1471169042-19099-1-git-send-email-amnon@scylladb.com>	2016-08-14 13:25:59 +03:00
Tomasz Grabiec	1b2ea14d0e	partition_version: Add missing linearization context Snapshot removal merges partitions, and cell merging must be done inside linearization context. Fixes #1574 Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1471010625-18019-1-git-send-email-tgrabiec@scylladb.com>	2016-08-12 17:55:23 +03:00
Piotr Jastrzebski	f212a6cfcb	Fix after free access bug in storage proxy Due to speculative reads we can't guarantee that all fibers started by storage_proxy::query will be finished by the time the method returns a result. We need to make sure that no parameter passed to this method ever changes. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <31952e323e599905814b7f378aafdf779f7072b8.1471005642.git.piotr@scylladb.com>	2016-08-12 16:34:43 +02:00
Duarte Nunes	918a2939ff	docker: If set, broadcast address is seed This patch configures the broadcast address to be the seed if it is configured, otherwise Scylla complains about it and aborts. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470863058-1011-1-git-send-email-duarte@scylladb.com>	2016-08-12 11:46:50 +03:00
Avi Kivity	21392cf5fd	Merge seastar upstream * seastar 7fd8d49...823a404 (1): > io_priority_class: remove non-explicit operator unsigned	2016-08-11 17:20:23 +03:00
Avi Kivity	65aa9135a1	Merge seastar upstream * seastar 59613e7...7fd8d49 (1): > reactor: Do not test for poll mode default	2016-08-11 14:46:45 +03:00
Amnon Heiman	5a4fc9c503	scylla-housekeeping: rename configuration file from conf to cfg Files with a conf extension are run by the scylla_prepare on the AMI. The scylla-housekeeping configuration file is not a bash script and should not be run. This patch changes its extension to cfg which is more python like. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470896759-22651-2-git-send-email-amnon@scylladb.com>	2016-08-11 14:44:56 +03:00
Tomasz Grabiec	f1c2481040	sstables: Fix bug in promoted index generation maybe_flush_pi_block, which is called for each cell, assumes that block_first_colname will be empty when the first cell is encountered for each partition. This didn't hold after writing partition which generated no index entry, because block_first_colname was cleared only when there way any data written into the promoted index. Fix by always clearing the name. The effect was that the promoted index entry for the next partition would be flushed sooner than necessary (still counting since the start of the previous partition) and with offset pointing to the start of the current partition. This will cause parsing error when such sstable is read through promoted index entry because the offset is assumed to point to a cell not to partition start. Fixes #1567 Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>	2016-08-11 13:08:48 +03:00
Amnon Heiman	a24941cc5f	build_deb: Add dist flag The dist flag mark the debian package as distributed package. As such the housekeeping configuration file will be included in the package and will not need to be created by the scylla_setup. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470907208-502-2-git-send-email-amnon@scylladb.com>	2016-08-11 12:25:07 +03:00
Pekka Enberg	d1a052237d	dist/docker: Fix typo in "--overprovisioned" help text Reported by Mathias Bogaert (@analytically). Message-Id: <1470904395-4614-1-git-send-email-penberg@scylladb.com>	2016-08-11 11:38:03 +03:00
Nadav Har'El	7409688356	README.md: add another required package I tried to compile scylladb on a new Fedora 24 system, and the "-lsystemd" library was missing during like. We need the systemd-devel package for that. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470865063-12871-1-git-send-email-nyh@scylladb.com>	2016-08-11 10:21:21 +02:00
Avi Kivity	42d8701121	Merge seastar upstream * seastar 64ae228...59613e7 (20): > reactor: fix I/O queue pending requests collectd metric > simple_input_stream: introduce copy_to() member function > scollectd: Fix merge skew between "disabled" and "descriped" metrics patches > Merge "Write-behind for XFS" > Merge "Fix the SMP queue poller" from Tomasz > Merge "collectd syntactical sugar & descriptions" from Calle > Fix build failure introduced by 5b1051ce0de5b772f22444abe6a1d97076b49b1f > Add --abort-on-seastar-bad-alloc option > Merge "Make arp comply with C++ strict aliasing rules" > doc: use install-dependencies.sh on docs > add execute bit on install-dependencies.sh > core/reactor: Fix use-after-free on io_event's promise > tcp: write option length correctly > tcp: make tcp options comply with strict aliasing rules > tcp: comply with strict aliasing rules > add libdl to library list > reactor: add exception counter > install-dependencies.sh: remove unnecessary sudo > install-dependencies.sh: install add-apt-repository when it's not installed > install-dependencies.sh: add protobuf to dependencies, for newly added prometheus API support Fixes #1558.	2016-08-10 15:14:31 +03:00
Tomasz Grabiec	d7f8ce7722	Merge branch 'raphael/fix_min_max_metadata_v2' from git@github.com:raphaelsc/scylla.git Fix for generation of sstables min/max clustering metadata from Raphael.	2016-08-10 10:43:35 +02:00
Pekka Enberg	6a5ab6bff4	dist/docker: Add '--smp', '--memory', and '--overprovisoned' options Add '--smp', '--memory', and '--overprovisioned' options to the Docker image. The options are written to /etc/scylla.d/docker.conf file, which is picked up by the Scylla startup scripts. You can now, for example, restrict your Docker container to 1 CPU and 1 GB of memory with: $ docker run --name some-scylla penberg/scylla --smp 1 --memory 1G --overprovisioned 1 Needed by folks who want to run Scylla on Docker in production. Cc: Sasha Levin <alexander.levin@verizon.com> Message-Id: <1470680445-25731-1-git-send-email-penberg@scylladb.com>	2016-08-10 11:34:08 +03:00
Raphael S. Carvalho	8deb1ca19d	tests: add test to check sstables's min and max clustering values Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-08-09 15:54:40 -03:00
Raphael S. Carvalho	ef6ddf2398	sstables: fix tracking of min and max clustering components Scylla was tracking min and max column names instead. Min and max clustering components are tracked to optimize reads that use a clustering filter. For more details: https://issues.apache.org/jira/browse/CASSANDRA-5514 Also fix potential bug if clustering value is empty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-08-09 15:01:30 -03:00
Vlad Zolotarov	5deec0e327	tracing::write_complete(): improve a message in case of a logic error Improve a message if there is a logic error and add logging of such errors. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 19:00:43 +03:00
Vlad Zolotarov	67d537ecb5	tracing: issue a write event if a single session creates a lot of events Currently write events are issued every time a trace session is closed. However if a single session creates a lot of events we will start dropping them after the total amount of pending records bypasses the limit. This patch will issue a write event before the session end in that case. Since now new events may be added to the active tracing session while it's scheduled for write we have to ensure the following: - Not to add the already pending for write session to the pending bulk. - Grab all pending data in a specific session in a synchronous way during the write event. - Serialize creation of events mutations - otherwise the "monotonic nanos" logic won't work. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 19:00:43 +03:00
Vlad Zolotarov	5391bcc5a9	tracing: improve a back pressure policy Use a per-shard tracing records budget instead of maintaining a fixed-size per-session records budget and a per-shard sessions budget. The original policy could lead to some irrational situations, when we have a single tracing session that creates a substantial amount of records that we can handle but we would start dropping new records after it surpasses the per-session limit. The new policy handles a per-shard trace records budget that is being consumed by each trace() call and by a primary session destructor when a session record is created. Each active record may only be in one of the following states: - cached: stored in its session's object. When record is in this state it's not going to be written to I/O during the next write event. - pending for write: when record is in this state it's going to be written to I/O during the next write event. - flushing: the record is being currently written to the I/O. There are counters of the total amount of records in each state above. Each record may only be in a specific state at every point of time and thereby it must be accounted only in one and only one of the three counters. The sum of all three counters should not be greater than (max_pending_trace_records + write_event_records_threshold) at any time (actually it can get as high as a value above plus (max_pending_sessions) if all sessions are primary but we won't take this into an account for simplicity). The same is about the number of outstanding sessions: it may not be greater than (max_pending_sessions + write_event_sessions_threshold) at any time. If total number of tracing records is greater or equal to the limit above, the new trace point is going to be dropped. If current number or records plus the expected number of trace records per session (exp_trace_events_per_session) is greater than the limit above new sessions will be dropped. A new session will also be dropped if there are too many active sessions. When the record or a session is dropped the appropriate statistics counters are updated and there is a rate-limited warning message printed to the log. Every time a number of records pending for write is greater or equal to (write_event_records_threshold) or a number of sessions pending for write is greater or equal to (write_event_sessions_threshold) a write event is issued. Every 2 seconds a timer would write all pending for write records available so far. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 19:00:43 +03:00
Vlad Zolotarov	d8fe5317d1	tracing::trace_keyspace_helper: make events' mutations applying loop interruptible When building events' mutation don't apply them in a tight loop but rather apply each of them in a separate continuation to allow reactor to interrupt this loop if it takes too long for it to complete (e.g. where there are a lot of mutations to apply). Since building all events' mutations is asynchronous now we can no longer keep the "nanos" state in a global trace_keyspace_helper object but rather have to move it into the per-session backend_session_state class. backend_session_state class is a backend-specific implementation of a tracing::backend_session_state_base class. An instance of the above object is created by a tracing::i_tracing_backend_helper::allocate_session_state() virtual method and is stored in a tracing::one_session_records object. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 19:00:39 +03:00
Vlad Zolotarov	63a0502ed1	tracing: rework the interface between the tracing/trace_state and the backend Before this patch the interaction between the layers above was as follows: - trace_state was passing the trace event data to a backend object every time trace() method was called. - trace_state was passing the session data to a backend object in a destructor. - A backend object was storing this data in a form of lambda where all data above was caught in a capture list. This was primarily done in order to delay the call for make_xxx_mutation(). Lambdas were stored in a map by a session ID and they were executed when a kick() method was called. - A tracing::tracing object was periodically calling a kick() method of a backend that was initiating a write of all pending data to the storage. All backend methods used in the described above interactions were virtual. Thereby, for instance, for each and every trace record we were calling a virtual method that was receiving a significant amount of parameters, store a lambda in a map and return. This is clearly a suboptimal way of using virtual functions since we prevent a compiler from inlining an obviously inlinable operations. This patch changes the interaction scheme to be as follows: - Trace events and session data are stored and passed around in a form of structs that hold all relevant information (no more lambdas). - As long as a trace session is active its data is aggregated inside the corresponding trace_state object. - The object containing all records is passed and stored as a lw_shared_ptr to save extra copies and to shorten capture lists. - All aggregated data is passed to a tracing::tracing object in a trace_state destructor. The data is stored in a std::deque in a tracing::tracing object (instead of a map by a session ID). - A single backend's virtual method call writes all data aggregated so far (kick() method is not needed any more), every time a write event occurs. - Backend has only one virtual method now: - Write a bulk of sessions' data aggregated so far. - Backend's virtual method receives a records bulk object by reference. As a result: - A latency of a single trace event that has no formatting improved from 0.2us to 0.1us. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 15:25:52 +03:00
Vlad Zolotarov	960b423ce0	tracing/tracing.cc: rename a logger object s/logger/tracing_logger/ Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 15:21:47 +03:00
Nadav Har'El	c2e4f5ba16	Avoid some warnings in debug build The sanitizer of the debug build warns when a "bool" variable is read when containing a value not 0 or 1. In particular, if a class has an uninitialized bool field, which class logic allows to only be set later, then "move"ing such an object will read the uninitialized value and produce this warning. This patch fixes four of these warnings seen in sstable_test by initializing some bool fields to false, even though the code doesn't strictly need this initialization. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>	2016-08-09 13:21:45 +01:00
Vlad Zolotarov	e1b2926a8d	tracing: add a missing try-catch in params building Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-09 15:21:41 +03:00
Duarte Nunes	0ed19ec64d	thrift: Set default validator This patch sets the default validator for dynamic column families. Doing so has no consequences in terms of behavior, but it causes the correct type to be shown when describing the column family through cassandra-cli. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470739773-30497-1-git-send-email-duarte@scylladb.com>	2016-08-09 13:55:07 +02:00
Avi Kivity	da4d33802e	Merge "Add configuration file to scylla-housekeeping" from Amnon "The series adds an optional configuration file to the scylla-housekeeping. The file act as a way to prevent the scylla-housekeeping to run. A missing configuration file, will make the scylla-housekeeping immediately. The series adds a flag to the build_rpm that differentiate between public distributions that would contain the configuration file and private distributions that will not contain it which will cause the setup script to create it."	2016-08-09 14:52:19 +03:00
Nadav Har'El	e005762271	sstable: avoid copying non-existant value The promoted-index reading code contained a bug where it copied the value of an disengaged optional (this non-value was never used, but it was still copied ). Fix it by keeping the optional<> as such longer. This bug caused tests/sstable_test in the debug build to crash (the release build somehow worked). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com>	2016-08-09 14:35:18 +03:00
Duarte Nunes	f63886b32e	thrift: Send empty col metadata when describing ks This patch ensures we always send the column metadata, even when the column family is dynamic and the metadata is empty, as some clients like cassandra-cli always assume its presence. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470740971-31169-1-git-send-email-duarte@scylladb.com>	2016-08-09 14:33:18 +03:00
Pekka Enberg	9ff242d339	cql3: Filter compaction strategy class from compaction options Cassandra 2.x does not store the compaction strategy class in compaction options so neither should we to avoid confusing the drivers. Fixes #1538. Message-Id: <1470722615-29106-1-git-send-email-penberg@scylladb.com>	2016-08-09 10:38:37 +02:00
Pekka Enberg	c23acbe5e6	Update scylla-ami submodule * dist/ami/files/scylla-ami 2e599a3...14c1666 (1): > setup coredump on first startup	2016-08-09 11:09:24 +03:00
Nadav Har'El	bce020efbd	Fix failing tests Commit `0d8463aba5` broke some of the tests with an assertion failure about local_is_initialized(). It turns out that there is more than one level of local_is_initialized() we need to check... For some tests, neither locals were initialized, but for others, one was and the other wasn't, and the wrong one was tested. With this patch, all unit tests except "flush_queue_test.cc" pass on my machine. I doubt this test is relevant to the promoted index patches, but I'll continue to investigate it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>	2016-08-09 10:51:40 +03:00
Calle Wilund	1593f4254b	flush_queue_test: Start semaphore in propagation tests not initialized Somehow, all my local runs timed ok anyway, but obviously not on all machines. Message-Id: <1470727968-1759-1-git-send-email-calle@scylladb.com>	2016-08-09 09:35:28 +02:00
Asias He	d8bff4f745	gossip: Fix debug log in wait_for_gossip_to_settle There is an extra '{}' in the logger format string. Fixes: gossip - Gossip looks settled. 8 gossip round completed: ??? Message-Id: <1470278008-29914-2-git-send-email-asias@scylladb.com>	2016-08-08 16:38:21 +03:00
Takuya ASADA	60ce16cd54	dist/common/scripts: mkdir -p /var/lib/scylla/coredump before symlinking We are creating this dir in scylla_raid_setup, but user may create XFS volume w/o using the command, scylla_coredump_setup should work on such condition. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1470638615-17262-1-git-send-email-syuu@scylladb.com>	2016-08-08 16:02:55 +03:00
Pekka Enberg	3b31d500c8	dist/docker: Document data volume and cpuset configuration Message-Id: <1470649675-5648-1-git-send-email-penberg@scylladb.com>	2016-08-08 15:50:30 +03:00
Pekka Enberg	4372da426c	dist/docker: Add '--broadcast-rpc-address' command line option We already have a '--broadcat-address' command line option so let's add the same thing for RPC broadcast address configuration. Message-Id: <1470656449-11038-1-git-send-email-penberg@scylladb.com>	2016-08-08 15:46:42 +03:00
Avi Kivity	98226a14ac	Merge "Exception propagation writers in commitlog batch" " While periodic mode is a all-bets-off crap-shoot as far as knowing if data actually reached disk or not, batch mode is supposed to be somewhat more reliable/deterministic. Thus, if we get an exception writing/flushing the current buffer, we should propagate exceptions to all execution paths involved in this buffer. Flush queue can now (optionally) propagate exceptions to all clients, and commit log uses this to ensure that commit log writers in batch mode all generate exceptions on disk errors. Also includes some rudimentary tests for flush queue mechanisms. Note: other main user, sstable flushing, is not affected, as default mode is still to keep exceptions to individual worker continuations, not waiters."	2016-08-08 15:33:26 +03:00
Avi Kivity	700feda0db	Merge "promoted index for reading partial partitions" from Nadav "The goal of this patch series is to support reading and writing of a "promoted index" - the Cassandra 2.* SSTable feature which allows reading only a part of the partition without needing to read an entire partition when it is very long. To make a long story short, a "promoted index" is a sample of each partition's column names, written to the SSTable Index file with that partition's entry. See a longer explanation of the index file format, and the promoted index, here: https://github.com/scylladb/scylla/wiki/SSTables-Index-File There are two main features in this series - first enabling reading of parts of partitions (using the promoted index stored in an sstable), and then enable writing promoted indexes to new sstables. These two features are broken up into smaller stand-alone pieces to facilitate the review. Three features are still missing from this series and are planned to be developed later: 1. When we fail to parse a partition's promoted index, we silently fall back to reading the entire partition. We should log (with rate limiting) and count these errors, to help in debugging sstable problems. 2. The current code only uses the promoted index when looking for a single contiguous clustering-key range. If the ck range is non-contiguous, we fall back to reading the entire partition. We should use the promoted index in that case too. 3. The current code only uses the promoted index when reading a single partition, via sstable::read_row(). When scanning through all or a range of partitions (read_rows() or read_range_rows()), we do not yet use the promoted index; We read contiguously from data file (we do not even read from the index file, so unsurprisingly we can't use it)."	2016-08-07 17:53:17 +03:00
Avi Kivity	bbaebb39a8	Update scylla-ami submodule * dist/ami/files/scylla-ami 863cc45...2e599a3 (1): > Do not set developer-mode on unsupported instance types	2016-08-07 17:51:24 +03:00
Nadav Har'El	fc063ae62d	tests: add test for promoted index writing In this unit test, we create using Scylla C++ code, the same large partition with 13520 CQL rows as we previously imported from Cassandra for the large partition test. We then verify that the sstable index file we just wrote is byte-for-byte identical to the one previously created by Cassandra. They should indeed be identical, because the data file has the same layout (even if timestamps are different) and our default promoted- index block size is the same (64K) so the sample of columns should be identical. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:15 +03:00
Nadav Har'El	0d8463aba5	sstables: promoted index write support This patch adds writing of promoted index to sstables. The promoted index is basically a sample of columns and their positions for large partitions: The promoted index appears in the sstable's index file for partitions which are larger than 64 KB, and divides the partition to 64 KB blocks (as in Cassandra, this interval is configurable through the column_index_size_in_kb config parameter). Beyond modifying the index file, having a promoted index may also modify the data file: Since each of blocks may be read independently, we need to add in the beginning of each block the list of range tombstones that are still open at that position. See also https://github.com/scylladb/scylla/wiki/SSTables-Index-File Fixes #959 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:12 +03:00
Nadav Har'El	09ede5f333	range tombstone accumulator: add method Add to the range_tombstone_accumulator a range_tombstones_for_row(ck) method. Just like the existing tombstone_for_row(ck), this function drops from the accumulator tombstones that end before ck. But while the existing function returned just a single tombstone affecting the given row (the most recent tombstone), the new function range_tombstones_for_row(ck) returns all the accumulated range tombstones which cover ck. This function will be useful for the promoted-index writing code later, which divides a partition into blocks which may be read independently, so each block needs to start with a repeat of the earlier tombstones which still cover the first row in the new block. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:10 +03:00
Nadav Har'El	022a69caea	sstables: promoted index read support This patch adds support more efficiently reading small parts of a large partition, without reading the entire partition as we had to do so far. This is done using the "promoted index". The "promoted index" is stored in the sstable index file, and provides for each large sstable row ("partition" in CQL nomenclature) a sample of the column names at (for example) 64KB intervals. This means that when we read a slice of columns (e.g., cql rows), or page through a large partition, we do not have to read the entire partition from disk. This patch only implements the read side of promoted index - a later patch will add the write-side support (i.e., writing the promoted index to the index file while saving the sstable). Nevertheless this patch can already be tested by reading existing sstables from Cassandra which include a promoted index - such as the one included in the test in the previous patch. The use of the promoted index currently has two limitations: 1. It is only used when reading a single partition with sstable::read_row(), not when scanning through many partitions with sstable::read_range_rows() or sstable::read_rows(). 2. It is only used when filtering a single clustering-key range, rather than a list of disjoint ranges. A single range is the common case. These two issues will be improved later. In the meantime, in those unsupported cases we simply continue to read entire partitions, so we're not worse-off than before. Also note that this patch only helps when sstable::read_row() is used with a clustering-key prefix (i.e., a slice). Our higher-level request handling code may decide to read an entire partition into the cache, and not use a clustering-key prefix at all when reading. We will need to indepdently improve the high-level code to use read_row()'s slicing capabilities when paging through large partitions, for example. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:09 +03:00
Nadav Har'El	6cdf5684f5	sstables: introduce find_disk_ranges() Our sstable reading code is currently hard-coded to read entire partitions, even if we know that only a subset of the columns are requested. This patch introduces find_disk_ranges(), a function to find the ranges of bytes we need to read from the sstable data file to guarantee that the desired columns from the desired partition are read. The returned range may be the entire byte range of the given partition - as found using the summary and index files - but if the index contains a "promoted index" (basically a sample of column positions for each key) we may return a smaller range. The "disk_read_range" type introduced in the previous patch is extended here to support reading a partial partition - by including additional information which would be missed when reading only part of a partition (viz., the partition key and the partition's tombstone). This function isn't used in this patch - we will wire its use in the next patch, which will complete the read-side support for the promoted index. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:08 +03:00
Nadav Har'El	1d38a69e49	sstables: expose promoted index in index entry Our index_entry type, holding one partition's entry that we read from the index file, already contained the "_promoted_index" which we read from disk - as an unparsed byte buffer. But there wasn't any API to access this buffer after it was read. This patch adds a trivial getter, to get a read-only view of this buffer. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:06 +03:00
Nadav Har'El	4e5a09538d	make column-name parsing code public The "struct column" code in partition.cc is generally useful code for parsing serialized column names from the sstable. It is currently private inside the "mp_row_consumer" class. But in a next patch we'll also want to use it in the "sstable" class, for the promoted-index parsing code, which among other things also needs to deserialize column names. The trivial fix, in this patch, is to make this code "public". However, for now it is still available only in partition.cc. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:04 +03:00
Nadav Har'El	1975abbfd6	sstables: disk_read_range Currently, the main sstable data parsing entry point data_consume_rows() takes a contiguous range of bytes to read from disk and parse. This range is supposed to be an entire partition or contiguous group of partitions. and is self contained (can be parsed without extra information about the identity of these partitions). For the promoted index feature (which we will add in a following patch) we will want the range to span only a part of a partition, and will need the caller to provide some information not available to the parser (such as the partition's key). In the future, we will also want to support a vector of byte ranges, instead of just one. So in preparation for this, this patch simply replaces the start/end pair by a new class disk_read_range, which can be easily extended in later patches. No new functionality is introduced in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:02 +03:00
Nadav Har'El	6fed716bd3	sstables: STOP_THEN_ATOM_START parser state In a later patch adding "promoted index" read support, we would like to parse only part of an sstable row. In that case, the parser should start not at the usual ROW_START state, but rather at the ATOM_START state. But there's a problem: The sstable parser consumer currently assumes that the parser stops after the start of the row, before reading any atoms. So in the partial row case too, we must stop parsing before reading the first atom. For this, this patch adds the new "STOP_THEN_ATOM_START" parser state. When starting in this state, the parser stops immediately (with row_consumer::proceed::no), and when restarted again it will be in the ATOM_START case. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:01 +03:00
Nadav Har'El	3dd079fb7a	tests: add test for reading parts of a large partition This patch adds a test that takes an sstable with one partition of 13,520 clustering rows (spanning 700 KB in the data file), and attempts to read various slices CQL rows, counting that we got back the expected number of rows. The sstable included here was generated by Cassandra, and includes a promoted index. Promoted index reading is not supported yet (we will add it in the next patch), so for now the code will always read the entire partition from disk; But still the clustering-key filtering is already functional, and will drop some of the rows as requested, so this test will pass. Later, when we add promoted index support, we should check that this test still passes - promoted index will make the reads in this test more efficient (which the test cannot verify), but the important thing to check is that it doesn't break any of these tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:46:59 +03:00
Takuya ASADA	3d45d6579b	dist/ami/files: add a message for privacy policy agreement on login prompt Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1470212054-351-1-git-send-email-syuu@scylladb.com>	2016-08-07 17:40:25 +03:00
Amnon Heiman	beba3a31cb	scylla-housekeeping.service: Add a configuration file Adding the configuration file is a way to make the running scylla-housekeeping service not run the check version. If the file does not exists, it will be set by the setup script. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-07 16:24:43 +03:00
Amnon Heiman	9c4bf651ae	install housekeeping.conf based on dist flag The new dist flag difrentiate between public distribution and private compilations. For public distributation the housekeeping configuration file will be installed. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-07 16:24:42 +03:00
Amnon Heiman	060c3029b0	scylla_setup: create the scylla-housekeeping conf file if missing When running the scylla_setup, the script would check that the housekeeping configuration file: housekepping.conf exists. If not, it would ask the user if to run the check version option or not and create the file accordingly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-07 16:24:23 +03:00
Amnon Heiman	63582dd043	Add a config file for housekeeping The housekeeping.conf is a configuration file for scylla-housekeeping. By default it will be included in the rpm and state that the check-version would be run. If the file is missing, or if check-version is set to false, the check version operation will not be performed. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-07 13:14:59 +03:00
Amnon Heiman	93737a601f	scylla-housekeeping: An optional config file This patch adds an optional config file that can be passed from the command line. If the config file is specified and does not exist, the script will terminate. The only parameter that is currently available is check-version that can be either true or false. The ConfigParser module is used to read the config file, that should be on the form of: [housekeeping] check-version: True Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-08-07 13:14:59 +03:00
Duarte Nunes	e0a43a82c6	system_keyspace: Correctly deal with wrapped ranges This patch ensures we correctly deal with ranges that wrap around when querying the size_estimates system table. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470412433-7767-1-git-send-email-duarte@scylladb.com>	2016-08-05 19:17:00 +03:00
Avi Kivity	b0a275945f	Merge "Remove compact columns" from Duarte "The compact column is a dense schema's single regular column. Its existence has been a source of bugs, so this patchset removes the column_kind::compact_column, as well as further references to compact columns from the code base. Fixes #1542"	2016-08-05 12:39:23 +03:00
Takuya ASADA	bd1ab3a0ad	dist/ami/files: show warning message for unsupported instance types Notify users to run scylla_io_setup before lunching scylla on unsupported instance types. Fixes #1511 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1470090415-8632-1-git-send-email-syuu@scylladb.com>	2016-08-05 09:51:13 +03:00
Avi Kivity	28ee2bdbd2	Merge "Docker image fixes" from Pekka "Kubernetes is unhappy with our Docker image because we start systemd under the hood. Fix that by switching to use "supervisord" to manage the two processes -- "scylla" and "scylla-jmx": http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/ While at it, fix up "docker logs" and "docker exec cqlsh" to work out-of-the-box, and update our documentation to match what we have. Further work is needed to ensure Scylla production configuration works as expected and is documented accordingly."	2016-08-04 15:11:18 +03:00
Pekka Enberg	394c8f8c4f	dist/docker: Document Scylla cluster setup Add instructions on how to make a cluster of two Scylla nodes.	2016-08-04 12:20:46 +03:00
Glauber Costa	fe6a0d97d1	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>	2016-08-04 11:16:53 +02:00
Pekka Enberg	7deddbe17a	dist/docker: Fix Docker Hub documentation Fix Docker Hub documentation to match what we have right now. More work is needed in the following areas: * How to make a cluster * How to configure Docker image for production use	2016-08-04 10:05:08 +03:00
Pekka Enberg	6c8c60a5fc	dist/docker: Setup hostname in cqlshrc We configure the hostname in the "CQLSH_HOST" environment variable but that is only picked up if we first start the shell. Setup the hostname in $HOME/.cqlshrc file instead so that we can start "cqlsh" directly: docker exec -it scylla cqlsh	2016-08-04 09:57:08 +03:00
Pekka Enberg	d0aeb53e7c	dist/docker: Log to stdout instead of syslog We don't have systemd running on the image so "journalctl" is useless. Log to stdout instead which has the nice benefit of making "docker logs" produce meaningful output on the host.	2016-08-04 09:46:26 +03:00
Glauber Costa	ad58691afb	logalloc: make sure blocked requests memory allocations are served from the standar allocator Issue 1510 describes a scenario in which, under load, we allocate memory within release_requests() leading to a reentry into an invalid state in our blocked requests' shared_promise. This is not easy to trigger since not all allocations will actually get to the point in which they need a new segment, let alone have that happening during another allocator call. Having those kinds of reentry is something we have always sought to avoid with release_requests(): this is the reason why most of the actual routine is deferred after a call to later(). However, that is a trick we cannot use for updating the state of the blocked requests' shared_promise: we can't guarantee when is that going to run, and we always need a valid shared_promise, in a valid state, waiting for new requests to hook into. The solution employed by this patch is to make sure that no allocation operations whatsoever happen during the initial part of release_requests on behalf of the shared promise. Allocation is now deferred to first use, which relieves release_requests() from all allocation duties. All it needs to do is free the old object and signal to the its user that an allocation is needed (by storing {} into the shared_promise). Fixes #1510 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>	2016-08-03 20:40:30 +02:00
Duarte Nunes	cb0516a76c	schema: Remove compact_column concept This is a confusing one, and can be replaced the fact that dense schemas have a single regular column. Ref #1542 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-03 17:21:41 +00:00
Duarte Nunes	529c3a3ae6	column_kind: Drop compact_column A compact column is a dense schema's single regular column. The fact that it is a different column_kind has lead to various bugs (#1535, derived by the schema being dense and the column being regular. Fixes #1542 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-03 17:21:37 +00:00
Calle Wilund	0f9e868839	commitlog: Use exception propagation in flush_queue (for batch) Fixes: #1490 While periodic mode is a all-bets-off crap-shoot as far as knowing if data actually reached disk or not, batch mode is supposed to be somewhat more reliable/deterministic. Thus, if we get an exception writing/flushing the current buffer, we should propagate exceptions to all execution paths involved in this buffer. Thus, adding a muation to commit log in batch, will now, if an error is generated, result in an exception to the caller, which should be interpreted as "data might not have been persisted". The failing segment is then closed, and we happily hope things will get better in the next. Which they probably wont. Missing: registration of some sort of "error-handling policy", similar to origin, which can either kill transports or shut down process. (A reasonable guess is that disk errors in commit log are not gonna be recoverable).	2016-08-03 14:49:43 +00:00
Calle Wilund	9098eed30b	flush_queue_test: Add tests for exception propagation v2: * Remove leading "_" in template types	2016-08-03 14:49:43 +00:00
Calle Wilund	620e54cae4	flush_queue: Allow exception propagation to waiters Re-worked to use shared_promise<> as signal mechanism (because we have that now), which also makes it less painful to implement exceptions propagating not only from "func" to "post", but also from given func->post chain entry to any waiters. v2: * Remove leading "_" in template types	2016-08-03 14:49:38 +00:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Avi Kivity	9df4ac53e5	conf: synchronize internode_compression between scylla.yaml and code Our default is "none", to give reasonable performance, so have scylla.yaml reflect that.	2016-08-03 16:50:48 +03:00
Amnon Heiman	b18b067b26	Add prometheus API This patch adds the prometheus API it adds the proto library to the compilation, adds an optional configuration parameter to change the prometheus listening port and start the prometheus API in main. To disable the prometheus API, set its listening port to 0. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470231628-22831-2-git-send-email-amnon@scylladb.com>	2016-08-03 16:49:42 +03:00
Amnon Heiman	bb4268a8a5	Add prometheus API This patch adds the prometheus API it adds the proto library to the compilation, adds an optional configuration parameter to change the prometheus listening port and start the prometheus API in main. To disable the prometheus API, set its listening port to 0. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1470228764-19545-2-git-send-email-amnon@scylladb.com>	2016-08-03 15:55:18 +03:00
Duarte Nunes	1516cd4c08	schema: Dense schemas are correctly upgrades When upgrading a dense schema, we would drop the cells of the regular (compact) column. This patch fixes this by making the regular and compact column kinds compatible. Fixes #1536 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470172097-7719-1-git-send-email-duarte@scylladb.com>	2016-08-03 13:39:01 +02:00
Paweł Dziepak	02ffc28f0d	sstables: extend sstable life until reader is fully closed data_consume_rows_context needs to have close() called and the returned future waited for before it can be destroyed. data_consume_context::impl does that in the background upon its destruction. However, it is possible that the sstable is removed before data_consume_rows_context::close() completes in which case EBADF may happen. The solution is to make data_consume_context::impl keep a reference to the sstable and extend its life time until closing of data_consume_rows_context (which is performed in the background) completes. Side effect of this change is also that data_consume_context no longer requires its user to make sure that the sstable exists as long as it is in use since it owns its own reference to it. Fixes #1537. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>	2016-08-03 13:19:08 +02:00
Pekka Enberg	59bd5e485b	dist/docker: Use supervisord to manage multiple processes Switch to supervisord to manage the two processes we have: Scylla server and Scylla JMX proxy. We need this to make the Docker image run under Kubernetes, which now fails as follows as we try to start the systemd init process: Couldn't find an alternative telinit implementation to spawn. I have not seen other people hitting the issue, except for GitLab Docker image: https://gitlab.com/gitlab-org/gitlab-ce/issues/18612 which "solved" the problem by not running init... https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/838/diffs Furthermore, the "supervisord" approach seems to be what people actually use in Docker land: http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/ The only downside is that we now sort of duplicate functionality that's already in the systemd configuration files. However, we should work towards Scylla figuring out its configuration rather than compose a long list of command line arguments. Once we do that, the duplication in Docker supervisord scripts disappears.	2016-08-03 11:59:04 +03:00
Paweł Dziepak	5f11a727c9	Merge "partition_limit: Don't count dead partitions" from Duarte "This patch series ensures we don't count dead partitions (i.e., partitions with no live rows) towards the partition_limit. We also enforce the partition limit at the storage_proxy level, so that limits with smp > 1 works correctly."	2016-08-03 09:49:30 +01:00
Duarte Nunes	db1118e4f7	database_test: Add case for partition limit Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 22:11:15 +00:00
Duarte Nunes	84e3969014	mutation_query_test: Add test for partition limit Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 22:08:39 +00:00
Duarte Nunes	b0c5996580	read_command: Add comment explaining partition_limit Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Duarte Nunes	54ad038aa6	storage_proxy: Enforce partition_limit This patch enforces the partition_limit at the mutation_result_merger. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Duarte Nunes	ec490ffaba	query_result_builder: Don't count dead partitions With this patch we stop counting dead partitions (i.e., partitions containing only tombstones) towards the partition limit, which should apply only to partitions with live rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Duarte Nunes	167e400ca8	compact_mutation: Don't count dead partitions With this patch we stop counting dead partitions (i.e., partitions containing only tombstones) towards the partition limit, which should apply only to partitions with live rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-08-02 21:17:06 +00:00
Paweł Dziepak	0f902738f0	Revert "storage_proxy: Enforce partition_limit" This reverts commit `141ea49e05`. There was a confusion around the meaning of "partition limit". Parts of our code interpreted it just as "maximum number of partitions". This is also how Cassandra behaves. However, the other parts of the code, including data query, interpreted it as "maximum number of live partitions" or otherwise skipped dead partitions resulting in #1447. A decision has been made to stick to the "maximum number of live partitions" interpretation everywhere. The consequences are, among others, that the patch reverted by this one is no longer correct. While, the actual series fixing the interpretations of partition limit and getting rid of the confusion is yet to come, the purpose of this revert is to make backporting easier (as the patch being reverted hasn't made it to branch-1.3 yet).	2016-08-02 16:53:01 +01:00
Duarte Nunes	5995aebf39	schema_builder: Ensure dense tables have compact col This patch ensures that when the schema is dense, regardless of compact_storage being set, the single regular columns is translated into a compact column. This fixes an issue where Thrift dynamic column families are translated to a dense schema with a regular column, instead of a compact one. Since a compact column is also a regular column (e.g., for purposes of querying), no further changes are required. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470062410-1414-1-git-send-email-duarte@scylladb.com>	2016-08-02 14:49:13 +02:00
Avi Kivity	fbc3377ad4	row_cache: add a counter for a miss that did not result in an insertion Such misses are due to concurrent access to the same key. Add a counter to track this as it results in unnecessary I/O being performed. See #1534. Message-Id: <1470139871-14693-1-git-send-email-avi@scylladb.com>	2016-08-02 14:14:27 +02:00
Tomasz Grabiec	d2ed75c9ff	Merge 'pdziepak/row-cache-wide-entries/v4' from seastar-dev.git This series adds the ability for partition cache to keep information whether partition size makes it uncacheable. During, reads these entries save us IO operations since we already know that the partiiton is too big to be put in the cache. First part of the patchset makes all mutation_readers allow the streamed_mutations they produce to outlive them, which is a guarantee used later by the code handling reading large partitions.	2016-08-02 12:26:56 +02:00
Duarte Nunes	141ea49e05	storage_proxy: Enforce partition_limit This patch enforces the partition_limit at the mutation_result_merger. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1470065526-3174-1-git-send-email-duarte@scylladb.com>	2016-08-02 10:10:43 +01:00
Avi Kivity	9f35e4d328	checked_file: preserve DMA alignment Inherit the alignment parameters from the underlying file instead of defaulting to 4096. This gives better read performance on disks with 512-byte sectors. Fixes #1532. Message-Id: <1470122188-25548-1-git-send-email-avi@scylladb.com>	2016-08-02 10:03:02 +01:00
Paweł Dziepak	d7123d21eb	Merge "cql3: remove schema_altering_statement::prepare" from Avi "schema_altering_statement is both a cql_statement and a prepared_statement. This makes it hard to understand because virtual functions from both base classes are present, and hard to separate the raw and prepared variants. This patchset removes schema_altering_statement::prepare() (and the enable_shared_from_this<> that makes it work) in preparation for splitting its subclasses into raw and prepared variants (note that create_table_statement was already split)."	2016-08-02 09:46:55 +01:00
Tomasz Grabiec	0c1bf6c861	scylla-gdb.py: Fix lookup of global symbols Fixes errors like the one below: (gdb) scylla memory Python Exception <class 'gdb.error'> A syntax error in expression, near `memory::cpu_mem'.: Error occurred in Python command: A syntax error in expression, near `memory::cpu_mem'. Wrapping the symbol in quotes instructs GDB to lookup in the global context instead of the context of current frame. Message-Id: <1470050751-3167-1-git-send-email-tgrabiec@scylladb.com>	2016-08-01 13:51:15 +01:00
Calle Wilund	e4a845145a	flush_queue_test: Do "close" at end of tests to ensure gate is balanced	2016-08-01 08:23:55 +00:00
Calle Wilund	a277975fd4	flush_queue: ensure gate is always closed (even with exceptions)	2016-08-01 08:23:42 +00:00
Takuya ASADA	9b59bb59f2	dist/ami: Install scylla metapackage and debuginfo on AMI Install scylla metapackage and debuginfo on AMI to make AMI to report bugs easier. Fixes #1496 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1469635071-16821-1-git-send-email-syuu@scylladb.com>	2016-07-31 18:37:41 +03:00
Takuya ASADA	89b790358e	dist/common/scripts: disable coredump compression by default, add an argument to enable compression on scylla_coredump_setup On large memory machine compression takes too long, so disable it by default. Also provide a way to enable it again. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1469706934-6280-1-git-send-email-syuu@scylladb.com>	2016-07-31 18:36:51 +03:00
Takuya ASADA	d3746298ae	dist/ami: setup correct repository when --localrpm specified There was no way to setup correct repo when AMI is building by --localrpm option, since AMI does not have access to 'version' file, and we don't passed repo URL to the AMI. So detect optimal repo path when starting build AMI, passes repo URL to the AMI, setup it correctly. Note: this changes behavor of build_ami.sh/scylla_install_pkg's --repo option. It was repository URL, but now become .repo/.list file URL. This is optimal for the distribution which requires 3rdparty packages to install scylla, like CentOS7. Existing shell scripts which invoking build_ami.sh are need to change in new way, such as our Jenkins jobs. Fixes #1414 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1469636377-17828-1-git-send-email-syuu@scylladb.com>	2016-07-31 18:36:21 +03:00
Avi Kivity	ec62f0d321	Merge "housekeeping: Switch to pytho2 and handle version" from Amnon This series handle two issues: * Moving to python2, though python3 is supported, there are modules that we need that are not rpm installable, python3 would wait when it will be more mature. * Check version should send the current version when it check for a new one and a simple string compare is wrong.	2016-07-31 14:55:36 +03:00
Amnon Heiman	3170b477d0	ubuntu control.in: set python2 request dependency scylla-housekeeping moved to python2, this change the dependency to take the python2 requests module. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-31 10:47:22 +03:00
Amnon Heiman	c8bcb5a8bf	scylla.spec: Set the python dependencies for housekeeping The scylla-housekeeping moved to python2, this set the python dependencies under redhat. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-31 10:40:40 +03:00
Avi Kivity	a6ec4aa547	cql3: schema_altering_statement: remove prepare() In preparation for splitting raw and prepared variants of subclasses of schema_altering_statement, remove schema_altering_statment::prepare(). All subclasses already implement it themselves (by creating a new instance).	2016-07-30 23:29:32 +03:00
Avi Kivity	0687dc3689	cql3: add fake create_table_statement::prepare() In preparation for the removal of schema_altering_statement::prepare(), add a fake create_table_statement::prepare(). create_table_statement has already been split to raw and prepared variants, so this prepare() will never be called, but it is required because schema_altering_statement is both a cql_statement and a prepared_statement. This confusion will be fixed later on.	2016-07-30 23:26:28 +03:00
Avi Kivity	713cfc3182	cql3: copy drop_type_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:14:49 +03:00
Avi Kivity	6631cd3277	cql3: copy drop_table_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:14:35 +03:00
Avi Kivity	382235fca4	cql3: copy drop_keyspace_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:14:18 +03:00
Avi Kivity	6a099f2690	cql3: copy create_type_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:14:03 +03:00
Avi Kivity	2d961334d8	cql3: copy create_keyspace_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:13:45 +03:00
Avi Kivity	35ef82e78d	cql3: copy create_index_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:12:55 +03:00
Avi Kivity	23eef0a610	cql3: copy alter_type_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:12:24 +03:00
Avi Kivity	8b50f75958	cql3: copy alter_table_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 23:10:43 +03:00
Avi Kivity	81b75dfa47	cql3: copy alter_keyspace_statement when preparing it To prepare for the separation of prepared and raw schema_altering_statements, avoid the reliance this class implementing both the raw and prepared variants and the use of shared_from_this().	2016-07-30 22:45:08 +03:00
Avi Kivity	b881945d45	estimated_histograms: fix indentation, bracing	2016-07-30 20:13:16 +03:00
Avi Kivity	75ee8fc2a7	size_estimates_recorder: adjust indentation	2016-07-30 20:10:12 +03:00
Piotr Jastrzebski	ca9c29e296	Cache information about partition being wide Once we encounter a wide partition store information about this in cache entry and don't try to read it all and cache next time it's requested. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [Paweł: rebased, moved large partition reading logic to cache_entry::read_wide()] Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 18:39:22 +01:00
Paweł Dziepak	42f566433e	tests/row_cache_test: use BOOST_REQUIRE_EQUAL() istead of raw assert() In case of failure BOOST_REQUIRE_EQUAL() is nicer and prints the actual values that were supposed to be equal, but aren't. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 17:19:18 +01:00
Paweł Dziepak	ee1f1ee1c4	row_cache: fix creating readers for large partitions There were cases of use-after-free introduced by the code responsible for creating mutation_readers for large partitions – the lifetimes of partition ranges and the readers themselves weren't sufficiently extended. Another problem, was that if the partition was no longer present in the sstable the reader would return EOS which was then returned by range_populating_reader itself causing its users to incorrectly interpret that as an end of stream. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 17:02:17 +01:00
Paweł Dziepak	7b479d8b41	clarify relations between mutation_reader and streamed_mutation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 15:58:42 +01:00
Paweł Dziepak	3a27582cd4	sstables: allow streamed_mutation to outlive mutation_reader This patch makes sstable_streamed_mutation keep a reference to sstable_data_source object which contains full state necessary to read the sstable. That state is also shared with parent mutation_reader (only for range queries), but now its lifetime is appropriately extended if the mutation_reader is destoryed before streamed_mutation. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 15:53:09 +01:00
Paweł Dziepak	5ba4cd1a0b	sstables: enable_lw_shared_from_this for sstable sstable has member functions that create objects which need to extend lifetime of the sstable (for example mutation_readers), the easiest way to achieve that is to enable_lw_shared_from_this for sstable. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 15:51:12 +01:00
Duarte Nunes	7d1b7e8da3	storage_service: Fix get_range_to_address_map_in_local_dc This patch fixes a couple of bugs in get_range_to_address_map_in_local_dc. Fixes #1517 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469782666-21320-1-git-send-email-duarte@scylladb.com>	2016-07-29 11:11:07 +02:00
Gleb Natapov	3531dd8d71	api: fix use after free in sum_sstable get_sstables_including_compacted_undeleted() may return temporary shared ptr which will be destroyed before the loop if not stored locally. Fixes #1514 Message-Id: <20160728100504.GD2502@scylladb.com>	2016-07-28 14:25:40 +03:00
Piotr Jastrzebski	fdfd1af694	Use continuity flag correctly with concurrent invalidations Between reading cache entry and actually using it invalidations can happen so we have to check if no flag was cleared if it was we need to read the entry again. Fixes #1464. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <7856b0ded45e42774ccd6f402b5ee42175bd73cf.1469701026.git.piotr@scylladb.com>	2016-07-28 11:55:18 +01:00
Duarte Nunes	25a44ee6cf	sstables: Validate static cell is on static column This patch enforces compatibility between a cell and the corresponding column definition with regards to them being static. [tgrabiec: Fixed typo in "definition"] Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469699532-26984-1-git-send-email-duarte@scylladb.com>	2016-07-28 12:01:31 +02:00
Duarte Nunes	6fc6adbdeb	sstable_mutation_test: Test non-compound cell name This patch adds a test case for reading non-compound cell names, validating that such a cell is not incorrectly marked as static. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616205-4550-5-git-send-email-duarte@scylladb.com>	2016-07-28 11:12:20 +02:00
Duarte Nunes	40f388f2b5	sstables: Don't assume cell name is compound The current code assumes cell names are always compound and may wrongly report a non-static row as such, since it looks at the first bytes of the name assuming they are the component's length. Tables with compact storage (which cannot contain static rows) may not have a compound comparator, so we check for the table's compoundness before checking for the static marker. We do this by delegating to composite_view::is_static. Fixes #1495 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616205-4550-4-git-send-email-duarte@scylladb.com>	2016-07-28 11:12:06 +02:00
Duarte Nunes	05c3d4f22b	composite: Use operator[] instead of at() Since we already do bounds checking on is_static(), we can use bytes_view::operator[] instead of bytes_view::at() to avoid repeating the bounds checking. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616205-4550-3-git-send-email-duarte@scylladb.com>	2016-07-28 11:11:54 +02:00
Duarte Nunes	6c9076fdd7	composite_view: Fix is_static composite_view's is_static function is wrong because: 1) It doesn't guard against the composite being a compound; 2) Doesn't deal with widening due to integral promotions and consequent sign extension. This patch fixes this by ensuring there's only one correct implementation of is_static, to avoid code duplication and enforce test coverage. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616205-4550-2-git-send-email-duarte@scylladb.com>	2016-07-28 11:11:38 +02:00
Amnon Heiman	406fa11cc5	scylla-housekeeping: check version should use the current version This patch handle two issues with check version: * When checking for a version, the script send the current version * Instead of string compare it uses parse_version to compare the versions. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-28 11:31:05 +03:00
Amnon Heiman	641e5dc57c	scylla-housekeeping: Switchng to pythno2 There is a problem with python module installation in pythno3, especially on centos. Though pytho34 has a normal package, alot of the modules are missing yum installation and can only be installed by pip. This patch switch the scylla-housekeeping implementation to use python2, we should switch back to python3 when CeontOS python3 will be more mature. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-28 11:27:52 +03:00
Tomasz Grabiec	27f6c0d62b	tests: lsa_async_eviction_test: Use chunked_fifo<> To protect against large reallocations during push() which are done under reclaim lock and may fail.	2016-07-28 09:33:12 +02:00
Tomasz Grabiec	17b45fae9e	tests/memory_footprint: Fix runtime errors The test bit rot. One thing is that cache is no longer empty right after creation, since we added ghost entries to it. Second problem is that mutation serializer needs storage service to be initialized, so we need to setup a full cql test env. Message-Id: <1469626546-4279-1-git-send-email-tgrabiec@scylladb.com>	2016-07-27 14:38:36 +01:00
Duarte Nunes	f468453cbe	README.md: Replace yum with dnf yum is démodé. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616395-5537-1-git-send-email-duarte@scylladb.com>	2016-07-27 13:50:31 +03:00
Duarte Nunes	e4464f7500	README.md: Add protobuf dependencies Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616338-5199-1-git-send-email-duarte@scylladb.com>	2016-07-27 13:50:19 +03:00
Duarte Nunes	5aaf43d1bc	thrift: Preserve partition order when accumulating This patch changes the column_visitor so that it preservers the order of the partitions it visits when building the accumulation result. This is required by verbs such as get_range_slice, on top of which users can implement paging. In such cases, the last key returned by the query will be that start of the range for the next query. If that key is not actually the last in the partitioner's order, then the new request will likely result in duplicate values being sent. Ref #693 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469568135-19644-1-git-send-email-duarte@scylladb.com>	2016-07-27 10:34:29 +02:00
Avi Kivity	64d0cf58ea	size_estimates_recorder: unwrap ranges before searching for sstables column_family::select_sstables() requires unwrapped ranges, so unwrap them. Fixes crash with Leveled Compaction Strategy. Fixes #1507. Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469563488-14869-1-git-send-email-avi@scylladb.com>	2016-07-27 10:06:21 +03:00
Takuya ASADA	b542581d97	dist: add protobuf to dependencies, for newly added prometheus API support Since we added prometheus API support, we need protobuf compiler and header to build seastar/scylla, so add dependencies on packaging. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1469573936-8551-1-git-send-email-syuu@scylladb.com>	2016-07-27 08:48:58 +03:00
Paweł Dziepak	efa690ce8c	stables: fix skipping partitions with no rows If partition contains no static and clustering rows or range tombstones mp_row_consumer will return disengaged mutation_fragment_opt with is_mutation_end flag set to mark end of this partition. Current, mutation_reader::impl code incorrectly recognized disengaged mutation fragment as end of the stream of all mutations. This patch fixes that by using is_mutation_end flag to determine whether end of partition or end of stream was reached. Fixes #1503. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1469525449-15525-1-git-send-email-pdziepak@scylladb.com>	2016-07-26 13:02:03 +03:00
Amos Kong	64530e9686	scylla-housekeeping: fix typo of script path I tried to start scylla-housekeeping service by: # sudo systemctl restart scylla-housekeeping.service But it's failed for wrong script path, error detail: systemd[5605]: Failed at step EXEC spawning /usr/lib/scylla/scylla-Housekeeping: No such file or directory The right script name is 'scylla-housekeeping' Signed-off-by: Amos Kong <amos@scylladb.com> Message-Id: <c11319a3c7d3f22f613f5f6708699be0aa6bd740.1469506477.git.amos@scylladb.com>	2016-07-26 09:18:43 +03:00
Raphael S. Carvalho	b7cdfafbdd	tests: fix compilation of partitioner test For some unknown reason, there were some duplicated definitions of bytes3. They were not needed at all. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <7f2b87211c592e573f45f277dc0ab4a8c037f258.1469490327.git.raphaelsc@scylladb.com>	2016-07-26 09:04:08 +03:00
Avi Kivity	1d1f61ac65	bytes_ostream: fix assignment operators Protect against self-assignment, and provide the strong exception guarantees for copying assignment. Fixes #1499. Message-Id: <1469466156-11114-1-git-send-email-avi@scylladb.com>	2016-07-25 19:06:42 +02:00
Avi Kivity	9e515c17a3	Merge "Optimize deserialization" from Tomasz Fixes #1330.	2016-07-25 19:52:28 +03:00
Tomasz Grabiec	cea4957a11	serializer: Avoid copying when deserializing bytes_ostream	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	22664b9b7f	serializer: Optimize serializer<>::skip() for remaining types	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	2e121d4b18	serializer: Implement serializer<std::vector<T>>::skip() using skip_array<T>()	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	af94ed538e	serializer: Optimize serializer<std::array<T, N>>::skip()	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	fd5ccce919	serializer: Move skip() to serializer.hh It is part of the core API, like serialize() and deserialize().	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	d3658b33da	tests: Add test for skip() not doing full deserialization	2016-07-25 17:35:42 +02:00
Tomasz Grabiec	c3707ff754	idl: Avoid full deserialization in skip() Fixes #1330.	2016-07-25 17:35:34 +02:00
Tomasz Grabiec	033312f686	serializer: Add serializer<enum_set<T>>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	1ddba66861	serializer: Add serializer<std::unique_ptr<T>>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	4ec29d88d3	serializer: Add serializer<sstring>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	517a501ace	serializer: Add serializer<std::experimental::optional<T>>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	c67e047b92	serializer: Add serializer<bytes_ostream>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	1bc63a133b	serializer: Add serializer<bytes>	2016-07-25 17:22:28 +02:00
Tomasz Grabiec	51e25cb50e	serializer: Add serializer<std::chrono::time_point<Clock, Duration>>	2016-07-25 17:22:25 +02:00
Tomasz Grabiec	f965e64a05	serializer: Add serializer<std::map<K, V>>	2016-07-25 17:22:21 +02:00
Tomasz Grabiec	5ffaccfa7d	serializer: Add serializer<std::array<T, N>>	2016-07-25 17:20:52 +02:00
Tomasz Grabiec	43a69e64f6	serializer: Add serializer<std::chrono::duration<T, Ratio>>	2016-07-25 17:20:08 +02:00
Tomasz Grabiec	445f763fa3	serializer: Add serializer<std::vector<T>>	2016-07-25 17:20:08 +02:00
Tomasz Grabiec	953fce3f7b	serializer: Define serializer<> specializations for integral types	2016-07-25 17:19:50 +02:00
Avi Kivity	c908e630f0	Merge seastar upstream * seastar 2425aab...64ae228 (1): > Merge "Adding prometheus API support" from Amnon	2016-07-25 15:32:40 +03:00
Avi Kivity	45e1274064	Merge "Add GDB commands for working with thread lists" from Tomasz	2016-07-25 15:32:02 +03:00
Tomasz Grabiec	0b8aafd72c	scylla-gdb.py: Introduce "scylla thread apply all ..." Similar to gdb's "thread apply all". Executes given command in the context of each seastar thread. For example to print backtrace of all threads: scylla thread apply-all bt	2016-07-25 12:40:47 +02:00
Tomasz Grabiec	2d6341d4cb	scylla-gdb.py: Introduce "scylla threads" command Lists all seastar threads. Example: (gdb) scylla threads [shard 1] (seastar::thread_context) 0x602008c9aa00 [shard 1] (seastar::thread_context) 0x602008c9ca00 [shard 1] (seastar::thread_context) 0x602008cf5800 [shard 1] (seastar::thread_context) 0x602008cbe000 [shard 1] (seastar::thread_context) 0x602008c4bc00 [shard 2] (seastar::thread_context) 0x601008d6b800 [shard 2] (seastar::thread_context) 0x601008d89400 [shard 2] (seastar::thread_context) 0x601008c95a00 [shard 0] (seastar::thread_context) 0x600008d84400 [shard 0] (seastar::thread_context) 0x6000000a1600 [shard 0] (seastar::thread_context) 0x600008ce9a00 [shard 0] (seastar::thread_context) 0x600008df9a00 [shard 0] (seastar::thread_context) 0x600008dfee00 [shard 0] (seastar::thread_context) 0x600008d85800 [shard 0] (seastar::thread_context) 0x600008df9000 [shard 0] (seastar::thread_context) 0x600008d82c00 [shard 0] (seastar::thread_context*) 0x600008cece00	2016-07-25 12:32:48 +02:00
Tomasz Grabiec	0e1b21eab6	scylla-gdb.py: Make intrusive_list() work with newer version of boost	2016-07-25 12:32:48 +02:00
Duarte Nunes	d8a4bd6b1a	sstables: Remove duplication in extract_clustering_key This patch removes some duplicated code in extract_clustering_key(), which is already handled in composite_view. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469397806-8067-1-git-send-email-duarte@scylladb.com>	2016-07-25 10:07:25 +02:00
Duarte Nunes	b2278697ce	sstables: Remove superfluous call to check_static() When building a column we're calling check_static() two times; refector things a bit so that this doesn't happen and we reuse the previous calculation. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469397748-7987-1-git-send-email-duarte@scylladb.com>	2016-07-25 10:06:56 +02:00
Duarte Nunes	7a81553d17	compound_compat: Only compound values can be static If a composite is not a compound, then it doesn't carry a length prefix where static information is encoded. In its absence, a non-compound composite can never be static. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469397561-7748-1-git-send-email-duarte@scylladb.com>	2016-07-25 10:05:19 +02:00
Avi Kivity	f2939b125e	dist: reduce kernel's inclincation to OOM-kill Scylla Scylla is likely the most important process on the machine, make the kernel less inclined to kill it, on systemd-enabled hosts. See #1393. Message-Id: <1468847986-9429-1-git-send-email-avi@scylladb.com>	2016-07-25 10:00:19 +02:00
Duarte Nunes	5c4a2044d5	thrift: Fail when creating mixed CF This patch ensures we fail when creating a mixed column family, either when adding columns to a dynamic CF through updated_column_family() or when adding a dynamic column upon insertion. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469378658-19853-1-git-send-email-duarte@scylladb.com>	2016-07-25 10:35:37 +03:00
Duarte Nunes	560cc12fd7	thrift: Correctly translate no_such_column_family The no_such_column_family exception is translated to InvalidRequestException instead of to NotFoundException. `8991d35231` exposed this problem. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469376674-14603-1-git-send-email-duarte@scylladb.com>	2016-07-25 10:35:07 +03:00
Avi Kivity	df346d8b49	Merge "thrift: Implements describe_splits verbs" from Duarte "This patchset implements the describe_splits and describe_splits_ex verbs in Thrift. It also contains a couple of related fixes."	2016-07-25 10:08:00 +03:00
Raphael S. Carvalho	56ef3d4fde	sstables: get rid of magic number Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <c0b3acf58aa6f7632bf15a2dd0dcac4d1b45d444.1469389290.git.raphaelsc@scylladb.com>	2016-07-25 10:06:51 +03:00
Takuya ASADA	b5bb702b35	dist: add dependency for lspci Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1469429320-13645-1-git-send-email-syuu@scylladb.com>	2016-07-25 10:02:30 +03:00
Duarte Nunes	ab08561b89	thrift: Implement describe_splits verb This patch implements the describe_splits verb on top of describe_splits_ex. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:44:07 +00:00
Duarte Nunes	472c23d7d2	thrift: Implement describe_splits_ex verb This patch implements the describe_splits_ex verbs by querying the size_estimates system table for all the estimates in the specified token range. If the keys_per_split argument is bigger then the estimated partitions count, then we merge ranges until keys_per_split is met. Note that the tokens can't be split any further, keys_per_split might be less than the reported number of keys in one or more ranges. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:44:01 +00:00
Duarte Nunes	ecfa04da77	system_keyspace: Add query_size_estimates() function The query_size_estimates() function queries the size_estimates system table for a given keyspace and table, filtering out the token ranges according to the specified tokens. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:58 +00:00
Duarte Nunes	d984cc30bf	size_estimates_recorder: Fix stop() This patch fixes stop() by checking if the current CPU instead of whether the service is active (which it won't be at the time stop() is called). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:58 +00:00
Duarte Nunes	e16f3f2969	system_keyspace: Avoid pointers in range_estimates This patch makes range_estimates a proper struct, where tokens are represented as dht::tokens rather than dht::ring_position*. We also pass other arguments to update_ and clear_size_estimates by copy, since one will already be required. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-24 22:43:35 +00:00
Asias He	2f4cd86809	random_partitioner: Implement random_partitioner Cassandra 1.x clusters often use RandomPartitioner. Supporting RandomPartitioner will allow easier migration to Scylla Tests are added to make sure scylla generates the same token as Cassandra does for the same partition key. Fixes #1438 Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>	2016-07-24 16:25:25 +03:00
Avi Kivity	a0369a6d5b	Merge seastar upstream * seastar 9d1db3f...2425aab (5): > Merge "Track all threads globally" from Tomasz > net: provide remote address on native_server_socket_impl<Protocol>::accept() > Introduce install-dependencies.sh > reactor: make sure a poll cycle always happens when later is called > Merge "rpc: various fixes and cleanups" from Gleb	2016-07-24 16:22:05 +03:00
Avi Kivity	b339337287	Merge "Add more tools to the GDB script" from Tomasz	2016-07-24 16:21:31 +03:00
Duarte Nunes	2be45c4806	thrift: Handle and convert invalid_request_exception This patch converts an exceptions::invalid_request_exception into a Thrift InvalidRequestException instead of into a generic one. This makes TitanDB work correctly, which expects an InvalidRequestException when setting a non-existent keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469362086-1013-1-git-send-email-duarte@scylladb.com>	2016-07-24 14:08:58 +02:00
Duarte Nunes	b5968ac244	sstables: Fix format string This patch fixes a format string by using {} instead of %s. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469352081-2167-1-git-send-email-duarte@scylladb.com>	2016-07-24 14:01:04 +02:00
Avi Kivity	de62285591	bloom_filter: use correct types 'long' is not a defined size. It happens to match Java's long on Linux x86_64, but may not on other platforms (e.g. Windows x64). Message-Id: <1469352705-1079-1-git-send-email-avi@scylladb.com>	2016-07-24 14:00:37 +02:00
Avi Kivity	900639915d	bloom_filter: fix overflow for large filters We use ::abs(), which has an int parameter, on long arguments, resulting in incorrect results. Switch to std::abs() instead, which has the correct overloads. Fixes #1494. Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>	2016-07-24 11:31:26 +03:00
Vlad Zolotarov	9423c13419	cql_server::connection::process_prepare(): don't std::move() a shared_ptr captured by reference in value_of() lambda A seastar::value_of() lambda used in a trace point was doing the unthinkable: it called std::move() on a value captured by reference. Not only it compiled(!!!) but it also actually std::move()ed the shared_ptr before it was used in a make_result() which naturally caused a SIGSEG crash. Fixes #1491 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1469193763-27631-1-git-send-email-vladz@cloudius-systems.com>	2016-07-22 16:32:21 +03:00
Tomasz Grabiec	0db79c7153	scylla-gdb.py: Introduce "scylla mem-range" command Prints memory range belonging to current shard.	2016-07-22 15:19:48 +02:00
Tomasz Grabiec	26e66bf49e	scylla-gdb.py: Introduce "scylla thread" command Switches into a seastar thread. "scylla unthread" switches back. Example: ``` (gdb) scylla thread 0x6000000e1800 Switched to thread 1, (seastar::thread_context*) 0x6000000e1800 (gdb) bt #0 seastar::thread_context::switch_out (this=0x6000000e1800) at core/thread.cc:104 #1 0x00000000004cfb07 in future<>::wait() (this=0x600008ca2c70) at core/future.hh:817 #2 0x0000000000f7752c in future<>::get() (this=0x600008ca2c70) at /home/tgrabiec/src/scylla2/seastar/core/future.hh:787 ... #16 seastar::thread_context::main (this=0x6000000e1800) at core/thread.cc:166 #17 0x000000000051a702 in seastar::thread_context::s_main (lo=<optimized out>, hi=<optimized out>) at core/thread.cc:157 #18 0x00007f2c34861f20 in ?? () from /lib64/libc.so.6 #19 0x0000000000000000 in ?? () (gdp) scylla unthread Switched to thread 1 ```	2016-07-22 15:19:48 +02:00
Tomasz Grabiec	d6fc9ad48e	scylla-gdb.py: Break down code into finer abstractions	2016-07-22 15:19:48 +02:00
Raphael S. Carvalho	c4f34f5038	compaction: do not convert timestamp resolution to uppercase C* only allows timestamp resolution in uppercase, so we shouldn't be forgiving about it, otherwise migration to C* will not work. Timestamp resolution is stored in compaction strategy options of schema BTW. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d64878fc9bbcf40fd8de3d0f08cce9f6c2fde717.1469133851.git.raphaelsc@scylladb.com>	2016-07-22 07:03:27 +01:00
Avi Kivity	106e3703d9	sstables: stop using unaligned_cast unaligned_cast violates strict aliasing, and causes code misgeneration on gcc 6. Replace it with read_be/write_be, which are nicer anyway. Message-Id: <1469122850-7511-1-git-send-email-avi@scylladb.com>	2016-07-22 07:03:08 +01:00
Raphael S. Carvalho	56a50784f8	compaction_manager: make registration of sstables and weight exception safe Compacting sstables and weight could be left unregistered in event of an exception. Let's make it safe by using a RAII approach. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <f2cf9d0c12f22046293bd2185ef14ede3f4d63d4.1469114161.git.raphaelsc@scylladb.com>	2016-07-22 07:02:48 +01:00
Vlad Zolotarov	4647ad9d8a	tracing: set a default TTL for system_traces tables when they are created Fixes #1482 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1469104164-4452-1-git-send-email-vladz@cloudius-systems.com>	2016-07-22 07:02:29 +01:00
Vlad Zolotarov	57b58cad8e	SELECT tracing instrumentation: improve inter-nodes communication stages messages Add/fix "sending to"/"received from" messages. With this patch the single key select trace with a data on an external node looks as follows: Tracing session: 65dbfcc0-4f51-11e6-8dd2-000000000001 activity \| timestamp \| source \| source_elapsed ---------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+---------------- Execute CQL3 query \| 2016-07-21 17:42:50.124000 \| 127.0.0.2 \| 0 Parsing a statement [shard 1] \| 2016-07-21 17:42:50.124127 \| 127.0.0.2 \| -- Processing a statement [shard 1] \| 2016-07-21 17:42:50.124190 \| 127.0.0.2 \| 64 Creating read executor for token 2309717968349690594 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 1] \| 2016-07-21 17:42:50.124229 \| 127.0.0.2 \| 103 read_data: sending a message to /127.0.0.1 [shard 1] \| 2016-07-21 17:42:50.124234 \| 127.0.0.2 \| 108 read_data: message received from /127.0.0.2 [shard 1] \| 2016-07-21 17:42:50.124358 \| 127.0.0.1 \| 14 read_data handling is done, sending a response to /127.0.0.2 [shard 1] \| 2016-07-21 17:42:50.124434 \| 127.0.0.1 \| 89 read_data: got response from /127.0.0.1 [shard 1] \| 2016-07-21 17:42:50.124662 \| 127.0.0.2 \| 536 Done processing - preparing a result [shard 1] \| 2016-07-21 17:42:50.124695 \| 127.0.0.2 \| 569 Request complete \| 2016-07-21 17:42:50.124580 \| 127.0.0.2 \| 580 Fixes #1481 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1469112271-22818-1-git-send-email-vladz@cloudius-systems.com>	2016-07-21 19:46:43 +03:00
Tomasz Grabiec	5e8f0efc85	schema_tables: Fix hang during keyspace drop Fixes #1484. We drop tables as part of keyspace drop. Table drop starts with creating a snapshot on all shards. All shards must use the same snapshot timestamp which, among other things, is part of the snapshot name. The timestamp is generated using supplied timestamp generating function (joinpoint object). The joinpoint object will wait for all shards to arrive and then generate and return the timestamp. However, we drop tables in parallel, using the same joinpoint instance. So joinpoint may be contacted by snapshotting shards of tables A and B concurrently, generating timestamp t1 for some shards of table A and some shards of table B. Later the remaining shards of table A will get a different timestamp. As a result, different shards may use different snapshot names for the same table. The snapshot creation will never complete because the sealing fiber waits for all shards to signal it, on the same name. The fix is to give each table a separate joinpoint instance. Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>	2016-07-21 19:14:57 +03:00
Vlad Zolotarov	e1480cd00d	tracing: add a word "shard" to a "thread" value Add a word "shard" to a "thread" column value. From now its format is "shard X", where X is a shard index. Fixes #1480 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1469099732-23334-1-git-send-email-vladz@cloudius-systems.com>	2016-07-21 15:08:09 +03:00
Paweł Dziepak	1c7c944eee	Merge "Thrift: small cleanups" from Duarte "This patchset adds several small thrift related cleanups."	2016-07-21 11:56:51 +01:00
Duarte Nunes	8991d35231	thrift: Use database::find_schema directly This patch changes lookup_schema() so it directly calls database::find_schema() instead of going through database::find_column_family(). It also drops conversion of the no_such_column_family exeption, as that is already handled at a higher layer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-21 10:37:38 +00:00
Duarte Nunes	038d42c589	thrift: Remove hardcoded version constant ...and use the one in thrift_server.hh instead. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-21 10:37:33 +00:00
Duarte Nunes	8bb43d09b1	thrift: Remove unused with_cob_dereference function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-21 10:13:55 +00:00
Paweł Dziepak	8a386a51bd	Merge "Don't cache wide partitions" from Piotr "When reading a partition try to read it all but once more bytes are read than a given limit we decide that partition is wide and we don't cache it. Instead we retry the read with clustering key filtering applied."	2016-07-21 10:24:25 +01:00
Benoît Canet	4ce7bced27	docker: Add documentation page for Docker Hub Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466438296-5593-1-git-send-email-benoit@scylladb.com>	2016-07-21 12:22:57 +03:00
Yoav Kleinberger	d1d1be4c1a	docker: bring docker image closer to a more 'standard' scylla installation Previously, the Docker image could only be run interactively, which is not conducive for running clusters. This patch makes the docker image run in the background (using systemd). This makes the docker workflow similar to working with virtual machines, i.e. the user launches a container, and once it is running they can connect to it with docker exec -it <container_name> bash and immediately use `cqlsh` to control it. In addition, the configuration of scylla is done using established scripts, such as `scylla_dev_mode_setup`, `scylla_cpuset_setup` and `scylla_io_setup`, whereas previously code from these scripts was duplicated into the docker startup file. To specify seeds for making a cluster, use the --seeds command line argument, e.g. docker run -d --privileged scylladb/scylla docker run -d --privileged scylladb/scylla --seeds 172.17.0.2 other options include --developer-mode, --cpuset, --broadcast-address The --developer-mode option mode is on by default - so that we don't fail users who just want to play with this. The Dockerfile entrypoint script was rewritten as a few Python modules. The move to Python is meritted because: * Using `sed` to manipulate YAML is fragile * Lack of proper command line parsing resulted in introducing ad-hoc environment variables * Shell scripts don't throw exceptions, and it's easy to forget to check exit codes for every single command I've made an effort to make the entrypoint `go' script very simple and readable. The goary details are hidden inside the other python modules. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1468938693-32168-1-git-send-email-yoav@scylladb.com>	2016-07-21 12:20:39 +03:00
Duarte Nunes	a436cf945c	thrift: Omit regular columns for dynamic CFs This patch skips adding the auto-generated regular column when describing a dynamic Column family for the describe_keyspace(s) verbs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469091720-10113-1-git-send-email-duarte@scylladb.com>	2016-07-21 12:06:10 +03:00
Pekka Enberg	d90f8dc2d9	Merge "tracing: cleanup" from Vlad	2016-07-21 11:57:54 +03:00
Avi Kivity	a9e07b292b	Merge seastar upstream * seastar 103543a...9d1db3f (8): > reactor: limit task backlog > iotune: Fix SIGFPE with some executions > Merge "Preparation for protobuf" from Amnon > byteorder: add missing cpu_to_be(), be_to_cpu() functions > rpc: fix gcc-7 compilation error > reactor: Register the smp metrics disabled > scollectd: Allow creating metric that is disabled > Merge "Propagate timeout to a server" from Gleb	2016-07-21 10:54:48 +03:00
Piotr Jastrzebski	7d29cdf81f	Add tests for wide partiton handling in cache. They shouldn't be cached. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:48:09 +02:00
Piotr Jastrzebski	37a7d49676	Add collectd counter for uncached wide partitions. Keep track of every read of wide partition that's not cached. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:49 +02:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Piotr Jastrzebski	98c12dc2e2	Try to read whole streamed_mutation up to limit If limit is exceeded then return the streamed_mutation and don't cache it. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:35:35 +02:00
Piotr Jastrzebski	0d39bb1ad0	Implement mutation_from_streamed_mutation_with_limit If mutation is bigger than this limit it won't be read and mutation_from_streamed_mutation will return empty optional. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:35:23 +02:00
Avi Kivity	5e3b019447	Merge "Fix sstable reader for duplicated range tombstones" from Paweł "This series fixes sstable reader so that it can handle duplicated range tombstones which may appear if promoted index is used."	2016-07-21 10:13:29 +03:00
Vlad Zolotarov	276ef041d2	tracing: use "trace" log level instead of "debug". Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-20 19:50:40 +03:00
Vlad Zolotarov	d7d72c4cd4	tracing: "inline" cleanup - Don't use inline for templates. - Put "inline" qualifier for out-of-class defined methods where they are defined and not where they are declared. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-20 19:49:30 +03:00
Vlad Zolotarov	5376b053f9	tracing: use seastar::format() for formatted trace() Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-20 19:49:30 +03:00
Duarte Nunes	8d87976fa1	tracing: Downgrade debug level log messages Two periodic debug log level messages triggered by tracing::write_timer_callback() quickly fill the logs, so this patch downgrades them to trace level. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Reviewed-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1469003034-1147-1-git-send-email-duarte@scylladb.com>	2016-07-20 11:34:02 +03:00
Pekka Enberg	aff8cf319d	db/config: Start Thrift server by default We have Thrift support now so start the server by default. Message-Id: <1469002000-26767-1-git-send-email-penberg@scylladb.com>	2016-07-20 09:25:44 +01:00
Avi Kivity	5ceb55827d	Merge "Add more tools to the gdb script" from Tomasz	2016-07-20 11:21:58 +03:00
Tomasz Grabiec	d2f0711608	scylla-gdb: Fix bounds checking in scylla ptr command Message-Id: <1468951987-10184-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 11:20:45 +03:00
Duarte Nunes	64dff69077	thrift: Actually concatenate strings This patch fixes concatenating a char[] with an int by using sprint instead of just increasing the pointer. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468971542-9600-1-git-send-email-duarte@scylladb.com>	2016-07-20 11:08:44 +03:00
Tomasz Grabiec	0d26294fac	database: Add table name to log message about sealing Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:31 +03:00
Tomasz Grabiec	a0832f08d2	schema_tables: Add more logging Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>	2016-07-20 10:12:00 +03:00
Pekka Enberg	7b5c2266b6	Merge "date tiered compaction strategy options" from Raphael "After this patchset, user can now tune date tiered compaction strategy by playing with its parameters."	2016-07-20 09:13:08 +03:00
Raphael S. Carvalho	cf54af9e58	tests: add new test for date tiered strategy This test set the time window to 1 hour and checks that the strategy works accordingly. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-19 19:20:40 -03:00
Raphael S. Carvalho	eaa6e281a2	compaction: implement date tiered compaction strategy options Now date tiered compaction strategy will take into account the strategy options which are defined in the schema. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-19 19:20:35 -03:00
Tomasz Grabiec	93a942dc53	scylla-gdb.py: Add "scylla mem-ranges" command Prints memory ranges corresponding to seastar heap. The ouput can be easily turned into arguemnts to "find".	2016-07-19 20:23:40 +02:00
Tomasz Grabiec	724f99641f	scylla-gdb.py: Add "scylla shard" command Beams you up to the thread corresponding to given seastar shard. Example: (gdb) scylla shard 0 Switched to thread 49	2016-07-19 20:23:40 +02:00
Tomasz Grabiec	d5fb452e94	scylla-gdb.py: Add "scylla apply" command Executes given command on all shards.	2016-07-19 20:23:40 +02:00
Tomasz Grabiec	720d8149ba	scylla-gdb.py: Add "scylla timers" command Lists all active timers on current shard.	2016-07-19 20:23:40 +02:00
Tomasz Grabiec	6e4506a1f2	scylla-gdb.py: Add std::array wrapper which allows iteration	2016-07-19 20:23:40 +02:00
Tomasz Grabiec	29af5b22da	scylla-gdb.py: Import intrusive_list() function Imported from: https://github.com/cloudius-systems/osv/blob/master/scripts/loader.py	2016-07-19 20:23:40 +02:00
Avi Kivity	dc50b845b4	Merge seastar upstream * seastar 823bc05...103543a (1): > core: add a seastar::format()	2016-07-19 19:09:42 +03:00
Avi Kivity	1a1b7fe3f2	Merge "CQL Tracing patch bomb" from Vlad "This series includes the following: - Introduction of a formatted message support in trace(). - Major rename: s/flush_/write_/, s/flush()/kick()/, s/store_/write_/. - Some cosmetic fixes found on the way. - Fix a bug in a shutdown flow. - Instrumentation to MUTATE, PREPARE, EXECUTE and BATCH flow and some related changes. - A patch that aligns the QUERY tracing format with the Origin. - Methods and functions description in tracing/trace_state.hh."	2016-07-19 18:46:59 +03:00
Vlad Zolotarov	7c590295ef	SELECT instrumentation: add a nice trace point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	a197323b47	tracing::trace_state.hh: Add descriptions for main methods and functions Add a proper description to a tracing::trace() that clarifies that the tracing message string and the positional parameters are going to be copied if tracing state is initialized. Add a description for trace_state::begin() methods and for a tracing::begin() helper function. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	b36b69c1d6	service::storage_proxy: remove a default value for a tracing::trace_state_ptr parameter Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	baa6496816	service::storage_proxy: READ instrumentation: store trace state object in abstract_read_executor Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:59 +03:00
Vlad Zolotarov	b0a39f210d	transport: CQL tracing: QUERY instrumentation: align the session creation parameters with origin - Don't put the query name as a 'request' but rather save it as one of entries in a 'params' map. - Save some additional query parameters in 'params'. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	962bddf8fe	transport: CQL tracing: instrument a BATCH command Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	d21eaabcfe	transport: CQL tracing: instrument EXECUTE command Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	89a49c346c	tracing::trace_state: add begin() overload for seastar::value_of given as a "request" parameter. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	1f9b858d83	cql3: prepared_statement: add raw_cql_statement field This field will contain an original statement given to a PREPARE command. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	147dd72517	transport: CQL tracing: instrument a PREPARE command Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	be88074f47	service::query_state: get rid of begin_tracing() Use tracing::begin() directly. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	982d301178	service::client_state: add a const version of get_trace_state() tracing::begin() requires a non-const version, tracing::trace() requires a const version. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	da56aa4256	service::client_state: rename: trace_state_ptr() -> get_trace_state() Rename the method for consistency with other classes methods returning the same value. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	4c16df9e4c	service: instrument MUTATE flow with tracing Store the trace state in the abstract_write_response_handler. Instrument send_mutation RPC to receive an additional rpc::optional parameter that will contain optional<trace_info> value. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	54a758dfff	cql3::select_statement: simplify the tracing code by using a tracing::make_trace_info() helper Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	952dc8a3d4	query_state: add get_trace_state() method Adding this method allows to use tracing helper functions and remove the no longer needed accessors in the query_state. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	0552ffcd17	service/storage_proxy: tracing: adjust the existing SELECT instrumentation with the new trace() interface From now on trace_state::trace() is able to receive the sprint-ready format string with the arguments that will be applied only during the flush event. This patch also optimizes the way the source address is evaluated - do it only once instead of twice if tracing is requested. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	c1bb4d147d	query::read_command: std::move() std::experimental::optional when initializing trace_info Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	0689843e79	tracing::trace_state: add method to set the session's "params" map entries Sometimes we want to be able to set "params" map after we started a tracing session, e.g. when the parameters values, like a consistency level parsed from the "options" part of a binary frame, are available only after some heavy part of a flow we would like to trace. This patch includes the following changes: - No longer pass a map to the begin(). - Limit the parameters to the known set. - Define a method to set each such parameter and save its value till the final sstring->sstring map is created. - Construct the final sstring->sstring map in the destructor of the trace_state object in order to defer all the formatting to be after the traced flow. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	9c0a725c56	tracing: add a _local_tracing to a i_tracing_backend_helper A backend helper has to constantly communicate with the corresponding tracing::tracing instance. By saving a reference to the tracing::tracing instance will save us a lot of tracing::get_local_tracing_instance() calls and thus a lot of dereferencing. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	2bb054748e	tracing: record events' time stamps - Extend the i_tracing_backend_helper interface to accept the event record timestamp. - Grab the current timestamp when the event record is taken. - Add the instrumentation to the trace_keyspace_helper to create a unique time-UUID from a given std::chrono::duration object. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	f64f27beb9	utils: add get_time_UUID(system_clock::time_point) Creates a type 1 UUID (time-based UUID) with the given system_clock::time_point Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:58 +03:00
Vlad Zolotarov	06d4221382	tracing: add tracing::make_trace_info() helper This helper returns an std::experimental::optional<trace_info> which is initialized or not initialized depending on whether a given trace_state_ptr is initialized or not. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	7a5fc9fcdc	tracing::trace_state: add const qualifiers to a trace_state_ptr parameter Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	b0673aabd5	tracing: fix a logger name Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	da4836becc	tracing::trace_state: add support for a formatted message in trace() Add an support for passing a format string plus positional parameters for creation of a trace point message. Format string should be given in a fmt library native format described here: http://fmtlib.net/latest/syntax.html#syntax . Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	ee0e986e96	tracing: make a service shutdown stages more strict kick() backend during shutdown and restrict accessing a backend after that. Flush pending records when service is being shut down. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	6e38133f82	tracing: prevent a destruction of a tracing::tracing while it's used Prevent the destruction of tracing::tracing instances while there are still tracing::trace_state objects that are using it: - Make tracing::tracing inherit from seastar::async_sharded_service<tracing::tracing>. - Grab a tracing::tracing.shared_from_this() in each tracing::trace_state object using it. - Use a saved pointer to the local tracing::tracing instance in a destructor instead of accessing it via tracing::get_local_tracing_instance() to avoid "local is not initialized" assert when sessions are being destroyed after the service was stopped. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Vlad Zolotarov	a5022a09a4	tracing: use 'write' instead of 'flush' and 'store' for consistency with seastar's API In names of functions and variables: s/flush_/write_/ s/store_/write_/ In a i_tracing_backend_helper: s/flush()/kick()/ Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-07-19 18:21:57 +03:00
Paweł Dziepak	b405ff8ad2	tests/sstables: test reading sstable with duplicated range tombstones Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-19 15:13:27 +01:00
Paweł Dziepak	04f2c278c2	sstables: avoid recursion in sstable_streamed_mutation::read_next() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-19 15:12:43 +01:00
Paweł Dziepak	08032db269	sstables: protect against duplicated range tombstones Promoted index may cause sstable to have range tombstones duplicated several times. These duplicates appear in the "wrong" place since they are smaller than the entity preceeding them. This patch ignores such duplicates by skipping range tombstones that are smaller than previously read ones. Moreover, these duplicted range tombstone may appear in the middle of clustering row, so the sstable reader has also gained the ability to merge parts of the row in such cases. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-19 15:07:12 +01:00
Paweł Dziepak	50469e5ef3	tests: extract streamed_mutation assertions Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-19 14:45:36 +01:00
Pekka Enberg	cbf5283a93	Merge "Populate size_estimates table" from Duarte "This patchset implements the size_estimates_recorder, which periodically writes estimations for all the non-system column families in the size_estimates system table. This table is updated per schema with a set of token ranges and the associated estimations of how many partitions there are and their mean size. Fixes #352"	2016-07-19 14:31:12 +03:00
Duarte Nunes	9ffdf4a5cd	db: Implement size_estimates_recorder This patch implements the size_estimates_recorder, which periodically writes estimations for all the non-system column families in the size_estimates system table. The size_estimates_recorder class corresponds to the one in Cassandra's SizeEstimatesRecorder.java. Estimation is carried out by shard 0. Since we're estimating based on data in shared sstables, having multiple shards doing this would skew the results. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-19 09:44:58 +00:00
Avi Kivity	86661db178	Merge seastar upstream * seastar ceb0c94...823bc05 (1): > Revert "util::lazy_eval: add an implicit cast operator overload"	2016-07-19 12:02:44 +03:00
Duarte Nunes	f8f61cf246	system_keyspace: Record and clear size estimates This patch implements functions that allow the size_estimates system table to be updated and cleared. The size_estimates table is updated per schema with a set of token ranges and the associated estimations of how many partitions there are and their mean size. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	3518db531e	database: Get non-system column_families This patch adds an utility function that allows fetching the set of column_families that do not belong to the system keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	4bc00c2055	database: Expose selection of sstables by a range This patch allows a set of a column_family's sstables to be selected according to a range of ring_positions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	d7ae25c572	range: Make transform template arguments deductable This patch makes it so that the template arguments of range<T>::transform are more easily deducible by the compiler. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	3c05ea2f80	types: Add to_bytes_view for sstrings This patch adds an overload of to_bytes_view for sstrings Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Tomasz Grabiec	ce768858f5	types: Fix update_types() We should replace the old type, not insert the new type before the old type. Fixes #1465 Message-Id: <1468861076-20397-1-git-send-email-tgrabiec@scylladb.com>	2016-07-18 20:14:22 +03:00
Avi Kivity	f886f7a2f5	Merge seastar upstream * seastar a45823a...ceb0c94 (2): > print: switch to fmtlib > logging: simplify stringer array building	2016-07-18 19:37:34 +03:00
Avi Kivity	d261927fa3	logalloc: change sprint() of a pointer to use void* explicitly Otherwise, fmtlib dislikes it.	2016-07-18 19:37:16 +03:00
Avi Kivity	1d1b03a7cb	cql3: change sprint() of a pointer to use void* explicitly Otherwise, fmtlib dislikes it.	2016-07-18 19:36:35 +03:00
Raphael S. Carvalho	7b9cf528ad	tests: fix occassional failure in date tiered test That was a bug in the test itself. It could happen that a sstable would incorrectly belong to the next time window if the current minute is approaching its end. Fix is about having all sstables that we want in the same time window with the same min/max timestamp. Fixes #1448. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ee25d49e7ed12b4cf7d018a08163404c3d122e56.1468782787.git.raphaelsc@scylladb.com>	2016-07-18 15:18:29 +02:00
Paweł Dziepak	4497204b7d	streamed_mutation: do not leave mutation in an invalid state This patch avoids moving entries from range tombstones and clustering rows sets in streamed_mutation_from_mutation(). Such action breaks these sets as the entries will be left in some unknown state. Instead, the sets are being broken in a supported and predictable way using unlink_leftmost_without_rebalance(). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1468843205-18852-1-git-send-email-pdziepak@scylladb.com>	2016-07-18 15:14:21 +02:00
Avi Kivity	9d1b813f45	Merge seastar upstream * seastar d699205...a45823a (5): > rpc: do not call shutdown function on already closed fd > log: Do not crash if logger is invoked from non-reactor thread > rpc: remove unaligned_cast and reinterpret_cast uses > unaligned: note unaligned_cast<> is deprecated > byteorder: add unaligned read/write helpers Fixes #1463.	2016-07-18 15:24:43 +03:00
Avi Kivity	60491476e3	Merge "thrift: Add authentication and authorization" from Duarte "This patchset implements the login verb to enable authentication in the thrift API, and it adds access control to the already implemented verbs."	2016-07-18 11:32:32 +03:00
Duarte Nunes	b6663f050d	thrift: Add authorization for DML verbs Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	63354320b8	thrift: Add authorization to thrift DDL verbs This patch adds authorization to the DDL thrift verbs. Since checking for authorization is asynchronous, we now need to copy the verb arguments so they can be accessed from the continuations. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	3c389ba871	client_state: Add has_schema_access function This function is similar to has_column_family_access, but skips validating if the specified keyspace and column family names map to a valid schema, as it already takes one as an argument. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	dbbf4b3cc2	thrift: Group mutation map by column family This patch transforms the mutation map, a map of keys to a map of columns families to mutations, into a map of column families to a map of keys to mutations. This makes is a more natural organization, as things like checking access permissions are done by column family. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	f14628dc49	thrift: Introduce with_schema function This is a wrapper around with_cob, which fetches a schema and forwards it to a supplied function. The patch also removes superfluous return instructions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	09a5560b1b	thrift: Validate login This patch validates that a user is correctly logged in (if authentication is required) for the required verbs. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Duarte Nunes	a3e507eb1c	thrift: Implement login verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-17 17:38:23 +00:00
Amnon Heiman	d096a6762a	scylla_setup: Ask if to start scylla-housekeeping The scylla-server.service will try to start the scylla-housekeeping. This patch adds a question to the scylla_setup if to enable the version check, if the answer is no, the scylla-housekeeping will be masked. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1468741129-1977-1-git-send-email-amnon@scylladb.com>	2016-07-17 17:57:12 +03:00
Nadav Har'El	c647d917e0	sstables: move to_bytes_view to header file Move the to_bytes_view(temporary_buffer<char>) function from source file to header file where is can be used in more places. This saves one use of reinterpret_cast (which we are no re-evaluating), and moreover, we want to use this function also in the promoted index code (to return a bytes_view from the promoted index which was saved as a temporary_buffer). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>	2016-07-17 16:29:26 +03:00
Avi Kivity	b6b35a986a	Merge seastar upstream * seastar 5e97d5f...d699205 (3): > rpc: fix race between send loop and expiration timer > rpc: fix cancellable type move operations > reactor: create new files with a more reasonable default mode	2016-07-17 13:27:23 +03:00
Paweł Dziepak	81e4952c78	row_cache: fix marking last entry as continuous Range queries need to take special care when transitioning between ranges that are read from sstables and ranges that are already in the cache. Original code in such case just started a secondary reader and told it to unconditionally mark the last entry as continuous (primary reader has already returned an element tha immediately follows the range that is going to be read form sstables). However, that information may get stale. For instance, by the time secondary reader finish reading its range the element immediately following it may get evicted from the cache thus causing continuity flag to be incorrectly set. The solution is to ensure that the element immediately after the range read from sstables is still in the cache. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1468586893-15266-1-git-send-email-pdziepak@scylladb.com>	2016-07-15 15:15:02 +02:00
Tomasz Grabiec	7328a8eff8	cql: modification_statement: Avoid copying keyspace and table names Message-Id: <1468574135-4701-1-git-send-email-tgrabiec@scylladb.com>	2016-07-15 10:36:53 +01:00
Duarte Nunes	aaa76d58ba	query: Move to_partition_range to dht namespace This patch moves to_partition_range, from the query namespace to the dht namespace, where it is a more natural fit. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468498060-19251-1-git-send-email-duarte@scylladb.com>	2016-07-15 10:41:52 +02:00
Tomasz Grabiec	32937f354e	Merge branch 'duarten/thrift/dml/v9' from git@github.com:duarten/scylla.git From Duarte: This patchset adds support for the data manipulation verbs. It defers support for super columns and mixed CFs (a static CF treated as dynamic) to later patchsets. Everything is done on top of storage_proxy; it was only necessary to modify the layers below to add support for different kinds of limits: per partition row limit, which corresponds to limiting the number of columns returned when querying a dynamic CF, and limit on the number of partitions returned, so that we can emulate the one thrift row per key model when querying dynamic CFs. Ref #399	2016-07-14 18:26:07 +02:00
Duarte Nunes	df1234d86a	thrift: Mark static CFs as non-compound By default, the schema is marked as compound regardless of the comparator. Since a composite comparator for static CFs is currently unsupported (otherwise thrift column families would be indistinguishable from CQL ones), just mark them as non-compound. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:55 +02:00
Duarte Nunes	901d4d1628	thrift: Skip CQL3 column families This patch prevents CQL3 column families from being returned to clients or subject to updates from thrift. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	92adbaab0a	thrift: Warn about unimplemented features Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	a924f14441	thrift: Validate thrift Columns Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	7c1bf41b0d	thrift: Implement truncate verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	4f440217e5	thrift: Implement remove verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	237e3b28d6	thrift: Implement insert verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	5c5056e4f9	thrift: Implement atomic_batch_mutate verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	f237b5ff19	thrift: Implement batch_mutate on top of storage_proxy So that the specified consistency level can be respected. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	12dca9fdc9	thrift: Convert thrift Mutation to internal one Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	822a315dfa	thrift: Implement get_multi_slice verb The get_multi_slice verb is used to perform multiple slices on a single row key in one operation. It takes a set of column_slices, which we normalize to not contain any overlapping ranges. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:54 +02:00
Duarte Nunes	9792a77266	range: Add deoverlap function This patch adds the deoverlap function to range.hh, which takes in a vector of possibly overlapping ranges and returns a vector of non-overlapping ranges covering the same values. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 18:20:41 +02:00
Duarte Nunes	c910a4639c	thrift: Implement get_paged_slice verb The get_paged_slice verb is similar to the get_range_slices verb, except that it doesn't take a SlicePredicate. Instead, it takes a column from which to start the query. For dynamic CFs, we use the partition_slice::specific_ranges to single out the first partition, and query starting from the start_column row. For static CFs, we issue an initial query to fetch the remainder of columns from the first partition, and at least one more query to fetch the subsequent columns until the limit is reached. This implies a performance penalty for static CFs. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	370572884c	thrift: Implement get_range_slices verb The get_range_slices verb is similar to the multiget_slice verb, except that it operates on a range of partition keys (or tokens). In origin, empty partitions are returned as part of the KeySlice, for which the key will be filled in but the columns vector will be empty. Since in our case we don't return empty partitions, we don't know which partition keys in the specified range we should return back to the client. So for now, our behavior differs from Origin. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	b872db55bd	thrift: Implement get_count verb on top of multiget_count Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	a42b7ba3f7	thrift: Implement multiget_count verb This patch implements the multiget_count verb in a similar fashion as multiget_slice, but using an accumulator that counts the returned columns instead of create thrift ColumnOrSuperColumn objects. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	a44561870a	thrift: Implement get verb in terms of get_slice Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	db4c26d5b8	thrift: Implement get_slice in terms of multiget_slice Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	cd3a12535e	thrift: Implement multiget_slice verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	acd39d871f	thrift: Validate column names Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	4e9af0dc8e	thrift: Make read_command from SlicePredicate This patch build a query::read_command from a SlicePredicate, for both dynamic and static column families. For dynamic CFs, restrictions on the clustering columns are added, and for static CFs, limits and ordering is defined inline by selecting the correct regular columns. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	21d0a2c764	query: Optionally send cell ttl This patch adds support to send a cell's ttl as part of a query's result. This is needed for thrift support. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	eb8f5fafb2	thrift: Add partition key validation This patch validates whether the specified partition key is not empty and under the size limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	f57136f2f3	thrift: Make key_from_thrift take schema ref Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	e2b4cc4849	types: Add to_bytes_view function This patch adds a function that converts a reference to an std::string to a bytes_view. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	a647fea30b	schema: Add is_dynamic to thrift_schema This patch adds the is_dynamic() function to thrift_schema, which tells whether the underlying column family is dynamic or not, according to thrift rules. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	527ec2ab59	thrift: Support composite keys This patch adds support for composite comparators (which, for dynamic column families, it means composite clustering keys) and for composite keys (composite partition keys). Support for composite column names and regular columns is deferred, which will entail making compound_type an abstract_type. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	7f5ec71b1f	thrift: Extract ttl calculation Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Duarte Nunes	324b776c1b	thrift: Add lookup_schema function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-14 15:36:23 +02:00
Avi Kivity	4ef2b1b25f	Merge seastar upstream * seastar e660d54...5e97d5f (7): > util::lazy_eval: add an implicit cast operator overload > rpc: consolidate read_(request\|response)_frame logic > rpc: handle lz4 compressor errors > iotune: provide a status dump if we can't calculate a proper number of io_queues > rpc: adjust lz4 compression for older lz4.h > Fix chunked_fifo move assignment > rpc: add missing header file protectors	2016-07-14 16:27:23 +03:00
Avi Kivity	32d670a792	Merge "Scylla-housekeeping check version" from Amnon "This series replaces the original scylla-help.py It contains only a basic script that checks daily for version and report if a newer version matched. The script is added as a service and will be started and shutdown with scylla-server."	2016-07-14 14:58:33 +03:00
Avi Kivity	1048e1071b	db: do not create column family directories belonging to foreign keyspaces Currently, for any column family, we create a directory for it in all keyspace directories. This is incredibly awkward. Fix by iterating over just the keyspace's column families, not all column families in existence. Fixes #1457. Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>	2016-07-14 14:31:05 +03:00
Amnon Heiman	260761f2dd	rules.in: Add the scylla-timer to ubuntu This adds a rule to install the scylla-timer as part of the ubuntu package. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-14 12:46:47 +03:00
Amnon Heiman	3be9ab38e2	ubuntu.in: Add dependency to python3-requests The check version script uses the python requests package, this add the dependency to the ubuntu package. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-14 12:46:47 +03:00
Amnon Heiman	3b9db378ac	scylla-server.install.in: Pack the scyll-housekeeping on ubuntu This adds the scylla-housekeeping to the ubuntu packging.	2016-07-14 12:46:47 +03:00
Amnon Heiman	948140bec3	Adding a timer service for ubuntu scylla-housekeeping Ununtu 14.4 upstart does not support timers for recurrent operations. The upstart cookbook suggest a way to mimic this functionality here: http://upstart.ubuntu.com/cookbook/#run-a-job-periodically This patch adds a service that runs the house-keeping daily. Setting it as a service insure that it would start and stop with scylla-server service.	2016-07-14 12:46:39 +03:00
Avi Kivity	23edc1861a	db: estimate queued read size more conservatively There are plenty of continuations involved, so don't assume it fits in 1k. Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>	2016-07-14 11:42:24 +02:00
Avi Kivity	d3c87975b0	db: don't over-allocate memory for mutation_reader column_family::make_reader() doesn't deal with sstables directly, so it doesn't need to reserve memory for them. Fixes #1453. Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>	2016-07-14 10:01:42 +02:00
Paweł Dziepak	10c144ffd4	types: fix type aliasing violation Any pointer can be casted to char*, but not the other way around. This causes GCC6 to misoptimize timestamp_type_impl::from_string(). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1468413349-27267-1-git-send-email-pdziepak@scylladb.com>	2016-07-13 17:22:16 +03:00
Tomasz Grabiec	c97871d95c	migration_manager: Uncomment logging for keysapce drop Message-Id: <1468413673-6899-1-git-send-email-tgrabiec@scylladb.com>	2016-07-13 13:42:23 +01:00
Gleb Natapov	9cc076c9f3	storage_proxy: preserve endpoint's order while filtering local nodes for query filter_for_query() gets sorted by preference list of endpoints and should preserve that order after filtering out non local endpoints for local query. partition() does not guaranty this while stable_partition() does, so use it instead. Fixes #1450. Message-Id: <20160713100909.GM10767@scylladb.com>	2016-07-13 13:17:28 +03:00
Tomasz Grabiec	7227c537ce	Merge branch 'pdziepak/streamed-mutations-hashing/v5' from seastar-dev.git From Paweł: This is another episode in the "convert X to streamed mutations" series. Hashing mutations (mainly for repair) is converted so that it doesn't need to rebuild whole mutation. The first part of the series changes the way streamed mutations deal with range tombstones. Since it is not necessary to make sure we write disjoint tombstones to sstables there is no need anymore for streamed mutations to produce disjoint tombstones and, consequently, no need for range tombstones to be split into range_tombstone_begin and range_tombstone_end. The second part is the actual hashing implementation. However, to ensure that the hash depends only on the contents of the mutation and no the way it is stored in different data sources range tombstones have to be made disjoint before they are hashed. This series also ensures that any changes caused by streamed mutations to hashing and streaming do not break repair during upgrade.	2016-07-13 11:24:00 +02:00
Duarte Nunes	674afc52bc	compound_test: Test singular composite_view::explode() This patch adds a test case for composite_view::explode() called on a non-compound composite. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468353393-3074-1-git-send-email-duarte@scylladb.com>	2016-07-13 11:23:24 +02:00
Paweł Dziepak	3fe1aec29d	streaming: avoid word "ERROR" in non-error messages Some tools (e.g. ccm) get confused and consider messages containing word "ERROR" as error level messagess irrespectively of their actual severity level. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1468399752-5228-1-git-send-email-pdziepak@scylladb.com>	2016-07-13 12:06:33 +03:00
Paweł Dziepak	eb88181347	repair: ask for streamed checksums if cluster supports them Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	7e06499458	repair: convert hashing to streamed_mutations This patch makes hashing for repair calculate checksums in a way that doesn't require rebuilding whole mutation. Unfortunately, such checksums are incompatible with the old ones so the old way for computing checksums is preserved for compatibility reasons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	e779e2f0c9	streaming: do not fragment mutations in mixed cluster The receiving side needs to handle fragmented mutations properly so that isolation guarantees are not broken. If the receiving node may be an old one do not fragment mutations. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	85c092c56c	storage_service: add LARGE_PARTITIONS_FEATURE Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	c5662919df	tests/streamed_mutation: test hashing Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	fe172484bd	streamed_mutation: add mutation_hasher mutation_hasher is a consumer of streamed_mutation that feeds its data to a specified hasher. It is not compatible with hashing_partition_visitor. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	eb1dcf08e7	tests/streamed_mutation: add test for range_tombstones_stream Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	93cc4454a6	streamed_mutation: emit range_tombstones directly Originally, streamed_mutations guaranteed that emitted tombstones are disjoint. In order to achieve that two separate objects were produced for each range tombstone: range_tombstone_begin and range_tombstone_end. Unfortunately, this forced sstable writer to accumulate all clustering rows between range_tombstone_begin and range_tombstone_end. However, since there is no need to write disjoint tombstones to sstables (see #1153 "Write range tombstones to sstables like Cassandra does") it is also not necessary for streamed_mutations to produce disjoint range tombstones. This patch changes that by making streamed_mutation produce range_tombstone objects directly. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:18 +01:00
Paweł Dziepak	c3a8539074	streamed_mutation: add more comparators to position_in_partition Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:08 +01:00
Paweł Dziepak	27fea7bf2c	mutation_partition: add non-cons rows and tombstones accessors Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Paweł Dziepak	2208d4b53e	range_tombstone_list: add non-const begin() and end() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Paweł Dziepak	5a790a9b49	range_tombstone: add flip() range_tombstone::flip() flips range bounds. This is necessary in order to use range tombstone in reversed mutation fragment streams. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Paweł Dziepak	e1d306fa0d	range_tombstone: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Paweł Dziepak	91a866501d	range_tombstone: add range_tombstone_accumulator range_tombstone_accumulator is a helper class that allows determining tombstone for a clustering row when range tombstones and clustering rows are streamed from streamed_mutation. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Paweł Dziepak	cd7937d33b	range_tombstone: add apply() range_tombstone::apply() allows merging two, possibly overlapping, range tombstones with the same start bound and produces one or two disjoint range tombstones as a result. It is intended to be used for merging tombstones coming from different sources. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:50:07 +01:00
Nadav Har'El	aec90a22da	sstable parsing: assert we do not lose clustering rows The sstable parsing code calls mp_row_consumer::flush() after every clustering row has been read, and this puts the now complete row in a single field "_ready". The assumption is that at this point parsing will stop, the consumer will move out this _ready (mp_row_consumer::get_mutation_fragment()) and when flush() is later called again, _ready will be empty again. This assumption is correct in our code, but is based on an intricate combination of estoreric parts of the code, such as: 1. In data_consume_row_context we stop parsing after reading the parition's header, before reading any clustering rows, giving the caller the chance to call sstable_streamed_mutation::read_next() to be prepared for the incoming mutations. 2. In mp_row_consumer::flush_if_needed(), we stop the parser after each individual clustering row. It is easy to break this assumption, and I did this in one of my code changes, and the result was silent loss of clustering rows, as "_ready" got silently overwritten before the reader had a chance to move it out. What this patch does is to add an assertion: If a clustering row is silently lost before being transferred to the mutation fragment reader, we croak. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1468389955-24600-1-git-send-email-nyh@scylladb.com>	2016-07-13 09:42:48 +01:00
Duarte Nunes	4eca7632ec	sstables: Replace composite fields with raw bytes This patch fixes a regression introduced in `f81329be60`, which made keys compound by default when using a particular ctor, in turn leading to mismatches when comparing the same key built with functions that properly consider compoundness. As a temporary fix, the sstable::key and sstable::key_view classes store raw bytes instead of a composite. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468339295-3924-1-git-send-email-duarte@scylladb.com>	2016-07-12 18:08:04 +02:00
Duarte Nunes	f013425bb5	query: Ensure timestamp is last param in read_command Since the timestamp is not serialized, it must always be the last parameter of query::read_command. This patch reorders it with the partition_limit parameters and updates callers that specified a timestamp argument. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468312334-10623-1-git-send-email-duarte@scylladb.com>	2016-07-12 10:41:54 +01:00
Amnon Heiman	41546747d8	scylla-server.service: Start the scylla-housekeeping This makes scylla-server to try and start the scylla-housekeeping. Failing to start the service will not interfere with the scylla-server start. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-12 12:32:52 +03:00
Amnon Heiman	0eba2b8fd5	scylla.spec.in: Pack the scylla-housekeeping service This change pack and install the scylla-housekeeping service under redhat like systems. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-12 12:32:48 +03:00
Tomasz Grabiec	c5e3c9bc35	Merge branch 'duarten/composite-v7' from git@github.com:duarten/scylla.git From Duarte: This patchset adds a representation of a legacy composite value to compound_compat.hh and replaces the one in sstables/key.hh. This patchset is needed for the thrift series.	2016-07-12 10:49:02 +02:00
Amnon Heiman	6d5049d90b	Adding the scylla-housekeeping service The scylla housekeeping service responsible for recurent tasks. It is currently set to run daily and report if the version is correct. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-12 11:47:04 +03:00
Amnon Heiman	30efdabf55	Introducting the scylla-housekeeping script scylla-housekeeping is a script that check and report for hardware and software issues. The first phase of it check for newer version and report if the version is old. To see the available options run scylla-housekeeping help	2016-07-12 11:12:43 +03:00
Glauber Costa	73a70e6d0a	config: Use Scylla in user visible options We have imported most of our data about config options from Cassandra. Due to that, many options that mention the database by name are still using "Cassandra". Specially for the user visible options, which is something that a user sees, we should really be using Scylla here. This patch was created by automatically replacing every occurrence of "Cassandra" with "Scylla" and then later on discarding the ones in which the change didn't make sense (such as Unused options and mentions to the Cassandra documentation) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>	2016-07-12 09:18:17 +03:00
Duarte Nunes	f81329be60	sstables: sstables::key delegates to composite The sstables::key class now delegates much of its functionality to the composite class. All existing behavior is preserved. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 23:37:33 +02:00
Gleb Natapov	726b79ea91	messaging_service: enable internode_compression option Use LZ4 for internode compression if enabled. Message-Id: <20160711141734.GZ18455@scylladb.com>	2016-07-11 18:30:21 +03:00
Avi Kivity	201f585ab6	Merge seastar upstream * seastar e7a7d41...e660d54 (1): > rpc: add factory class for lz4 compressor	2016-07-11 18:29:43 +03:00
Glauber Costa	f7706d51d1	scyllatop: fix typo Keyborad -> Keyboard Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <349f20fd69be2f2e05ae0b7800e34a336cd2472b.1468248179.git.glauber@scylladb.com>	2016-07-11 18:27:49 +03:00
Duarte Nunes	ad8ff1df7e	sstables: Replace composite class This patch replaces the sstables::composite class with the one in compound_compat.hh. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 16:55:11 +02:00
Duarte Nunes	0b87d16699	composite: Add unit tests Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 16:55:11 +02:00
Duarte Nunes	b179d8d378	compound_compat: Parse legacy compound values This patch adds support for parsing legacy compound values by introducing the composite class, a wrapper around a sequence of bytes serialized in the legacy format for compounds. Compound values can be sent though the thrift API. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 16:55:07 +02:00
Avi Kivity	9b08ddb639	Merge seastar upstream * seastar 9267dfa...e7a7d41 (3): > Merge "Compression support for RPC" from Gleb > reactor: allow sleeping while disk aio is pending > sstring: add resize method	2016-07-11 16:23:29 +03:00
Calle Wilund	4ab03e98cf	commitlog: Ensure we don't end up in a loop when we must wait for alloc Continuation reordering could cause us to repeatedly see the segment-local flag var even though actual write/sync ops are done. Can cause wild recursion without actual delayed continuation -> SOE. Fix by also checking queue status, since this is the wait object. Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>	2016-07-11 14:12:38 +03:00
Calle Wilund	14b0fe23c5	commitlog: Ensure we don't end up in a loop when we must wait for alloc Continuation reordering could cause us to repeatedly see the segment-local flag var even though actual write/sync ops are done. Can cause wild recursion without actual delayed continuation -> SOE. Fix by also checking queue status, since this is the wait object.	2016-07-11 07:45:36 +00:00
Avi Kivity	f126efd7f2	transport: encode user-defined type metadata Right now we fall back to tuples, which confuses the client. Fixes #1443. Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468167120-1945-1-git-send-email-avi@scylladb.com>	2016-07-11 08:51:17 +03:00
Takuya ASADA	d2caa486ba	dist/redhat/centos_dep: disable go and ada language on scylla-gcc package, since ScyllaDB never use them centos-master jenkins job failed at building libgo, but we don't need go language, so let's disable it on scylla-gcc package. Also we never use ada, disable it too. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1468166660-23323-1-git-send-email-syuu@scylladb.com>	2016-07-10 19:12:52 +03:00
Avi Kivity	24e3026e32	Merge "compaction manager refactoring" from Raphael	2016-07-10 17:16:23 +03:00
Tomasz Grabiec	6a1f9a9b97	db: Improve logging Message-Id: <1467997671-16570-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 16:15:03 +03:00
Avi Kivity	b5bef73ad2	Merge "Avoiding checking bloom filters during compaction" from Tomasz "Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322."	2016-07-10 11:33:41 +03:00
Tomasz Grabiec	8c4b5e4283	db: Avoiding checking bloom filters during compaction Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322.	2016-07-10 09:54:20 +02:00
Tomasz Grabiec	c0233c877d	db: Avoid out-of-memory when flushing cannot keep up memtable_list::seal_on_overlflow() is called on each mutation to check if current memtable should be flushed. It will call memtable_list::seal_active_memtable() when that is the case. The number of concurrent seals is guarded by a semaphore, starting from commit `0f64eb7e7d`, and allows at most 4 of them. If there are 4 flushes already pending, every incoming mutation will enqueue a new flush task on the semaphore's wait list, without waiting for it. The wait queue can grow without bounds, eventually leading to out-of-memory. The fix is to seal the memtable immediately to satisfy should_flush() condition, but limit concurrency of actual flushes. This way the wait queue size on the semaphore is limited by memtables pending a flush, which is fairly limited. Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 10:53:51 +03:00
Tomasz Grabiec	74ff30a31a	mutation_reader: Introduce stable_flattened_mutations_consumer adaptor Needed to make compact_mutation class non-movable later. It is used in do_with, so needs to be movable. Will be solved by using this adaptor.	2016-07-09 22:31:28 +02:00
Tomasz Grabiec	fb44f895b2	mutation_reader: Name template parameters after concepts With so many consumer concepts out there, it is confusing to name parameters using genering "Consumer" name, let's name them after (already defined) concepts: CompactedMutationsConsumer, FlattenedConsumer.	2016-07-09 22:31:27 +02:00
Raphael S. Carvalho	ed5e7e6842	compaction: refactor compaction manager Previously, same function was used to handle both regular compaction and cleanup requests. That's bad because a lot of conditions were added for both compaction types to live in the same function. Now, cleanup and regular compaction will live in different functions. They share a lot of code, so helper functions were introduced. This change is also important for user-initiated compaction that will go through compaction manager in the future. Code is also a lot easier to read now. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 16:37:53 -03:00
Raphael S. Carvalho	da6a2b429d	compaction: add functions to register and deregister compacting sstables Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 16:00:51 -03:00
Raphael S. Carvalho	4d6dce8ec9	compaction: add helper function to get candidates for strategy Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:06:14 -03:00
Raphael S. Carvalho	e38f66c6fe	database: make certain column family functions const qualified Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:22 -03:00
Raphael S. Carvalho	bfc5376548	compaction: remove gate from compaction manager task There is no longer a need to use gate for regular termination of fiber that runs compaction. Now, we only set task->stopping to true, ask for compaction termination, and wait for its future to resolve. Code is simplified a lot with this change. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:10 -03:00
Paweł Dziepak	cba996a3ea	Merge "Implement missing functions for byte_ordered_partitioner" from Asias	2016-07-08 10:49:25 +01:00
Asias He	f4389349e4	config: Enable partitioner option Enable --partitioner option so that user can choose partitioner other than the default Murmur3Partitioner. Currently, only Murmur3Partitioner and ByteOrderedPartitioner are supported. When non-supported partitioner is specifed, error will be propogated to user.	2016-07-08 17:44:55 +08:00
Asias He	9c27b5c46e	byte_ordered_partitioner: Implement missing describe_ownership and midpoint In order to support ByteOrderedPartitioner, we need to implement the missing describe_ownership and midpoint function in byte_ordered_partitioner class. As a starter, this path uses a simple node token distance based method to calculate ownership. C* uses a complicated key samples based method. We can switch to what C* does later. Tests are added to tests/partitioner_test.cc. Fixes #1378	2016-07-08 17:44:55 +08:00
Asias He	e0949a8f4f	storage_service: Exit shadow round state if it fails If a node fails to talk to any seed node, shadow round will fail. We should exit shadow round state before we continue. This issue is spotted by consistency_test.TestConsistency.data_query_digest_test dtest. Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>	2016-07-08 10:05:07 +01:00
Avi Kivity	8dab93a853	sstables: fix low disk utilization with compression and small chunk lengths As Nadav notes we use the chunk length as the buffer size for the compressed stream too. Fix by using it only for the outer (uncompressed) stream; the inner (compressed) stream uses the sstable buffer size, 128 kiB. Fixes #1402. Message-Id: <1467910556-5759-1-git-send-email-avi@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2016-07-07 18:13:30 +01:00
Vlad Zolotarov	f2bf453be2	database: revive mutation retry in case of replay_position_reordered_exception The logic that would retry applying a mutation in case of a replay_position_reordered_exception error was broken by a commit `0c31f3e626` Author: Glauber Costa <glauber@scylladb.com> Date: Wed Apr 20 19:09:21 2016 -0400 database: move memtable throttler to the LSA throttler This patch makes it work again. Fixes #1439 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1467893342-30559-1-git-send-email-vladz@cloudius-systems.com>	2016-07-07 15:00:35 +02:00
Tomasz Grabiec	de429d6a53	Merge branch 'dev/pdziepak/streamed-mutations-streaming/v3' Support for streaming of large partitions from Paweł: This series converts streaming to streaming_mutations so that there is need to store full mutation in memory in order to send or receive it. The first several patches add a way of estimating mutation fragment memory usage and introduce fragment_and_freeze() which produces a stream of reasonably sized frozen mutations from a single streamed mutation. The second part of this patchset makes sure that streaming mutations in fragments doesn't break isolation guarantees. This is achieved by delaying visibility of sstables produced by streaming until the streaming is completed. However, our current receiving code merges mutations from all streaming plans together thus making it impossible to track which data was received from a particular streaming plan. The solution to that problem is to introduce an additional flag to STREAM_MUTATION verb which informs the receiver whether the mutation is fragmented and care must be taken to preserve isolation. Small mutations behaved as they were, with writes from different stream plans coalesced while big mutations are handled separately for each streaming task.	2016-07-07 13:23:39 +02:00
Paweł Dziepak	d9eb4d8028	streaming: use fragment_and_freeze() to send mutations Commit `206955e4` "streaming: Reduce memory usage when sending mutations" moved streaming mutation limiter from do_send_mutations() to send_mutations(). The reason for that was that send_mutation() did full mutation copies. That's no longer the case and streaming limiter should be moved back to do_send_mutation() in order to provide back pressure to fragment_and_freeze(). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:36 +01:00
Paweł Dziepak	32a5de7a1f	db: handle receiving fragmented mutations If mutations are fragmented during streaming a special care must be taken so that isolation guarantees are not broken. Mutations received with flag "fragmented" set are applied to a memtable that is used only by that particular streaming task and the sstables created by flushing such memtables are not made visible until the task is complte. Also, in case the streaming fails all data is dropped. This means that fragmented mutations cannot benefit from coalescing of writes from multiple streaming plans, hence separate way of handling them so that there is no loss of performance for small partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	f2ae31711e	streaming: inform CF when streaming fails Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	4031c0ed8f	streaming: pass plan_id to column family for apply and flush plan_id is needed to keep track of the origin of mutations so that if they are fragmented all fragments are made visible at the same time, when that particular streaming plan_id completes. Basically, each streaming plan that sends big (fragmented) mutations is going to have its own memtables and a list of sstables which will get flushed and made visible when that plan completes (or dropped if it fails). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	51ec7a7285	db: wait for ongoing flushes at end of streaming When flush_streaming_mutations() is called at the end of streaming it is supposed to flush all data and then invalidate cache. ranges However, if there are already some memtable flushes in progress it won't wait for them. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	5bc51821fe	sstables: allow writing unsealed sstables The purpose of this patch is to split the actions of writing sstable and sealing it. As long as the sstable is unsealed it is considered incomplete and is going to be removed on reboot. Such functionality is needed in order to defer visibility of sstables created during streaming until the streaming is complete. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	a7b6c1110f	sstables: do not require seal_sstable() to be run in thread Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	4e34bd4e8a	tests/streamed_mutation: test fragment_and_freeze() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	19629e95e2	frozen_mutation: add fragment_add_freeze() fragment_and_freeze() produces a stream of frozen mutations from a single streamed_mutation. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:30 +01:00
Paweł Dziepak	820bd6c9bc	streamed_mutation: add mutation_fragment::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	23d0bfd065	mutation_partition: add row::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	1d54327afd	atomic_cell_or_collection: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	d0ee750cec	keys: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	cfa581b426	utils/managed_vector: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	703509a1c7	utils/managed_bytes: add memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	a289816b31	streamed_mutation: fix mutation_fragment::consume() return type Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Paweł Dziepak	37bd7230bc	streamed_mutation: add mutation fragment visitor Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:17:25 +01:00
Glauber Costa	54ce6221a7	allow the dirty memory manager to be used without a database object Some of our tests don't provide a database object to a CF. Create a default dirty memory manager object that can be used without a database for them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <872f8c9232ff87d788e271b1db86c814d7a75d9f.1467832713.git.glauber@scylladb.com>	2016-07-07 10:00:43 +01:00
Raphael S. Carvalho	0772d20c60	fix compilation in debug mode build/debug/sstables/compaction_strategy.o: In function `date_tiered_manifest::date_tiered_manifest(std::map<basic_sstring<char, unsigned int, 15u>, basic_sstring<char, unsigned int, 15u>, std::less<basic_sstring<char, unsigned int, 15u> >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, basic_sstring<char, unsigned int, 15u> > > > const&)': /home/centos/scylla/sstables/date_tiered_compaction_strategy.hh:67: undefined reference to `date_tiered_manifest::DEFAULT_BASE_TIME_SECONDS' That's fixed by moving definition of static constexpr outside the class. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20c16ad71f64900aa5591018bc4e976406cfebb3.1467870383.git.raphaelsc@scylladb.com>	2016-07-07 11:52:37 +03:00
Avi Kivity	9a8788019d	row_cache: fix visitor for boost <= 1.55 Older boosts can't return a future from a visitor (likely lacking support for move-only objects). Supply a dirty hackaround. Message-Id: <1467822548-25940-1-git-send-email-avi@scylladb.com>	2016-07-06 19:55:51 +03:00
Avi Kivity	21031d276b	Merge seastar upstream * seastar c82c36f...9267dfa (6): > app_template: Make run() wait for func when reactor exit is triggered externally > core: Introduce futurize_apply() helper > rpc: make unexpected eof messages more informative > Fix boost version check > reactor: more fix for smp poll with older boost > reactor: fix build on older boost due to spsc_queue::read_available()	2016-07-06 18:14:13 +03:00
Avi Kivity	02530faeb2	compaction: fix tombstones not being garbage collected during compaction `2a46410f4a` changed sstable_list from a map to a set, so it is no longer sorted by generation. The code for finding the list of sstables not being compacted relied on this sort order, and now broke, returning a longer list than needed (including some of the sstables being compacted). As a result, the compaction code preserved the tombstones, incorrectly thinking there was still live data they referenced. Fix by sorting the set explicitly. Fixes #1429. Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>	2016-07-06 10:22:31 +02:00
Asias He	0c56bbe793	gossip: Make get_supported_features and wait_for_feature_on{_all}_node private They are used only inside gossiper itself. Also make the helper get_supported_features(std::unordered_map<gms::inet_address, sstring>) static. Message-Id: <f434c145ad9138084708b60c1d959b84360e47b2.1467775291.git.asias@scylladb.com>	2016-07-06 09:54:56 +03:00
Avi Kivity	ab279a4752	Merge "Add support to date tiered compaction strategy" from Raphael "After this patchset, date tiered compaction strategy is supported by Scylla. For those who don't know what it is about, the following article may help: https://labs.spotify.com/2014/12/18/date-tiered-compaction/ It's also nicely explained here by our wiki page: https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction Basically, date tiered strategy was developed to help the database perform better when facing a time series workload. Date tiered strategy will work to keep data written at nearly the same time together, such that the number of relevant sstables for a time-based query is relatively low. We still lacks support to filter out sstables based on time parameters of a query, but that feature should come ASAP. The following dtests now pass: compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_delete_test compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test Used cassandra-stress with the parameter '-schema compaction$strategy=DateTieredCompactionStrategy$' to check stability. Fixes #511."	2016-07-06 09:51:12 +03:00
Avi Kivity	7438c9de5c	Merge "Fix database freeze with load for multiple CFs" from Glauber "Issue 1195 describes a scenario with a fairly easy reproducer in which we can freeze the database. That involves writing simultaneously to multiple CFs, such that the sum of all the memory they are using is larger than the dirty memory limit, without not any of them individually being larger than the memtable size. This patchset rewrites the throttling code, including now active flushes so that this situation cannot happen. Fixes #1195"	2016-07-06 09:48:13 +03:00
Raphael S. Carvalho	b5ec4d46c6	tests: add test for date tiered compaction strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	b699ef2de3	compaction: wire up date tiered compaction strategy After this commit, date tiered compaction strategy is supported on Scylla. To understand how it works, take a look at our wiki page: https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction Fixes #511. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	e5cc0cc6c4	compaction: implement date tiered compaction strategy This commit is basically about converting Java to C++. Date tiered compaction strategy isn't wired yet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	cab2892866	tests: add test for sstables::get_fully_expired_sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	e9076f39be	compaction: implement function to get fully expired sstables Strongly based on org.apache.cassandra.db.compaction. CompactionController.getFullyExpiredSSTables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	69b3860662	tests: add test for leveled_manifest::overlapping Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:45 -03:00
Raphael S. Carvalho	92848efc42	sstables: make overlapping functions static That's needed for a function that will get overlapping sstables to get fully expired ones. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:34:34 -03:00
Raphael S. Carvalho	8d38fa49d4	sstables: move code to get uncompacting sstables to a function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:33:55 -03:00
Raphael S. Carvalho	1118cfc51a	tests: test that sstable max_local_deletion_time is properly updated Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:13:34 -03:00
Raphael S. Carvalho	cc6c383249	sstables: properly keep track of max local deletion time We weren't updating max local deletion time for cells that contain ttl, or for tombstone cells. If there is a live cell with no ttl, then max local deletion time is supposed to store maximum value, which means that the sstable will not be fully expired later on. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:13:24 -03:00
Raphael S. Carvalho	1ecd9bdefc	sstables: fix type of max_local_deletion_time max_local_deletion_time was incorrectly using an unsigned type instead of a signed one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:13:13 -03:00
Raphael S. Carvalho	f9ab94d266	compaction: import DateTieredCompactionStrategy.java File can be found at the following C* directory: src/java/org/apache/cassandra/db/compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:12:49 -03:00
Glauber Costa	b0932ceb04	database: act on LSA pressure notification Issue 1195 describes a scenario with a fairly easy reproducer in which we can freeze the database. That involves writing simultaneously to multiple CFs, such that the sum of all the memory they are using is larger than the dirty memory limit, without not any of them individually being larger than the memtable size. Because we will never reach the individual memtable seal size for any of them, none of them will initiate a flush leading the database to a halt. The LSA has now gained infrastructure that allow us to be notified when pressure conditions mount. What we will do in this case is initiate a flush ourselves. Fixes #1195 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Glauber Costa	7169b727ea	move system tables to its own region In the spirit of what we are doing for the read semaphore, this patch moves system writes to its own dirty memory manager. Not only will it make sure that system tables will not be serialized by its own semaphore, but it will also put system tables in its own region group. Moving system tables to its own region group has the advantage that system requests won't be waiting during throttle behind a potentially big queue of user requests, since requests are tended to in FIFO order within the same region group. However, system tables being more controlled and predictable, we can actually go a step further and give them some extra reservation so they may not necessarily block even if under pressure (up to 10 MB more). Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Glauber Costa	c358947284	database: wrap semaphore and region group into a new dirty memory manager We currently have a semaphore in the column family level that protects us against multiple concurrent sstable flushes. However, storing that semaphore into the CF, not the database, was a (implementation, not design) mistake. One comment in particular makes it quite clear: // Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind // would take care of the rest. But that still has issues, so we'll limit parallelism to some // number (4), that we will hopefully reduce to 1 when write behind works. So I aimed for the shard, but ended up coding it into the CF because that's closer to the flush point - my bad. This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The long term benefit is that we now have a one unified structure that can hold shared flush data in all of the CFs. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:29:04 -04:00
Glauber Costa	d41fcd45d1	memtables: make memtable inherit from region The LSA memory pressure mechanism will let us know which region is the best candidate for eviction when under pressure. We need to somehow then translate region -> memtable -> column family. The easiest way to convert from region to memtable, is having memtable inherit from region. Despite the fact that this requires multiple inheritance, which always raise a flag a bit, the other class we inherit from is enable_shared_from_this, which has a very simple and well defined interface. So I think it is worthy for us to do it. Once we have the memtable, grabing the column family is easy provided we have a database object. We can grab it from the schema. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:05:29 -04:00
Glauber Costa	0c31f3e626	database: move memtable throttler to the LSA throttler The LSA infrastructure, through the use of its region groups, now have a throttler mechanism built-in. This patch converts the current throttlers so that the LSA throttler is used instead. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:05:19 -04:00
Yoav Kleinberger	0ad940bfc3	tools/scyllatop: fix crash due to mouse events due to an urwid-library technicality, some mouse events like scroll or click would crash scyllatop. This patch fixes this problem. closes issue #1396. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1467294117-19218-1-git-send-email-yoav@scylladb.com>	2016-07-05 19:08:55 +03:00
Avi Kivity	cb59e724ee	Merge "Fix enabling sstable read ahead" from Paweł "This series contains remaining changes necessary to safely enable read ahead of sstables. Basically, it makes sure that input_streams are always properly closed (even in case of exception during read)."	2016-07-05 19:04:19 +03:00
Raphael S. Carvalho	e688fc9550	api: provide estimation of pending compaction Use compaction_strategy::estimated_pending_compaction() to provide user with an estimation of number of compaction for strategy to be fully satisfied. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <39b7d91f2525ca38fb2ce9d8885d0c2e727de7ed.1467667054.git.raphaelsc@scylladb.com>	2016-07-05 19:03:12 +03:00
Raphael S. Carvalho	43926026c3	compaction: introduce compaction strategy method to estimate pending compaction At the moment, it's not possible to know how many compaction are needed for compaction strategy to be satisfied. It's not possible to know exactly the number of pending compaction, but the strategy can provide an estimation. For size tiered, it's based on number of sstables in each bucket. By dividing bucket size by max threshold, we get number of compaction needed to compact that single bucket. For leveled, it's about the number of sstables that exceeds the limit in each level. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>	2016-07-05 19:03:11 +03:00
Avi Kivity	76cc6408cd	Merge "feature check for seed node" from Asias ""This series implemnts feature check for seed node.	2016-07-05 19:01:01 +03:00
Asias He	6f69963ef9	system_keyspace: Simplify load_host_ids implementation - Use plain loop instead of do_for_each - Use row.get_as() instead of row.template get_as() Message-Id: <3e108d3a6258c0caaf569eb9c79532d9789ea411.1467703722.git.asias@scylladb.com>	2016-07-05 09:47:21 +02:00
Asias He	3f31be58b6	system_keyspace: Simplify load_tokens implemntation - Use plain loop instead of do_for_each - Use row.get_as() instead of row.template get_as() Message-Id: <f959ace4f30078695d383c849ed4520169228f97.1467703722.git.asias@scylladb.com>	2016-07-05 09:47:21 +02:00
Asias He	5236e7a379	storage_service: Implement feature check for seed node Checking features for seed node is a bit more complicated than non-seed node, because non-seed node can always talk to at least one seed node, seed node may not. In this patch, we distingush new cluster and existing cluster by checking if the system table is empty. We relax the feature check for new cluster because the feature check is mostly useful when upgrading an existing cluster to prevent old node to join new cluster. When talking to a seed node failed during the check, we fallback to the check using features stored in the system table. This makes restarting a seed node when no other seed node is up possible (no other seed node at all, or other seed node is not up yet). I tested the following scenarios. 1) start a completely new seed node in a new cluster * system table is empty, skip the check. 2) start a cluster, restart one seed node, at least one other seed node is up * system table is not empty, check with shadow round, shadow round will * succeed 3) start a cluster, restart one seed node, no other seed node is up * system table is not empty, check with shadow round, shadow round will * fail, fallback to system table check. 4) start a cluster, shutdown all the nodes, start one seed node with new ip address, seed list in yaml is updated with new ip address * system table is not empty, check with shadow round, shadow round will * fail, fallback to system table check	2016-07-05 10:09:54 +08:00
Asias He	bb80362c3f	gossip: Insert with result.end() in get_supported_features It is faster than result.begin(), suggested by Avi.	2016-07-05 10:09:54 +08:00
Asias He	72cb4a228b	gossip: Add to_feature_set helper To convert a "," split feature string to a feature set.	2016-07-05 10:09:54 +08:00
Asias He	1d6c57fb40	gossip: Reduce timeout in shadow round In `3a36ec33db` (gossip: Wait longer for seed node during boot up), we increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5 seconds * 60 = 5 minutes. In `57ee9676c2` (storage_service: Fix default ring_delay time), we fixed the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds = 30 minutes, which is too long. Make it 5 minues.	2016-07-05 10:09:54 +08:00
Asias He	88f0bb3a7b	gossip: Add check_knows_remote_features To check if this node knows features in std::unordered_map<inet_address, sstring> peer_features_string	2016-07-05 10:09:54 +08:00
Asias He	2b53c50c15	gossip: Add get_supported_features To get features supported by all the nodes listed in the address/feature map.	2016-07-05 10:09:53 +08:00
Asias He	31df4e5316	system_keyspace: Introduce load_peer_features To get the peer features stored in the system.peers table.	2016-07-05 10:09:53 +08:00
Paweł Dziepak	4acf77d755	sstables: drop unused data_stream_at() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-04 18:17:43 +01:00
Paweł Dziepak	2cdf498bbd	sstables: close input stream in sstable::data_read() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-04 18:17:42 +01:00
Paweł Dziepak	8931b939a1	sstables: use finally() to close input streams Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-04 18:17:42 +01:00
Paweł Dziepak	e6ececce7f	Merge seastar upstream Submodule seastar a47f893..c82c36f: > reactor: fix build error > util: lazy_eval: fix compilation errors related to operator<<()s definitions	2016-07-04 18:14:05 +01:00
Duarte Nunes	41843b32c5	thrift: Correctly mark a CF as dense And store whether the comparator is a composite type in the case of dynamic CFs. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1467307688-11059-1-git-send-email-duarte@scylladb.com>	2016-07-04 17:40:53 +02:00
Nadav Har'El	c4e871ea2d	Work around unexpected data_value constructor If someone tried to naively use utf8_type->decompose("18wX"), this would mysteriously fail, returning an empty key. decompose takes a data_value, so the compiler looked for an implict conversion from the string constant (const char) to data_value. We did not have such a conversion, only conversion from sstring. But the compiler chose (backed by the C++ standard, no doubt) to implicitly convert the const char to a bool (!), and then use data_value(bool). It did not convert the const char* to an sstring, nor did it warn about the possible ambiguity. So this patch adds a data_value(const char*) constructor, so people will not fall into the same trap that I fell into... Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1467643462-6349-1-git-send-email-nyh@scylladb.com>	2016-07-04 17:50:53 +03:00
Avi Kivity	e22517bafc	Merge "Optimize reads from leveled sstables" In a leveled column family, there can be many thousands of sstables, since each sstable is limited to a relatively small size (160M by default). With the current approach of reading from all sstables in parallel, cpu quickly becomes a bottleneck as we need to check the bloom filter for each of these sstables. This patch addresses the problem by introducing a compaction-strategy-specific data structure for holding sstables. This data structure has a method to obtain the sstables used for a read. For leveled compaction strategy, this data structure is an interval map, which can be efficiently used to select the right sstables.	2016-07-04 16:00:35 +03:00
Asias He	610a0f7ef0	storage_service: Skip feature check for seed node for now When a seed node boots up with more than one node in the seed list, it will fail to talk to the other seed node which is not up yet. This fails the feature check, so the seed node will not boot. Skip the feature check for seed node for now, util we have a proper solution. Fixes recent dtest failure due to fail to boot the seed node. Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>	2016-07-04 15:09:57 +03:00
Avi Kivity	28fab55e6e	Merge "Convert sstable writes to streamed mutations" from Paweł "This series converts sstable writers (including compaction) to streamed mutations and makes them use consumer-style interface. Code related to sstable writes and compaction is converted to consumers that can be used with consume_flattened_in_thread() (which is a variant of consume_flattened() intended to be run inside a thread). compac_for_query is improved so that it can be reused by sstable compaction."	2016-07-04 15:07:47 +03:00
Avi Kivity	171054e87b	Merge seastar upstream * seastar d4d9e16...a47f893 (1): > Merge "overprovisioning support"	2016-07-04 13:46:03 +03:00
Paweł Dziepak	5d0de2179a	Merge "Adding scylla version API" from Amnon Amnon says: The API that returns the version, currently returns the compatibility version (e.g. the version the compatible origin version - currently 2.1.8). The check version functionality need to know what is the current running version of scylla. For that a new API was added that return the current version. The result is equivalent of running scylla --version. After this series a call to: $ curl -X GET "http://localhost:10000/storage_service/scylla_release_version" "666.development-20160703.72f0d4d" Which is the json representation of: $ ./build/release/scylla --version 666.development-20160703.72f0d4d	2016-07-04 10:52:44 +01:00
Asias He	f6a2672be0	storage_service: Modify log to match config option of scylla We currently log as follow: May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This node was decommissioned and will not rejoin the ring unless cassandra.override_decommission=true has been set,or all existing data is removed and the node is bootstrapped again Howerver, user should use override_decommission:true instead of cassandra.override_decommission:true in scylla.yaml where the cassandra prefix is stripped. Fixes #1240 Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>	2016-07-04 10:47:49 +02:00
Avi Kivity	76cc0c0cf9	auth: fix performance problem when looking up permissions data_resource lookup uses data_resource::name(), which uses sprint(), which uses (indirectly) locale, which takes a global lock. This is a bottleneck on large machines. Fix by not using name() during lookup. Fixes #1419 Message-Id: <1467616296-17645-1-git-send-email-avi@scylladb.com>	2016-07-04 10:26:18 +02:00
Yoav Kleinberger	49cba035ea	tools/scyllatop: leave terminal in a functioning state when user quits with CTRL-C closes issue #1417. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1467556769-11851-1-git-send-email-yoav@scylladb.com>	2016-07-03 17:43:46 +03:00
Amnon Heiman	e66a1cd705	API: Add implementation for the scylla release version This adds the implementation to the scylla release version API. After this patch a call to: curl -X GET "http://localhost:10000/storage_service/scylla_release_version" Will return the current scylla release version. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-03 16:29:09 +03:00
Amnon Heiman	56ea8c943e	API: add scylla release version API This adds a definition to the scylla release version. The API already return the compatibility version (ie. the compatible origin version) This definition returns the scylla version, a call to the API should return the same result as running scylla --version. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-07-03 16:26:21 +03:00
Avi Kivity	68e613b313	Rebuild _column_family::_sstables when changing compaction_strategy The concrete sstable_set type depends on the compaction strategy, so ask the compaction_strategy to create a new sstable_set object and populate it.	2016-07-03 13:42:10 +03:00
Avi Kivity	44a6cef4e1	sstable mutation readers: use sstable_set::select() Apply compaction strategy specific logic to narrow down the set of sstables used for a query; can speed up reads using LeveledCompactionStrategy significantly. Fixes #1185.	2016-07-03 10:50:58 +03:00
Avi Kivity	4cb7618601	Convert column_family::_sstables to sstable_set Using sstable_set will allow us to filter sstables during a query before actually creating a reader (this is left to the next patch; here we just convert the users of the _sstables field).	2016-07-03 10:32:27 +03:00
Avi Kivity	c8237fc262	compaction_strategy: introduce make_sstable_set() Allow compaction_strategy to create a container for sstables that is optimized for the strategy. Most compaction_strategies return bag_sstable_set; leveled compaction returns the specialized partitioned_sstable_set.	2016-07-03 10:27:01 +03:00
Avi Kivity	168696c558	Introduce partitioned_sstable_set partitioned_sstable_set assumes that sstable are mostly partitioned along the token range: only a few sstables will be needed to access a particular token. It is implemented as an interval_map.	2016-07-03 10:27:00 +03:00
Avi Kivity	64e4357461	Introduce bag_sstable_set bag_sstable_set is a generic sstable_set implementation: it assumes nothing about the sstables. It is implemented as a vector, and any select will return the entire sstable set.	2016-07-03 10:27:00 +03:00
Avi Kivity	85e9cf4616	Introduce sstable_set sstable_set abstracts the notion of a container of sstables, allowing different compaction strategies to supply their own implementation. The intended user is leveled compaction strategy; since it partitions sstables, it can quickly restrict the number of sstables that participate in a query by looking at the min/max partition key. sstable_set also maintains an internal lw_shared_ptr<sstable_list>, in parallel with the abstract container. This is to support column_family::get_sstable(), which returns a lw_shared_ptr<sstable_list> which must be anchored somewhere if it is not saved at the caller side, as it isn't in most current callers.	2016-07-03 10:27:00 +03:00
Avi Kivity	c1815abd15	Introduce compatible_ring_position ring_position is built for modern code that does not require default constructors or stateless comparators. But not all code is modern, so supply a compatible_ring_position that works with old code, at the cost of some extra storage. Intended user is boost's interval container library.	2016-07-03 10:27:00 +03:00
Avi Kivity	2a46410f4a	Change sstable_list from a map to a set sstable_list is now a map<generation, sstable>; change it to a set in preparation for replacing it with sstable_set. The change simplifies a lot of code; the only casualty is the code that computes the highest generation number.	2016-07-03 10:26:57 +03:00
Duarte Nunes	386c0dd4b2	storage_proxy: Correctly calculate new limit This patch fixes a bug where we would always return query::max_rows when calculating the new limit for a retry read command. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1467289746-18177-1-git-send-email-duarte@scylladb.com>	2016-06-30 14:49:56 +02:00
Paweł Dziepak	b150720361	sstable: enable read ahead Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 13:18:24 +01:00
Paweł Dziepak	4513f8b52c	sstables: add compressed_file_data_source_impl::close() compressed_file_data_source_impl should close the underlying data source properly when asked to. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 13:07:07 +01:00
Paweł Dziepak	55a6911d7a	sstables: close input_stream<> properly If read ahead is going to be enabled it is important to close input_stream<> properly (and wait for completion) before destroying it. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	e44e12c74a	sstables: drop no longer needed code Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	c2f0ee9b5f	sstables: add consumer-style sstable compactor This patch moves compaction logic to a consumer that can be used with consume_flattened_in_thread(). Internally, sstable_writer is used to write individual sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	18a9ee105f	sstables: add consumer-style sstable writer sstable_writer encapsulates all logic related to writing sstable. Previously introduced component_writer is used to write actual mutations. sstable_writer is intended to be used with consume_flattened_in_thread(). Its purpose is to be used by higher-level consumer that needs to write possibly more than one sstable (sstable compaction is an example of such consumer). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	0e8b8463ba	sstables: introduce consumer-style components writer This patch rewrites do_write_components() so that it can use consume_flattened_in_thread(). All components-writing code is moved to a new consumer: component_writer. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	0287e0c9ac	mutation_reader: add consume_flattened_in_thread() This is a version of consume_flattened() intended to be run inside a thread. All consumer code is going to be invoked in the same thread context. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	7a95847014	mutation_compactor: prepare for sstable compaction compact_mutation code is going to be shared among queries and sstable compaction. There are some differences though. Queries don't provide _max_purgeable and sstable compaction don't need any limits. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	00bcc05d36	mutation_compactor: _max_purgeable depends on the decorated key _max_perguable can be different for each partition, since it is computed using sstables in which that partition is present (or likely to be present). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	4133cc7a53	mutation_reader: make consume_flattened() produce decorated keys Since decorated keys are already computed it is better to pass more information than less. Consumers interested just in partition key can just drop token and the ones requiring full decorated key don't need to recompute it. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:00 +01:00
Paweł Dziepak	fe4b739828	mutation_compactor: rename compact_for_query to compact_mutation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	3e86f9ab73	mutation_partition: extract compact_for_query to a separate header The compacting logic inside compact_for_query is going to be shared with sstable compaction. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	9b14c93677	streamed_mutation: return reference to decorated key Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	3c08ffb275	query: add full_slice query::full_slice is a partiton slice which has full clustering row ranges for all partition keys and no per-partition row limit. Options and columns are not set. It is used as a helper object in cases when a reference to partition_slice is needed but the user code needs just all data there is (an example of such case would be sstable compaction). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	599ed7f1ed	sstables: restore indentation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Paweł Dziepak	e7ff20b3bb	sstables: run compaction code inside a thread Currently, each sstable write has its separate thread. However, the goal is to have compaction use consume_flattened() with a consumer that creates and writes the sstables. consume_flattened() needs to be executed inside a thread, since sstable writer may defer. This patch is a first step in preparations and it just makes whole compaction logic run inside a thread. That makes little sense now, since all sstable writes spawn their own threads but that's going to change in the following patches. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:37:54 +01:00
Duarte Nunes	0ae6eafadd	query: Make partition_limit last parameter The partition_limit should have been added to the end of the ctor argument list, as its current placement causes some callers to pass it the timestamp instead of the limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1467239360-6853-3-git-send-email-duarte@scylladb.com>	2016-06-30 12:31:11 +02:00
Gleb Natapov	8bf82cc31c	put additional info into cql timeout exception Fixes #1397 Message-Id: <20160628101829.GR14658@scylladb.com>	2016-06-30 12:03:48 +02:00
Paweł Dziepak	b70bf086b7	frozen_mutation: handle reversed streams properly Freezing streamed_mutations assumed that mutation fragments are streamed in the order they appear in the frozen mutation. That is not true for reversed streams. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1467277069-18702-1-git-send-email-pdziepak@scylladb.com>	2016-06-30 11:26:45 +02:00
Avi Kivity	9ac730dcc9	mutation_reader: make restricting_mutation_reader even more restricting While limiting the number of concurrently executing sstable readers reduces our memory load, the queued readers, although consuming a small amount of memory, can still grow without bounds. To limit the damage, add two limits on the queue: - a timeout, which is equal to the read timeout - a queue length limit, which is equal to 2% of the shard memory divided by an estimate of the queued request size (1kb) Together, these limits bound the amount of memory needed by queued disk requests in case the disk can't keep up. Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>	2016-06-29 15:17:35 +02:00
Raphael S. Carvalho	85cb2a6d35	database: trigger compaction on boot At the moment, we only trigger compaction after creating a new sstable as a result of memtable flush, or some other event such as changing compaction strategy of a column family. However, it's important to trigger compaction on boot too. That will happen after loading all column families. Fixes #1404. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <54d38a418157454eec97aaba6b8a6b6e51484db4.1467135349.git.raphaelsc@scylladb.com>	2016-06-29 13:47:42 +03:00
Amnon Heiman	610fe274fd	services: Make scylla-jmx service depends on scylla-server The scylla-jmx no longer shutdown itself. A better setup would be that the it would be started when the scylla-server starts and that it would shutdown when the scylla-server shutdown. This patch do the scylla-server part of the change. The scylla-server definition would Want the scylla-jmx.service so there is no need to enable the scylla-jmx.service. A patch to the scylla-jmx would cause it to shutdown when the scylla-jmx shutsdown. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1467184502-4358-1-git-send-email-amnon@scylladb.com>	2016-06-29 11:36:04 +03:00
Avi Kivity	2c4501f317	Merge seastar upstream * seastar c15055c...d4d9e16 (4): > semaphore: switch to chunked_fifo > fair_queue: add missing include > chunked_fifo: implement back() > Chunked FIFO queue	2016-06-28 19:30:29 +03:00
Avi Kivity	1b448877d7	Merge " thrift: Implement CQL over thrift" from Duarte "This patchset implements the CQL over thrift verbs. Only CQL3 is supported, and the CQL2 verbs are disabled."	2016-06-28 13:36:12 +03:00
Piotr Jastrzebski	59d0d9e666	Fix cache_tracker::clear Make sure that artificial entries for all column families are set to non continuous. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f9e517fe40482c05f6c388faab7d6b9eca6b159e.1467103548.git.piotr@scylladb.com>	2016-06-28 11:18:23 +02:00
Piotr Jastrzebski	27575a0528	Fix previous_entry_is_continuous Rename it to check_previous_entry. Remove unnesessary test. Make sure ring_position always has working relation_to_keys method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <6bc790d492ba9b5c302a50218f3e26b924f657d0.1467101754.git.piotr@scylladb.com>	2016-06-28 10:27:08 +02:00
Piotr Jastrzebski	68e5a199e9	Clean continuous flag of cache entry preceeding invalidated decorated key even when it's not found. Add test. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c7b8f4df37256363bf304e0396f84b5f37921b81.1467059472.git.piotr@scylladb.com>	2016-06-28 10:26:02 +02:00
Piotr Jastrzebski	cd9f3f94c4	Fix row_cache::update Clear continuous flag on the last cache entry with key smaller than a partition being dropped from memtable on flush and not saved in cache. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <0b5293cc0bf8bb858e62aa8dd00ae7fe7a484380.1467059472.git.piotr@scylladb.com>	2016-06-28 10:25:38 +02:00
Piotr Jastrzebski	eb959a8b81	Change check for artificial entry in cache_entry destructor from _key.has_key() to _lru_link.is_linked() Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <f6d3d1bc49d9f6dd5b67a10cbe862466047b039d.1467059472.git.piotr@scylladb.com>	2016-06-28 10:24:29 +02:00
Nadav Har'El	164c760324	Switch compression chunk default from 64 KB to 4 KB Following Cassandra, our default sstable compression chunk size is 64 KB. The big downside of this default size is that small reads need to read and uncompress a large chunk, around 32 KB (if compression halves the data size). In this patch we switch the default chunk size to 4 KB, which allows faster small reads (the report in issue #1337 was of a 60-fold speedup...). Since commit `2f56577`, large reads will not be signficantly slowed down by changing to a small chunk size. The remaining potential downside of this change is lowering of the compression ratio because of the smaller chunks individually compressed. However, experimentation shows that the compression ratio is hurt somewhat, but not dramatically, by lowering the chunk size: A recent survey of Cassandra compression in https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/ reports a compression ratio of 2 for 64 KB chunks, vs. 1.75 for 4 KB chunks. My own test on a cassandra-stress workload (whose data is relatively hard to compress), showed compression ratio 1.25 for 64 KB chunk, vs. 1.23 for 4 KB chunks. Also remember that if a user wants to control the chunk length for a particular table, he can - the 64 KB or 4 KB sizes are just the default. Fixes #1337 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1467063335-12096-1-git-send-email-nyh@scylladb.com>	2016-06-28 08:50:24 +03:00
Tomasz Grabiec	6108d91362	scylla-gdb: Introduce scylla ptr Helps in identifying pointers allocated through seastar allocator. Shows to which thread the pointer belongs, to which size class, whether it's live or free, what's the offset realtive to the live object. Example: (gdb) scylla ptr 0x6040abe88170 thread 1, small (size <= 320), live (0x6040abe88140 +48) Message-Id: <1467047215-1763-1-git-send-email-tgrabiec@scylladb.com>	2016-06-27 20:11:56 +03:00
Avi Kivity	22ec25b1b3	Merge seastar upstream * seastar 3029ebe...c15055c (5): > memory: add option to mlock() all memory > reactor: run idle poll handler with a pure poll function > ignore all but one failed futures in map_reduce > tutorial: more general exception printout on startup > resource: don't abort on too-high io queue count Fixes #1395. Fixes #1400.	2016-06-27 19:24:04 +03:00
Tomasz Grabiec	85a37cb379	Merge tag '1398/v3' from https://github.com/avikivity/scylla From Avi: Both the cql binary transport and the rpc server have protection against too many concurrent requests overwhelming the database due to transient allocations. There work by estimating the amount of memory a request requires, and accounting that against a semaphore. When the semaphore blocks, we stop dequeing requests from the tcp connection. Unfortunately, this doesn't work for reads, because we can't estimate the required memory size. A small read request can require many sstables to be read, perhaps concurrently, and a large response to be generated. Fix by limiting the number of concurrent reads in a shard to 100. This is more than enough concurrency for any reasonable disk, and there is no network communication at this level, so we're safe from high network latency requiring high concurrency. Fixes #1398.	2016-06-27 18:04:33 +02:00
Avi Kivity	f03cd6e913	db: add statistics about queued reads	2016-06-27 17:25:08 +03:00
Avi Kivity	edeef03b34	db: restrict replica read concurrency Since reading mutations can consume a large amount of memory, which, moreover, is not predicatable at the time the read is initiated, restrict the number of reads to 100 per shard. This is more than enough to saturate the disk, and hopefully enough to prevent allocation failures. Restriction is applied in column_family::make_sstable_reader(), which is called either on a cache miss or if the cache is disabled. This allows cached reads to proceed without restriction, since their memory usage is supposedly low. Reads from the system keyspace use a separate semaphore, to prevent user reads from blocking system reads. Perhaps we should select the semaphore based on the source of the read rather than the keyspace, but for now using the keyspace is sufficient.	2016-06-27 17:17:56 +03:00
Avi Kivity	bea7d7ee94	mutation_reader: introduce restricting_reader A restricting_reader wraps a mutation_reader, and restricts it concurrency using a provided semaphore; this allows controlling read concurrency, which is important since reads can consume a lot of resources ((number of participating sstables) * 128k after we have streaming mutations, and a lot more before).	2016-06-27 17:17:52 +03:00
Duarte Nunes	d31b52a07b	thrift: Disable CQL2 verbs And make set_cql_version a no-op. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:39:33 +02:00
Duarte Nunes	60094f4033	thrift: Implement execute_prepared_cql3_query verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:39:28 +02:00
Duarte Nunes	96068084ca	thrift: Implement prepare_cql3_query verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:39:22 +02:00
Duarte Nunes	c8afb4cc46	query_processor: Support thrift prepared statements This patch adds support for thrift prepared statements. It specializes the result_message::prepared into two types: result_message::prepared::cql and result_message::prepared::thrift, as their identifiers have different types. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:39:02 +02:00
Paweł Dziepak	1addbb9c1d	thrift: implement execute_cql3_query Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:38:52 +02:00
Duarte Nunes	2e7cb32601	query_options: Adjust value_views after prepare() query_options::prepare() changes the values array, but this is not the one used by query_options internally (e.g., in get_value_at). So we need to also recalculate the value_views after prepare() is called. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Duarte Nunes	2683a49c69	query_options: Remove value_views arg from ctor Having both the values and value_views arguments in the query_options ctor is confusing, since query_options uses only the value_views field but that is not communicated to the caller. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Duarte Nunes	62cfc4ab55	thrift: Add with_exn_cob helper function Similarly to the with_cob functions, this one takes the exn_cob function and ensures it is called in case of an exception. This is useful when the return type of the thrift verb is not nothrow move constructible; by holding on to the cob inside the verb and calling it directly when we have the result we avoid having to wrap it in a smart pointer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Duarte Nunes	b74ee6fdea	thrift: Add consistency level conversion Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Paweł Dziepak	0c441378f2	client_state: support thrift clients Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Paweł Dziepak	002d2bc353	thrift: pass query_processor to the thrift handler Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Duarte Nunes	225c5be78e	thrift: Add query_state to thrift_handler This patch adds a query_state object to the thrift handler, as it is required for CQL3 operations. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-27 15:24:27 +02:00
Avi Kivity	f96e5d7c1b	managed_bytes: fix build with gcc 6 gcc 6 complains that deleting a managed_bytes::external isn't defined because the size isn't known. I'm not sure it's correct, but there's no way to tell because flexible arrays aren't standardized. Fix by using an array of zero size. Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>	2016-06-27 10:54:10 +02:00
Avi Kivity	056b427855	range_tombstone_list: use non-template lambda for cloning tombstones Using a template lambda invokes a bug in Fedora 24's boost where the lambda's parameter is an internal boost type rather than a range_tombestone. Constraining the parameter with an explicit type avoids the problem. Message-Id: <1466844211-17298-1-git-send-email-avi@scylladb.com>	2016-06-27 10:48:59 +02:00
Amnon Heiman	a439a6b8d3	API: Add the collectd enable/disable implementation This adds the implementation to the enable and disable of the collectd metrics. An example for disabling all collectd metrics that has write in their type_instance part: curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://localhost:10000/collectd/.?instance=.&type=.&type_instance=.write.&enable=false" After that a call to: curl -X GET "http://localhost:10000/collectd/" Would return those metrics with the enable set to "false" An example to enable all the metrics in cache that their type starts with byt: curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://localhost:10000/collectd/cache?type=byt.&enable=true" Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1466932139-19264-3-git-send-email-amnon@scylladb.com>	2016-06-26 12:26:50 +03:00
Amnon Heiman	4d7837af40	API Definition: collectd to support enable disable This adds to the definition of the collectd API the ability to turn on and off specific collectd metrics. For the GET end point a POST option was added that allow to enable or disable a metric. The general GET endpoint now returns the enable flag that indicates if the metric is enable. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1466932139-19264-2-git-send-email-amnon@scylladb.com>	2016-06-26 12:26:48 +03:00
Duarte Nunes	dfbf68cd24	commitlog: Define operator<< in namespace db Needed for compilation with gcc6. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1466852874-8448-1-git-send-email-duarte@scylladb.com>	2016-06-26 10:05:28 +03:00
Avi Kivity	5b81448ed6	main: add scylla --version option Fixes #1384. Message-Id: <1466691517-29964-1-git-send-email-avi@scylladb.com>	2016-06-23 16:24:03 +02:00
Duarte Nunes	1ffae6e6ee	database_test: Add test case for row limit This patch introduces database_test and adds a test case to ensure the row limit is respected when querying multiple partition ranges. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20160623111723.17523-1-duarte@scylladb.com>	2016-06-23 14:20:34 +02:00
Avi Kivity	e647ec1c4a	Merge "thrift: Implement describe verbs" from Duarte "This patchset implements the thrib describe verbs: - describe_keyspace - describe_keyspaces - describe_cluster_name - describe_version - describe_ring - describe_local_ring - describe_token_map - describe_partitioner - describe_snitch - describe_schema_versions The verbs describe_splits and describe_splits_ex are not implemented because they are marked as experimentail (Origin's thrift interface has this to say about them: "experimental API for hadoop/parallel query support. may change violently and without warning."). Some drivers have moved away from depending on this verb (SPARKC-94). The correct way to implement the verbs for us would be to use the size_estimates system table (CASSANDRA-7688). However, we currently don't populate size_estimates, which is done by SizeEstimatesRecorder.java in Origin."	2016-06-23 13:30:39 +03:00
Duarte Nunes	b291c22e39	thrift: Complete describe_keyspace verb This patch completes the describe_keyspace verb by adding setting the remaining fields. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 12:02:47 +02:00
Duarte Nunes	febc48166d	thrift: Type name is already based on Origin This patch removes a conversion function from an internal type name to Origin's naming, which isn't needed because the abstract_type hierarchy already keeps that mapping. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 12:02:47 +02:00
Duarte Nunes	8b00fe3989	thrift: Add explanatory note about describe_splits We don't implement describe_splits, and this patch describes why that it. In a nutshell, to properly implement this, we would need something like Origin's SizeEstimatesRecorder.java, but as the verb is marked as experimental, we don't do it for now. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 12:02:46 +02:00
Duarte Nunes	b175204cfe	thrift: Implement describe_snitch verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:47 +02:00
Duarte Nunes	9e6ab878d6	thrift: Implement describe_partitioner verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:47 +02:00
Duarte Nunes	358b03c409	thrift: Implement describe_token_map verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:47 +02:00
Duarte Nunes	1ea7102d9f	thrift: Implement describe_ring verbs Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:42 +02:00
Duarte Nunes	8377264226	thrift: Implement describe_version verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:10 +02:00
Duarte Nunes	8370450dcb	trhift: Implement describe_cluster_name verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:09 +02:00
Duarte Nunes	2a898743c6	thrift: Implement describe_schema_versions verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:53:07 +02:00
Pekka Enberg	c464e85a3f	Merge "thrift: Implement DDL verbs" from Duarte "This patchset implements the thrift DDL verbs: - system_add_column_family - system_drop_column_family - system_update_column_family - system_add_keyspace - system_drop_keyspace - system_update_keyspace"	2016-06-23 12:46:58 +03:00
Duarte Nunes	3c02af083c	thrift: Implement system_update_keyspace verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:29 +02:00
Duarte Nunes	aa16c303ca	thrift: Implement system_drop_keyspace verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:29 +02:00
Duarte Nunes	8ff3fbe916	thrift: Implement system_drop_column_family verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:29 +02:00
Duarte Nunes	f6fab027c6	thrift: Implement system_update_column_family verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	de46653036	thrift: Implement system_add_column_family verb Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	25a8ffb09a	thrift: Extract keyspace_from_thrift function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	74cb796de7	thrift: Extract schema_from_thrift function Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	9d85ea6304	thrift: Complete system_add_keyspace verb This patch completes the system_add_keyspace verb by setting all relevant options on the new schemas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	05f7a6d63e	thrift: Add basic support for dynamic CF In thrift, a static column family is one where all columns are defined upon schema creation. It maps to a CQL table with a singular partition key and a set of regular columns. On the other hand, a dynamic column family is one which allows column to be dynamically added by insertion requests. It maps to a CQL table with a partition key and a clustering key, which will hold the names of the dynamic columns, and a regular column, which will how the respective values. If the thrift comparator type is composite, then there will be a clustering column for each of the composite's components. There can also be mixed column families; supporting those is future work. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:41:28 +02:00
Duarte Nunes	49b8bff21c	thrift: Extract make_exception to common header This patch moves the make_exception function from thrift/handler.cc to the new header file thrift/utils.hh. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-23 11:40:52 +02:00
Benoît Canet	e37b18b231	scylla_ntp_setup: Define an ntp server on ubuntu if there is none The pool directive from ntp.conf is not recognized by ntpdate. Strip it and put the ubuntu server in place. Fixes: #1345 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466607457-14029-1-git-send-email-benoit@scylladb.com>	2016-06-23 12:40:13 +03:00
Benoît Canet	8b6bb0251d	README.md: Fix markdown formating I suspect wrong formatting causes us trouble in the docker hub descriptions. Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466603787-13423-1-git-send-email-benoit@scylladb.com>	2016-06-23 12:39:04 +03:00
Duarte Nunes	aacc7193f2	schema: Replace keyspace's schema_ptr on CF update This patch ensures we replace the schema_ptr held by its respective keyspace object when a column family is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20160623085710.26168-1-duarte@scylladb.com>	2016-06-23 11:11:52 +02:00
Tzach Livyatan	3fa7bb1292	scylla_setup: Ignore case in prompt responses Fix #1376 by converting each response to lowercase. Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1466672539-5625-1-git-send-email-tzach@scylladb.com>	2016-06-23 12:08:26 +03:00
Glauber Costa	e08fa7dafa	fix potential stale data in cache update We currently have a problem in update_cache, that can be trigger by ordering issues related to memtable flush termination (not initiation) and/or update_cache() call duration. That issue is described in #1364, and in short, happens if a call to update_cache starts before and ongoing call finishes. There is now a new SSTable that should be consulted by the presence checker that is not. The partition checker operates in a stale list because we need to make sure the SSTable we just wrote is excluded from it. This patch changes the partition checker so that all SSTables currently in use are consulted, except for the one we have just flushed. That provides both the guarantee that we won't check our own SSTable and access to the most up-to-date SSTable list. Fixes #1364 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>	2016-06-23 10:54:44 +02:00
Pekka Enberg	bcba45f546	Merge "Prevent old node to join new cluster" from Asias Fixes #1253	2016-06-23 10:25:38 +03:00
Piotr Jastrzebski	9b011bff18	row_cache: add contiguity flag to cache entry to reduce disk IO during scans Add contiguity flag to cache entry and set it in scanning reader. Partitions fetched during scanning are continuous and we know there's nothing between them. Clear contiguity flag on cache entries when the succeeding entry is removed. Use continuous flag in range queries. Don't go do disk if we know that there's nothing between two entries we have in cache. We know that when continuous flag of the first one is set to true. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>	2016-06-23 09:43:15 +03:00
Avi Kivity	5af22f6cb1	main: handle exceptions during startup If we don't, std::terminate() causes a core dump, even though an exception is sort-of-expected here and can be handled. Add an exception handler to fix. Fixes #1379. Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>	2016-06-23 09:25:33 +03:00
Avi Kivity	a192c80377	gdb: fully-qualify type names gdb gets confused if a non-fully-qualified class name is used when we are in some namespace context. Help it out by adding a :: prefix. Message-Id: <1466587895-8690-1-git-send-email-avi@scylladb.com>	2016-06-22 12:04:17 +02:00
Avi Kivity	9dacd4fb80	Merge "query: Add new limits" from Duarte This patchset adds two new types of query limits: - Per partition row limit, which limits how many rows a given partition may return; needed both for thrift and for future CQL features; - Limit on the number of partitions returned, needed by thrift.	2016-06-22 11:03:13 +03:00
Duarte Nunes	82dbf5bff3	storage_proxy: Trace when retrying a query Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:15 +02:00
Duarte Nunes	69798df95e	query: Limit number of partitions returned This is required to implement a thrift verb. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:13 +02:00
Duarte Nunes	594e43a60a	compact_query: Rename partition_limit This patch renames compact_query::_partition_limit to _current_partition_limit for clarity, as the next patch adds a partition limit that limits the number of partitions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:47:29 +02:00
Duarte Nunes	e9ebd87991	compact_query: Rename limit to row_limit This patch renames compact_query::_limit to _row_limit for clarity, as a subsequent patch introduces yet another limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:47:28 +02:00
Duarte Nunes	01b18063ea	query: Add per-partition row limit This patch as a per-partition row limit. It ensures both local queries and the reconciliation logic abide by this limit. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:46:51 +02:00
Duarte Nunes	20d9813a89	storage_proxy: Fetch last replica row just in time This patch changes the way we fetch each replica's last row to determine if we got incomplete information from any of them. Instead of fetching the last rows up front, we fetch them on demand only if we actually trigger the code that needs them. We now get the last row from the versions vector of vectors. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Duarte Nunes	4ce9fc24cb	storage_proxy: Extract finding last row This patch extracts to a function the code that actually determines the last row of a partition based on the direction of the query. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 00:15:38 +02:00
Takuya ASADA	73ba4ac337	dist: drop sudoers.d from .rpm, since systemd moved to PermissionsStartOnly Since systemd moved to PermissionsStartOnly, only upstart uses sudoers. So move common/sudoers.d to dist/ubuntu, drop them from .rpm. Also, Ubuntu 15.10/16.04 does not requires sudoers since these are uses systemd. So copy sudoers only for 14.04. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1466536491-9860-1-git-send-email-syuu@scylladb.com>	2016-06-21 22:59:18 +03:00
Glauber Costa	4e81f19ab5	LSA: fix typo in region merge There are many potentially tricky things about referring to different regions from the LSA perspective. Madness, however, is not one of them. I can only assume we meant made? Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>	2016-06-21 22:58:44 +03:00
Benoît Canet	8e4dee0bd1	scylla_setup: Hide /dev/loop* The user probably don't want to use /dev/loop* as RAID devices. Fixes: #1259 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466520602-7888-1-git-send-email-benoit@scylladb.com>	2016-06-21 19:27:40 +03:00
Tzach Livyatan	27b99f47e8	scylla_setup: improve the wording of disk setup phase. Fix #1197 by adding XFS related info to the interactive prompt Signed-off-by: Tzach Livyatan <tzach@scylladb.com> Message-Id: <1466504625-28926-1-git-send-email-tzach@scylladb.com>	2016-06-21 19:26:31 +03:00
Avi Kivity	96ebc4e7b5	Merge seastar upstream * seastar 401c333...3029ebe (3): > util: add a seastar::value_of() helper function > rpc: force closing listen fd on server stop > reactor: fix I/O priority class id assignment	2016-06-21 15:11:26 +03:00
Tomasz Grabiec	597cbbdedc	Merge branch 'pdziepak/streamed-mutations/v5' from seastar-dev.git From Paweł: This series introduces streaming_mutations which allow mutations to be streamed between the producers and the consumers as a series of mutation_fragments. Because of that the mutation streaming interface works well with partitions larger than available memory provided that actual producer and consumer implementations can support this as well. mutation_fragments are the basic objects that are emitted by streamed_mutations they can represent a static row, a clustering row, the beginning and the end of a range tombstone. They are ordered by their clustering keys (with static rows being always the first emitted mutation fragment). The beginning of range tombstone is emitted before any clustering row affected by that tombstone and the end of range tombstone is emitted after the last clustering row affected by it. Range tombstones are disjoint. In this series all producers are converted to fully support the new interface, that includes cache, memtables and sstables. Mutation queries and data queries are the only consumers converted so far. To minimize the per-mutation_fragment overhead streamed_mutations use batching. The actual producer implementation fills a buffer until it is full (currently, buffer size is 16, the limit should, however, be changed to depend on the actual size in memory of the stored elements) or end of stream is reached. In order to guarantee isolation of writes reads from cache and memtable use MVCC. When a reader is created it takes a snapshot of the particular cache or memtable entry. The snapshot is immutable and if there happen to be any incoming writes while the read is active a new version of partition is created. When the snapshot is destroyed partition versions are merged together as much as possible. Performance results with perf_simple_query (median of results with duration 15): before after diff write 618652.70 618047.58 -0.10% read 661712.44 608070.49 -8.11%	2016-06-21 12:15:21 +02:00
Pekka Enberg	11dd20d640	Revert "ami: Change type from EBS to Instance" This reverts commit `2d7f8f4a47`. Avi sayeth: "Isn't this the other way round? EBS is persistent." and "The patch is wrong too. Instance store takes 5 minutes to boot compared to 1 minute for EBS."	2016-06-21 12:41:30 +03:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Calle Wilund	2b812a392a	commitlog_replayer: Fix calculation of global min pos per shard If a CF does not have any sstables at all, we should treat it as having a replay position of zero. However, since we also must deal with potential re-sharding, we cannot just set shard->uuid->zero initially, because we don't know what shards existed. Go through all CF:s post map-reduce, and for every shard where a CF does not have an RP-mapping (no sstables found), set the global min pos (for shard) to zero. Fixes #1372 Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>	2016-06-21 10:05:05 +03:00
Benoît Canet	2d7f8f4a47	ami: Change type from EBS to Instance Instance types does not have ephemeral drive that disapear on reboot. Fixes #1229 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466443232-5898-1-git-send-email-benoit@scylladb.com>	2016-06-21 09:56:26 +03:00
Calle Wilund	88ffe60138	batchlog_manager: Change replay mutation CL to ALL Try to emulate the origin behaviour for batch reply. They use an explicit write handler, combinging 1.) Hinting to all known dead endpoints 2.) Sending to all persumed live, requiring ack from all 3.) Hinting to endpoint to which send failed. We don't have hints, so try to work around by doing send with cl=ALL, and if send fails (wholly or partially), retain the batch in the log. This is still slight behavioural difference, and we also risk filling up the batch log in extreme cases. (Though probably not in any real environment). Refs #1222 Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>	2016-06-21 09:41:09 +03:00
Glauber Costa	7f29cb8aba	tests: add logalloc tests for pressure notification tests to make sure varios scenarios of pressure notification for active asynchronous reclaim work. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:58:39 -04:00
Glauber Costa	8f5047fc5f	tests: add tests to new region_group throttle interface Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	35f8a2ce2c	LSA: add a backpointer to the region from its private data Region is implemented using the pimpl pattern (region_impl), and all its relevant data is present in a private structure instead of the region itself. That private structure is the one that the other parts of the LSA will refer to, the region_group being the prime example. To allow classes such as the region_group the externally export a particular region, we will introduce a backpointer region_impl -> region. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	6404028c6a	LSA: move subgroups to a heap as well When we decide to evict from a specific region_group due to excessive memory usage, we must also consider looking at each of their children (subgroups). It could very well be that most of memory is used by one of the subgroups, and we'll have to evict from there. We also want to make sure we are evicting from the biggest region of all, and not the biggest region in the biggest region_group. To understand why this is important, consider the case in which the regions are memtables associated with dirty region groups. It could be that a very big memtable was recently flushed, and a fairly small one took its place. That region group is still quite large because the memtable hasn't finished flushing yet, but that doesn't mean we should evict from it. To allow us to efficiently pick which region is the largest, each root of each subtree will keep track of its maximal score, defined as the maximum between our largest region total_space and the maximum maximal score of subtrees. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	e1eab5c845	LSA: store regions in a heap for regions_group Currently, the regions in a region group are organized in a simple vector. We can do better by using a binomial heap, as we do for segments, and then updating when there is change. Internally to the LSA, we are in good position to always know when change happens, so that's really the best way to do it. The end game here, is to easily call for the reclaim of the largest offending region (potentially asynchronously). Because of that, we aren't really interested in the region occupancy, but in the region reclaimable occuppancy instead: that's simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing that effectively lists all non reclaimable regions in the end of the heap, in no particular order. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	54d4d46cf7	LSA: move throttling code to LSA. The database code uses a throttling function to make sure that memory used for the dirty region never is over the limit. We track that with a region group, so it makes sense to move this as generic functionality into LSA. This patch implements the LSA-side functionality and a later patch will convert the current memtable throttler to use it. Unlike the current throttling mechanism, we'll not use a timer-based mechanism here. Aside from being more generic and friendlier towards other users, this is a good change for current memtable by itself. The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we would be better off without them. Let's discuss the merits of each separately: 1) 10ms timer: If we are throttling, we expect somebody to flush the memtables for memory to be released. Since we are in position to know exactly when a memtable was written, thus releasing memory, we can just call unthrottle at that point, instead of using a timer. 2) 1MB release threshold: we do that because we have no idea how much memory a request will use, so we put the cut somehow. However, because of 1) we don't call unthrottle through a timer anymore, and do it directly instead. This means that we can just execute the request and see how much memory it has used, with no need to guess. So we'll call unthrottle at the end of every request that was previously throttled. Writing the code this way also has the advantage that we need one less continuation in the common case of the database not being throttled. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:34:19 -04:00
Paweł Dziepak	6f25533f4e	mutation_query: drop querying_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:52 +01:00
Paweł Dziepak	ed12c164f8	mutation_query: make mutation queries streaming-friendly Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:28 +01:00
Paweł Dziepak	0828c88b25	mutation_partition: implement streaming-friendly data_query() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:19 +01:00
Paweł Dziepak	67ae9457e3	mutation_partition: introduce mutation_querier mutation_querier is a streamed_mutation consumer that adds the mutation content to query::result. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:53 +01:00
Paweł Dziepak	f54e604a16	mutation_partition: introduce compact_for_query compact_for_query is an intermediate stage used to compact data in a flattened stream of mutations before they are consumed by query building consumers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:53 +01:00
Paweł Dziepak	2b7e62599d	mutation_reader: add consume_flattened() Mutation reader produces a stream of streamed_mutations. Each streamed_mutation itself is a stream so basically we are dealing here with a stream of streams. consume_flattened() flattens such stream of streams making all its elements consumable by a single consumer. It also allows reversing the mutations before consumption using reverse_streamed_mutation().	2016-06-20 21:29:52 +01:00
Paweł Dziepak	5566d23180	streamed_mutation: add reverse_streamed_mutation() reverse_streamed_mutation() is an inefficient way of reversing streamed_mutations. First, it collects all mutation_fragments and then it emits them in the reversed orders (except static row which always is the first element and it also flips the bounds of range tombstones). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	f676d1779b	range_tombstone: add flip_bound_kind() flip_bound_kind() changes start bound to end bound and vice versa while preserving the inclusivness. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	a3423bac38	tests/streamed_mutation: test freezing streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	6e68f0931e	frozen_mutation: freeze streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	349905d0fd	range_tombstone_list: add clear() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	494c6fa9c1	tests/mutation_query_test: make sure mutations are sliced properly Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	8dfabf2790	mutation_reader: support slicing in make_reader_returning_many() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	6871bd5fa0	memtable: fully support streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	983321f194	tests/mutation: do not create memtable on stack Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	4a5a9148e3	tests/row_cache: test slicing mutation reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	e1a8d94542	tests/row_cache: test mvcc Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	b2c37429e7	row_cache: drop slicing_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	f605499aec	row_cache: fully support streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	e4ae7894d4	tests/mutation: test slicing mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	f95c5542dc	mutation_partition: allow slicing moved mutation_partition Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	db5ea591ad	add mvcc implementation for mutation_partitions To ensure isolation of operation when streaming a mutation from a mutable source (such as cache or memtable) MVCC is used. Each entry in memtable or cache is actually a list of used versions of that entry. Incoming writes are either applied directly to the last verion (if it wasn't being read by anyone) or preprended to the list (if the former head was being read by someone). When reader finishes it tries to squash versions together provided there is no other reader that could prevent this. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	b2e6e95de7	clustering_key_filter: always return ranges in ascending order Originally, ranges for reversed queries were in descending order and ranges for forward queries in ascending order. However, streamed_mutations require them to always be in ascending order. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	2ab1a73efa	memtable: rename partition_entry to memtable_entry partition_entry is going to be a more general object used by both cache and memtable entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	4992ea9949	tests: add test for anchorless_list Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	dfa827161d	utils: add anchorless list The main user of this list is MVCC implementation in partition_version.cc. The reason why boost::intrusive::list<> cannot be used is that tere is no single owner of the list who could keep boost::intrusive::list<> object alive. In the MVCC case there is at least one partition_entry object and possibly multiple partition_snapshot objects which lifetime is independent and the list must remain in a valid state as long as at least one of them is alive. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	f991a2deb5	tests/row_cache_alloc_stress: use another memtable for underlying storage It is incorrect to update row_cache with a memtable that is also its underlying storage. The reason for that is that after memtable is merged into row_cache they share lsa region. Then when there is a cache miss it asks underlying storage for data. This will result with memtable reader running under row_cache allocation section. Since memtable reader also uses allocation section the result is an assertion fault since allocation sections from the same lsa region cannot be nested. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	5a5c519fa0	tests/row_cache_alloc_stress: use large cells instead of many rows With streamed_mutations a partition with many small rows doesn't stress the cache as much as the test expects. Use large clustering rows instead. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	71e961427a	test/sstables: test reading sstables with incorrect ordering Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	2ee69860d2	sstables: make sstable reader produce streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	e82cc68196	streamed_mutation: add range_tombstone_stream range_tombstone_stream encapsulates logic responsible for turning range_tombstone_list into a stream of mutation_fragments and merging that stream with a stream of clustering rows. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	a200189541	range_tombstone_list: mark apply() argument as const Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	5a60f6d1ec	range_tombstone: extract is_single_clustering_row_tombstone() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	b6f78a8e2f	sstable: make sstable reads return streamed_mutation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	9e8db53c46	sstables: allow row consumer to stop at any point Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	125c4e20e2	tests/sstables: add test for sliced mutation reads Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	71088b4f4a	sstables: fix partition slicing for row markers and collections Row markers and collections weren't filtered out even if they belonged to a clustering row that shouldn't be in the result. The check whether to include cell or not was done only for live and dead atomic cells. This patch adds appropriate checks for collections and row markers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	575daea897	sstables: make deletion_time to tombstone cast safer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	7074b439d8	mutation_reader: do not ask for mutation before current is consumed mutation_reader and streamed_mutation may use the same stream as a source mutation_fragments and mutations themselves (this happens in sstable reader). In such case asking for next streamed_mutation from mutation_reader would invalidate all other streamed_mutations. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	737eb73499	mutation_reader: make readers return streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	52a0b405f8	tests/row_cache: simplify verify_has() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	fec3346343	tests: add streamed_mutation assertions Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	11f43a8e91	tests/sstable: drop sstable_range_wrapping_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	5b45d46f82	row_cache: simplify slicing_reader Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	9c83eb9542	mutation_reader: drop joining and lazy readers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	579de26e95	storage_proxy: drop make_local_reader() This code was used only by its unit test. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	c8f4b96e76	tests: add streamed_mutation_tests Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	a1fc5888d3	streamed_mutation: add mutation_merger Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	48e08fa997	mutation: add mutation_from_streamed_mutation() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	9df01c2a36	streamed_mutation: add streamed_mutation_from_mutation() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	22160ae6d5	mutation_partition: make rows_type public Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	675f684788	streamed_mutation: introduce streamed_mutation streamed_mutation represents a mutation in a form of a stream of mutation_fragments. streamed_mutation emits mutation fragments in the order they should appear in the sstables, i.e. static row is always the first one, then clustering rows and range tombstones are emitted according to the lexicographical ordering of their clustering keys and bounds of the range tombstones. Range tombstones are disjoint, i.e. after emitting range_tombstone_begin it is guaranteed that there is going to be a single range_tombstone_end before another range_tombstone_begin is emitted. The ordering of mutation_fragments also guarantees that by the time the consumer sees a clustering row it has already received all relevant tombstones. Partition key and partition tombstone are not streamed and is part of the streamed_mutation itself. streamed_mutation uses batching. The mutation implementations are supposed to fill a buffer with mutation fragments until is_buffer_full() or the end of stream is encountered. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	262337768a	streamed_mutation: introduce mutation_fragment This commit introduces mutation_fragment class which represents the parts of mutation streamed by streamed_mutation. mutation_fragment can be: - a static row (only one in the mutation) - a clustering row - start of range tombstone - end of range rombstone There is an ordering (implemented in position_in_partition class) between mutation_fragment objects. It reflects the order in which content of partition appears in the sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	84713d2236	utils: extract optimized_optional<> from mutation_opt Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:49 +01:00
Paweł Dziepak	847bf878ec	mutation_partition: add more row::apply() overloads Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Paweł Dziepak	7809adc6ce	keys: add compound_wrapper::tri_compare Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Paweł Dziepak	c24f08a683	range_tombstone_list: compare full tombstones not just timestamps Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Paweł Dziepak	df4c1c6293	range_tombstone: simplify bound_view::equal() Bounds are equal only if they are of the same kind. No need to check weights. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Paweł Dziepak	a6aceb179d	range_tombstone: fix bound ordering Assuming the clustering keys are equal: excl_end < incl_start < incl_end < excl_start. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Paweł Dziepak	3a0e76d635	range_tombstone: check for adjacent instead of equal bounds Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:48 +01:00
Nadav Har'El	3372052d48	Rewriting shared sstables only after all shards loaded sstables After commit `faa4581`, each shard only starts splitting its shared sstables after opening all sstables. This was important because compaction needs to be aware of all sstables. However, another bug remained: If one shard finishes loading its sstables and starts the splitting compactions, and in parallel a different shard is still opening sstables - the second shard might find a half-written sstable being written by the first shard, and abort on a malformed sstable. So in this patch we start the shared sstable rewrites - on all shards - only after all shards finished loading their sstables. Doing this is easy, because main.cc already contains a list of sequential steps where each uses invoke_on_all() to make sure the step completes on all shards before continuing to the next step. Fixes #1371 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>	2016-06-20 16:25:24 +03:00
Calle Wilund	7cdea1b889	commitlog: Use flush queue for write/flush ordering, improve batch Using an ordering mechanism better than rw-locks for write/flush means we can wait for pending write in batch mode, and coalesce data from more than one mutation into a chunk. It also means we can wait for a specific read+flush pair (based on file position). Downside is that we will not do parallel writes in batch mode (unless we run out of buffer), which might underutilize the disk bandwidth. Upside is that running in batch mode (i.e. per-write consistency) now has way better bandwidth, and also, at least with high mutation rate, better average latency. Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>	2016-06-20 13:09:16 +03:00
Benoît Canet	77375cefaa	docker: normalize environment variables names Use a more docker like form. Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466414939-5019-1-git-send-email-benoit@scylladb.com>	2016-06-20 12:30:13 +03:00
Benoît Canet	4c7ac4cab7	docker: implement seeds and broadcast_address variables Implement the seeds and broadcast_address variable required for clustering behavior. Do it raw with sed in the startup script. Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466412846-4760-3-git-send-email-benoit@scylladb.com>	2016-06-20 11:55:03 +03:00
Benoît Canet	fd811c90fc	docker: Complete the missing part of production mode Scylla will not start if the disk was not benchmarked so start run io_tune with the right parameters. Also add the cpu_set environment variables for passing cpu set to iotune and scylla. Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466412846-4760-2-git-send-email-benoit@scylladb.com>	2016-06-20 11:54:54 +03:00
Pekka Enberg	1d5f7be447	systemd: Use PermissionsStartOnly instead of running sudo Use the PermissionsStartOnly systemd option to apply the permission related configurations only to the start command. This allows us to stop using "sudo" for ExecStartPre and ExecStopPost hooks and drop the "requiretty" /etc/sudoers hack from Scylla's RPM. Tested-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1466407587-31734-1-git-send-email-penberg@scylladb.com>	2016-06-20 11:53:24 +03:00
Vlad Zolotarov	baf3614e8f	sstables: don't backup sstables that are a result of a compaction According to incremental backup description (http://docs.datastax.com/en/cassandra_win/2.2/cassandra/operations/opsBackupIncremental.html) sstables that are a result of a compaction process should not be backed up since original sstables had already been backed up. Fixes #1308 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <1466338622-7323-1-git-send-email-vladz@cloudius-systems.com>	2016-06-20 09:52:30 +03:00
Pekka Enberg	f4153c75a0	cql3: Bump CQL language version to 3.2.1 We already added 3.2.1 support in commit `569d288` ("cql3: Add TRUNCATE TABLE alias for TRUNCATE") but never got around fixing the CQL version reported to drivers. Fixes #1358. Message-Id: <1466403967-28654-1-git-send-email-penberg@scylladb.com>	2016-06-20 09:42:12 +03:00
Avi Kivity	07045ffd7c	dist: fix scylla-kernel-conf postinstall scriptlet failure Because we build on CentOS 7, which does not have the %sysctl_apply macro, the macro is not expanded, and therefore executed incorrectly even on 7.2, which does. Fix by expanding the macro manually. Fixes #1360. Message-Id: <1466250006-19476-1-git-send-email-avi@scylladb.com>	2016-06-20 09:36:39 +03:00
Lucas Meneghel Rodrigues	ae622b0c08	dist/common/scripts/scylla_kernel_check: Update messages Small grammar tweaks to the script's output messages. Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com> Message-Id: <1466205496-3885-3-git-send-email-lmr@scylladb.com>	2016-06-19 19:28:58 +03:00
Lucas Meneghel Rodrigues	aacf7eb2ae	dist/common/scripts/scylla_kernel_check: Fix conditional statement Since most of the time people are running scylla_setup on a fully upgraded ubuntu 14.04 box, we rarely reach that code path, but once we do we end up with an error. Let's fix that. Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com> Message-Id: <1466205496-3885-2-git-send-email-lmr@scylladb.com>	2016-06-19 19:28:56 +03:00
Nadav Har'El	faa45812b2	Rewrite shared sstables only after entire CF is read Starting in commit `721f7d1d4f`, we start "rewriting" a shared sstable (i.e., splitting it into individual shards) as soon as it is loaded in each shard. However as discovered in issue #1366, this is too soon: Our compaction process relies in several places that compaction is only done after all the sstables of the same CF have been loaded. One example is that we need to know the content of the other sstables to decide which tombstones we can expire (this is issue #1366). Another example is that we use the last generation number we are aware of to decide the number of the next compaction output - and this is wrong before we saw all sstables. So with this patch, while loading sstables we only make a list of shared sstables which need to be rewritten - and the actual rewrite is only started when we finish reading all the sstables for this CF. We need to do this in two cases: reboot (when we load all the existing sstables we find on disk), and nodetool referesh (when we import a set of new sstables). Fixes #1366. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>	2016-06-19 16:50:51 +03:00
Paweł Dziepak	dde87e0b0e	row_cache: drop schema upgrade for new entries in update() Commit `daad2eb` "row_cache: fix memory leak in case of schema upgrade failure" has fixed a memory leak caused by failed upgrade_entry(). However, in case of upgrade failure memtable_entry used to create the new cache entry was left in some invalid state. If the operation was retried the cache would attempt again to apply that memtable_entry which now would be in invalid state. The solution is to either to ignore upgrade_entry() exceptions or do not call it at all and let the cache entry be upgraded on demand. This patch implements the latter. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1466163435-27367-1-git-send-email-pdziepak@scylladb.com>	2016-06-17 13:43:01 +02:00
Paweł Dziepak	daad2ebf81	row_cache: fix memory leak in case of schema upgrade failure When update() causes a new entry to be inserted to the cache the procedure is as follows: 1. allocate and construct new entry 2. upgrade entry schema 3. add entry to lru list and cache tree Step 2 may fail and at this point the pointer to the entry is neither protected by RAII nor added in any of the cache containers. The solution is to swap steps 2 and 3 so that even if the upgrade fails the entry is already owned by the cache and won't leak. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1466161709-25288-1-git-send-email-pdziepak@scylladb.com>	2016-06-17 13:12:01 +02:00
Asias He	4f3ce42163	storage_service: Prevent old version node to join a new version cluster We want to prevent older version of scylla which has fewer features to join a cluster with newer version of scylla which has more features, because when scylla sees a feature is enabled on all other nodes, it will start to use the feature and assume existing nodes and future nodes will always have this feature. In order to support downgrade during rolling upgrade, we need to support mixed old and new nodes case. 1) All old nodes O O O O O <- N OK O O O O O <- O OK 2) All new nodes N N N N N <- N OK N N N N N <- O FAIL 3) Mixed old and new nodes O N O N O <- N OK O N O N O <- O OK (O == old node, N == new node, <- == joining the cluster) With this patch, I tested: 1.1) Add new node to new node cluster gossip - Feature check passed. Local node 127.0.0.4 features = {RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES} 1.2) Add old node to old node cluster gossip - Feature check passed. Local node 127.0.0.4 features = {}, Remote common_features = {} 2.1) Add new node to new node cluster gossip - Feature check passed. Local node 127.0.0.4 features = {RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES} 2.2) Add old node to new node cluster seastar - Exiting on unhandled exception: std::runtime_error (Feature check failed. This node can not join the cluster because it does not understand the feature. Local node 127.0.0.4 features = {}, Remote common_features = {RANGE_TOMBSTONES}) 3.1) Add new node to mixed cluster gossip - Feature check passed. Local node 127.0.0.4 features = {RANGE_TOMBSTONES}, Remote common_features = {} 3.2) Add old node to mixed cluster gossip - Feature check passed. Local node 127.0.0.4 features = {}, Remote common_features = {} Fixes #1253	2016-06-17 10:49:45 +08:00
Asias He	32ed468e42	gossip: Remove empty string feature in get_supported_features If the feature string is empty, boost::split will return std::set<sstring> = {""} instead of std::set<sstring> = {} which will make a node with a feaure, e.g. std::set<sstring> = {"RANGE_TOMBSTONES"}, think it does not understand the feature of a node with no features at all.	2016-06-17 10:49:45 +08:00
Gleb Natapov	4659800ab9	storage_proxy: implement custom speculative retry strategy User may specify time after which speculative retry should happen instead of relying on cf statics. Use provided value in speculative executor. Message-Id: <20160616104422.GH5961@scylladb.com>	2016-06-16 13:45:56 +03:00
Pekka Enberg	d72c608868	service/storage_service: Make do_isolate_on_error() more robust Currently, we only stop the CQL transport server. Extract a stop_transport() function from drain_on_shutdown() and call it from do_isolate_on_error() to also shut down the inter-node RPC transport, Thrift, and other communications services. Fixes #1353	2016-06-16 13:34:09 +03:00
Avi Kivity	85bb5ea064	Merge "Reduce LSA reclaim latency" from Tomasz "Reclaiming many segments was observed to cause up to multi-ms latency. With the new setting, the latency of reclamation cycle with full segments (worst case mode) is below 1ms. I saw no difference in throughput in a CQL write micro benchmark in neither of these workloads: - full segments, reclaim by random eviction - sparse segments (3% occupancy), reclaim by compaction and no eviction Fixes #1274."	2016-06-16 10:47:57 +03:00
Pekka Enberg	a8f95e8081	dist/docker: Use Scylla superpackage for installation Make the Dockerfile more future-proof by using the Scylla superpackage for installation. Message-Id: <1466015996-19792-1-git-send-email-penberg@scylladb.com>	2016-06-16 10:32:18 +03:00
Benoît Canet	c133748a24	scylla_setup: Fix RAID device enumeration Commit `f42673ed1e` ("scylla_setup: Hide busy block devices from RAID0 configuration") wasn't enumerating anything. Additionally it listed from /dev/ and not /dev/dm which broke the tests conditions. This one uses blkid instead of /proc/partitions. A follow up patch will be required to mask encrypted devices. Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466059657-12377-1-git-send-email-benoit@scylladb.com>	2016-06-16 09:52:25 +03:00
Glauber Costa	01a658f51d	LSA: helper function for region_group current hierarchy walk converted, but more users will come. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	741aa16748	LSA: allow a region_group to have a threshold for throttling specified Allocations will still be allowed if made directly, but callers will have the choice (in an upcoming patch) to proceed only if memory is below this threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	7cd0c0731e	region_group: delete move constructor Tomek correctly points out that since we are now using "this" in lambda captures, we should make the region_group not movable. We currently define a move constructor, but there are no users. So we should just remove them. copy constructor is already deleted, and so are the copy and move assignment operators. So by removing the move constructor, we should be fine. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Benoît Canet	0cf8144485	scylla_setup: Propose defaults values when judicious Also takes care of explaining the options. Fixes #1031 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466011848-11054-1-git-send-email-benoit@scylladb.com>	2016-06-15 20:33:55 +03:00
Benoît Canet	263a55c0da	scylla_setup: Inform the user that he can skip any step Fixes: #1188 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466001423-9547-3-git-send-email-benoit@scylladb.com>	2016-06-15 19:38:23 +03:00
Benoît Canet	f42673ed1e	scylla_setup: Hide busy block devices from RAID0 configuration This patch look in /proc/mount for the device name so the device or it's subdevices will be excluded from the availables RAID0 targets. It does the same with physical volume from device mapper. Fixes #1189 Message-Id: <1466001423-9547-4-git-send-email-benoit@scylladb.com>	2016-06-15 19:36:11 +03:00
Paweł Dziepak	c8e75d2e84	schema: cache is_atomic() in column_definition is_atomic() is called for each cell in mutation applies, compaction and query. Since the value doesn't change it can be easily cached which would save one indirection and virtual call. Results of perf_simple_query -c1 (median, duration 60): before after read 54611.49 55396.01 +1.44% write 65378.92 68554.25 +4.86% Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1465991045-11140-1-git-send-email-pdziepak@scylladb.com>	2016-06-15 19:18:13 +03:00
Benoît Canet	4def1f4524	dist: sysctl.d: Disable automatic numa balancing On NUMA hardware, autonuma may reduce performance by unmapping memory. Since we do manual NUMA placement, autonuma will not help anything. We ought to disable it by setting the kernel.numa_balancing sysctl to 0. Fixes: #1120 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1466006345-9972-1-git-send-email-benoit@scylladb.com>	2016-06-15 19:11:00 +03:00
Gleb Natapov	7f54333c45	storage_proxy: fix complication on older boost boost before 1.56.0 had broken boost:size() implementation. Do not use it. Message-Id: <20160615123134.GD5961@scylladb.com>	2016-06-15 15:34:57 +03:00
Asias He	de0fd98349	repair: Switch log level to warn instead of error dtest takes error level log as serious error. It is not a serious error for streaming to fail to send a verb and fail a streaming session which triggers a repair failure, for example, the peer node is gone or stopped. Switch to use log level warn instead of level error. Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test Fixes: #1335 Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>	2016-06-15 13:01:35 +03:00
Asias He	94c9211b0e	streaming: Switch log level to warn instead of error dtest takes error level log as serious error. It is not a serious error for streaming to fail to send a verb and fail a streaming session, for example, the peer node is gone or stopped. Switch to use log level warn instead of level error. Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test Fixes: #1335 Message-Id: <0149d30044e6e4d80732f1a20cd20593de489fc8.1465979288.git.asias@scylladb.com>	2016-06-15 13:01:22 +03:00
Vlad Zolotarov	c616e74ae4	locator::gossiping_property_file_snitch: use a lowres_clock time source for a timer gossiping_property_file_snitch checks a configuration file every 60s. lowres_clock clock source should be good enough for that. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465314448-11611-1-git-send-email-vladz@cloudius-systems.com>	2016-06-15 13:01:05 +03:00
Tomasz Grabiec	207c8d94f1	idl: Rename variable to a more meaningful name Message-Id: <1465909911-10534-2-git-send-email-tgrabiec@scylladb.com>	2016-06-14 17:02:59 +03:00
Raphael S. Carvalho	80d8c5ef6f	compaction: use proper type in constructor Correctness is not affected due to long type, but an unsigned long type should be definitely used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d3ab15a3206306de195aeb3d78f9b5bc4ca9208e.1465908970.git.raphaelsc@scylladb.com>	2016-06-14 17:02:32 +03:00
Tomasz Grabiec	8e8f63de85	mutation_partition_view: Avoid unnecessary copy into temporary Message-Id: <1465909038-8174-1-git-send-email-tgrabiec@scylladb.com>	2016-06-14 17:02:17 +03:00
Tomasz Grabiec	75f899cc93	lsa: Make reclamation step configurable via config	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	cd9955d2ce	lsa: Reclaim 1 segment by default Reclaiming many segments was observed to cause up to multi-ms latency. With the new setting, the latency of reclamation cycle with full segments (worst case mode) is below 1ms. I saw no decrease in throughput compared to the step of 16 segments in neither of these modes: - full segments, reclaim by random evicition - sparse segments (3% occupancy), reclaim by compaction and no eviction Fixes #1274.	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	86b76171a8	lsa: Use the same step in both internal and external reclamations	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	d74d902a01	lsa: Make reclamation step configurable	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	93bb95bd0d	lsa: Log reclamation rate	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	cb18418022	lsa: Print more details before aborting	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	7cb98c916f	tests: lsa_async_eviction_test: Push to refs with reclaim lock push_back() is not reentrant with pop_front(), used by the evictor. If reclaimer runs when std::deque allocates a new node it will get corrupted. Fix by runnning push_back() under reclaim lock.	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	de8772525a	tests: lsa_async_eviction_test: Make sure refs scope encloses reclaimer scope	2016-06-14 15:13:14 +02:00
Tomasz Grabiec	c4a556ac13	tests: lsa_async_eviction_test: Fix use after free due to at_exit() callback The callback will run after thread is destroyed. We don't really need the stop feature, so for now just remove it.	2016-06-14 15:13:14 +02:00
Pekka Enberg	155ad2eeb5	storage_service: Fix start_rpc_server() to use logger Message-Id: <1465882880-7392-1-git-send-email-penberg@scylladb.com>	2016-06-14 09:52:04 +02:00
Raphael S. Carvalho	0b2cd41daf	database: remember sstable level when cleaning it up Cleanup operation wasn't preserving level of sstables. That will have a bad impact on performance because compaction work is lost. Fixes #1317. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <35ce8fbbb4590725bb0414e6a5450fcbe6cb7212.1465843387.git.raphaelsc@scylladb.com>	2016-06-14 08:06:00 +03:00
Vlad Zolotarov	d3960f0bbb	tracing: rearrange shut down tracing::tracing local instance is dereferenced from a cql_server::connection::process_request(), therefore tracing::tracing service may be stop()ed only after a CQL server service is down. On the other hand it may not be stopped before RPC service is down because a remote side may request a tracing for a specific command too. This patch splits the tracing::tracing stop() into two phases: 1) Flush all pending tracing records and stop the backend. 2) Stop the service. The first phase is called after CQL server is down and before RPC is down. The second phase is called after RPC is down. Fixes #1339 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>	2016-06-14 07:58:04 +03:00
Avi Kivity	49449fc30c	Merge seastar upstream * seastar 864d6dc...401c333 (8): > scollectd: Support filtering specific collectd metrics > core: Integrate error reporting with the logging framework > rpc: wait for all replies to be completed before closing rpc server > rpc: clean up resource accounting > queue: fix race between pop_eventually() and abort() > rpc_test: fix cancel test to not depend on timing. > tutorial: explain application-specific command line options > add ostream output operator for std::unordered_map	2016-06-13 19:35:00 +03:00
Gleb Natapov	e089166cfa	storage_proxy: wait only for expected CL when writing back data during read repair When read repair writes diffs back to replicas it is enough to wait for requested CL to guaranty read monotonicity. This patch makes read repair write reuse regular mutate functionality which already tracks CL status. This is done by changing write response handler to not hold mutation directly, but instead hold a container that, depending on whether this is read repair write or regular one, can provide different mutation per destination. Message-Id: <20160613124727.GL1096@scylladb.com>	2016-06-13 19:01:51 +03:00
Duarte Nunes	c896309383	database: Actually decrease query_state limit query_state expects the current row limit to be updated so it can be enforced across partition ranges. A regression introduced in `e4e8acc946` prevented that from happening by passing a copy of the limit to querying_reader. This patch fixes the issue by having column_family::query update the limit as it processes partitions from the querying_reader. Fixes #1338 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>	2016-06-13 10:03:27 +02:00
Avi Kivity	465c0a4ead	Merge "Make stronger guarantees in row_cache's clear/invalidate" from Tomasz "Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does. Fixes #1291."	2016-06-13 09:55:29 +03:00
Shlomi Livne	ac6f2b5c13	dist/common: Update scylla_io_setup to use settings done in cpuset.conf scylla_io_setup is searching for --smp and --cpuset setting in SCYLLA_ARGS. We have moved the settings of this args into /etc/scylla.d/cpuset.conf and they are set by scylla_cpuset_setup into CPUSET. Fixes: #1327 Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <2735e3abdd63d245ec96cfa1e65f766b1c12132e.1465508701.git.shlomi@scylladb.com>	2016-06-10 09:37:44 +03:00
Vlad Zolotarov	89375d4c2a	service::storage_proxy: tracing: instrument read_digest and read_mutation_data Instrument read_digest and read_mutation_data handlers similarly to a read_data handler instrumentation. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465304055-4263-1-git-send-email-vladz@cloudius-systems.com>	2016-06-09 14:32:42 +02:00
Pekka Enberg	8df5aa7b0c	utils/exceptions: Whitelist EEXIST and ENOENT in should_stop_on_system_error() There are various call-sites that explicitly check for EEXIST and ENOENT: $ git grep "std::error_code(E" database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) { sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) { sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) { Commit `961e80a` ("Be more conservative when deciding when to shut down due to disk errors") turned these errors into a storage_io_exception that is not expected by the callers, which causes 'nodetool snapshot' functionality to break, for example. Whitelist the two error codes to revert back to the old behavior of io_check(). Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>	2016-06-09 10:03:04 +02:00
Pekka Enberg	02d033667a	utils: Improve storage_io_exception error message Make storage_io_exception exception error message less cryptic by actually including the human-readable error message from std::system_error... Before: nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2 After: nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory We can improve this further by including the name of the file that the I/O error happened on. Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>	2016-06-09 09:58:00 +02:00
Tomasz Grabiec	d5a2d7a88d	row_cache: Add eviciton and removal counters Fixes #1273. Message-Id: <1465315433-8473-1-git-send-email-tgrabiec@scylladb.com>	2016-06-08 16:08:32 -04:00
Nadav Har'El	721f7d1d4f	Rewrite shared sstables soon after startup Several shards may share the same sstable - e.g., when re-starting scylla with a different number of shards, or when importing sstables from an external source. Sharing an sstable is fine, but it can result in excessive disk space use because the shared sstable cannot be deleted until all the shards using it have finished compacting it. Normally, we have no idea when the shards will decide to compact these sstables - e.g., with size- tiered-compaction a large sstable will take a long time until we decide to compact it. So what this patch does is to initiate compaction of the shared sstables - on each shard using it - so that a soon as possible after the restart, we will have the original sstable is split into separate sstables per shard, and the original sstable can be deleted. If several sstables are shared, we serialize this compaction process so that each shard only rewrites one sstable at a time. Regular compactions may happen in parallel, but they will not not be able to choose any of the shared sstables because those are already marked as being compacted. Commit `3f2286d0` increased the need for this patch, because since that commit, if we don't delete the shared sstable, we also cannot delete additional sstables which the different shards compacted with it. For one scylla user, this resulted in so much excessive disk space use, that it literally filled the whole disk. After this patch commit `3f2286d0`, or the discussion in issue #1318 on how to improve it, is no longer necessary, because we will never compact a shared sstable together with any other sstable - as explained above, the shared sstables are marked as "being compacted" so the regular compactions will avoid them. Fixes #1314. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-06-08 15:44:29 -04:00
Raphael S. Carvalho	1b8e170254	compaction: retry compaction until strategy is satisfied Previously, we were using a stat to decide if compaction should be retried, but that's not efficient. The information is also lost after node is restarted. After these changes, compaction will be retried until strategy is satisfied, i.e. there is nothing to compact. We will now be doing the following in a loop: Get compaction job from compaction strategy. If cannot run, finish the loop. Otherwise, compact this column family. Go back to start of the loop. By the way, pending_compactions stat will be deprecated after this commit. Previously, it was increased to indicate the want for compaction and decreased when compaction finished. Now, we can compact more than we asked for, so it would be decreased below 0. Also, it's the strategy that will tell the want for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>	2016-06-08 11:31:56 -04:00
Avi Kivity	7bd4b7ca63	cql3: split use_statement into raw and prepared variants Rather than having one class fulfil both roles, have one class per role, disentangling dependencies. Message-Id: <1465053407-20931-1-git-send-email-avi@scylladb.com>	2016-06-08 16:48:45 +03:00
Yoav Kleinberger	43071bf488	tools/scyllatop: handle absentee metrics Sometimes a metric previously reported from collectd is not available anymore. Previously, this caused scyllatop to log and exception to the user - which in effect destroyes the user experience and inhibits monitoring other metrics. This patch makes ScyllaTop handle this problem. It will display such metrics and 'not available', and exclude them from some and average computations. Closes issue #1287. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1465301178-27544-1-git-send-email-yoav@scylladb.com>	2016-06-08 16:35:55 +03:00
Vlad Zolotarov	24624b2600	tests/gossiping_property_file_snitch_test: cancel O_DIRECT enforcement Cancel O_DIRECT enforcement on shard 0 (a default I/O shard for this snitch) to ensure proper functioning on any FS (e.g. ecryptfs). Otherwise tests fails on file systems not supporting O_DIRECT. Fixes #1324 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465385087-20510-1-git-send-email-vladz@cloudius-systems.com>	2016-06-08 16:21:44 +03:00
Tomasz Grabiec	287ff7dbd3	Merge tag 'ms_update/v2' from seastar-dev.git From Asias: In `f27e5d2a6` (messaging_service: Delay listening ms during boot up), messaging_service startup is splitted into two stages. Adjust the api registration code and fix up the messaging_service stop code.	2016-06-08 10:25:14 +02:00
Paul McGuire	8326fe7760	Clean up idl-compiler pyparsing usage This patch makes a few minor improvements in the parser: - merge first and rest into 2-argument form of Word to define identifier – should give some performance boost, simpler code - replace Literal(keyword_string) with Keyword(keyword_string) throughout - stricter parsing, avoids misinterpreting identifiers with keywords - replace expr.setResultsName("name") with expr("name") throughout – this is a style change (no actual change in underlying parser behavior), but I find this form easier to follow - add calls to setName to make exceptions more readable Message-Id: <005901d1bbd2$711f7bb0$535e7310$@austin.rr.com>	2016-06-08 08:13:05 +03:00
Asias He	b36d3be5d4	messaging_service: Fix messaging_service::stop There are two problems: 1. _server_tls is not stopped 2. _server and _server_tls might not be created if messaging_service::start_listen is not called yet.	2016-06-08 11:13:36 +08:00
Asias He	e6f63a50e1	main: Delay the messaging_service api registration Since messaging_service is fully initialized in storage_service::init_server which calls messaging_service::start_listen, we need to delay the messaging_service api registration after it.	2016-06-08 11:13:35 +08:00
Asias He	f7d25e6bae	messaging_service: Handle _server is not created in foreach_server_connection_stats It is possible _server is not created yet when foreach_server_connection_stats is called. Handle this case.	2016-06-08 11:13:35 +08:00
Gleb Natapov	9635e67a84	config: adjust boost::program_options validator to work with db::string_map Fixes #1320 Message-Id: <20160607064511.GX9939@scylladb.com>	2016-06-07 10:42:27 +03:00
Amnon Heiman	2cf882c365	rate_moving_average: mean_rate is not initilized The rate_moving_average is used by timed_rate_moving_average to return its internal values. If there are no timed event, the mean_rate is not propertly initilized. To solve that the mean_rate is now initilized to 0 in the structure definition. Refs #1306 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>	2016-06-07 09:38:58 +03:00
Vlad Zolotarov	ce08bc611c	tracing: fix debug compilation Define flush_period as a const and not as constexpr. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1465240516-20128-1-git-send-email-vladz@cloudius-systems.com>	2016-06-06 22:15:27 -04:00
Avi Kivity	e25380347a	Merge "tracing: probabilistic tracing" from Vlad "This series includes some fixes to and adds a probabilistic tracing feature."	2016-06-06 11:25:18 -04:00
Benoît Canet	b508aaf0d9	docker: Add the production environment variable This variable if set to true will activate developer mode. It will be set by using the -e option of docker run. The xfs bind mount behavior and the cpuset behavior will be set by using the relevant docker command lines options and documented in the scylla/docker howto. Fixes: #1267 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1465213713-2537-1-git-send-email-benoit@scylladb.com>	2016-06-06 16:28:17 +03:00
Benoît Canet	c771854120	docker: Start scylla on ubuntu docker Make it behave on par with redhat version Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1465218003-2740-1-git-send-email-benoit@scylladb.com>	2016-06-06 16:27:03 +03:00
Vlad Zolotarov	0611417c76	api::storage_service: add set_trace_probability/get_trace_probability Trace probability defines a probability for the next CQL command to be traced. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 15:44:28 +03:00
Vlad Zolotarov	905190ac06	tracing: add support for probabilistic tracing Add a support for defining a probability (a value in a [0,1] range) for tracing the next CQL request. Traces for requests that are chosen to be traced due to this feature are not going to flushed immediately. Use std::subtract_with_carry_engine (implements the "lagged Fibonacci" algorithm) random number engine for fastest generation of random integer values. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 15:41:01 +03:00
Vlad Zolotarov	779ff88c76	tracing: add flush timer Flush pending sessions to the storage every 2 seconds. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 14:34:08 +03:00
Tomasz Grabiec	170a214628	row_cache: Make stronger guarantees in clear/invalidate Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does.	2016-06-06 13:21:06 +02:00
Vlad Zolotarov	4b008ac5ea	tracing: rework maximum sessions amount back pressure strategy A tracing session life cycle includes 3 stages: 1) Active: when new trace records are being added to this session. 2) Pending for flushing to a storage: when session is over but not yet flushed to the storage ("backend"). 3) Flushing: when session's records are being flushed to the storage and this process is not yet completed. Sessions may accumulate in each of the stages above and we should limit the maximum amount of sessions being accumulated in each of them in order to avoid OOM situation. Current in-tree implementation only limits the number of tracing sessions accumulated in the first ("Active") stage. Since currently every closing session is being immediately flushed (as long as "settraceprobability" is not implemented) the second stage never accumulates tracing sessions. The third stage is currently not controlled at all and if, for instance, we succeed to push enough tracing session towards a slow storage backend, they may accumulate there consuming an uncontrolled amount of memory and may eventually consume all of it. This patch fixes this unpleasant situation by implying the following strategy: - Limit the total amount of accumulated tracing sessions in all stages above together by a static value - 2 times "flush threshold". "2 times" is needed to allow new tracing sessions to accumulate in the stage 2 while sessions in the stage 3 are still being processed. - Forcefully flush sessions in the stage 2 to the storage when their count reaches a "flush threshold". This would ensure that there will not more than totally (2 * "flush threshold") sessions (in any stage) on each shard. An advantage of this strategy is its simplicity - we only need a single threshold to control all stages. If we feel that we needed a finer graining for each stage we may add separate limits for each of them in the future. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 13:50:41 +03:00
Pekka Enberg	b407031401	Update scylla-ami submodule * dist/ami/files/scylla-ami 72ae258...863cc45 (3): > Move --cpuset/--smp parameter settings from scylla_sysconfig_setup to scylla_ami_setup > convert scylla_install_ami to bash script > 'sh -x -e' is not valid since all scripts converted to bash script, so remove them	2016-06-06 13:37:21 +03:00
Vlad Zolotarov	35402b965f	service/client_state: don't try to dereference a tracing state if it's not initialized Call for a tracing::tracing::create_session() doesn't promise a session creation. Check that the session is actually created before trying to use it. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 13:00:31 +03:00
Vlad Zolotarov	139fa9d1bd	tracing: minor cleanups - Make small functions on a fast path "inline". - Add "const" qualifier where needed. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-06 13:00:31 +03:00
Avi Kivity	961e80ab74	Be more conservative when deciding when to shut down due to disk errors Currently we only shut down on EIO. Expand this to shut down on any system_error. This may cause us to shut down prematurely due to a transient error, but this is better than not shutting down due to a permanent error (such as ENOSPC or EPERM). We may whitelist certain errors in the future to improve the behavior. Fixes #1311. Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>	2016-06-06 10:56:34 +02:00
Raphael S. Carvalho	17b56eb459	compaction: leveled: improve log message for overlapping table Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <2dcbe3c8131f1d88a3536daa0b6cdd25c6e41d76.1464883077.git.raphaelsc@scylladb.com>	2016-06-05 18:20:01 +03:00
Raphael S. Carvalho	588ce915d6	compaction: disable parallel compaction for leveled strategy It was discussed that leveled strategy may not benefit from parallel compaction feature because almost all compaction jobs will have similar size. It was also found that leveled strategy wasn't working correctly with it because two overlapping sstable (targetting the same level) could be created in parallel by two ongoing compaction. Fixes #1293. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <60fe165d611c0283ca203c6d3aa2662ab091e363.1464883077.git.raphaelsc@scylladb.com>	2016-06-05 18:20:00 +03:00
Amnon Heiman	5f84e55bf6	histogram: total need to be increment on plus operator The total counter (the one that count the actual number of sample points) should be incremented when adding histograms. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1464172277-4251-1-git-send-email-amnon@scylladb.com>	2016-06-05 12:09:36 +03:00
Tomasz Grabiec	2ab18dcd2d	row_cache: Implement clear() using invalidate() Reduces code duplication.	2016-06-03 13:34:40 +02:00
Tomasz Grabiec	57413618e8	Merge branch 'range-tombstone-v9' from https://github.com/duarten/scylla.git From Duarte: This patchset adds the range_tombstone_list data structure, used to hold a set of disjoint range tombstones, and changes the internal representation of row tombstones to use that data structure. Fixes #1155 [tgrabiec: Added compound_wrapper::make_empty(const schema&) overload to fix compilation failure in tracing code]	2016-06-02 22:17:17 +02:00
Raphael S. Carvalho	3f4500cb71	db: compaction strategy changes via alter table must have immediate effect At the moment, compaction strategy changes via ALTER TABLE have no effect until node restart. Tomek says: "Statements of the following form should have immediate effect: ALTER TABLE t WITH compaction = { 'class' : 'LeveledCompactionStrategy' };" Fixes #877. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <3b72c494f887643b82a272ef0a9995edb970382c.1464726828.git.raphaelsc@scylladb.com>	2016-06-02 16:59:50 +02:00
Pekka Enberg	d03f65d94e	database: Don't use std::cbegin() and std::cend() They're not supported by GCC 4.9. Fixes #1305 Message-Id: <1464877984-27856-1-git-send-email-penberg@scylladb.com>	2016-06-02 16:57:24 +02:00
Duarte Nunes	c970d682d1	storage_service: Announce range tombstones feature This patch enables the RANGE_TOMBSTONES supported feature, meaning that the node is capable of accepting row entry tombstones as range tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	70083efee2	sstables: Read and write range tombstone bounds This patch uses the composite_marker to add inclusiveness information to the prefixes of a range tombstone. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	7628e403a3	sstables: Drop code for tombstone merging Since Scylla now supports proper range tombstones, the code for reading ranges from sstables and converting them to overlapping tombstones is no longer necessary, and is, in fact, wasteful as the internal representation converts overlapping tombstones back to ranges. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	79bff2742f	random_mutation_generator: Generate range tombstones Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	95594b8171	mutations: Encapsulate row tombstones difference This patch moves the difference between two mutation_partition's row_tombstones inside the range_tombstone_list. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	91aac30f12	mutations: Row tombstones are now a set of ranges This patch changes the type of the mutation partition's row_tombstones to be a range_tombstone_list, so that they are now represented as a set of disjoint ranges. All of its usages are updated accordingly. Fixes #1155 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:59 +02:00
Duarte Nunes	e46537b7d3	storage_service: Include range tombstones feature This patch adds the range tombstones feature, which is not enabled yet, to the storage_service, so that consumers can query for it. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	17a544c4a6	gossip: Add feature default ctor and operator= This allows a feature to be declared and initialized later. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	2c82dcd309	gossip: Decouple feature lifetime from the gossiper This patch changes the gms::feature destructor so it checks whether the gossiper has been stopped before trying to unregister the feature. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	351aaf9738	range_tombstone: Introduce range_tombstone_to_prefix_tombstone_converter This patch extracts the code from sstables/partition.cc which is used to transform a set of range tombstones into a set of overlapping scylladb tombstones. The range_tombstone_merger will be used to send mutations to nodes not yet updated to support the internal range tombstone representation. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	f7809bcaef	range_tombstone_list: Add unit test Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	284bb6b66f	range_tombstone_list: Make it ReversiblyMergeable This patch implements the ReversiblyMergeable cancellative monoid for the range_tombstone_list. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	86030885c8	mutations: Introduce range tombstone list This class is responsible for representing a set of range tombstones as non-overlapping disjoint sets of range tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	6a111fdd01	mutations: Introduce the range_tombstone class This patch introduces the range_tombstone class, composed of a [start, end] pair of clustering_key_prefixes, the type of inclusiveness of each bound, and a tombstone. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:58 +02:00
Duarte Nunes	dc8319ed91	keys: Remove schema argument from make_empty An empty key is independent of the schema. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:36 +02:00
Duarte Nunes	7f8c35dd8c	idl: Add range tombstone IDL This patch adds the range tombstone IDL, preserving backwards compatibility. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:36 +02:00
Duarte Nunes	9bd7d08fc7	idl-compiler: Default expr can refer to previous fields This patch changes the idl-compiler so that the default value of a field can be set to the value of a previous field in the class: class P { uint32_t x; uint32_t y = x; }; Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:36 +02:00
Duarte Nunes	e2812c1b7a	idl: Rename range_tombstone::key to start ... and make it a clustering_key_prefix, in preparation of supporting not-whole-row range tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-02 16:21:36 +02:00
Pekka Enberg	f64c25a495	cql3/statements/select_statement: Unify coding style The coding style in select_statement.cc is very inconsistent which makes the code hard to read. Clean that up. Message-Id: <1464871790-21031-1-git-send-email-penberg@scylladb.com>	2016-06-02 16:17:21 +02:00
Avi Kivity	6da0449fc7	tests: adjust config_test for db::string_map changes	2016-06-02 14:48:02 +03:00
Gleb Natapov	9132604a90	config: make string_map to be a unique type instead of an alias to unordered_map Config provides operators << >> for string_map which makes it impossible to have generic stream operators for unordered_map. Fix it by making string_map a separate type and not just an alias. Message-Id: <20160602102642.GJ9939@scylladb.com>	2016-06-02 13:28:40 +03:00
Asias He	96463cc17c	streaming: Fix indention in do_send_mutations Message-Id: <bc8cfa7c7b29f08e70c0af6d2fb835124d0831ac.1464857352.git.asias@scylladb.com>	2016-06-02 11:56:03 +03:00
Asias He	206955e47c	streaming: Reduce memory usage when sending mutations Limit disk bandwidth to 5MB/s to emulate a slow disk: echo "8:0 5000000" > /cgroup/blkio/limit/blkio.throttle.write_bps_device echo "8:0 5000000" > /cgroup/blkio/limit/blkio.throttle.read_bps_device Start scylla node 1 with low memory: scylla -c 1 -m 128M --auto-bootstrap false Run c-s: taskset -c 7 cassandra-stress write duration=5m cl=ONE -schema 'replication(factor=1)' -pop seq=1..100000 -rate threads=20 limit=2000/s -node 127.0.0.1 Start scylla node 2 with low memory: scylla -c 1 -m 128M --auto-bootstrap true Without this patch, I saw std::bad_alloc during streaming ERROR 2016-06-01 14:31:00,196 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: std::bad_alloc (std::bad_alloc) ... ERROR 2016-06-01 14:31:10,172 [shard 0] database - failed to move memtable to cache: std::bad_alloc (std::bad_alloc) ... To fix: 1. Apply the streaming mutation limiter before we read the mutation into memory to avoid wasting memory holding the mutation which we can not send. 2. Reduce the parallelism of sending streaming mutations. Before we send each range in parallel, after we send each range one by one. before: nr_vnode * nr_shard * (send_info + cf.make_reader memory usage) after: nr_shard * (send_info + cf.make_reader memory usage) We can at least save memory usage by the factor of nr_vnode, 256 by default. In my setup, fix 1) alone is not enough, with both fix 1) and 2), I saw no std::bad_alloc. Also, I did not see streaming bandwidth dropped due to 2). In addition, I tested grow_cluster_test.py:GrowClusterTest.test_grow_3_to_4, as described: https://github.com/scylladb/scylla/issues/1270#issuecomment-222585375 With this patch, I saw no std::bad_alloc any more. Fixes: #1270 Message-Id: <7703cf7a9db40e53a87f0f7b5acbb03fff2daf43.1464785542.git.asias@scylladb.com>	2016-06-02 11:01:58 +03:00
Gleb Natapov	1476becd28	config: put operators << and >> into db namespace Makes ADL find the right version of the overload. Message-Id: <20160601130952.GJ2381@scylladb.com>	2016-06-02 10:45:01 +03:00
Pekka Enberg	b6b2c84316	Merge "CQL tracing" from Vlad "This series introduces a tracing infrastructure that may be used for tracing CQL commands execution and measuring latencies of separate stages of CQL handling as defined by a CQL binary protocol specification. To begin tracing one should create a "tracing session", which may then be used to issuing tracing events. If execution of a specific CQL command involves other Nodes (not only a Coordinator), then a "tracing session ID" is passed to that Node (in the context of the corresponding RPC call). Then this "session ID" may be used to create a "secondary tracing session" to issue tracing events in the context of the original session. The series contains an implementation of tracing that uses a keyspace in the current cluster for storing tracing information. This series contains a demo per-request tracing instrumentation of a QUERY CQL command and even this instrumentation is partial: it only fully instruments a QUERY->SELECT->read_data call chain. This is by all means a very beginning of the proper instrumentation which is to come. Right now the latencies for a single SELECT for a single raw with RF 1 from a 2 Nodes cluster on my laptop started using ccm (for C* all default parameters, for scylla - memory 256MB, --smp 2) are as follows (pseudo-graphics warning): -------------------------------------------------------------------------------------------- \| scylla (2 Nodes x 2 shards each) \| C* 2.1.8 _______________________________________\|___________________________________\|________________ Coordinator and replica are same Node \| \| (TRACING OFF): \| 0.3ms \| 0.3ms c-s with a single thread mean latency \| (was 0.2ms before the last \| value \| rebase with a master) \| -------------------------------------------------------------------------------------------- Coordinator and replica are same Node \| \| (TRACING ON) \| ~250us \| ~1200us Running a SELECT command from a cqlsh \| \| a few times \| \| -------------------------------------------------------------------------------------------- Coordinator and replica are not on the \| \| same Node \| ~700us \| >2500us (TRACING ON) \| \| -------------------------------------------------------------------------------------------- To begin tracing one may use a cqlsh "TRACING ON/OFF" commands: cqlsh> TRACING ON Now Tracing is enabled cqlsh> select "C0", "C1" from keyspace1.standard1 where key=0x12345679; C0 \| C1 --------------------+------ 0x000000000001e240 \| null (1 rows) Tracing session: 146f0180-21e7-11e6-b244-000000000000 activity \| timestamp \| source \| source_elapsed -------------------------------------------------------------------+----------------------------+-----------+---------------- select "C0", "C1" from keyspace1.standard1 where key=0x12345679; \| 2016-05-24 22:38:24.536000 \| 127.0.0.1 \| 0 message received from /127.0.0.1 [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.2 \| -- Done reading options [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.1 \| 3 read_data handling is done [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.2 \| 37 Parsing a statement [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.1 \| 3 Processing a statement [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.1 \| 56 Done processing - preparing a result [0] \| 2016-05-24 22:38:24.537000 \| 127.0.0.1 \| 550 Request complete \| 2016-05-24 22:38:24.536560 \| 127.0.0.1 \| 560 cqlsh>"	2016-06-02 08:35:33 +03:00
Avi Kivity	c7953897d1	build: remove obsolete log.cc dependency	2016-06-01 22:35:07 +03:00
Vlad Zolotarov	69bd8efc40	storage_proxy: instrument a read_data handler to accept a tracing info This is a demo instrumentation: - Check if a tracing info is present in the read_command. - If yes - create a tracing session with the given tracing session ID. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:17:25 +03:00
Vlad Zolotarov	4c17a422e0	cql3: instrument a SELECT query to send tracing info Instrument a coordinator of a SELECT query to send tracing session info to the corresponding replica Nodes. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:17:25 +03:00
Vlad Zolotarov	6e26909b02	query::read_command: add an optional trace_info field Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:17:19 +03:00
Vlad Zolotarov	a53d329b25	tracing: add a serializable trace_info object tracing::trace_info is used to pass the tracing information between nodes. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:16:53 +03:00
Vlad Zolotarov	099ff0d2d5	transport: instrument a QUERY with tracing - Store a trace state inside a client_state. - Start tracing in a cql_server::connection::process_query(). Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:14:29 +03:00
Vlad Zolotarov	f994e0a8d0	transport/server: add support for sending a tracing session ID in a CQL response - Add a tracing ID (UUID) optional field to cql_server::response. - If _tracing_id is set make_frame() would insert a tracing ID in the response message. According to CQL spec it should be the first thing in the response "body" and the TRACING bit (0x02) should be set in the "flags" field. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	9e61a3498d	cql_server::response: rework make_frame() Use a template function to avoid code duplication. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	8bf34fca02	service::client_state: store a client address When client_state is created with an external_tag - store a client address in the client state. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	c58c56bccc	gms::inet_address: add a constructor from socket_address Currently only IPv4 addresses are supported. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	63c724c41d	service::client_state: make private fields actually private Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	4b43b08ffc	main: start a tracing service Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:53 +03:00
Vlad Zolotarov	c965528a03	tracing: add a trace_state and tracing classes trace_state: Is a single tracing session. tracing: A sharded service that contains an i_trace_backend_helper instance and is a "factory" of trace_state objects. trace_state main interface functions are: - begin(): Start time counting (should be used via tracing::begin() wrapper). - trace(): Create a tracing event - it's coupled with a time passed since begin() (should be used via tracing::trace() wrapper). - ~trace_state(): Destructor will close the tracing session. "tracing" service main interface function is: - start(): Initialize a backend. - stop(): Shut down a backend. - create_session(): Creates a new tracing session. (tracing::end_session(): Is called by a trace_state destructor). When trace_state needs to store a tracing event it uses a backend helper from a "tracing" service. A "tracing" service limits a number of opened tracing session by a static number. If this number is reached - next sessions will be dropped. trace_state implements a similar strategy in regard to tracing events per singe session. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:13:42 +03:00
Vlad Zolotarov	fa14ad3a99	service/client_state: don't allow modification of a system_trace KS Only users with enough permissions are allowed to modify system_trace KS. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:12:19 +03:00
Vlad Zolotarov	d3988a8113	tracing::trace_keyspace_helper: a keyspace based i_tracing_backend_helper implementation Uses a CQL keyspace system_traces to store tracing information. Uses two tables: CREATE TABLE system_traces.sessions ( session_id uuid, command text, client inet, coordinator inet, duration int, parameters map<text, text>, request text, started_at timestamp, PRIMARY KEY ((session_id))) and CREATE TABLE system_traces.events ( session_id uuid, event_id timeuuid, activity text, source inet, source_elapsed int, thread text, PRIMARY KEY ((session_id), event_id)) system_traces.sessions table contains records of tracing sessions. system_traces.sessions columns description: - session_id: an ID of the session. - command: type of a command this session was created for (currently supported "NONE", "QUERY" and "REPAIR"). - client: IP of the client that issued the command. - coordinator: IP of a coordinator that received the command. - duration: total duration of the tracing session (in us). - parameters: optional parameters for this session, passed to i_trace_state::begin() call. - request: a CQL command this tracing session is created for. - started_at: the time the session has been started at. system_traces.events contains records of separate tracing events. system_traces.events columns description: - session_id: an ID of the session. - event_id: an ID of the event. - activity: the trace point description - a message given to i_trace_state::trace(). - source: IP of the Node where trace event was issued. - source_elapsed: time passed since creation of a tracing session (in us) on the Node where this trace event was issued. - thread: name of the thread in who's context this trace event was issued in (currently its "core N", where 'N' is an index of a shard the trace event was issued on). This class will cache lambdas creating the corresponding mutations for each tracing record requested to be stored till flush() method is called. flush() will merge all pending mutations to "sessions" and "events" tables and then apply a mutation to "events" table and when it completes - to "sessions" table. This way it'll ensure that when some tracing session is visible, all its events are visible too. trace_keyspace_helper exposes a few metrics via collectd: - tracing_error - a total number of errors (not including OOM) - bad_column_family_errors - number of times a tracing record wasn't stored because system_trace tables' schema didn't match the expected value. This may happen if a DB administrator is doing funny things like altering the schemas of the above tables. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:12:19 +03:00
Vlad Zolotarov	a2994ffd7f	tracing: add i_tracing_backend_helper interface This class represents an interface for a specific backend that is going to store tracing information. The specific implementation may and expected to implement caching of pending tracing records. Interface functions are: - start(): Initialize a backend (e.g. create keyspace and tables). - stop(): Flush all pending work and shut down the backend. - store_session_record()/store_event_record(): Cache/store the corresponding tracing records. - flush(): Flush pending tracing records. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-06-01 20:12:13 +03:00
Gleb Natapov	91c773fdde	storage_proxy: fix writes_attempts counter writes_attempts suppose to count how many time data was sent out, but currently it counts even those replicas in other DCs that get the data through a coordinator. Fix it by counting only when data is actually sent. Message-Id: <20160601153124.GB9939@scylladb.com>	2016-06-01 18:46:23 +03:00
Avi Kivity	8dcbddc7ed	Merge "Serialize memtable flushes" from Glauber "One of the things we need to do as part of the throttle rework I am doing is to serialize memtable flushes to some extent - that will guarantee that in case we're throttling, the flushes finish earlier and release memory earlier, if compared to the case in which we just let all tables flush freely and simultaneously."	2016-06-01 18:31:18 +03:00
Avi Kivity	0c7b2e2d5c	Merge	2016-06-01 18:29:23 +03:00
Avi Kivity	d2e4548b35	Merge seastar upstream * seastar 0bcdd28...864d6dc (4): > Logging framework > Add libubsan and libasan to fedora deps docs > tests: add rpc cancellable tests > rpc: add cancellable interface Dropped logging implementation in favor of seastar's due to a link conflict with operator<<.	2016-06-01 18:28:42 +03:00
Tomasz Grabiec	56736389c1	Merge branch 'sstable-errors/v2' from https://github.com/penberg/scylla.git This series adds a constructor to malformed_sstable_exception that includes a filename and converts some call-sites to use it. There are still plenty of low-level sites that don't even know the sstable filename they are operating on. We need to either change the code to carry the filename to lower layers or find a higher-level call-site where we can catch malformed_sstable_exception and rethrow it with the sstable filename. But that's for another series by someone who knows the sstable code well. Refs #669.	2016-06-01 16:59:56 +02:00
Gleb Natapov	26b50eb8f4	storage_proxy: drop debug output Message-Id: <20160601132641.GK2381@scylladb.com>	2016-06-01 17:13:12 +03:00
Pekka Enberg	94c35cc135	sstables/sstables: Add sstable filename to thrown malformed_sstable_exceptions	2016-06-01 17:11:05 +03:00
Pekka Enberg	3ca7fc2a8b	database: Add sstable filename to thrown malformed_sstable_exceptions	2016-06-01 14:56:10 +03:00
Pekka Enberg	fa5354dda4	sstables: Add optional filename to malformed_sstable_exception Add a constructor to malformed_sstable_exception that accepts a error message and a sstable name.	2016-06-01 14:48:08 +03:00
Pekka Enberg	de0634c289	Merge "Extract modification_statement's (and related) parsed statement into raw" from Avi "Move parsed statements into raw namespace. Mindless but therapeutic."	2016-06-01 14:19:53 +03:00
Avi Kivity	92d815a6cf	Make github issue template less shouty	2016-06-01 10:45:04 +03:00
Pekka Enberg	0255318bf3	Revert "Revert "main: change order between storage service and drain execution during exit"" This reverts commit `b3ed55be1d`. The issue is in the failing dtest, not this commit. Gleb writes: "The bug is in the test, not the patch. Test waits for repair session to end one way or the other when node is killed, but for nodetool to know if repair is completed it needs to poll for it. If node dies before nodetool managed to see repair completion it will stuck forever since jmx is alive, but does not provide answers any more. The patch changes timing, repair is completed much close to exit now, so problem appears, but it may happen even without the patch. The fix is for dtest to kill jmx as part of killing a node operation." Now that Lucas fixed the problem in scylla-ccm, revert the revert.	2016-06-01 08:48:50 +03:00
Glauber Costa	0f64eb7e7d	serialize memtable flush for a memtable_list We can only free memory for a region_group when the entire memtable is released. This means that while the disk can handle requests from multiple memtables just fine, we won't free any memory until all of them finish. If we are under a pressure situation we will take a lot more time to leave it. Ideally, with write-behind, we would allow just one memtable to be flushed at a time. But since we don't have it enabled, it's better to serialize the flushes so that only some memtables (4) are flushed at a time. Having the memtable writer bandwidth all to itself, the memtable will finish sooner, release memory sooner, and recover the system's health sooner. We would like to do that without having streaming and memtables starve each other. Ideally, that should mean half the bandwidth for each - but that sacrifices memtable writes in the common case there is no streaming. Again, write behind will help here, and since this is something we intend to do, there is no need to complicate the code too much for an interim solution. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:18:35 -04:00
Glauber Costa	46c79be401	database: allow callers to specify memtable list's flush behavior This patch introduces an explicit behavior enum class - one of delayed or immediate, that allow callers to tell the memtable list whether they want a delayed flush (default), or force an immediate flush. So far this only affects the streaming code (memtables just ignore it), but the concept is one that can be easily generalized. With that in place, we can revert back the stop function to use the standard flush. I have argued before that adding infrastructure like that would not be worth it for the sake of stop alone, but some other code could now use it. Specifically, the active reclaimer for the throttler would like to force immediate flushes, as delayed flushes really won't make a lot of difference in reducing memory usage. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:17:48 -04:00
Avi Kivity	c8b5104aa5	cql3: extract raw batch_statement into raw sub-namespace prepare() was moved to .cc to avoid circular dependencies.	2016-05-31 21:41:26 +03:00
Avi Kivity	1d144699f6	cql3: extract raw delete_statement into raw sub-namespace	2016-05-31 21:24:56 +03:00
Avi Kivity	e596799962	cql3: extract raw update_statement into raw sub-namespace update_statment also has an insert_statement counterpart, convert it too.	2016-05-31 21:16:53 +03:00
Avi Kivity	10213c4211	cql3: extract raw modification_statement into raw sub-namespace	2016-05-31 20:53:37 +03:00
Asias He	f27e5d2a68	messaging_service: Delay listening ms during boot up When a node starts up, peer node can send gossip syn message to it before the gossip message handlers are registered in messaging_service. We can see: scylla[123]: [shard 0] rpc - client a.b.c.d: unknown verb exception 6 ignored To fix, we delay the listening of messaging_service to the point when gossip message handlers are registered. Message-Id: <9b20d85e199ef0e44cdcde2920123a301a88f3d7.1464254400.git.asias@scylladb.com>	2016-05-31 12:28:11 +03:00
Avi Kivity	f3fc3afe00	cql3: optimize make_empty_metadata() All empty metadata objects are equal, so make just one and keep returning it. Message-Id: <1464334638-7971-4-git-send-email-avi@scylladb.com>	2016-05-31 09:12:20 +03:00
Avi Kivity	0135b4d5cd	cql3: constify metadata users Metadata usually doesn't change after it is created; make that visible in the code, allowing further optimizations to be applied later. Message-Id: <1464334638-7971-3-git-send-email-avi@scylladb.com>	2016-05-31 09:12:11 +03:00
Avi Kivity	6728454591	cql3: rationalize extract_result_metadata() Rather than dynamic_cast<>ing the statement to see whether it is a select statement, add a virtual function to cql_statement to get the result metadata. This is faster and easier to follow. Message-Id: <1464334638-7971-2-git-send-email-avi@scylladb.com>	2016-05-31 09:12:02 +03:00
Avi Kivity	25b3d74f45	cql3: Split select_statement::raw_statement into raw namespace cql3::select_statement::raw_statement -> cql3::raw::select_statement Message-Id: <1464609556-3756-4-git-send-email-avi@scylladb.com>	2016-05-31 09:09:30 +03:00
Avi Kivity	c8f98c5981	cql3: move cf_statement into raw hierarchy cql3::statements::cf_statement -> cql3::statements::raw::cf_statement Message-Id: <1464609556-3756-3-git-send-email-avi@scylladb.com>	2016-05-31 09:09:21 +03:00
Avi Kivity	caf8d4f0e6	cql3: separate parsed_statement and parsed_statment::prepared cql3::statements::parsed_statement -> cql3::statements::raw::parsed_statement cql3::statements::parsed_statement::prepared -> cql3::statements::prepared_statement Message-Id: <1464609556-3756-2-git-send-email-avi@scylladb.com>	2016-05-31 09:09:10 +03:00
Duarte Nunes	a15ed3c60f	mutation_test: Specify tmp data dir Otherwise we attempt to create sstable files under /. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1464618602-1124-1-git-send-email-duarte@scylladb.com>	2016-05-30 20:34:47 +02:00
Pekka Enberg	b3ed55be1d	Revert "main: change order between storage service and drain execution during exit" This reverts commit `0ebd8b18b7`. The change breaks repair_additional_test.py:RepairAdditionalTest.repair_kill_1_test	2016-05-30 12:48:09 +03:00
Avi Kivity	e515933c70	dist: tune scheduler for lower latency Scylla-jmx and collectd can preempt scylla and induce long latencies. Tune the scheduler to provide lower latencies. Since when the support processes are not running we normally do not context switch (one thread per core, remember?), there should be no effect on throughput. The tunings are provided in a separate package, which can be uninstalled if the server is shared with other applications which are negatively affected by the tuning. Fixes #1218. Message-Id: <1464529625-12825-1-git-send-email-avi@scylladb.com>	2016-05-30 08:42:19 +03:00
Avi Kivity	e8e00338d1	config: document defragment_memory_on_idle Message-Id: <1464261650-14136-2-git-send-email-avi@scylladb.com>	2016-05-30 08:39:26 +03:00
Avi Kivity	b50cb3eca8	config: rename compact_on_idle compact_on_idle will lead users to thinking we're talking about sstable compaction, not log-structured-allocator compaction. Rename the variable to reduce the probability of confusion. Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>	2016-05-30 08:39:13 +03:00
Yoav Kleinberger	e580ac5dae	docker: fix Ubuntu Dockerfile one needs to update the repository info before one can install packages. Fixes issue #1296. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <a906e76d584baff5988cb31a4003de27455e0741.1464529740.git.yoav@scylladb.com>	2016-05-29 17:00:25 +03:00
Avi Kivity	3f6ecb9f28	Merge "cancel cross DC read repair if non matching data was recently modified" from Gleb	2016-05-29 15:58:55 +03:00
Gleb Natapov	2efbccc901	storage_proxy: do only local read repair if non matching data was recently modified When read/write to a partition happens in parallel reader may detect digest mismatch that may potentially cause cross DC read repair attempt, but the repair is not really needed, so added latency is not justified. This patch tries to prevent such parallel access from causing heavy cross DC repair operation buy checking a timestamp of most resent modification. If the modification happens less then "write timeout" seconds ago the patch assumes that the read operation raced with write one and cancel cross DC repair, but only if CL is LOCAL_*.	2016-05-29 15:26:51 +03:00
Amnon Heiman	d4123ba613	API: column_family count sstable space used correctly The space calculation counters in column family had two problem: 1. The total bytes is an ever growing counter, which is meaningless for the API. 2. Trying to simply sum the size on all shards, ignores the fact that the same sstable file can be referenced by multiple shards, this is especially noticeable during migration time. To solve this, the implementation was modified so instead of collecting the sizes, the API would collect a map of file name to size and then would do the summing. This removes the duplications and fixes the total bytes calculation Calling cfstats before the change with load after a compaction happend: $ nodetool cfstats keyspace1 Keyspace: keyspace1 Verify write latency 1068253.0 76435 Read Count: 75915 Read Latency: 0.5953986037015082 ms. Write Count: 76435 Write Latency: 0.013975966507490025 ms. Pending Flushes: 0 Table: standard1 SSTable count: 5 Space used (live): 44261215 Space used (total): 219724478 After the fix: $ nodetool cfstats keyspace1 Keyspace: keyspace1 Verify write latency 1863206.0 124219 Read Count: 125401 Read Latency: 0.9381053978835895 ms. Write Count: 124219 Write Latency: 0.01499936402643718 ms. Pending Flushes: 0 Table: standard1 SSTable count: 6 Space used (live): 50402904 Space used (total): 50402904 Space used by snapshots (total): 0 Fixes: #1042 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1464518757-14666-2-git-send-email-amnon@scylladb.com>	2016-05-29 14:11:03 +03:00
Gleb Natapov	32c9a06faf	messaging_service: abort retrying send during exit Fixes #862 Message-Id: <1463579574-15789-3-git-send-email-gleb@scylladb.com>	2016-05-29 11:39:36 +03:00
Gleb Natapov	0ebd8b18b7	main: change order between storage service and drain execution during exit Even the comment says drain_on_shutdown should be called first, but for that in has to be registered last. Fixes #862 Message-Id: <1463579574-15789-2-git-send-email-gleb@scylladb.com>	2016-05-29 11:39:24 +03:00
Glauber Costa	30d54cef38	database: add a comment explaining the choice of function in CF stop We have recently commited a fix to a broken streaming bug that involved reverting column_family::stop() back to calling the custom seal functions explicitly for both memtables and streaming memtables. We here add a comment to explain why that had to be done. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>	2016-05-29 11:28:15 +03:00
Avi Kivity	8e124b31aa	Merge "gossip: Refactor waiting for supported features" from Duarte "This patch changes the way we wait for supported features. We no longer sleep periodically, waking up to check if the wanted features are now avaiable. Instead, we register waiters in a condition variable that is signaled whenever new endpoint information is received. We also add a new poll interface based on the feature class, which encapsulates the availability of a cluster feature."	2016-05-27 20:24:25 +03:00
Duarte Nunes	f613dabf53	gossip: Introduce the gms::feature class This class encapsulates the waiting for a cluster feature. A feature object is registered with the gossiper, which is responsible for later marking it as enabled. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-05-27 17:20:51 +00:00
Duarte Nunes	4684b8ecbb	gossip: Refactor waiting for features This patch changes the sleep-based mechanism of detecting new features by instead registering waiters with a condition variable that is signaled whenever a new endpoint information is received. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-05-27 17:20:51 +00:00
Duarte Nunes	422f244172	gossip: Don't timeout when waiting for features This patch removes the timeout when waiting for features, since future patches will make this argument unnecessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-05-27 17:20:51 +00:00
Avi Kivity	fab4cc8d6d	Merge seastar upstream * seastar 8bfbb1a...0bcdd28 (1): > Merge "introduce sleep_abortable() that throws exception on application exit" from Gleb	2016-05-27 20:14:49 +03:00
Duarte Nunes	b3011c9039	gossip: Rename set_heart_beat_state ...to set_heart_beat_state_and_update_timestamp in order to make it explicit for callers the update_timestamp is also changed. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1464309023-3254-3-git-send-email-duarte@scylladb.com>	2016-05-27 09:11:39 +03:00
Duarte Nunes	8c0e2e05b7	gossip: Fix modification to shadow endpoint state This patch fixes an inadvertent change to the shadow endpoint state map in gossiper::run, done by calling get_heart_beat_state() which also updates the endpoint state's timestamp. This did not happen for the normal map, but did happen for the shadow map. As a result, every time gossiper::run() was scheduled, endpoint_map_changed would always be true and all the shards would make superfluous copies of the endpoint state maps. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1464309023-3254-2-git-send-email-duarte@scylladb.com>	2016-05-27 09:10:38 +03:00
Pekka Enberg	b7e79b72d5	Merge "Introduce SET_NIC for non-AMI environment" from Takuya "This patchset provides a way to enable SET_NIC(posix_net_conf.sh) on non-AMI environment. Also support -mq option of the script. This also contains number of bug fixes of scripts. Fixes #1192"	2016-05-26 13:37:06 +03:00
Yoav Kleinberger	26c0d86401	tools/scyllatop: improved user interface: scrollable views NOTE: scyllatop now requires the urwid library previously, if there were more metrics that lines in the terminal window, the user could not see some of the metrics. Now the user can scroll. As an added bonus, the program will not crash when the window size changes. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1464098832-5755-1-git-send-email-yoav@scylladb.com>	2016-05-26 13:36:28 +03:00
Piotr Jastrzebski	136b8148d2	Use idle CPU to compact LSA memory Register an idle CPU handler that compacts a single segment every time there's nothing better to execute on CPU. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>	2016-05-26 12:43:53 +03:00
Avi Kivity	d7f36a093f	Merge seastar upstream * seastar e5faea8...8bfbb1a (1): > reactor: advertise the logging_failures metric as a DERIVE counter Fixes #1292.	2016-05-26 11:46:08 +03:00
Tomasz Grabiec	f0c2b1d161	config: Fix typos Message-Id: <1464201938-4778-1-git-send-email-tgrabiec@scylladb.com>	2016-05-26 08:19:57 +03:00
Asias He	f1b3cb4a08	storage_service: Catch and fail an invalid configuration with --replace-address Vlad reported a strange user configuration: SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --collectd-address=127.0.0.1:25826 --collectd=1 --collectd-poll-period 60000 --network-stack posix --num-io-queues 32 --max-io-requests 128 --replace-address 10.0.4.131" seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.0.4.131" In the mean while, 10.0.4.131 is the IP address of the node itself. When the node was started, the following message were reported. Apr 13 06:31:12 n0 scylla[19681]: [shard 0] gossip - Connect seeds again ... (20 seconds passed) Apr 13 06:31:13 n0 scylla[19681]: [shard 0] gossip - Connect seeds again ... (21 seconds passed) Apr 13 06:31:14 n0 scylla[19681]: [shard 0] gossip - Connect seeds again ... (22 seconds passed) Apr 13 06:31:15 n0 scylla[19681]: [shard 0] gossip - Connect seeds again ... (23 seconds passed) The configruation is invalid, becasue for --replace-address to work, at least one working seed node should be alive. Catch the configuration error and fail it with an appropriate error message. Fixes #1183 Message-Id: <a94a082d896313e7a668915ae21fe2c03719da3a.1464164058.git.asias@scylladb.com>	2016-05-25 14:42:19 +03:00
Asias He	fed1e65e1e	gossip: Do not insert the same node into _live_endpoints_just_added _live_endpoints_just_added tracks the peer node which just becomes live. When a down node gets back, the peer nodes can receive multiple messages which would mark the node up, e.g., the message piled up in the sender's tcp stack, after a node was blocked with gdb and released. Each such message will trigger a echo message and when the reply of the echo message is received (real_mark_alive), the same node will be added to _live_endpoints_just_added.push_back more than once. Thus, we see the same node be favored more than once: INFO 2016-04-12 12:09:57,399 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:09:58,412 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:09:59,429 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:10:00,429 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:10:01,430 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:10:02,442 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 INFO 2016-04-12 12:10:03,454 [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2 To fix, do not insert the node if it is already in _live_endpoints_just_added. Fixes #1178 Message-Id: <6bcfad4430fbc63b4a8c40ec86a2744bdfafb40f.1464161975.git.asias@scylladb.com>	2016-05-25 14:19:40 +03:00
Glauber Costa	46f60f52d9	database: do not use implicitly stated seal function when closing the CF In commit `4981362f57`, I have introduced a regression that was thankfully caught by our dtest infrastructure. That patch is a preparation patch for the active reclaim patchset that is to come, and it consolidated all the flushes using the memtable_list's seal_fn function instead of calling the seal function explicitly. The problem here is that the streaming memtables have the delayed mechanism, about which the memtable_list is unaware. Calling memtable_list's seal_active_memtable() for the streaming memtables calls the delayed version, that does not guarantee flush. If we're lucky, we will indeed flush after the timer expires, but if we're not we'll just stop the CF with data not flushed. There are two options to fix this: the first is to teach the memtable_list about the delayed/forced mechanism, and the second is to just call the correct function explicitly during shutdown, and then when the time comes to add continuations to the result of the seal, add them here as well. Although the second option involves a bit more work and duplication, I think it is better in the sense that the delayed / forced mechanism really is something that belong to the streaming only. Being this the only user, I don't think it justifies complicating the memtable_list with this concept. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>	2016-05-25 08:21:24 +03:00
Avi Kivity	2d4d6c9c92	Merge seastar upstream * seastar aed893e...e5faea8 (5): > Catch exceptions thrown by idle cpu handler > core::gate: add a get_count() method > reactor: Introduce idle CPU handler > core: add missing header for g++-4.9 > Add lksctp-tools-devel do required packages	2016-05-24 20:42:41 +03:00
Pekka Enberg	ceb29f9d32	Merge "Introduce upload dir for sstable migration" from Raphael "This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and run 'nodetool refresh'. That's supposed to be the preferred option from now on."	2016-05-24 16:36:47 +03:00
Gleb Natapov	7f6b12c97a	query: add user provided timestamp to read_command If read query supplies timestamp move it to read_command to be used later otherwise get local timestamp.	2016-05-24 15:19:35 +03:00
Pekka Enberg	d7d8c76fe5	transport/server: Add CQL frame LZ4 compression support The default CQL frame compression algorithm in Cassandra is LZ4. Add support for decompressing incoming frames and compressing outgoing frames with LZ4 if the CQL driver asks for that. Fixes #416 Message-Id: <1464086807-11325-1-git-send-email-penberg@scylladb.com>	2016-05-24 15:03:33 +03:00
Takuya ASADA	53cebb4a5e	dist/ubuntu: don't rebuild dependency packages by default Same as CentOS, do not build dependencies by default, install binary packages from our repository. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1464023451-21436-1-git-send-email-syuu@scylladb.com>	2016-05-24 14:10:59 +03:00
Gleb Natapov	12cf60c302	messaging_service: add timestemp of last modification to READ_DIGEST verb return value	2016-05-24 13:27:34 +03:00
Gleb Natapov	1e6f64f4ab	query: add latest modification timestamp to result structure	2016-05-24 13:27:34 +03:00
Gleb Natapov	5fef0717cc	query: find latest modification timestamp while calculating result digest	2016-05-24 13:27:34 +03:00
Avi Kivity	9637c2232c	Merge "Move the JMX timer polling logic to Scylla" from Amnon	2016-05-24 13:07:52 +03:00
Raphael S. Carvalho	c2fa3b796d	db: fix read consistency after refresh If sstable loaded by refresh covers a row that is cached by the column family, read query may fail to return consistent data. What we should do is to clear cache for the column family being loaded with new sstables. Fixes #1212. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a08c9885a5ceb0b2991e40337acf5b7679580a66.1464072720.git.raphaelsc@scylladb.com>	2016-05-24 12:11:41 +03:00
Takuya ASADA	5d5d525a14	dist/ubuntu: fix incorrect dependency package name PyYAML is CentOS/RHEL/Fedora package name, python-yaml is correct one. Fixes #1279 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1463987823-22837-1-git-send-email-syuu@scylladb.com>	2016-05-23 10:21:29 +03:00
Pekka Enberg	8a7197e390	dist/docker: Fetch RPM repository from Scylla web site Fix the hard-coded Scylla RPM repository by downloading it from Scylla web site. This makes it easier to switch between different versions. Message-Id: <1463981271-25231-1-git-send-email-penberg@scylladb.com>	2016-05-23 09:45:41 +03:00
Piotr Jastrzebski	2be4ec4e06	Add lksctp-tools-devel to required packages in fedora build instructions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <15f3db34f12f01cb9da32fd14c16ba87e64ad5f4.1463947999.git.piotr@scylladb.com>	2016-05-23 08:26:02 +03:00
Avi Kivity	5e5317b228	dist: add build dependencies for sctp Required by new seastar	2016-05-22 19:10:25 +03:00
Avi Kivity	5bb1255da1	Merge seastar upstream * seastar 6a849ac...aed893e (3): > net: move 'transport' enum to seastar namespace > net: sctp protocol support for posix stack > future: Support get() when state is at a promise	2016-05-22 16:32:33 +03:00
Amnon Heiman	e26002d581	idl-compiler: default constructor of complex types This patch solve a problem where a complex type is define as version depended (with the version attribute) but doesn't have a default value. In those cases the default constructor is used, but in the case of complex types (template) param_type should be use to get the C++ type. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1463916723-15322-1-git-send-email-amnon@scylladb.com>	2016-05-22 15:32:29 +03:00
Raphael S. Carvalho	e5f0314afd	db: introduce upload directory for sstable migration This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and call 'nodetool refresh'. That's supposed to be the preferred option from now on. For each sstable in upload directory, refresh will do the following: 1) Mutate sstable level to 0. 2) Create hard links to its components in column family dir, using a new generation. We make it safe by creating a hard link to temporary TOC first. 3) Remove all of its components in upload directory. This new code runs after refresh checked for new sstables in the column family directory. Otherwise, we could have a generation conflict. Unlike the first step, this new step runs with sstable write enabled. It's easier here because we know exactly which sstables are new. After that, refresh will load new sstables found in column family and upload directories. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:26:21 -03:00
Raphael S. Carvalho	70b793e4d3	tests: add test for statistics rewrite Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:26:12 -03:00
Raphael S. Carvalho	74c8a87777	sstables: fix statistics rewrite It's not working because it tries to overwrite existing statistics file with exclusive flag. It's fixed by writing new statistics into temporary file and renaming it into place. If Scylla failed in middle of rewrite, a temporary file is left over. So boot code was adjusted to delete a temporary file created by this rewrite procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:24:15 -03:00
Pekka Enberg	94e7e61cd0	api: Register snitch API earlier Currently, we register snitch API in set_server_gossip_settle() which waits until a node has joined the cluster. This makes 'nodetool status' not properly show the status of a joining node. Fix the issue by registering snitch API earlier. Fixes #1269. Message-Id: <1463576381-15484-1-git-send-email-penberg@scylladb.com>	2016-05-20 14:24:14 +03:00
Gleb Natapov	7a54b5ebbb	gossiper: cleanup mark_alive() even more Message-Id: <20160519100513.GE984@scylladb.com>	2016-05-19 12:47:19 +02:00
Takuya ASADA	03a762bb0b	dist/common/scripts: Ask to set SET_NIC=yes on scylla_setup interactive prompt We supported SET_NIC on non-AMI environment, so ask user to use it on scylla_setup interactive prompt.	2016-05-19 06:26:23 +09:00
Takuya ASADA	88fde0a91e	dist/ami: fix dependency unresolved error on AMI build script with local package, by adding scylla-conf package Since we added scylla-conf package, we cannot install scylla-server/-tools without the package, because of this --localrpm is failing. So copy scylla-conf package to AMI, and install it to fix the problem.	2016-05-19 06:26:23 +09:00
Takuya ASADA	898243929f	dist/common/scripts: specify queue settings for posix_net_conf.sh on scylla_prepare posix_net_conf.sh wants -sq/-mq options, so detect number of queues and specify the option in scylla_prepare.	2016-05-19 06:26:23 +09:00
Takuya ASADA	f84b7b094f	dist/common/scripts: drop special condition to enable SET_NIC on AMI, do this on AMI installation script Remove special case of SET_NIC in AMI, do this in scylla-ami-setup.service.	2016-05-19 06:25:41 +09:00
Takuya ASADA	49cdd0b786	dist: move '--cpuset' and '--smp' configuration to scylla_cpuset_setup / cpuset.conf These parameters are only required for AMI, not for non-AMI environment which want to enable SET_NIC, so split them to indivisual script / conf file, call it from AMI install script.	2016-05-19 06:25:28 +09:00
Takuya ASADA	46fa80a5a6	dist/common/scripts: replace IFNAME variable when --nic specified to scylla_sysconfig_setup scylla_sysconfig_setup has bug that it not replaces IFNAME variable, so fixed.	2016-05-19 06:25:15 +09:00
Glauber Costa	4eff07d773	database: reorder initialization In a preparation move for the LSA throttler, we have reordered the initialization fields in database.hh so that the sizes of the regions are computed before the initialization of the region. However, that seemingly innocent move broke one of our tests. The reason behind that, is that if we don't destroy the column families before destroying the region, we may end up with a use after free in the memtable destructor - that itself expects to call into the region. This patch reorders the initialization so that the CF list still comes after the dirty regions (therefore being destroyed first), while maintaining the relative ordering between size / region that we needed in the first place. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0669984b5bccdb2c950f2444bdee4427abad56ba.1463508884.git.glauber@scylladb.com>	2016-05-18 11:02:40 +03:00
Asias He	eb9ac9ab91	gms: Optimize gossiper::is_alive In perf-flame, I saw in service::storage_proxy::create_write_response_handler (2.66% cpu) gossiper::is_alive takes 0.72% cpu locator::token_metadata::pending_endpoints_for takes 1.2% cpu After this patch: service::storage_proxy::create_write_response_handler (2.17% cpu) gossiper::is_alive does not show up at all locator::token_metadata::pending_endpoints_for takes 1.3% cpu There is no need to copy the endpoint_state from the endpoint_state_map to check if a node is alive. Optimize it since gossiper::is_alive is called in the fast path. Message-Id: <2144310aef8d170cab34a2c96cb67cabca761ca8.1463540290.git.asias@scylladb.com>	2016-05-18 10:12:38 +03:00
Avi Kivity	6ec0000df8	Merge "fix migration of tables with level > 0" from Rapahel	2016-05-17 19:14:01 +03:00
Raphael S. Carvalho	cbc2e96a58	tests: check that overlapping sstable has its level changed to 0 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-17 11:11:05 -03:00
Raphael S. Carvalho	ee0f66eef6	db: fix migration of sstables with level greater than 0 Refresh will rewrite statistics of any migrated sstable with level > 0. However, this operation is currently not working because O_EXCL flag is used, meaning that create will fail. It turns out that we don't actually need to change on-disk level of a sstable by overwriting statistics file. We can only set in-memory level of a sstable to 0. If Scylla reboots before all migrated sstables are compacted, leveled strategy is smart enough to detect sstables that overlap, and set their in-memory level to 0. Fixes #1124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-17 11:08:08 -03:00
Gleb Natapov	76e0eb426e	gossiper: simplify mark_alive() The code runs in a thread so there is no need to use heap to communicate between statements. Message-Id: <20160517120245.GK984@scylladb.com>	2016-05-17 15:37:21 +03:00
Avi Kivity	4413176051	Merge "reduce performance degradation when adding node" from Asias "With this series, the operations per second drop during adding node period gets much better. Before: 45K to 10K After: 45k to 38K Refs: #1223 "	2016-05-17 14:31:31 +03:00
Asias He	089734474b	token_metadata: Speed up pending_endpoints_for pending_endpoints_for is called frequently by storage_proxy::create_write_response_handler when doing cql query. Before this patch, each call to pending_endpoints_for involves converting a multimap (std::unordered_multimap<range<token>, inet_address>>) to map (std::unordered_map<range<token>, std::unordered_set<inet_address>>). To speed up the token to pending endpoint mapping search, a interval map is introduced. It is faster than searching the map linearly and can avoid caching the token/pending endpoint mapping. With this patch, the operations per second drop during adding node period gets much better. Before: 45K to 10K After: 45k to 38K (The number is measured with the streaming code skipping to send data to rule out the streaming factor.) Refs: #1223	2016-05-17 17:32:15 +08:00
Asias He	ee0585cee9	dht: Add default constructor for token It is needed to put token in to a boost interval_map in the following patch.	2016-05-17 17:32:15 +08:00
Amnon Heiman	ad34f80e6f	API: change cache_service, column_family and storage_proxy to rate object The API would expose now the rate_moving_average and rate_moving_average_and_histogram. The old end points remains for the transition period, but marked as depricated. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:56:52 +03:00
Amnon Heiman	b33ed48527	API Definition: change cache_service, column_family and storage_proxy to use rate objects This patch replaces the latency histogram to rate_moving_avrage_and_histogram and the counters to rate_moving_average. The old endpoints where left unchagned but marked as depricated when needed.	2016-05-17 11:55:06 +03:00
Amnon Heiman	20a48b0f20	API: column family stats break the map_reduce functionality This patch replaces the helper function for column family with two function, one that collect the relevant column family from all shareds and another one that do the translation to json object. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:53:15 +03:00
Amnon Heiman	750f30cf07	column_family: Change histogram to timed_rate_moving_average_and_histogram As part of moving the derived statistic in to scylla, this replaces the histogram object in the column_family to timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:53:15 +03:00
Amnon Heiman	468bcfbf1f	row_cache: Change counter to timed_rate_moving_average_and_histogram As part of moving the derived statistic in to scylla, this replaces the counter in the row_cache stats to timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:53:15 +03:00
Amnon Heiman	64e0c8cd1b	storage_proxy: Change histogram to timed_rate_moving_average_and_histogram As part of moving the derived statistic in to scylla, this replaces the histogram object in the storage_proxy to timed_rate_moving_average_and_histogram. and the read, write and range counters where replaced by rate_moving_average. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:52:16 +03:00
Amnon Heiman	f6a5a4e3da	API: Add helper function for the rate objects This patch adds the helper function that are used to sum the rate_moving_average and rate_moving_average_and_histogram. The current sum functionality for histogram was modified to support rate and histogram but return a histogram. This way current endpoints would continue to behave the same. It also cleans the histogram related method by using the plus operator in the histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:49:34 +03:00
Amnon Heiman	8ef25ceb05	Add waited avrage rate related object This patch adds a few data structure for derived and accumulative statistics that are similiar to the yammer implementation used by the JMX. It also adds a plus operator to histogram which cleans the histogram usage. moving_average - An exponentially-weighted moving average. calculate an event rate on a given interval. rate_moving_average and timed_rate_moving_average - Calculate 1m, 5m and 15m ewma an all time avrage and a counter. rate_moving_average_and_histogram and timed_rate_moving_average_and_histogram - Combines a histogram with a rate_moving_average. It also expose a histogram API so it will be an easy task to replace a histogram with a timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:47:49 +03:00
Glauber Costa	17b9203719	database: invert order of elements So that the sizes of the region can be initialized first Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dc3df186a977b492d83c0a397f206c2db940aa37.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	2ff6d38d0c	database: use a single constructor for the column family We've been keeping two constructors for the column family to allow for a version without the commitlog. But it's by now quite complicated to maintain the two, because changes always have to be made in two places. This patch adds a private constructor that does the actual construction, and have the public constructors to call it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	8fede5b98e	memtables: isolate logic for disk writes disabled When we have disk writes disabled, we exit immediately from the flush function. We can just encode that separately and pass a different function in the memtable_list creation. That simplifies the memtable flush a bit. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <908e3b5eb2c6ee84b8ad7b31c3673be5531a087c.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:38 +03:00
Glauber Costa	4981362f57	memtables: always seal through memtable_list seal function I would like to be able to apply a function at the end of every flush, that is common for both memtables and streaming memtables. For instance, to unthrottle current waiters. Right now some calls to seal_active_memtable are open coded, calling the column family's function directly, for both the main memtable list and the streaming list. This patch moves all the current open code callers to call the respective memtable_list function. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:37 +03:00
Takuya ASADA	4972a72380	dist: drop 'sudo -E' and SETENV for security reason, source envfile from scripts As Nadav pointed out, SETENV and sudo -E might be causes security hole: https://github.com/scylladb/scylla/issues/1028#issuecomment-196202171 So drop them now, sourcing envfiles from scylla_prepare / scylla_stop scripts instead. Also on "[PATCH] ubuntu: Fix the init script variable sourcing" thread we have problem to passing variables from envfiles to scylla_prepare / scylla_stop on Ubuntu, it seems better to sourcing from these scripts. Additionally, this fixes #1249 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462989906-30062-1-git-send-email-syuu@scylladb.com>	2016-05-17 10:31:03 +03:00
Pekka Enberg	9c450f673c	cql3: Clean up prepared_metadata class Return vectors by const reference in prepared_metadata class and add a FIXME to result_message class. Message-Id: <1463425756-20225-1-git-send-email-penberg@scylladb.com>	2016-05-17 10:02:14 +03:00
Pekka Enberg	217c1ffa95	cql3: Specify result set flag ABI explicitly As Avi points out, the flag values are an ABI. So specify them explicitly. Message-Id: <1463413379-8355-1-git-send-email-penberg@scylladb.com>	2016-05-16 19:00:52 +03:00
Avi Kivity	a3b23d75b9	Merge "Fix Prepared message metadata serialization" "The Prepared message has a metadata section that's similar to result set metadata but not exactly the same. Fix serialization by introducing a separate prepared_metadata class like Origin has and implement serialization as per the CQL protocol specification. This fixes one CQL binary protocol version 4 issue that we currently have. The changes have been verified by running the gocql integration tests using v4. Please note that this series does not enable v4 for clients because Cassandra 2.1.x series only supports CQL binary protocol v3."	2016-05-16 18:59:54 +03:00
Pekka Enberg	868ff5107c	cql3: Introduce prepared_metadata class Introduce a new prepared_metadata class that holds prepared statement metadata and implement CQL binary protocol serialization that works for all versions.	2016-05-16 18:06:01 +03:00
Tomasz Grabiec	272e89846d	Merge branch 'cache' from git@github.com:haaawk/scylla.git From Piotr: Fixes #656. It makes it possible to slice using clustering ranges in mutation readers. We don't have row index yet so the slicing is just ignoring data which is out of range.	2016-05-16 14:44:33 +02:00
Piotr Jastrzebski	dcba6f5c45	Pass clustering_row_ranges to mutation readers. This will allow readers to reduce the amount of data read. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 14:36:57 +02:00
Pekka Enberg	a68671e247	cql3: Add column_specification::all_in_same_table() helper We need it the prepared_metadata class that we're about to introduce.	2016-05-16 14:13:31 +03:00
Takuya ASADA	80037aa95b	dist/common/scripts: don't proceed to run scylla_raid_setup when disks not selected, on interactive RAID setup When disks not selected, run disk select prompt again. Fixes #1260 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1463388933-3640-1-git-send-email-syuu@scylladb.com>	2016-05-16 13:45:17 +03:00
Pekka Enberg	adfb4d7bbd	cql3: Move result_set class implementation to source file	2016-05-16 13:20:45 +03:00
Pekka Enberg	8552f222f5	cql3: Clean up result_set class Kill some left-over ifdef'd code from the result_set class. Message-Id: <1463392997-22921-1-git-send-email-penberg@scylladb.com>	2016-05-16 13:09:37 +03:00
Piotr Jastrzebski	23c23abe53	Make memtable mutation_reader slice using clustering ranges. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:41 +02:00
Piotr Jastrzebski	484d2ecd0a	Slice data with clustering key range in sstable reader Add additional parameters to mp_row_consumer to be able to fetch only cells for given clustering key ranges This will be used in row_cache when it will work on clustering key level instead of partition key level. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:30 +02:00
Piotr Jastrzebski	8307681975	Introduce clustering_ranges type. It will be used to slice data returned by mutation_readers. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:09 +02:00
Amnon Heiman	7e07d97e4b	API utils: Adding rate moving avrage rate_moving_average and rate_moving_average_and_histogram are type that are used by the JMX. They are based on the yammer meter and timer and are used to collect derivative information. Specificlly: rate_moving_average calculate rates and rate_moving_average_and_histogram collect rates and histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-16 11:40:19 +03:00
Pekka Enberg	17765b6c06	Merge seastar upstream * seastar 3dec26f...6a849ac (4): > seastar::socket: Be resilient against ENOTCONN > Merge " improve performance and predictability of syscall thread communications" from Glauber > rpc_test: Shutdown properly > [PATCH} future: better detect get_future() on already used promise	2016-05-16 08:04:47 +03:00
Yoav Kleinberger	de7952a8db	tools/scyllatop: log input from collectd for easier debugging When running with DEBUG verbosity, scyllatop will now log every single value it receives from collectd. When you suspect that scyllatop is somehow distorting values, this is a good way to check it. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1463320730-6631-1-git-send-email-yoav@scylladb.com>	2016-05-15 19:17:10 +03:00
Tomasz Grabiec	1eabe9b840	storage_proxy: Add trace-level logging for mutating Message-Id: <1462978554-31217-1-git-send-email-tgrabiec@scylladb.com>	2016-05-12 13:52:56 +03:00
Tomasz Grabiec	7207cc8b1a	storage_proxy: Improve error reporting Knowing the source node can help in debugging the issue. Message-Id: <1462978535-31164-1-git-send-email-tgrabiec@scylladb.com>	2016-05-12 13:52:39 +03:00
Pekka Enberg	b5d9aa866d	Merge "Fixes for schema synchronization" from Tomek "Writes may start to be rejected by replicas after issuing alter table which doesn't affect columns. This affects all versions with alter table support. Fixes #1258"	2016-05-12 09:43:25 +03:00
Duarte Nunes	7dbeef3c39	storage_service: Fix ignored future in on_alive This patch ensures the future created by invoke_on_all is not ignored by waiting on it, which is safe to do since we are within a seastar::async context. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1462989837-7326-1-git-send-email-duarte@scylladb.com>	2016-05-12 09:03:46 +03:00
Tomasz Grabiec	13d8cd0ae9	migration_manager: Invalidate prepared statements on every schema change Currently we only do that when column set changes. When prepared statements are executed, paramaters like read repair chance are read from schema version stored in the statement. Not invalidating prepared statements on changes of such parameters will appear as if alter took no effect. Fixes #1255. Message-Id: <1462985495-9767-1-git-send-email-tgrabiec@scylladb.com>	2016-05-12 08:58:40 +03:00
Tomasz Grabiec	90c31701e3	tests: Add unit tests for schema_registry	2016-05-11 17:31:22 +02:00
Tomasz Grabiec	443e5aef5a	schema_registry: Fix possible hang in maybe_sync() if syncer doesn't defer Spotted during code review. If it doesn't defer, we may execute then_wrapped() body before we change the state. Fix by moving then_wrapped() body after state changes.	2016-05-11 17:31:22 +02:00
Tomasz Grabiec	8703136a4f	migration_manager: Fix schema syncing with older version The problem was that "s" would not be marked as synced-with if it came from shard != 0. As a result, mutation using that schema would fail to apply with an exception: "attempted to mutate using not synced schema of ..." The problem could surface when altering schema without changing columns and restarting one of the nodes so that it forgets past versions. Fixes #1258. Will be covered by dtest: SchemaManagementTest.test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns	2016-05-11 17:29:14 +02:00
Takuya ASADA	8503600e30	dist/common/systemd: drop hardcoded path Stop using /var/lib/scylla, use $SCYLLA_HOME instead. systemd seems does not extract variables on Environment="HOME=$SCYLLA_HOME", but both CentOS/Ubuntu able to run scylla-server without $HOME, so dropped it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462977871-26632-1-git-send-email-syuu@scylladb.com>	2016-05-11 17:53:53 +03:00
Calle Wilund	152bd82a05	alter_keyspace_statement: Handle missing replication strategy ALTER KEYSPACE should allow no replication strategy to be set, in which case old strategy should be kept. Initial translation from origin missed this. Fixes #1256 Message-Id: <1462967584-2875-2-git-send-email-calle@scylladb.com>	2016-05-11 16:02:22 +03:00
Calle Wilund	5604fb8aa3	cql3::statements::cf_prop_defs: Fix compation min/max not handled Property parsing code was looking at wrong property level for initial guard statement. Fixes #1257 Message-Id: <1462967584-2875-1-git-send-email-calle@scylladb.com>	2016-05-11 16:02:16 +03:00
Takuya ASADA	c38b5fbb3d	dist/common/scripts: On scylla_io_setup, run iotune on correct data directory which specified on scylla.yaml Currently scylla_io_setup hardcoded to run iotune on /var/lib/scylla, but user may change data directory by modifying scylla.yaml, and it may on different block device. So use scylla_config_get.py to get configuration from scylla.yaml, passes it to iotune. Fixes #1167 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462955824-21983-2-git-send-email-syuu@scylladb.com>	2016-05-11 13:02:25 +03:00
Takuya ASADA	53820393da	dist/common/scripts: add scylla.yaml parser for scripts To parse scylla.yaml, scylla_config_get.py is added. It can be use like 'scylla_config_get.py [key name]' from shell script, or command line. This is needed for scylla_io_setup, to get 'data_file_directories' from shellscript. Currently it does not supported to specify key name of nested data structure, but enough for scyll_io_setup. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462955824-21983-1-git-send-email-syuu@scylladb.com>	2016-05-11 13:02:23 +03:00
Pekka Enberg	d93d46e721	Merge "ALTER KEYSPACE" from Calle "Implementation of ALTER KEYSPACE. Fixes #429"	2016-05-10 22:07:06 +03:00
Takuya ASADA	a73924b4e0	dist/ubuntu/dep: introduce scylla-gdb-7.11 for Ubuntu 14.04LTS Introduce scylla-gdb-7.11 for Ubuntu 14.04LTS, to get better support of recent version of g++ on gdb. Fixes #969 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462825880-20866-3-git-send-email-syuu@scylladb.com>	2016-05-10 17:53:32 +03:00
Takuya ASADA	9ff2efb28b	dist/common/dep: add Ubuntu support for scylla-env Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well. This change introduces scylla-env package to Ubuntu 14.04LTS. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462825880-20866-2-git-send-email-syuu@scylladb.com>	2016-05-10 17:53:32 +03:00
Takuya ASADA	43cc77d1b8	dist/redhat/centos_dep: move scylla-env to dist/common to share with Ubuntu Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well. To do it, share the package directory on dist/common/dep at first. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462825880-20866-1-git-send-email-syuu@scylladb.com>	2016-05-10 17:53:31 +03:00
Calle Wilund	147aa81177	Cql.g: Handle ALTER KEYSPACE	2016-05-10 14:36:46 +00:00
Calle Wilund	5c36d2e09e	alter_keyspace_statement: Implement Note: Like create keyspace, we don't properly validate replication strategy yet.	2016-05-10 14:36:17 +00:00
Piotr Jastrzebski	240a185727	Stop scanning keyspace data directory when populating. Iterate over column families and check/create directories for them instead of scanning keyspace data directory and filtering directories against column families that exist in system tables for this keyspace. Fixes #1008 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <26da66eec67a1ab1318917a66161915cdef924ab.1462890592.git.piotr@scylladb.com>	2016-05-10 17:35:55 +03:00
Calle Wilund	63b6c6bb5a	migration_manager: Implement announce_keyspace_update More or less the same as create keyspace...	2016-05-10 14:34:51 +00:00
Calle Wilund	8cdf4e37fb	schema_tables: Fix merge_keyspaces to handle alter keyspace Must keep "altered" alive into the call chain.	2016-05-10 14:32:51 +00:00
Calle Wilund	6ef7885ae3	database: Implement update_keyspace Reloads keyspace metadata and replaces in existing keyspace. Note: since keyspace metadata, and consequently, replication strategy now becomes volatile, keyspace::metadata now returns shared pointer by value (i.e. keep-alive). Replication strategy should receive the same treatment, but since it is extensively used, but never kept across a continuation, I've just added a comment for now.	2016-05-10 14:31:30 +00:00
Raphael S. Carvalho	d80d194873	compaction_manager: stop compaction tasks in parallel Purpose is to speed up shutdown. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a8db3492f1ceeea2a886d3920e5effa841ea155f.1462838670.git.raphaelsc@scylladb.com>	2016-05-10 10:03:35 +03:00
Avi Kivity	28cc6f97af	Merge	2016-05-09 14:25:25 +03:00
Calle Wilund	917bf850fa	transport::server: Do not treat accept exception as fatal 1.) It most likely is not, i.e. either tcp or more likely, ssl negotiation failure. In any case, we can still try next connection. 2.) Not retrying will cause us to "leak" the accept, and then hang on shutdown. Also, promote logging message on accept exception to "warn", since dtest(s?) depend on seeing log output. Message-Id: <1462283265-27051-4-git-send-email-calle@scylladb.com>	2016-05-09 14:13:07 +03:00
Calle Wilund	437ebe7128	cql_server: Use credentials_builder to init tls Slightly cleaner, and shard-safe tls init. Message-Id: <1462283265-27051-3-git-send-email-calle@scylladb.com>	2016-05-09 14:12:59 +03:00
Calle Wilund	58f7edb04f	messaging_service: Change tls init to use credentials_builder To simplify init of msg service, use credendials_builder to encapsulate tls options so actual credentials can be more easily created in each shard. Message-Id: <1462283265-27051-2-git-send-email-calle@scylladb.com>	2016-05-09 14:12:53 +03:00
Avi Kivity	29e103a2ae	Merge seastar upstream * seastar 7782ad4...3dec26f (3): > tests/mkcert.gmk: Fix makefile bug in snakeoil cert generator > tls_test: Add case to do a little checking of credentials_builder > tls: Add credentials_builder - copyable credentials "factory"	2016-05-09 14:12:29 +03:00
Tomasz Grabiec	1ca5ceadff	Merge tag '1235-v2' from https://github.com/avikivity/scylla From Avi: When we shut down, we may have to give up on some pending atomic sstable deletions, because not all shards may have agreed to delete all members of the set. This is expected, so silence these frightening error messages. Fixes #1235.	2016-05-09 12:22:41 +02:00
Duarte Nunes	dada385826	rpc: Secure connection attempts can be cancelled This patch adds support for secure connection attempts to be cancellable. Fixes #862 Includes seastar upstream merge: * seastar f1a3520...7782ad4 (1): > Merge "rpc: Allow client connections to be cancelled" from Duarte Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1462783335-10731-1-git-send-email-duarte@scylladb.com>	2016-05-09 11:44:53 +03:00
Takuya ASADA	f7d41ba07a	dist: Extract scylla.yaml and create metapackage This patch create a scylla-conf package containing scylla.yaml and a scylla package acting as a metapackage. Fixes #421 Signed-off-by: Benoît Canet <benoit@scylladb.com> Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1462280987-26909-1-git-send-email-syuu@scylladb.com>	2016-05-09 11:23:28 +03:00
Avi Kivity	4b34152870	Merge seastar upstream * seastar ab74536...f1a3520 (2): > rpc: clear outgoing queue of a socket after failed connection > Merge "unconnected socket (now seastar::socket)" from Duarte Fixes #1236.	2016-05-09 10:16:15 +03:00
Raphael S. Carvalho	3ac22bc0d7	compaction_manager: simplify code that waits for cleanup termination Now that a task is created on demand, it's possible to wait for termination of cleanup without extra machinery. However, shared_future<> is now used because we may have more than one fiber waiting for completion of task. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <209de365c7782742dc2876a66f9d0784998cae53.1462599296.git.raphaelsc@scylladb.com>	2016-05-08 11:26:36 +03:00
Avi Kivity	ee7225a9cb	sstables: silence atomic deletion cancellation logs during sstable deletion Those logs are expected during shutdown.	2016-05-07 20:37:49 +03:00
Avi Kivity	80302d98dd	database: silence atomic deletion cancellation logs during compaction Those logs are expected during shutdown.	2016-05-07 20:37:48 +03:00
Avi Kivity	43221fc7e2	sstables: make delete_atomically() throw a distinct exception when cancelled Throwing a runtime_error makes it impossible to catch the cancellation exception, so replace it with a distinct exception class.	2016-05-07 20:37:46 +03:00
Calle Wilund	709dd82d59	storage_service: Add logging to match origin Pointing out if CQL server is listing in SSL mode. Message-Id: <1462368016-32394-2-git-send-email-calle@scylladb.com>	2016-05-06 13:27:55 +03:00
Raphael S. Carvalho	bf18025937	main: stop compaction manager earlier Avi says: "During shutdown, we prevent new compactions, but perhaps too late. Memtables are flushed and these can trigger compaction." To solve that, let's stop compaction manager at a very early step of shutdown. We will still try to stop compaction manager in database::stop() because user may ask for a shutdown before scylla was fully started. It's fine to stop compaction manager twice. Only the first call will actually stop the manager. Fixes #1238. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <c64ab11f3c91129c424259d317e48abc5bde6ff3.1462496694.git.raphaelsc@scylladb.com>	2016-05-06 07:41:29 +03:00
Calle Wilund	d8ea85cd90	messaging_service: Add logging to match origin To announce rpc port + ssl if on. Message-Id: <1462368016-32394-1-git-send-email-calle@scylladb.com>	2016-05-05 10:26:01 +03:00
Raphael S. Carvalho	b8277979ef	compaction_manager: fix indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <82c6b93b24cbcc97f5eff3f91b05d4c1b415ecee.1462412927.git.raphaelsc@scylladb.com>	2016-05-05 10:06:56 +03:00
Avi Kivity	3aefa4f1d2	Merge seastar upstream * seastar e536555...ab74536 (4): > reactor: kill max_inline_continuations > smp: optimize smp_message_queue::flush_request_batch() for empty queue > thread: do not yield if idle > Merge "Fixes for iotune" from Glauber	2016-05-05 09:48:58 +03:00
Gleb Natapov	f1cd52ff3f	tests: test for result row counting Message-Id: <1462377579-2419-2-git-send-email-gleb@scylladb.com>	2016-05-04 18:18:17 +02:00
Gleb Natapov	b75475de80	query: fix result row counting for results with multiple partitions Message-Id: <1462377579-2419-1-git-send-email-gleb@scylladb.com>	2016-05-04 18:18:15 +02:00
Gleb Natapov	2a00c06dd5	query: fix non full clustering key deserialization Clustering key prefix may have less columns than described in schema. Deserailiaztion should stop when end of buffer is reached. Message-Id: <20160503140420.GP23113@scylladb.com>	2016-05-04 17:42:28 +02:00
Raphael S. Carvalho	5aeeb0b3e8	compaction: add support to parallel compaction on the same column family It was noticed that small sstables will accumulate for a column family because scylla was limited to two compaction per shard, and a column family could have at most one compaction running at a given shard. With the number of sstables increasing rapidly, read performance is degraded. At the moment, our compaction manager works by running two compaction task handlers that run in parallel to the rest of the system. Each task handler gets to run when needed, gets a column family from compaction manager queue, runs compaction on it, and goes to sleep again. That's basically its cycle. Compaction manager only allows one instance of a column family to be on its queue, meaning that it's impossible for a column family to be compacted in parallel. One compaction starts after another for a given column family. To solve the problem described, we want to concurrently run compaction jobs of a column family that have different "size tier" (or "weight"). For those unfamiliar, compaction job contains a list of sstables that will be compacted together. The "size tier" of a compaction job is the log of the total size of the input sstables. So a compaction job only gets to run if its "size tier" is not the same of an ongoing compaction. There is no point in compacting concurrently at the same "size tier", because that slows down both compactions. We will no longer queue column families in compaction manager. Instead, we create a new fiber to run compaction on demand. This fiber that runs asynchronously will do the following: 1) Get a compaction job from compaction strategy. 2) Calculate "size tier" of compaction job. 3) Run compaction job if its "size tier" is not the same of an ongoing compaction for the given column family. As before, it may decide to re-compact a column family based on a stat stored in column family object. Ran all compaction-related dtests. Fixes #1216. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>	2016-05-04 11:46:09 +03:00
Calle Wilund	6d2caedafd	auth: Make auth.* schemas use deterministic UUIDs In initial implementation I figured this was not required, but we get issues communicating across nodes if system tables don't have the same UUID, since creation is forcefully local, yet shared. Just do a manual re-create of the scema with a name UUID, and use migration manager directly. Message-Id: <1462194588-11964-1-git-send-email-calle@scylladb.com>	2016-05-03 10:48:24 +03:00
Avi Kivity	24f90b087f	Merge "fix range queries with limiter to not generate more requests than needed" from Gleb Fixes #1204.	2016-05-02 15:14:45 +03:00
Gleb Natapov	3039e4c7de	storage_proxy: stop range query with limit after the limit is reached	2016-05-02 15:10:15 +03:00
Gleb Natapov	db322d8f74	query: put live row count into query::result The patch calculates row count during result building and while merging. If one of results that are being merged does not have row count the merged result will not have one either.	2016-05-02 15:10:15 +03:00
Gleb Natapov	41c586313a	storage_proxy: fix calculation of concurrency queried ranges	2016-05-02 15:10:15 +03:00
Gleb Natapov	c364ab9121	storage_proxy: add logging for range query row count estimation	2016-05-02 15:10:15 +03:00
Calle Wilund	751ba2f0bf	messaging_service: Change init to use per-shard tls credentials Fixes: #1220 While the server_credentials object is technically immutable (esp with last change in seastar), the ::shared_ptr holding them is not safe to share across shards. Pre-create cpu x credentials and then move-hand them out in service start-up instead. Fixes assertion error in debug builds. And just maybe real memory corruption in release. Requires seastar tls change: "Change server_credentials to copy dh_params input" Message-Id: <1462187704-2056-1-git-send-email-calle@scylladb.com>	2016-05-02 15:04:40 +03:00
Raphael S. Carvalho	ae95ce1bd7	sstables: optimize leveled compaction strategy Leveled compaction strategy is doing a lot of work whenever it's asked to get a list of sstables to be compacted. It's checking if a sstable overlaps with another sstable in the same level twice. First, when adding a sstable to a list with sstables at the same level. Second, after adding all sstables to their respective lists. It's enough to check that a sstable creates an overlap in its level only once. So I am changing the code to unconditionally insert a sstable to its respective list, and after that, it will call repair_overlapping_sstables() that will send any sstable that creates an overlap in its level to L0 list. By the way, the optimization isn't in the compaction itself, instead in the strategy code that gets a set of sstables to be compacted. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <8c8526737277cb47987a3a5dbd5ff3bb81a6d038.1461965074.git.raphaelsc@scylladb.com>	2016-05-02 11:18:39 +03:00
Avi Kivity	dc69999fd8	Merge seastar upstream * seastar dab58e4...e536555 (5): > rpc: introduce outgoing packet queue > Add condition variable implementation. > future-utils: support futures with multiple values in map_reduce > tests: rpc: stop client and server > tls_test: Add test for large-ish buffer send/recieve	2016-05-02 11:10:33 +03:00
Takuya ASADA	122330a5eb	dist/common/scripts: add interactive prompt for package installation check, also check scylla-tools installed Currently scylla_setup is unusable when user does not want to install scylla-jmx because it checks package unconditionally, but some users (or developers) does not want to install it, so let's ask to skip check or not on interactive prompt. Also, scylla-tools package should installed for most of the case, added check code for the package. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460662354-10221-1-git-send-email-syuu@scylladb.com>	2016-05-01 14:50:50 +03:00
Takuya ASADA	cc74b6ff5f	dist/ubuntu: move lines from rules to .install/.dirs/.docs To simplify build script, and make it easier spliting two packages, use .install/.dirs/.docs instead of rules. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461960695-30647-1-git-send-email-syuu@scylladb.com>	2016-05-01 10:16:35 +03:00
Avi Kivity	434db0bc8b	Update scylla-ami submodule * dist/ami/files/scylla-ami 7019088...72ae258 (1): > Add --repo option to scylla_install_ami to construct AMI with custom repository URL	2016-04-28 16:41:30 +03:00
Takuya ASADA	6723978891	dist/ami: Add --repo option to build_ami.sh to construct AMI with custom repository URL To build AMI from specified build of .rpm, custom repo URL option is required. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461849370-11963-1-git-send-email-syuu@scylladb.com>	2016-04-28 16:40:49 +03:00
Takuya ASADA	3ec47fbcf0	dist/ubuntu: unofficial support Debian 8.4 Unofficial support for Debian 8.4. Now we supported both ubuntu and debian, but keep directory name as 'dist/ubuntu' for now. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461006868-28273-1-git-send-email-syuu@scylladb.com>	2016-04-27 15:39:20 +03:00
Pekka Enberg	31090f3116	Merge "Fix for systemd support on Ubuntu, add Ubuntu 16.04 support" from Takuya "This is bug fix for systemd support on Ubuntu, and add Ubuntu 16.04 support."	2016-04-27 15:37:25 +03:00
Takuya ASADA	1cfde50102	dist/ubuntu: support 16.04 Drop 'unsupported release' message on 16.04. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Takuya ASADA	988b7bcd3d	dist/ubuntu: don't use ubuntu-toolchain-r/test ppa repo on recent versions of Ubuntu, since it has newer g++ On Ubuntu 15.04 and newer, official g++ package is >= g++-4.9. So we don't need to use development repository, just use official package. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Takuya ASADA	fa0b90b727	dist/ubuntu: add dependency for libsystemd-dev to handle startup correctly on recent versions of Ubuntu To handle scylla startup correctly on systemd versions of Ubuntu, scylla requires to build with libsystemd-dev. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Takuya ASADA	eae881ff70	dist/ubuntu: skip dh_installinit --upstart-only on recent versions of Ubuntu Since 16.04LTS does not support this argument anymore, drop it on recent version of Ubuntu which does not uses Upstart. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Takuya ASADA	d5efa02eab	dist/ubuntu/dep: Drop python-support on Ubuntu 16.04 Ubuntu 16.04 seems dropped python-support, so remove it from thrift package. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Takuya ASADA	e733c2aae8	dist/ubuntu/dep: use distribution's thrift-compiler-0.9.1 on newer versions of Ubuntu Use distribution's thrift if version > 14.04LTS. 14.04LTS doesn't have thrift-compiler-0.9.1, use our version. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-04-27 18:06:59 +09:00
Avi Kivity	ad9e75a3fa	Merge seastar upstream * seastar 15a92cf...dab58e4 (6): > tls: Fix tls sink::put so it deals with larger packets > tls: Change server_credentials to copy dh_params input > seastar thread: allow the thread_scheduling_group's usage fraction to change > seastar::async allow passing an attribute > thread: document undocumented classes > fair_queue: fix inconsistency during renormalization	2016-04-27 10:39:09 +03:00
Avi Kivity	454512a272	dist/redhat: package scylla_kernel_check Can't build rpm without this. Message-Id: <1461683947-30356-1-git-send-email-avi@scylladb.com>	2016-04-27 08:38:48 +03:00
Tomasz Grabiec	61435108a5	query: Do not take arguments via ... in the visitor Amnon reports that current code fails to compile on gcc 4.9: distcc[9700] ERROR: compile /home/amnon/.ccache/tmp/query.tmp.localhost.localdomain.9673.ii on localhost failed In file included from query.cc:30:0: query-result-reader.hh: In instantiation of ‘void query::result_view::consume(const query::partition_slice&, ResultVisitor&&) [with ResultVisitor = query::result::calculate_row_count(const query::partition_slice&)::<anonymous struct>&]’: query.cc:196:32: required from here query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class clustering_key_prefix’ through ‘...’ visitor.accept_new_row(*row.key(), static_row, view); ^ query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’ query-result-reader.hh:184:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’ query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’ visitor.accept_new_row(static_row, view); ^ query-result-reader.hh:186:21: error: cannot pass objects of non-trivially-copyable type ‘class query::result_row_view’ through ‘...’ Work around the problem by not using '...'. Message-Id: <1460964042-2867-1-git-send-email-tgrabiec@scylladb.com>	2016-04-26 14:50:35 +03:00
Takuya ASADA	eb9bd3ee21	dist/common/scripts: show knowledge base URL when kernel is too old To explain why this kernel is not supported, we need to show kb URL here. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461644708-32078-1-git-send-email-syuu@scylladb.com>	2016-04-26 14:43:10 +03:00
Takuya ASADA	05ac4bb99d	dist/common/scripts: notice restart required after changing bootparameters Fixes #1115 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459330851-32470-1-git-send-email-syuu@scylladb.com>	2016-04-26 14:41:49 +03:00
Tomasz Grabiec	88bb5fcb53	api: Fix error message Keyspace and table names are separated by a single colon. Message-Id: <1461600269-4070-1-git-send-email-tgrabiec@scylladb.com>	2016-04-26 08:40:28 +03:00
Takuya ASADA	e7f438eeae	dist/ubuntu: Drop dependency to libthrift0, link it statically Drop dependency to libthrift0 on installation time, link libthrift statically. With this fix, we don't need to distribute libthrift0 deb package anymore to install scylla-server binary package. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461594460-2403-2-git-send-email-syuu@scylladb.com>	2016-04-25 17:44:46 +03:00
Takuya ASADA	ec2ef467c8	configure.py: configure.py: add --static-thrift option to link libthrift statically This is needed for Ubuntu packaging, to drop dependency to libthrift0 on installation time. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461594460-2403-1-git-send-email-syuu@scylladb.com>	2016-04-25 17:44:44 +03:00
Avi Kivity	af803a9149	Merge seastar upstream * seastar 2b3c363...15a92cf (2): > smp: allow more than 128 in-flight operations on core-to-core queue > future: balance constructors and destructors in future_state<> Fixes #1205.	2016-04-25 13:34:27 +03:00
Calle Wilund	cdd0f00de5	client_state: Remove unwarranted keyspace check "has_keyspace_access" is not supposed to (according to origin) verify that a keyspace exists. Remove. It (and all others) are however supposed to check "ks" (name) not empty. Add this. Message-Id: <1461578072-24113-1-git-send-email-calle@scylladb.com>	2016-04-25 13:16:36 +03:00
Calle Wilund	49d3d79dfe	sstables: Fix compilation error on boost 1.55 Message-Id: <1461067254-526-2-git-send-email-calle@scylladb.com>	2016-04-25 12:54:44 +03:00
Calle Wilund	9130b0de16	database.cc: Fix compilation error with boost 1.55 Message-Id: <1461067254-526-1-git-send-email-calle@scylladb.com>	2016-04-25 12:54:43 +03:00
Takuya ASADA	c657a431dc	dist/common/scripts: Fix incorrect order to run scylla_sysconfig_setup on scylla_setup Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461174517-12441-2-git-send-email-syuu@scylladb.com>	2016-04-25 11:09:49 +03:00
Takuya ASADA	9a99231f6b	dist/common/scripts: On scylla_setup, skip showing 'lo' interface on sysconfig prompt Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461174517-12441-1-git-send-email-syuu@scylladb.com>	2016-04-25 11:09:48 +03:00
Takuya ASADA	611b0a3400	dist/common/scripts: Add kernel version check Check kernel version at beginning of scylla_setup, show error when kernel is too old. Use iotune --fs-check to check kernel. Fixes #1116 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459886738-10882-1-git-send-email-syuu@scylladb.com>	2016-04-24 17:47:13 +03:00
Vlad Zolotarov	813ad4024f	query_processor: account unprepared statements executions Add the statistics counter for a number of unprepared statements executions and expose it with collectd. Since in our implementation a number of unprepared statements executions equals to a number of executions of prepare() function we may simply increment the new statistics counter every time query_processor::get_statement() is called. Fixes #1068 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1461503492-32228-1-git-send-email-vladz@cloudius-systems.com>	2016-04-24 16:55:15 +03:00
Avi Kivity	c6b5890eb2	Merge	2016-04-24 16:17:00 +03:00
Pekka Enberg	f6da9bc92b	Merge "Additional mutations/queries related collectd metrics" from Vlad "This series introduces some additional metrics (mostly) in a storage_proxy and a database level that are meant to create a better picture of how data flows in the cluster. First of all where possible counters of each category (e.g. total writes in the storage proxy level) are split into the following categories: - operations performed on a local Node - operations performed on remote Nodes aggregated per DC In a storage_proxy level there are the following metrics that have this "split" nature (all on a sending side): - total writes (attempts/errors) - writes performed as a result of a Read Repair logic - total data reads (attempts/completed/errors) - total digest reads (attempts/completed/errors) - total mutations data reads (attempts/completed/errors) In a batchlog_manager: - writes performed as a result of a batchlog replay logic Thereby if for instance somebody wants to get an idea of how many writes the current Node performs due to user requested mutations only he/she has to take a counter of total writes and subtract the writes resulted by Read Repairs and batchlog replays. On a receiving side of a storage_proxy we add the two following counters: - total number of received mutations - total number of forwarded mutations (attempts/errors) In order to get a better picture of what is going on on a local Node we are adding two counters on a database level: - total number of writes - total number of reads Comparing these to total writes/reads in a storage_proxy may give a good idea if there is an excessive access to a local DB for example."	2016-04-21 15:58:45 +03:00
Takuya ASADA	2bfc8e8c12	main: add tcp_syncookies sanity check Check net.ipv4.tcp_syncookies, show error message when it set to 0. Fixes #1118 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460738415-3798-1-git-send-email-syuu@scylladb.com>	2016-04-21 14:55:26 +03:00
Pekka Enberg	3f1fcca3bc	cql3: Fix DROP KEYSPACE error message when keyspace does not exist Commit `d3fe0c5` ("Refactor db/keyspace/column_family toplogy") changed database::find_keyspace() to throw a std::nested_exception so the catch block in migration_manager::announce_keyspace_drop() no longer catches the exception. Fix the issue by explicitly checking if the keyspace exists and throwing the correct exception type if it doesn't. Fixes TestCQL.keyspace_test. Message-Id: <1461218910-26691-1-git-send-email-penberg@scylladb.com>	2016-04-21 12:42:45 +02:00
Vlad Zolotarov	4ef5b11e9b	batchlog_manager: add a counter for a total number of write attempts Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:29:21 +03:00
Vlad Zolotarov	97e5bfa815	database: add metrics for total writes and reads This patch adds a counter of total writes and reads for each shard. It seems that nothing ensures that all database queries are ready before database object is destroyed. Make _stats lw_shared_ptr in order to ensure that the object is alive when lambda gets to incrementing it. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:28:53 +03:00
Vlad Zolotarov	9bf8253412	storage_proxy: add read requests split counters Add split (local Nodes, external Nodes aggregated per Nodes' DCs) counters for the following read categories: - data reads - digest reads - mutation data reads Each category is added attempts, completions and errors metrics. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:28:19 +03:00
Vlad Zolotarov	cbcbdc3b4a	storage_proxy: add split counters for writes Added split metrics for operations on a local Node and on external Nodes aggregated per Nodes' DCs. Added separate split counters for: - total writes attempts/errors - read repair write attempts (there is no easy way to separate errors at the moment) Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:28:15 +03:00
Vlad Zolotarov	c92654b281	storage_proxy: add counters for received and forwarded mutations Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:27:29 +03:00
Piotr Jastrzebski	8231385e0c	sstables: Remove unused code from mp_row_consumer _mutation_to_subscription is not used anywhere so it should probably be removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <90ef62daee0c183b29dcb86d08843145d657ea38.1461179970.git.piotr@scylladb.com>	2016-04-20 23:10:43 +03:00
Raphael S. Carvalho	eb51c93a5a	tests: fix use-after-free in sstable test After commit `a843aea547`, a gate was introduced to make sure that an asynchronous operation is finished before column family is destroyed. A sstable testcase was not stopping column family, instead it just removed column family from compaction manager. That could cause an user-after-free if column family is destroyed while the asynchronous operation is running. Let's fix it by stopping column family in the test. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ed910ec459c1752148099e6dc503e7f3adee54da.1461177411.git.raphaelsc@scylladb.com>	2016-04-20 22:08:08 +03:00
Pekka Enberg	7af9ac2880	Merge "Add support for User Defined Types" from Duarte "This patchset enables support for user defined types, completing the functionality that was already in place. Fixes #426"	2016-04-20 21:26:03 +03:00
Yoav Kleinberger	1543253bfd	scyllatop: differentiate metrics coming from different hosts Fix issue #1173. Previously scyllatop aggregated metrics coming from a cluster with many hosts so that individual contributions could not be recognized. This is now changed so that aggregation is also by hostname. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <8a4d8b82216d8c1aa855026ff31bcfd8bfac7e47.1461150261.git.yoav@scylladb.com>	2016-04-20 20:20:09 +02:00
Duarte Nunes	c04f8c239e	udt: Enable user type query test case This patch enables the test case for user defined types in cql_query_test. Fixes #426 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:07 +02:00
Duarte Nunes	bc90d6a730	udt: type_parser handles user defined types This patch ensures type_parser can handle user defined types. It also prefixes user_type_impl::make_name() with org.apache.cassandra.db.marshal.UserType. Fixes #631 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:07 +02:00
Duarte Nunes	b5a87f8bdc	udt: Add unit test for user type schema changes Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:07 +02:00
Duarte Nunes	7911438de0	udt: Add grammar for altering user types This patch adds support in Cql.g for the alter user type statement. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:07 +02:00
Duarte Nunes	fbf70e9bed	udt: Add alter type statement Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:07 +02:00
Duarte Nunes	3e663cfa9a	udt: Add capability to replace a user_type This patch adds a function to abstract_type that locates the usage of a given user_type and recursively returns an updated version of the containing type containing the updated user type. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:06 +02:00
Duarte Nunes	6cb57a567f	udt: Add grammar for dropping user types This patch adds support in Cql.g for the drop user type statement. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:06 +02:00
Duarte Nunes	809b45e160	udt: Add drop type statement Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 18:07:02 +02:00
Calle Wilund	7f85373e15	cql3/drop_table_statement: Fix exception handling in access check Tried to handle possibly benign exception in continuation, but this is always thrown synchronously. Fixes ttl_test dtest failures. Message-Id: <1461154499-10674-1-git-send-email-calle@scylladb.com>	2016-04-20 15:49:04 +03:00
Duarte Nunes	66c60f03fe	udt: Add references_user_type to abstract_type This patch adds a virtual function to the abstract_type hierarchy to tell whether a given type references the specified type. Needed to implement the drop and alter type statements. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:07 +02:00
Duarte Nunes	6732da67ab	udt: Add is_user_type function to abstract_type This patch adds a function to identify a given abstract_type as a user_type_impl. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:07 +02:00
Duarte Nunes	ddb4a4b29b	udt: Implement as_cql3_type for user_type_impl Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	35a88b5d49	udt: Complete create_type_statement Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	d1f215b743	udt: Merge user defined type mutations This patch implements the merge_types() function, allowing mutations to user defined types to be applied. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	fdddcfb3ea	udt: Fix user type compatibility check A new user type is checked for compatibility against the previous version of that type, so as to ensure that an updated field type is compatible with the previous field type (e.g., altering a field type from text to blob is allowed, but not the other way around). However, it is also possible to add new fields to a user type. So, when comparing a user type against its previous version, we should also allow the current, new type to be longer than the previous one. The current code instead allows for the previous type to be longer, which this patch fixes. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	eae7f10906	map_difference: Allow on unordered_map This patch changes the map_difference interface so difference() can be called on on unordered_maps. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	7dc895e63d	types: Add operator== for abstract_types This patch allows abstract_types to be compared for equality. In particular, it enables the indirect_equal_to<abstract_type> idiom. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	0aeb4dcaaf	udt: Implement equals() for user_type_impl Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	d6d29f7c52	schema: Replace ad hoc func with indirect_equal_to Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	08a7bba4ed	udt: Announce UDT migrations This patch defines the member functions responsible for announce create, update and drop user defined types migration. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	dd75fe8ec0	udt: Add mutations for user defined types This patch implements mutations for user defined types. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	37a1547971	udt: Add migration notifications This patch adds migration notifications for user defined types. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	c2e3e918e8	udt: Take name by ref when querying for an UDT ..so as not to incur in a copy. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	2c15778fe0	udt: Remove user_types field from keyspace This field is superfluous and adds confusion regarding the user_types field in the keyspace metadata. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	c7b3a4b144	udt: Parse user types system table This patch loads and parses the user types system table during bootstrap. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Duarte Nunes	f8d8dbdeb7	types: Don't wrap tombstone in an std::optional All the callers of do_serialize_mutation_form pass a valid tombstone that is converted into a non-empty optional. This happens even if the tombstone is empty (tombstone::timestamp == api::missing_timestamp). This patch fixes this by passing in a reference to the tombstone which is convertible to bool, based on whether it is empty or not. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1460620528-3628-1-git-send-email-duarte@scylladb.com>	2016-04-20 09:22:01 +02:00
Duarte Nunes	40c1b29701	cql3: Implement contains relation Although it doesn't work in the absence of secondary indexes, now we provide the same error messages as origin when trying to use the contains relation. Fixes #1158 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1461088626-26958-1-git-send-email-duarte@scylladb.com>	2016-04-20 09:22:25 +03:00
Pekka Enberg	4ed702f0da	Merge "Authorizer support" from Calle "Conversion/implementation of "authorizer" code from origin, handling permissions management for users/resources. Default implementation keeps mapping of <user.resource>->{permissions} in a table, contents of which is cached for slightly quicker checks. Adds access control to all (existing) cql statements. Adds access management support to the CQL impl. (GRANT/REVOKE/LIST) Verified manually and with dtest auth_test.py. Note that several of these still fail due to (unrelated) unimplemented features, like index, types etc. Fixes #1138"	2016-04-19 15:00:38 +03:00
Calle Wilund	4c246b5cc3	scylla.yaml: Move authorizer/authenticator options to supported section	2016-04-19 11:49:06 +00:00
Calle Wilund	9ed25a970e	Cql.g: Permission statements parsing	2016-04-19 11:49:06 +00:00
Calle Wilund	3b101c6e19	cql3::statements::drop_user_statement: Drop all permissions for user	2016-04-19 11:49:06 +00:00
Calle Wilund	14cc47d8b9	cql3::statements::revoke_statement: Initial conversion	2016-04-19 11:49:06 +00:00
Calle Wilund	4e1ef3c1bc	cql3::statements::grant_statement: Initial conversion	2016-04-19 11:49:05 +00:00
Calle Wilund	04c37def3a	cql3::statements::list_permissions_statement: Initial conversion	2016-04-19 11:49:05 +00:00
Calle Wilund	fe23447f6f	cql3::statements::permission_altering_statement: Inital conversion Alter permission base typ	2016-04-19 11:49:05 +00:00
Calle Wilund	add2111c0a	cql3::statements::authorizarion_statement: Initial conversion Auth cql base type	2016-04-19 11:49:05 +00:00
Calle Wilund	3906dc9f0d	cql3::statements: Change check_access to future<> + implement	2016-04-19 11:49:05 +00:00
Calle Wilund	dac6cf69eb	service::client_state: Add authorization checkers	2016-04-19 11:49:05 +00:00
Calle Wilund	072acc68da	validation: Add KS validation + convinence methods Looking up local db.	2016-04-19 11:49:05 +00:00
Calle Wilund	a7e1af1c06	db::config: Add permissions cache entries/mark auth/perm as used	2016-04-19 11:49:05 +00:00
Calle Wilund	36bb40c205	auth::auth: Add authorizer initialization + permissions getter Create and init authorizer object on start. Create thread local permissions cache to front end the actual authorizer.	2016-04-19 11:49:05 +00:00
Calle Wilund	03568d0325	tests::cql_test_env: Fake logged in user in case test requires is.	2016-04-19 11:49:05 +00:00
Calle Wilund	ead1c882f8	utils::loading_cache: Version of the LoadingCache type used in origin Simple, expiring, cache of potentially limited number of entries.	2016-04-19 11:49:05 +00:00
Calle Wilund	956ee87e12	auth::authenticator: Change "protected_resources" to return reference It it an immutable static value anyway.	2016-04-19 11:49:05 +00:00
Calle Wilund	1f0bbf2d9a	auth::authorizer: Initial conversion Main authorization endpoint. Default (and only) real authorizer keeps a mapping resource -> permission sets in system table	2016-04-19 11:49:04 +00:00
Benoît Canet	e17795d2dd	scylla_dev_mode_setup: Unify --developer-mode prompt and parsing Fixes: #1194 Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1461002978-5379-2-git-send-email-benoit@scylladb.com>	2016-04-19 09:38:03 +03:00
Takuya ASADA	f6252be0c1	utils: fix compilation error on utils/exceptions.hh It doesn't able to find std::system_error due to missing header. Fixes #1202 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1461006884-28316-1-git-send-email-syuu@scylladb.com>	2016-04-19 09:37:31 +03:00
Raphael S. Carvalho	bf03cd1ea6	sstables: kill unused code from size tiered strategy Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <485b1e49419cb052218ab4558f27270ce3bd03b4.1460761821.git.raphaelsc@scylladb.com>	2016-04-19 08:46:06 +03:00
Raphael S. Carvalho	29db5f5e1f	sstables: move compaction strategy code to a new source file Moving compaction strategy code from sstables/compaction.cc to sstables/compaction_strategy.cc That improves readability. Strategy code should be separated from the generic compaction code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <5af6fc8f7321351a071fc0ce03c80ffea21f8396.1460761821.git.raphaelsc@scylladb.com>	2016-04-19 08:45:43 +03:00
Pekka Enberg	a3a0404d33	Merge seastar upstream * seastar 2185f37...2b3c363 (1): > net/tls: Fix compilation with older GnuTLS versions	2016-04-19 08:43:36 +03:00
Calle Wilund	c446fe50e6	tuple_hash: Add convinence operator for two arguments (non-pair)	2016-04-18 13:51:15 +00:00
Calle Wilund	f0d2efd206	data_value: Add constructor from unordered_set<>	2016-04-18 13:51:15 +00:00
Calle Wilund	690c7207fe	cql3::untyped_result_set: Add get_set<> method Gets a value as a, you guessed it, set.	2016-04-18 13:51:15 +00:00
Calle Wilund	443af44f24	log: Add output operator for std::exception&/std::system_error&	2016-04-18 13:51:15 +00:00
Calle Wilund	ca7d339110	auth::authenticated_user: Add copy/move constructors	2016-04-18 13:51:15 +00:00
Calle Wilund	d3a9650646	auth::permission_set: Add < operator	2016-04-18 13:51:15 +00:00
Calle Wilund	c93d114949	auth::permission: Add stringizers + move sets into namespace	2016-04-18 13:51:15 +00:00
Calle Wilund	6e09920f93	auth::data_resource: Fix to_string to match origin	2016-04-18 13:51:15 +00:00
Calle Wilund	bb96e5bd66	auth::data_resource: Move declaration of "resource_ids"	2016-04-18 13:51:15 +00:00
Takuya ASADA	2eb91421eb	dist/ami: Show correct login message when scylla-ami-setup.service is still running While scylla-ami-setup.service is running, login message says "run systemctl status scylla-server" to see status, but it actually never launched yet. This patch fixes the message to notice RAID construction is running, and 'systemctl status scylla-ami-setup' is the correct way to see status. Fixes #1035 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460660628-10103-2-git-send-email-syuu@scylladb.com>	2016-04-18 15:22:02 +03:00
Takuya ASADA	07a6057c03	dist/ami: fix incorrect service name on .bash_profile Ubuntu's service name on .bash_profile is incorrect, fix it. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460660628-10103-1-git-send-email-syuu@scylladb.com>	2016-04-18 15:21:48 +03:00
Tomasz Grabiec	45527fcffa	Merge branch 'glommer/issue-1144-v5' From Glauber: There are current some outstanding issues with the throttling code. It's easier to see them with the streaming code, but at least one of them is general. One of them is related to situations in which the amount of memory available leaves only one memtable fitting in memory. That would only happen with the general code if we set the memtable cleanup threshold to 100 % - and I don't even know if it is valid - but will happen quite often with the streaming code. If that happens, we'll start throttling when that memtable is being written, but won't be able to put anything else in its place - leading to unnecessary throttling. The second, and more serious, happens when we start throttling and the amount of available memory is not at least 1MB. This can deadlock the database in the sense that it will prevent any request from continuing, and in turn causing a flush due to memtable size. It is a good practice anyway to always guarantee progress. Fixes #1144	2016-04-18 12:20:13 +02:00
Gleb Natapov	f3b515052b	udt: fix error generation if accessed type is not udt Fixes #1198 Message-Id: <1460884314-3717-2-git-send-email-gleb@scylladb.com>	2016-04-18 12:45:03 +03:00
Duarte Nunes	ece89069dd	udt: Implement to_string() for selectable Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1460884314-3717-1-git-send-email-gleb@scylladb.com>	2016-04-18 12:44:48 +03:00
Pekka Enberg	edf7f098e2	Merge "Fix query of collection cell with all items deleted" from Tomek	2016-04-18 11:01:24 +03:00
Tomasz Grabiec	2e08d0f698	Merge branch 'dev/gleb/logging' Logging improvements from Gleb.	2016-04-15 19:03:44 +02:00
Tomasz Grabiec	89bc32b020	tests: Add test for query of collection with deleted item	2016-04-15 18:14:05 +02:00
Tomasz Grabiec	c69d0a8e87	mutation_partition: Fix collection emptiness check Broken by `f15c380a4f`. This resulted in empty collection being returned in the results instead of no collection. Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest from cassandra-unit-tests.	2016-04-15 18:14:05 +02:00
Tomasz Grabiec	b0d4782016	types: Add default argument values to is_any_live()	2016-04-15 18:14:05 +02:00
Avi Kivity	0de32ab120	Merge seastar upstream * seastar 2aeb9dd...2185f37 (15): > reactor: avoid issuing systemwide memory barriers in parallel > Revert "Use sys_membarrier() when available" > Merge "Various exception-safety fixes" from Tomasz > future-util: make map reduce exception safe > collectd: do not give up after a failure > future-util: make repeat_until_value exception safe > rpc: do not block connection when unknown verbs is received > rpc: do not wait for a reply after timeout > rpc: move connection stats to base class > core/reactor: Handle io_submit failures inside flush_pending_aio > apps/iotune: add --fs-check option to use iotune for kernel version check > Merge "Some exception safety patches" from Paweł > tls: Fix conversion of dh_params::level to gnutls_sec_param_t > core: posix_thread: Mark start_routine as noexcept > fair_queue: better overflow protection	2016-04-15 16:06:53 +03:00
Pekka Enberg	3f2286d02e	Merge "Delete compacted sstables atomically" from Avi "If we compact sstables A, B into a new sstable C we must either delete both A and B, or none of them. This is because a tombstone in B may delete data in A, and during compaction, both the tombstone and the data are removed. If only B is deleted, then the data gets resurrected. Non-atomic deletion occurs because the filesystem does not support atomic deletion of multiple files; but the window for that is small and is not addressed in this patchset. Another case is when A is shared across multiple shards (as is the case when changing shard count, or migrating from existing Cassandra sstables). This case is covered by this patchset. Fixes #1181."	2016-04-14 22:04:15 +03:00
Glauber Costa	9c87ae3496	throttle: always release at least one request if we are below the limit Our current throttling code releases one requests per 1MB of memory available that we have. If we are below the memory limit, but not by 1MB or more, then we will keep getting to unthrottle, but never really do anything. If another memtable is close to the flushing point, those requests may be exactly the ones that would make it flush. Without them, we'll freeze the database. In general, we need to always release at least one request to make sure that progress is always achieved. This fixes #1144 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 13:13:15 -04:00
Gleb Natapov	9801d69d53	storage_proxy: add query result row count to brief format Report number of rows in brief reporting format, but only if we can count them without linearizing result's buffer.	2016-04-14 19:26:00 +03:00
Gleb Natapov	53993527ed	storage_proxy: move verbose query result printing into separate logger If query result is large tracing cannot be done since printing the result takes too much time and space.	2016-04-14 19:26:00 +03:00
Gleb Natapov	46e5d05220	storage_proxy: cleanup query logging. Since commit `c1cffd06` logger catch errors internally, so no need to catch most of them at the top level. Only those that can happen during parameter evaluation can reach here. Change parameters to not throw too.	2016-04-14 19:26:00 +03:00
Gleb Natapov	15ebe5e4e5	query: add calculate_row_count function to query::result	2016-04-14 19:26:00 +03:00
Gleb Natapov	f47b2dad18	query: add lazy printer to query::result query::result transformation to printable form is very heavy operation that allocates memory and thus can fail. Add a class to query::result that can be used with logger to push to string conversion when output is performed.	2016-04-14 19:26:00 +03:00
Glauber Costa	2c5dfe08c1	memtable_list: make sure at least two memtables are available This is usually not a problem for the main memtable list - although it can be, depending on settings, but shows up easily for the streaming memtables list. We would like to have at least two memtables, even if we have to cut it short. If we don't do that, one memtable will have use all available memory and we'll force throttling until the memtable gets totally flushed. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	1daede7396	unnest throttle_state throttle_state is currently a nested member of database, but there is no particular reason - aside from the fact that it is currently only ever referenced by the database for us to do so. We'll soon want to have some interaction between this and the column family, to allow us to flush during throttle. To make that easier, let's unnest it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	39def369ce	move information about memtables' region group inside memtable list This is a preparation patch so we can move the throttling infrastructure inside the memtable_list. To do that, the region group will have to be passed to the throttler so let's just go ahead and store it. In consequence of that, all that the CF has to tell us is what is the current schema - no longer how to create a new memtable. Also, with a new parameter to be passed to the memtable_list the creation code gets quite big and hard to follow. So let's move the creation functions to a helper. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Avi Kivity	a843aea547	db: delete compacted sstables atomically If sstables A, B are compacted, A and B must be deleted atomically. Otherwise, if A has data that is covered by a tombstone in B, and that tombstone is deleted, and if B is deleted while A is not, then the data in A is resurrected. Fixes #1181.	2016-04-14 17:14:26 +03:00
Avi Kivity	3798d04ae8	sstables: convert sstable::mark_for_deletion() to atomic deletion infrastructure All deletions must go through the same data structure, or some atomic deletions will never be satisified.	2016-04-14 17:14:26 +03:00
Avi Kivity	e43dbac836	main: cancel pending atomic deletions on shutdown A shared sstable must be compacted by all shards before it can be deleted. Since we're stoping, that's not going to happen. Cancel those pending deletions to let anyone waiting on them to continue.	2016-04-14 17:14:26 +03:00
Avi Kivity	2ba584db8d	sstables: add delete_atomically(), for atomically deleting multiple sstables When we compact a set of sstables, we have to remove the set atomically, otherwise we can resurrect data if the following happens: insert data to sstable A insert tombstone to sstable B compact A+B -> C (removing both data and tombstone) delete B only read data from A Since an sstable may be shared by multiple shard, and each shard performs compaction at a different time, we need to defer deletion of an sstable set until all shards agree that the set can be deleted. An additional atomicity issue exists because posix does not provide a way to atomically delete multiple files. This issue is not addressed by this patch.	2016-04-14 17:14:26 +03:00
Pekka Enberg	a1a9294d8c	Merge "Support nodetool removenode force and status" from Asias "With this series, we support all the 3 nodetool removenode commands, e.g., $ nodetool removenode 778948bf-6709-4eb5-80fe-bee911e9c3bf $ nodetool removenode status RemovalStatus: Removing token (-8969872965815280276). Waiting for replication confirmation from [127.0.0.3,127.0.0.1]. $ nodetool removenode force RemovalStatus: No token removals in process. Tested with: 1) - start 3 nodes - inject data with cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)' - kill -9 node2 - wait for node2 to be in DOWN state - run nodetool removenode host2_host_id on node1 2) - start 3 nodes - inject data with cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)' - kill -9 node2 - wait for node2 to be in DOWN state - run nodetool removenode host2_host_id on node1 - kill -9 node3 - nodetool removenode will wait forever since node3 is gonne, node3 will never send the replication confirmation to node1 - run nodetool removenode force on node1 nodetool removenode completes with the following error: $ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e nodetool: Scylla API server HTTP POST to URL '/storage_service/remove_node' failed: nodetool removenode force is called by user nodetool removenode force completes sucessfully $ nodetool removenode force RemovalStatus: Removing token (-9171569494049085776). Waiting for replication confirmation from [127.0.0.3,127.0.0.1]. Fixes #1135."	2016-04-14 15:44:33 +03:00
Pekka Enberg	144d1e3216	dist/docker/redhat: Start up JMX proxy and include tools Make the Docker image more user-friendly by starting up JMX proxy in the background and install Scylla tools in the image. Also add a welcome banner like we have with our AMI so that users have pointers to nodetool and cqlsh, as well as our documentation. Message-Id: <1460376059-3678-1-git-send-email-penberg@scylladb.com>	2016-04-14 15:41:21 +03:00
Pekka Enberg	355c3ea331	dist/docker/redhat: Make sure image builds against latest Scylla Use "yum clean expire-cache" to make sure we build against the latest Scylla release. Message-Id: <1460374418-27315-1-git-send-email-penberg@scylladb.com>	2016-04-14 15:41:10 +03:00
Gleb Natapov	6f13715f8c	storage_proxy: add logging to read executor creation path Message-Id: <1460549369-29523-4-git-send-email-gleb@scylladb.com>	2016-04-14 14:58:02 +03:00
Gleb Natapov	14ecadb247	storage_proxy: add logging for mutation write path Message-Id: <1460549369-29523-3-git-send-email-gleb@scylladb.com>	2016-04-14 14:57:29 +03:00
Gleb Natapov	dbb1217896	cl: enable logging for insufficient LOCAL_QUORUM consistency Message-Id: <1460549369-29523-2-git-send-email-gleb@scylladb.com>	2016-04-14 14:56:58 +03:00
Gleb Natapov	dfdbb1e703	storage_proxy: move hack to make coordinator most preferable node for read into sorting function This is kind of sorting, so it belongs there, but it also fixes a bug in storage_proxy::get_read_executor() that assumes filter_for_query() do not change order of nodes in all_nodes when extra replica is chosen. Otherwise if coordinator ip happens to be last in all_nodes then it will be chosen as extra replica and will be quired twice. Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>	2016-04-14 14:56:21 +03:00
Duarte Nunes	73e3b5ac5d	udt: Fix user type compatibility check A new user type is checked for compatibility against the previous version of that type, so as to ensure that an updated field type is compatible with the previous field type (e.g., altering a field type from text to blob is allowed, but not the other way around). However, it is also possible to add new fields to a user type. So, when comparing a user type against its previous version, we should also allow the current, new type to be longer than the previous one. The current code instead allows for the previous type to be longer, which this patch fixes. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1460627939-11376-12-git-send-email-duarte@scylladb.com>	2016-04-14 13:30:37 +03:00
Takuya ASADA	f98997120a	dist: #!/bin/bash for all scripts We choosed #!/bin/sh for shebang when we started to implement installer scripts, not bash. After we started to work on Ubuntu, we found that we mistakenly used bash syntax on AMI script, it caused error since /bin/sh is dash on Ubuntu. So we changed shebang to /bin/bash for the script, from that time we have both sh scripts and bash scripts. (`2f39e2e269`) If we use bash syntax on sh scripts, it won't work on Ubuntu but works on Fedora/CentOS, could be very easy to confusing. So switch all scripts to #!/bin/bash. It will much safer. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1460594643-30666-1-git-send-email-syuu@scylladb.com>	2016-04-14 12:01:28 +03:00
Pekka Enberg	60352f810a	Merge "Fixes for the reading of missing Summary" from Glauber "This patchset contains some fixes spotted during post-merged review by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0, but since we haven't yet even backported the main series, might as well backport everything. It also includes some unit tests to make sure that they will be kept working in the future."	2016-04-13 11:32:05 +03:00
Raphael S. Carvalho	beaacbda2e	tests: test that leveled strategy was fixed L1 wasn't being compacted into L2. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <1a357896a448eafa7da4d28bc56fa02b89d4193e.1460508373.git.raphaelsc@scylladb.com>	2016-04-13 11:14:28 +03:00
Raphael S. Carvalho	c7b728e716	sstables: Fix leveled compaction strategy There is a problem in the implementation of leveled compaction strategy that prevents level 1 from being compacted into level 2, and so forth. As a result, all sstables will only belong to either level 0 or 1. One of the consequences is level 1 being overwhelmed by a huge amount of sstables. The root of the problem is a conditional statement in the code that prevents a single sstable, with level > 0, from being compacted into a subsequent level that is empty or has no overlapping sstables. Fixes #1180. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>	2016-04-13 11:14:14 +03:00
Asias He	1e84699a64	api: Wire up storage_service removal_status and force_remove_completion They are used by nodetool removenode: $ nodetool removenode force $ nodetool removenode status For example: $ nodetool removenode status RemovalStatus: Removing token (-8969872965815280276). Waiting for replication confirmation from [127.0.0.3,127.0.0.1]. $ nodetool removenode force RemovalStatus: No token removals in process. Tested with: 1) - start 3 nodes - inject data with cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)' - kill -9 node2 - wait for node2 to be in DOWN state - run nodetool removenode host2_host_id on node1 2) - start 3 nodes - inject data with cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)' - kill -9 node2 - wait for node2 to be in DOWN state - run nodetool removenode host2_host_id on node1 - kill -9 node3 - nodetool removenode will wait forever since node3 is gonne, node3 will never send the replication confirmation to node1 - run nodetool removenode force on node1 nodetool removenode completes with the following error: $ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e nodetool: Scylla API server HTTP POST to URL '/storage_service/remove_node' failed: nodetool removenode force is called by user nodetool removenode force completes sucessfully $ nodetool removenode force RemovalStatus: Removing token (-9171569494049085776). Waiting for replication confirmation from [127.0.0.3,127.0.0.1]. Fixes 1135.	2016-04-13 14:53:28 +08:00
Asias He	891e947314	storage_service: Rename remove_node to removenode nodetool uses removenode command to remove a node. Rename the implementation in storage_service to match the command.	2016-04-13 14:53:28 +08:00
Asias He	9ffb95216d	storage_service: Add force_remove_completion It is needed by the $ nodetool removenode force command.	2016-04-13 14:53:28 +08:00
Asias He	7c7e5967f6	storage_service: Add get_removal_status It is needed by the $ nodetool removenode status command.	2016-04-13 14:53:28 +08:00
Asias He	8d7cd07d6c	storage_service: Add print info in confirm_replication The message is rare but it is very useful to debug removenode operation.	2016-04-13 14:53:28 +08:00
Asias He	ffe91b5755	token_metadata: Do not assert in get_host_id Throw an exception instead of assert.	2016-04-13 14:53:27 +08:00
Raphael S. Carvalho	c28d168619	sstables: allow user to specify max sstable size with leveled strategy This change will allow user to specify the maximum size of a new sstable created as a result of leveled compaction. Example of using this setting: ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000', 'class': 'LeveledCompactionStrategy'} Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>	2016-04-13 09:13:33 +03:00
Raphael S. Carvalho	15246f31f7	sstables: fix incorrect sstable size when compression is enabled Size of uncompressed sstable was being unconditionally used to determine when to stop writing a table. When compression is enabled, compressed size should be used instead. Problem affected Scylla when compression and leveled strategy were used. Fixes #1177. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d9bf26def41fb33ca297f4127ce042b7f67adf96.1460484529.git.raphaelsc@scylladb.com>	2016-04-13 09:01:01 +03:00
Glauber Costa	60ab3b3f50	sstable_tests: make sure the generation of the Summary is sane When we recreate the summary from a missing Summary, we should make sure it is generated sanely, and that it resembles the Summary that would have otherwise been there. In this tests we'll grab one of the Summary tests we've been doing, and just apply them to the non-existent Summary file. We expect the same results on those cases. Plus, a new test is added with some sanity checking. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	114ba5e3a8	be robust against broken summary files Now that we can boot without a Summary file, we can just as easily boot with a broken one. Suggested by Nadav, and it is actually very easy to do, so do it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	72dc45999d	review fixes for generate_summary Spotted by Avi post-merge 1) Need to close the file 2) Should be using the parameter pc instead of the default_class Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	f78f43850d	clear components if reading toc fail This shouldn't be a problem in practice, because if read_toc() fails, the users will just tend to discard the sstable object altogether, and not insist on using it. However, if somebody does try to keep using it, a subsequent read_toc() could theoretically have some components filled up leading the new reader to believe the toc was populated successfully. It is easier to just clear the _components set and never worry about it, than trying to reason about whether or not that could happen. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Glauber Costa	0f41ef1b84	index_reader: avoid misleading parent name Also add comments about the expected signature of IndexConsumer Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:15:11 -04:00
Takuya ASADA	1eebe8bce1	dist: Support systemd for Ubuntu 15.10 To share systemd unit file between Fedora/CentOS and Ubuntu, generate systemd unit file on building time since Fedora/CentOS and Ubuntu has sysconfdir on different place. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459779957-11007-1-git-send-email-syuu@scylladb.com>	2016-04-12 14:39:26 +03:00
Avi Kivity	715794cce6	sstables: filter sstables single-row read using first_key/last_key Using leveled compaction strategy, only a few sstables will contain a given key, so we need to filter out the rest. Using the summary entries to filter keys works if the key is before the first summary entry, but does not work if it is after the last summary entry, because the last summary entry does not represent the last key; so sstables that are are towards the beginning of the ring are read even if they do not contain the key, greatly reducing read performance. Fix by consulting the summary's first_key/last_key entries before consulting the summary entry array.	2016-04-12 10:33:17 +03:00
Pekka Enberg	64c9ebb962	Merge "More exception safety fixes" from Paweł "This is the second part of exception safety fixes for issues discovered using memory allocation failure injector."	2016-04-12 08:08:00 +03:00
Paweł Dziepak	d53354947c	storage_proxy: mark hint_to_dead_endpoints() noexcept Hints are currently unimplemented but there is code depending on the fact that hint_to_dead_endpoints() doesn't throw. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-12 00:06:10 +01:00
Paweł Dziepak	209b373412	exceptions: make exception constructors noexcept Some of the exceptions are not thrown but constructed and set to some future. In such case if there is another exception thrown in the constructor it won't be propagated properly as it will casue stack to be unwind in the place where the future is set, not in the continuation chain waiting for it. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-12 00:06:02 +01:00
Paweł Dziepak	b00a3a76cc	transport: ignore errors during connection shutdown If the other end of the connection has already disconnected the shutdown will fail with ENOTCONN. The resulting exception is going to propagate through the continuation chain that is supposed to shut the cql server down preventing it from properly waiting for all outstanding continuations. The solution is to just ignore any errors that shutdown() may return. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:54:47 +01:00
Paweł Dziepak	0d3d0a3c08	gossiper: handle failures in gossiper thread creation seastar::async() creates a seastar thread and to do that allocates memory. That allocation, obviously, may fail so the error handling code needs to be moved so that it also catches errors from thread creation. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:54:47 +01:00
Paweł Dziepak	c1cffd0639	log: try to report logger failure Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:54:47 +01:00
Paweł Dziepak	b75c4098f2	storage_proxy: catch all errors in abstract_read_executor::execute() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:52:13 +01:00
Paweł Dziepak	9cd3da496e	transport: retry do_accept() in case of bad_alloc Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:52:13 +01:00
Paweł Dziepak	2db70cf912	database: remove throw() specifiers Most of them are missing std::bad_alloc (which leads to aborts) and they force the compiler to add unnecessary runtime checks. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:52:13 +01:00
Gleb Natapov	3734dcbace	storage_proxy: cleanup data_read_resolver::resolve() live_row_count is summed several times in the same function. Do it only once. -- v1->v2: - call get() on std::reference_wrapper<std::vector<partition>> to get to reference for moving out of it. Message-Id: <20160411123829.GE21479@scylladb.com>	2016-04-11 17:13:48 +02:00
Pekka Enberg	7af46e41e5	Merge "CQL authentication implementation" from Calle "Adds support for CQL commands to create, alter, drop and list users. Verified manually and by relevant dtests. With this patch set, scylla supports adding super/regular users and run sessions logged in as these. Note however that since actual authorization is still not implemented, no CF/KS is really protected by the authentication beyond initial login. Some fixes for lingering bugs in user management in the existing code as well. Fixes #1121"	2016-04-11 12:57:00 +03:00
Pekka Enberg	4e04805352	cql3: Make lexer and parser error messages compatible with Cassandra The default recognition error messages in antlr C++ backend are different from Java backend which makes Scylla's CQL error messages incompatible with Cassandra. This makes it very hard to write CQL level test cases which are portable between Scylla and Cassandra. To fix the issue, override the most common lexer and parser error messages to follow the convention set by the antlr Java backend. This unlocks various test cases in AlterTest, for example. Message-Id: <1460032883-14422-1-git-send-email-penberg@scylladb.com>	2016-04-11 12:35:53 +03:00
Calle Wilund	ceac4df164	Cql.g: Add create/drop/alter/list user parsing	2016-04-11 09:10:41 +00:00
Calle Wilund	b8bd77e621	cql3::list_users_statement: Initial conversion	2016-04-11 09:10:41 +00:00
Calle Wilund	adaf21403b	cql3::drop_user_statement: Initial conversion	2016-04-11 09:10:41 +00:00
Calle Wilund	8732b3eed7	cql3::alter_user_statement: Initial conversion	2016-04-11 09:10:41 +00:00
Calle Wilund	da89189308	cql3::create_user_statement: Initial conversion	2016-04-11 09:10:41 +00:00
Calle Wilund	57f5bb854f	cql3::authentication_statement: cql auth base class	2016-04-11 09:10:41 +00:00
Calle Wilund	cef52d1653	cql3::user_options: Add options wrapper type	2016-04-11 09:10:41 +00:00
Calle Wilund	7ebac35779	client_state: break up setting login/validation transport::server uses client_state in a move-temporary-around fashion. Having a setter that does continuation-bound validation makes this messier. Break them up to separate "this" placement from the actual validation continuation logic	2016-04-11 09:10:41 +00:00
Calle Wilund	83e2604bc6	client_state: Propagate login user in merge	2016-04-11 09:10:41 +00:00
Calle Wilund	3daf768a82	client_state : Add ensure_not_anonymous method	2016-04-11 09:10:41 +00:00
Calle Wilund	1d7930c4bd	authenticated_user: implement "is_super" Which also, unfortunately, must be a continuation. (Queries tables)	2016-04-11 09:10:41 +00:00
Calle Wilund	d9b176307f	auth::authenticator: option<->string	2016-04-11 09:10:41 +00:00
Raphael S. Carvalho	8fe7524e46	sstables: enable leveled strategy feature to prevent L0 from falling behind If level 0 falls behind, size tiered strategy is used on it to reduce overhead until we can catch up on the higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <17bf15b7d12cd5dc652cc92939c0c68f921662a2.1459976469.git.raphaelsc@scylladb.com>	2016-04-11 11:52:00 +03:00
Nadav Har'El	92ef11ffaa	stables_mutation_test: more compare keys not representations Commit `0fc4c36952` missed one place where keys were compared using their byte representation. Fix that. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459778074-10759-4-git-send-email-nyh@scylladb.com>	2016-04-11 11:36:17 +03:00
Nadav Har'El	9f9353ae5b	sstable_mutation_test: another test for range tombstone merging This is even a more elaborate tombstone merging unit test, with 3 levels of nesting, which did not pass with older range-tombstone merging algorithms, and works with the current one. I started with deletion of three nested levels of row - aaa, aaa:bbb, and aaa:bbb::ccc. I then complicated the sstable even further by adding additional middle-points with the same timestamps (which we saw happening in some real-life sstables), resulting in: [ {"key": "pk", "cells": [["aaa:_","aaa:bba:_",1459438519943668,"t",1459438519], ["aaa:bba:_","aaa:bbb:_",1459438519943668,"t",1459438519], ["aaa:bbb:_","aaa:bbb:ccb:_",1459438519950348,"t",1459438519], ["aaa:bbb:ccb:_","aaa:bbb:ccc:_",1459438519950348,"t",1459438519], ["aaa:bbb:ccc:_","aaa:bbb:ccc:!",1459438519958850,"t",1459438519], ["aaa:bbb:ccc:!","aaa:bbb:ddd:!",1459438519950348,"t",1459438519], ["aaa:bbb:ddd:!","aaa:bbb:!",1459438519950348,"t",1459438519], ["aaa:bbb:!","aaa:!",1459438519943668,"t",1459438519]]} ] Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459778074-10759-3-git-send-email-nyh@scylladb.com>	2016-04-11 11:35:59 +03:00
Nadav Har'El	77a793048e	sstable_mutation_test: strengthen tombstone_merging test In the tombstone_merging test, we expected one row tombstone. But we did not verify that in addition to that row tombstone, there is no other rows (deleted or otherwise). It turns out that in the onld merging algorithm, we did produce additional deleted rows which shouldn't have been there. So this patch adds a test that there are no such additional deleted rows beyond the one row tombstone we expect. The test passes with the new range tombstone merging algorithm. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459778074-10759-2-git-send-email-nyh@scylladb.com>	2016-04-11 11:35:46 +03:00
Nadav Har'El	818f14f444	stable: overhaul (again) range tombstone merging In commit `99ecda3c96`, we overhauled the way we read Cassandra's disjoint range tombstones, and convert them to the overlapping whole-prefix tombstones which we support. Unfortunately, while this algorithm worked correctly for a couple of test cases, it did not for additional test cases. While the previous algorithm could not generate "wrong" tombstones (it didn't generate things it didn't see), it could generate redundant overlapping tombstones, and missed some sanity checks about the correctness of the merge process. In this patch, a new algorithm makes sure to not generate redundant tombstones, and includes additional tests to ensure that we do not mistakenly merge range tombstones which cannot actually be merged. The following patches will include tests which failed with the previous algorithm, and succeeds with this one. I described the new algorithm on the ScyllaDB mailing list this way: 1. Have a stack of open ranges, start & timestamp for each (no end for each), and just one "end of last contiguous deletion" Processing each range tombstone: 2. If the start of a range tombstone is not adjacent to the "end of last deletion", assert we have no open range on the stack (because we can never close those). In any case, set the "end of of last deletion" to the end of this tombstone. 3. If the current tombstone's timestamp is STRICTLY HIGHER than that on the top of the stack, push the new tombstone's start+timestamp to the stack. Note: If it was STRICTLY LOWER, throw error (it means the open range will never be closed). 4. If the current tombstone's end matches (i.e., closes row) of the start on the top of the stack, emit this tombstone and pop the stack. When the row ends: 5. Assert the stack is empty. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459778074-10759-1-git-send-email-nyh@scylladb.com>	2016-04-11 11:35:23 +03:00
Avi Kivity	0c7f9917dc	Merge seastar upstream * seastar aa281bd...2aeb9dd (20): > memory: avoid exercising the reclaimers for oversized requests > tests: test cross-cpu free not underflowing live object counter > memory: fix live objects counter underflow due to cross-cpu free > core/reactor: Don't abort in allocate_aligned_buffer() on allocation failure > build: add --tests-debuginfo, to avoid stripping tests > connected_socket: Add buffer size arg to output() > scripts/posix_net_conf.sh: added a support for bonding interfaces > scripts/posix_net_conf.sh: move the NIC configuration code into a separate function > scripts/posix_net_conf.sh: implement the logic for selecting default MQ mode > scripts/posix_net_conf.sh: forward the interface name as a parameter > http/routes: Remove request failure logging to stderr > lowres_clock: Initialize _now when the clock is created > apps/iotune: fix broken URL > tutorial: expand and improve semaphore section > DPDK: support set RSS key to port_conf when hash_key_size is unknown > dpdk: aware of vmxnet3 max xmit frags and do linearizing > packet_util: insert out of order packet when map is empty > core: Fix use-after-free of scollectd::impl > futures: Optimize finally() > futures: Factor out exceptional path of finally()	2016-04-10 18:08:51 +03:00
Pekka Enberg	9b98278436	Merge "Be able to boot without a Summary" from Glauber "Summary files are a relatively recent addition to Cassandra. I thought that every SSTable converted to 2.1 would have them, but that does not seem to be true. It's easy to generate a stream of files that will boot in Cassandra 2.1 just fine, but not in Scylla as they will be missing the Summary. Cassandra can boot those files because they are robust against the Summary not existing, and we should do the same. Since we keep the Summary in memory, in case one does not exist we create a memory copy of it from the Index - the filesystem is not touched. Hopefully, compaction will run soon and the next time we boot we won't have to do such thing. Fixes #1170"	2016-04-09 20:38:57 +03:00
Pekka Enberg	992dab3fcb	Merge "Fixes for mutation querying" from Tomek "Fixes dtest failure of paging_test.py:TestPagingData.static_columns_paging_test"	2016-04-09 09:07:36 +03:00
Glauber Costa	8a50b027aa	summary: generate one if it is not present There are cases in which a Summary file will not be present, and imported SSTables will have just the Index and Data files. In earlier versions of Cassandra, a Summary didn't exist, so one may not be generated when migrating. In Issue #1170, we can see an example of tables generated by CQLSSTableWriter, and they lack a Summary. Cassandra is robust against this and can cope perfectly with the Summary not existing. I will argue that we should do the same. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	4de26fdec8	sstables: allow read_toc to be called more than once We do that by bailing immediately if we detect that the components map is already populated. This allow us to call read_toc() earlier if we need to - for instance, to inquire about the existence of the Summary - without the need to re-read the components again later. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	736e21222e	sstables: avoid passing schema unnecessarily for prepare_summary we can just pass the min interval as a parameter and avoid having the schema do yet another hop. For sealing the summary, it is completely unused and we can do away with it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	0de3a32147	index reader: make index_consumer a template parameter This is done so we can use other consumers. An example of that, is regeneration of the Summary from an existing Index. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	8453ff7788	make get_sstable_key_range an instance method Because just creating an SSTable object does not generate any I/O, get_sstable_key_range should be an instance method. The main advantage of doing that is that we won't have to read the summary twice. The way we're doing it currently, if happens to be a shard-relevant table we'll call load() - which reads the summary again. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Glauber Costa	6ae601a025	do not re-read the summary There are times in which we read the Summary file twice. That actually happens every time during normal boot (it doesn't during refresh). First during get_sstable_key_range and then again during load(). Every summary will have at least one entry, so we can easily test for whether or not this is properly initialized. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Tomasz Grabiec	3e0c24934b	tests: cql_query_test: Add test for slicing in reverse	2016-04-08 20:53:33 +02:00
Tomasz Grabiec	c2b955d40b	mutation_partition: Fix static row being returned when paginating Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test. Broken by `f15c380a4f`, where the calcualtion of has_ck_selector got broken, in such a way that present clustering restrictions were treated as if not present, which resulted in static row being returned when it shouldn't. While at it, unify the check between query_compacted() and do_compact() by extracting it to a function.	2016-04-08 20:53:33 +02:00
Tomasz Grabiec	a1539fed95	mutation_partition: Fix reversed trim_rows() The first erase_and_dispose(), which removes rows between last position and beginning of the next range, can invalidate end() iterator of the range. Fix by looking up end after erasing. mutation_partition::range() was split into lower_bound() and upper_bound() to allow for that. This affects for example queries with descending order where the selected clustering range is empty and falls before all rows. Exposed by `f15c380a4f`, which is now calling do_compact() during query. Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test	2016-04-08 20:53:33 +02:00
Avi Kivity	db03295c8a	Merge "Fix query digest mismatch" from Tomasz "Currently data query digest includes cells and tombstones which may have expired or be covered by higher-level tombstones. This causes digest mismatch between replicas if some elements are compacted on one of the nodes and not on others. This mismatch triggers read-repair which doesn't resolve because mutations received by mutation queries are not differing, they are compacted already. The fix adds compacting step before writing and digesting query results by reusing the algorithm used by mutation query. This is not the most optimal way to fix this. The compaction step could be folded with the query writing, there is redundancy in both steps. However such change carries more risk, and thus was postponed. perf_simple_query test (cassandra-stress-like partitions) shows regression from 83k to 77k (7%) ops/s. Fixes #1165."	2016-04-08 12:13:29 +03:00
Pekka Enberg	47a904c0f6	Merge "gossip: Introduce SUPPORTED_FEATURES" from Asias "There is a need to have an ability to detect whether a feature is supported by entire cluster. The way to do it is to advertise feature availability over gossip and then each node will be able to check if all other nodes have a feature in question. The idea is to have new application state SUPPORTED_FEATURES that will contain set of strings, each string holding feature name. This series adds API to do so. The following patch on top of this series demostreates how to wait for features during boot up. FEATURE1 and FEATURE2 are introduced. We use wait_for_feature_on_all_node to wait for FEATURE1 and FEATURE2 successfully. Since FEATURE3 is not supported, the wait will not succeed, the wait will timeout. --- a/service/storage_service.cc +++ b/service/storage_service.cc @@ -95,7 +95,7 @@ sstring storage_service::get_config_supported_features() { // Add features supported by this local node. When a new feature is // introduced in scylla, update it here, e.g., // return sstring("FEATURE1,FEATURE2") - return sstring(""); + return sstring("FEATURE1,FEATURE2"); } std::set<inet_address> get_seeds() { @@ -212,6 +212,11 @@ void storage_service::prepare_to_join() { // gossip snitch infos (local DC and rack) gossip_snitch_info().get(); + gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE1"), sstring("FEATURE2")}, std::chrono::seconds(30)).get(); + logger.info("Wait for FEATURE1 and FEATURE2 done"); + gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE3")}).get(); + logger.info("Wait for FEATURE3 done"); + We can query the supported_features: cqlsh> SELECT supported_features from system.peers; supported_features -------------------- FEATURE1,FEATURE2 FEATURE1,FEATURE2 (2 rows) cqlsh> SELECT supported_features from system.local; supported_features -------------------- FEATURE1,FEATURE2 (1 rows)"	2016-04-08 09:22:50 +03:00
Benoît Canet	7c99ecf16f	scylla_setup: Check if scylla-jmx is installed Signed-of-by: Benoît Canet <benoit@scylladb.com> Fixes #1107 Message-Id: <1460045692-815-1-git-send-email-benoit@scylladb.com>	2016-04-08 09:03:38 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Tomasz Grabiec	474a35ba6b	tests: Add test for query digest calculation	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	4418da77e6	tests: mutation_source: Include random mutations in generate_mutation_sets() result Probably increases coverage.	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	5d768d0681	tests: mutation_test: Move mutation generator to mutation_source_test.hh So that it can be reused.	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	30d25bc47a	tests: mutation_test: Add test case for querying of expired cells	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	58bbd4203f	partition_slice_builder: Add new setters	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	7cd8e61429	tests: result_set_assertions: Add and_only_that()	2016-04-07 19:57:19 +02:00
Tomasz Grabiec	f15c380a4f	database: Compact mutations when executing data queries Currently data query digest includes cells and tombstones which may have expired or be covered by higher-level tombstones. This causes digest mismatch between replicas if some elements are compacted on one of the nodes and not on others. This mismatch triggers read-repair which doesn't resolve because mutations received by mutation queries are not differing, they are compacted already. The fix adds compacting step before writing and digesting query results by reusing the algorithm used by mutation query. This is not the most optimal way to fix this. The compaction step could be folded with the query writing, there is redundancy in both steps. However such change carries more risk, and thus was postponed. perf_simple_query test (cassandra-stress-like partitions) shows regression from 83k to 77k (7%) ops/s. Fixes #1165.	2016-04-07 19:56:58 +02:00
Tomasz Grabiec	e4e8acc946	mutation_query: Extract main part of mutation_query() into more generic querying_reader So that it can be reused in query()	2016-04-07 19:03:04 +02:00
Takuya ASADA	ed7a3beed2	dist/ubuntu: drop unused scripts This was used when we didn't shared scripts between CentOS/Fedora and Ubuntu, but used anymore so drop them. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459408583-13497-1-git-send-email-syuu@scylladb.com>	2016-04-06 08:21:09 +03:00
Asias He	d5dce8016b	storage_service: Advertise supported_features into cluster Advertise features supported by this node, so that other nodes can know this info. For example, on a 3 node cluster with supported_features == FEATURE1 and FEATURE2, it looks like: cqlsh> SELECT supported_features from system.peers; supported_features -------------------- FEATURE1,FEATURE2 FEATURE1,FEATURE2 (2 rows) cqlsh> SELECT supported_features from system.local; supported_features -------------------- FEATURE1,FEATURE2 (1 rows)	2016-04-06 07:12:34 +08:00
Asias He	0e1738943d	storage_service: Add supported_features into system.peers table	2016-04-06 07:12:34 +08:00
Asias He	50bcfe569a	system_keyspace: Add supported_features into system.local table	2016-04-06 07:12:34 +08:00
Asias He	b710a5f9ee	storage_service: Introduce get_config_supported_features It tells features supported by this local node. When new feature is introduced in scylla, update features returned by get_config_supported_features, e.g., return sstring("FEATURE1,FEATURE2")	2016-04-06 07:12:34 +08:00
Asias He	e0a82a1107	gossip: Add supported_features helper in versioned_value Give a supported features sstring, return a versioned_value for it.	2016-04-06 07:12:34 +08:00
Asias He	214c0f72b2	db: Add supported_features column in system.local and system.peers table	2016-04-06 07:12:34 +08:00
Asias He	04e8727793	gossip: Introduce wait_for_feature_on_{all}_node API to wait for features are available on a node or all the nodes in the cluster. $timeout specifies how long we want to wait. If the features are not availabe yet, sleep 2 seconds and retry.	2016-04-06 07:12:34 +08:00
Asias He	1e437e925c	gossip: Introduce get_supported_features - Get features supported by this particular node std::set<sstring> get_supported_features(inet_address endpoint) const; - Get features supported by all the nodes this node knows about std::set<sstring> get_supported_features() const;	2016-04-06 07:12:34 +08:00
Asias He	a6080773b3	gossip: Add SUPPORTED_FEATURES application_state It is used to negotiate cluster wide features.	2016-04-06 07:12:34 +08:00
Piotr Jastrzebski	d3f91eec61	Implement tuple_type_impl::from_string This is a fix for: https://github.com/scylladb/scylla/issues/574 It mirrors the behavior of: org.apache.cassandra.db.marshal.TupleType.java#fromString Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <24a7d6253727d0faebb1df117c2f52410523d42f.1459843091.git.piotr@scylladb.com>	2016-04-05 16:00:18 +03:00
Vlad Zolotarov	2daaa00c4f	conf: resurrect the important text related to endpoint_snitch configuration commit `d1b44cef1b` removed an important part of a comment related to an 'endpoint_snitch' configuration. This patch puts it back. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1459858934-12005-1-git-send-email-vladz@cloudius-systems.com>	2016-04-05 15:23:13 +03:00
Raphael S. Carvalho	e15ce5eb4d	api: Add support to get column family compression ratio After this change, user can query compression ratio on a per column family basis with 'nodetool cfstats'. look at 'nodetool cfstats' output: ./bin/nodetool cfstats ks.test5 Keyspace: ks Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Flushes: 0 Table: test5 SSTable count: 1 Space used (live): 4774 Space used (total): 4774 Space used by snapshots (total): 0 Off heap memory used (total): 131384 SSTable Compression Ratio: 0.833333 ... Fixes #636. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a1bee5a23fe63787df3e387a88f2d216ba4a4134.1459802771.git.raphaelsc@scylladb.com>	2016-04-05 12:46:40 +03:00
Asias He	d1b44cef1b	conf: Drop duplicated section for endpoint_snitch endpoint_snitch is supported and it is the "Supported Parameters". Remove the duplicated section in "Unsupported parameters". Message-Id: <f8260b72558305f9186c011b8f8f452b3b91339b.1459325982.git.asias@scylladb.com>	2016-04-05 08:48:48 +03:00
Pekka Enberg	32471fcb96	Merge "Do batch log replay in decommission" from Asias "batchlog_manager is modified to allow the storage_service to initate a bachlog replay operation. Refs #1085. Tested with tests/batchlog_manager_test and batch_test.py"	2016-04-05 08:42:47 +03:00
Gleb Natapov	70575699e4	commitlog, sstables: enlarge XFS extent allocation for large files With big rows I see contention in XFS allocations which cause reactor thread to sleep. Commitlog is a main offender, so enlarge extent to commitlog segment size for big files (commitlog and sstable Data files). Message-Id: <20160404110952.GP20957@scylladb.com>	2016-04-04 14:15:00 +03:00
Amnon Heiman	725231a7a0	api: set the api_doc before registering any api This is a left over from the re ordering of the API init. The api_doc should be set first, so later API registration will enable their relevent swagger doc. Currently, the swagger documentation of the system API is not available. Fixes #1160 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1459750490-15996-1-git-send-email-amnon@scylladb.com>	2016-04-04 11:37:59 +03:00
Avi Kivity	6a3cf4ac41	cql: unlock ALTER TABLE syntax It was marked experimental for 1.0, but will be fully supported in the next release. Message-Id: <1459707946-5860-1-git-send-email-avi@scylladb.com>	2016-04-04 11:36:11 +03:00
Piotr Jastrzebski	613e7d8618	Add more info to wrong RPC address error If listening on RPC address failed then report IP address and port in the error message. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c4db3527df2ce6dccb3b619584ee3fcb1e70ffd1.1459512258.git.piotr@scylladb.com>	2016-04-03 12:57:19 +03:00
Takuya ASADA	cad5edc53b	dist: fix build error at copy symlinks Both build_rpm.sh and build_deb.sh will fail with "cannot stat 'xxx': No such file or directory" when scylla-server package is not installed, need to prevent it by --no-dereference option of cp. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459523585-9108-1-git-send-email-syuu@scylladb.com>	2016-04-03 12:49:55 +03:00
Tomasz Grabiec	0fc4c36952	tests: sstable_mutation_test: Compare keys not representations Representation is opaque at this level of abstraction. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459508193-7086-1-git-send-email-tgrabiec@scylladb.com>	2016-04-03 11:39:03 +03:00
Nadav Har'El	6c4ee49bd3	sstables: another test for range tombstone merging This is another unit test for range tombstone merging, introduced in commit `0fc9a5ee4d` and rewritten in commit `99ecda3c96`. In this test, a single large deletion was broken up into several smaller ranges, all with the same time stamps, so we should recombine them into one row tombstone, instead of failing the read. The sstable in this test case was artificially created using json2sstable. We don't know how yet to produce such a case using Cassandra 2, but we have seen a similar occurance in the wild, in a real SSTable. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459429243-15821-1-git-send-email-nyh@scylladb.com>	2016-04-01 11:55:14 +02:00
Takuya ASADA	d59c1c7648	dist/redhat: drop very old %pre script These lines are needed for very old version of scylla, not for 1.0. Can be removed now. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459177601-20269-1-git-send-email-syuu@scylladb.com>	2016-04-01 09:41:18 +03:00
Pekka Enberg	b9a1aef670	Merge "Random exception safety fixes" from Paweł "These patches fix some of the problems found by randomly injecting memory allocation failures."	2016-04-01 08:58:00 +03:00
Paweł Dziepak	8f78b8e190	log: ignore logging exceptions Logging is used in many places including those that shouldn't really throw any exceptions (destructors, noexcept functions). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:43:32 +01:00
Paweł Dziepak	c8159eca52	commitlog: make sure that segment destructor doesn't throw Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:42:56 +01:00
Paweł Dziepak	3e0555809e	storage_proxy: catch all exceptions in read executor abstract_read_executor::reconcile() is supposed to make sure that _result_promise is eventually set to either a result or an exception. That may not happen however if reconciliation throws any exception since only read timeouts are being caught. When that happends the continuation chain becomes stuck. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:38:41 +01:00
Paweł Dziepak	3c107c4b05	sstables: remove HyperLogLog throw() specifier HyperLogLog constructor promises that it only throws instances of std::invalid_argument. That's a lie since it also adds elements to a vector (and doesn't catch potential bad_allocs). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-31 16:36:53 +01:00
Avi Kivity	417bcb122d	commitlog: ignore commitlog segments generated by Cassandra-derived tools Cassandra-derived tools (such as sstable2json) may write commitlog segments, that Scylla cannot recognize. Since we now write them with a distinct name, we can recognize the name and ignore these segments, as we know the data they contain is not interesting. Fixes #1112. Message-Id: <1459356904-20699-1-git-send-email-avi@scylladb.com>	2016-03-31 16:01:08 +03:00
Nadav Har'El	78c9f49585	sstables: Move check_marker() to source file The check_marker() function is use as a sanity-check of data we read from sstable, so instead of the header file key.hh, let's move it to the sstable-parsing source file partition.cc. In addition to having less code in header files, another benefit is that the function can now throw a more specific exception (malformed sstable exception). Also fixed the exception's message (which had a second "%d" but only one parameter). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459420430-5968-1-git-send-email-nyh@scylladb.com>	2016-03-31 14:22:51 +03:00
Nadav Har'El	99ecda3c96	sstables: overhaul range tombstone reading Until recently, we believed that range tombstones we read from sstables will always be for entire rows (or more generalized clustering-key prefixes), not for arbitrary ranges. But as we found out, because Cassandra insists that range tombstones do not overlap, it may take two overlapping row tombstones and convert them into three range tombstones which look like general ranges (see the patch for a more detailed example). Not only do we need to accept such "split" range tombstones, we also need to convert them back to our internal representation which, in the above example, involves two overlapping tombstones. This is what this patch does. This patch also contains a test for this case: We created in Cassandra an sstable with two overlapping deletions, and verify that when we read it to Scylla, we get these two overlapping deletions - despite the sstable file actually having contained three non-overlapping tombstones. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>	2016-03-31 12:49:50 +03:00
Pekka Enberg	2629389d5d	dist/docker/ubuntu: Use bash in start-scylla script The default shell in Ubuntu is "dash" which causes the following error when "scylla-start" script is executed: /start-scylla: 8: /start-scylla: source: not found Message-Id: <1459406561-20141-1-git-send-email-penberg@scylladb.com>	2016-03-31 11:21:36 +03:00
Duarte Nunes	26a3461908	cql: Fix antlr3 missing token leak This patch overrides the antlr3 function that allocates the missing tokens that would eventually leak. The override stores these tokens in a vector, ensuring memory is freed whenever the parser is destroyed. Fixes #1147 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1459355146-17402-1-git-send-email-duarte@scylladb.com>	2016-03-31 08:44:45 +03:00
yan cui	6fc29843cd	dist/docker: refine docker file for ubuntu	2016-03-30 18:54:14 +03:00
Duarte Nunes	f7a12adb6f	cql3: Disable pg-style string format test antlr3 leaks the token itself creates when recovering from a mismatch in the case the missing token can be determined. Until this bug is fixed or circumvented, the test should remain disabled. Ref #1147 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1459345403-8243-1-git-send-email-duarte@scylladb.com>	2016-03-30 16:44:47 +03:00
Asias He	bc1889b7ab	storage_service: Shutdown batchlog_manager after decommission On the node which was decommissioned, I saw 2016-03-29 09:35:52,097 [shard 0] storage_service - DECOMMISSIONED: 2016-03-29 09:35:52,097 [shard 0] storage_service - DECOMMISSIONING: done 2016-03-29 09:36:28,814 [shard 0] batchlog_manager - Batchlog replay on shard 0: starts 2016-03-29 09:36:28,814 [shard 0] batchlog_manager - Batchlog replay on shard 0: done 2016-03-29 09:37:28,819 [shard 0] batchlog_manager - Batchlog replay on shard 1: starts 2016-03-29 09:37:28,820 [shard 0] batchlog_manager - Batchlog replay on shard 1: done 2016-03-29 09:38:28,830 [shard 0] batchlog_manager - Batchlog replay on shard 0: starts 2016-03-29 09:38:28,830 [shard 0] batchlog_manager - Batchlog replay on shard 0: done 2016-03-29 09:39:28,844 [shard 0] batchlog_manager - Batchlog replay on shard 1: starts 2016-03-29 09:39:28,844 [shard 0] batchlog_manager - Batchlog replay on shard 1: done We should stop the batchlog_manager to avoid initiating only future batchlog replay operation.	2016-03-30 20:54:30 +08:00
Asias He	5d1140b1eb	storage_service: Do batch log replay in decommission Replay the batch log during decommission. Kill one FIXME. Refs #1085	2016-03-30 20:54:30 +08:00
Asias He	5550aeba1d	batchlog_manager: Avoid stopping batchlog_manager more than once We can stop batchlog_manager in decommission and drain. Avoid stopping it more than once. Fix the following error: $ nodetool decommission $ nodetool drain storage_service - DECOMMISSIONING: stop_gossiping done storage_service - messaging_service stopped storage_service - DECOMMISSIONING: stop messaging_service done storage_service - DECOMMISSIONING: set_bootstrap_state done storage_service - DECOMMISSIONED: storage_service - DECOMMISSIONING: done storage_service - DRAINING: starting drain process gossip - gossip is already stopped scylla: ./seastar/core/gate.hh:93: future<> seastar::gate::close(): Assertion `!_stopped && "seastar::gate::close() cannot be called more than once"' failed.	2016-03-30 20:54:30 +08:00
Asias He	cdb43c5586	batchlog_manager: Allow user initiated bachlog replay operation During decommission, the storage_service::unbootstrap() needs to initiate a batchlog replay operation. To sync the replay operation initiated by the timer in batchlog_manager and storage_service, a semaphore is introduced. To simplify the semaphore locking, the management code now always runs on shard zero, but the real work is distruted to all shards.	2016-03-30 20:54:30 +08:00
Nadav Har'El	0fc9a5ee4d	sstables: merge range tombstones if possible This is a rewrite of Glauber's earlier patch to do the same thing, taking into account Avi's comments (do not use a class, do not throw from the constructor, etc.). I also verified that the actual use case which was broken in #1136 was fixed by this patch. Currently, we have no support for range tombstones because CQL will not generate them as of version 2.x. Thrift will, but we can safely leave this for the future. However, we have seen cases during a real migration in which a pure-CQL Cassandra would generate range tombstones in its SSTables. Although we are not sure how and why, those range tombstones were of a special kind: their end and next's start range were adjacent, which means that in reality, they could very well have been written as a single range tombstone for an entire clustering key - which we support just fine. This code will attempt to fix this problem temporarily by merging such ranges if possible. Care must be taken so that we don't end up accepting a true generic range tombstone by accident. Fixes #1136 Signed-off-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459333972-20345-1-git-send-email-nyh@scylladb.com>	2016-03-30 13:40:10 +03:00
Calle Wilund	0f5ca342b8	lists.cc: setter_by_uuid does not require read before execute Fixes #1082 Setting by UUID does not need existing data in list, so need no read before execute Message-Id: <1459325931-16387-1-git-send-email-calle@scylladb.com>	2016-03-30 11:24:20 +03:00
Takuya ASADA	73fa36b416	dist/common/scripts: update SET_NIC when --setup-nic passed to scylla_sysconfig_setup scylla_sysconfig_setup mistakenly ignores --setup-nic argument. Fixes #1132 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459285500-22185-1-git-send-email-syuu@scylladb.com>	2016-03-30 11:07:33 +03:00
Takuya ASADA	58fb7000b1	dist: add setup scripts symlink to /usr/sbin Instead of moving script to /usr/sbin, create symlink from /usr/lib/scylla/scylla_*setup to /usr/sbin/ Fixes #1092 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459324684-31364-1-git-send-email-syuu@scylladb.com>	2016-03-30 11:04:41 +03:00
Glauber Costa	23808ba184	sstables: fix exception printouts in check_marker As Nadav noticed in his bug report, check_marker is creating its error messages using characters instead of numbers - which is what we intended here in the first place. That happens because sprint(), when faced with an 8-byte type, interprets this as a character. To avoid that we'll use uint16_t types, taking care not to sign-extend them. The bug also noted that one of the error messages is missing a parameter, and that is also fixed. Fixes #1122 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>	2016-03-29 19:23:28 +03:00
Takuya ASADA	c1277bacb4	dist/common/scripts: prevent misinterpret blank input as '/dev/', show error when inputted device path is not found Fixes #1110 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459267786-19123-1-git-send-email-syuu@scylladb.com>	2016-03-29 19:18:51 +03:00
Glauber Costa	d5c1366e85	compaction: be verbose about which table is causing an exception When we, for some reason, fail to compact an SSTable, we do not log the file name leaving us with cryptic messages that tell us what happened, but not where it happened. This patch adds logging in compaction so that we'll know what's going on. Please note that readers are more of a concern, because the SSTable being written technically do not exist yet. Still, better safe than sorry: if open_data fails, or we leave an unfinished SSTable, it is still good to know which one was the culprit. Some argument can be made about whether we should log this at the lower SSTable level, or at the compaction level. The reason I am logging this at the compaction level, is that we don't really know which exception will trigger, and where: it may be the case that we're seeing exceptions that are not SSTable specific, and may not have the chance to log it properly. In particular, if the exception happens inside the reader: read_rows() and friends only return a mutation reader, which doesn't really do anything until we call read(). But at that time, we don't hold any pointers to the SSTable anymore. In Summary, logging at the compaction level guarantees that we always do it no matter what. Exceptions that are part of the main SSTable path can log the file name as well if they want: if that's the case, we'll be left with the name appearing twice. That's totally harmless, and better than none. Fixes #1123 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>	2016-03-29 18:15:56 +03:00
Glauber Costa	d536846433	commitlog: initialize sync period with actual sync period commitlog's sync period is initialized as the batch period, and not as the sync period itself as it should be. I've found this by code inspection, but unless I am missing something really fundamental, this seems to be completely wrong. It's been working fine because in our defaults, I have checked that both variables default to the same value. But it seems to me that as long as anyone would change one of them, the behavior wouldn't be as expected. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>	2016-03-29 15:21:02 +03:00
Takuya ASADA	a5bb6c4b1b	dist/ubuntu: drop classical sysv init script, only support Upstart for Ubuntu 14.04LTS Sysv init script was added just for prevent warning message on lintian, never really used by Ubuntu users. Result of that, we often break this script since upstart/systemd unit file frequently changed. It may confuse users, it's better to use Upstart only, just like Fedora/CentOS. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459177601-20269-2-git-send-email-syuu@scylladb.com>	2016-03-29 11:48:18 +03:00
Takuya ASADA	42ce77a3b7	dist/redhat: prevent 'yum: command not found' on some Fedora environment On some Fedora environments such as Fedora official AMI, dnf-yum package is not installed by default, causes command not found error when we run our setup scripts. To prevent this, we need to add dnf-yum to scylla-server package dependency. Fixes #1106 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459099744-23068-1-git-send-email-syuu@scylladb.com>	2016-03-29 11:29:09 +03:00
Avi Kivity	adffb1c061	dist/ubuntu: improve handling of bad command line options On a bad command line, Scylla will exit with an exit code of 2. Mark it as a "normal" exit, to prevent a respawn. Fixes #1087 Message-Id: <1458827221-12833-1-git-send-email-avi@scylladb.com>	2016-03-29 11:14:45 +03:00
Avi Kivity	c1d8fb56f7	dist/ubuntu: specify kill timeout Allow more time for commitlog flushing Message-Id: <1458827216-12778-1-git-send-email-avi@scylladb.com>	2016-03-29 11:14:27 +03:00
Raphael Carvalho	d515a7fd85	sstables: fix deletion of sstable with temporary TOC After `4e52b41a4`, remove_by_toc_name() became aware of temporary TOC files, however, it doesn't consider that some components may be missing if temporary TOC is present. When creating a new sstable, the first thing we do is to write all components into temporary TOC, so content of a temporary TOC isn't reliable until it is renamed. Solution is about implementing the following flow (described by Avi): "Flow should be: - remove all components in parallel - forgive ENOENT, since the compoent may not have been written; otherwise deletion error should be raised - fsync the directory - delete the temporary TOC " This problem can be reproduced by running compaction without disk space, so compaction would fail and leave a partial sstable that would be marked for deletion. Afterwards, remove_by_toc_name() would try to delete a component that doesn't exist because it looked at the content of temporary TOC. Fixes #1095. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>	2016-03-29 10:38:01 +03:00
Tomasz Grabiec	d1db23e353	storage_service: Fix typos Message-Id: <1458837390-26634-1-git-send-email-tgrabiec@scylladb.com>	2016-03-29 10:29:04 +03:00
Pekka Enberg	994390769f	Update scylla-ami submodule * dist/ami/files/scylla-ami 89e7436...7019088 (1): > Re-enable clocksource=tsc on AMI	2016-03-29 10:18:06 +03:00
Takuya ASADA	201b0c6ab3	dist: re-enable clocksource=tsc on AMI clocksource=tsc on boot parameter mistakenly dropped on `b3c85aea89`, need to re-enable. [ penberg: Manual backport of commit `050fb911d5` to 1.0. ] Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `80242ff443`)	2016-03-29 10:17:41 +03:00
Pekka Enberg	227daecba6	Revert "dist: move setup scripts to /usr/sbin" This reverts commit `989357189a` because it broke our Jenkins packaging jobs.	2016-03-29 10:17:05 +03:00
Pekka Enberg	d1ec97e76f	Revert "dist: re-enable clocksource=tsc on AMI" This reverts commit `050fb911d5` in preparation for reverting `989357189a`.	2016-03-29 10:16:48 +03:00
Takuya ASADA	050fb911d5	dist: re-enable clocksource=tsc on AMI clocksource=tsc on boot parameter mistakenly dropped on `b3c85aea89`, need to re-enable. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com>	2016-03-29 09:53:23 +03:00
Asias He	62d443a07d	streaming: Fix log of plan_id and session address in stream_session They are get swapped. Fix it up. Spotted by looking at the log. Message-Id: <d163d71e9a96d1a45c3a4c529519790eeff7c486.1459172778.git.asias@scylladb.com>	2016-03-29 09:01:06 +03:00
Nadav Har'El	a05577ca41	sstable: fix read failure of certain sstables We had a problem reading certain existing Cassandra sstables into Scylla. Our consume_range_tombstone() function assumes that the start and end columns have a certain "end of component" markers, and want to verify that assumption. But because of bugs in older versions of Cassandra, see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the "end of component" was missing (set to 0). CASSANDRA-7593 suggested this problem might exist on the start column, so we allowed for that, but now we discovered a case where also the end column is set to 0 - causing the test in consume_range_tombstone() to fail and the sstable read to fail - causing Scylla to no be able to import that sstable from Cassandra. Allowing for an 0 also on the end column made it possible to read that sstable, compact it, and so on. Fixes #1125. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>	2016-03-28 17:09:37 +03:00
Duarte Nunes	db881fdc8f	cql: Add support for pg-style string literal This patch adds support for pg-style string literals to the CQL grammar. Fixes #1078 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1459093238-2529-1-git-send-email-duarte@scylladb.com>	2016-03-28 17:06:03 +03:00
yan cui	e5d1c031ac	dist: add ubuntu docker file	2016-03-28 10:14:12 +03:00
Avi Kivity	a919113fdb	schema_tables: fix deadlock in cross-node communications Seastar wrongly limits the number of concurrent submit_to()s to a single remote shard. This can cause an ABBA deadlock: fiberA fiberB (x127) submit_to(0) # lock schema <- returns submit_to(0) # lock schema (waits) submit_to(0) # do work (waits) The fiberBs wait for fiberA, which in turn waits for a fiberB to return. While the correct fix is to remote the client-side limit and replace it with a server-side per-verb limit, we start with a simpler fix that replaces the blocking lock call with a non-blocking call, removing the deadlock. Fixes #1088. Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>	2016-03-28 10:12:10 +03:00
Raphael Carvalho	e6e5999282	Fix corner-case in refresh Problem found by dtest which loads sstables with generation 1 and 2 into an empty column family. The root of the problem is that reshuffle procedure changes new sstables to start from generation 2 at least. So reshuffle could try to set generation 1 to 2 when generation 2 exists. This problem can be fixed by starting from generation 1 instead, so reshuffle would handle this case properly. Fixes #1099. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>	2016-03-27 10:03:32 +03:00
Avi Kivity	077c0d1022	dist: ami: fix AMI_OPT receiving no value We assign AMI=0 and AMI_OPT=1, so in the true case, AMI_OPT has no value, and a later compare fails.	2016-03-26 21:16:28 +03:00
Takuya ASADA	989357189a	dist: move setup scripts to /usr/sbin Since these scripts are user command, should be on $PATH. Fixes #1092 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458860407-25269-1-git-send-email-syuu@scylladb.com>	2016-03-25 11:50:13 +03:00
Takuya ASADA	2582dbe4a0	dist/ami: use tilde for release candidate builds Sync with ubuntu package versioning rule Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458882718-29317-1-git-send-email-syuu@scylladb.com>	2016-03-25 11:34:28 +03:00
Glauber Costa	e750a94300	sanity check Seastar's I/O queue configuration While Seastar in general can accept any parameter for its I/O queues, Scylla in particular shouldn't run with them disabled. Such will be the status when the max-io-requests parameter is not enabled. On top of that, we would like to have enough depth per I/O queue not to allow for shard-local parallelism. Therefore, we will require a minimum per-queue capacity of 4. In machines where the disk iodepth is not enough to allow for 4 concurrent requests per shard, one should reduce the number of I/O queues. For --max-io-requests, we will check the parameter itself. However, the --num-io-queues parameter is not mandatory, and given enough concurrent requests, Seastar's default configuration can very well just be doing the right thing. So for that, we will check the final result of each I/O queue. As it is the case with other checks of the sorts, this can be overridden by the --developer-mode switch. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>	2016-03-25 11:33:57 +03:00
Tomasz Grabiec	53bbcf4a1e	schema_tables: Wait for notifications to be processed. Listeners may defer since: `93015bcc54` "migration_manager: Make the migration callbacks runs inside seastar thread" Not all places were adjusted to wait for them. Fix that. Message-Id: <1458837613-27616-1-git-send-email-tgrabiec@scylladb.com>	2016-03-24 19:04:12 +02:00
Avi Kivity	12744217b8	Initial github issue template Message-Id: <1458817106-1513-1-git-send-email-avi@scylladb.com>	2016-03-24 15:37:00 +02:00
Benoît Canet	4ac1126677	collectd: Write to the network to get rid of spurious log messages Closes #1018 Suggested-by: Avi Kivity <avi@scylladb.com> Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458759378-4935-1-git-send-email-benoit@scylladb.com>	2016-03-24 12:34:14 +02:00
Calle Wilund	ff5df306e3	database: Use disk-marking delete function in discard_sstables Fixes #797 To make sure an inopportune crash after truncate does not leave sstables on disk to be considered live, and thus resurrect data, after a truncate, use delete function that renames the TOC file to make sure we've marked sstables as dead on disk when we finish this discard call. Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>	2016-03-24 12:02:08 +02:00
Calle Wilund	4e52b41a46	sstables: Add delete func to rename TOC ensuring table is marked dead Note: "normal" remove_by_toc_name must now be prepared for and check if the TOC of the sstable is already moved to temp file when we get to the juicy delete parts. Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>	2016-03-24 12:01:53 +02:00
Asias He	6fd6e57e80	streaming: Harden keep alive timer - Do nothing in case the session is closed, to prevent we fire up the timer again - Print log info when no progress has been made if the time expires, it is very useful to debug a idle session - Grab a reference when the keep alive timer is running Message-Id: <9f2cc3164696905a6a39c0d072a980765d598dfd.1458782956.git.asias@scylladb.com>	2016-03-24 11:58:54 +02:00
Avi Kivity	112a930f92	Merge "Bring back simplify session completion logic" from Asias "The following patches are reverted becasue they were thought they break Glauber's "Make sure repairs do not cripple incoming load" series. It turns out these two patches just made another bug more visisble. The bug is fixed in `c2eff7e824` (streaming: Complete receive task after the flush). We can bring the two patches back now. Passed repair_additional_test.py and update_cluster_layout_tests.py with smp 2."	2016-03-24 11:57:20 +02:00
Tomasz Grabiec	341b509f68	cql_test_env: Make initialization exception-safe Currently start() is not prepared to handle exceptions thrown from service initialization. It's easy to trigger such exceprion by starting two tests at the same time, which will result in socket bind error. Exception thrown from start() typically results in assertion failures like this one: seastar::sharded<Service>::~sharded() [with Service = database]: Assertion `_instances.empty()' failed. This patch fixes the problem by combining start() and stop() in a single do_with() and using RAII for stopping services. Now exceptions thrown from service initialization should stop services in proper order and let the original exception to pass through. Example result: fatal error in "test_new_schema_with_no_structural_change_is_propagated": std::runtime_error: bind: Address already in use Message-Id: <1458768018-27662-1-git-send-email-tgrabiec@scylladb.com>	2016-03-24 11:20:01 +02:00
Shlomi Livne	d3a91e737b	fix a collision betwen --ami command line param and env sysconfig scylla-server includes an AMI, the script also used an AMI variable fix this by renaming the script variable `6a18634f9f` introduced this issue since it started imported the sysconfig scylla-server Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <0bc472bb885db2f43702907e3e40d871f1385972.1458767984.git.shlomi@scylladb.com>	2016-03-24 08:14:41 +02:00
Asias He	fe263e5436	Revert "Revert "streaming: Start to send mutations after PREPARE_DONE_MESSAGE"" This reverts commit `1f29a698d5`.	2016-03-24 08:43:17 +08:00
Asias He	a6dd6e6d55	Revert "Revert "streaming: Simplify session completion logic"" This reverts commit `354fca9d56`.	2016-03-24 07:48:27 +08:00
Gleb Natapov	0afd1c6f0a	config: enable truncate_request_timeout_in_ms option Option truncate_request_timeout_in_ms is used by truncate. Mark it as used. Message-Id: <20160323162649.GH2282@scylladb.com>	2016-03-23 18:50:24 +02:00
Yoav Kleinberger	91269d0c15	tools/scyllatop: add sums to aggregate view the aggregate view now supports both sums and means. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1328af8efb113a786d7402b0704220108bfb28db.1458749600.git.yoav@scylladb.com>	2016-03-23 18:49:57 +02:00
Shlomi Livne	6a18634f9f	scylla_io_setup import scylla-server env args scylla_io_seup requires the scylla-server env to be setup to run correctly. previously scylla_io_setup was encapsulated in scylla-io.service that assured this. extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking io_tune Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>	2016-03-23 17:54:06 +02:00
Pekka Enberg	8bf3d4f550	Merge "Make sure repairs do not cripple incoming load" from Glauber "This series makes sure that the influence of repairs on the ongoing loads is limited. This patch does not fix the situation completely, but it will be the best we can do for 1.0 Here's a brief explanation about some potentially contentions points, and future work: 1) With the old parallelism semaphore in tree, we could never really drop parallelism below 256, since even with (local) parallelism = 1, we would still have 256 vnodes. So while the number 100 is totally empirical, we know for a fact that around 200-something, we start having real trouble. (total) parallelism = 100 is enough to allow us to survive a load as much as 3 times heavier than the load described in Issue944. So while it is empirical, at least it is based on something 2) I totally support changing the checksumming algorithm. However, I would rather focus my efforts on testing this to exhaustion than doing this at the moment. But if anybody wants to do it, I think it is a great thing to have before 1.0. Specially because we'll probably need a new verb for that, so we would be better off having it from the start 3) This problem was made harder due to the fact that there are three conditions really that can affect the ongoing load. Only one of them needs to trigger for us to see degradation, so fixing them individually will usually buy us nothing. Those are: a) The disk bandwidth. Since the mutations are all together in the same memtable/commitlog as normal memtables, we can differentiate between them from the I/O Scheduler perspective. This is not an issue of course if the incoming mutations are not enough for us to saturate the disk, but specially given the highly parallel nature of repair, we usually will. If the commitlog queue starts getting too big, for instance, new requests will start being put to wait. The effect of this part of the series is to completely shift the high waiting times from those classes to the streaming ones (unfortunately compaction is still affected, but that's fine IMHO). With the new streaming classes, the waiting time of a memtable / commitlog requests is still kept in the microseconds range. The streaming classes, on the other hand, will be in the hundreds of milliseconds range, or even seconds. b) The memory consumption: since the whole problem that leads to a) is the fact that due to high disk activity some requests will have to wait, we will end up with a lot of streaming memtables not yet flushed. Because of that, we will start throttling new incoming CQL requests and all the isolation efforts are rendered useless. Once again, due to the highly parallel nature of repair, this turned out to be a very easy condition to trigger. The solution proposed here is to limit a maximum amount of dirty memory for the repair job (in here, 25 %). This way, we can endure even slightly heavier loads without sweating too much. c) The task scheduler: repair generates a ton of requests for range checksums, and we actually want to keep it that way - so that the ranges checksummed are small enough so we don't have to resend a lot of mutations for no reason. However, if we pile up thousands of continuations in the task scheduler, seastar has absolutely no mechanism (right now) to prioritize between different kinds of requests. That means that the continuations that are supposed to be handling user requests will simply not for a long time. Even if the Seastar load is less than 100 % that is still a problem, since that is just adding hundreds of milliseconds worth of latencies to any request processing. Fixes #944 and fixes #1033."	2016-03-23 16:07:06 +02:00
Yoav Kleinberger	d2cfb86dc8	tools/scyllatop: defend against unexpected strings from collectd Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <cd7ecf6b3b82bd2027179cbec4e689a946469e9a.1458740337.git.yoav@scylladb.com>	2016-03-23 16:05:59 +02:00
Asias He	c2eff7e824	streaming: Complete receive task after the flush A STREAM_MUTATION_DONE message will signal the receiver that the sender has completed the sending of streams mutations. When the receiver finds it has zero task to send and zero task to receive, it will finish the stream_session, and in turn finish the stream_plan if all the stream_sessions are finished. We should call receive_task_completed only after the flush finishes so that when stream_plan is finshed all the data is on disk. Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make sure repairs do not cripple incoming load" serries ====================================================================== FAIL: repair_disjoint_data_test (repair_additional_test.RepairAdditionalTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "scylla-dtest/repair_additional_test.py", line 102, in repair_disjoint_data_test self.check_rows_on_node(node1, 3000) File "scylla-dtest/repair_additional_test.py", line 33, in check_rows_on_node self.assertEqual(len(result), rows, len(result)) AssertionError: 2461	2016-03-23 09:40:49 -04:00
Glauber Costa	f49e965d78	repair: rework repair code so we can limit parallelism The repair code as it is right now is a bit convoluted: it resorts to detached continuations + do_for_each when calling sync_ranges, and deals with the problem of excessive parallelism by employing a semaphore inside that range. Still, even by doing that, we still generate a great number of checksum requests because the ranges themselves are processed in parallel. It would be better to have a single-semaphore to limit the overall parallelism for all requests. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:49 -04:00
Glauber Costa	34a9fc106f	database: keep streaming memtables in their own region group Theoretically, because we can have a lot of pending streaming memtables, we can have the database start throttling and incoming connections slowing down during streaming. Turns out this is actually a very easy condition to trigger. That is basically because the other side of the wire in this case is quite efficient in sending us work. This situation is alleviated a bit by reducing parallelism, but not only it does't go away completely, once we have the tools to start increasing parallelism again it will become common place. The solution for this is to limit the streaming memtables to a fraction of the total allowed dirty memory. Using the nesting capability built in in the LSA regions, we will make the streaming region group a child of the main region group. With that, we can throttle streaming requests separately, while at the same time being able to control the total amount of dirty memory as well. Because of the property, it can still be the case that incoming requests will throttle earlier due to streaming - unless we allow for more dirty memory to be used during repairs - but at least that effect will be limited. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:47 -04:00
Glauber Costa	455d5a57d2	streaming memtables: coalesce incoming writes The repair process will potentially send ranges containing few mutations, definitely not enough to fill a memtable. It wants to know whether or not each of those ranges individually succeeded or failed, so we need a future for each. Small memtables being flushed are bad, and we would like to write bigger memtables so we can better utilize our disks. One of the ways to fix that, is changing the repair itself to send more mutations at a single batch. But relying on that is a bad idea for two reasons: First, the goals of the SSTable writer and the repair sender are at odds. The SSTable writer wants to write as few SSTables as possible, while the repair sender wants to break down the range in pieces as small as it can and checksum them individually, so it doesn't have to send a lot of mutations for no reason. Second, even if the repair process wants to process larger ranges at once, some ranges themselves may be small. So while most ranges would be large, we would still have potentially some fairly small SSTables lying around. The best course of action in this case is to coalesce the incoming streams write-side. repair can now choose whatever strategy - small or big ranges - it wants, resting assure that the incoming memtables will be coalesced together. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:38:22 -04:00
Glauber Costa	5fa866223d	streaming: add incoming streaming mutations to a different sstable Keeping the mutations coming from the streaming process as mutations like any other have a number of advantages - and that's why we do it. However, this makes it impossible for Seastar's I/O scheduler to differentiate between incoming requests from clients, and those who are arriving from peers in the streaming process. As a result, if the streaming mutations consume a significant fraction of the total mutations, and we happen to be using the disk at its limits, we are in no position to provide any guarantees - defeating the whole purpose of the scheduler. To implement that, we'll keep a separate set of memtables that will contain only streaming mutations. We don't have to do it this way, but doing so makes life a lot easier. In particular, to write an SSTable, our API requires (because the filter requires), that a good estimate on the number of partitions is informed in advance. The partitions also need to be sorted. We could write mutations directly to disk, but the above conditions couldn't be met without significant effort. In particular, because mutations can be arriving from multiple peer nodes, we can't really sort them without keeping a staging area anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:13:00 -04:00
Glauber Costa	10c8ca6ace	priority manager: separate streaming reads from writes Streaming has currently one class, that can be used to contain the read operations being generated by the streaming process. Those reads come from two places: - checksums (if doing repair) - reading mutations to be sent over the wire. Depending on the amount of data we're dealing with, that can generate a significant chunk of data, with seconds worth of backlog, and if we need to have the incoming writes intertwined with those reads, those can take a long time. Even if one node is only acting as a receiver, it may still read a lot for the checksums - if we're talking about repairs, those are coming from the checksums. However, in more complicated failure scenarios, it is not hard to imagine a node that will be both sending and receiving a lot of data. The best way to guarantee progress on both fronts, is to put both kinds of operations into different classes. This patch introduces a new write class, and rename the old read class so it can have a more meaningful name. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	78189de57f	database: make seal_on_overflow a method of the memtable_list Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	635bb942b2	database: move add_memtable as a method of the memtable_list The column family still has to teach the memtable list how to allocate a new memtable, since it uses CF parameters to do so. After that, the memtable_list's constructor takes a seal and a create function and is complete. The copy constructor can now go, since there are no users left. The behavior of keeping a reference to the underlying memtables can also go, since we can now guarantee that nobody is keeping references to it (it is not even a shared pointer anymore). Individual memtables are, and users may be keeping references to them individually. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	6ba95d450f	database: move active_memtable to memtable_list Each list can have a different active memtable. The column family method keeps existing, since the two separate sets of memtable are just an implementation detail to deal with the problem of streaming QoS: the active memtable keeps being the one from the main list. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	af6c7a5192	database: create a class for memtable_list memtable_list is currently just an alias for a vector of memtables. Let's move them to a class on its own, exporting the relevant methods to keep user code unchanged as much as possible. This will help us keeping separate lists of memtables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Avi Kivity	8ed95754c0	Merge seastar upstream * seastar 9f2b868...aa281bd (7): > shared_promise: Add move assignment operator > lowres_clock: Fix stretched time > scripts: Delete tap with ip instead of tunctl > vla: Actually be exception-safe > vla: Ensure memory is freed if ctor throws > vla: Ensure memory is correctly freed > net: Improve error message when parsing invalid ipv4 address	2016-03-23 14:39:31 +02:00
Takuya ASADA	50db64de33	dist: drop -j2 option on .spec, make build_rpm.sh able to specify -j option Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458678665-30273-1-git-send-email-syuu@scylladb.com>	2016-03-23 13:32:14 +02:00
Gleb Natapov	48c83163b9	init: make more initialization threaded Since initialization now runs in a thread storage, messaging and gossiper services initialization code may take advantage of it too. Message-Id: <20160323094732.GF2282@scylladb.com>	2016-03-23 11:53:11 +02:00
Shlomi Livne	4ecc37111f	dist/ami: Use the actual number of disks instead of AWS meta service We have seen in some cases that when using the boto api to start instances the aws metadata service http://169.254.169.254/latest/meta-data/block-device-mapping/ returns incorrect number of disks - workaround that by checking the actual number of disks using lsblk Adding a validation at the end verifying that after all computations the NR_IO_QUEUES will not be greater then the number of shards (we had an issue with i2.8x) Fixes: #1062 Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <54c51cd94dd30577a3fe23aef3ce916c01e05504.1458721659.git.shlomi@scylladb.com>	2016-03-23 10:47:08 +02:00
Raphael Carvalho	370b1336fe	service: fix refresh Vlad and I were working on finding the root of the problems with refresh. We found that refresh was deleting existing sstable files because of a bug in a function that was supposed to return the maximum generation of a column family. The intention of this function is to get generation from last element of column_family::_sstables, which is of type std::map. However, we were incorrectly using std::map::end() to get last element, so garbage was being read instead of maximum generation. If the garbage value is lower than the minimum generation of a column family, then reshuffle_sstables() would set generation of all existing sstables to a lower value. That would confuse our mechanism used to delete sstables because sstables loaded at boot stage were touched. Solution to this problem is about using rbegin() instead of end() to get last element from column_family::_sstables. The other problem is that refresh will only load generations that are larger than or equal to X, so new sstables with lower generation will not be loaded. Solution is about creating a set with generation of live SSTables from all shards, and using this set to determine whether a generation is new or not. The last change was about providing an unused generation to reshuffle procedure by adding one to the maximum generation. That's important to prevent reshuffle from touching an existing SSTable. Tested 'refresh' under the following scenarios: 1) Existing generations: 1, 2, 3, 4. New ones: 5, 6. 2) Existing generations: 3, 4, 5, 6. New ones: 1, 2. 3) Existing generations: 1, 2, 3, 4. New ones: 7, 8. 4) No existing generation. No new generation. 5) No existing generation. New ones: 1, 2. I also had to adapt existing testcase for reshuffle procedure. Fixes #1073. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>	2016-03-23 10:21:58 +02:00
Benoît Canet	1594bdd5bb	dist/ubuntu: Fix the init script variable sourcing The variable sourcing was crashing the init script on ubuntu. Fix it with the suggestion from Avi. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458685099-1160-1-git-send-email-benoit@scylladb.com>	2016-03-23 09:03:17 +02:00
Tomasz Grabiec	5f44afa311	cql3: batch_statement: Execute statements sequentially Currently we execute all statements in parallel, but some statements depend on order, in particular list append/prepend. Fix by executing sequentially. Fixes cql_additional_tests.py:TestCQL.batch_and_list_test dtest. Fixes #1075. Message-Id: <1458672874-4749-1-git-send-email-tgrabiec@scylladb.com>	2016-03-22 20:59:40 +02:00
Pekka Enberg	354fca9d56	Revert "streaming: Simplify session completion logic" This reverts commit `208b7fa7ba`. It breaks Glauber's upcoming repair series.	2016-03-22 20:37:50 +02:00
Pekka Enberg	1f29a698d5	Revert "streaming: Start to send mutations after PREPARE_DONE_MESSAGE" This reverts commit `4c06221766`. It breaks Glauber's upcoming repair series.	2016-03-22 20:37:22 +02:00
Avi Kivity	7df21768d6	Merge "Fix row_cache_alloc_stress test" from Tomasz "The test predates LSA zones and was not anticipating that LSA would take much more free memory from the system than it needs in its assertions. Fix by accounting for the fact properly."	2016-03-22 18:46:31 +02:00
Avi Kivity	b8f80bb2be	Update scylla-ami submodule * dist/ami/files/scylla-ami 56f1ab7...89e7436 (1): > Merge "iotune packaging fix for scylla-ami" from Takuya	2016-03-22 17:55:00 +02:00
Takuya ASADA	dac2bc3055	dist: on scylla_io_setup, SMP and CPUSET should be empty when the parameter not present Fixes #1060 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458659928-2050-1-git-send-email-syuu@scylladb.com>	2016-03-22 17:49:06 +02:00
Avi Kivity	8cf785e53a	Merge "Merge "iotune packaging fix" from Takuya "This implements #1065 - iotune will NOT be a part of scylla service - remove the scylla.io.service - User will have to run it manually - using a script call scylla_io_tune_setup (that will do the exact same thing the service does today. - if they wont, and do not use --developer-mode, scylla init will fail will a proper error - scylla will not start (in the same manner it does not start if you run scylla on non XFS FS) - For c3,m3,i2 we will use the evaluation formula we have (that takes the number of disks , cores etc.) - For other instances we will set --developer-mode. if the user logins into the instance - he will get a developer-mode warning - No iotune on AWS" Fixes #1065.	2016-03-22 17:46:32 +02:00
Takuya ASADA	9889712d43	dist: remove scylla-io-setup.service and make it standalone script	2016-03-22 17:45:58 +02:00
Takuya ASADA	2cedab07f2	dist: on scylla_io_setup print out message both for stdout and syslog	2016-03-22 17:45:58 +02:00
Takuya ASADA	83112551bb	dist: introduce dev-mode.conf and scylla_dev_mode_setup	2016-03-22 17:45:58 +02:00
Tomasz Grabiec	a4e3adfbec	Fix assertion in row_cache_alloc_stress Fixes the following assertion failure: row_cache_alloc_stress: tests/row_cache_alloc_stress.cc:120: main(int, char**)::<lambda()>::<lambda()>: Assertion `mt->occupancy().used_space() < memory::stats().free_memory()' failed. memory::stats()::free_memory() may be much lower than the actual amount of reclaimable memory in the system since LSA zones will try to keep a lot of free segments to themselves. Fix by using actual amount of reclaimable memory in the check.	2016-03-22 16:31:04 +01:00
Tomasz Grabiec	a0cba3c86f	logalloc: Introduce tracker::occupancy() Returns occupancy information for all memory allocated by LSA, including segment pools / zones.	2016-03-22 16:28:10 +01:00
Yoav Kleinberger	97bb7a35d9	tools/scyllatop: some sensible default metrics Previosly if the user did not specify any metrics, scyllatop use whatever it could find. Now we have some preset defaults which are probably more interesting. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1458658804-377-1-git-send-email-yoav@scylladb.com>	2016-03-22 17:04:13 +02:00
Tomasz Grabiec	529c8b8858	logalloc: Rename tracker::occupancy() to region_occupancy()	2016-03-22 14:56:44 +01:00
Pekka Enberg	5019b709ba	service/migration_manager: Simplify verb unregistration You can safely unregister verbs even if they're not registered yet. Simplify code in migration manager by dropping the redundant checks. Message-Id: <1458027669-6517-1-git-send-email-penberg@scylladb.com>	2016-03-22 15:24:55 +02:00
Pekka Enberg	3e1a660839	Merge seastar upstream * seastar c193821...9f2b868 (4): > memory: set free memory to non-zero value in debug mode > Merge "Increase IOTune's robustness by including a timeout" from Glauber > shared_future: add companion class, shared_promise > rpc: fix client connection stopping	2016-03-22 15:16:21 +02:00
Asias He	4c06221766	streaming: Start to send mutations after PREPARE_DONE_MESSAGE Below are 3 possible cases in a stream session, after commit `208b7fa7ba` (streaming: Simplify session completion logic) We might close the session before the exchange of the PREPARE_DONE_MESSAGE message in case 1). To fix, we defer the sending of mutations after PREPARE_DONE_MESSAGE is sent at the initiator node. 1) Initiator Follower tx rx tx rx 1 0 0 1 send prepare send back prepare recev prepare send mutations (close the session before prepare_done msg is sent) recv mutations (close session before prepare_done msg is received) send prepare_done recv prepare_done and send no mutations 2) Initiator Follower tx rx tx rx 0 1 1 0 send prepare send back prepare recv prepare nothing to send send prepare_done recv prepare_done and send mutations (close session) recv mutations (close session) 3) Initiator Follower tx rx tx rx 1 1 1 1 send prepare send back prepare recv prepare send mutations recv mutations, can not close session since we have mutations to send send prepare_done recv prepare_done and send mutations (close session) recv mutations (close session) Message-Id: <d6510b558565db23202164fa491b883ef3796e58.1458634037.git.asias@scylladb.com>	2016-03-22 15:05:57 +02:00
Takuya ASADA	6b2a8a2f70	dist: enable collectd on scylla_setup by default, to make scyllatop usable Fixes #1037 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458324769-9152-1-git-send-email-syuu@scylladb.com>	2016-03-22 15:02:18 +02:00
Tomasz Grabiec	ca08db504b	managed_bytes: Make operator[] work for large blobs as well Fixes assertion in mutation_test: mutation_test: ./utils/managed_bytes.hh:349: blob_storage::char_type* managed_bytes::data(): Assertion `!_u.ptr->next' Introduced in `ea7c2dd085` Message-Id: <1458648786-9127-1-git-send-email-tgrabiec@scylladb.com>	2016-03-22 14:43:52 +02:00
Gleb Natapov	1e6352e398	messaging: do not admit new requests during messaging service shutdown. Sending a message may open new client connection which will never be closed in case messaging service is shutting down already. Fixes #1059 Message-Id: <1458639452-29388-3-git-send-email-gleb@scylladb.com>	2016-03-22 13:00:18 +02:00
Gleb Natapov	357c91a076	messaging: do not delete client during messaging service shutdown Messaging service stop() method calls stop() on all clients. If remove_rpc_client_one() is called while those stops are running client::stop() will be called twice which not suppose to happen. Fix it by ignoring client remove request during messaging service shutdown. Fixes #1059 Message-Id: <1458639452-29388-2-git-send-email-gleb@scylladb.com>	2016-03-22 13:00:18 +02:00
Asias He	b8abd88841	messaging_service: Take reference of ms in send_message_timeout_and_retry Take a reference of messaging_service object inside send_message_timeout_and_retry to make sure it is not freed during the life time of send_message_timeout_and_retry operation.	2016-03-22 12:32:19 +02:00
Pekka Enberg	ae33e9fe76	dist/ubuntu: Use tilde for release candidate builds The version number ordering rules are different for rpm and deb. Use tilde ('~') for the latter to ensure a release candidate is ordered _before_ a final version. Message-Id: <1458627524-23030-1-git-send-email-penberg@scylladb.com>	2016-03-22 11:52:05 +02:00
Avi Kivity	5a20a70728	Merge "CQL syntax extension to handle sstable loader lists" from Calle "Adds an extension function SCYLLA_TIMEUUID_LIST_INDEX to CQL syntax for collection element indexing, which, if the target is a list, will attempt to directly index the list (which is really a map) by the ordering time uuid (as index parameter)."	2016-03-22 11:42:47 +02:00
Duarte Nunes	36571a2018	init: Trim spaces in seeds list This patch ensures we are resilient against spaces before or after IP addresses in the seeds list. Fixes #958 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1458637617-5761-1-git-send-email-duarte@scylladb.com>	2016-03-22 11:10:29 +02:00
Avi Kivity	1798889e85	Merge "Make apply() exception-safe" from Tomasz "We cannot leave partially applied mutation behind when the write fails. It may fail if memory allocation fails in the middle of apply(). This for example would violate write atomicity, readers should either see the whole write or none at all. This fix makes apply() revert partially applied data upon failure, by the means of ReversiblyMergeable concept. In a nut shell the idea is to store old state in the source mutation as we apply it and swap back in case of exception. At cell level this swapping is inexpensive, just rewiring pointers. For this to work, the source mutation needs to be brought into mutable form, so frozen mutations need to be unfrozen. In practice this doesn't increase amount of cell allocations in the memtable apply path because incoming data will usually be newer and we will have to copy it into LSA anyway. There are extra allocations though for the data structures which holds cells. I didn't see significant change in performance of: build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13 The score fluctuates around ~77k ops/s. The change was tested with a unit test (patch to mutation_test) which generates random mutations and injects allocation failures at every possible allocation site in the apply path. This also uncovered other preexisting bugs."	2016-03-22 10:43:41 +02:00
Gleb Natapov	ea92064d38	avoid invoke_on_all during developer-mode application if possible Message-Id: <20160315145327.GW6117@scylladb.com>	2016-03-22 10:40:30 +02:00
Nadav Har'El	2eb0627665	sstable: fix use-after-free of temporary ioclass copy Commit `6a3872b355` fixed some use-after-free bugs but introduced a new one because of a typo: Instead of capturing a reference to the long-living io-class object, as all the code does, one place in the code accidentally captured a copy of this object. This copy had a very temporary life, and when a reference to that copy was passed to sstable reading code which assumed that it lives at least as long as the read call, a use-after-free resulted. Fixes #1072 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458595629-9314-1-git-send-email-nyh@scylladb.com>	2016-03-21 22:28:05 +01:00
Tomasz Grabiec	6e73c3f3dc	perf_simple_query: Make duration configurable	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	2fbb55929d	mutation_test: Add allocation failure stress test for apply() The test injects allocation failures at every allocation site during apply(). Only allocations throug allocation_strategy are instrumented, but currently those should include all allocations in the apply() path. The target and source mutations are randomized.	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	8ede27f9c6	mutation_test: Add more apply() tests	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	36575d9f01	mutation_test: Hoist make_blob() to a function	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	4c85d06df7	mutation_test: Make make_blob() return different blob each time random_bytes was constructed with the same seed each time.	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	19b3df9f0f	mutation_test: Fix use-after-free The problem was that verify_row() was returning a future which was not waited on. Fix by running the code in a thread.	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	a7966e9b71	mutation_partition: Fix friend declarations Missing "class" confuses CLion IDE.	2016-03-21 21:49:53 +01:00
Tomasz Grabiec	dc290f0af7	mutation_partition: Make apply() atomic even in case of exception We cannot leave partially applied mutation behind when the write fails. It may fail if memory allocation fails in the middle of apply(). This for example would violate write atomicity, readers should either see the whole write or none at all. This fix makes apply() revert partially applied data upon failure, by the means of ReversiblyMergeable concept. In a nut shell the idea is to store old state in the source mutation as we apply it and swap back in case of exception. At cell level this swapping is inexpensive, just rewiring pointers. For this to work, the source mutation needs to be brought into mutable form, so frozen mutations need to be unfrozen. In practice this doesn't increase amount of cell allocations in the memtable apply path because incoming data will usually be newer and we will have to copy it into LSA anyway. There are extra allocations though for the data structures which holds cells. I didn't see significant change in performance of: build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13 The score fluctuates around ~77k ops/s. Fixes #283.	2016-03-21 21:49:52 +01:00
Tomasz Grabiec	e09d186c7c	mutation_partition: Make intrusive sets ReversiblyMergeable	2016-03-21 21:49:52 +01:00
Tomasz Grabiec	f1a4feb1fc	mutation_partition: Make row_tombstones_entry ReversiblyMergeable	2016-03-21 19:26:24 +01:00
Tomasz Grabiec	e4a576a90f	mutation_partition: Make rows_entry ReversiblyMergeable	2016-03-21 19:26:24 +01:00
Tomasz Grabiec	aadcd75d89	mutation_partition: Make row_marker ReversiblyMergeable	2016-03-21 19:26:24 +01:00
Tomasz Grabiec	ea7c2dd085	mutation_partition: Make row ReversiblyMergeable	2016-03-21 19:26:24 +01:00
Tomasz Grabiec	c9d4f5a49c	atomic_cell_or_collection: Introduce as_atomic_cell_ref() Needed for setting the REVERT flag on existing cell.	2016-03-21 19:25:54 +01:00
Tomasz Grabiec	1ffe06165d	atomic_cell_hash: Specialize appending_hash<> for atomic_cell and collection_mutation	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	bfc6413414	atomic_cell: Add REVERT flag Needed to make atomic cells ReversiblyMergeable.	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	7fcfa97916	tombstone: Make ReversiblyMergeable	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	1407173186	Introduce the concept of ReversiblyMergeable	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	9fc7f8a5ed	mutation_partition: row: Add empty()	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	d5e66a5b0d	mutation_partition: row: Allow storing empty cells internally Currently only "set" storage could store empty cells, but not the "vector" one because there empty cell has the meaning of being missing. To implement rolback, we need to be able to distinguish empty cells from missing ones. Solve by making vector storage use a bitmap for presence checking instead of emptiness. This adds 4 bytes to vector storage.	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	ed1e6515db	mutation_partition: Make row::merge() tolerate empty row The row may be empty and still have a set storage, in which case rbegin() dereference is undefined behavior.	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	184e2831e7	managed_bytes: Mark move-assignment noexcept	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	92d4cfc3ab	managed_bytes: Make copy assignment exception-safe	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	22d193ba9f	managed_bytes: Make linearization_context::forget() noexcept It is needed for noexcept destruction, which we need for exception safety in higher layers. According to [1], erase() only throws if key comparison throws, and in our case it doesn't. [1] http://en.cppreference.com/w/cpp/container/unordered_map/erase	2016-03-21 18:41:27 +01:00
Tomasz Grabiec	87d7279267	mutation: Add copy assignment operator We already have a copy constructor, so can have copy assignment as well.	2016-03-21 18:41:27 +01:00
Shlomi Livne	b7e338275b	fix centos local ami creation (revert some changes) in centos we do not have a version file created - revert this changes introduced when adding ubuntu ami creation Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <69c80dcfa7afe4f5db66dde2893d9253a86ac430.1458578004.git.shlomi@scylladb.com>	2016-03-21 18:41:40 +02:00
Asias He	208b7fa7ba	streaming: Simplify session completion logic Both the initiator and follower of a stream session knows how many transfer task and receive task the stream session contains in the preparation phase. They use the _transfers and _receivers map to track the tasks, like below: std::map<UUID, stream_transfer_task> _transfers; std::map<UUID, stream_receive_task> _receivers; A stream_transfer_task will send STREAM_MUTATION verb to transfer data with frozen_mutation, when all the STREAM_MUTATIONs are sent, it will send STREAM_MUTATION_DONE to tell the peer the stream_transfer_task is completed and remove the stream_transfer_task from _transfers map. The peer will remove the corresponding stream_receive_task in _receivers. We do not really need the COMPLETE_MESSAGE verb to notify the peer we have completed sending. It makes the session completion logic much simpler and cleaner if we do not depend on COMPLETE_MESSAGE verb. However, to be compatible with older version, we always send a COMPLETE_MESSAGE message and do nothing in the COMPLETE_MESSAGE handler and replies a ready future even if the stream_session is closed already. This way, node with older version will get a COMPLETE_MESSAGE message and manage to send a COMPLETE_MESSAGE message to new node as before. Message-Id: <1458540564-34277-2-git-send-email-asias@scylladb.com>	2016-03-21 16:58:03 +02:00
Pekka Enberg	4892a6ded9	build: Invoke Seastar build only once Make sure we invoke the Seastar ninja build only once from our own build process so that we don't have multiple ninjas racing with each other. Refs #1061. Message-Id: <1458563076-29502-1-git-send-email-penberg@scylladb.com>	2016-03-21 16:22:11 +02:00
Takuya ASADA	6edd909b00	dist: stop using '-p' option on lsblk since Ubuntu doesn't supported it On scylla_setup interactive mode we are using lsblk to list up candidate block devices for RAID, and -p option is to print full device paths. Since Ubuntu 14.04LTS version of lsblk doesn't supported this option, we need to use non-full path name and complete paths before passes it to scylla_raid_setup. Fixes #1030 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458325411-9870-1-git-send-email-syuu@scylladb.com>	2016-03-21 14:54:36 +02:00
Calle Wilund	5982c0ee10	Cql.g: Add extension function SCYLLA_TIMEUUID_LIST_INDEX Allows scylla sstable loader (cql) to do by-uuid updates to non-frozen lists.	2016-03-21 12:28:37 +00:00
Calle Wilund	5b570c417b	cql3::operation: Allow set_element to be "by uuid" (for lists) Just add an instantiation flag to keep track. Then choose actual opertation to perform in prepare.	2016-03-21 12:28:37 +00:00
Calle Wilund	71170f51a8	cql3::lists: Add setter_by_uuid operation Allows direct setting of list element by UUID key	2016-03-21 12:28:36 +00:00
Asias He	39992dd559	gossip: Sync gossip_digest.idl.hh and application_state.hh We did the clean up in idl/gossip_digest.idl.hh, but the patch to clean up gms/application_state.hh was never merged. To maintain compatibility with previous version of scylla, we can not change application_state.hh, instead change idl to be sync with application_state.hh. Message-Id: <3a78b159d5cb60bc65b354d323d163ce8528b36d.1458557948.git.asias@scylladb.com>	2016-03-21 13:07:22 +02:00
Pekka Enberg	bcdd034512	dist/ubuntu: Install wget package if it's not available The build scripts use wget so make sure it's actually installed on the machine. Message-Id: <1458554706-14558-1-git-send-email-penberg@scylladb.com>	2016-03-21 12:36:52 +02:00
Asias He	7acc9816d2	gossip: Handle unknown application_state when printing In case an unknown application_state is received, we should be able to handle it when printting. Message-Id: <98d2307359292e90c8925f38f67a74b69e45bebe.1458553057.git.asias@scylladb.com>	2016-03-21 11:59:04 +02:00
Asias He	28ccd866e2	streaming: Move ranges in stream_plan The ranges are not used afterwards. We can move instead of copy. Message-Id: <1458540564-34277-1-git-send-email-asias@scylladb.com>	2016-03-21 10:10:09 +01:00
Avi Kivity	e1e4766cc6	Merge "Ubuntu based AMI support" from Takuya "This provides Ubuntu based AMI support. With this patchset, you will able to run build_ami.sh on Ubuntu 14.04LTS."	2016-03-20 20:40:21 +02:00
Raphael Carvalho	de4b4e593d	db: better handling of failure in column_family::populate Improve handling of failure by saving first exception and ignoring the remaining futures. At the moment, code only throws first exception and doesn't care about any possible remaining future. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <383dc4445db09dd2fbce093d4609a0a0bc38a405.1458240398.git.raphaelsc@scylladb.com>	2016-03-20 17:33:20 +02:00
Avi Kivity	7869a48c31	Update scylla-ami submodule * dist/ami/files/scylla-ami 84bcd0d...56f1ab7 (2): > Ubuntu AMI support on scylla_install_ami > scylla_ami_setup is not POSIX sh compatible, change shebang to /bin/bash	2016-03-20 17:26:03 +02:00
Takuya ASADA	769204d41e	dist: allow more requests for i2 instances i2 instances has better performance than others, so allow more requests. Fixes #921 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458251067-1533-1-git-send-email-syuu@scylladb.com>	2016-03-20 17:24:52 +02:00
Tomasz Grabiec	c518e852ee	modificiation_statement: Use result_view::do_with() Reduces code duplication. Message-Id: <1458336592-22065-1-git-send-email-tgrabiec@scylladb.com>	2016-03-20 15:14:28 +02:00
Avi Kivity	6d031b4c6b	Merge seastar upstream * seastar 6a207e1...c193821 (6): > semaphore: allow wait() and signal() after broken() > run reactor::stop() only once > sharded: fix start with reference parameter > core: add asserts to rwlock > util/defer: Fix cancel() not being respected > tcp: Do not return accept until the connection is connected	2016-03-20 13:32:18 +02:00
Tomasz Grabiec	8134992024	mutation_partition: Add cell_entry constructor which makes an empty cell	2016-03-18 22:30:04 +01:00
Tomasz Grabiec	518e956736	mutation_partition: Make row::vector_to_set() exception-safe Currently allocation failure can leave the old row in a half-moved-from state and leak cell_entry objects.	2016-03-18 22:30:04 +01:00
Tomasz Grabiec	c91eefa183	mutation_partition: Unmark cell_entry's copy constructor as noexcept It was a mistake, it certainly may throw because it copies cells.	2016-03-18 22:30:04 +01:00
Glauber Costa	e52b869b25	fix small typo will sent -> will send Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20eaf0cea6fe14b03332547b7c4a3b85e9b619e7.1458325926.git.glauber@scylladb.com>	2016-03-18 20:34:22 +02:00
Takuya ASADA	a6cd085c38	dist: allow to run 'sudo scylla_ami_setup' for Ubuntu AMI Allows to run scylla_ami_setup from scylla-server.conf Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-03-18 05:57:50 +09:00
Takuya ASADA	7828023599	dist: launch scylla_ami_setup on Ubuntu AMI Since upstart does not have same behavior as systemd, we need to run scylla_io_setup and scylla_ami_setup in scylla-server.conf's pre-start stanza. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-03-18 05:57:50 +09:00
Takuya ASADA	93bf7bff8e	dist: fix broken scylla_install_pkg --local-pkg and --unstable on Ubuntu --local-pkg and --unstable arguments didn't handled on Ubuntu, support it. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-03-18 05:57:50 +09:00
Takuya ASADA	0c83b34d0c	dist: prevent to show up dialog on apt-get in scylla_raid_setup "apt-get -y install mdadm" shows up a dialog to select install mode of postfix, this will block scylla-ami-setup.service forever since it is running as background task, we need to prevent it. Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-03-18 05:57:50 +09:00
Takuya ASADA	b097ed6d75	dist: Ubuntu based AMI support This introduces Ubuntu AMI. Both CentOS AMI and Ubuntu AMI are need to build on same distribution, so build_ami.sh script automatically detect current distribution, and selects base AMI image. Fixes #998 Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2016-03-18 05:57:40 +09:00
Takuya ASADA	4cc589872d	dist: follow sysconfig setting when counting number of cpus on scylla_io_setup When NR_CPU >= 8, we disabled cpu0 for AMI on scylla_sysconfig_setup. But scylla_io_setup doesn't know that, try to assign NR_CPU queues, then scylla fails to start because queues > cpus. So on this fix scylla_io_setup checks sysconfig settings, if '--smp <n>' specified on SCYLLA_ARGS, use n to limit queue size. Also, when instance type is not supported pre-configured parameters, we need to passes --cpuset parameters to iotune. Otherwise iotune will run on a different set of CPUs, which may have different performance characteristics. Fixes #996, #1043, #1046 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458221762-10595-2-git-send-email-syuu@scylladb.com>	2016-03-17 16:44:46 +02:00
Takuya ASADA	6f71173827	dist: On scylla_sysconfig_setup, don't disable cpu0 on non-AMI environments Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458221762-10595-1-git-send-email-syuu@scylladb.com>	2016-03-17 16:44:45 +02:00
Benoît Canet	3b1d3d977d	exceptions: Shutdown communications on non file I/O errors Apply the same treatment to non file filesystem I/O errors. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-2-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:54 +02:00
Benoît Canet	1fb9a48ac5	exception: Optionally shutdown communication on I/O errors. I/O errors cannot be fixed by Scylla the only solution is to shutdown the database communications. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:52 +02:00
Pekka Enberg	69dacf9063	main: Fix broadcast_address and listen_address validation errors Fix the validation error message to look like this: Scylla version 666.development-20160316.49af399 starting ... WARN 2016-03-17 12:24:15,137 [shard 0] config - Option partitioner is not (yet) used. WARN 2016-03-17 12:24:15,138 [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; you may run out of file descriptors. ERROR 2016-03-17 12:24:15,138 [shard 0] init - Bad configuration: invalid 'listen_address': eth0: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> > (Invalid argument) Exiting on unhandled exception of type 'bad_configuration_error': std::exception Instead of: Exiting on unhandled exception of type 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >': Invalid argument Fixes #1051. Message-Id: <1458210329-4488-1-git-send-email-penberg@scylladb.com>	2016-03-17 14:59:00 +02:00
Tomasz Grabiec	b9af32c9d5	Merge branch 'pdziepak/fix-lsa-memory-accounting/v1' from seastar-dev.git Memory accounting fix from Paweł.	2016-03-17 12:55:21 +01:00
Paweł Dziepak	13849fd129	tests/lsa: add test for region groups Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-17 11:20:22 +00:00
Paweł Dziepak	ed53784cb6	tests/lsa: do not leak memory in large allocation test Large allocations test, unsurprisingly, allocates a lot of memory. Do not leak it so that any tests that are going to be run afterwards have still some memory left. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-17 11:19:13 +00:00
Paweł Dziepak	338fd34770	lsa: update _closed_occupancy after freeing all segments _closed_occupancy will be used when a region is removed from its region group, make sure that it is accurate. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-17 11:12:05 +00:00
Pekka Enberg	0434bc3d33	dist: Fix '--developer-mode' parsing in scylla_io_setup We need to support the following variations: --developer-mode true --developer-mode 1 --developer-mode=true --developer-mode=1 Fixes #1026. Message-Id: <1458203393-26658-1-git-send-email-penberg@scylladb.com>	2016-03-17 09:58:34 +01:00
Pekka Enberg	972fc6e014	main: Defer API server hooks until commitlog replay Defer registering services to the API server until commitlog has been replayed to ensure that nobody is able to trigger sstable operations via 'nodetool' before we are ready for them. Message-Id: <1458116227-4671-1-git-send-email-penberg@scylladb.com>	2016-03-17 10:04:35 +02:00
Takuya ASADA	95161d5db7	dist: add scylla-gdb.py on Ubuntu dbg package Fixes #969 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458150248-10632-1-git-send-email-syuu@scylladb.com>	2016-03-17 09:03:00 +02:00
Pekka Enberg	303dd76205	Merge "Fix debug messages for streaming session" from Glauber "One of the messages is printed twice, and one of the verbs is missing a message. That makes it hard to debug the session."	2016-03-17 08:11:50 +02:00
Glauber Costa	a3ebf640c6	stream_session: print debug message for STREAM_MUTATION For this verb(), we don't call get_session - and it doesn't look like we will. We currently have no debug message for this one, which makes it harder to debug the stream of messages. Print it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-16 22:09:46 -04:00
Glauber Costa	0ab4275893	stream_session: remove duplicated debug message Whenever we call get_session, that will print a debug message about the arrival of this new verb. Because we also print that explicitly in PREPARE_DONE, that message gets duplicated. That confuses poor developers who are, for a while, left wondering why is it that the sender is sender the message twice. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-16 22:04:25 -04:00
Glauber Costa	6a3872b355	sstables: do not assume mutation_reader will be kept alive Our sstables::mutation_reader has a specialization in which start and end ranges are passed as futures. That is needed because we may have to read the index file for those. This works well under the assumption that every time a mutation_reader will be created it will be used, since whoever is using it will surely keep the state of the reader alive. However, that assumption is no longer true - for a while. We use a reader interface for reading everything from mutations and sstables to cache entries, and when we create an sstable mutation_reader, that does not mean we'll use it. In fact we won't, if the read can be serviced first by a higher level entity. If that happens to be the case, the reader will be destructed. However, since it may take more time than that for the start and end futures to resolve, by the time they are resolved the state of the mutation reader will no longer be valid. The proposed fix for that is to only resolve the future inside mutation_reader's read() function. If that function is called, we can have a reasonable expectation that the caller object is being kept alive. A second way to fix this would be to force the mutation reader to be kept alive by transforming it into a shared pointer and acquiring a reference to itself. However, because the reader may turn out not to be used, the delayed read actually has the advantage of not even reading anything from the disk if there is no need for it. Also, because sstables can be compacted, we can't guarantee that the sst object itself , used in the resolution of start and end can be alive and that has the same problem. If we delay the calling of those, we will also solve a similar problem. We assume here that the outter reader is keeping the SSTable object alive. I must note that I have not reproduced this problem. What goes above is the result of the analysis we have made in #1036. That being the case, a thorough review is appreciated. Fixes #1036 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>	2016-03-16 17:51:02 +02:00
Nadav Har'El	02ba8ffbe8	Allow uncompression at end of file Asking to read from byte 100 when a file has 50 bytes is an obvious error. But what if we ask to read from byte 50? What if we ask to read 0 bytes at byte 50? :-) Before this patch, code which asked to read from the EOF position would get an exception. After this patch, it would simply read nothing, without error. This allows, for example, reading 0 bytes from position 0 on a file with 0 bytes, which apparently happened in issue #1039... A read which starts at a position higher than the EOF position still generates an exception. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>	2016-03-16 17:50:23 +02:00
Nadav Har'El	73297c7872	Fix out-of-range exception when uncompressing 0 bytes The uncompression code reads the compressed chunks containing the bytes pos through pos + len - 1. This, however, is not correct when len==0, and pos + len - 1 may even be -1, causing an out-of-range exception when calling locate() to find the chunks containing this byte position. So we need to treat len==0 specially, and in this case we don't read anything, and don't need to locate() the chunks to read. Refs #1039. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>	2016-03-16 15:54:48 +02:00
Takuya ASADA	f1d18e9980	dist: do not auto-start scylla-server job on Ubuntu package install time Fixes #1017 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458122424-22889-1-git-send-email-syuu@scylladb.com>	2016-03-16 13:55:12 +02:00
Pekka Enberg	2f519b9b34	tests/gossip_test: Fix messaging service stop This fixes gossip test shutdown similar to what commit `13ce48e` ("tests: Fix stop of storage_service in cql_test_env") did for CQL tests: gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed. Running 1 test case... [snip] unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested) seastar/tests/test-utils.cc(32): last checkpoint Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>	2016-03-16 13:15:18 +02:00
Asias He	2d50c71ca3	streaming: Handle cf is deleted after the deletion check The cf can be deleted after the cf deletion check. Handle this case as well. Use "warn" level to log if cf is missing. Although we can handle the case, but it is good to distingush where the receiver of streaming applied all the stream mutations or not. We believe that the cf is missing because it was dropped, but it could be missing because of a bug or something we didn't anticipated here. Related patch: "streaming: Handle cf is deleted when sending STREAM_MUTATION_DONE" Fixes simple_add_new_node_while_schema_changes_test failure. Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>	2016-03-16 09:46:41 +01:00
Asias He	13ce48e775	tests: Fix stop of storage_service in cql_test_env In stop() of storage_service, it unregisters the verb handler. In the test, we stop messaging_service before storage_service. Fix it by deferring stop of messaging_service. Message-Id: <c71f7b5b46e475efe2fac4c1588460406f890176.1458086329.git.asias@scylladb.com>	2016-03-16 08:32:01 +02:00

1193 changed files with 122125 additions and 27282 deletions

									
										9

.github/ISSUE_TEMPLATE.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,9 @@

				*Installation details*

				Scylla version (or git commit hash):

				Cluster size:

				OS (RHEL/CentOS/Ubuntu/AWS AMI):

				*Hardware details (for performance issues)*          Delete if unneeded

				Platform (physical/VM/cloud instance type/docker):

				Hardware: sockets= cores= hyperthreading= memory=

				Disks: (SSD/HDD, count)

9

.gitignore vendored

View File

@@ -9,3 +9,12 @@ dist/ami/files/*.rpm
 dist/ami/variables.json
 dist/ami/scylla_deploy.sh
 *.pyc
 Cql.tokens
 .kdev4
 *.kdev4
 CMakeLists.txt.user
 .cache
 .tox
 *.egg-info
 __pycache__CMakeLists.txt.user
 .gdbinit

2

.gitmodules vendored

View File

@@ -1,6 +1,6 @@
 [submodule "seastar"]
 	path = seastar
 	url = ../seastar
 	url = ../scylla-seastar
 	ignore = dirty
 [submodule "swagger-ui"]
 	path = swagger-ui

									
										140

CMakeLists.txt
									
										Normal file
									
												View File
												
				@@ -0,0 +1,140 @@

				##

				## For best results, first compile the project using the Ninja build-system.

				##

				cmake_minimum_required(VERSION 3.7)

				project(scylla)

				if (NOT DEFINED FOR_IDE AND NOT DEFINED ENV{FOR_IDE} AND NOT DEFINED ENV{CLION_IDE})

				    message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in IDEs, please define FOR_IDE to acknowledge this.")

				endif()

				# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.

				set(SEASTAR_INCLUDE_DIRS "seastar")

				# These paths are always available, since they're included in the repository. Additional DPDK headers are placed while

				# Seastar is built, and are captured in `SEASTAR_INCLUDE_DIRS` through parsing the Seastar pkg-config file (below).

				set(SEASTAR_DPDK_INCLUDE_DIRS

				        seastar/dpdk/lib/librte_eal/common/include

				        seastar/dpdk/lib/librte_eal/common/include/generic

				        seastar/dpdk/lib/librte_eal/common/include/x86

				        seastar/dpdk/lib/librte_ether)

				find_package(PkgConfig REQUIRED)

				set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/seastar/build/release:$ENV{PKG_CONFIG_PATH}")

				pkg_check_modules(SEASTAR seastar)

				find_package(Boost COMPONENTS filesystem program_options system thread)

				##

				## Populate the names of all source and header files in the indicated paths in a designated variable.

				##

				## When RECURSIVE is specified, directories are traversed recursively.

				##

				## Use: scan_scylla_source_directories(VAR my_result_var [RECURSIVE] PATHS [path1 path2 ...])

				##

				function (scan_scylla_source_directories)

				    set(options RECURSIVE)

				    set(oneValueArgs VAR)

				    set(multiValueArgs PATHS)

				    cmake_parse_arguments(args "${options}" "${oneValueArgs}" "${multiValueArgs}" "${ARGN}")

				    set(globs "")

				    foreach (dir ${args_PATHS})

				        list(APPEND globs "${dir}/*.cc" "${dir}/*.hh")

				    endforeach()

				    if (args_RECURSIVE)

				        set(glob_kind GLOB_RECURSE)

				    else()

				        set(glob_kind GLOB)

				    endif()

				    file(${glob_kind} var

				            ${globs})

				    set(${args_VAR} ${var} PARENT_SCOPE)

				endfunction()

				## Although Seastar is an external project, it is common enough to explore the sources while doing

				## Scylla development that we'll treat the Seastar sources as part of this project for easier navigation.

				scan_scylla_source_directories(

				        VAR SEASTAR_SOURCE_FILES

				        RECURSIVE

				        PATHS

				          seastar/core

				          seastar/http

				          seastar/json

				          seastar/net

				          seastar/rpc

				          seastar/tests

				          seastar/util)

				scan_scylla_source_directories(

				        VAR SCYLLA_ROOT_SOURCE_FILES

				        PATHS .)

				scan_scylla_source_directories(

				        VAR SCYLLA_SUB_SOURCE_FILES

				        RECURSIVE

				        PATHS

				          api

				          auth

				          cql3

				          db

				          dht

				          exceptions

				          gms

				          index

				          io

				          locator

				          message

				          repair

				          service

				          sstables

				          streaming

				          tests

				          thrift

				          tracing

				          transport

				          utils)

				scan_scylla_source_directories(

				        VAR SCYLLA_GEN_SOURCE_FILES

				        RECURSIVE

				        PATHS build/release/gen)

				set(SCYLLA_SOURCE_FILES

				        ${SCYLLA_ROOT_SOURCE_FILES}

				        ${SCYLLA_GEN_SOURCE_FILES}

				        ${SCYLLA_SUB_SOURCE_FILES})

				add_executable(scylla

				        ${SEASTAR_SOURCE_FILES}

				        ${SCYLLA_SOURCE_FILES})

				# Note that since CLion does not undestand GCC6 concepts, we always disable them (even if users configure otherwise).

				# CLion seems to have trouble with `-U` (macro undefinition), so we do it this way instead.

				list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")

				# If the Seastar pkg-config information is available, append to the default flags.

				#

				# For ease of browsing the source code, we always pretend that DPDK is enabled.

				target_compile_options(scylla PUBLIC

				        -std=gnu++14

				        -DHAVE_DPDK

				        -DHAVE_HWLOC

				        "${SEASTAR_CFLAGS}")

				# The order matters here: prefer the "static" DPDK directories to any dynamic paths from pkg-config. Some files are only

				# available dynamically, though.

				target_include_directories(scylla PUBLIC

				        .

				        ${SEASTAR_DPDK_INCLUDE_DIRS}

				        ${SEASTAR_INCLUDE_DIRS}

				        ${Boost_INCLUDE_DIRS}

				        build/release/gen)

									
										11

CONTRIBUTING.md
									
										Normal file
									
												View File
												
				@@ -0,0 +1,11 @@

				# Asking questions or requesting help

				Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.

				# Reporting an issue

				Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues.  Fill in as much information as you can in the issue template, especially for performance problems.

				# Contributing Code to Scylla

				To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

									
										233

HACKING.md
									
										Normal file
									
												View File
												
				@@ -0,0 +1,233 @@

				# Guidelines for developing Scylla

				This document is intended to help developers and contributors to Scylla get started. The first part consists of general guidelines that make no assumptions about a development environment or tooling. The second part describes a particular environment and work-flow for exemplary purposes.

				## Overview

				This section covers some high-level information about the Scylla source code and work-flow.

				### Getting the source code

				Scylla uses [Git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) to manage its dependency on Seastar and other tools. Be sure that all submodules are correctly initialized when cloning the project:

				```bash

				$ git clone https://github.com/scylladb/scylla

				$ cd scylla

				$ git submodule update --init --recursive

				```

				### Dependencies

				Scylla depends on the system package manager for its development dependencies.

				Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.

				### Build system

				**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.

				Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.

				To build for the first time:

				```bash

				$ ./configure.py

				$ ninja-build

				```

				Afterwards, it is sufficient to just execute Ninja.

				The full suite of options for project configuration is available via

				```bash

				$ ./configure.py --help

				```

				The most important options are:

				- `--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.

				- `--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.

				Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.

				To save time -- for instance, to avoid compiling all unit tests -- you can also specify specific targets to Ninja. For example,

				```bash

				$ ninja-build build/release/tests/schema_change_test

				```

				### Unit testing

				Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.

				A test target can be any executable. A non-zero return code indicates test failure.

				Most tests in the Scylla repository are built using the [Boost.Test](http://www.boost.org/doc/libs/1_64_0/libs/test/doc/html/index.html) library. Utilities for writing tests with Seastar futures are also included.

				Run all tests through the test execution wrapper with

				```bash

				$ ./test.py --mode={debug,release}

				```

				The `--name` argument can be specified to run a particular test.

				Alternatively, you can execute the test executable directly. For example,

				```bash

				$ build/release/tests/row_cache_test -- -c1 -m1G

				```

				The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread and 1 GB of memory.

				### Preparing patches

				All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.

				Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/).

				### Running Scylla

				Once Scylla has been compiled, executing the (`debug` or `release`) target will start a running instance in the foreground:

				```bash

				$ build/release/scylla

				```

				The `scylla` executable requires a configuration file, `scylla.yaml`. By default, this is read from `$SCYLLA_HOME/conf/scylla.yaml`. A good starting point for development is located in the repository at `/conf/scylla.yaml`.

				For development, a directory at `$HOME/scylla` can be used for all Scylla-related files:

				```bash

				$ mkdir -p $HOME/scylla $HOME/scylla/conf

				$ cp conf/scylla.yaml $HOME/scylla/conf/scylla.yaml

				$ # Edit configuration options as appropriate

				$ SCYLLA_HOME=$HOME/scylla build/release/scylla

				```

				The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.

				Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.

				Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.

				On a development machine, one might run Scylla as

				```bash

				$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes

				```

				### Branches and tags

				Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.

				Similarly, tags are used to pin-point precise release versions, including hot-fix versions like 1.5.4. These are named `scylla-1.5.4`, for example.

				Most development happens on the `master` branch. Release branches are cut from `master` based on time and/or features. When a patch against `master` fixes a serious issue like a node crash or data loss, it is backported to a particular release branch with `git cherry-pick` by the project maintainers.

				## Example: development on Fedora 25

				This section describes one possible work-flow for developing Scylla on a Fedora 25 system. It is presented as an example to help you to develop a work-flow and tools that you are comfortable with.

				### Preface

				This guide will be written from the perspective of a fictitious developer, Taylor Smith.

				### Git work-flow

				Having two Git remotes is useful:

				- A public clone of Seastar (`"public"`)

				- A private clone of Seastar (`"private"`) for in-progress work or work that is not yet ready to share

				The first step to contributing a change to Scylla is to create a local branch dedicated to it. For example, a feature that fixes a bug in the CQL statement for creating tables could be called `ts/cql_create_table_error/v1`. The branch name is prefaced by the developer's initials and has a suffix indicating that this is the first version. The version suffix is useful when branches are shared publicly and changes are requested on the mailing list. Having a branch for each version of the patch (or patch set) shared publicly makes it easier to reference and compare the history of a change.

				Setting the upstream branch of your development branch to `master` is a useful way to track your changes. You can do this with

				```bash

				$ git branch -u master ts/cql_create_table_error/v1

				```

				As a patch set is developed, you can periodically push the branch to the private remote to back-up work.

				Once the patch set is ready to be reviewed, push the branch to the public remote and prepare an email to the `scylladb-dev` mailing list. Including a link to the branch on your public remote allows for reviewers to quickly test and explore your changes.

				### Development environment and source code navigation

				Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.

				[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.

				Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).

				To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.

				[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.

				### Distributed compilation: `distcc` and `ccache`

				Scylla's compilations times can be long. Two tools help somewhat:

				- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible

				- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines

				A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.

				Having a direct wired connection between the machines ensures that object files can be transmitted quickly and limits the overhead of remote compilation.

				The coordinator has been assigned the static IP address `10.0.0.1` and the passive compilation machine has been assigned `10.0.0.2`.

				On Fedora, installing the `ccache` package places symbolic links for `gcc` and `g++` in the `PATH`. This allows normal compilation to transparently invoke `ccache` for compilation and cache object files on the local file-system.

				Next, set `CCACHE_PREFIX` so that `ccache` is responsible for invoking `distcc` as necessary:

				```bash

				export CCACHE_PREFIX="distcc"

				```

				On each host, edit `/etc/sysconfig/distccd` to include the allowed coordinators and the total number of jobs that the machine should accept.

				This example is for the laptop, which has 2 physical cores (4 logical cores with hyper-threading):

				```

				OPTIONS="--allow 10.0.0.2 --allow 127.0.0.1 --jobs 4"

				```

				`10.0.0.2` has 8 physical cores (16 logical cores) and 64 GB of memory.

				As a rule-of-thumb, the number of jobs that a machine should be specified to support should be equal to the number of its native threads.

				Restart the `distccd` service on all machines.

				On the coordinator machine, edit `$HOME/.distcc/hosts` with the available hosts for compilation. Order of the hosts indicates preference.

				```

				10.0.0.2/16 localhost/2

				```

				In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine will be sent up to 2. Allowing for two extra threads on the host machine for coordination, we run compilation with `16 + 2 + 2 = 20` jobs in total: `ninja-build -j20`.

				When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.

				One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.

				### Using the `gold` linker

				Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds the linking process. On Fedora, you can switch the system linker using

				```bash

				$ sudo alternatives --config ld

				```

				### Testing changes in Seastar with Scylla

				Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.

				One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:

				```bash

				$ cd $HOME/src/scylla

				$ cd seastar

				$ git remote add local /home/tsmith/src/seastar

				$ git remote update

				$ git checkout -t local/my_local_seastar_branch

				```

									
										42

README.md
									
												View File
												
				@@ -1,29 +1,19 @@

				#Scylla

				# Scylla

				##Building Scylla

				## Quick-start

				In addition to required packages by Seastar, the following packages are required by Scylla.

				### Submodules

				Scylla uses submodules, so make sure you pull the submodules first by doing:

				```

				git submodule init

				git submodule update --recursive

				```bash

				$ git submodule update --init --recursive

				$ sudo ./install-dependencies.sh

				$ ./configure.py --mode=release

				$ ninja-build -j4 # Assuming 4 system threads.

				$ ./build/release/scylla

				$ # Rejoice!

				```

				### Building and Running Scylla on Fedora

				* Installing required packages:

				Please see [HACKING.md](HACKING.md) for detailed information on building and developing Scylla.

				```

				sudo yum install yaml-cpp-devel lz4-devel zlib-devel snappy-devel jsoncpp-devel thrift-devel antlr3-tool antlr3-C++-devel libasan libubsan gcc-c++ gnutls-devel ninja-build ragel libaio-devel cryptopp-devel xfsprogs-devel numactl-devel hwloc-devel libpciaccess-devel libxml2-devel python3-pyparsing

				```

				* Build Scylla

				```

				./configure.py --mode=release --with=scylla --disable-xen

				ninja-build build/release/scylla -j2 # you can use more cpus if you have tons of RAM

				```

				## Running Scylla

				* Run Scylla

				```

				@@ -83,14 +73,6 @@ Run the image with:

				docker run -p $(hostname -i):9042:9042 -i -t <image name>

				```

				## Contributing to Scylla

				Do not send pull requests.

				Send patches to the mailing list address scylladb-dev@googlegroups.com.

				Be sure to subscribe.

				In order for your patches to be merged, you must sign the Contributor's

				License Agreement, protecting your rights and ours.  See

				http://www.scylladb.com/opensource/cla/.

				[Guidelines for contributing](CONTRIBUTING.md)

9

SCYLLA-VERSION-GEN

View File

@@ -1,6 +1,6 @@
 #!/bin/sh
 VERSION=666.development
 VERSION=2.1.6
 if test -f version
 then
@@ -10,7 +10,12 @@ else
 	DATE=$(date +%Y%m%d)
 	GIT_COMMIT=$(git log --pretty=format:'%h' -n 1)
 	SCYLLA_VERSION=$VERSION
 	SCYLLA_RELEASE=$DATE.$GIT_COMMIT
 	# For custom package builds, replace "0" with "counter.your_name",
 	# where counter starts at 1 and increments for successive versions.
 	# This ensures that the package manager will select your custom
 	# package over the standard release.
 	SCYLLA_BUILD=0
 	SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
 fi
 echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"

									
										90

api/api-doc/cache_service.json
									
												View File
												
				@@ -397,6 +397,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/key/hits_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get key hits moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_key_hits_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/key/requests_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get key requests moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_key_requests_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/key/size",

				      "operations": [

				@@ -487,6 +517,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/row/hits_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get row hits moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_row_hits_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/row/requests_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get row requests moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_row_requests_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/row/size",

				      "operations": [

				@@ -577,6 +637,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/counter/hits_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get counter hits moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_counter_hits_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/counter/requests_moving_avrage",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get counter requests moving avrage",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_counter_requests_moving_avrage",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/cache_service/metrics/counter/size",

				      "operations": [

									
										88

api/api-doc/collectd.json
									
												View File
												
				@@ -55,6 +55,57 @@

				                     "paramType":"query"

				                  }

				               ]

				            },

				            {

				               "method":"POST",

				               "summary":"Start reporting on one or more collectd metric",

				               "type":"void",

				               "nickname":"enable_collectd",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"pluginid",

				                     "description":"The plugin ID, describe the component the metric belongs to. Examples are cache, thrift, etc'. Regex are supported.The plugin ID, describe the component the metric belong to. Examples are: cache, thrift etc'. regex are supported",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"instance",

				                     "description":"The plugin instance typically #CPU indicating per CPU metric. Regex are supported. Omit for all",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"type",

				                     "description":"The plugin type, the type of the information. Examples are total_operations, bytes, total_operations, etc'. Regex are supported. Omit for all",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"type_instance",

				                     "description":"The plugin type instance, the specific metric. Exampls are total_writes, total_size, zones, etc'. Regex are supported, Omit for all",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"enable",

				                     "description":"set to true to enable all, anything else or omit to disable",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				@@ -63,10 +114,10 @@

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get a collectd value",

				               "summary":"Get a list of all collectd metrics and their status",

				               "type":"array",

				               "items":{

				                  "type":"type_instance_id"

				                  "type":"collectd_metric_status"

				               },

				               "nickname":"get_collectd_items",

				               "produces":[

				@@ -74,6 +125,25 @@

				               ],

				               "parameters":[

				               ]

				            },

				            {

				               "method":"POST",

				               "summary":"Enable or disable all collectd metrics",

				               "type":"void",

				               "nickname":"enable_all_collectd",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"enable",

				                     "description":"set to true to enable all, anything else or omit to disable",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      }

				@@ -113,6 +183,20 @@

				               }

				            }

				         }

				      },

				      "collectd_metric_status":{

				         "id":"collectd_metric_status",

				         "description":"Holds a collectd id and an enable flag",

				         "properties":{

				            "id":{

				               "description":"The metric ID",

				               "type":"type_instance_id"

				            },

				            "enable":{

				               "description":"Is the metric enabled",

				               "type":"boolean"

				            }

				         }

				      }

				   }

				}

									
										256

api/api-doc/column_family.json
									
												View File
												
				@@ -78,11 +78,19 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"split_output",

				                     "description":"true if the output of the major compaction should be split in several sstables",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"bool",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -102,7 +110,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -129,7 +137,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -153,7 +161,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -180,7 +188,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -204,7 +212,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -244,7 +252,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -271,7 +279,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -298,7 +306,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -317,7 +325,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -349,7 +357,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -381,7 +389,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -405,7 +413,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -432,7 +440,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -459,7 +467,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -491,7 +499,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -518,7 +526,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -545,7 +553,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -569,7 +577,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -593,7 +601,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -633,7 +641,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -673,7 +681,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -713,7 +721,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -753,7 +761,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -793,7 +801,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -833,7 +841,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -873,7 +881,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -916,7 +924,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -943,7 +951,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -970,7 +978,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -994,7 +1002,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1034,7 +1042,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1058,7 +1066,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1094,14 +1102,14 @@

				               "method":"GET",

				               "summary":"Get read latency histogram",

				               "$ref": "#/utils/histogram",

				               "nickname":"get_read_latency_histogram",

				               "nickname":"get_read_latency_histogram_depricated",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1121,6 +1129,49 @@

				               "items":{

				                  "$ref": "#/utils/histogram"

				               },

				               "nickname":"get_all_read_latency_histogram_depricated",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/column_family/metrics/read_latency/moving_average_histogram/{name}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get read latency moving avrage histogram",

				               "$ref": "#/utils/rate_moving_average_and_histogram",

				               "nickname":"get_read_latency_histogram",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/column_family/metrics/read_latency/moving_average_histogram/",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get read latency moving avrage histogram from all column family",

				               "type":"array",

				               "items":{

				                  "$ref": "#/utils/rate_moving_average_and_histogram"

				               },

				               "nickname":"get_all_read_latency_histogram",

				               "produces":[

				                  "application/json"

				@@ -1160,7 +1211,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1200,7 +1251,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1224,7 +1275,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1260,14 +1311,14 @@

				               "method":"GET",

				               "summary":"Get write latency histogram",

				               "$ref": "#/utils/histogram",

				               "nickname":"get_write_latency_histogram",

				               "nickname":"get_write_latency_histogram_depricated",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1287,6 +1338,49 @@

				               "items":{

				                  "$ref": "#/utils/histogram"

				               },

				               "nickname":"get_all_write_latency_histogram_depricated",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/column_family/metrics/write_latency/moving_average_histogram/{name}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get write latency moving average histogram",

				               "$ref": "#/utils/rate_moving_average_and_histogram",

				               "nickname":"get_write_latency_histogram",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/column_family/metrics/write_latency/moving_average_histogram/",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get write latency moving average histogram of all column family",

				               "type":"array",

				               "items":{

				                  "$ref": "#/utils/rate_moving_average_and_histogram"

				               },

				               "nickname":"get_all_write_latency_histogram",

				               "produces":[

				                  "application/json"

				@@ -1326,7 +1420,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1366,7 +1460,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1406,7 +1500,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1446,7 +1540,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1486,7 +1580,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1526,7 +1620,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1566,7 +1660,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1606,7 +1700,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1646,7 +1740,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1686,7 +1780,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1726,7 +1820,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1766,7 +1860,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1806,7 +1900,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1846,7 +1940,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1886,7 +1980,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1926,7 +2020,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -1966,7 +2060,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2006,7 +2100,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2030,7 +2124,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2070,7 +2164,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2110,7 +2204,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2150,7 +2244,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2190,7 +2284,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2214,7 +2308,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2238,7 +2332,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2265,7 +2359,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2292,7 +2386,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2319,7 +2413,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2346,7 +2440,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2415,7 +2509,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2439,7 +2533,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2463,7 +2557,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2487,7 +2581,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2511,7 +2605,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2535,7 +2629,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2559,7 +2653,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2583,7 +2677,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2607,7 +2701,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2631,7 +2725,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2655,7 +2749,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				@@ -2679,7 +2773,7 @@

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The column family name in keysspace:name format",

				                     "description":"The column family name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

									
										8

api/api-doc/endpoint_snitch_info.json
									
												View File
												
				@@ -21,8 +21,8 @@

				               "parameters":[

				                  {

				                     "name":"host",

				                     "description":"The host name",

				                     "required":true,

				                     "description":"The host name. If absent, the local server broadcast/listen address is used",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				@@ -45,8 +45,8 @@

				               "parameters":[

				                  {

				                     "name":"host",

				                     "description":"The host name",

				                     "required":true,

				                     "description":"The host name. If absent, the local server broadcast/listen address is used",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

									
										33

api/api-doc/failure_detector.json
									
												View File
												
				@@ -42,6 +42,25 @@

				            }

				         ]

				      },

				      {

				         "path":"/failure_detector/endpoint_phi_values",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get end point phi values",

				               "type":"array",

				               "items":{

				                  "type":"endpoint_phi_values"

				               },

				               "nickname":"get_endpoint_phi_values",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/failure_detector/endpoints/",

				         "operations":[

				@@ -202,6 +221,20 @@

				                    "description": "The application state version"

				                }

				            }

				        },

				        "endpoint_phi_value": {

				            "id" : "endpoint_phi_value",

				            "description": "Holds phi value for a single end point",

				            "properties": {

				                "phi": {

				                    "type": "double",

				                    "description": "Phi value"

				                },

				                "endpoint": {

				                    "type": "string",

				                    "description": "end point address"

				                }

				            }

				        }

				    }

				}

									
										137

api/api-doc/storage_proxy.json
									
												View File
												
				@@ -716,6 +716,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/read/timeouts_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get read metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_read_metrics_timeouts_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/read/unavailables_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get read metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_read_metrics_unavailables_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/read/histogram",

				      "operations": [

				@@ -723,7 +753,7 @@

				          "method": "GET",

				          "summary": "Get read metrics",

				          "$ref": "#/utils/histogram",

				          "nickname": "get_read_metrics_latency_histogram",

				          "nickname": "get_read_metrics_latency_histogram_depricated",

				          "produces": [

				            "application/json"

				          ],

				@@ -738,6 +768,36 @@

				          "method": "GET",

				          "summary": "Get range metrics",

				          "$ref": "#/utils/histogram",

				          "nickname": "get_range_metrics_latency_histogram_depricated",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/read/moving_average_histogram",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get read metrics",

				          "$ref": "#/utils/rate_moving_average_and_histogram",

				          "nickname": "get_read_metrics_latency_histogram",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/range/moving_average_histogram",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get range metrics rate and histogram",

				          "$ref": "#/utils/rate_moving_average_and_histogram",

				          "nickname": "get_range_metrics_latency_histogram",

				          "produces": [

				            "application/json"

				@@ -776,6 +836,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/range/timeouts_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get range metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_range_metrics_timeouts_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/range/unavailables_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get range metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_range_metrics_unavailables_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/write/timeouts",

				      "operations": [

				@@ -806,6 +896,36 @@

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/write/timeouts_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get write metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_write_metrics_timeouts_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/write/unavailables_rates",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get write metrics rates",

				          "type": "#/utils/rate_moving_average",

				          "nickname": "get_write_metrics_unavailables_rates",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/write/histogram",

				      "operations": [

				@@ -813,6 +933,21 @@

				          "method": "GET",

				          "summary": "Get write metrics",

				          "$ref": "#/utils/histogram",

				          "nickname": "get_write_metrics_latency_histogram_depricated",

				          "produces": [

				            "application/json"

				          ],

				          "parameters": []

				        }

				      ]

				    },

				    {

				      "path": "/storage_proxy/metrics/write/moving_average_histogram",

				      "operations": [

				        {

				          "method": "GET",

				          "summary": "Get write metrics",

				          "$ref": "#/utils/rate_moving_average_and_histogram",

				          "nickname": "get_write_metrics_latency_histogram",

				          "produces": [

				            "application/json"

									
										108

api/api-doc/storage_service.json
									
												View File
												
				@@ -177,6 +177,22 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/scylla_release_version",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Fetch a string representation of the Scylla version.",

				               "type":"string",

				               "nickname":"get_scylla_release_version",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/schema_version",

				         "operations":[

				@@ -936,6 +952,22 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/force_terminate_repair",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Force terminate all repair sessions",

				               "type":"void",

				               "nickname":"force_terminate_all_repair_sessions_new",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/decommission",

				         "operations":[

				@@ -1185,11 +1217,12 @@

				               ],

				               "parameters":[

				                  {

				                     "name":"non_system",

				                     "description":"When set to true limit to non system",

				                     "name":"type",

				                     "description":"Which keyspaces to return",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "type":"string",

				                     "enum": [ "all", "user", "non_local_strategy" ],

				                     "paramType":"query"

				                  }

				               ]

				@@ -1720,6 +1753,57 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/slow_query",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Set slow query parameter",

				               "type":"void",

				               "nickname":"set_slow_query",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"enable",

				                     "description":"set it to true to enable, anything else to disable",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"ttl",

				                     "description":"TTL in seconds",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"threshold",

				                     "description":"Slow query record threshold in microseconds",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  }

				               ]

				            },

				            {

				               "method":"GET",

				               "summary":"Returns the slow query record configuration.",

				               "type":"slow_query_info",

				               "nickname":"get_slow_query_info",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/auto_compaction/{keyspace}",

				         "operations":[

				@@ -2117,6 +2201,24 @@

				            }

				         }

				      },

				      "slow_query_info": {

				         "id":"slow_query_info",

				         "description":"Slow query triggering information",

				         "properties":{

				            "enable":{

				               "type":"boolean",

				               "description":"Is slow query logging enable or disable"

				            },

				            "ttl":{

				               "type":"long",

				               "description":"The slow query TTL in seconds"

				            },

				            "threshold":{

				               "type":"long",

				               "description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"

				            }

				         }

				      },

				      "endpoint_detail":{

				         "id":"endpoint_detail",

				         "description":"Endpoint detail",

									
										39

api/api-doc/utils.json
									
												View File
												
				@@ -65,6 +65,41 @@

				               "description":"The series of values to which the counts in `buckets` correspond"

				            }

				         }

				      }

				   }

				      },

				    "rate_moving_average": {

				         "id":"rate_moving_average",

				         "description":"A meter metric which measures mean throughput and one, five, and fifteen-minute exponentially-weighted moving average throughputs",

				         "properties":{

				             "rates": {

				               "type":"array",

				               "items":{

				                  "type":"double"

				               },

				               "description":"One, five and fifteen mintues rates"

				            },

				            "mean_rate": {

				               "type":"double",

				               "description":"The mean rate from startup"

				            },

				            "count": {

				               "type":"long",

				               "description":"Total number of events from startup"

				            }

				         }

				    },

				    "rate_moving_average_and_histogram": {

				         "id":"rate_moving_average_and_histogram",

				         "description":"A timer metric which aggregates timing durations and provides duration statistics, plus throughput statistics",

				         "properties":{

				            "meter": {

				               "type":"rate_moving_average",

				               "description":"The metric rate moving average"

				            },

				            "hist": {

				               "type":"histogram",

				               "description":"The metric histogram"

				            }

				         }

				    }

				  }

				}

									
										12

api/api.cc
									
												View File
												
				@@ -49,7 +49,7 @@ static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {

				        throw bad_param_exception(ex.what());

				    }

				    // We never going to get here

				    return std::make_unique<reply>();

				    throw std::runtime_error("exception_reply");

				}

				future<> set_server_init(http_context& ctx) {

				@@ -61,10 +61,10 @@ future<> set_server_init(http_context& ctx) {

				                new content_replace("html")));

				        r.add(GET, url("/ui").remainder("path"), new httpd::directory_handler(ctx.api_dir,

				                new content_replace("html")));

				        rb->set_api_doc(r);

				        rb->register_function(r, "system",

				                "The system related API");

				        set_system(ctx, r);

				        rb->set_api_doc(r);

				    });

				}

				@@ -83,6 +83,10 @@ future<> set_server_storage_service(http_context& ctx) {

				    return register_api(ctx, "storage_service", "The storage service API", set_storage_service);

				}

				future<> set_server_snitch(http_context& ctx) {

				    return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);

				}

				future<> set_server_gossip(http_context& ctx) {

				    return register_api(ctx, "gossiper",

				                "The gossiper API", set_gossiper);

				@@ -118,10 +122,6 @@ future<> set_server_gossip_settle(http_context& ctx) {

				        rb->register_function(r, "cache_service",

				                "The cache service API");

				        set_cache_service(ctx,r);

				        rb->register_function(r, "endpoint_snitch_info",

				                "The endpoint snitch info API");

				        set_endpoint_snitch(ctx, r);

				    });

				}

									
										88

api/api.hh
									
												View File
												
				@@ -29,6 +29,7 @@

				#include "utils/histogram.hh"

				#include "http/exception.hh"

				#include "api_init.hh"

				#include "seastarx.hh"

				namespace api {

				@@ -110,61 +111,49 @@ future<json::json_return_type>  sum_stats(distributed<T>& d, V F::*f) {

				    });

				}

				inline double pow2(double a) {

				    return a * a;

				}

				// FIXME: Move to utils::ihistogram::operator+=()

				inline utils::ihistogram add_histogram(utils::ihistogram res,

				        const utils::ihistogram& val) {

				    if (res.count == 0) {

				        return val;

				    }

				    if (val.count == 0) {

				        return std::move(res);

				    }

				    if (res.min > val.min) {

				        res.min = val.min;

				    }

				    if (res.max < val.max) {

				        res.max = val.max;

				    }

				    double ncount = res.count + val.count;

				    // To get an estimated sum we take the estimated mean

				    // and multiply it by the true count

				    res.sum = res.sum + val.mean * val.count;

				    double a = res.count/ncount;

				    double b = val.count/ncount;

				    double mean =  a * res.mean + b * val.mean;

				    res.variance = (res.variance + pow2(res.mean - mean) )* a +

				            (val.variance + pow2(val.mean -mean))* b;

				    res.mean = mean;

				    res.count = res.count + val.count;

				    for (auto i : val.sample) {

				        res.sample.push_back(i);

				    }

				    return res;

				}

				inline

				httpd::utils_json::histogram to_json(const utils::ihistogram& val) {

				    httpd::utils_json::histogram h;

				    h = val;

				    h.sum = val.estimated_sum();

				    return h;

				}

				inline

				httpd::utils_json::rate_moving_average meter_to_json(const utils::rate_moving_average& val) {

				    httpd::utils_json::rate_moving_average m;

				    m = val;

				    return m;

				}

				inline

				httpd::utils_json::rate_moving_average_and_histogram timer_to_json(const utils::rate_moving_average_and_histogram& val) {

				    httpd::utils_json::rate_moving_average_and_histogram h;

				    h.hist = to_json(val.hist);

				    h.meter = meter_to_json(val.rate);

				    return h;

				}

				template<class T, class F>

				future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::ihistogram F::*f) {

				future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return p.get_stats().*f;}, utils::ihistogram(),

				            add_histogram).then([](const utils::ihistogram& val) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).hist;}, utils::ihistogram(),

				            std::plus<utils::ihistogram>()).then([](const utils::ihistogram& val) {

				        return make_ready_future<json::json_return_type>(to_json(val));

				    });

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				        return make_ready_future<json::json_return_type>(timer_to_json(val));

				    });

				}

				inline int64_t min_int64(int64_t a, int64_t b) {

				    return std::min(a,b);

				}

				@@ -178,33 +167,36 @@ inline int64_t max_int64(int64_t a, int64_t b) {

				 * It combine total and the sub set for the ratio and its

				 * to_json method return the ration sub/total

				 */

				struct ratio_holder : public json::jsonable {

				    double total = 0;

				    double sub = 0;

				template<typename T>

				struct basic_ratio_holder : public json::jsonable {

				    T total = 0;

				    T sub = 0;

				    virtual std::string to_json() const {

				        if (total == 0) {

				            return "0";

				        }

				        return std::to_string(sub/total);

				    }

				    ratio_holder() = default;

				    ratio_holder& add(double _total, double _sub) {

				    basic_ratio_holder() = default;

				    basic_ratio_holder& add(T _total, T _sub) {

				        total += _total;

				        sub += _sub;

				        return *this;

				    }

				    ratio_holder(double _total, double _sub) {

				    basic_ratio_holder(T _total, T _sub) {

				        total = _total;

				        sub = _sub;

				    }

				    ratio_holder& operator+=(const ratio_holder& a) {

				    basic_ratio_holder<T>& operator+=(const basic_ratio_holder<T>& a) {

				        return add(a.total, a.sub);

				    }

				    friend ratio_holder operator+(ratio_holder a, const ratio_holder& b) {

				    friend basic_ratio_holder<T> operator+(basic_ratio_holder a, const basic_ratio_holder<T>& b) {

				        return a += b;

				    }

				};

				typedef basic_ratio_holder<double>  ratio_holder;

				typedef basic_ratio_holder<int64_t> integral_ratio_holder;

				class unimplemented_exception : public base_exception {

				public:

									
										1

api/api_init.hh
									
												View File
												
				@@ -38,6 +38,7 @@ struct http_context {

				};

				future<> set_server_init(http_context& ctx);

				future<> set_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx);

				future<> set_server_gossip(http_context& ctx);

				future<> set_server_load_sstable(http_context& ctx);

									
										68

api/cache_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -177,6 +177,20 @@ void set_cache_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(0);

				    });

				    cs::get_key_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				        // See above

				        return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));

				    });

				    cs::get_key_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				        // See above

				        return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));

				    });

				    cs::get_key_size.set(r, [] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				@@ -194,41 +208,57 @@ void set_cache_service(http_context& ctx, routes& r) {

				    });

				    cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, 0, [](const column_family& cf) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().get_cache_tracker().region().occupancy().used_space();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_hits.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, 0, [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits;

				        }, std::plus<int64_t>());

				        return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_requests.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, 0, [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits + cf.get_row_cache().stats().misses;

				        }, std::plus<int64_t>());

				        return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_hit_rate.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [](const column_family& cf) {

				            return ratio_holder(cf.get_row_cache().stats().hits + cf.get_row_cache().stats().misses,

				                    cf.get_row_cache().stats().hits);

				            return ratio_holder(cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count(),

				                    cf.get_row_cache().stats().hits.count());

				        }, std::plus<ratio_holder>());

				    });

				    cs::get_row_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate() + cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // In origin row size is the weighted size.

				        // We currently do not support weights, so we use num entries instead

				        return map_reduce_cf(ctx, 0, [](const column_family& cf) {

				            return cf.get_row_cache().num_entries();

				            return cf.get_row_cache().partitions();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, 0, [](const column_family& cf) {

				            return cf.get_row_cache().num_entries();

				            return cf.get_row_cache().partitions();

				        }, std::plus<uint64_t>());

				    });

				@@ -264,6 +294,20 @@ void set_cache_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(0);

				    });

				    cs::get_counter_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				        // See above

				        return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));

				    });

				    cs::get_counter_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				        // See above

				        return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));

				    });

				    cs::get_counter_size.set(r, [] (std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

									
										2

api/cache_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										58

api/collectd.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -25,10 +25,14 @@

				#include "core/scollectd_api.hh"

				#include "endian.h"

				#include <boost/range/irange.hpp>

				#include <regex>

				namespace api {

				using namespace scollectd;

				using namespace httpd;

				using namespace json;

				namespace cd = httpd::collectd_json;

				static auto transformer(const std::vector<collectd_value>& values) {

				@@ -36,19 +40,27 @@ static auto transformer(const std::vector<collectd_value>& values) {

				    for (auto v: values) {

				        switch (v._type) {

				        case scollectd::data_type::GAUGE:

				            collected_value.values.push(v.u._d);

				            collected_value.values.push(v.d());

				            break;

				        case scollectd::data_type::DERIVE:

				            collected_value.values.push(v.u._i);

				            collected_value.values.push(v.i());

				            break;

				        default:

				            collected_value.values.push(v.u._ui);

				            collected_value.values.push(v.ui());

				            break;

				        }

				    }

				    return collected_value;

				}

				static const char* str_to_regex(const sstring& v) {

				    if (v != "") {

				        return v.c_str();

				    }

				    return ".*";

				}

				void set_collectd(http_context& ctx, routes& r) {

				    cd::get_collectd.set(r, [&ctx](std::unique_ptr<request> req) {

				@@ -72,7 +84,7 @@ void set_collectd(http_context& ctx, routes& r) {

				    });

				    cd::get_collectd_items.set(r, [](const_req req) {

				        std::vector<cd::type_instance_id> res;

				        std::vector<cd::collectd_metric_status> res;

				        auto ids = scollectd::get_collectd_ids();

				        for (auto i: ids) {

				            cd::type_instance_id id;

				@@ -80,10 +92,44 @@ void set_collectd(http_context& ctx, routes& r) {

				            id.plugin_instance = i.plugin_instance();

				            id.type = i.type();

				            id.type_instance = i.type_instance();

				            res.push_back(id);

				            cd::collectd_metric_status it;

				            it.id = id;

				            it.enable = scollectd::is_enabled(i);

				            res.push_back(it);

				        }

				        return res;

				    });

				    cd::enable_collectd.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {

				        std::regex plugin(req->param["pluginid"].c_str());

				        std::regex instance(str_to_regex(req->get_query_param("instance")));

				        std::regex type(str_to_regex(req->get_query_param("type")));

				        std::regex type_instance(str_to_regex(req->get_query_param("type_instance")));

				        bool enable = strcasecmp(req->get_query_param("enable").c_str(), "true") == 0;

				        return smp::invoke_on_all([enable, plugin, instance, type, type_instance]() {

				            for (auto id: scollectd::get_collectd_ids()) {

				                if (std::regex_match(std::string(id.plugin()), plugin) &&

				                        std::regex_match(std::string(id.plugin_instance()), instance) &&

				                        std::regex_match(std::string(id.type()), type) &&

				                        std::regex_match(std::string(id.type_instance()), type_instance)) {

				                    scollectd::enable(id, enable);

				                }

				            }

				        }).then([] {

				            return json::json_return_type(json_void());

				        });

				    });

				    cd::enable_all_collectd.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {

				        bool enable = strcasecmp(req->get_query_param("enable").c_str(), "true") == 0;

				        return smp::invoke_on_all([enable] {

				            for (auto id: scollectd::get_collectd_ids()) {

				                scollectd::enable(id, enable);

				            }

				        }).then([] {

				            return json::json_return_type(json_void());

				        });

				    });

				}

				}

									
										2

api/collectd.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										358

api/column_family.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -24,7 +24,7 @@

				#include <vector>

				#include "http/exception.hh"

				#include "sstables/sstables.hh"

				#include "sstables/estimated_histogram.hh"

				#include "utils/estimated_histogram.hh"

				#include <algorithm>

				namespace api {

				@@ -40,7 +40,7 @@ const utils::UUID& get_uuid(const sstring& name, const database& db) {

				    if (pos == sstring::npos) {

				        pos  = name.find(":");

				        if (pos == sstring::npos) {

				            throw bad_param_exception("Column family name should be in keyspace::column_family format");

				            throw bad_param_exception("Column family name should be in keyspace:column_family format");

				        }

				        end = pos + 1;

				    } else {

				@@ -77,14 +77,14 @@ future<json::json_return_type>  get_cf_stats(http_context& ctx,

				}

				static future<json::json_return_type>  get_cf_stats_count(http_context& ctx, const sstring& name,

				        utils::ihistogram column_family::stats::*f) {

				        utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    return map_reduce_cf(ctx, name, int64_t(0), [f](const column_family& cf) {

				        return (cf.get_stats().*f).count;

				        return (cf.get_stats().*f).hist.count;

				    }, std::plus<int64_t>());

				}

				static future<json::json_return_type>  get_cf_stats_sum(http_context& ctx, const sstring& name,

				        utils::ihistogram column_family::stats::*f) {

				        utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([uuid, f](database& db) {

				        // Histograms information is sample of the actual load

				@@ -92,7 +92,7 @@ static future<json::json_return_type>  get_cf_stats_sum(http_context& ctx, const

				        // with count. The information is gather in nano second,

				        // but reported in micro

				        column_family& cf = db.find_column_family(uuid);

				        return ((cf.get_stats().*f).count/1000.0) * (cf.get_stats().*f).mean;

				        return ((cf.get_stats().*f).hist.count/1000.0) * (cf.get_stats().*f).hist.mean;

				    }, 0.0, std::plus<double>()).then([](double res) {

				        return make_ready_future<json::json_return_type>((int64_t)res);

				    });

				@@ -100,28 +100,29 @@ static future<json::json_return_type>  get_cf_stats_sum(http_context& ctx, const

				static future<json::json_return_type>  get_cf_stats_count(http_context& ctx,

				        utils::ihistogram column_family::stats::*f) {

				        utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    return map_reduce_cf(ctx, int64_t(0), [f](const column_family& cf) {

				        return (cf.get_stats().*f).count;

				        return (cf.get_stats().*f).hist.count;

				    }, std::plus<int64_t>());

				}

				static future<json::json_return_type>  get_cf_histogram(http_context& ctx, const sstring& name,

				        utils::ihistogram column_family::stats::*f) {

				        utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    utils::UUID uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const database& p) {return p.find_column_family(uuid).get_stats().*f;},

				    return ctx.db.map_reduce0([f, uuid](const database& p) {

				        return (p.find_column_family(uuid).get_stats().*f).hist;},

				            utils::ihistogram(),

				            add_histogram)

				            std::plus<utils::ihistogram>())

				            .then([](const utils::ihistogram& val) {

				                return make_ready_future<json::json_return_type>(to_json(val));

				    });

				}

				static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::ihistogram column_family::stats::*f) {

				static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    std::function<utils::ihistogram(const database&)> fun = [f] (const database& db)  {

				        utils::ihistogram res;

				        for (auto i : db.get_column_families()) {

				            res = add_histogram(res, i.second->get_stats().*f);

				            res += (i.second->get_stats().*f).hist;

				        }

				        return res;

				    };

				@@ -132,6 +133,33 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:

				    });

				}

				static future<json::json_return_type>  get_cf_rate_and_histogram(http_context& ctx, const sstring& name,

				        utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    utils::UUID uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([f, uuid](const database& p) {

				        return (p.find_column_family(uuid).get_stats().*f).rate();},

				            utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>())

				            .then([](const utils::rate_moving_average_and_histogram& val) {

				                return make_ready_future<json::json_return_type>(timer_to_json(val));

				    });

				}

				static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram column_family::stats::*f) {

				    std::function<utils::rate_moving_average_and_histogram(const database&)> fun = [f] (const database& db)  {

				        utils::rate_moving_average_and_histogram res;

				        for (auto i : db.get_column_families()) {

				            res += (i.second->get_stats().*f).rate();

				        }

				        return res;

				    };

				    return ctx.db.map(fun).then([](const std::vector<utils::rate_moving_average_and_histogram> &res) {

				        std::vector<httpd::utils_json::rate_moving_average_and_histogram> r;

				        boost::copy(res | boost::adaptors::transformed(timer_to_json), std::back_inserter(r));

				        return make_ready_future<json::json_return_type>(r);

				    });

				}

				static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ctx, const sstring& name) {

				    return map_reduce_cf(ctx, name, int64_t(0), [](const column_family& cf) {

				        return cf.get_unleveled_sstables();

				@@ -141,7 +169,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct

				static int64_t min_row_size(column_family& cf) {

				    int64_t res = INT64_MAX;

				    for (auto i: *cf.get_sstables() ) {

				        res = std::min(res, i.second->get_stats_metadata().estimated_row_size.min());

				        res = std::min(res, i->get_stats_metadata().estimated_row_size.min());

				    }

				    return (res == INT64_MAX) ? 0 : res;

				}

				@@ -149,30 +177,113 @@ static int64_t min_row_size(column_family& cf) {

				static int64_t max_row_size(column_family& cf) {

				    int64_t res = 0;

				    for (auto i: *cf.get_sstables() ) {

				        res = std::max(i.second->get_stats_metadata().estimated_row_size.max(), res);

				        res = std::max(i->get_stats_metadata().estimated_row_size.max(), res);

				    }

				    return res;

				}

				static double update_ratio(double acc, double f, double total) {

				    if (f && !total) {

				        throw bad_param_exception("total should include all elements");

				    } else if (total) {

				        acc += f / total;

				    }

				    return acc;

				}

				static ratio_holder mean_row_size(column_family& cf) {

				    ratio_holder res;

				static integral_ratio_holder mean_row_size(column_family& cf) {

				    integral_ratio_holder res;

				    for (auto i: *cf.get_sstables() ) {

				        auto c = i.second->get_stats_metadata().estimated_row_size.count();

				        res.sub += i.second->get_stats_metadata().estimated_row_size.mean() * c;

				        auto c = i->get_stats_metadata().estimated_row_size.count();

				        res.sub += i->get_stats_metadata().estimated_row_size.mean() * c;

				        res.total += c;

				    }

				    return res;

				}

				static std::unordered_map<sstring, uint64_t> merge_maps(std::unordered_map<sstring, uint64_t> a,

				        const std::unordered_map<sstring, uint64_t>& b) {

				    a.insert(b.begin(), b.end());

				    return a;

				}

				static json::json_return_type sum_map(const std::unordered_map<sstring, uint64_t>& val) {

				    uint64_t res = 0;

				    for (auto i : val) {

				        res += i.second;

				    }

				    return res;

				}

				static future<json::json_return_type>  sum_sstable(http_context& ctx, const sstring name, bool total) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([uuid, total](database& db) {

				        std::unordered_map<sstring, uint64_t> m;

				        auto sstables = (total) ? db.find_column_family(uuid).get_sstables_including_compacted_undeleted() :

				                db.find_column_family(uuid).get_sstables();

				        for (auto t : *sstables) {

				            m[t->get_filename()] = t->bytes_on_disk();

				        }

				        return m;

				    }, std::unordered_map<sstring, uint64_t>(), merge_maps).

				            then([](const std::unordered_map<sstring, uint64_t>& val) {

				        return sum_map(val);

				    });

				}

				static future<json::json_return_type> sum_sstable(http_context& ctx, bool total) {

				    return map_reduce_cf_raw(ctx, std::unordered_map<sstring, uint64_t>(), [total](column_family& cf) {

				        std::unordered_map<sstring, uint64_t> m;

				        auto sstables = (total) ? cf.get_sstables_including_compacted_undeleted() :

				                cf.get_sstables();

				        for (auto t : *sstables) {

				            m[t->get_filename()] = t->bytes_on_disk();

				        }

				        return m;

				    },merge_maps).then([](const std::unordered_map<sstring, uint64_t>& val) {

				        return sum_map(val);

				    });

				}

				template <typename T>

				class sum_ratio {

				    uint64_t _n = 0;

				    T _total = 0;

				public:

				    future<> operator()(T value) {

				        if (value > 0) {

				            _total += value;

				            _n++;

				        }

				        return make_ready_future<>();

				    }

				    // Returns average value of all registered ratios.

				    T get() && {

				        return _n ? (_total / _n) : T(0);

				    }

				};

				static double get_compression_ratio(column_family& cf) {

				    sum_ratio<double> result;

				    for (auto i : *cf.get_sstables()) {

				        auto compression_ratio = i->get_compression_ratio();

				        if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {

				            result(compression_ratio);

				        }

				    }

				    return std::move(result).get();

				}

				static std::vector<uint64_t> concat_sstable_count_per_level(std::vector<uint64_t> a, std::vector<uint64_t>&& b) {

				    a.resize(std::max(a.size(), b.size()), 0UL);

				    for (auto i = 0U; i < b.size(); i++) {

				        a[i] += b[i];

				    }

				    return a;

				}

				ratio_holder filter_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {

				    double f = sst->filter_get_false_positive();

				    return ratio_holder(f + sst->filter_get_true_positive(), f);

				}

				ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {

				    double f = sst->filter_get_recent_false_positive();

				    return ratio_holder(f + sst->filter_get_recent_true_positive(), f);

				}

				void set_column_family(http_context& ctx, routes& r) {

				    cf::get_column_family_name.set(r, [&ctx] (const_req req){

				        vector<sstring> res;

				@@ -293,21 +404,21 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {

				            sstables::estimated_histogram res(0);

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto i: *cf.get_sstables() ) {

				                res.merge(i.second->get_stats_metadata().estimated_row_size);

				                res.merge(i->get_stats_metadata().estimated_row_size);

				            }

				            return res;

				        },

				        sstables::merge, utils_json::estimated_histogram());

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            uint64_t res = 0;

				            for (auto i: *cf.get_sstables() ) {

				                res += i.second->get_stats_metadata().estimated_row_size.count();

				                res += i->get_stats_metadata().estimated_row_size.count();

				            }

				            return res;

				        },

				@@ -315,14 +426,14 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {

				            sstables::estimated_histogram res(0);

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            utils::estimated_histogram res(0);

				            for (auto i: *cf.get_sstables() ) {

				                res.merge(i.second->get_stats_metadata().estimated_column_count);

				                res.merge(i->get_stats_metadata().estimated_column_count);

				            }

				            return res;

				        },

				        sstables::merge, utils_json::estimated_histogram());

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::get_all_compression_ratio.set(r, [] (std::unique_ptr<request> req) {

				@@ -355,10 +466,14 @@ void set_column_family(http_context& ctx, routes& r) {

				        return get_cf_stats_count(ctx, &column_family::stats::writes);

				    });

				    cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				    cf::get_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family::stats::reads);

				    });

				    cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family::stats::reads);

				    });

				    cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats_sum(ctx,req->param["name"] ,&column_family::stats::reads);

				    });

				@@ -367,24 +482,40 @@ void set_column_family(http_context& ctx, routes& r) {

				        return get_cf_stats_sum(ctx, req->param["name"] ,&column_family::stats::writes);

				    });

				    cf::get_all_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				    cf::get_all_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, &column_family::stats::writes);

				    });

				    cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				    cf::get_all_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, &column_family::stats::writes);

				    });

				    cf::get_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, req->param["name"], &column_family::stats::writes);

				    });

				    cf::get_all_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				    cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, req->param["name"], &column_family::stats::writes);

				    });

				    cf::get_all_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_histogram(ctx, &column_family::stats::writes);

				    });

				    cf::get_all_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_rate_and_histogram(ctx, &column_family::stats::writes);

				    });

				    cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, req->param["name"], &column_family::stats::pending_compactions);

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        }, std::plus<int64_t>());

				    });

				    cf::get_all_pending_compactions.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family::stats::pending_compactions);

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        }, std::plus<int64_t>());

				    });

				    cf::get_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<request> req) {

				@@ -400,19 +531,19 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, req->param["name"], &column_family::stats::live_disk_space_used);

				        return sum_sstable(ctx, req->param["name"], false);

				    });

				    cf::get_all_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family::stats::live_disk_space_used);

				        return sum_sstable(ctx, false);

				    });

				    cf::get_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, req->param["name"], &column_family::stats::total_disk_space_used);

				        return sum_sstable(ctx, req->param["name"], true);

				    });

				    cf::get_all_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family::stats::total_disk_space_used);

				        return sum_sstable(ctx, true);

				    });

				    cf::get_min_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				@@ -432,17 +563,19 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), mean_row_size, std::plus<ratio_holder>());

				        // Cassandra 3.x mean values are truncated as integrals.

				        return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());

				    });

				    cf::get_all_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), mean_row_size, std::plus<ratio_holder>());

				        // Cassandra 3.x mean values are truncated as integrals.

				        return map_reduce_cf(ctx, integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());

				    });

				    cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst.second->filter_get_false_positive();

				                return s + sst->filter_get_false_positive();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -450,7 +583,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst.second->filter_get_false_positive();

				                return s + sst->filter_get_false_positive();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -458,7 +591,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst.second->filter_get_recent_false_positive();

				                return s + sst->filter_get_recent_false_positive();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -466,51 +599,39 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return s + sst.second->filter_get_recent_false_positive();

				                return s + sst->filter_get_recent_false_positive();

				            });

				        }, std::plus<uint64_t>());

				    });

				    cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {

				                double f = sst.second->filter_get_false_positive();

				                return update_ratio(s, f, f + sst.second->filter_get_true_positive());

				            });

				        }, std::plus<double>());

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, double(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {

				                double f = sst.second->filter_get_false_positive();

				                return update_ratio(s, f, f + sst.second->filter_get_true_positive());

				            });

				        }, std::plus<double>());

				        return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {

				                double f = sst.second->filter_get_recent_false_positive();

				                return update_ratio(s, f, f + sst.second->filter_get_recent_true_positive());

				            });

				        }, std::plus<double>());

				        return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, double(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {

				                double f = sst.second->filter_get_recent_false_positive();

				                return update_ratio(s, f, f + sst.second->filter_get_recent_true_positive());

				            });

				        }, std::plus<double>());

				        return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {

				            return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());

				        }, std::plus<>());

				    });

				    cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->filter_size();

				                return sst->filter_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -518,7 +639,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->filter_size();

				                return sst->filter_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -526,7 +647,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->filter_memory_size();

				                return sst->filter_memory_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -534,7 +655,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->filter_memory_size();

				                return sst->filter_memory_size();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -542,7 +663,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->get_summary().memory_footprint();

				                return sst->get_summary().memory_footprint();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -550,7 +671,7 @@ void set_column_family(http_context& ctx, routes& r) {

				    cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {

				            return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {

				                return sst.second->get_summary().memory_footprint();

				                return sst->get_summary().memory_footprint();

				            });

				        }, std::plus<uint64_t>());

				    });

				@@ -623,27 +744,35 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits;

				        }, std::plus<int64_t>());

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cf::get_all_row_cache_hit.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits;

				        }, std::plus<int64_t>());

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cf::get_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().misses;

				        }, std::plus<int64_t>());

				        return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cf::get_all_row_cache_miss.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](const column_family& cf) {

				            return cf.get_row_cache().stats().misses;

				        }, std::plus<int64_t>());

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const column_family& cf) {

				            return cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				@@ -669,10 +798,10 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            return cf.get_stats().estimated_sstable_per_read;

				        },

				        sstables::merge, utils_json::estimated_histogram());

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {

				@@ -719,25 +848,29 @@ void set_column_family(http_context& ctx, routes& r) {

				        return std::vector<sstring>();

				    });

				    cf::get_compression_ratio.set(r, [](const_req) {

				        // FIXME

				        // Currently there are no compression information

				        // so we return 0 as the ratio

				        return 0;

				    cf::get_compression_ratio.set(r, [&ctx](std::unique_ptr<request> req) {

				        auto uuid = get_uuid(req->param["name"], ctx.db.local());

				        return ctx.db.map_reduce(sum_ratio<double>(), [uuid](database& db) {

				            column_family& cf = db.find_column_family(uuid);

				            return make_ready_future<double>(get_compression_ratio(cf));

				        }).then([] (const double& result) {

				            return make_ready_future<json::json_return_type>(result);

				        });

				    });

				    cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            return cf.get_stats().estimated_read;

				        },

				        sstables::merge, utils_json::estimated_histogram());

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return map_reduce_cf(ctx, req->param["name"], sstables::estimated_histogram(0), [](column_family& cf) {

				        return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {

				            return cf.get_stats().estimated_write;

				        },

				        sstables::merge, utils_json::estimated_histogram());

				        utils::estimated_histogram_merge, utils_json::estimated_histogram());

				    });

				    cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<request> req) {

				@@ -766,12 +899,11 @@ void set_column_family(http_context& ctx, routes& r) {

				    });

				    cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<request> req) {

				        // TBD

				        // FIXME

				        // This is a workaround, until there will be an API to return the count

				        // per level, we return an empty array

				        vector<uint64_t> res;

				        return make_ready_future<json::json_return_type>(res);

				        return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const column_family& cf) {

				            return cf.sstable_count_per_level();

				        }, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				}

				}

									
										37

api/column_family.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -34,31 +34,44 @@ future<> foreach_column_family(http_context& ctx, const sstring& name, std::func

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([mapper, uuid](database& db) {

				        return mapper(db.find_column_family(uuid));

				    }, init, reducer).then([](const I& res) {

				    }, init, reducer);

				}

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				template<class Mapper, class I, class Reducer, class Result>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				        Mapper mapper, Reducer reducer, Result result) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    return ctx.db.map_reduce0([mapper, uuid](database& db) {

				        return mapper(db.find_column_family(uuid));

				    }, init, reducer).then([result](const I& res) mutable {

				    }, init, reducer);

				}

				template<class Mapper, class I, class Reducer, class Result>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				        Mapper mapper, Reducer reducer, Result result) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer, result).then([result](const I& res) mutable {

				        result = res;

				        return make_ready_future<json::json_return_type>(result);

				    });

				}

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				future<I> map_reduce_cf_raw(http_context& ctx, I init,

				        Mapper mapper, Reducer reducer) {

				    return ctx.db.map_reduce0([mapper, init, reducer](database& db) {

				        auto res = init;

				@@ -66,10 +79,18 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				            res = reducer(res, mapper(*i.second.get()));

				        }

				        return res;

				    }, init, reducer).then([](const I& res) {

				    }, init, reducer);

				}

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& name,

				        int64_t column_family::stats::*f);

									
										2

api/commitlog.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/commitlog.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										9

api/compaction_manager.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -20,12 +20,13 @@

				 */

				#include "compaction_manager.hh"

				#include "sstables/compaction_manager.hh"

				#include "api/api-doc/compaction_manager.json.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				namespace api {

				using namespace scollectd;

				namespace cm = httpd::compaction_manager_json;

				using namespace json;

				@@ -78,7 +79,9 @@ void set_compaction_manager(http_context& ctx, routes& r) {

				    });

				    cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return get_cm_stats(ctx, &compaction_manager::stats::pending_tasks);

				        return map_reduce_cf(ctx, int64_t(0), [](column_family& cf) {

				            return cf.get_compaction_strategy().estimated_pending_compactions(cf);

				        }, std::plus<int64_t>());

				    });

				    cm::get_completed_tasks.set(r, [&ctx] (std::unique_ptr<request> req) {

									
										2

api/compaction_manager.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										16

api/endpoint_snitch.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -22,16 +22,22 @@

				#include "locator/snitch_base.hh"

				#include "endpoint_snitch.hh"

				#include "api/api-doc/endpoint_snitch_info.json.hh"

				#include "utils/fb_utilities.hh"

				namespace api {

				void set_endpoint_snitch(http_context& ctx, routes& r) {

				    httpd::endpoint_snitch_info_json::get_datacenter.set(r, [] (const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(req.get_query_param("host"));

				    static auto host_or_broadcast = [](const_req req) {

				        auto host = req.get_query_param("host");

				        return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);

				    };

				    httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));

				    });

				    httpd::endpoint_snitch_info_json::get_rack.set(r, [] (const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(req.get_query_param("host"));

				    httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {

				        return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));

				    });

				    httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

									
										2

api/endpoint_snitch.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										16

api/failure_detector.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -88,6 +88,20 @@ void set_failure_detector(http_context& ctx, routes& r) {

				            return make_ready_future<json::json_return_type>(state);

				        });

				    });

				    fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {

				        return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {

				            std::vector<fd::endpoint_phi_value> res;

				            auto now = gms::arrival_window::clk::now();

				            for (auto& p : map) {

				                fd::endpoint_phi_value val;

				                val.endpoint = p.first.to_sstring();

				                val.phi = p.second.phi(now);

				                res.emplace_back(std::move(val));

				            }

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				}

				}

									
										2

api/failure_detector.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/gossiper.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/gossiper.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										3

api/hinted_handoff.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -24,7 +24,6 @@

				namespace api {

				using namespace scollectd;

				using namespace json;

				namespace hh = httpd::hinted_handoff_json;

									
										2

api/hinted_handoff.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										6

api/lsa.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -29,11 +29,11 @@

				namespace api {

				static logging::logger logger("lsa-api");

				static logging::logger alogger("lsa-api");

				void set_lsa(http_context& ctx, routes& r) {

				    httpd::lsa_json::lsa_compact.set(r, [&ctx](std::unique_ptr<request> req) {

				        logger.info("Triggering compaction");

				        alogger.info("Triggering compaction");

				        return ctx.db.invoke_on_all([] (database&) {

				            logalloc::shard_tracker().reclaim(std::numeric_limits<size_t>::max());

				        }).then([] {

									
										2

api/lsa.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										8

api/messaging_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -27,7 +27,7 @@

				#include <sstream>

				using namespace httpd::messaging_service_json;

				using namespace net;

				using namespace netw;

				namespace api {

				@@ -120,13 +120,13 @@ void set_messaging_service(http_context& ctx, routes& r) {

				    }));

				    get_version.set(r, [](const_req req) {

				        return net::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));

				        return netw::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));

				    });

				    get_dropped_messages_by_ver.set(r, [](std::unique_ptr<request> req) {

				        shared_ptr<std::vector<uint64_t>> map = make_shared<std::vector<uint64_t>>(num_verb);

				        return net::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {

				        return netw::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {

				            for (auto i = 0; i < num_verb; i++) {

				                (*map)[i]+= local_map[i];

				            }

									
										2

api/messaging_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										87

api/storage_proxy.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -33,17 +33,36 @@ namespace sp = httpd::storage_proxy_json;

				using proxy = service::storage_proxy;

				using namespace json;

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, sstables::estimated_histogram proxy::stats::*f) {

				    return ctx.sp.map_reduce0([f](const proxy& p) {return p.get_stats().*f;}, sstables::estimated_histogram(),

				            sstables::merge).then([](const sstables::estimated_histogram& val) {

				static future<utils::rate_moving_average>  sum_timed_rate(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {

				    return d.map_reduce0([f](const proxy& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average(),

				            std::plus<utils::rate_moving_average>());

				}

				static future<json::json_return_type>  sum_timed_rate_as_obj(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        httpd::utils_json::rate_moving_average m;

				        m = val;

				        return make_ready_future<json::json_return_type>(m);

				    });

				}

				static future<json::json_return_type>  sum_timed_rate_as_long(distributed<proxy>& d, utils::timed_rate_moving_average proxy::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        return make_ready_future<json::json_return_type>(val.count);

				    });

				}

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, utils::estimated_histogram proxy::stats::*f) {

				    return ctx.sp.map_reduce0([f](const proxy& p) {return p.get_stats().*f;}, utils::estimated_histogram(),

				            utils::estimated_histogram_merge).then([](const utils::estimated_histogram& val) {

				        utils_json::estimated_histogram res;

				        res = val;

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				static future<json::json_return_type>  total_latency(http_context& ctx, utils::ihistogram proxy::stats::*f) {

				    return ctx.sp.map_reduce0([f](const proxy& p) {return (p.get_stats().*f).mean * (p.get_stats().*f).count;}, 0.0,

				static future<json::json_return_type>  total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram proxy::stats::*f) {

				    return ctx.sp.map_reduce0([f](const proxy& p) {return (p.get_stats().*f).hist.mean * (p.get_stats().*f).hist.count;}, 0.0,

				            std::plus<double>()).then([](double val) {

				        int64_t res = val;

				        return make_ready_future<json::json_return_type>(res);

				@@ -291,41 +310,77 @@ void set_storage_proxy(http_context& ctx, routes& r) {

				    });

				    sp::get_read_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::read_timeouts);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::read_timeouts);

				    });

				    sp::get_read_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::read_unavailables);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::read_unavailables);

				    });

				    sp::get_range_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::range_slice_timeouts);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::range_slice_timeouts);

				    });

				    sp::get_range_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::range_slice_unavailables);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::range_slice_unavailables);

				    });

				    sp::get_write_metrics_timeouts.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::write_timeouts);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::write_timeouts);

				    });

				    sp::get_write_metrics_unavailables.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::write_unavailables);

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::write_unavailables);

				    });

				    sp::get_range_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				    sp::get_read_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::read_timeouts);

				    });

				    sp::get_read_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::read_unavailables);

				    });

				    sp::get_range_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::range_slice_timeouts);

				    });

				    sp::get_range_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::range_slice_unavailables);

				    });

				    sp::get_write_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::write_timeouts);

				    });

				    sp::get_write_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &proxy::stats::write_unavailables);

				    });

				    sp::get_range_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_histogram_stats(ctx.sp, &proxy::stats::range);

				    });

				    sp::get_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				    sp::get_write_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_histogram_stats(ctx.sp, &proxy::stats::write);

				    });

				    sp::get_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				    sp::get_read_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_histogram_stats(ctx.sp, &proxy::stats::read);

				    });

				    sp::get_range_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timer_stats(ctx.sp, &proxy::stats::range);

				    });

				    sp::get_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timer_stats(ctx.sp, &proxy::stats::write);

				    });

				    sp::get_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_timer_stats(ctx.sp, &proxy::stats::read);

				    });

				    sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_estimated_histogram(ctx, &proxy::stats::estimated_read);

				    });

				@@ -342,7 +397,7 @@ void set_storage_proxy(http_context& ctx, routes& r) {

				    });

				    sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {

				        return sum_histogram_stats(ctx.sp, &proxy::stats::read);

				        return sum_timer_stats(ctx.sp, &proxy::stats::range);

				    });

				    sp::get_range_latency.set(r, [&ctx](std::unique_ptr<request> req) {

									
										2

api/storage_proxy.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										113

api/storage_service.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -22,6 +22,8 @@

				#include "storage_service.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "db/config.hh"

				#include <boost/range/adaptor/map.hpp>

				#include <boost/range/adaptor/filtered.hpp>

				#include <service/storage_service.hh>

				#include <db/commitlog/commitlog.hh>

				#include <gms/gossiper.hh>

				@@ -31,6 +33,8 @@

				#include "locator/snitch_base.hh"

				#include "column_family.hh"

				#include "log.hh"

				#include "release.hh"

				#include "sstables/compaction_manager.hh"

				namespace api {

				@@ -121,6 +125,9 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return service::get_local_storage_service().get_release_version();

				    });

				    ss::get_scylla_release_version.set(r, [](const_req req) {

				        return scylla_version();

				    });

				    ss::get_schema_version.set(r, [](const_req req) {

				        return service::get_local_storage_service().get_schema_version();

				    });

				@@ -355,16 +362,22 @@ void set_storage_service(http_context& ctx, routes& r) {

				            try {

				                res = fut.get0();

				            } catch(std::runtime_error& e) {

				                return make_ready_future<json::json_return_type>(json_exception(httpd::bad_param_exception(e.what())));

				                throw httpd::bad_param_exception(e.what());

				            }

				            return make_ready_future<json::json_return_type>(json::json_return_type(res));

				        });

				    });

				    ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>(json_void());

				        return repair_abort_all(service::get_local_storage_service().db()).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {

				        return repair_abort_all(service::get_local_storage_service().db()).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::decommission.set(r, [](std::unique_ptr<request> req) {

				@@ -382,21 +395,21 @@ void set_storage_service(http_context& ctx, routes& r) {

				    ss::remove_node.set(r, [](std::unique_ptr<request> req) {

				        auto host_id = req->get_query_param("host_id");

				        return service::get_local_storage_service().remove_node(host_id).then([] {

				        return service::get_local_storage_service().removenode(host_id).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::get_removal_status.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>("");

				        return service::get_local_storage_service().get_removal_status().then([] (auto status) {

				            return make_ready_future<json::json_return_type>(status);

				        });

				    });

				    ss::force_remove_completion.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>(json_void());

				        return service::get_local_storage_service().force_remove_completion().then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    ss::set_logging_level.set(r, [](std::unique_ptr<request> req) {

				@@ -453,8 +466,15 @@ void set_storage_service(http_context& ctx, routes& r) {

				    });

				    ss::get_keyspaces.set(r, [&ctx](const_req req) {

				        auto non_system = req.get_query_param("non_system");

				        return map_keys(ctx.db.local().keyspaces());

				        auto type = req.get_query_param("type");

				        if (type == "user") {

				            return ctx.db.local().get_non_system_keyspaces();

				        } else if (type == "non_local_strategy") {

				            return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {

				                return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;

				            }));

				        }

				        return map_keys(ctx.db.local().get_keyspaces());

				    });

				    ss::update_snitch.set(r, [](std::unique_ptr<request> req) {

				@@ -538,9 +558,7 @@ void set_storage_service(http_context& ctx, routes& r) {

				    });

				    ss::is_joined.set(r, [] (std::unique_ptr<request> req) {

				        return service::get_local_storage_service().is_joined().then([] (bool is_joined) {

				            return make_ready_future<json::json_return_type>(is_joined);

				        });

				        return make_ready_future<json::json_return_type>(service::get_local_storage_service().is_joined());

				    });

				    ss::set_stream_throughput_mb_per_sec.set(r, [](std::unique_ptr<request> req) {

				@@ -659,16 +677,59 @@ void set_storage_service(http_context& ctx, routes& r) {

				    });

				    ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        auto probability = req->get_query_param("probability");

				        return make_ready_future<json::json_return_type>(json_void());

				        return futurize<json::json_return_type>::apply([probability] {

				            double real_prob = std::stod(probability.c_str());

				            return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {

				                local_tracing.set_trace_probability(real_prob);

				            }).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        }).then_wrapped([probability] (auto&& f) {

				            try {

				                f.get();

				                return make_ready_future<json::json_return_type>(json_void());

				            } catch (std::out_of_range& e) {

				                throw httpd::bad_param_exception(e.what());

				            } catch (std::invalid_argument&){

				                throw httpd::bad_param_exception(sprint("Bad format in a probability value: \"%s\"", probability.c_str()));

				            }

				        });

				    });

				    ss::get_trace_probability.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>(0);

				        return make_ready_future<json::json_return_type>(tracing::tracing::get_local_tracing_instance().get_trace_probability());

				    });

				    ss::get_slow_query_info.set(r, [](const_req req) {

				        ss::slow_query_info res;

				        res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();

				        res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;

				        res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();

				        return res;

				    });

				    ss::set_slow_query.set(r, [](std::unique_ptr<request> req) {

				        auto enable = req->get_query_param("enable");

				        auto ttl = req->get_query_param("ttl");

				        auto threshold = req->get_query_param("threshold");

				        try {

				            return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {

				                if (threshold != "") {

				                    local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));

				                }

				                if (ttl != "") {

				                    local_tracing.set_slow_query_record_ttl(std::chrono::seconds(std::stol(ttl.c_str())));

				                }

				                if (enable != "") {

				                    local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);

				                }

				            }).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        } catch (...) {

				            throw httpd::bad_param_exception(sprint("Bad format value: "));

				        }

				    });

				    ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {

				@@ -748,10 +809,8 @@ void set_storage_service(http_context& ctx, routes& r) {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::get_metrics_load.set(r, [](std::unique_ptr<request> req) {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>(0);

				    ss::get_metrics_load.set(r, [&ctx](std::unique_ptr<request> req) {

				        return get_cf_stats(ctx, &column_family::stats::live_disk_space_used);

				    });

				    ss::get_exceptions.set(r, [](const_req req) {

									
										2

api/storage_service.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/stream_manager.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/stream_manager.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/system.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										2

api/system.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										164

atomic_cell.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -28,11 +28,12 @@

				#include "utils/managed_bytes.hh"

				#include "net/byteorder.hh"

				#include <cstdint>

				#include <iostream>

				#include <iosfwd>

				#include <seastar/util/gcc6-concepts.hh>

				template<typename T>

				template<typename T, typename Input>

				static inline

				void set_field(managed_bytes& v, unsigned offset, T val) {

				void set_field(Input& v, unsigned offset, T val) {

				    reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);

				}

				@@ -54,9 +55,11 @@ class atomic_cell_or_collection;

				 */

				class atomic_cell_type final {

				private:

				    static constexpr int8_t DEAD_FLAGS = 0;

				    static constexpr int8_t LIVE_FLAG = 0x01;

				    static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells

				    static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.

				    static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.

				    static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;

				    static constexpr unsigned flags_size = 1;

				    static constexpr unsigned timestamp_offset = flags_size;

				    static constexpr unsigned timestamp_size = 8;

				@@ -66,27 +69,62 @@ private:

				    static constexpr unsigned deletion_time_size = 4;

				    static constexpr unsigned ttl_offset = expiry_offset + expiry_size;

				    static constexpr unsigned ttl_size = 4;

				    friend class counter_cell_builder;

				private:

				    static bool is_counter_update(bytes_view cell) {

				        return cell[0] & COUNTER_UPDATE_FLAG;

				    }

				    static bool is_revert_set(bytes_view cell) {

				        return cell[0] & REVERT_FLAG;

				    }

				    static bool is_counter_in_place_revert_set(bytes_view cell) {

				        return cell[0] & COUNTER_IN_PLACE_REVERT;

				    }

				    template<typename BytesContainer>

				    static void set_revert(BytesContainer& cell, bool revert) {

				        cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);

				    }

				    template<typename BytesContainer>

				    static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {

				        cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);

				    }

				    static bool is_live(const bytes_view& cell) {

				        return cell[0] != DEAD_FLAGS;

				        return cell[0] & LIVE_FLAG;

				    }

				    static bool is_live_and_has_ttl(const bytes_view& cell) {

				        return cell[0] & EXPIRY_FLAG;

				    }

				    static bool is_dead(const bytes_view& cell) {

				        return cell[0] == DEAD_FLAGS;

				        return !is_live(cell);

				    }

				    // Can be called on live and dead cells

				    static api::timestamp_type timestamp(const bytes_view& cell) {

				        return get_field<api::timestamp_type>(cell, timestamp_offset);

				    }

				    template<typename BytesContainer>

				    static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {

				        set_field(cell, timestamp_offset, ts);

				    }

				    // Can be called on live cells only

				    static bytes_view value(bytes_view cell) {

				private:

				    template<typename BytesView>

				    static BytesView do_get_value(BytesView cell) {

				        auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);

				        auto value_offset = flags_size + timestamp_size + expiry_field_size;

				        cell.remove_prefix(value_offset);

				        return cell;

				    }

				public:

				    static bytes_view value(bytes_view cell) {

				        return do_get_value(cell);

				    }

				    static bytes_mutable_view value(bytes_mutable_view cell) {

				        return do_get_value(cell);

				    }

				    // Can be called on live counter update cells only

				    static int64_t counter_update_value(bytes_view cell) {

				        return get_field<int64_t>(cell, flags_size + timestamp_size);

				    }

				    // Can be called only when is_dead() is true.

				    static gc_clock::time_point deletion_time(const bytes_view& cell) {

				        assert(is_dead(cell));

				@@ -106,7 +144,7 @@ private:

				    }

				    static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {

				        managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);

				        b[0] = DEAD_FLAGS;

				        b[0] = 0;

				        set_field(b, timestamp_offset, timestamp);

				        set_field(b, deletion_time_offset, deletion_time.time_since_epoch().count());

				        return b;

				@@ -119,6 +157,14 @@ private:

				        std::copy_n(value.begin(), value.size(), b.begin() + value_offset);

				        return b;

				    }

				    static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {

				        auto value_offset = flags_size + timestamp_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));

				        b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        set_field(b, value_offset, value);

				        return b;

				    }

				    static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {

				        auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());

				@@ -129,6 +175,31 @@ private:

				        std::copy_n(value.begin(), value.size(), b.begin() + value_offset);

				        return b;

				    }

				    // make_live_from_serializer() is intended for users that need to serialise

				    // some object or objects to the format used in atomic_cell::value().

				    // With just make_live() the patter would look like follows:

				    // 1. allocate a buffer and write to it serialised objects

				    // 2. pass that buffer to make_live()

				    // 3. make_live() needs to prepend some metadata to the cell value so it

				    //    allocates a new buffer and copies the content of the original one

				    //

				    // The allocation and copy of a buffer can be avoided.

				    // make_live_from_serializer() allows the user code to specify the timestamp

				    // and size of the cell value as well as provide the serialiser function

				    // object, which would write the serialised value of the cell to the buffer

				    // given to it by make_live_from_serializer().

				    template<typename Serializer>

				    GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {

				        serializer(it);

				    })

				    static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {

				        auto value_offset = flags_size + timestamp_size;

				        managed_bytes b(managed_bytes::initialized_later(), value_offset + size);

				        b[0] = LIVE_FLAG;

				        set_field(b, timestamp_offset, timestamp);

				        serializer(b.begin() + value_offset);

				        return b;

				    }

				    template<typename ByteContainer>

				    friend class atomic_cell_base;

				    friend class atomic_cell;

				@@ -140,16 +211,25 @@ protected:

				    ByteContainer _data;

				protected:

				    atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }

				    atomic_cell_base(const ByteContainer& data) : _data(data) { }

				    friend class atomic_cell_or_collection;

				public:

				    bool is_counter_update() const {

				        return atomic_cell_type::is_counter_update(_data);

				    }

				    bool is_revert_set() const {

				        return atomic_cell_type::is_revert_set(_data);

				    }

				    bool is_counter_in_place_revert_set() const {

				        return atomic_cell_type::is_counter_in_place_revert_set(_data);

				    }

				    bool is_live() const {

				        return atomic_cell_type::is_live(_data);

				    }

				    bool is_live(tombstone t) const {

				        return is_live() && !is_covered_by(t);

				    bool is_live(tombstone t, bool is_counter) const {

				        return is_live() && !is_covered_by(t, is_counter);

				    }

				    bool is_live(tombstone t, gc_clock::time_point now) const {

				        return is_live() && !is_covered_by(t) && !has_expired(now);

				    bool is_live(tombstone t, gc_clock::time_point now, bool is_counter) const {

				        return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);

				    }

				    bool is_live_and_has_ttl() const {

				        return atomic_cell_type::is_live_and_has_ttl(_data);

				@@ -157,17 +237,24 @@ public:

				    bool is_dead(gc_clock::time_point now) const {

				        return atomic_cell_type::is_dead(_data) || has_expired(now);

				    }

				    bool is_covered_by(tombstone t) const {

				        return timestamp() <= t.timestamp;

				    bool is_covered_by(tombstone t, bool is_counter) const {

				        return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);

				    }

				    // Can be called on live and dead cells

				    api::timestamp_type timestamp() const {

				        return atomic_cell_type::timestamp(_data);

				    }

				    void set_timestamp(api::timestamp_type ts) {

				        atomic_cell_type::set_timestamp(_data, ts);

				    }

				    // Can be called on live cells only

				    bytes_view value() const {

				    auto value() const {

				        return atomic_cell_type::value(_data);

				    }

				    // Can be called on live counter update cells only

				    int64_t counter_update_value() const {

				        return atomic_cell_type::counter_update_value(_data);

				    }

				    // Can be called only when is_dead(gc_clock::time_point)

				    gc_clock::time_point deletion_time() const {

				        return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();

				@@ -182,15 +269,21 @@ public:

				    }

				    // Can be called on live and dead cells

				    bool has_expired(gc_clock::time_point now) const {

				        return is_live_and_has_ttl() && expiry() < now;

				        return is_live_and_has_ttl() && expiry() <= now;

				    }

				    bytes_view serialize() const {

				        return _data;

				    }

				    void set_revert(bool revert) {

				        atomic_cell_type::set_revert(_data, revert);

				    }

				    void set_counter_in_place_revert(bool flag) {

				        atomic_cell_type::set_counter_in_place_revert(_data, flag);

				    }

				};

				class atomic_cell_view final : public atomic_cell_base<bytes_view> {

				    atomic_cell_view(bytes_view data) : atomic_cell_base(data) {}

				    atomic_cell_view(bytes_view data) : atomic_cell_base(std::move(data)) {}

				public:

				    static atomic_cell_view from_bytes(bytes_view data) { return atomic_cell_view(data); }

				@@ -198,6 +291,19 @@ public:

				    friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);

				};

				class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {

				    atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}

				public:

				    static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }

				    friend class atomic_cell;

				};

				class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {

				public:

				    atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}

				};

				class atomic_cell final : public atomic_cell_base<managed_bytes> {

				    atomic_cell(managed_bytes b) : atomic_cell_base(std::move(b)) {}

				public:

				@@ -218,11 +324,22 @@ public:

				    static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value) {

				        return atomic_cell_type::make_live(timestamp, value);

				    }

				    static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {

				        return make_live(timestamp, bytes_view(value));

				    }

				    static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {

				        return atomic_cell_type::make_live_counter_update(timestamp, value);

				    }

				    static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,

				        gc_clock::time_point expiry, gc_clock::duration ttl)

				    {

				        return atomic_cell_type::make_live(timestamp, value, expiry, ttl);

				    }

				    static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value,

				                                 gc_clock::time_point expiry, gc_clock::duration ttl)

				    {

				        return make_live(timestamp, bytes_view(value), expiry, ttl);

				    }

				    static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value, ttl_opt ttl) {

				        if (!ttl) {

				            return atomic_cell_type::make_live(timestamp, value);

				@@ -230,6 +347,10 @@ public:

				            return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);

				        }

				    }

				    template<typename Serializer>

				    static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {

				        return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));

				    }

				    friend class atomic_cell_or_collection;

				    friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);

				};

				@@ -267,11 +388,6 @@ collection_mutation::operator collection_mutation_view() const {

				    return { data };

				}

				namespace db {

				template<typename T>

				class serializer;

				}

				class column_definition;

				int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);

									
										29

atomic_cell_hash.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -26,16 +26,17 @@

				#include "types.hh"

				#include "atomic_cell.hh"

				#include "hashing.hh"

				#include "counters.hh"

				template<>

				struct appending_hash<collection_mutation_view> {

				    template<typename Hasher>

				    void operator()(Hasher& h, collection_mutation_view cell) const {

				    void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {

				        auto m_view = collection_type_impl::deserialize_mutation_form(cell);

				        ::feed_hash(h, m_view.tomb);

				        for (auto&& key_and_value : m_view.cells) {

				            ::feed_hash(h, key_and_value.first);

				            ::feed_hash(h, key_and_value.second);

				            ::feed_hash(h, key_and_value.second, cdef);

				        }

				    }

				};

				@@ -43,10 +44,14 @@ struct appending_hash<collection_mutation_view> {

				template<>

				struct appending_hash<atomic_cell_view> {

				    template<typename Hasher>

				    void operator()(Hasher& h, atomic_cell_view cell) const {

				    void operator()(Hasher& h, atomic_cell_view cell, const column_definition& cdef) const {

				        feed_hash(h, cell.is_live());

				        feed_hash(h, cell.timestamp());

				        if (cell.is_live()) {

				            if (cdef.is_counter()) {

				                ::feed_hash(h, counter_cell_view(cell));

				                return;

				            }

				            if (cell.is_live_and_has_ttl()) {

				                feed_hash(h, cell.expiry());

				                feed_hash(h, cell.ttl());

				@@ -57,3 +62,19 @@ struct appending_hash<atomic_cell_view> {

				        }

				    }

				};

				template<>

				struct appending_hash<atomic_cell> {

				    template<typename Hasher>

				    void operator()(Hasher& h, const atomic_cell& cell, const column_definition& cdef) const {

				        feed_hash(h, static_cast<atomic_cell_view>(cell), cdef);

				    }

				};

				template<>

				struct appending_hash<collection_mutation> {

				    template<typename Hasher>

				    void operator()(Hasher& h, const collection_mutation& cm, const column_definition& cdef) const {

				        feed_hash(h, static_cast<collection_mutation_view>(cm), cdef);

				    }

				};

									
										16

atomic_cell_or_collection.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -27,6 +27,8 @@

				// A variant type that can hold either an atomic_cell, or a serialized collection.

				// Which type is stored is determined by the schema.

				// Has an "empty" state.

				// Objects moved-from are left in an empty state.

				class atomic_cell_or_collection final {

				    managed_bytes _data;

				private:

				@@ -36,10 +38,15 @@ public:

				    atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}

				    static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }

				    atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }

				    atomic_cell_ref as_atomic_cell_ref() { return { _data }; }

				    atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }

				    atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}

				    explicit operator bool() const {

				        return !_data.empty();

				    }

				    bool can_use_mutable_view() const {

				        return !_data.is_fragmented();

				    }

				    static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {

				        return std::move(data.data);

				    }

				@@ -55,10 +62,13 @@ public:

				    template<typename Hasher>

				    void feed_hash(Hasher& h, const column_definition& def) const {

				        if (def.is_atomic()) {

				            ::feed_hash(h, as_atomic_cell());

				            ::feed_hash(h, as_atomic_cell(), def);

				        } else {

				            ::feed_hash(as_collection_mutation(), h, def.type);

				            ::feed_hash(h, as_collection_mutation(), def);

				        }

				    }

				    size_t external_memory_usage() const {

				        return _data.external_memory_usage();

				    }

				    friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);

				};

									
										41

auth/allow_all_authenticator.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,41 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "auth/allow_all_authenticator.hh"

				#include "service/migration_manager.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				const sstring& allow_all_authenticator_name() {

				    static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthenticator";

				    return name;

				}

				// To ensure correct initialization order, we unfortunately need to use a string literal.

				static const class_registrator<

				        authenticator,

				        allow_all_authenticator,

				        cql3::query_processor&,

				        ::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");

				}

									
										97

auth/allow_all_authenticator.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,97 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <stdexcept>

				#include "auth/authenticator.hh"

				#include "auth/authenticated_user.hh"

				#include "auth/common.hh"

				namespace cql3 {

				class query_processor;

				}

				namespace service {

				class migration_manager;

				}

				namespace auth {

				const sstring& allow_all_authenticator_name();

				class allow_all_authenticator final : public authenticator {

				public:

				    allow_all_authenticator(cql3::query_processor&, ::service::migration_manager&) {

				    }

				    future<> start() override {

				        return make_ready_future<>();

				    }

				    future<> stop() override {

				        return make_ready_future<>();

				    }

				    const sstring& qualified_java_name() const override {

				        return allow_all_authenticator_name();

				    }

				    bool require_authentication() const override {

				        return false;

				    }

				    option_set supported_options() const override {

				        return option_set();

				    }

				    option_set alterable_options() const override {

				        return option_set();

				    }

				    future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {

				        return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());

				    }

				    future<> create(sstring username, const option_map& options) override {

				        return make_ready_future();

				    }

				    future<> alter(sstring username, const option_map& options) override {

				        return make_ready_future();

				    }

				    future<> drop(sstring username) override {

				        return make_ready_future();

				    }

				    const resource_ids& protected_resources() const override {

				        static const resource_ids ids;

				        return ids;

				    }

				    ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {

				        throw std::runtime_error("Should not reach");

				    }

				};

				}

									
										41

auth/allow_all_authorizer.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,41 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "auth/allow_all_authorizer.hh"

				#include "auth/common.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				const sstring& allow_all_authorizer_name() {

				    static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";

				    return name;

				}

				// To ensure correct initialization order, we unfortunately need to use a string literal.

				static const class_registrator<

				    authorizer,

				    allow_all_authorizer,

				    cql3::query_processor&,

				    ::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthorizer");

				}

									
										98

auth/allow_all_authorizer.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,98 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "authorizer.hh"

				#include "exceptions/exceptions.hh"

				#include "stdx.hh"

				namespace cql3 {

				class query_processor;

				}

				namespace service {

				class migration_manager;

				}

				namespace auth {

				class service;

				const sstring& allow_all_authorizer_name();

				class allow_all_authorizer final  : public authorizer {

				public:

				    allow_all_authorizer(cql3::query_processor&, ::service::migration_manager&) {

				    }

				    future<> start() override {

				        return make_ready_future<>();

				    }

				    future<> stop() override {

				        return make_ready_future<>();

				    }

				    const sstring& qualified_java_name() const override {

				        return allow_all_authorizer_name();

				    }

				    future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override {

				        return make_ready_future<permission_set>(permissions::ALL);

				    }

				    future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {

				        throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");

				    }

				    future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {

				        throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");

				    }

				    future<std::vector<permission_details>> list(

				            service&,

				            ::shared_ptr<authenticated_user> performer,

				            permission_set,

				            stdx::optional<data_resource>,

				            stdx::optional<sstring>) const override {

				        throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");

				    }

				    future<> revoke_all(sstring dropped_user) override {

				        return make_ready_future();

				    }

				    future<> revoke_all(data_resource) override {

				        return make_ready_future();

				    }

				    const resource_ids& protected_resources() override {

				        static const resource_ids ids;

				        return ids;

				    }

				    future<> validate_configuration() const override {

				        return make_ready_future();

				    }

				};

				}

									
										306

auth/auth.cc
									
												View File
											
				@@ -1,306 +0,0 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 *

				 * Modified by Cloudius Systems

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <seastar/core/sleep.hh>

				#include "auth.hh"

				#include "authenticator.hh"

				#include "database.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/statements/cf_statement.hh"

				#include "cql3/statements/create_table_statement.hh"

				#include "db/config.hh"

				#include "service/migration_manager.hh"

				const sstring auth::auth::DEFAULT_SUPERUSER_NAME("cassandra");

				const sstring auth::auth::AUTH_KS("system_auth");

				const sstring auth::auth::USERS_CF("users");

				static const sstring USER_NAME("name");

				static const sstring SUPER("super");

				static logging::logger logger("auth");

				// TODO: configurable

				using namespace std::chrono_literals;

				const std::chrono::milliseconds auth::auth::SUPERUSER_SETUP_DELAY = 10000ms;

				class auth_migration_listener : public service::migration_listener {

				    void on_create_keyspace(const sstring& ks_name) override {}

				    void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}

				    void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_create_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_update_keyspace(const sstring& ks_name) override {}

				    void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}

				    void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_update_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_drop_keyspace(const sstring& ks_name) override {

				        // TODO:

				        //DatabaseDescriptor.getAuthorizer().revokeAll(DataResource.keyspace(ksName));

				    }

				    void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {

				        // TODO:

				        //DatabaseDescriptor.getAuthorizer().revokeAll(DataResource.columnFamily(ksName, cfName));

				    }

				    void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				};

				static auth_migration_listener auth_migration;

				/**

				 * Poor mans job schedule. For maximum 2 jobs. Sic.

				 * Still does nothing more clever than waiting 10 seconds

				 * like origin, then runs the submitted tasks.

				 *

				 * Only difference compared to sleep (from which this

				 * borrows _heavily_) is that if tasks have not run by the time

				 * we exit (and do static clean up) we delete the promise + cont

				 *

				 * Should be abstracted to some sort of global server function

				 * probably.

				 */

				struct waiter {

				    promise<> done;

				    timer<> tmr;

				    waiter() : tmr([this] {done.set_value();})

				    {

				        tmr.arm(auth::auth::SUPERUSER_SETUP_DELAY);

				    }

				    ~waiter() {

				        if (tmr.armed()) {

				            tmr.cancel();

				            done.set_exception(std::runtime_error("shutting down"));

				        }

				        logger.trace("Deleting scheduled task");

				    }

				    void kill() {

				    }

				};

				typedef std::unique_ptr<waiter> waiter_ptr;

				static std::vector<waiter_ptr> & thread_waiters() {

				    static thread_local std::vector<waiter_ptr> the_waiters;

				    return the_waiters;

				}

				void auth::auth::schedule_when_up(scheduled_func f) {

				    logger.trace("Adding scheduled task");

				    auto & waiters = thread_waiters();

				    waiters.emplace_back(std::make_unique<waiter>());

				    auto* w = waiters.back().get();

				    w->done.get_future().finally([w] {

				        auto & waiters = thread_waiters();

				        auto i = std::find_if(waiters.begin(), waiters.end(), [w](const waiter_ptr& p) {

				                            return p.get() == w;

				                        });

				        if (i != waiters.end()) {

				            waiters.erase(i);

				        }

				    }).then([f = std::move(f)] {

				        logger.trace("Running scheduled task");

				        return f();

				    }).handle_exception([](auto ep) {

				        return make_ready_future();

				    });

				}

				bool auth::auth::is_class_type(const sstring& type, const sstring& classname) {

				    if (type == classname) {

				        return true;

				    }

				    auto i = classname.find_last_of('.');

				    return classname.compare(i + 1, sstring::npos, type) == 0;

				}

				future<> auth::auth::setup() {

				    auto& db = cql3::get_local_query_processor().db().local();

				    auto& cfg = db.get_config();

				    auto type = cfg.authenticator();

				    if (is_class_type(type, authenticator::ALLOW_ALL_AUTHENTICATOR_NAME)) {

				        return authenticator::setup(type).discard_result(); // just create the object

				    }

				    future<> f = make_ready_future();

				    if (!db.has_keyspace(AUTH_KS)) {

				        std::map<sstring, sstring> opts;

				        opts["replication_factor"] = "1";

				        auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);

				        f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);

				    }

				    return f.then([] {

				        return setup_table(USERS_CF, sprint("CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY(%s)) WITH gc_grace_seconds=%d",

				                                        AUTH_KS, USERS_CF, USER_NAME, SUPER, USER_NAME,

				                                        90 * 24 * 60 * 60)); // 3 months.

				    }).then([type] {

				        return authenticator::setup(type).discard_result();

				    }).then([] {

				        // TODO authorizer

				    }).then([] {

				        service::get_local_migration_manager().register_listener(&auth_migration); // again, only one shard...

				        // instead of once-timer, just schedule this later

				        schedule_when_up([] {

				            // setup default super user

				            return has_existing_users(USERS_CF, DEFAULT_SUPERUSER_NAME, USER_NAME).then([](bool exists) {

				                if (!exists) {

				                    auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",

				                                    AUTH_KS, USERS_CF, USER_NAME, SUPER);

				                    cql3::get_local_query_processor().process(query, db::consistency_level::ONE, {DEFAULT_SUPERUSER_NAME, true}).then([](auto) {

				                        logger.info("Created default superuser '{}'", DEFAULT_SUPERUSER_NAME);

				                    }).handle_exception([](auto ep) {

				                        try {

				                            std::rethrow_exception(ep);

				                        } catch (exceptions::request_execution_exception&) {

				                            logger.warn("Skipped default superuser setup: some nodes were not ready");

				                        }

				                    });

				                }

				            });

				        });

				    });

				}

				future<> auth::auth::shutdown() {

				    // just make sure we don't have pending tasks.

				    // this is mostly relevant for test cases where

				    // db-env-shutdown != process shutdown

				    return smp::invoke_on_all([] {

				        thread_waiters().clear();

				    });

				}

				static db::consistency_level consistency_for_user(const sstring& username) {

				    if (username == auth::auth::DEFAULT_SUPERUSER_NAME) {

				        return db::consistency_level::QUORUM;

				    }

				    return db::consistency_level::LOCAL_ONE;

				}

				static future<::shared_ptr<cql3::untyped_result_set>> select_user(const sstring& username) {

				    // Here was a thread local, explicit cache of prepared statement. In normal execution this is

				    // fine, but since we in testing set up and tear down system over and over, we'd start using

				    // obsolete prepared statements pretty quickly.

				    // Rely on query processing caching statements instead, and lets assume

				    // that a map lookup string->statement is not gonna kill us much.

				    return cql3::get_local_query_processor().process(

				                    sprint("SELECT * FROM %s.%s WHERE %s = ?",

				                                    auth::auth::AUTH_KS, auth::auth::USERS_CF,

				                                    USER_NAME), consistency_for_user(username),

				                    { username }, true);

				}

				future<bool> auth::auth::is_existing_user(const sstring& username) {

				    return select_user(username).then(

				                    [](::shared_ptr<cql3::untyped_result_set> res) {

				                        return make_ready_future<bool>(!res->empty());

				                    });

				}

				future<bool> auth::auth::is_super_user(const sstring& username) {

				    return select_user(username).then(

				                    [](::shared_ptr<cql3::untyped_result_set> res) {

				                        return make_ready_future<bool>(!res->empty() && res->one().get_as<bool>(SUPER));

				                    });

				}

				future<> auth::auth::insert_user(const sstring& username, bool is_super)

				                throw (exceptions::request_execution_exception) {

				    return cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",

				                    AUTH_KS, USERS_CF, USER_NAME, SUPER),

				                    consistency_for_user(username), { username, is_super }).discard_result();

				}

				future<> auth::auth::delete_user(const sstring& username) throw(exceptions::request_execution_exception) {

				    return cql3::get_local_query_processor().process(sprint("DELETE FROM %s.%s WHERE %s = ?",

				                    AUTH_KS, USERS_CF, USER_NAME),

				                    consistency_for_user(username), { username }).discard_result();

				}

				future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {

				    auto& qp = cql3::get_local_query_processor();

				    auto& db = qp.db().local();

				    if (db.has_schema(AUTH_KS, name)) {

				        return make_ready_future();

				    }

				    ::shared_ptr<cql3::statements::cf_statement> parsed = static_pointer_cast<

				                    cql3::statements::cf_statement>(cql3::query_processor::parse_statement(cql));

				    parsed->prepare_keyspace(AUTH_KS);

				    ::shared_ptr<cql3::statements::create_table_statement> statement =

				                    static_pointer_cast<cql3::statements::create_table_statement>(

				                                    parsed->prepare(db)->statement);

				    // Origin sets "Legacy Cf Id" for the new table. We have no need to be

				    // pre-2.1 compatible (afaik), so lets skip a whole lotta hoolaballo

				    return statement->announce_migration(qp.proxy(), false).then([statement](bool) {});

				}

				future<bool> auth::auth::has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column) {

				    auto default_user_query = sprint("SELECT * FROM %s.%s WHERE %s = ?", AUTH_KS, cfname, name_column);

				    auto all_users_query = sprint("SELECT * FROM %s.%s LIMIT 1", AUTH_KS, cfname);

				    return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::ONE, { def_user_name }).then([=](::shared_ptr<cql3::untyped_result_set> res) {

				        if (!res->empty()) {

				            return make_ready_future<bool>(true);

				        }

				        return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::QUORUM, { def_user_name }).then([all_users_query](::shared_ptr<cql3::untyped_result_set> res) {

				            if (!res->empty()) {

				                return make_ready_future<bool>(true);

				            }

				            return cql3::get_local_query_processor().process(all_users_query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> res) {

				                return make_ready_future<bool>(!res->empty());

				            });

				        });

				    });

				}

									
										121

auth/auth.hh
									
												View File
											
				@@ -1,121 +0,0 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 *

				 * Modified by Cloudius Systems

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <chrono>

				#include <seastar/core/sstring.hh>

				#include <seastar/core/future.hh>

				#include "exceptions/exceptions.hh"

				namespace auth {

				class auth {

				public:

				    static const sstring DEFAULT_SUPERUSER_NAME;

				    static const sstring AUTH_KS;

				    static const sstring USERS_CF;

				    static const std::chrono::milliseconds SUPERUSER_SETUP_DELAY;

				    static bool is_class_type(const sstring& type, const sstring& classname);

				#if 0

				    public static Set<Permission> getPermissions(AuthenticatedUser user, IResource resource)

				    {

				        return permissionsCache.getPermissions(user, resource);

				    }

				#endif

				    /**

				     * Checks if the username is stored in AUTH_KS.USERS_CF.

				     *

				     * @param username Username to query.

				     * @return whether or not Cassandra knows about the user.

				     */

				    static future<bool> is_existing_user(const sstring& username);

				    /**

				     * Checks if the user is a known superuser.

				     *

				     * @param username Username to query.

				     * @return true is the user is a superuser, false if they aren't or don't exist at all.

				     */

				    static future<bool> is_super_user(const sstring& username);

				    /**

				     * Inserts the user into AUTH_KS.USERS_CF (or overwrites their superuser status as a result of an ALTER USER query).

				     *

				     * @param username Username to insert.

				     * @param isSuper User's new status.

				     * @throws RequestExecutionException

				     */

				    static future<> insert_user(const sstring& username, bool is_super) throw(exceptions::request_execution_exception);

				    /**

				     * Deletes the user from AUTH_KS.USERS_CF.

				     *

				     * @param username Username to delete.

				     * @throws RequestExecutionException

				     */

				    static future<> delete_user(const sstring& username) throw(exceptions::request_execution_exception);

				    /**

				     * Sets up Authenticator and Authorizer.

				     */

				    static future<> setup();

				    static future<> shutdown();

				    /**

				     * Set up table from given CREATE TABLE statement under system_auth keyspace, if not already done so.

				     *

				     * @param name name of the table

				     * @param cql CREATE TABLE statement

				     */

				    static future<> setup_table(const sstring& name, const sstring& cql);

				    static future<bool> has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column_name);

				    // For internal use. Run function "when system is up".

				    typedef std::function<future<>()> scheduled_func;

				    static void schedule_when_up(scheduled_func);

				};

				}

									
										7

auth/authenticated_user.cc
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -52,6 +52,9 @@ auth::authenticated_user::authenticated_user(sstring name)

				                : _name(name), _anon(false)

				{}

				auth::authenticated_user::authenticated_user(authenticated_user&&) = default;

				auth::authenticated_user::authenticated_user(const authenticated_user&) = default;

				const sstring& auth::authenticated_user::name() const {

				    return _anon ? ANONYMOUS_USERNAME : _name;

				}

									
										16

auth/authenticated_user.hh
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -42,6 +42,8 @@

				#pragma once

				#include <seastar/core/sstring.hh>

				#include <seastar/core/future.hh>

				#include "seastarx.hh"

				namespace auth {

				@@ -51,17 +53,11 @@ public:

				    authenticated_user();

				    authenticated_user(sstring name);

				    authenticated_user(authenticated_user&&);

				    authenticated_user(const authenticated_user&);

				    const sstring& name() const;

				    /**

				     * Checks the user's superuser status.

				     * Only a superuser is allowed to perform CREATE USER and DROP USER queries.

				     * Im most cased, though not necessarily, a superuser will have Permission.ALL on every resource

				     * (depends on IAuthorizer implementation).

				     */

				    bool is_super() const;

				    /**

				     * If IAuthenticator doesn't require authentication, this method may return true.

				     */

									
										75

auth/authenticator.cc
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -41,70 +41,27 @@

				#include "authenticator.hh"

				#include "authenticated_user.hh"

				#include "common.hh"

				#include "password_authenticator.hh"

				#include "auth.hh"

				#include "cql3/query_processor.hh"

				#include "db/config.hh"

				#include "utils/class_registrator.hh"

				const sstring auth::authenticator::USERNAME_KEY("username");

				const sstring auth::authenticator::PASSWORD_KEY("password");

				const sstring auth::authenticator::ALLOW_ALL_AUTHENTICATOR_NAME("org.apache.cassandra.auth.AllowAllAuthenticator");

				/**

				 * Authenticator is assumed to be a fully state-less immutable object (note all the const).

				 * We thus store a single instance globally, since it should be safe/ok.

				 */

				static std::unique_ptr<auth::authenticator> global_authenticator;

				future<>

				auth::authenticator::setup(const sstring& type) throw (exceptions::configuration_exception) {

				    if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHENTICATOR_NAME)) {

				        class allow_all_authenticator : public authenticator {

				        public:

				            const sstring& class_name() const override {

				                return ALLOW_ALL_AUTHENTICATOR_NAME;

				            }

				            bool require_authentication() const override {

				                return false;

				            }

				            option_set supported_options() const override {

				                return option_set();

				            }

				            option_set alterable_options() const override {

				                return option_set();

				            }

				            future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override {

				                return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());

				            }

				            future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {

				                return make_ready_future();

				            }

				            future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {

				                return make_ready_future();

				            }

				            future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {

				                return make_ready_future();

				            }

				            resource_ids protected_resources() const override {

				                return resource_ids();

				            }

				            ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {

				                throw std::runtime_error("Should not reach");

				            }

				        };

				        global_authenticator = std::make_unique<allow_all_authenticator>();

				    } else if (auth::auth::is_class_type(type, password_authenticator::PASSWORD_AUTHENTICATOR_NAME)) {

				        auto pwa = std::make_unique<password_authenticator>();

				        auto f = pwa->init();

				        return f.then([pwa = std::move(pwa)]() mutable {

				            global_authenticator = std::move(pwa);

				        });

				    } else {

				        throw exceptions::configuration_exception("Invalid authenticator type: " + type);

				auth::authenticator::option auth::authenticator::string_to_option(const sstring& name) {

				    if (strcasecmp(name.c_str(), "password") == 0) {

				        return option::PASSWORD;

				    }

				    return make_ready_future();

				    throw std::invalid_argument(name);

				}

				auth::authenticator& auth::authenticator::get() {

				    assert(global_authenticator);

				    return *global_authenticator;

				sstring auth::authenticator::option_to_string(option opt) {

				    switch (opt) {

				    case option::PASSWORD:

				        return "PASSWORD";

				    default:

				        throw std::invalid_argument(sprint("Unknown option {}", opt));

				    }

				}

									
										50

auth/authenticator.hh
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -69,7 +69,6 @@ class authenticator {

				public:

				    static const sstring USERNAME_KEY;

				    static const sstring PASSWORD_KEY;

				    static const sstring ALLOW_ALL_AUTHENTICATOR_NAME;

				    /**

				     * Supported CREATE USER/ALTER USER options.

				@@ -79,32 +78,21 @@ public:

				        PASSWORD

				    };

				    static option string_to_option(const sstring&);

				    static sstring option_to_string(option);

				    using option_set = enum_set<super_enum<option, option::PASSWORD>>;

				    using option_map = std::unordered_map<option, boost::any, enum_hash<option>>;

				    using credentials_map = std::unordered_map<sstring, sstring>;

				    /**

				     * Resource id mappings, i.e. keyspace and/or column families.

				     */

				    using resource_ids = std::set<data_resource>;

				    /**

				     * Setup is called once upon system startup to initialize the IAuthenticator.

				     *

				     * For example, use this method to create any required keyspaces/column families.

				     * Note: Only call from main thread.

				     */

				    static future<> setup(const sstring& type) throw(exceptions::configuration_exception);

				    /**

				     * Returns the system authenticator. Must have called setup before calling this.

				     */

				    static authenticator& get();

				    virtual ~authenticator()

				    {}

				    virtual const sstring& class_name() const = 0;

				    virtual future<> start() = 0;

				    virtual future<> stop() = 0;

				    virtual const sstring& qualified_java_name() const = 0;

				    /**

				     * Whether or not the authenticator requires explicit login.

				@@ -131,7 +119,7 @@ public:

				     *

				     * @throws authentication_exception if credentials don't match any known user.

				     */

				    virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) = 0;

				    virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const = 0;

				    /**

				     * Called during execution of CREATE USER query (also may be called on startup, see seedSuperuserOptions method).

				@@ -143,7 +131,7 @@ public:

				     * @throws exceptions::request_validation_exception

				     * @throws exceptions::request_execution_exception

				     */

				    virtual future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;

				    virtual future<> create(sstring username, const option_map& options) = 0;

				    /**

				     * Called during execution of ALTER USER query.

				@@ -156,7 +144,7 @@ public:

				     * @throws exceptions::request_validation_exception

				     * @throws exceptions::request_execution_exception

				     */

				    virtual future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;

				    virtual future<> alter(sstring username, const option_map& options) = 0;

				    /**

				@@ -166,7 +154,7 @@ public:

				     * @throws exceptions::request_validation_exception

				     * @throws exceptions::request_execution_exception

				     */

				    virtual future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;

				    virtual future<> drop(sstring username) = 0;

				     /**

				     * Set of resources that should be made inaccessible to users and only accessible internally.

				@@ -174,14 +162,14 @@ public:

				     * @return Keyspaces, column families that will be unmodifiable by users; other resources.

				     * @see resource_ids

				     */

				    virtual resource_ids protected_resources() const = 0;

				    virtual const resource_ids& protected_resources() const = 0;

				    class sasl_challenge {

				    public:

				        virtual ~sasl_challenge() {}

				        virtual bytes evaluate_response(bytes_view client_response) throw(exceptions::authentication_exception) = 0;

				        virtual bytes evaluate_response(bytes_view client_response) = 0;

				        virtual bool is_complete() const = 0;

				        virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const throw(exceptions::authentication_exception) = 0;

				        virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const = 0;

				    };

				    /**

				@@ -194,5 +182,9 @@ public:

				    virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const = 0;

				};

				inline std::ostream& operator<<(std::ostream& os, authenticator::option opt) {

				    return os << authenticator::option_to_string(opt);

				}

				}

									
										118

auth/authorizer.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,118 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "authorizer.hh"

				#include "authenticated_user.hh"

				#include "common.hh"

				#include "default_authorizer.hh"

				#include "auth.hh"

				#include "cql3/query_processor.hh"

				#include "db/config.hh"

				#include "utils/class_registrator.hh"

				const sstring& auth::allow_all_authorizer_name() {

				    static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";

				    return name;

				}

				/**

				 * Authenticator is assumed to be a fully state-less immutable object (note all the const).

				 * We thus store a single instance globally, since it should be safe/ok.

				 */

				static std::unique_ptr<auth::authorizer> global_authorizer;

				using authorizer_registry = class_registry<auth::authorizer, cql3::query_processor&>;

				future<>

				auth::authorizer::setup(const sstring& type) {

				    if (type == allow_all_authorizer_name()) {

				        class allow_all_authorizer : public authorizer {

				        public:

				            future<> start() override {

				                return make_ready_future<>();

				            }

				            future<> stop() override {

				                return make_ready_future<>();

				            }

				            const sstring& qualified_java_name() const override {

				                return allow_all_authorizer_name();

				            }

				            future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override {

				                return make_ready_future<permission_set>(permissions::ALL);

				            }

				            future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {

				                throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");

				            }

				            future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {

				                throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");

				            }

				            future<std::vector<permission_details>> list(::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const override {

				                throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");

				            }

				            future<> revoke_all(sstring dropped_user) override {

				                return make_ready_future();

				            }

				            future<> revoke_all(data_resource) override {

				                return make_ready_future();

				            }

				            const resource_ids& protected_resources() override {

				                static const resource_ids ids;

				                return ids;

				            }

				            future<> validate_configuration() const override {

				                return make_ready_future();

				            }

				        };

				        global_authorizer = std::make_unique<allow_all_authorizer>();

				        return make_ready_future();

				    } else {

				        auto a = authorizer_registry::create(type, cql3::get_local_query_processor());

				        auto f = a->start();

				        return f.then([a = std::move(a)]() mutable {

				            global_authorizer = std::move(a);

				        });

				    }

				}

				auth::authorizer& auth::authorizer::get() {

				    assert(global_authorizer);

				    return *global_authorizer;

				}

									
										167

auth/authorizer.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,167 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <vector>

				#include <tuple>

				#include <experimental/optional>

				#include <seastar/core/future.hh>

				#include <seastar/core/shared_ptr.hh>

				#include "permission.hh"

				#include "data_resource.hh"

				#include "seastarx.hh"

				namespace auth {

				class service;

				class authenticated_user;

				struct permission_details {

				    sstring user;

				    data_resource resource;

				    permission_set permissions;

				    bool operator<(const permission_details& v) const {

				        return std::tie(user, resource, permissions) < std::tie(v.user, v.resource, v.permissions);

				    }

				};

				using std::experimental::optional;

				class authorizer {

				public:

				    virtual ~authorizer() {}

				    virtual future<> start() = 0;

				    virtual future<> stop() = 0;

				    virtual const sstring& qualified_java_name() const = 0;

				    /**

				     * The primary Authorizer method. Returns a set of permissions of a user on a resource.

				     *

				     * @param user Authenticated user requesting authorization.

				     * @param resource Resource for which the authorization is being requested. @see DataResource.

				     * @return Set of permissions of the user on the resource. Should never return empty. Use permission.NONE instead.

				     */

				    virtual future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const = 0;

				    /**

				     * Grants a set of permissions on a resource to a user.

				     * The opposite of revoke().

				     *

				     * @param performer User who grants the permissions.

				     * @param permissions Set of permissions to grant.

				     * @param to Grantee of the permissions.

				     * @param resource Resource on which to grant the permissions.

				     *

				     * @throws RequestValidationException

				     * @throws RequestExecutionException

				     */

				    virtual future<> grant(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring to) = 0;

				    /**

				     * Revokes a set of permissions on a resource from a user.

				     * The opposite of grant().

				     *

				     * @param performer User who revokes the permissions.

				     * @param permissions Set of permissions to revoke.

				     * @param from Revokee of the permissions.

				     * @param resource Resource on which to revoke the permissions.

				     *

				     * @throws RequestValidationException

				     * @throws RequestExecutionException

				     */

				    virtual future<> revoke(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring from) = 0;

				    /**

				     * Returns a list of permissions on a resource of a user.

				     *

				     * @param performer User who wants to see the permissions.

				     * @param permissions Set of Permission values the user is interested in. The result should only include the matching ones.

				     * @param resource The resource on which permissions are requested. Can be null, in which case permissions on all resources

				     *                 should be returned.

				     * @param of The user whose permissions are requested. Can be null, in which case permissions of every user should be returned.

				     *

				     * @return All of the matching permission that the requesting user is authorized to know about.

				     *

				     * @throws RequestValidationException

				     * @throws RequestExecutionException

				     */

				    virtual future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const = 0;

				    /**

				     * This method is called before deleting a user with DROP USER query so that a new user with the same

				     * name wouldn't inherit permissions of the deleted user in the future.

				     *

				     * @param droppedUser The user to revoke all permissions from.

				     */

				    virtual future<> revoke_all(sstring dropped_user) = 0;

				    /**

				     * This method is called after a resource is removed (i.e. keyspace or a table is dropped).

				     *

				     * @param droppedResource The resource to revoke all permissions on.

				     */

				    virtual future<> revoke_all(data_resource) = 0;

				    /**

				     * Set of resources that should be made inaccessible to users and only accessible internally.

				     *

				     * @return Keyspaces, column families that will be unmodifiable by users; other resources.

				     */

				    virtual const resource_ids& protected_resources() = 0;

				    /**

				     * Validates configuration of IAuthorizer implementation (if configurable).

				     *

				     * @throws ConfigurationException when there is a configuration error.

				     */

				    virtual future<> validate_configuration() const = 0;

				};

				}

									
										70

auth/common.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,70 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "auth/common.hh"

				#include <seastar/core/shared_ptr.hh>

				#include "cql3/query_processor.hh"

				#include "cql3/statements/create_table_statement.hh"

				#include "schema_builder.hh"

				#include "service/migration_manager.hh"

				namespace auth {

				namespace meta {

				const sstring DEFAULT_SUPERUSER_NAME("cassandra");

				const sstring AUTH_KS("system_auth");

				const sstring USERS_CF("users");

				const sstring AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");

				}

				future<> create_metadata_table_if_missing(

				        const sstring& table_name,

				        cql3::query_processor& qp,

				        const sstring& cql,

				        ::service::migration_manager& mm) {

				    auto& db = qp.db().local();

				    if (db.has_schema(meta::AUTH_KS, table_name)) {

				        return make_ready_future<>();

				    }

				    auto parsed_statement = static_pointer_cast<cql3::statements::raw::cf_statement>(

				            cql3::query_processor::parse_statement(cql));

				    parsed_statement->prepare_keyspace(meta::AUTH_KS);

				    auto statement = static_pointer_cast<cql3::statements::create_table_statement>(

				            parsed_statement->prepare(db, qp.get_cql_stats())->statement);

				    const auto schema = statement->get_cf_meta_data();

				    const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());

				    schema_builder b(schema);

				    b.set_uuid(uuid);

				    return mm.announce_new_column_family(b.build(), false);

				}

				}

									
										74

auth/common.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,74 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <chrono>

				#include <seastar/core/future.hh>

				#include <seastar/core/reactor.hh>

				#include <seastar/core/resource.hh>

				#include <seastar/core/sstring.hh>

				#include "delayed_tasks.hh"

				#include "seastarx.hh"

				namespace service {

				class migration_manager;

				}

				namespace cql3 {

				class query_processor;

				}

				namespace auth {

				namespace meta {

				extern const sstring DEFAULT_SUPERUSER_NAME;

				extern const sstring AUTH_KS;

				extern const sstring USERS_CF;

				extern const sstring AUTH_PACKAGE_NAME;

				}

				template <class Task>

				future<> once_among_shards(Task&& f) {

				    if (engine().cpu_id() == 0u) {

				        return f();

				    }

				    return make_ready_future<>();

				}

				template <class Task, class Clock>

				void delay_until_system_ready(delayed_tasks<Clock>& ts, Task&& f) {

				    static const typename std::chrono::milliseconds delay_duration(10000);

				    ts.schedule_after(delay_duration, std::forward<Task>(f));

				}

				future<> create_metadata_table_if_missing(

				        const sstring& table_name,

				        cql3::query_processor&,

				        const sstring& cql,

				        ::service::migration_manager&);

				}

									
										36

auth/data_resource.cc
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -47,11 +47,8 @@

				const sstring auth::data_resource::ROOT_NAME("data");

				auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)

				    : _ks(ks), _cf(cf)

				    : _level(l), _ks(ks), _cf(cf)

				{

				    if (l != get_level()) {

				        throw std::invalid_argument("level/keyspace/column mismatch");

				    }

				}

				auth::data_resource::data_resource()

				@@ -67,14 +64,7 @@ auth::data_resource::data_resource(const sstring& ks, const sstring& cf)

				{}

				auth::data_resource::level auth::data_resource::get_level() const {

				    if (!_cf.empty()) {

				        assert(!_ks.empty());

				        return level::COLUMN_FAMILY;

				    }

				    if (!_ks.empty()) {

				        return level::KEYSPACE;

				    }

				    return level::ROOT;

				    return _level;

				}

				auth::data_resource auth::data_resource::from_name(

				@@ -125,16 +115,14 @@ auth::data_resource auth::data_resource::get_parent() const {

				    }

				}

				const sstring& auth::data_resource::keyspace() const

				                throw (std::invalid_argument) {

				const sstring& auth::data_resource::keyspace() const {

				    if (is_root_level()) {

				        throw std::invalid_argument("ROOT data resource has no keyspace");

				    }

				    return _ks;

				}

				const sstring& auth::data_resource::column_family() const

				                throw (std::invalid_argument) {

				const sstring& auth::data_resource::column_family() const {

				    if (!is_column_family_level()) {

				        throw std::invalid_argument(sprint("%s data resource has no column family", name()));

				    }

				@@ -158,7 +146,15 @@ bool auth::data_resource::exists() const {

				}

				sstring auth::data_resource::to_string() const {

				    return name();

				    switch (get_level()) {

				        case level::ROOT:

				            return "<all keyspaces>";

				        case level::KEYSPACE:

				            return sprint("<keyspace %s>", _ks);

				        case level::COLUMN_FAMILY:

				        default:

				            return sprint("<table %s.%s>", _ks, _cf);

				    }

				}

				bool auth::data_resource::operator==(const data_resource& v) const {

				@@ -170,6 +166,6 @@ bool auth::data_resource::operator<(const data_resource& v) const {

				}

				std::ostream& auth::operator<<(std::ostream& os, const data_resource& r) {

				    return os << r.name();

				    return os << r.to_string();

				}

									
										21

auth/data_resource.hh
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -41,8 +41,11 @@

				#pragma once

				#include "utils/hash.hh"

				#include <iosfwd>

				#include <set>

				#include <seastar/core/sstring.hh>

				#include "seastarx.hh"

				namespace auth {

				@@ -54,6 +57,7 @@ private:

				    static const sstring ROOT_NAME;

				    level _level;

				    sstring _ks;

				    sstring _cf;

				@@ -114,13 +118,13 @@ public:

				     * @return keyspace of the resource.

				     * @throws std::invalid_argument if it's the root-level resource.

				     */

				    const sstring& keyspace() const throw(std::invalid_argument);

				    const sstring& keyspace() const;

				    /**

				     * @return column family of the resource.

				     * @throws std::invalid_argument if it's not a cf-level resource.

				     */

				    const sstring& column_family() const throw(std::invalid_argument);

				    const sstring& column_family() const;

				    /**

				     * @return Whether or not the resource has a parent in the hierarchy.

				@@ -136,8 +140,17 @@ public:

				    bool operator==(const data_resource&) const;

				    bool operator<(const data_resource&) const;

				    size_t hash_value() const {

				        return utils::tuple_hash()(_ks, _cf);

				    }

				};

				/**

				 * Resource id mappings, i.e. keyspace and/or column families.

				 */

				using resource_ids = std::set<data_resource>;

				std::ostream& operator<<(std::ostream&, const data_resource&);

				}

									
										257

auth/default_authorizer.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,257 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <unistd.h>

				#include <crypt.h>

				#include <random>

				#include <chrono>

				#include <seastar/core/reactor.hh>

				#include "common.hh"

				#include "default_authorizer.hh"

				#include "authenticated_user.hh"

				#include "permission.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "exceptions/exceptions.hh"

				#include "log.hh"

				const sstring& auth::default_authorizer_name() {

				    static const sstring name = meta::AUTH_PACKAGE_NAME + "CassandraAuthorizer";

				    return name;

				}

				static const sstring USER_NAME = "username";

				static const sstring RESOURCE_NAME = "resource";

				static const sstring PERMISSIONS_NAME = "permissions";

				static const sstring PERMISSIONS_CF = "permissions";

				static logging::logger alogger("default_authorizer");

				// To ensure correct initialization order, we unfortunately need to use a string literal.

				static const class_registrator<

				        auth::authorizer,

				        auth::default_authorizer,

				        cql3::query_processor&,

				        ::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");

				auth::default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)

				        : _qp(qp)

				        , _migration_manager(mm) {

				}

				auth::default_authorizer::~default_authorizer() {

				}

				future<> auth::default_authorizer::start() {

				    static const sstring create_table = sprint("CREATE TABLE %s.%s ("

				                    "%s text,"

				                    "%s text,"

				                    "%s set<text>,"

				                    "PRIMARY KEY(%s, %s)"

				                    ") WITH gc_grace_seconds=%d", meta::AUTH_KS,

				                    PERMISSIONS_CF, USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME,

				                    USER_NAME, RESOURCE_NAME, 90 * 24 * 60 * 60); // 3 months.

				    return auth::once_among_shards([this] {

				        return auth::create_metadata_table_if_missing(

				                PERMISSIONS_CF,

				                _qp,

				                create_table,

				                _migration_manager);

				    });

				}

				future<> auth::default_authorizer::stop() {

				    return make_ready_future<>();

				}

				future<auth::permission_set> auth::default_authorizer::authorize(

				                service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const {

				    return auth::is_super_user(ser, *user).then([this, user, resource = std::move(resource)](bool is_super) {

				        if (is_super) {

				            return make_ready_future<permission_set>(permissions::ALL);

				        }

				        /**

				         * TOOD: could create actual data type for permission (translating string<->perm),

				         * but this seems overkill right now. We still must store strings so...

				         */

				        auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?"

				                        , PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);

				        return _qp.process(query, db::consistency_level::LOCAL_ONE, {user->name(), resource.name() })

				                        .then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {

				            try {

				                auto res = f.get0();

				                if (res->empty() || !res->one().has(PERMISSIONS_NAME)) {

				                    return make_ready_future<permission_set>(permissions::NONE);

				                }

				                return make_ready_future<permission_set>(permissions::from_strings(res->one().get_set<sstring>(PERMISSIONS_NAME)));

				            } catch (exceptions::request_execution_exception& e) {

				                alogger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);

				                return make_ready_future<permission_set>(permissions::NONE);

				            }

				        });

				    });

				}

				#include <boost/range.hpp>

				future<> auth::default_authorizer::modify(

				                ::shared_ptr<authenticated_user> performer, permission_set set,

				                data_resource resource, sstring user, sstring op) {

				    // TODO: why does this not check super user?

				    auto query = sprint("UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",

				                    meta::AUTH_KS, PERMISSIONS_CF, PERMISSIONS_NAME,

				                    PERMISSIONS_NAME, op, USER_NAME, RESOURCE_NAME);

				    return _qp.process(query, db::consistency_level::ONE, {

				                    permissions::to_strings(set), user, resource.name() }).discard_result();

				}

				future<> auth::default_authorizer::grant(

				                ::shared_ptr<authenticated_user> performer, permission_set set,

				                data_resource resource, sstring to) {

				    return modify(std::move(performer), std::move(set), std::move(resource), std::move(to), "+");

				}

				future<> auth::default_authorizer::revoke(

				                ::shared_ptr<authenticated_user> performer, permission_set set,

				                data_resource resource, sstring from) {

				    return modify(std::move(performer), std::move(set), std::move(resource), std::move(from), "-");

				}

				future<std::vector<auth::permission_details>> auth::default_authorizer::list(

				                service& ser, ::shared_ptr<authenticated_user> performer, permission_set set,

				                optional<data_resource> resource, optional<sstring> user) const {

				    return auth::is_super_user(ser, *performer).then([this, performer, set = std::move(set), resource = std::move(resource), user = std::move(user)](bool is_super) {

				        if (!is_super && (!user || performer->name() != *user)) {

				            throw exceptions::unauthorized_exception(sprint("You are not authorized to view %s's permissions", user ? *user : "everyone"));

				        }

				        auto query = sprint("SELECT %s, %s, %s FROM %s.%s", USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF);

				        // Oh, look, it is a case where it does not pay off to have

				        // parameters to process in an initializer list.

				        future<::shared_ptr<cql3::untyped_result_set>> f = make_ready_future<::shared_ptr<cql3::untyped_result_set>>();

				        if (resource && user) {

				            query += sprint(" WHERE %s = ? AND %s = ?", USER_NAME, RESOURCE_NAME);

				            f = _qp.process(query, db::consistency_level::ONE, {*user, resource->name()});

				        } else if (resource) {

				            query += sprint(" WHERE %s = ? ALLOW FILTERING", RESOURCE_NAME);

				            f = _qp.process(query, db::consistency_level::ONE, {resource->name()});

				        } else if (user) {

				            query += sprint(" WHERE %s = ?", USER_NAME);

				            f = _qp.process(query, db::consistency_level::ONE, {*user});

				        } else {

				            f = _qp.process(query, db::consistency_level::ONE, {});

				        }

				        return f.then([set](::shared_ptr<cql3::untyped_result_set> res) {

				            std::vector<permission_details> result;

				            for (auto& row : *res) {

				                if (row.has(PERMISSIONS_NAME)) {

				                    auto username = row.get_as<sstring>(USER_NAME);

				                    auto resource = data_resource::from_name(row.get_as<sstring>(RESOURCE_NAME));

				                    auto ps = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));

				                    ps = permission_set::from_mask(ps.mask() & set.mask());

				                    result.emplace_back(permission_details {username, resource, ps});

				                }

				            }

				            return make_ready_future<std::vector<permission_details>>(std::move(result));

				        });

				    });

				}

				future<> auth::default_authorizer::revoke_all(sstring dropped_user) {

				    auto query = sprint("DELETE FROM %s.%s WHERE %s = ?", meta::AUTH_KS,

				                    PERMISSIONS_CF, USER_NAME);

				    return _qp.process(query, db::consistency_level::ONE, { dropped_user }).discard_result().handle_exception(

				                    [dropped_user](auto ep) {

				                        try {

				                            std::rethrow_exception(ep);

				                        } catch (exceptions::request_execution_exception& e) {

				                            alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);

				                        }

				                    });

				}

				future<> auth::default_authorizer::revoke_all(data_resource resource) {

				    auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",

				                    USER_NAME, meta::AUTH_KS, PERMISSIONS_CF, RESOURCE_NAME);

				    return _qp.process(query, db::consistency_level::LOCAL_ONE, { resource.name() })

				                    .then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {

				        try {

				            auto res = f.get0();

				            return parallel_for_each(res->begin(), res->end(), [this, res, resource](const cql3::untyped_result_set::row& r) {

				                auto query = sprint("DELETE FROM %s.%s WHERE %s = ? AND %s = ?"

				                                , meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);

				                return _qp.process(query, db::consistency_level::LOCAL_ONE, { r.get_as<sstring>(USER_NAME), resource.name() })

				                                .discard_result().handle_exception([resource](auto ep) {

				                    try {

				                        std::rethrow_exception(ep);

				                    } catch (exceptions::request_execution_exception& e) {

				                        alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);

				                    }

				                });

				            });

				        } catch (exceptions::request_execution_exception& e) {

				            alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);

				            return make_ready_future();

				        }

				    });

				}

				const auth::resource_ids& auth::default_authorizer::protected_resources() {

				    static const resource_ids ids({ data_resource(meta::AUTH_KS, PERMISSIONS_CF) });

				    return ids;

				}

				future<> auth::default_authorizer::validate_configuration() const {

				    return make_ready_future();

				}

									
										92

auth/default_authorizer.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,92 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <functional>

				#include "authorizer.hh"

				#include "cql3/query_processor.hh"

				#include "service/migration_manager.hh"

				namespace auth {

				const sstring& default_authorizer_name();

				class default_authorizer : public authorizer {

				    cql3::query_processor& _qp;

				    ::service::migration_manager& _migration_manager;

				public:

				    default_authorizer(cql3::query_processor&, ::service::migration_manager&);

				    ~default_authorizer();

				    future<> start() override;

				    future<> stop() override;

				    const sstring& qualified_java_name() const override {

				        return default_authorizer_name();

				    }

				    future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override;

				    future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;

				    future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;

				    future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user>, permission_set, optional<data_resource>, optional<sstring>) const override;

				    future<> revoke_all(sstring) override;

				    future<> revoke_all(data_resource) override;

				    const resource_ids& protected_resources() override;

				    future<> validate_configuration() const override;

				private:

				    future<> modify(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring, sstring);

				};

				} /* namespace auth */

									
										229

auth/password_authenticator.cc
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -46,28 +46,42 @@

				#include <seastar/core/reactor.hh>

				#include "auth.hh"

				#include "common.hh"

				#include "password_authenticator.hh"

				#include "authenticated_user.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "log.hh"

				#include "service/migration_manager.hh"

				#include "utils/class_registrator.hh"

				const sstring auth::password_authenticator::PASSWORD_AUTHENTICATOR_NAME("org.apache.cassandra.auth.PasswordAuthenticator");

				const sstring& auth::password_authenticator_name() {

				    static const sstring name = meta::AUTH_PACKAGE_NAME + "PasswordAuthenticator";

				    return name;

				}

				// name of the hash column.

				static const sstring SALTED_HASH = "salted_hash";

				static const sstring USER_NAME = "username";

				static const sstring DEFAULT_USER_NAME = auth::auth::DEFAULT_SUPERUSER_NAME;

				static const sstring DEFAULT_USER_PASSWORD = auth::auth::DEFAULT_SUPERUSER_NAME;

				static const sstring DEFAULT_USER_NAME = auth::meta::DEFAULT_SUPERUSER_NAME;

				static const sstring DEFAULT_USER_PASSWORD = auth::meta::DEFAULT_SUPERUSER_NAME;

				static const sstring CREDENTIALS_CF = "credentials";

				static logging::logger logger("password_authenticator");

				static logging::logger plogger("password_authenticator");

				// To ensure correct initialization order, we unfortunately need to use a string literal.

				static const class_registrator<

				        auth::authenticator,

				        auth::password_authenticator,

				        cql3::query_processor&,

				        ::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");

				auth::password_authenticator::~password_authenticator()

				{}

				auth::password_authenticator::password_authenticator()

				{}

				auth::password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)

				    : _qp(qp)

				    , _migration_manager(mm) {

				}

				// TODO: blowfish

				// Origin uses Java bcrypt library, i.e. blowfish salt

				@@ -88,12 +102,10 @@ auth::password_authenticator::password_authenticator()

				// and some old-fashioned random salt generation.

				static constexpr size_t rand_bytes = 16;

				static thread_local crypt_data tlcrypt = { 0, };

				static sstring hashpw(const sstring& pass, const sstring& salt) {

				    // crypt_data is huge. should this be a thread_local static?

				    auto tmp = std::make_unique<crypt_data>();

				    tmp->initialized = 0;

				    auto res = crypt_r(pass.c_str(), salt.c_str(), tmp.get());

				    auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);

				    if (res == nullptr) {

				        throw std::system_error(errno, std::system_category());

				    }

				@@ -122,17 +134,16 @@ static sstring gensalt() {

				    sstring salt;

				    if (!prefix.empty()) {

				        return prefix + salt;

				        return prefix + input;

				    }

				    auto tmp = std::make_unique<crypt_data>();

				    tmp->initialized = 0;

				    // Try in order:

				    // blowfish 2011 fix, blowfish, sha512, sha256, md5

				    for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {

				        salt = pfx + input;

				        if (crypt_r("fisk", salt.c_str(), tmp.get())) {

				        const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);

				        if (e && (e[0] != '*')) {

				            prefix = pfx;

				            return salt;

				        }

				@@ -144,39 +155,52 @@ static sstring hashpw(const sstring& pass) {

				    return hashpw(pass, gensalt());

				}

				future<> auth::password_authenticator::init() {

				    gensalt(); // do this once to determine usable hashing

				future<> auth::password_authenticator::start() {

				    return auth::once_among_shards([this] {

				        gensalt(); // do this once to determine usable hashing

				    sstring create_table = sprint(

				                    "CREATE TABLE %s.%s ("

				                                    "%s text,"

				                                    "%s text," // salt + hash + number of rounds

				                                    "options map<text,text>,"// for future extensions

				                                    "PRIMARY KEY(%s)"

				                                    ") WITH gc_grace_seconds=%d",

				                    auth::auth::AUTH_KS,

				                    CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,

				                    90 * 24 * 60 * 60); // 3 months.

				        static const sstring create_table = sprint(

				                "CREATE TABLE %s.%s ("

				                "%s text,"

				                "%s text," // salt + hash + number of rounds

				                "options map<text,text>,"// for future extensions

				                "PRIMARY KEY(%s)"

				                ") WITH gc_grace_seconds=%d",

				                meta::AUTH_KS,

				                CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,

				                90 * 24 * 60 * 60); // 3 months.

				    return auth::setup_table(CREDENTIALS_CF, create_table).then([this] {

				        // instead of once-timer, just schedule this later

				        auth::schedule_when_up([] {

				            return auth::has_existing_users(CREDENTIALS_CF, DEFAULT_USER_NAME, USER_NAME).then([](bool exists) {

				                if (!exists) {

				                    cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",

				                                                    auth::AUTH_KS,

				                                                    CREDENTIALS_CF,

				                                                    USER_NAME, SALTED_HASH

				                                    ),

				                                    db::consistency_level::ONE, {DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD)}).then([](auto) {

				                                        logger.info("Created default user '{}'", DEFAULT_USER_NAME);

				                                    });

				                }

				        return auth::create_metadata_table_if_missing(

				                CREDENTIALS_CF,

				                _qp,

				                create_table,

				                _migration_manager).then([this] {

				            auth::delay_until_system_ready(_delayed, [this] {

				                return has_existing_users().then([this](bool existing) {

				                    if (!existing) {

				                        return _qp.process(

				                                sprint(

				                                        "INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",

				                                        meta::AUTH_KS,

				                                        CREDENTIALS_CF,

				                                        USER_NAME, SALTED_HASH),

				                                db::consistency_level::ONE,

				                                { DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD) }).then([](auto) {

				                            plogger.info("Created default user '{}'", DEFAULT_USER_NAME);

				                        });

				                    }

				                    return make_ready_future<>();

				                });

				            });

				        });

				    });

				}

				future<> auth::password_authenticator::stop() {

				    return make_ready_future<>();

				}

				db::consistency_level auth::password_authenticator::consistency_for_user(const sstring& username) {

				    if (username == DEFAULT_USER_NAME) {

				        return db::consistency_level::QUORUM;

				@@ -184,8 +208,8 @@ db::consistency_level auth::password_authenticator::consistency_for_user(const s

				    return db::consistency_level::LOCAL_ONE;

				}

				const sstring& auth::password_authenticator::class_name() const {

				    return PASSWORD_AUTHENTICATOR_NAME;

				const sstring& auth::password_authenticator::qualified_java_name() const {

				    return password_authenticator_name();

				}

				bool auth::password_authenticator::require_authentication() const {

				@@ -201,8 +225,7 @@ auth::authenticator::option_set auth::password_authenticator::alterable_options(

				}

				future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::authenticate(

				                const credentials_map& credentials) const

				                                throw (exceptions::authentication_exception) {

				                const credentials_map& credentials) const {

				    if (!credentials.count(USERNAME_KEY)) {

				        throw exceptions::authentication_exception(sprint("Required key '%s' is missing", USERNAME_KEY));

				    }

				@@ -218,12 +241,11 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au

				    // obsolete prepared statements pretty quickly.

				    // Rely on query processing caching statements instead, and lets assume

				    // that a map lookup string->statement is not gonna kill us much.

				    auto& qp = cql3::get_local_query_processor();

				    return qp.process(

				                    sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,

				                                    auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),

				                    consistency_for_user(username), { username }, true).then_wrapped(

				                    [=](future<::shared_ptr<cql3::untyped_result_set>> f) {

				    return futurize_apply([this, username, password] {

				        return _qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,

				                                        meta::AUTH_KS, CREDENTIALS_CF, USER_NAME),

				                        consistency_for_user(username), {username}, true);

				    }).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {

				        try {

				            auto res = f.get0();

				            if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {

				@@ -234,62 +256,57 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au

				            std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));

				        } catch (exceptions::request_execution_exception& e) {

				            std::throw_with_nested(exceptions::authentication_exception(e.what()));

				        } catch (...) {

				            std::throw_with_nested(exceptions::authentication_exception("authentication failed"));

				        }

				    });

				}

				future<> auth::password_authenticator::create(sstring username,

				                const option_map& options)

				                                throw (exceptions::request_validation_exception,

				                                exceptions::request_execution_exception) {

				                const option_map& options) {

				    try {

				        auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));

				        auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",

				                        auth::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);

				        auto& qp = cql3::get_local_query_processor();

				        return qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();

				                        meta::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);

				        return _qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();

				    } catch (std::out_of_range&) {

				        throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");

				    }

				}

				future<> auth::password_authenticator::alter(sstring username,

				                const option_map& options)

				                                throw (exceptions::request_validation_exception,

				                                exceptions::request_execution_exception) {

				                const option_map& options) {

				    try {

				        auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));

				        auto query = sprint("UPDATE %s.%s SET %s = ? WHERE %s = ?",

				                        auth::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);

				        auto& qp = cql3::get_local_query_processor();

				        return qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();

				                        meta::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);

				        return _qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();

				    } catch (std::out_of_range&) {

				        throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");

				    }

				}

				future<> auth::password_authenticator::drop(sstring username)

				                throw (exceptions::request_validation_exception,

				                exceptions::request_execution_exception) {

				future<> auth::password_authenticator::drop(sstring username) {

				    try {

				        auto query = sprint("DELETE FROM %s.%s WHERE %s = ?",

				                        auth::AUTH_KS, CREDENTIALS_CF, USER_NAME);

				        auto& qp = cql3::get_local_query_processor();

				        return qp.process(query, consistency_for_user(username), { username }).discard_result();

				                        meta::AUTH_KS, CREDENTIALS_CF, USER_NAME);

				        return _qp.process(query, consistency_for_user(username), { username }).discard_result();

				    } catch (std::out_of_range&) {

				        throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");

				    }

				}

				auth::authenticator::resource_ids auth::password_authenticator::protected_resources() const {

				    return { data_resource(auth::AUTH_KS, CREDENTIALS_CF) };

				const auth::resource_ids& auth::password_authenticator::protected_resources() const {

				    static const resource_ids ids({ data_resource(meta::AUTH_KS, CREDENTIALS_CF) });

				    return ids;

				}

				::shared_ptr<auth::authenticator::sasl_challenge> auth::password_authenticator::new_sasl_challenge() const {

				    class plain_text_password_challenge: public sasl_challenge {

				        const password_authenticator& _self;

				    public:

				        plain_text_password_challenge(const password_authenticator& a)

				                        : _authenticator(a)

				        plain_text_password_challenge(const password_authenticator& self) : _self(self)

				        {}

				        /**

				@@ -305,9 +322,8 @@ auth::authenticator::resource_ids auth::password_authenticator::protected_resour

				         * would expect

				         * @throws javax.security.sasl.SaslException

				         */

				        bytes evaluate_response(bytes_view client_response)

				                        throw (exceptions::authentication_exception) override {

				            logger.debug("Decoding credentials from client token");

				        bytes evaluate_response(bytes_view client_response) override {

				            plogger.debug("Decoding credentials from client token");

				            sstring username, password;

				@@ -344,14 +360,59 @@ auth::authenticator::resource_ids auth::password_authenticator::protected_resour

				        bool is_complete() const override {

				            return _complete;

				        }

				        future<::shared_ptr<authenticated_user>> get_authenticated_user() const

				                        throw (exceptions::authentication_exception) override {

				            return _authenticator.authenticate(_credentials);

				        future<::shared_ptr<authenticated_user>> get_authenticated_user() const override {

				            return _self.authenticate(_credentials);

				        }

				    private:

				        const password_authenticator& _authenticator;

				        credentials_map _credentials;

				        bool _complete = false;

				    };

				    return ::make_shared<plain_text_password_challenge>(*this);

				}

				//

				// Similar in structure to `auth::service::has_existing_users()`, but trying to generalize the pattern breaks all kinds

				// of module boundaries and leaks implementation details.

				//

				future<bool> auth::password_authenticator::has_existing_users() const {

				    static const sstring default_user_query = sprint(

				            "SELECT * FROM %s.%s WHERE %s = ?",

				            meta::AUTH_KS,

				            CREDENTIALS_CF,

				            USER_NAME);

				    static const sstring all_users_query = sprint(

				            "SELECT * FROM %s.%s LIMIT 1",

				            meta::AUTH_KS,

				            CREDENTIALS_CF);

				    // This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we

				    // can potentially avoid doing a range query with a high consistency level.

				    return _qp.process(

				            default_user_query,

				            db::consistency_level::ONE,

				            { meta::DEFAULT_SUPERUSER_NAME },

				            true).then([this](auto results) {

				        if (!results->empty()) {

				            return make_ready_future<bool>(true);

				        }

				        return _qp.process(

				                default_user_query,

				                db::consistency_level::QUORUM,

				                { meta::DEFAULT_SUPERUSER_NAME },

				                true).then([this](auto results) {

				            if (!results->empty()) {

				                return make_ready_future<bool>(true);

				            }

				            return _qp.process(

				                    all_users_query,

				                    db::consistency_level::QUORUM).then([](auto results) {

				                return make_ready_future<bool>(!results->empty());

				            });

				        });

				    });

				}

									
										43

auth/password_authenticator.hh
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -42,31 +42,48 @@

				#pragma once

				#include "authenticator.hh"

				#include "cql3/query_processor.hh"

				#include "delayed_tasks.hh"

				namespace service {

				class migration_manager;

				}

				namespace auth {

				class password_authenticator : public authenticator {

				public:

				    static const sstring PASSWORD_AUTHENTICATOR_NAME;

				const sstring& password_authenticator_name();

				    password_authenticator();

				class password_authenticator : public authenticator {

				    cql3::query_processor& _qp;

				    ::service::migration_manager& _migration_manager;

				    delayed_tasks<> _delayed{};

				public:

				    password_authenticator(cql3::query_processor&, ::service::migration_manager&);

				    ~password_authenticator();

				    future<> init();

				    future<> start() override;

				    const sstring& class_name() const override;

				    future<> stop() override;

				    const sstring& qualified_java_name() const override;

				    bool require_authentication() const override;

				    option_set supported_options() const override;

				    option_set alterable_options() const override;

				    future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override;

				    future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;

				    future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;

				    future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;

				    resource_ids protected_resources() const override;

				    future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override;

				    future<> create(sstring username, const option_map& options) override;

				    future<> alter(sstring username, const option_map& options) override;

				    future<> drop(sstring username) override;

				    const resource_ids& protected_resources() const override;

				    ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;

				    static db::consistency_level consistency_for_user(const sstring& username);

				private:

				    future<bool> has_existing_users() const;

				};

				}

									
										71

auth/permission.cc
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -39,11 +39,66 @@

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include <unordered_map>

				#include <boost/algorithm/string.hpp>

				#include "permission.hh"

				const auth::permission_set auth::ALL_DATA = auth::permission_set::of

				                < auth::permission::CREATE, auth::permission::ALTER,

				                auth::permission::DROP, auth::permission::SELECT,

				                auth::permission::MODIFY, auth::permission::AUTHORIZE>();

				const auth::permission_set auth::ALL = auth::ALL_DATA;

				const auth::permission_set auth::NONE;

				const auth::permission_set auth::permissions::ALL_DATA =

				                auth::permission_set::of<auth::permission::CREATE,

				                                auth::permission::ALTER, auth::permission::DROP,

				                                auth::permission::SELECT,

				                                auth::permission::MODIFY,

				                                auth::permission::AUTHORIZE>();

				const auth::permission_set auth::permissions::ALL = auth::permissions::ALL_DATA;

				const auth::permission_set auth::permissions::NONE;

				const auth::permission_set auth::permissions::ALTERATIONS =

				                auth::permission_set::of<auth::permission::CREATE,

				                                auth::permission::ALTER, auth::permission::DROP>();

				static const std::unordered_map<sstring, auth::permission> permission_names({

				    { "READ", auth::permission::READ },

				    { "WRITE", auth::permission::WRITE  },

				    { "CREATE", auth::permission::CREATE },

				    { "ALTER", auth::permission::ALTER },

				    { "DROP", auth::permission::DROP },

				    { "SELECT", auth::permission::SELECT  },

				    { "MODIFY", auth::permission::MODIFY   },

				    { "AUTHORIZE", auth::permission::AUTHORIZE },

				});

				const sstring& auth::permissions::to_string(permission p) {

				    for (auto& v : permission_names) {

				        if (v.second == p) {

				            return v.first;

				        }

				    }

				    throw std::out_of_range("unknown permission");

				}

				auth::permission auth::permissions::from_string(const sstring& s) {

				    sstring upper(s);

				    boost::to_upper(upper);

				    return permission_names.at(upper);

				}

				std::unordered_set<sstring> auth::permissions::to_strings(const permission_set& set) {

				    std::unordered_set<sstring> res;

				    for (auto& v : permission_names) {

				        if (set.contains(v.second)) {

				            res.emplace(v.first);

				        }

				    }

				    return res;

				}

				auth::permission_set auth::permissions::from_strings(const std::unordered_set<sstring>& set) {

				    permission_set res = auth::permissions::NONE;

				    for (auto& s : set) {

				        res.set(from_string(s));

				    }

				    return res;

				}

				bool auth::operator<(const permission_set& p1, const permission_set& p2) {

				    return p1.mask() < p2.mask();

				}

									
										22

auth/permission.hh
									
												View File
												
				@@ -17,9 +17,9 @@

				 */

				/*

				 * Copyright 2016 Cloudius Systems

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by Cloudius Systems

				 * Modified by ScyllaDB

				 */

				/*

				@@ -41,6 +41,10 @@

				#pragma once

				#include <unordered_set>

				#include <seastar/core/sstring.hh>

				#include "seastarx.hh"

				#include "enum_set.hh"

				namespace auth {

				@@ -74,8 +78,22 @@ typedef enum_set<super_enum<permission,

				                permission::MODIFY,

				                permission::AUTHORIZE>> permission_set;

				bool operator<(const permission_set&, const permission_set&);

				namespace permissions {

				extern const permission_set ALL_DATA;

				extern const permission_set ALL;

				extern const permission_set NONE;

				extern const permission_set ALTERATIONS;

				const sstring& to_string(permission);

				permission from_string(const sstring&);

				std::unordered_set<sstring> to_strings(const permission_set&);

				permission_set from_strings(const std::unordered_set<sstring>&);

				}

				}

									
										51

auth/permissions_cache.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,51 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "auth/permissions_cache.hh"

				#include "auth/authorizer.hh"

				#include "auth/common.hh"

				#include "auth/service.hh"

				#include "db/config.hh"

				namespace auth {

				permissions_cache_config permissions_cache_config::from_db_config(const db::config& dc) {

				    permissions_cache_config c;

				    c.max_entries = dc.permissions_cache_max_entries();

				    c.validity_period = std::chrono::milliseconds(dc.permissions_validity_in_ms());

				    c.update_period = std::chrono::milliseconds(dc.permissions_update_interval_in_ms());

				    return c;

				}

				permissions_cache::permissions_cache(const permissions_cache_config& c, service& ser, logging::logger& log)

				        : _cache(c.max_entries, c.validity_period, c.update_period, log, [&ser, &log](const key_type& k) {

				              log.debug("Refreshing permissions for {}", k.first.name());

				              return ser.underlying_authorizer().authorize(ser, ::make_shared<authenticated_user>(k.first), k.second);

				          }) {

				}

				future<permission_set> permissions_cache::get(::shared_ptr<authenticated_user> user, data_resource r) {

				    return _cache.get(key_type(*user, r));

				}

				}

									
										99

auth/permissions_cache.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,99 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <chrono>

				#include <functional>

				#include <iostream>

				#include <utility>

				#include <seastar/core/future.hh>

				#include <seastar/core/shared_ptr.hh>

				#include "auth/authenticated_user.hh"

				#include "auth/data_resource.hh"

				#include "auth/permission.hh"

				#include "log.hh"

				#include "utils/loading_cache.hh"

				namespace std {

				template <>

				struct hash<auth::data_resource> final {

				    size_t operator()(const auth::data_resource & v) const {

				        return v.hash_value();

				    }

				};

				template <>

				struct hash<auth::authenticated_user> final {

				    size_t operator()(const auth::authenticated_user & v) const {

				        return utils::tuple_hash()(v.name(), v.is_anonymous());

				    }

				};

				inline std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p) {

				    os << "{user: " << p.first.name() << ", data_resource: " << p.second << "}";

				    return os;

				}

				}

				namespace db {

				class config;

				}

				namespace auth {

				class service;

				struct permissions_cache_config final {

				    static permissions_cache_config from_db_config(const db::config&);

				    std::size_t max_entries;

				    std::chrono::milliseconds validity_period;

				    std::chrono::milliseconds update_period;

				};

				class permissions_cache final {

				    using cache_type = utils::loading_cache<

				            std::pair<authenticated_user, data_resource>,

				            permission_set,

				            utils::loading_cache_reload_enabled::yes,

				            utils::simple_entry_size<permission_set>,

				            utils::tuple_hash>;

				    using key_type = typename cache_type::key_type;

				    cache_type _cache;

				public:

				    explicit permissions_cache(const permissions_cache_config&, service&, logging::logger&);

				    future <> stop() {

				        return _cache.stop();

				    }

				    future<permission_set> get(::shared_ptr<authenticated_user>, data_resource);

				};

				}

									
										355

auth/service.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,355 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "auth/service.hh"

				#include <map>

				#include <seastar/core/future-util.hh>

				#include <seastar/core/shared_ptr.hh>

				#include "auth/allow_all_authenticator.hh"

				#include "auth/allow_all_authorizer.hh"

				#include "auth/common.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "db/config.hh"

				#include "db/consistency_level.hh"

				#include "exceptions/exceptions.hh"

				#include "log.hh"

				#include "service/migration_listener.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				namespace meta {

				static const sstring user_name_col_name("name");

				static const sstring superuser_col_name("super");

				}

				static logging::logger log("auth_service");

				class auth_migration_listener final : public ::service::migration_listener {

				    authorizer& _authorizer;

				public:

				    explicit auth_migration_listener(authorizer& a) : _authorizer(a) {

				    }

				private:

				    void on_create_keyspace(const sstring& ks_name) override {}

				    void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}

				    void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_create_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_create_view(const sstring& ks_name, const sstring& view_name) override {}

				    void on_update_keyspace(const sstring& ks_name) override {}

				    void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}

				    void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_update_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}

				    void on_drop_keyspace(const sstring& ks_name) override {

				        _authorizer.revoke_all(auth::data_resource(ks_name));

				    }

				    void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {

				        _authorizer.revoke_all(auth::data_resource(ks_name, cf_name));

				    }

				    void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}

				};

				static db::consistency_level consistency_for_user(const sstring& name) {

				    if (name == meta::DEFAULT_SUPERUSER_NAME) {

				        return db::consistency_level::QUORUM;

				    } else {

				        return db::consistency_level::LOCAL_ONE;

				    }

				}

				static future<::shared_ptr<cql3::untyped_result_set>> select_user(cql3::query_processor& qp, const sstring& name) {

				    // Here was a thread local, explicit cache of prepared statement. In normal execution this is

				    // fine, but since we in testing set up and tear down system over and over, we'd start using

				    // obsolete prepared statements pretty quickly.

				    // Rely on query processing caching statements instead, and lets assume

				    // that a map lookup string->statement is not gonna kill us much.

				    return qp.process(

				            sprint(

				                    "SELECT * FROM %s.%s WHERE %s = ?",

				                    meta::AUTH_KS,

				                    meta::USERS_CF,

				                    meta::user_name_col_name),

				            consistency_for_user(name),

				            { name },

				            true);

				}

				service_config service_config::from_db_config(const db::config& dc) {

				    const qualified_name qualified_authorizer_name(meta::AUTH_PACKAGE_NAME, dc.authorizer());

				    const qualified_name qualified_authenticator_name(meta::AUTH_PACKAGE_NAME, dc.authenticator());

				    service_config c;

				    c.authorizer_java_name = qualified_authorizer_name;

				    c.authenticator_java_name = qualified_authenticator_name;

				    return c;

				}

				service::service(

				        permissions_cache_config c,

				        cql3::query_processor& qp,

				        ::service::migration_manager& mm,

				        std::unique_ptr<authorizer> a,

				        std::unique_ptr<authenticator> b)

				            : _permissions_cache_config(std::move(c))

				            , _permissions_cache(nullptr)

				            , _qp(qp)

				            , _migration_manager(mm)

				            , _authorizer(std::move(a))

				            , _authenticator(std::move(b))

				            , _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {

				}

				service::service(

				        permissions_cache_config cache_config,

				        cql3::query_processor& qp,

				        ::service::migration_manager& mm,

				        const service_config& sc)

				            : service(

				                      std::move(cache_config),

				                      qp,

				                      mm,

				                      create_object<authorizer>(sc.authorizer_java_name, qp, mm),

				                      create_object<authenticator>(sc.authenticator_java_name, qp, mm)) {

				}

				bool service::should_create_metadata() const {

				    const bool null_authorizer = _authorizer->qualified_java_name() == allow_all_authorizer_name();

				    const bool null_authenticator = _authenticator->qualified_java_name() == allow_all_authenticator_name();

				    return !null_authorizer || !null_authenticator;

				}

				future<> service::create_metadata_if_missing() {

				    auto& db = _qp.db().local();

				    auto f = make_ready_future<>();

				    if (!db.has_keyspace(meta::AUTH_KS)) {

				        std::map<sstring, sstring> opts{{"replication_factor", "1"}};

				        auto ksm = keyspace_metadata::new_keyspace(

				                meta::AUTH_KS,

				                "org.apache.cassandra.locator.SimpleStrategy",

				                opts,

				                true);

				        // We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.

				        // See issue #2129.

				        f = _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);

				    }

				    return f.then([this] {

				        // 3 months.

				        static const auto gc_grace_seconds = 90 * 24 * 60 * 60;

				        static const sstring users_table_query = sprint(

				                "CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY (%s)) WITH gc_grace_seconds=%s",

				                meta::AUTH_KS,

				                meta::USERS_CF,

				                meta::user_name_col_name,

				                meta::superuser_col_name,

				                meta::user_name_col_name,

				                gc_grace_seconds);

				        return create_metadata_table_if_missing(

				                meta::USERS_CF,

				                _qp,

				                users_table_query,

				                _migration_manager);

				    }).then([this] {

				        delay_until_system_ready(_delayed, [this] {

				            return has_existing_users().then([this](bool existing) {

				                if (!existing) {

				                    //

				                    // Create default superuser.

				                    //

				                    static const sstring query = sprint(

				                            "INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",

				                            meta::AUTH_KS,

				                            meta::USERS_CF,

				                            meta::user_name_col_name,

				                            meta::superuser_col_name);

				                    return _qp.process(

				                            query,

				                            db::consistency_level::ONE,

				                            { meta::DEFAULT_SUPERUSER_NAME, true }).then([](auto&&) {

				                        log.info("Created default superuser '{}'", meta::DEFAULT_SUPERUSER_NAME);

				                    }).handle_exception([](auto exn) {

				                        try {

				                            std::rethrow_exception(exn);

				                        } catch (const exceptions::request_execution_exception&) {

				                            log.warn("Skipped default superuser setup: some nodes were not ready");

				                        }

				                    }).discard_result();

				                }

				                return make_ready_future<>();

				            });

				        });

				        return make_ready_future<>();

				    });

				}

				future<> service::start() {

				    return once_among_shards([this] {

				        if (should_create_metadata()) {

				            return create_metadata_if_missing();

				        }

				        return make_ready_future<>();

				    }).then([this] {

				        return when_all_succeed(_authorizer->start(), _authenticator->start());

				    }).then([this] {

				        _permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);

				    }).then([this] {

				        return once_among_shards([this] {

				            _migration_manager.register_listener(_migration_listener.get());

				            return make_ready_future<>();

				        });

				    });

				}

				future<> service::stop() {

				    return once_among_shards([this] {

				        _delayed.cancel_all();

				        return make_ready_future<>();

				    }).then([this] {

				        return _permissions_cache->stop();

				    }).then([this] {

				        return when_all_succeed(_authorizer->stop(), _authenticator->stop());

				    });

				}

				future<bool> service::has_existing_users() const {

				    static const sstring default_user_query = sprint(

				            "SELECT * FROM %s.%s WHERE %s = ?",

				            meta::AUTH_KS,

				            meta::USERS_CF,

				            meta::user_name_col_name);

				    static const sstring all_users_query = sprint(

				            "SELECT * FROM %s.%s LIMIT 1",

				            meta::AUTH_KS,

				            meta::USERS_CF);

				    // This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we

				    // can potentially avoid doing a range query with a high consistency level.

				    return _qp.process(

				            default_user_query,

				            db::consistency_level::ONE,

				            { meta::DEFAULT_SUPERUSER_NAME },

				            true).then([this](auto results) {

				        if (!results->empty()) {

				            return make_ready_future<bool>(true);

				        }

				        return _qp.process(

				                default_user_query,

				                db::consistency_level::QUORUM,

				                { meta::DEFAULT_SUPERUSER_NAME },

				                true).then([this](auto results) {

				            if (!results->empty()) {

				                return make_ready_future<bool>(true);

				            }

				            return _qp.process(

				                    all_users_query,

				                    db::consistency_level::QUORUM).then([](auto results) {

				                return make_ready_future<bool>(!results->empty());

				            });

				        });

				    });

				}

				future<bool> service::is_existing_user(const sstring& name) const {

				    return select_user(_qp, name).then([](auto results) {

				        return !results->empty();

				    });

				}

				future<bool> service::is_super_user(const sstring& name) const {

				    return select_user(_qp, name).then([](auto results) {

				        return !results->empty() && results->one().template get_as<bool>(meta::superuser_col_name);

				    });

				}

				future<> service::insert_user(const sstring& name, bool is_superuser) {

				    return _qp.process(

				            sprint(

				                    "INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",

				                    meta::AUTH_KS,

				                    meta::USERS_CF,

				                    meta::user_name_col_name,

				                    meta::superuser_col_name),

				            consistency_for_user(name),

				            { name, is_superuser }).discard_result();

				}

				future<> service::delete_user(const sstring& name) {

				    return _qp.process(

				            sprint(

				                    "DELETE FROM %s.%s WHERE %s = ?",

				                    meta::AUTH_KS,

				                    meta::USERS_CF,

				                    meta::user_name_col_name),

				            consistency_for_user(name),

				            { name }).discard_result();

				}

				future<permission_set> service::get_permissions(::shared_ptr<authenticated_user> u, data_resource r) const {

				    return _permissions_cache->get(std::move(u), std::move(r));

				}

				//

				// Free functions.

				//

				future<bool> is_super_user(const service& ser, const authenticated_user& u) {

				    if (u.is_anonymous()) {

				        return make_ready_future<bool>(false);

				    }

				    return ser.is_super_user(u.name());

				}

				}

									
										133

auth/service.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,133 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <memory>

				#include <seastar/core/future.hh>

				#include <seastar/core/sstring.hh>

				#include "auth/authenticator.hh"

				#include "auth/authorizer.hh"

				#include "auth/authenticated_user.hh"

				#include "auth/permission.hh"

				#include "auth/permissions_cache.hh"

				#include "delayed_tasks.hh"

				#include "seastarx.hh"

				namespace cql3 {

				class query_processor;

				}

				namespace db {

				class config;

				}

				namespace service {

				class migration_manager;

				class migration_listener;

				}

				namespace auth {

				class authenticator;

				class authorizer;

				struct service_config final {

				    static service_config from_db_config(const db::config&);

				    sstring authorizer_java_name;

				    sstring authenticator_java_name;

				};

				class service final {

				    permissions_cache_config _permissions_cache_config;

				    std::unique_ptr<permissions_cache> _permissions_cache;

				    cql3::query_processor& _qp;

				    ::service::migration_manager& _migration_manager;

				    std::unique_ptr<authorizer> _authorizer;

				    std::unique_ptr<authenticator> _authenticator;

				    // Only one of these should be registered, so we end up with some unused instances. Not the end of the world.

				    std::unique_ptr<::service::migration_listener> _migration_listener;

				    delayed_tasks<> _delayed{};

				public:

				    service(

				            permissions_cache_config,

				            cql3::query_processor&,

				            ::service::migration_manager&,

				            std::unique_ptr<authorizer>,

				            std::unique_ptr<authenticator>);

				    service(

				            permissions_cache_config,

				            cql3::query_processor&,

				            ::service::migration_manager&,

				            const service_config&);

				    future<> start();

				    future<> stop();

				    future<bool> is_existing_user(const sstring& name) const;

				    future<bool> is_super_user(const sstring& name) const;

				    future<> insert_user(const sstring& name, bool is_superuser);

				    future<> delete_user(const sstring& name);

				    future<permission_set> get_permissions(::shared_ptr<authenticated_user>, data_resource) const;

				    authenticator& underlying_authenticator() {

				        return *_authenticator;

				    }

				    const authenticator& underlying_authenticator() const {

				        return *_authenticator;

				    }

				    authorizer& underlying_authorizer() {

				        return *_authorizer;

				    }

				    const authorizer& underlying_authorizer() const {

				        return *_authorizer;

				    }

				private:

				    future<bool> has_existing_users() const;

				    bool should_create_metadata() const;

				    future<> create_metadata_if_missing();

				};

				future<bool> is_super_user(const service&, const authenticated_user&);

				}

									
										232

auth/transitional.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,232 @@

				/*

				 * Licensed to the Apache Software Foundation (ASF) under one

				 * or more contributor license agreements.  See the NOTICE file

				 * distributed with this work for additional information

				 * regarding copyright ownership.  The ASF licenses this file

				 * to you under the Apache License, Version 2.0 (the

				 * "License"); you may not use this file except in compliance

				 * with the License.  You may obtain a copy of the License at

				 *

				 *     http://www.apache.org/licenses/LICENSE-2.0

				 *

				 * Unless required by applicable law or agreed to in writing, software

				 * distributed under the License is distributed on an "AS IS" BASIS,

				 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

				 * See the License for the specific language governing permissions and

				 * limitations under the License.

				 */

				/*

				 * Copyright (C) 2017 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "authenticator.hh"

				#include "authenticated_user.hh"

				#include "authenticator.hh"

				#include "authorizer.hh"

				#include "password_authenticator.hh"

				#include "default_authorizer.hh"

				#include "permission.hh"

				#include "db/config.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				class service;

				static const sstring PACKAGE_NAME("com.scylladb.auth.");

				static const sstring& transitional_authenticator_name() {

				    static const sstring name = PACKAGE_NAME + "TransitionalAuthenticator";

				    return name;

				}

				static const sstring& transitional_authorizer_name() {

				    static const sstring name = PACKAGE_NAME + "TransitionalAuthorizer";

				    return name;

				}

				class transitional_authenticator : public authenticator {

				    std::unique_ptr<authenticator> _authenticator;

				public:

				    static const sstring PASSWORD_AUTHENTICATOR_NAME;

				    transitional_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)

				            : transitional_authenticator(std::make_unique<password_authenticator>(qp, mm))

				    {}

				    transitional_authenticator(std::unique_ptr<authenticator> a)

				        : _authenticator(std::move(a))

				    {}

				    future<> start() override {

				        return _authenticator->start();

				    }

				    future<> stop() override {

				        return _authenticator->stop();

				    }

				    const sstring& qualified_java_name() const override {

				        return transitional_authenticator_name();

				    }

				    bool require_authentication() const override {

				        return true;

				    }

				    option_set supported_options() const override {

				        return _authenticator->supported_options();

				    }

				    option_set alterable_options() const override {

				        return _authenticator->alterable_options();

				    }

				    future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {

				        auto i = credentials.find(authenticator::USERNAME_KEY);

				        if ((i == credentials.end() || i->second.empty()) && (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {

				            // return anon user

				            return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());

				        }

				        return make_ready_future().then([this, &credentials] {

				            return _authenticator->authenticate(credentials);

				        }).handle_exception([](auto ep) {

				            try {

				                std::rethrow_exception(ep);

				            } catch (exceptions::authentication_exception&) {

				                // return anon user

				                return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());

				            }

				        });

				    }

				    future<> create(sstring username, const option_map& options) override {

				        return _authenticator->create(username, options);

				    }

				    future<> alter(sstring username, const option_map& options) override {

				        return _authenticator->alter(username, options);

				    }

				    future<> drop(sstring username) override {

				        return _authenticator->drop(username);

				    }

				    const resource_ids& protected_resources() const override {

				        return _authenticator->protected_resources();

				    }

				    ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {

				        class sasl_wrapper : public sasl_challenge {

				        public:

				            sasl_wrapper(::shared_ptr<sasl_challenge> sasl)

				                : _sasl(std::move(sasl))

				            {}

				            bytes evaluate_response(bytes_view client_response) override {

				                try {

				                    return _sasl->evaluate_response(client_response);

				                } catch (exceptions::authentication_exception&) {

				                    _complete = true;

				                    return {};

				                }

				            }

				            bool is_complete() const {

				                return _complete || _sasl->is_complete();

				            }

				            future<::shared_ptr<authenticated_user>> get_authenticated_user() const {

				                return futurize_apply([this] {

				                    return _sasl->get_authenticated_user().handle_exception([](auto ep) {

				                        try {

				                            std::rethrow_exception(ep);

				                        } catch (exceptions::authentication_exception&) {

				                            // return anon user

				                            return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());

				                        }

				                    });

				                });

				            }

				        private:

				            ::shared_ptr<sasl_challenge> _sasl;

				            bool _complete = false;

				        };

				        return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());

				    }

				};

				class transitional_authorizer : public authorizer {

				    std::unique_ptr<authorizer> _authorizer;

				public:

				    transitional_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)

				        : transitional_authorizer(std::make_unique<default_authorizer>(qp, mm))

				    {}

				    transitional_authorizer(std::unique_ptr<authorizer> a)

				        : _authorizer(std::move(a))

				    {}

				    ~transitional_authorizer()

				    {}

				    future<> start() override {

				        return _authorizer->start();

				    }

				    future<> stop() override {

				        return _authorizer->stop();

				    }

				    const sstring& qualified_java_name() const override {

				        return transitional_authorizer_name();

				    }

				    future<permission_set> authorize(service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const override {

				        return is_super_user(ser, *user).then([](bool s) {

				            static const permission_set transitional_permissions =

				                            permission_set::of<permission::CREATE,

				                                            permission::ALTER, permission::DROP,

				                                            permission::SELECT, permission::MODIFY>();

				            return make_ready_future<permission_set>(s ? permissions::ALL : transitional_permissions);

				        });

				    }

				    future<> grant(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {

				        return _authorizer->grant(std::move(user), std::move(ps), std::move(r), std::move(s));

				    }

				    future<> revoke(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {

				        return _authorizer->revoke(std::move(user), std::move(ps), std::move(r), std::move(s));

				    }

				    future<std::vector<permission_details>> list(service& ser, ::shared_ptr<authenticated_user> user, permission_set ps, optional<data_resource> r, optional<sstring> s) const override {

				        return _authorizer->list(ser, std::move(user), std::move(ps), std::move(r), std::move(s));

				    }

				    future<> revoke_all(sstring s) override {

				        return _authorizer->revoke_all(std::move(s));

				    }

				    future<> revoke_all(data_resource r) override {

				        return _authorizer->revoke_all(std::move(r));

				    }

				    const resource_ids& protected_resources() override {

				        return _authorizer->protected_resources();

				    }

				    future<> validate_configuration() const override {

				        return _authorizer->validate_configuration();

				    }

				};

				}

				//

				// To ensure correct initialization order, we unfortunately need to use string literals.

				//

				static const class_registrator<

				        auth::authenticator,

				        auth::transitional_authenticator,

				        cql3::query_processor&,

				        ::service::migration_manager&> transitional_authenticator_reg("com.scylladb.auth.TransitionalAuthenticator");

				static const class_registrator<

				        auth::authorizer,

				        auth::transitional_authorizer,

				        cql3::query_processor&,

				        ::service::migration_manager&> transitional_authorizer_reg("com.scylladb.auth.TransitionalAuthorizer");

									
										2

bytes.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2014 Cloudius Systems, Ltd.

				 * Copyright (C) 2014 ScyllaDB

				 */

				/*

									
										5

bytes.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -21,14 +21,17 @@

				#pragma once

				#include "seastarx.hh"

				#include "core/sstring.hh"

				#include "hashing.hh"

				#include <experimental/optional>

				#include <iosfwd>

				#include <functional>

				#include "utils/mutable_view.hh"

				using bytes = basic_sstring<int8_t, uint32_t, 31>;

				using bytes_view = std::experimental::basic_string_view<int8_t>;

				using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;

				using bytes_opt = std::experimental::optional<bytes>;

				using sstring_view = std::experimental::string_view;

									
										112

bytes_ostream.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -38,6 +38,7 @@ class bytes_ostream {

				public:

				    using size_type = bytes::size_type;

				    using value_type = bytes::value_type;

				    static constexpr size_type max_chunk_size() { return 16 * 1024; }

				private:

				    static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");

				    struct chunk {

				@@ -58,7 +59,6 @@ private:

				    };

				    // FIXME: consider increasing chunk size as the buffer grows

				    static constexpr size_type chunk_size{512};

				    static constexpr size_type usable_chunk_size{chunk_size - sizeof(chunk)};

				private:

				    std::unique_ptr<chunk> _begin;

				    chunk* _current;

				@@ -99,6 +99,19 @@ private:

				        }

				        return _current->size - _current->offset;

				    }

				    // Figure out next chunk size.

				    //   - must be enough for data_size

				    //   - must be at least chunk_size

				    //   - try to double each time to prevent too many allocations

				    //   - do not exceed max_chunk_size

				    size_type next_alloc_size(size_t data_size) const {

				        auto next_size = _current

				                ? _current->size * 2

				                : chunk_size;

				        next_size = std::min(next_size, max_chunk_size());

				        // FIXME: check for overflow?

				        return std::max<size_type>(next_size, data_size + sizeof(chunk));

				    }

				    // Makes room for a contiguous region of given size.

				    // The region is accounted for as already written.

				    // size must not be zero.

				@@ -109,7 +122,7 @@ private:

				            _size += size;

				            return ret;

				        } else {

				            auto alloc_size = size <= usable_chunk_size ? chunk_size : (size + sizeof(chunk));

				            auto alloc_size = next_alloc_size(size);

				            auto space = malloc(alloc_size);

				            if (!space) {

				                throw std::bad_alloc();

				@@ -153,19 +166,18 @@ public:

				    }

				    bytes_ostream& operator=(const bytes_ostream& o) {

				        _size = 0;

				        _current = nullptr;

				        _begin = {};

				        append(o);

				        if (this != &o) {

				            auto x = bytes_ostream(o);

				            *this = std::move(x);

				        }

				        return *this;

				    }

				    bytes_ostream& operator=(bytes_ostream&& o) noexcept {

				        _size = o._size;

				        _begin = std::move(o._begin);

				        _current = o._current;

				        o._current = nullptr;

				        o._size = 0;

				        if (this != &o) {

				            this->~bytes_ostream();

				            new (this) bytes_ostream(std::move(o));

				        }

				        return *this;

				    }

				@@ -174,7 +186,7 @@ public:

				        value_type* ptr;

				        // makes the place_holder looks like a stream

				        seastar::simple_output_stream get_stream() {

				            return seastar::simple_output_stream{reinterpret_cast<char*>(ptr)};

				            return seastar::simple_output_stream(reinterpret_cast<char*>(ptr), sizeof(T));

				        }

				    };

				@@ -195,19 +207,19 @@ public:

				        if (v.empty()) {

				            return;

				        }

				        auto space_left = current_space_left();

				        if (v.size() <= space_left) {

				            memcpy(_current->data + _current->offset, v.begin(), v.size());

				            _current->offset += v.size();

				            _size += v.size();

				        } else {

				            if (space_left) {

				                memcpy(_current->data + _current->offset, v.begin(), space_left);

				                _current->offset += space_left;

				                _size += space_left;

				                v.remove_prefix(space_left);

				            }

				            memcpy(alloc(v.size()), v.begin(), v.size());

				        auto this_size = std::min(v.size(), size_t(current_space_left()));

				        if (this_size) {

				            memcpy(_current->data + _current->offset, v.begin(), this_size);

				            _current->offset += this_size;

				            _size += this_size;

				            v.remove_prefix(this_size);

				        }

				        while (!v.empty()) {

				            auto this_size = std::min(v.size(), size_t(max_chunk_size()));

				            std::copy_n(v.begin(), this_size, alloc(this_size));

				            v.remove_prefix(this_size);

				        }

				    }

				@@ -272,13 +284,8 @@ public:

				    }

				    void append(const bytes_ostream& o) {

				        if (o.size() > 0) {

				            auto dst = alloc(o.size());

				            auto r = o._begin.get();

				            while (r) {

				                dst = std::copy_n(r->data, r->offset, dst);

				                r = r->next.get();

				            }

				        for (auto&& bv : o.fragments()) {

				            write(bv);

				        }

				    }

				@@ -328,6 +335,45 @@ public:

				        _current->next = nullptr;

				        _current->offset = pos._offset;

				    }

				    void reduce_chunk_count() {

				        // FIXME: This is a simplified version. It linearizes the whole buffer

				        // if its size is below max_chunk_size. We probably could also gain

				        // some read performance by doing "real" reduction, i.e. merging

				        // all chunks until all but the last one is max_chunk_size.

				        if (size() < max_chunk_size()) {

				            linearize();

				        }

				    }

				    bool operator==(const bytes_ostream& other) const {

				        auto as = fragments().begin();

				        auto as_end = fragments().end();

				        auto bs = other.fragments().begin();

				        auto bs_end = other.fragments().end();

				        auto a = *as++;

				        auto b = *bs++;

				        while (!a.empty() || !b.empty()) {

				            auto now = std::min(a.size(), b.size());

				            if (!std::equal(a.begin(), a.begin() + now, b.begin(), b.begin() + now)) {

				                return false;

				            }

				            a.remove_prefix(now);

				            if (a.empty() && as != as_end) {

				                a = *as++;

				            }

				            b.remove_prefix(now);

				            if (b.empty() && bs != bs_end) {

				                b = *bs++;

				            }

				        }

				        return true;

				    }

				    bool operator!=(const bytes_ostream& other) const {

				        return !(*this == other);

				    }

				};

				template<>

									
										661

cache_flat_mutation_reader.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,661 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <vector>

				#include "row_cache.hh"

				#include "mutation_reader.hh"

				#include "streamed_mutation.hh"

				#include "partition_version.hh"

				#include "utils/logalloc.hh"

				#include "query-request.hh"

				#include "partition_snapshot_reader.hh"

				#include "partition_snapshot_row_cursor.hh"

				#include "read_context.hh"

				#include "flat_mutation_reader.hh"

				namespace cache {

				extern logging::logger clogger;

				class cache_flat_mutation_reader final : public flat_mutation_reader::impl {

				    enum class state {

				        before_static_row,

				        // Invariants:

				        //  - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range

				        //  - if _next_row has valid iterators:

				        //    - _next_row points to the nearest row in cache >= _lower_bound

				        //    - _next_row_in_range = _next.position() < _upper_bound

				        //  - if _next_row doesn't have valid iterators, it has no meaning.

				        reading_from_cache,

				        // Starts reading from underlying reader.

				        // The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).

				        // Invariants:

				        //  - _next_row_in_range = _next.position() < _upper_bound

				        move_to_underlying,

				        // Invariants:

				        // - Upper bound of the read is min(_next_row.position(), _upper_bound)

				        // - _next_row_in_range = _next.position() < _upper_bound

				        // - _last_row points at a direct predecessor of the next row which is going to be read.

				        //   Used for populating continuity.

				        // - _population_range_starts_before_all_rows is set accordingly

				        reading_from_underlying,

				        end_of_stream

				    };

				    lw_shared_ptr<partition_snapshot> _snp;

				    position_in_partition::tri_compare _position_cmp;

				    query::clustering_key_filter_ranges _ck_ranges;

				    query::clustering_row_ranges::const_iterator _ck_ranges_curr;

				    query::clustering_row_ranges::const_iterator _ck_ranges_end;

				    lsa_manager _lsa_manager;

				    partition_snapshot_row_weakref _last_row;

				    // We need to be prepared that we may get overlapping and out of order

				    // range tombstones. We must emit fragments with strictly monotonic positions,

				    // so we can't just trim such tombstones to the position of the last fragment.

				    // To solve that, range tombstones are accumulated first in a range_tombstone_stream

				    // and emitted once we have a fragment with a larger position.

				    range_tombstone_stream _tombstones;

				    // Holds the lower bound of a position range which hasn't been processed yet.

				    // Only fragments with positions < _lower_bound have been emitted.

				    //

				    // It is assumed that !_lower_bound.is_clustering_row(). We depend on this when

				    // calling range_tombstone::trim_front() and when inserting dummy entries. Dummy

				    // entries are assumed to be only at !is_clustering_row() positions.

				    position_in_partition _lower_bound;

				    position_in_partition_view _upper_bound;

				    state _state = state::before_static_row;

				    lw_shared_ptr<read_context> _read_context;

				    partition_snapshot_row_cursor _next_row;

				    bool _next_row_in_range = false;

				    // True iff current population interval, since the previous clustering row, starts before all clustered rows.

				    // We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and

				    // because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent

				    // us from marking the interval as continuous.

				    // Valid when _state == reading_from_underlying.

				    bool _population_range_starts_before_all_rows;

				    future<> do_fill_buffer();

				    void copy_from_cache_to_buffer();

				    future<> process_static_row();

				    void move_to_end();

				    void move_to_next_range();

				    void move_to_range(query::clustering_row_ranges::const_iterator);

				    void move_to_next_entry();

				    // Emits all delayed range tombstones with positions smaller than upper_bound.

				    void drain_tombstones(position_in_partition_view upper_bound);

				    // Emits all delayed range tombstones.

				    void drain_tombstones();

				    void add_to_buffer(const partition_snapshot_row_cursor&);

				    void add_clustering_row_to_buffer(mutation_fragment&&);

				    void add_to_buffer(range_tombstone&&);

				    void add_to_buffer(mutation_fragment&&);

				    future<> read_from_underlying();

				    void start_reading_from_underlying();

				    bool after_current_range(position_in_partition_view position);

				    bool can_populate() const;

				    void maybe_update_continuity();

				    void maybe_add_to_cache(const mutation_fragment& mf);

				    void maybe_add_to_cache(const clustering_row& cr);

				    void maybe_add_to_cache(const range_tombstone& rt);

				    void maybe_add_to_cache(const static_row& sr);

				    void maybe_set_static_row_continuous();

				    void finish_reader() {

				        push_mutation_fragment(partition_end());

				        _end_of_stream = true;

				        _state = state::end_of_stream;

				    }

				public:

				    cache_flat_mutation_reader(schema_ptr s,

				                               dht::decorated_key dk,

				                               query::clustering_key_filter_ranges&& crr,

				                               lw_shared_ptr<read_context> ctx,

				                               lw_shared_ptr<partition_snapshot> snp,

				                               row_cache& cache)

				        : flat_mutation_reader::impl(std::move(s))

				        , _snp(std::move(snp))

				        , _position_cmp(*_schema)

				        , _ck_ranges(std::move(crr))

				        , _ck_ranges_curr(_ck_ranges.begin())

				        , _ck_ranges_end(_ck_ranges.end())

				        , _lsa_manager(cache)

				        , _tombstones(*_schema)

				        , _lower_bound(position_in_partition::before_all_clustered_rows())

				        , _upper_bound(position_in_partition_view::before_all_clustered_rows())

				        , _read_context(std::move(ctx))

				        , _next_row(*_schema, *_snp)

				    {

				        clogger.trace("csm {}: table={}.{}", this, _schema->ks_name(), _schema->cf_name());

				        push_mutation_fragment(partition_start(std::move(dk), _snp->partition_tombstone()));

				    }

				    cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;

				    cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;

				    virtual future<> fill_buffer() override;

				    virtual ~cache_flat_mutation_reader() {

				        maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());

				    }

				    virtual void next_partition() override {

				        clear_buffer_to_next_partition();

				        if (is_buffer_empty()) {

				            _end_of_stream = true;

				        }

				    }

				    virtual future<> fast_forward_to(const dht::partition_range&) override {

				        clear_buffer();

				        _end_of_stream = true;

				        return make_ready_future<>();

				    }

				    virtual future<> fast_forward_to(position_range pr) override {

				        throw std::bad_function_call();

				    }

				};

				inline

				future<> cache_flat_mutation_reader::process_static_row() {

				    if (_snp->version()->partition().static_row_continuous()) {

				        _read_context->cache().on_row_hit();

				        row sr = _lsa_manager.run_in_read_section([this] {

				            return _snp->static_row();

				        });

				        if (!sr.empty()) {

				            push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));

				        }

				        return make_ready_future<>();

				    } else {

				        _read_context->cache().on_row_miss();

				        return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {

				            if (sr) {

				                assert(sr->is_static_row());

				                maybe_add_to_cache(sr->as_static_row());

				                push_mutation_fragment(std::move(*sr));

				            }

				            maybe_set_static_row_continuous();

				        });

				    }

				}

				inline

				future<> cache_flat_mutation_reader::fill_buffer() {

				    if (_state == state::before_static_row) {

				        auto after_static_row = [this] {

				            if (_ck_ranges_curr == _ck_ranges_end) {

				                finish_reader();

				                return make_ready_future<>();

				            }

				            _state = state::reading_from_cache;

				            _lsa_manager.run_in_read_section([this] {

				                move_to_range(_ck_ranges_curr);

				            });

				            return fill_buffer();

				        };

				        if (_schema->has_static_columns()) {

				            return process_static_row().then(std::move(after_static_row));

				        } else {

				            return after_static_row();

				        }

				    }

				    clogger.trace("csm {}: fill_buffer(), range={}, lb={}", this, *_ck_ranges_curr, _lower_bound);

				    return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {

				        return do_fill_buffer();

				    });

				}

				inline

				future<> cache_flat_mutation_reader::do_fill_buffer() {

				    if (_state == state::move_to_underlying) {

				        _state = state::reading_from_underlying;

				        _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);

				        auto end = _next_row_in_range ? position_in_partition(_next_row.position())

				                                      : position_in_partition(_upper_bound);

				        return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {

				            return read_from_underlying();

				        });

				    }

				    if (_state == state::reading_from_underlying) {

				        return read_from_underlying();

				    }

				    // assert(_state == state::reading_from_cache)

				    return _lsa_manager.run_in_read_section([this] {

				        auto next_valid = _next_row.iterators_valid();

				        clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}", this, _lower_bound,

				            _upper_bound, _next_row.position(), next_valid);

				        // We assume that if there was eviction, and thus the range may

				        // no longer be continuous, the cursor was invalidated.

				        if (!next_valid) {

				            auto adjacent = _next_row.advance_to(_lower_bound);

				            _next_row_in_range = !after_current_range(_next_row.position());

				            if (!adjacent && !_next_row.continuous()) {

				                _last_row = nullptr; // We could insert a dummy here, but this path is unlikely.

				                start_reading_from_underlying();

				                return make_ready_future<>();

				            }

				        }

				        _next_row.maybe_refresh();

				        clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());

				        while (!is_buffer_full() && _state == state::reading_from_cache) {

				            copy_from_cache_to_buffer();

				            if (need_preempt()) {

				                break;

				            }

				        }

				        return make_ready_future<>();

				    });

				}

				inline

				future<> cache_flat_mutation_reader::read_from_underlying() {

				    return consume_mutation_fragments_until(_read_context->underlying().underlying(),

				        [this] { return _state != state::reading_from_underlying || is_buffer_full(); },

				        [this] (mutation_fragment mf) {

				            _read_context->cache().on_row_miss();

				            maybe_add_to_cache(mf);

				            add_to_buffer(std::move(mf));

				        },

				        [this] {

				            _state = state::reading_from_cache;

				            _lsa_manager.run_in_update_section([this] {

				                auto same_pos = _next_row.maybe_refresh();

				                if (!same_pos) {

				                    _read_context->cache().on_mispopulate(); // FIXME: Insert dummy entry at _upper_bound.

				                    _next_row_in_range = !after_current_range(_next_row.position());

				                    if (!_next_row.continuous()) {

				                        start_reading_from_underlying();

				                    }

				                    return;

				                }

				                if (_next_row_in_range) {

				                    maybe_update_continuity();

				                    _last_row = _next_row;

				                    add_to_buffer(_next_row);

				                    try {

				                        move_to_next_entry();

				                    } catch (const std::bad_alloc&) {

				                        // We cannot reenter the section, since we may have moved to the new range, and

				                        // because add_to_buffer() should not be repeated.

				                        _snp->region().allocator().invalidate_references(); // Invalidates _next_row

				                    }

				                } else {

				                    if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {

				                        this->maybe_update_continuity();

				                    } else if (can_populate()) {

				                        rows_entry::compare less(*_schema);

				                        auto& rows = _snp->version()->partition().clustered_rows();

				                        if (query::is_single_row(*_schema, *_ck_ranges_curr)) {

				                            with_allocator(_snp->region().allocator(), [&] {

				                                auto e = alloc_strategy_unique_ptr<rows_entry>(

				                                    current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));

				                                // Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.

				                                auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);

				                                auto inserted = insert_result.second;

				                                auto it = insert_result.first;

				                                if (inserted) {

				                                    e.release();

				                                    auto next = std::next(it);

				                                    it->set_continuous(next->continuous());

				                                    clogger.trace("csm {}: inserted dummy at {}, cont={}", this, it->position(), it->continuous());

				                                }

				                            });

				                        } else if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {

				                            with_allocator(_snp->region().allocator(), [&] {

				                                auto e = alloc_strategy_unique_ptr<rows_entry>(

				                                    current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));

				                                // Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.

				                                auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);

				                                auto inserted = insert_result.second;

				                                if (inserted) {

				                                    clogger.trace("csm {}: inserted dummy at {}", this, _upper_bound);

				                                    e.release();

				                                } else {

				                                    clogger.trace("csm {}: mark {} as continuous", this, insert_result.first->position());

				                                    insert_result.first->set_continuous(true);

				                                }

				                            });

				                        }

				                    } else {

				                        _read_context->cache().on_mispopulate();

				                    }

				                    try {

				                        move_to_next_range();

				                    } catch (const std::bad_alloc&) {

				                        // We cannot reenter the section, since we may have moved to the new range

				                        _snp->region().allocator().invalidate_references(); // Invalidates _next_row

				                    }

				                }

				            });

				            return make_ready_future<>();

				        });

				}

				inline

				void cache_flat_mutation_reader::maybe_update_continuity() {

				    if (can_populate() && (_population_range_starts_before_all_rows || _last_row.refresh(*_snp))) {

				            if (_next_row.is_in_latest_version()) {

				                clogger.trace("csm {}: mark {} continuous", this, _next_row.get_iterator_in_latest_version()->position());

				                _next_row.get_iterator_in_latest_version()->set_continuous(true);

				            } else {

				                // Cover entry from older version

				                with_allocator(_snp->region().allocator(), [&] {

				                    auto& rows = _snp->version()->partition().clustered_rows();

				                    rows_entry::compare less(*_schema);

				                    auto e = alloc_strategy_unique_ptr<rows_entry>(

				                        current_allocator().construct<rows_entry>(*_schema, _next_row.position(), is_dummy(_next_row.dummy()), is_continuous::yes));

				                    auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);

				                    auto inserted = insert_result.second;

				                    if (inserted) {

				                        clogger.trace("csm {}: inserted dummy at {}", this, e->position());

				                        e.release();

				                    }

				                });

				            }

				    } else {

				        _read_context->cache().on_mispopulate();

				    }

				}

				inline

				void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment& mf) {

				    if (mf.is_range_tombstone()) {

				        maybe_add_to_cache(mf.as_range_tombstone());

				    } else {

				        assert(mf.is_clustering_row());

				        const clustering_row& cr = mf.as_clustering_row();

				        maybe_add_to_cache(cr);

				    }

				}

				inline

				void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {

				    if (!can_populate()) {

				        _last_row = nullptr;

				        _population_range_starts_before_all_rows = false;

				        _read_context->cache().on_mispopulate();

				        return;

				    }

				    clogger.trace("csm {}: populate({})", this, cr);

				    _lsa_manager.run_in_update_section_with_allocator([this, &cr] {

				        mutation_partition& mp = _snp->version()->partition();

				        rows_entry::compare less(*_schema);

				        auto new_entry = alloc_strategy_unique_ptr<rows_entry>(

				            current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));

				        new_entry->set_continuous(false);

				        auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()

				                                              : mp.clustered_rows().lower_bound(cr.key(), less);

				        auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);

				        if (insert_result.second) {

				            _read_context->cache().on_row_insert();

				            new_entry.release();

				        }

				        it = insert_result.first;

				        rows_entry& e = *it;

				        if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {

				            clogger.trace("csm {}: set_continuous({})", this, e.position());

				            e.set_continuous(true);

				        } else {

				            _read_context->cache().on_mispopulate();

				        }

				        with_allocator(standard_allocator(), [&] {

				            _last_row = partition_snapshot_row_weakref(*_snp, it);

				        });

				        _population_range_starts_before_all_rows = false;

				    });

				}

				inline

				bool cache_flat_mutation_reader::after_current_range(position_in_partition_view p) {

				    return _position_cmp(p, _upper_bound) >= 0;

				}

				inline

				void cache_flat_mutation_reader::start_reading_from_underlying() {

				    clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", this, _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);

				    _state = state::move_to_underlying;

				}

				inline

				void cache_flat_mutation_reader::copy_from_cache_to_buffer() {

				    clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", this, _next_row.position(), _next_row_in_range);

				    position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());

				    for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {

				        add_to_buffer(std::move(rts));

				        if (is_buffer_full()) {

				            return;

				        }

				    }

				    if (_next_row_in_range) {

				        _last_row = _next_row;

				        add_to_buffer(_next_row);

				        move_to_next_entry();

				    } else {

				        move_to_next_range();

				    }

				}

				inline

				void cache_flat_mutation_reader::move_to_end() {

				    drain_tombstones();

				    finish_reader();

				    clogger.trace("csm {}: eos", this);

				}

				inline

				void cache_flat_mutation_reader::move_to_next_range() {

				    auto next_it = std::next(_ck_ranges_curr);

				    if (next_it == _ck_ranges_end) {

				        move_to_end();

				        _ck_ranges_curr = next_it;

				    } else {

				        move_to_range(next_it);

				    }

				}

				inline

				void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::const_iterator next_it) {

				    auto lb = position_in_partition::for_range_start(*next_it);

				    auto ub = position_in_partition_view::for_range_end(*next_it);

				    _last_row = nullptr;

				    _lower_bound = std::move(lb);

				    _upper_bound = std::move(ub);

				    _ck_ranges_curr = next_it;

				    auto adjacent = _next_row.advance_to(_lower_bound);

				    _next_row_in_range = !after_current_range(_next_row.position());

				    clogger.trace("csm {}: move_to_range(), range={}, lb={}, ub={}, next={}", this, *_ck_ranges_curr, _lower_bound, _upper_bound, _next_row.position());

				    if (!adjacent && !_next_row.continuous()) {

				        // FIXME: We don't insert a dummy for singular range to avoid allocating 3 entries

				        // for a hit (before, at and after). If we supported the concept of an incomplete row,

				        // we could insert such a row for the lower bound if it's full instead, for both singular and

				        // non-singular ranges.

				        if (_ck_ranges_curr->start() && !query::is_single_row(*_schema, *_ck_ranges_curr)) {

				            // Insert dummy for lower bound

				            if (can_populate()) {

				                // FIXME: _lower_bound could be adjacent to the previous row, in which case we could skip this

				                clogger.trace("csm {}: insert dummy at {}", this, _lower_bound);

				                auto it = with_allocator(_lsa_manager.region().allocator(), [&] {

				                    auto& rows = _snp->version()->partition().clustered_rows();

				                    auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);

				                    return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);

				                });

				                _last_row = partition_snapshot_row_weakref(*_snp, it);

				            } else {

				                _read_context->cache().on_mispopulate();

				            }

				        }

				        start_reading_from_underlying();

				    }

				}

				// _next_row must be inside the range.

				inline

				void cache_flat_mutation_reader::move_to_next_entry() {

				    clogger.trace("csm {}: move_to_next_entry(), curr={}", this, _next_row.position());

				    if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {

				        move_to_next_range();

				    } else {

				        if (!_next_row.next()) {

				            move_to_end();

				            return;

				        }

				        _next_row_in_range = !after_current_range(_next_row.position());

				        clogger.trace("csm {}: next={}, cont={}, in_range={}", this, _next_row.position(), _next_row.continuous(), _next_row_in_range);

				        if (!_next_row.continuous()) {

				            start_reading_from_underlying();

				        }

				    }

				}

				inline

				void cache_flat_mutation_reader::drain_tombstones(position_in_partition_view pos) {

				    while (true) {

				        reserve_one();

				        auto mfo = _tombstones.get_next(pos);

				        if (!mfo) {

				            break;

				        }

				        push_mutation_fragment(std::move(*mfo));

				    }

				}

				inline

				void cache_flat_mutation_reader::drain_tombstones() {

				    while (true) {

				        reserve_one();

				        auto mfo = _tombstones.get_next();

				        if (!mfo) {

				            break;

				        }

				        push_mutation_fragment(std::move(*mfo));

				    }

				}

				inline

				void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {

				    clogger.trace("csm {}: add_to_buffer({})", this, mf);

				    if (mf.is_clustering_row()) {

				        add_clustering_row_to_buffer(std::move(mf));

				    } else {

				        assert(mf.is_range_tombstone());

				        add_to_buffer(std::move(mf).as_range_tombstone());

				    }

				}

				inline

				void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {

				    if (!row.dummy()) {

				        _read_context->cache().on_row_hit();

				        add_clustering_row_to_buffer(row.row());

				    }

				}

				// Maintains the following invariants, also in case of exception:

				//   (1) no fragment with position >= _lower_bound was pushed yet

				//   (2) If _lower_bound > mf.position(), mf was emitted

				inline

				void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {

				    clogger.trace("csm {}: add_clustering_row_to_buffer({})", this, mf);

				    auto& row = mf.as_clustering_row();

				    auto key = row.key();

				    try {

				        drain_tombstones(row.position());

				        push_mutation_fragment(std::move(mf));

				        _lower_bound = position_in_partition::after_key(std::move(key));

				    } catch (...) {

				        // We may have emitted some of the range tombstones which start after the old _lower_bound

				        _lower_bound = position_in_partition::for_key(std::move(key));

				        throw;

				    }

				}

				inline

				void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {

				    clogger.trace("csm {}: add_to_buffer({})", this, rt);

				    // This guarantees that rt starts after any emitted clustering_row

				    if (!rt.trim_front(*_schema, _lower_bound)) {

				        return;

				    }

				    _lower_bound = position_in_partition(rt.position());

				    _tombstones.apply(std::move(rt));

				    drain_tombstones(_lower_bound);

				}

				inline

				void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {

				    if (can_populate()) {

				        clogger.trace("csm {}: maybe_add_to_cache({})", this, rt);

				        _lsa_manager.run_in_update_section_with_allocator([&] {

				            _snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);

				        });

				    } else {

				        _read_context->cache().on_mispopulate();

				    }

				}

				inline

				void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {

				    if (can_populate()) {

				        clogger.trace("csm {}: populate({})", this, sr);

				        _read_context->cache().on_row_insert();

				        _lsa_manager.run_in_update_section_with_allocator([&] {

				            _snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());

				        });

				    } else {

				        _read_context->cache().on_mispopulate();

				    }

				}

				inline

				void cache_flat_mutation_reader::maybe_set_static_row_continuous() {

				    if (can_populate()) {

				        clogger.trace("csm {}: set static row continuous", this);

				        _snp->version()->partition().set_static_row_continuous(true);

				    } else {

				        _read_context->cache().on_mispopulate();

				    }

				}

				inline

				bool cache_flat_mutation_reader::can_populate() const {

				    return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();

				}

				} // namespace cache

				inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,

				                                                            dht::decorated_key dk,

				                                                            query::clustering_key_filter_ranges crr,

				                                                            row_cache& cache,

				                                                            lw_shared_ptr<cache::read_context> ctx,

				                                                            lw_shared_ptr<partition_snapshot> snp)

				{

				    return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(

				        std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);

				}

									
										43

caching_options.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -24,6 +24,7 @@

				#include <boost/lexical_cast.hpp>

				#include "exceptions/exceptions.hh"

				#include "json.hh"

				#include "seastarx.hh"

				class schema;

				@@ -58,30 +59,34 @@ class caching_options {

				    caching_options() : _key_cache(default_key), _row_cache(default_row) {}

				public:

				    sstring to_sstring() const {

				        return json::to_json(std::map<sstring, sstring>({{ "keys", _key_cache }, { "rows_per_partition", _row_cache }}));

				    std::map<sstring, sstring> to_map() const {

				        return {{ "keys", _key_cache }, { "rows_per_partition", _row_cache }};

				    }

				    static caching_options from_sstring(const sstring& str) {

				        auto map = json::to_map(str);

				        if (map.size() > 2) {

				            throw exceptions::configuration_exception("Invalid map: " + str); 

				        }

				        sstring k;

				        sstring r;

				        if (map.count("keys")) {

				            k = map.at("keys");

				        } else {

				            k = default_key;

				        }

				    sstring to_sstring() const {

				        return json::to_json(to_map());

				    }

				        if (map.count("rows_per_partition")) {

				            r = map.at("rows_per_partition");

				        } else {

				            r = default_row;

				    template<typename Map>

				    static caching_options from_map(const Map & map) {

				        sstring k = default_key;

				        sstring r = default_row;

				        for (auto& p : map) {

				            if (p.first == "keys") {

				                k = p.second;

				            } else if (p.first == "rows_per_partition") {

				                r = p.second;

				            } else {

				                throw exceptions::configuration_exception("Invalid caching option: " + p.first);

				            }

				        }

				        return caching_options(k, r);

				    }

				    static caching_options from_sstring(const sstring& str) {

				        return from_map(json::to_map(str));

				    }

				    bool operator==(const caching_options& other) const {

				        return _key_cache == other._key_cache && _row_cache == other._row_cache;

				    }

									
										3

canonical_mutation.cc
									
												View File
												
				@@ -22,6 +22,7 @@

				#include "canonical_mutation.hh"

				#include "mutation.hh"

				#include "mutation_partition_serializer.hh"

				#include "counters.hh"

				#include "converting_mutation_partition_applier.hh"

				#include "hashing_partition_visitor.hh"

				#include "utils/UUID.hh"

				@@ -44,7 +45,7 @@ canonical_mutation::canonical_mutation(const mutation& m)

				    mutation_partition_serializer part_ser(*m.schema(), m.partition());

				    bytes_ostream out;

				    ser::writer_of_canonical_mutation wr(out);

				    ser::writer_of_canonical_mutation<bytes_ostream> wr(out);

				    std::move(wr).write_table_id(m.schema()->id())

				                 .write_schema_version(m.schema()->version())

				                 .write_key(m.key())

									
										2

cartesian_product.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 *

				 */

									
										566

cell_locking.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,566 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <boost/intrusive/unordered_set.hpp>

				#if __has_include(<boost/container/small_vector.hpp>)

				#include <boost/container/small_vector.hpp>

				template <typename T, size_t N>

				using small_vector = boost::container::small_vector<T, N>;

				#else

				#include <vector>

				template <typename T, size_t N>

				using small_vector = std::vector<T>;

				#endif

				#include "fnv1a_hasher.hh"

				#include "streamed_mutation.hh"

				#include "mutation_partition.hh"

				class cells_range {

				    using ids_vector_type = small_vector<column_id, 5>;

				    position_in_partition_view _position;

				    ids_vector_type _ids;

				public:

				    using iterator = ids_vector_type::iterator;

				    using const_iterator = ids_vector_type::const_iterator;

				    cells_range()

				        : _position(position_in_partition_view(position_in_partition_view::static_row_tag_t())) { }

				    explicit cells_range(position_in_partition_view pos, const row& cells)

				        : _position(pos)

				    {

				        _ids.reserve(cells.size());

				        cells.for_each_cell([this] (auto id, auto&&) {

				            _ids.emplace_back(id);

				        });

				    }

				    position_in_partition_view position() const { return _position; }

				    bool empty() const { return _ids.empty(); }

				    auto begin() const { return _ids.begin(); }

				    auto end() const { return _ids.end(); }

				};

				class partition_cells_range {

				    const mutation_partition& _mp;

				public:

				    class iterator {

				        const mutation_partition& _mp;

				        stdx::optional<mutation_partition::rows_type::const_iterator> _position;

				        cells_range _current;

				    public:

				        explicit iterator(const mutation_partition& mp)

				            : _mp(mp)

				            , _current(position_in_partition_view(position_in_partition_view::static_row_tag_t()), mp.static_row())

				        { }

				        iterator(const mutation_partition& mp, mutation_partition::rows_type::const_iterator it)

				            : _mp(mp)

				            , _position(it)

				        { }

				        iterator& operator++() {

				            if (!_position) {

				                _position = _mp.clustered_rows().begin();

				            } else {

				                ++(*_position);

				            }

				            if (_position != _mp.clustered_rows().end()) {

				                auto it = *_position;

				                _current = cells_range(position_in_partition_view(position_in_partition_view::clustering_row_tag_t(), it->key()),

				                        it->row().cells());

				            }

				            return *this;

				        }

				        iterator operator++(int) {

				            iterator it(*this);

				            operator++();

				            return it;

				        }

				        cells_range& operator*() {

				            return _current;

				        }

				        cells_range* operator->() {

				            return &_current;

				        }

				        bool operator==(const iterator& other) const {

				            return _position == other._position;

				        }

				        bool operator!=(const iterator& other) const {

				            return !(*this == other);

				        }

				    };

				public:

				    explicit partition_cells_range(const mutation_partition& mp) : _mp(mp) { }

				    iterator begin() const {

				        return iterator(_mp);

				    }

				    iterator end() const {

				        return iterator(_mp, _mp.clustered_rows().end());

				    }

				};

				class locked_cell;

				struct cell_locker_stats {

				    uint64_t lock_acquisitions = 0;

				    uint64_t operations_waiting_for_lock = 0;

				};

				class cell_locker {

				public:

				    using timeout_clock = lowres_clock;

				private:

				    using semaphore_type = basic_semaphore<default_timeout_exception_factory, timeout_clock>;

				    class partition_entry;

				    struct cell_address {

				        position_in_partition position;

				        column_id id;

				    };

				    class cell_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>>,

				                       public enable_lw_shared_from_this<cell_entry> {

				        partition_entry& _parent;

				        cell_address _address;

				        semaphore_type _semaphore { 0 };

				        friend class cell_locker;

				    public:

				        cell_entry(partition_entry& parent, position_in_partition position, column_id id)

				            : _parent(parent)

				            , _address { std::move(position), id }

				        { }

				        // Upgrades cell_entry to another schema.

				        // Changes the value of cell_address, so cell_entry has to be

				        // temporarily removed from its parent partition_entry.

				        // Returns true if the cell_entry still exist in the new schema and

				        // should be reinserted.

				        bool upgrade(const schema& from, const schema& to, column_kind kind) noexcept {

				            auto& old_column_mapping = from.get_column_mapping();

				            auto& column = old_column_mapping.column_at(kind, _address.id);

				            auto cdef = to.get_column_definition(column.name());

				            if (!cdef) {

				                return false;

				            }

				            _address.id = cdef->id;

				            return true;

				        }

				        const position_in_partition& position() const {

				            return _address.position;

				        }

				        future<> lock(timeout_clock::time_point _timeout) {

				            return _semaphore.wait(_timeout);

				        }

				        void unlock() {

				            _semaphore.signal();

				        }

				        ~cell_entry() {

				            if (!is_linked()) {

				                return;

				            }

				            unlink();

				            if (!--_parent._cell_count) {

				                delete &_parent;

				            }

				        }

				        class hasher {

				            const schema* _schema; // pointer instead of reference for default assignment

				        public:

				            explicit hasher(const schema& s) : _schema(&s) { }

				            size_t operator()(const cell_address& ca) const {

				                fnv1a_hasher hasher;

				                ca.position.feed_hash(hasher, *_schema);

				                ::feed_hash(hasher, ca.id);

				                return hasher.finalize();

				            }

				            size_t operator()(const cell_entry& ce) const {

				                return operator()(ce._address);

				            }

				        };

				        class equal_compare {

				            position_in_partition::equal_compare _cmp;

				        private:

				            bool do_compare(const cell_address& a, const cell_address& b) const {

				                return a.id == b.id && _cmp(a.position, b.position);

				            }

				        public:

				            explicit equal_compare(const schema& s) : _cmp(s) { }

				            bool operator()(const cell_address& ca, const cell_entry& ce) const {

				                return do_compare(ca, ce._address);

				            }

				            bool operator()(const cell_entry& ce, const cell_address& ca) const {

				                return do_compare(ca, ce._address);

				            }

				            bool operator()(const cell_entry& a, const cell_entry& b) const {

				                return do_compare(a._address, b._address);

				            }

				        };

				    };

				    class partition_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>> {

				        using cells_type = bi::unordered_set<cell_entry,

				                                             bi::equal<cell_entry::equal_compare>,

				                                             bi::hash<cell_entry::hasher>,

				                                             bi::constant_time_size<false>>;

				        static constexpr size_t initial_bucket_count = 16;

				        using max_load_factor = std::ratio<3, 4>;

				        dht::decorated_key _key;

				        cell_locker& _parent;

				        size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);

				        std::unique_ptr<cells_type::bucket_type[]> _buckets; // TODO: start with internal storage?

				        size_t _cell_count = 0; // cells_type::empty() is not O(1) if the hook is auto-unlink

				        cells_type::bucket_type _internal_buckets[initial_bucket_count];

				        cells_type _cells;

				        schema_ptr _schema;

				        friend class cell_entry;

				    private:

				        static constexpr size_t compute_rehash_at_size(size_t bucket_count) {

				            return bucket_count * max_load_factor::num / max_load_factor::den;

				        }

				        void maybe_rehash() {

				            if (_cell_count >= _rehash_at_size) {

				                auto new_bucket_count = std::min(_cells.bucket_count() * 2, _cells.bucket_count() + 1024);

				                auto buckets = std::make_unique<cells_type::bucket_type[]>(new_bucket_count);

				                _cells.rehash(cells_type::bucket_traits(buckets.get(), new_bucket_count));

				                _buckets = std::move(buckets);

				                _rehash_at_size = compute_rehash_at_size(new_bucket_count);

				            }

				        }

				    public:

				        partition_entry(schema_ptr s, cell_locker& parent, const dht::decorated_key& dk)

				            : _key(dk)

				            , _parent(parent)

				            , _cells(cells_type::bucket_traits(_internal_buckets, initial_bucket_count),

				                     cell_entry::hasher(*s), cell_entry::equal_compare(*s))

				            , _schema(s)

				        { }

				        ~partition_entry() {

				            if (is_linked()) {

				                _parent._partition_count--;

				            }

				        }

				        // Upgrades partition entry to new schema. Returns false if all

				        // cell_entries has been removed during the upgrade.

				        bool upgrade(schema_ptr new_schema);

				        void insert(lw_shared_ptr<cell_entry> cell) {

				            _cells.insert(*cell);

				            _cell_count++;

				            maybe_rehash();

				        }

				        cells_type& cells() {

				            return _cells;

				        }

				        struct hasher {

				            size_t operator()(const dht::decorated_key& dk) const {

				                return std::hash<dht::decorated_key>()(dk);

				            }

				            size_t operator()(const partition_entry& pe) const {

				                return operator()(pe._key);

				            }

				        };

				        class equal_compare {

				            dht::decorated_key_equals_comparator _cmp;

				        public:

				            explicit equal_compare(const schema& s) : _cmp(s) { }

				            bool operator()(const dht::decorated_key& dk, const partition_entry& pe) {

				                return _cmp(dk, pe._key);

				            }

				            bool operator()(const partition_entry& pe, const dht::decorated_key& dk) {

				                return _cmp(dk, pe._key);

				            }

				            bool operator()(const partition_entry& a, const partition_entry& b) {

				                return _cmp(a._key, b._key);

				            }

				        };

				    };

				    using partitions_type = bi::unordered_set<partition_entry,

				                                              bi::equal<partition_entry::equal_compare>,

				                                              bi::hash<partition_entry::hasher>,

				                                              bi::constant_time_size<false>>;

				    static constexpr size_t initial_bucket_count = 4 * 1024;

				    using max_load_factor = std::ratio<3, 4>;

				    std::unique_ptr<partitions_type::bucket_type[]> _buckets;

				    partitions_type _partitions;

				    size_t _partition_count = 0;

				    size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);

				    schema_ptr _schema;

				    // partitions_type uses equality comparator which keeps a reference to the

				    // original schema, we must ensure that it doesn't die.

				    schema_ptr _original_schema;

				    cell_locker_stats& _stats;

				    friend class locked_cell;

				private:

				    struct locker;

				    static constexpr size_t compute_rehash_at_size(size_t bucket_count) {

				        return bucket_count * max_load_factor::num / max_load_factor::den;

				    }

				    void maybe_rehash() {

				        if (_partition_count >= _rehash_at_size) {

				            auto new_bucket_count = std::min(_partitions.bucket_count() * 2, _partitions.bucket_count() + 64 * 1024);

				            auto buckets = std::make_unique<partitions_type::bucket_type[]>(new_bucket_count);

				            _partitions.rehash(partitions_type::bucket_traits(buckets.get(), new_bucket_count));

				            _buckets = std::move(buckets);

				            _rehash_at_size = compute_rehash_at_size(new_bucket_count);

				        }

				    }

				public:

				    explicit cell_locker(schema_ptr s, cell_locker_stats& stats)

				        : _buckets(std::make_unique<partitions_type::bucket_type[]>(initial_bucket_count))

				        , _partitions(partitions_type::bucket_traits(_buckets.get(), initial_bucket_count),

				                      partition_entry::hasher(), partition_entry::equal_compare(*s))

				        , _schema(s)

				        , _original_schema(std::move(s))

				        , _stats(stats)

				    { }

				    ~cell_locker() {

				        assert(_partitions.empty());

				    }

				    void set_schema(schema_ptr s) {

				        _schema = s;

				    }

				    schema_ptr schema() const {

				        return _schema;

				    }

				    // partition_cells_range is required to be in cell_locker::schema()

				    future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range,

				                                                timeout_clock::time_point timeout);

				};

				class locked_cell {

				    lw_shared_ptr<cell_locker::cell_entry> _entry;

				public:

				    explicit locked_cell(lw_shared_ptr<cell_locker::cell_entry> entry)

				        : _entry(std::move(entry)) { }

				    locked_cell(const locked_cell&) = delete;

				    locked_cell(locked_cell&&) = default;

				    ~locked_cell() {

				        if (_entry) {

				            _entry->unlock();

				        }

				    }

				};

				struct cell_locker::locker {

				    cell_entry::hasher _hasher;

				    cell_entry::equal_compare _eq_cmp;

				    partition_entry& _partition_entry;

				    partition_cells_range _range;

				    partition_cells_range::iterator _current_ck;

				    cells_range::const_iterator _current_cell;

				    timeout_clock::time_point _timeout;

				    std::vector<locked_cell> _locks;

				    cell_locker_stats& _stats;

				private:

				    void update_ck() {

				        if (!is_done()) {

				            _current_cell = _current_ck->begin();

				        }

				    }

				    future<> lock_next();

				    bool is_done() const { return _current_ck == _range.end(); }

				public:

				    explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, timeout_clock::time_point timeout)

				        : _hasher(s)

				        , _eq_cmp(s)

				        , _partition_entry(pe)

				        , _range(std::move(range))

				        , _current_ck(_range.begin())

				        , _timeout(timeout)

				        , _stats(st)

				    {

				        update_ck();

				    }

				    locker(const locker&) = delete;

				    locker(locker&&) = delete;

				    future<> lock_all() {

				        // Cannot defer before first call to lock_next().

				        return lock_next().then([this] {

				            return do_until([this] { return is_done(); }, [this] {

				                return lock_next();

				            });

				        });

				    }

				    std::vector<locked_cell> get() && { return std::move(_locks); }

				};

				inline

				future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, timeout_clock::time_point timeout) {

				    partition_entry::hasher pe_hash;

				    partition_entry::equal_compare pe_eq(*_schema);

				    auto it = _partitions.find(dk, pe_hash, pe_eq);

				    std::unique_ptr<partition_entry> partition;

				    if (it == _partitions.end()) {

				        partition = std::make_unique<partition_entry>(_schema, *this, dk);

				    } else if (!it->upgrade(_schema)) {

				        partition = std::unique_ptr<partition_entry>(&*it);

				        _partition_count--;

				        _partitions.erase(it);

				    }

				    if (partition) {

				        std::vector<locked_cell> locks;

				        for (auto&& r : range) {

				            if (r.empty()) {

				                continue;

				            }

				            for (auto&& c : r) {

				                auto cell = make_lw_shared<cell_entry>(*partition, position_in_partition(r.position()), c);

				                _stats.lock_acquisitions++;

				                partition->insert(cell);

				                locks.emplace_back(std::move(cell));

				            }

				        }

				        if (!locks.empty()) {

				            _partitions.insert(*partition.release());

				            _partition_count++;

				            maybe_rehash();

				        }

				        return make_ready_future<std::vector<locked_cell>>(std::move(locks));

				    }

				    auto l = std::make_unique<locker>(*_schema, _stats, *it, std::move(range), timeout);

				    auto f = l->lock_all();

				    return f.then([l = std::move(l)] {

				        return std::move(*l).get();

				    });

				}

				inline

				future<> cell_locker::locker::lock_next() {

				    while (!is_done()) {

				        if (_current_cell == _current_ck->end()) {

				            ++_current_ck;

				            update_ck();

				            continue;

				        }

				        auto cid = *_current_cell++;

				        cell_address ca { position_in_partition(_current_ck->position()), cid };

				        auto it = _partition_entry.cells().find(ca, _hasher, _eq_cmp);

				        if (it != _partition_entry.cells().end()) {

				            _stats.operations_waiting_for_lock++;

				            return it->lock(_timeout).then([this, ce = it->shared_from_this()] () mutable {

				                _stats.operations_waiting_for_lock--;

				                _stats.lock_acquisitions++;

				                _locks.emplace_back(std::move(ce));

				            });

				        }

				        auto cell = make_lw_shared<cell_entry>(_partition_entry, position_in_partition(_current_ck->position()), cid);

				        _stats.lock_acquisitions++;

				        _partition_entry.insert(cell);

				        _locks.emplace_back(std::move(cell));

				    }

				    return make_ready_future<>();

				}

				inline

				bool cell_locker::partition_entry::upgrade(schema_ptr new_schema) {

				    if (_schema == new_schema) {

				        return true;

				    }

				    auto buckets = std::make_unique<cells_type::bucket_type[]>(_cells.bucket_count());

				    auto cells = cells_type(cells_type::bucket_traits(buckets.get(), _cells.bucket_count()),

				                            cell_entry::hasher(*new_schema), cell_entry::equal_compare(*new_schema));

				    _cells.clear_and_dispose([&] (cell_entry* cell_ptr) noexcept {

				        auto& cell = *cell_ptr;

				        auto kind = cell.position().is_static_row() ? column_kind::static_column

				                                                    : column_kind::regular_column;

				        auto reinsert = cell.upgrade(*_schema, *new_schema, kind);

				        if (reinsert) {

				            cells.insert(cell);

				        } else {

				            _cell_count--;

				        }

				    });

				    // bi::unordered_set move assignment is actually a swap.

				    // Original _buckets cannot be destroyed before the container using them is

				    // so we need to explicitly make sure that the original _cells is no more.

				    _cells = std::move(cells);

				    auto destroy = [] (auto) { };

				    destroy(std::move(cells));

				    _buckets = std::move(buckets);

				    _schema = new_schema;

				    return _cell_count;

				}

									
										151

checked-file-impl.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,151 @@

				/*

				 * Copyright (C) 2016 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "seastar/core/file.hh"

				#include "disk-error-handler.hh"

				class checked_file_impl : public file_impl {

				public:

				    checked_file_impl(const io_error_handler& error_handler, file f)

				            : _error_handler(error_handler), _file(f) {

				        _memory_dma_alignment = f.memory_dma_alignment();

				        _disk_read_dma_alignment = f.disk_read_dma_alignment();

				        _disk_write_dma_alignment = f.disk_write_dma_alignment();

				    }

				    virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->write_dma(pos, buffer, len, pc);

				        });

				    }

				    virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->write_dma(pos, iov, pc);

				        });

				    }

				    virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->read_dma(pos, buffer, len, pc);

				        });

				    }

				    virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->read_dma(pos, iov, pc);

				        });

				    }

				    virtual future<> flush(void) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->flush();

				        });

				    }

				    virtual future<struct stat> stat(void) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->stat();

				        });

				    }

				    virtual future<> truncate(uint64_t length) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->truncate(length);

				        });

				    }

				    virtual future<> discard(uint64_t offset, uint64_t length) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->discard(offset, length);

				        });

				    }

				    virtual future<> allocate(uint64_t position, uint64_t length) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->allocate(position, length);

				        });

				    }

				    virtual future<uint64_t> size(void) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->size();

				        });

				    }

				    virtual future<> close() override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->close();

				        });

				    }

				    // returns a handle for plain file, so make_checked_file() should be called

				    // on file returned by handle.

				    virtual std::unique_ptr<seastar::file_handle_impl> dup() override {

				        return get_file_impl(_file)->dup();

				    }

				    virtual subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->list_directory(next);

				        });

				    }

				    virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->dma_read_bulk(offset, range_size, pc);

				        });

				    }

				private:

				    const io_error_handler& _error_handler;

				    file _file;

				};

				inline file make_checked_file(const io_error_handler& error_handler, file f)

				{

				    return file(::make_shared<checked_file_impl>(error_handler, f));

				}

				future<file>

				inline open_checked_file_dma(const io_error_handler& error_handler,

				                             sstring name, open_flags flags,

				                             file_open_options options = {})

				{

				    return do_io_check(error_handler, [&] {

				        return open_file_dma(name, flags, options).then([&] (file f) {

				            return make_ready_future<file>(make_checked_file(error_handler, f));

				        });

				    });

				}

				future<file>

				inline open_checked_directory(const io_error_handler& error_handler,

				                              sstring name)

				{

				    return do_io_check(error_handler, [&] {

				        return engine().open_directory(name).then([&] (file f) {

				            return make_ready_future<file>(make_checked_file(error_handler, f));

				        });

				    });

				}

									
										4

gc_clock.cc → clocks-impl.cc
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -19,6 +19,6 @@

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#include "gc_clock.hh"

				#include "clocks-impl.hh"

				std::atomic<int64_t> clocks_offset;

									
										49

clocks-impl.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,49 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include <algorithm>

				#include <atomic>

				#include <chrono>

				#include <cstdint>

				extern std::atomic<int64_t> clocks_offset;

				template<typename Duration>

				static inline void forward_jump_clocks(Duration delta)

				{

				    auto d = std::chrono::duration_cast<std::chrono::seconds>(delta).count();

				    clocks_offset.fetch_add(d, std::memory_order_relaxed);

				}

				static inline std::chrono::seconds get_clocks_offset()

				{

				    auto off = clocks_offset.load(std::memory_order_relaxed);

				    return std::chrono::seconds(off);

				}

				// Returns a time point which is earlier from t by d, or minimum time point if it cannot be represented.

				template<typename Clock, typename Duration, typename Rep, typename Period>

				inline

				auto saturating_subtract(std::chrono::time_point<Clock, Duration> t, std::chrono::duration<Rep, Period> d) -> decltype(t) {

				    return std::max(t, decltype(t)::min() + d) - d;

				}

									
										156

clustering_bounds_comparator.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,156 @@

				/*

				 * Copyright (C) 2016 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "keys.hh"

				#include "schema.hh"

				#include "range.hh"

				/**

				 * Represents the kind of bound in a range tombstone.

				 */

				enum class bound_kind : uint8_t {

				    excl_end = 0,

				    incl_start = 1,

				    // values 2 to 5 are reserved for forward Origin compatibility

				    incl_end = 6,

				    excl_start = 7,

				};

				std::ostream& operator<<(std::ostream& out, const bound_kind k);

				bound_kind invert_kind(bound_kind k);

				int32_t weight(bound_kind k);

				class bound_view {

				public:

				    const static thread_local clustering_key empty_prefix;

				    const clustering_key_prefix& prefix;

				    bound_kind kind;

				    bound_view(const clustering_key_prefix& prefix, bound_kind kind)

				        : prefix(prefix)

				        , kind(kind)

				    { }

				    bound_view(const bound_view& other) noexcept = default;

				    bound_view& operator=(const bound_view& other) noexcept {

				        if (this != &other) {

				            this->~bound_view();

				            new (this) bound_view(other);

				        }

				        return *this;

				    }

				    struct tri_compare {

				        // To make it assignable and to avoid taking a schema_ptr, we

				        // wrap the schema reference.

				        std::reference_wrapper<const schema> _s;

				        tri_compare(const schema& s) : _s(s)

				        { }

				        int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {

				            auto type = _s.get().clustering_key_prefix_type();

				            auto res = prefix_equality_tri_compare(type->types().begin(),

				                type->begin(p1), type->end(p1),

				                type->begin(p2), type->end(p2),

				                ::tri_compare);

				            if (res) {

				                return res;

				            }

				            auto d1 = p1.size(_s);

				            auto d2 = p2.size(_s);

				            if (d1 == d2) {

				                return w1 - w2;

				            }

				            return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0));

				        }

				        int operator()(const bound_view b, const clustering_key_prefix& p) const {

				            return operator()(b.prefix, weight(b.kind), p, 0);

				        }

				        int operator()(const clustering_key_prefix& p, const bound_view b) const {

				            return operator()(p, 0, b.prefix, weight(b.kind));

				        }

				        int operator()(const bound_view b1, const bound_view b2) const {

				            return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));

				        }

				    };

				    struct compare {

				        // To make it assignable and to avoid taking a schema_ptr, we

				        // wrap the schema reference.

				        tri_compare _cmp;

				        compare(const schema& s) : _cmp(s)

				        { }

				        bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {

				            return _cmp(p1, w1, p2, w2) < 0;

				        }

				        bool operator()(const bound_view b, const clustering_key_prefix& p) const {

				            return operator()(b.prefix, weight(b.kind), p, 0);

				        }

				        bool operator()(const clustering_key_prefix& p, const bound_view b) const {

				            return operator()(p, 0, b.prefix, weight(b.kind));

				        }

				        bool operator()(const bound_view b1, const bound_view b2) const {

				            return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));

				        }

				    };

				    bool equal(const schema& s, const bound_view other) const {

				        return kind == other.kind && prefix.equal(s, other.prefix);

				    }

				    bool adjacent(const schema& s, const bound_view other) const {

				        return invert_kind(other.kind) == kind && prefix.equal(s, other.prefix);

				    }

				    static bound_view bottom() {

				        return {empty_prefix, bound_kind::incl_start};

				    }

				    static bound_view top() {

				        return {empty_prefix, bound_kind::incl_end};

				    }

				    template<template<typename> typename R>

				    GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )

				    static bound_view from_range_start(const R<clustering_key_prefix>& range) {

				        return range.start()

				               ? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start)

				               : bottom();

				    }

				    template<template<typename> typename R>

				    GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )

				    static bound_view from_range_end(const R<clustering_key_prefix>& range) {

				        return range.end()

				               ? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end)

				               : top();

				    }

				    template<template<typename> typename R>

				    GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )

				    static std::pair<bound_view, bound_view> from_range(const R<clustering_key_prefix>& range) {

				        return {from_range_start(range), from_range_end(range)};

				    }

				    template<template<typename> typename R>

				    GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )

				    static stdx::optional<typename R<clustering_key_prefix_view>::bound> to_range_bound(const bound_view& bv) {

				        if (&bv.prefix == &empty_prefix) {

				            return {};

				        }

				        bool inclusive = bv.kind != bound_kind::excl_end && bv.kind != bound_kind::excl_start;

				        return {typename R<clustering_key_prefix_view>::bound(bv.prefix.view(), inclusive)};

				    }

				    friend std::ostream& operator<<(std::ostream& out, const bound_view& b) {

				        return out << "{bound: prefix=" << b.prefix << ", kind=" << b.kind << "}";

				    }

				};

									
										68

clustering_key_filter.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,68 @@

				/*

				 * Copyright (C) 2016 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "schema.hh"

				#include "query-request.hh"

				namespace query {

				class clustering_key_filter_ranges {

				    clustering_row_ranges _storage;

				    const clustering_row_ranges& _ref;

				public:

				    clustering_key_filter_ranges(const clustering_row_ranges& ranges) : _ref(ranges) { }

				    struct reversed { };

				    clustering_key_filter_ranges(reversed, const clustering_row_ranges& ranges)

				        : _storage(ranges.rbegin(), ranges.rend()), _ref(_storage) { }

				    clustering_key_filter_ranges(clustering_key_filter_ranges&& other) noexcept

				        : _storage(std::move(other._storage))

				        , _ref(&other._ref == &other._storage ? _storage : other._ref)

				    { }

				    clustering_key_filter_ranges& operator=(clustering_key_filter_ranges&& other) noexcept {

				        if (this != &other) {

				            this->~clustering_key_filter_ranges();

				            new (this) clustering_key_filter_ranges(std::move(other));

				        }

				        return *this;

				    }

				    auto begin() const { return _ref.begin(); }

				    auto end() const { return _ref.end(); }

				    bool empty() const { return _ref.empty(); }

				    size_t size() const { return _ref.size(); }

				    const clustering_row_ranges& ranges() const { return _ref; }

				    static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {

				        const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);

				        if (slice.options.contains(query::partition_slice::option::reversed)) {

				            return clustering_key_filter_ranges(clustering_key_filter_ranges::reversed{}, ranges);

				        }

				        return clustering_key_filter_ranges(ranges);

				    }

				};

				}

									
										219

clustering_ranges_walker.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,219 @@

				/*

				 * Copyright (C) 2017 ScyllaDB

				 *

				 * Modified by ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "schema.hh"

				#include "query-request.hh"

				#include "streamed_mutation.hh"

				// Utility for in-order checking of overlap with position ranges.

				class clustering_ranges_walker {

				    const schema& _schema;

				    const query::clustering_row_ranges& _ranges;

				    query::clustering_row_ranges::const_iterator _current;

				    query::clustering_row_ranges::const_iterator _end;

				    bool _in_current; // next position is known to be >= _current_start

				    bool _with_static_row;

				    position_in_partition_view _current_start;

				    position_in_partition_view _current_end;

				    stdx::optional<position_in_partition> _trim;

				    size_t _change_counter = 1;

				private:

				    bool advance_to_next_range() {

				        _in_current = false;

				        if (!_current_start.is_static_row()) {

				            if (_current == _end) {

				                return false;

				            }

				            ++_current;

				        }

				        ++_change_counter;

				        if (_current == _end) {

				            _current_end = _current_start = position_in_partition_view::after_all_clustered_rows();

				            return false;

				        }

				        _current_start = position_in_partition_view::for_range_start(*_current);

				        _current_end = position_in_partition_view::for_range_end(*_current);

				        return true;

				    }

				public:

				    clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)

				        : _schema(s)

				        , _ranges(ranges)

				        , _current(ranges.begin())

				        , _end(ranges.end())

				        , _in_current(with_static_row)

				        , _with_static_row(with_static_row)

				        , _current_start(position_in_partition_view::for_static_row())

				        , _current_end(position_in_partition_view::before_all_clustered_rows())

				    {

				        if (!with_static_row) {

				            if (_current == _end) {

				                _current_start = position_in_partition_view::before_all_clustered_rows();

				            } else {

				                _current_start = position_in_partition_view::for_range_start(*_current);

				                _current_end = position_in_partition_view::for_range_end(*_current);

				            }

				        }

				    }

				    clustering_ranges_walker(clustering_ranges_walker&& o) noexcept

				        : _schema(o._schema)

				        , _ranges(o._ranges)

				        , _current(o._current)

				        , _end(o._end)

				        , _in_current(o._in_current)

				        , _with_static_row(o._with_static_row)

				        , _current_start(o._current_start)

				        , _current_end(o._current_end)

				        , _trim(std::move(o._trim))

				        , _change_counter(o._change_counter)

				    { }

				    clustering_ranges_walker& operator=(clustering_ranges_walker&& o) {

				        if (this != &o) {

				            this->~clustering_ranges_walker();

				            new (this) clustering_ranges_walker(std::move(o));

				        }

				        return *this;

				    }

				    // Excludes positions smaller than pos from the ranges.

				    // pos should be monotonic.

				    // No constraints between pos and positions passed to advance_to().

				    //

				    // After the invocation, when !out_of_range(), lower_bound() returns the smallest position still contained.

				    void trim_front(position_in_partition pos) {

				        position_in_partition::less_compare less(_schema);

				        do {

				            if (!less(_current_start, pos)) {

				                break;

				            }

				            if (less(pos, _current_end)) {

				                _trim = std::move(pos);

				                _current_start = *_trim;

				                _in_current = false;

				                ++_change_counter;

				                break;

				            }

				        } while (advance_to_next_range());

				    }

				    // Returns true if given position is contained.

				    // Must be called with monotonic positions.

				    // Idempotent.

				    bool advance_to(position_in_partition_view pos) {

				        position_in_partition::less_compare less(_schema);

				        do {

				            if (!_in_current && less(pos, _current_start)) {

				                break;

				            }

				            // All subsequent clustering keys are larger than the start of this

				            // range so there is no need to check that again.

				            _in_current = true;

				            if (less(pos, _current_end)) {

				                return true;

				            }

				        } while (advance_to_next_range());

				        return false;

				    }

				    // Returns true if the range expressed by start and end (as in position_range) overlaps

				    // with clustering ranges.

				    // Must be called with monotonic start position. That position must also be greater than

				    // the last position passed to the other advance_to() overload.

				    // Idempotent.

				    bool advance_to(position_in_partition_view start, position_in_partition_view end) {

				        position_in_partition::less_compare less(_schema);

				        do {

				            if (!less(_current_start, end)) {

				                break;

				            }

				            if (less(start, _current_end)) {

				                return true;

				            }

				        } while (advance_to_next_range());

				        return false;

				    }

				    // Returns true if the range tombstone expressed by start and end (as in position_range) overlaps

				    // with clustering ranges.

				    // No monotonicity restrictions on argument values across calls.

				    // Does not affect lower_bound().

				    // Idempotent.

				    bool contains_tombstone(position_in_partition_view start, position_in_partition_view end) const {

				        position_in_partition::less_compare less(_schema);

				        if (_trim && !less(*_trim, end)) {

				            return false;

				        }

				        auto i = _current;

				        while (i != _end) {

				            auto range_start = position_in_partition_view::for_range_start(*i);

				            if (!less(range_start, end)) {

				                return false;

				            }

				            auto range_end = position_in_partition_view::for_range_end(*i);

				            if (less(start, range_end)) {

				                return true;

				            }

				            ++i;

				        }

				        return false;

				    }

				    // Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.

				    bool out_of_range() const {

				        return !_in_current && _current == _end;

				    }

				    // Resets the state of the walker so that advance_to() can be now called for new sequence of positions.

				    // Any range trimmings still hold after this.

				    void reset() {

				        auto trim = std::move(_trim);

				        auto ctr = _change_counter;

				        *this = clustering_ranges_walker(_schema, _ranges, _with_static_row);

				        _change_counter = ctr + 1;

				        if (trim) {

				            trim_front(std::move(*trim));

				        }

				    }

				    // Can be called only when !out_of_range()

				    position_in_partition_view lower_bound() const {

				        return _current_start;

				    }

				    // When lower_bound() changes, this also does

				    // Always > 0.

				    size_t lower_bound_change_counter() const {

				        return _change_counter;

				    }

				};

									
										3

coding-style.md
									
										Normal file
									
												View File
												
				@@ -0,0 +1,3 @@

				# Scylla Coding Style

				Please see the [Seastar style document](https://github.com/scylladb/seastar/blob/master/coding-style.md).

									
										2

combine.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

									
										39

compaction_strategy.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -21,7 +21,12 @@

				#pragma once

				#include "sstables/shared_sstable.hh"

				#include "exceptions/exceptions.hh"

				class column_family;

				class schema;

				using schema_ptr = lw_shared_ptr<const schema>;

				namespace sstables {

				@@ -30,12 +35,15 @@ enum class compaction_strategy_type {

				    major,

				    size_tiered,

				    leveled,

				    // FIXME: Add support to DateTiered.

				    date_tiered,

				    time_window,

				};

				class compaction_strategy_impl;

				class sstable;

				class sstable_set;

				struct compaction_descriptor;

				struct resharding_descriptor;

				class compaction_strategy {

				    ::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;

				@@ -49,7 +57,22 @@ public:

				    compaction_strategy& operator=(compaction_strategy&&);

				    // Return a list of sstables to be compacted after applying the strategy.

				    compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);

				    compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<shared_sstable> candidates);

				    std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<shared_sstable> candidates);

				    // Some strategies may look at the compacted and resulting sstables to

				    // get some useful information for subsequent compactions.

				    void notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);

				    // Return if parallel compaction is allowed by strategy.

				    bool parallel_compaction() const;

				    // Return if optimization to rule out sstables based on clustering key filter should be applied.

				    bool use_clustering_key_filter() const;

				    // An estimation of number of compaction for strategy to be satisfied.

				    int64_t estimated_pending_compactions(column_family& cf) const;

				    static sstring name(compaction_strategy_type type) {

				        switch (type) {

				@@ -61,6 +84,10 @@ public:

				            return "SizeTieredCompactionStrategy";

				        case compaction_strategy_type::leveled:

				            return "LeveledCompactionStrategy";

				        case compaction_strategy_type::date_tiered:

				            return "DateTieredCompactionStrategy";

				        case compaction_strategy_type::time_window:

				            return "TimeWindowCompactionStrategy";

				        default:

				            throw std::runtime_error("Invalid Compaction Strategy");

				        }

				@@ -77,6 +104,10 @@ public:

				            return compaction_strategy_type::size_tiered;

				        } else if (short_name == "LeveledCompactionStrategy") {

				            return compaction_strategy_type::leveled;

				        } else if (short_name == "DateTieredCompactionStrategy") {

				            return compaction_strategy_type::date_tiered;

				        } else if (short_name == "TimeWindowCompactionStrategy") {

				            return compaction_strategy_type::time_window;

				        } else {

				            throw exceptions::configuration_exception(sprint("Unable to find compaction strategy class '%s'", name));

				        }

				@@ -87,6 +118,8 @@ public:

				    sstring name() const {

				        return name(type());

				    }

				    sstable_set make_sstable_set(schema_ptr schema) const;

				};

				// Creates a compaction_strategy object from one of the strategies available.

									
										67

compatible_ring_position.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,67 @@

				/*

				 * Copyright (C) 2016 ScyllaDB

				 */

				/*

				 * This file is part of Scylla.

				 *

				 * Scylla is free software: you can redistribute it and/or modify

				 * it under the terms of the GNU Affero General Public License as published by

				 * the Free Software Foundation, either version 3 of the License, or

				 * (at your option) any later version.

				 *

				 * Scylla is distributed in the hope that it will be useful,

				 * but WITHOUT ANY WARRANTY; without even the implied warranty of

				 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

				 * GNU General Public License for more details.

				 *

				 * You should have received a copy of the GNU General Public License

				 * along with Scylla.  If not, see <http://www.gnu.org/licenses/>.

				 */

				#pragma once

				#include "query-request.hh"

				#include <experimental/optional>

				// Wraps ring_position so it is compatible with old-style C++: default constructor,

				// stateless comparators, yada yada

				class compatible_ring_position {

				    const schema* _schema = nullptr;

				    // optional to supply a default constructor, no more

				    std::experimental::optional<dht::ring_position> _rp;

				public:

				    compatible_ring_position() noexcept = default;

				    compatible_ring_position(const schema& s, const dht::ring_position& rp)

				            : _schema(&s), _rp(rp) {

				    }

				    compatible_ring_position(const schema& s, dht::ring_position&& rp)

				            : _schema(&s), _rp(std::move(rp)) {

				    }

				    const dht::token& token() const {

				        return _rp->token();

				    }

				    friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return x._rp->tri_compare(*x._schema, *y._rp);

				    }

				    friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) < 0;

				    }

				    friend bool operator<=(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) <= 0;

				    }

				    friend bool operator>(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) > 0;

				    }

				    friend bool operator>=(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) >= 0;

				    }

				    friend bool operator==(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) == 0;

				    }

				    friend bool operator!=(const compatible_ring_position& x, const compatible_ring_position& y) {

				        return tri_compare(x, y) != 0;

				    }

				};

									
										11

compound.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -22,7 +22,7 @@

				#pragma once

				#include "types.hh"

				#include <iostream>

				#include <iosfwd>

				#include <algorithm>

				#include <vector>

				#include <boost/range/iterator_range.hpp>

				@@ -130,10 +130,10 @@ public:

				    bytes decompose_value(const value_type& values) {

				        return serialize_value(values);

				    }

				    class iterator : public std::iterator<std::input_iterator_tag, bytes_view> {

				    class iterator : public std::iterator<std::input_iterator_tag, const bytes_view> {

				    private:

				        bytes_view _v;

				        value_type _current;

				        bytes_view _current;

				    private:

				        void read_current() {

				            size_type len;

				@@ -220,6 +220,9 @@ public:

				        assert(AllowPrefixes == allow_prefixes::yes);

				        return std::distance(begin(v), end(v)) == (ssize_t)_types.size();

				    }

				    bool is_empty(bytes_view v) const {

				        return begin(v) == end(v);

				    }

				    void validate(bytes_view v) {

				        // FIXME: implement

				        warn(unimplemented::cause::VALIDATION);

									
										424

compound_compat.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright 2015 Cloudius Systems

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -21,7 +21,10 @@

				#pragma once

				#include <boost/range/algorithm/copy.hpp>

				#include <boost/range/adaptor/transformed.hpp>

				#include "compound.hh"

				#include "schema.hh"

				//

				// This header provides adaptors between the representation used by our compound_type<>

				@@ -180,3 +183,422 @@ bytes to_legacy(CompoundType& type, bytes_view packed) {

				    std::copy(lv.begin(), lv.end(), legacy_form.begin());

				    return legacy_form;

				}

				class composite_view;

				// Represents a value serialized according to Origin's CompositeType.

				// If is_compound is true, then the value is one or more components encoded as:

				//

				//   <representation> ::= ( <component> )+

				//   <component>      ::= <length> <value> <EOC>

				//   <length>         ::= <uint16_t>

				//   <EOC>            ::= <uint8_t>

				//

				// If false, then it encodes a single value, without a prefix length or a suffix EOC.

				class composite final {

				    bytes _bytes;

				    bool _is_compound;

				public:

				    composite(bytes&& b, bool is_compound)

				            : _bytes(std::move(b))

				            , _is_compound(is_compound)

				    { }

				    explicit composite(bytes&& b)

				            : _bytes(std::move(b))

				            , _is_compound(true)

				    { }

				    composite()

				            : _bytes()

				            , _is_compound(true)

				    { }

				    using size_type = uint16_t;

				    using eoc_type = int8_t;

				    /*

				     * The 'end-of-component' byte should always be 0 for actual column name.

				     * However, it can set to 1 for query bounds. This allows to query for the

				     * equivalent of 'give me the full range'. That is, if a slice query is:

				     *   start = <3><"foo".getBytes()><0>

				     *   end   = <3><"foo".getBytes()><1>

				     * then we'll return *all* the columns whose first component is "foo".

				     * If for a component, the 'end-of-component' is != 0, there should not be any

				     * following component. The end-of-component can also be -1 to allow

				     * non-inclusive query. For instance:

				     *   end = <3><"foo".getBytes()><-1>

				     * allows to query everything that is smaller than <3><"foo".getBytes()>, but

				     * not <3><"foo".getBytes()> itself.

				     */

				    enum class eoc : eoc_type {

				        start = -1,

				        none = 0,

				        end = 1

				    };

				    using component = std::pair<bytes, eoc>;

				    using component_view = std::pair<bytes_view, eoc>;

				private:

				    template<typename Value, typename = std::enable_if_t<!std::is_same<const data_value, std::decay_t<Value>>::value>>

				    static size_t size(const Value& val) {

				        return val.size();

				    }

				    static size_t size(const data_value& val) {

				        return val.serialized_size();

				    }

				    template<typename Value, typename = std::enable_if_t<!std::is_same<data_value, std::decay_t<Value>>::value>>

				    static void write_value(Value&& val, bytes::iterator& out) {

				        out = std::copy(val.begin(), val.end(), out);

				    }

				    static void write_value(const data_value& val, bytes::iterator& out) {

				        val.serialize(out);

				    }

				    template<typename RangeOfSerializedComponents>

				    static void serialize_value(RangeOfSerializedComponents&& values, bytes::iterator& out, bool is_compound) {

				        if (!is_compound) {

				            auto it = values.begin();

				            write_value(std::forward<decltype(*it)>(*it), out);

				            return;

				        }

				        for (auto&& val : values) {

				            write<size_type>(out, static_cast<size_type>(size(val)));

				            write_value(std::forward<decltype(val)>(val), out);

				            // Range tombstones are not keys. For collections, only frozen

				            // values can be keys. Therefore, for as long as it is safe to

				            // assume that this code will be used to create keys, it is safe

				            // to assume the trailing byte is always zero.

				            write<eoc_type>(out, eoc_type(eoc::none));

				        }

				    }

				    template <typename RangeOfSerializedComponents>

				    static size_t serialized_size(RangeOfSerializedComponents&& values, bool is_compound) {

				        size_t len = 0;

				        auto it = values.begin();

				        if (it != values.end()) {

				            // CQL3 uses a specific prefix (0xFFFF) to encode "static columns"

				            // (CASSANDRA-6561). This does mean the maximum size of the first component of a

				            // composite is 65534, not 65535 (or we wouldn't be able to detect if the first 2

				            // bytes is the static prefix or not).

				            auto value_size = size(*it);

				            if (value_size > static_cast<size_type>(std::numeric_limits<size_type>::max() - uint8_t(is_compound))) {

				                throw std::runtime_error(sprint("First component size too large: %d > %d", value_size, std::numeric_limits<size_type>::max() - is_compound));

				            }

				            if (!is_compound) {

				                return value_size;

				            }

				            len += sizeof(size_type) + value_size + sizeof(eoc_type);

				            ++it;

				        }

				        for ( ; it != values.end(); ++it) {

				            auto value_size = size(*it);

				            if (value_size > std::numeric_limits<size_type>::max()) {

				                throw std::runtime_error(sprint("Component size too large: %d > %d", value_size, std::numeric_limits<size_type>::max()));

				            }

				            len += sizeof(size_type) + value_size + sizeof(eoc_type);

				        }

				        return len;

				    }

				public:

				    template <typename Describer>

				    auto describe_type(Describer f) const {

				        return f(const_cast<bytes&>(_bytes));

				    }

				    // marker is ignored if !is_compound

				    template<typename RangeOfSerializedComponents>

				    static composite serialize_value(RangeOfSerializedComponents&& values, bool is_compound = true, eoc marker = eoc::none) {

				        auto size = serialized_size(values, is_compound);

				        bytes b(bytes::initialized_later(), size);

				        auto i = b.begin();

				        serialize_value(std::forward<decltype(values)>(values), i, is_compound);

				        if (is_compound && !b.empty()) {

				            b.back() = eoc_type(marker);

				        }

				        return composite(std::move(b), is_compound);

				    }

				    template<typename RangeOfSerializedComponents>

				    static composite serialize_static(const schema& s, RangeOfSerializedComponents&& values) {

				        // FIXME: Optimize

				        auto b = bytes(size_t(2), bytes::value_type(0xff));

				        std::vector<bytes_view> sv(s.clustering_key_size());

				        b += composite::serialize_value(boost::range::join(sv, std::forward<RangeOfSerializedComponents>(values)), true).release_bytes();

				        return composite(std::move(b));

				    }

				    static eoc to_eoc(int8_t eoc_byte) {

				        return eoc_byte == 0 ? eoc::none : (eoc_byte < 0 ? eoc::start : eoc::end);

				    }

				    class iterator : public std::iterator<std::input_iterator_tag, const component_view> {

				        bytes_view _v;

				        component_view _current;

				    private:

				        void read_current() {

				            size_type len;

				            {

				                if (_v.empty()) {

				                    _v = bytes_view(nullptr, 0);

				                    return;

				                }

				                len = read_simple<size_type>(_v);

				                if (_v.size() < len) {

				                    throw marshal_exception();

				                }

				            }

				            auto value = bytes_view(_v.begin(), len);

				            _v.remove_prefix(len);

				            _current = component_view(std::move(value), to_eoc(read_simple<eoc_type>(_v)));

				        }

				    public:

				        struct end_iterator_tag {};

				        iterator(const bytes_view& v, bool is_compound, bool is_static)

				                : _v(v) {

				            if (is_static) {

				                _v.remove_prefix(2);

				            }

				            if (is_compound) {

				                read_current();

				            } else {

				                _current = component_view(_v, eoc::none);

				                _v.remove_prefix(_v.size());

				            }

				        }

				        iterator(end_iterator_tag) : _v(nullptr, 0) {}

				        iterator& operator++() {

				            read_current();

				            return *this;

				        }

				        iterator operator++(int) {

				            iterator i(*this);

				            ++(*this);

				            return i;

				        }

				        const value_type& operator*() const { return _current; }

				        const value_type* operator->() const { return &_current; }

				        bool operator!=(const iterator& i) const { return _v.begin() != i._v.begin(); }

				        bool operator==(const iterator& i) const { return _v.begin() == i._v.begin(); }

				    };

				    iterator begin() const {

				        return iterator(_bytes, _is_compound, is_static());

				    }

				    iterator end() const {

				        return iterator(iterator::end_iterator_tag());

				    }

				    boost::iterator_range<iterator> components() const & {

				        return { begin(), end() };

				    }

				    auto values() const & {

				        return components() | boost::adaptors::transformed([](auto&& c) { return c.first; });

				    }

				    std::vector<component> components() const && {

				        std::vector<component> result;

				        std::transform(begin(), end(), std::back_inserter(result), [](auto&& p) {

				            return component(bytes(p.first.begin(), p.first.end()), p.second);

				        });

				        return result;

				    }

				    std::vector<bytes> values() const && {

				        std::vector<bytes> result;

				        boost::copy(components() | boost::adaptors::transformed([](auto&& c) { return to_bytes(c.first); }), std::back_inserter(result));

				        return result;

				    }

				    const bytes& get_bytes() const {

				        return _bytes;

				    }

				    bytes release_bytes() && {

				        return std::move(_bytes);

				    }

				    size_t size() const {

				        return _bytes.size();

				    }

				    bool empty() const {

				        return _bytes.empty();

				    }

				    static bool is_static(bytes_view bytes, bool is_compound) {

				        return is_compound && bytes.size() > 2 && (bytes[0] & bytes[1] & 0xff) == 0xff;

				    }

				    bool is_static() const {

				        return is_static(_bytes, _is_compound);

				    }

				    bool is_compound() const {

				        return _is_compound;

				    }

				    template <typename ClusteringElement>

				    static composite from_clustering_element(const schema& s, const ClusteringElement& ce) {

				        return serialize_value(ce.components(s), s.is_compound());

				    }

				    static composite from_exploded(const std::vector<bytes_view>& v, bool is_compound, eoc marker = eoc::none) {

				        if (v.size() == 0) {

				            return composite(bytes(size_t(1), bytes::value_type(marker)), is_compound);

				        }

				        return serialize_value(v, is_compound, marker);

				    }

				    static composite static_prefix(const schema& s) {

				        return serialize_static(s, std::vector<bytes_view>());

				    }

				    explicit operator bytes_view() const {

				        return _bytes;

				    }

				    template <typename Component>

				    friend inline std::ostream& operator<<(std::ostream& os, const std::pair<Component, eoc>& c) {

				        return os << "{value=" << c.first << "; eoc=" << sprint("0x%02x", eoc_type(c.second) & 0xff) << "}";

				    }

				    friend std::ostream& operator<<(std::ostream& os, const composite& v);

				    struct tri_compare {

				        const std::vector<data_type>& _types;

				        tri_compare(const std::vector<data_type>& types) : _types(types) {}

				        int operator()(const composite&, const composite&) const;

				        int operator()(composite_view, composite_view) const;

				    };

				};

				class composite_view final {

				    bytes_view _bytes;

				    bool _is_compound;

				public:

				    composite_view(bytes_view b, bool is_compound = true)

				            : _bytes(b)

				            , _is_compound(is_compound)

				    { }

				    composite_view(const composite& c)

				            : composite_view(static_cast<bytes_view>(c), c.is_compound())

				    { }

				    composite_view()

				            : _bytes(nullptr, 0)

				            , _is_compound(true)

				    { }

				    std::vector<bytes_view> explode() const {

				        if (!_is_compound) {

				            return { _bytes };

				        }

				        std::vector<bytes_view> ret;

				        ret.reserve(8);

				        for (auto it = begin(), e = end(); it != e; ) {

				            ret.push_back(it->first);

				            auto marker = it->second;

				            ++it;

				            if (it != e && marker != composite::eoc::none) {

				                throw runtime_exception(sprint("non-zero component divider found (%d) mid", sprint("0x%02x", composite::eoc_type(marker) & 0xff)));

				            }

				        }

				        return ret;

				    }

				    composite::iterator begin() const {

				        return composite::iterator(_bytes, _is_compound, is_static());

				    }

				    composite::iterator end() const {

				        return composite::iterator(composite::iterator::end_iterator_tag());

				    }

				    boost::iterator_range<composite::iterator> components() const {

				        return { begin(), end() };

				    }

				    composite::eoc last_eoc() const {

				        if (!_is_compound || _bytes.empty()) {

				            return composite::eoc::none;

				        }

				        bytes_view v(_bytes);

				        v.remove_prefix(v.size() - 1);

				        return composite::to_eoc(read_simple<composite::eoc_type>(v));

				    }

				    auto values() const {

				        return components() | boost::adaptors::transformed([](auto&& c) { return c.first; });

				    }

				    size_t size() const {

				        return _bytes.size();

				    }

				    bool empty() const {

				        return _bytes.empty();

				    }

				    bool is_static() const {

				        return composite::is_static(_bytes, _is_compound);

				    }

				    explicit operator bytes_view() const {

				        return _bytes;

				    }

				    bool operator==(const composite_view& k) const { return k._bytes == _bytes && k._is_compound == _is_compound; }

				    bool operator!=(const composite_view& k) const { return !(k == *this); }

				    friend inline std::ostream& operator<<(std::ostream& os, composite_view v) {

				        return os << "{" << ::join(", ", v.components()) << ", compound=" << v._is_compound << ", static=" << v.is_static() << "}";

				    }

				};

				inline

				std::ostream& operator<<(std::ostream& os, const composite& v) {

				    return os << composite_view(v);

				}

				inline

				int composite::tri_compare::operator()(const composite& v1, const composite& v2) const {

				    return (*this)(composite_view(v1), composite_view(v2));

				}

				inline

				int composite::tri_compare::operator()(composite_view v1, composite_view v2) const {

				    // See org.apache.cassandra.db.composites.AbstractCType#compare

				    if (v1.empty()) {

				        return v2.empty() ? 0 : -1;

				    }

				    if (v2.empty()) {

				        return 1;

				    }

				    if (v1.is_static() != v2.is_static()) {

				        return v1.is_static() ? -1 : 1;

				    }

				    auto a_values = v1.components();

				    auto b_values = v2.components();

				    auto cmp = [&](const data_type& t, component_view c1, component_view c2) {

				        // First by value, then by EOC

				        auto r = t->compare(c1.first, c2.first);

				        if (r) {

				            return r;

				        }

				        return static_cast<int>(c1.second) - static_cast<int>(c2.second);

				    };

				    return lexicographical_tri_compare(_types.begin(), _types.end(),

				        a_values.begin(), a_values.end(),

				        b_values.begin(), b_values.end(),

				        cmp);

				}

									
										10

compress.hh
									
												View File
												
				@@ -1,5 +1,5 @@

				/*

				 * Copyright (C) 2015 Cloudius Systems, Ltd.

				 * Copyright (C) 2015 ScyllaDB

				 */

				/*

				@@ -32,24 +32,24 @@ enum class compressor {

				class compression_parameters {

				public:

				    static constexpr int32_t DEFAULT_CHUNK_LENGTH = 64 * 1024;

				    static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;

				    static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;

				    static constexpr auto SSTABLE_COMPRESSION = "sstable_compression";

				    static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";

				    static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";

				private:

				    compressor _compressor = compressor::none;

				    compressor _compressor;

				    std::experimental::optional<int> _chunk_length;

				    std::experimental::optional<double> _crc_check_chance;

				public:

				    compression_parameters() = default;

				    compression_parameters(compressor c) : _compressor(c) { }

				    compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }

				    compression_parameters(const std::map<sstring, sstring>& options) {

				        validate_options(options);

				        auto it = options.find(SSTABLE_COMPRESSION);

				        if (it == options.end() || it->second.empty()) {

				            _compressor = compressor::none;

				            return;

				        }

				        const auto& compressor_class = it->second;

Compare commits

4909 Commits branch-1.0 ... next-2.1

9 .github/ISSUE_TEMPLATE.md vendored Normal file Unescape Escape View File

9 .gitignore vendored Unescape Escape View File

2 .gitmodules vendored Unescape Escape View File

140 CMakeLists.txt Normal file Unescape Escape View File

11 CONTRIBUTING.md Normal file Unescape Escape View File

233 HACKING.md Normal file Unescape Escape View File

42 README.md Unescape Escape View File

9 SCYLLA-VERSION-GEN Unescape Escape View File

90 api/api-doc/cache_service.json Unescape Escape View File

88 api/api-doc/collectd.json Unescape Escape View File

256 api/api-doc/column_family.json Unescape Escape View File

8 api/api-doc/endpoint_snitch_info.json Unescape Escape View File

33 api/api-doc/failure_detector.json Unescape Escape View File

137 api/api-doc/storage_proxy.json Unescape Escape View File

108 api/api-doc/storage_service.json Unescape Escape View File

39 api/api-doc/utils.json Unescape Escape View File

12 api/api.cc Unescape Escape View File

88 api/api.hh Unescape Escape View File

1 api/api_init.hh Unescape Escape View File

68 api/cache_service.cc Unescape Escape View File

2 api/cache_service.hh Unescape Escape View File

58 api/collectd.cc Unescape Escape View File

2 api/collectd.hh Unescape Escape View File

358 api/column_family.cc Unescape Escape View File

37 api/column_family.hh Unescape Escape View File

2 api/commitlog.cc Unescape Escape View File

2 api/commitlog.hh Unescape Escape View File

9 api/compaction_manager.cc Unescape Escape View File

2 api/compaction_manager.hh Unescape Escape View File

16 api/endpoint_snitch.cc Unescape Escape View File

2 api/endpoint_snitch.hh Unescape Escape View File

16 api/failure_detector.cc Unescape Escape View File

2 api/failure_detector.hh Unescape Escape View File

2 api/gossiper.cc Unescape Escape View File

2 api/gossiper.hh Unescape Escape View File

3 api/hinted_handoff.cc Unescape Escape View File

2 api/hinted_handoff.hh Unescape Escape View File

6 api/lsa.cc Unescape Escape View File

2 api/lsa.hh Unescape Escape View File

8 api/messaging_service.cc Unescape Escape View File

2 api/messaging_service.hh Unescape Escape View File

87 api/storage_proxy.cc Unescape Escape View File

2 api/storage_proxy.hh Unescape Escape View File

113 api/storage_service.cc Unescape Escape View File

2 api/storage_service.hh Unescape Escape View File

2 api/stream_manager.cc Unescape Escape View File

2 api/stream_manager.hh Unescape Escape View File

2 api/system.cc Unescape Escape View File

2 api/system.hh Unescape Escape View File

164 atomic_cell.hh Unescape Escape View File

29 atomic_cell_hash.hh Unescape Escape View File

16 atomic_cell_or_collection.hh Unescape Escape View File

41 auth/allow_all_authenticator.cc Normal file Unescape Escape View File

97 auth/allow_all_authenticator.hh Normal file Unescape Escape View File

41 auth/allow_all_authorizer.cc Normal file Unescape Escape View File

98 auth/allow_all_authorizer.hh Normal file Unescape Escape View File

306 auth/auth.cc Unescape Escape View File

121 auth/auth.hh Unescape Escape View File

7 auth/authenticated_user.cc Unescape Escape View File

16 auth/authenticated_user.hh Unescape Escape View File

75 auth/authenticator.cc Unescape Escape View File

50 auth/authenticator.hh Unescape Escape View File

118 auth/authorizer.cc Normal file Unescape Escape View File

167 auth/authorizer.hh Normal file Unescape Escape View File

70 auth/common.cc Normal file Unescape Escape View File

74 auth/common.hh Normal file Unescape Escape View File

36 auth/data_resource.cc Unescape Escape View File

21 auth/data_resource.hh Unescape Escape View File

257 auth/default_authorizer.cc Normal file Unescape Escape View File

92 auth/default_authorizer.hh Normal file Unescape Escape View File

229 auth/password_authenticator.cc Unescape Escape View File

43 auth/password_authenticator.hh Unescape Escape View File

71 auth/permission.cc Unescape Escape View File

22 auth/permission.hh Unescape Escape View File

51 auth/permissions_cache.cc Normal file Unescape Escape View File

99 auth/permissions_cache.hh Normal file Unescape Escape View File

355 auth/service.cc Normal file Unescape Escape View File

133 auth/service.hh Normal file Unescape Escape View File

4909 Commits

branch-1.0 ... next-2.1

9

.github/ISSUE_TEMPLATE.md vendored Normal file

View File

9

.gitignore vendored

View File

2

.gitmodules vendored

View File

140

CMakeLists.txt Normal file

View File

11

CONTRIBUTING.md Normal file

View File

233

HACKING.md Normal file

View File

42

README.md

View File

9

SCYLLA-VERSION-GEN

View File

90

api/api-doc/cache_service.json

View File

88

api/api-doc/collectd.json

View File

256

api/api-doc/column_family.json

View File

8

api/api-doc/endpoint_snitch_info.json

View File

33

api/api-doc/failure_detector.json

View File

137

api/api-doc/storage_proxy.json

View File

108

api/api-doc/storage_service.json

View File

39

api/api-doc/utils.json

View File

12

api/api.cc

View File

88

api/api.hh

View File

1

api/api_init.hh

View File

68

api/cache_service.cc

View File

2

api/cache_service.hh

View File

58

api/collectd.cc

View File

2

api/collectd.hh

View File

358

api/column_family.cc

View File

37

api/column_family.hh

View File

2

api/commitlog.cc

View File

2

api/commitlog.hh

View File

9

api/compaction_manager.cc

View File

2

api/compaction_manager.hh

View File

16

api/endpoint_snitch.cc

View File

2

api/endpoint_snitch.hh

View File

16

api/failure_detector.cc

View File

2

api/failure_detector.hh

View File

2

api/gossiper.cc

View File

2

api/gossiper.hh

View File

3

api/hinted_handoff.cc

View File

2

api/hinted_handoff.hh

View File

6

api/lsa.cc

View File

2

api/lsa.hh

View File

8

api/messaging_service.cc

View File

2

api/messaging_service.hh

View File

87

api/storage_proxy.cc

View File

2

api/storage_proxy.hh

View File

113

api/storage_service.cc

View File

2

api/storage_service.hh

View File

2

api/stream_manager.cc

View File

2

api/stream_manager.hh

View File

2

api/system.cc

View File

2

api/system.hh

View File

164

atomic_cell.hh

View File

29

atomic_cell_hash.hh

View File

16

atomic_cell_or_collection.hh

View File

41

auth/allow_all_authenticator.cc Normal file

View File

97

auth/allow_all_authenticator.hh Normal file

View File

41

auth/allow_all_authorizer.cc Normal file

View File

98

auth/allow_all_authorizer.hh Normal file

View File

306

auth/auth.cc

View File

121

auth/auth.hh

View File

7

auth/authenticated_user.cc

View File

16

auth/authenticated_user.hh

View File

75

auth/authenticator.cc

View File

50

auth/authenticator.hh

View File

118

auth/authorizer.cc Normal file

View File

167

auth/authorizer.hh Normal file

View File

70

auth/common.cc Normal file

View File

74

auth/common.hh Normal file

View File

36

auth/data_resource.cc

View File

21

auth/data_resource.hh

View File

257

auth/default_authorizer.cc Normal file

View File

92

auth/default_authorizer.hh Normal file

View File

229

auth/password_authenticator.cc

View File

43

auth/password_authenticator.hh

View File

71

auth/permission.cc

View File

22

auth/permission.hh

View File

51

auth/permissions_cache.cc Normal file

View File

99

auth/permissions_cache.hh Normal file

View File

355

auth/service.cc Normal file

View File

133

auth/service.hh Normal file

View File

232

auth/transitional.cc Normal file

View File