scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Calle Wilund	abf50aafef	database: Fix assert in truncate Fixes crash in cql_tests.StorageProxyCQLTester.table_test "avoid race condition when deleting sstable on behalf..." changed discard_sstables behaviour to only return rp:s for sstables owned and submitted for deletion (not all matching time stamp), which can in some cases cause zero rp returned. Message-Id: <20180508070003.1110-1-calle@scylladb.com>	2018-05-09 10:02:09 +01:00
Duarte Nunes	dfe5b38a43	db/view: Limit number of pending view updates This patch adds a simple and naive mechanism to ensure a base replica doesn't overwhelm a potentially overloaded view replica by sending too many concurrent view updates. We add a semaphore to limit to 100 the number of outstanding view updates. We limit globally per shard, and not per destination view replica. We also limit statically. Refs #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180426134457.21290-2-duarte@scylladb.com> (cherry picked from commit `4b3562c3f5`)	2018-05-08 00:46:33 +01:00
Duarte Nunes	9bdc8c25f5	db/view: Return a future when sending view updates While we now send view mutations asynchronously in the normal view write path, other processes interested in sending view updates, such as streaming or view building, may wish to do it synchronously. Signed-off-by: Duarte Nunes <duarte@scylladb.com> (cherry picked from commit `dc44a08370`)	2018-05-08 00:46:19 +01:00
Duarte Nunes	e75c55b2db	db/timeout_clock: Properly scope type names Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180426134457.21290-1-duarte@scylladb.com> (cherry picked from commit `2be75bdfc9`)	2018-05-07 19:29:48 +01:00
Botond Dénes	756feae052	database: when dropping a table evict all relevant queriers Queriers shouldn't outlive the table they read from as that could lead to use-after-free problems when they are destroyed. Fixes: #3414 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <3d7172cef79bb52b7097596e1d4ebba3a6ff757e.1525716986.git.bdenes@scylladb.com> (cherry picked from commit `6f7d919470`)	2018-05-07 21:20:42 +03:00
Tomasz Grabiec	202b4e6797	storage_proxy: Request schema from the coordinator in the original DC The mutation forwarding intermediary (src_addr) may not always know about the schema which was used by the original coordinator. I think this may be the cause of the "Schema version ... not found" error seen in one of the clusters which entered some pathological state: storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found) Fixes #3393. Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `423712f1fe`)	2018-05-07 13:08:40 +03:00
Raphael S. Carvalho	76ac200eff	database: avoid race condition when deleting sstable on behalf of cf truncate After removal of deletion manager, caller is now responsible for properly submitting the deletion of a shared sstable. That's because deletion manager was responsible for holding deletion until all owners agreed on it. Resharding for example was changed to delete the shared sstables at the end, but truncate wasn't changed and so race condition could happen when deleting same sstable at more than one shard in parallel. Change the operation to only submit a shared sstable for deletion in only one owner. Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>	2018-05-04 13:10:12 +01:00
Tomasz Grabiec	9aa172fe8e	db: schema_tables: Treat drop of scylla_tables.version as an alter After upgrade from 1.7 to 2.0, nodes will record a per-table schema version which matches that on 1.7 to support the rolling upgrade. Any later schema change (after the upgrade is done) will drop this record from affected tables so that the per-table schema version is recalculated. If nodes perform a schema pull (they detect schema mismatch), then the merge will affect all tables and will wipe the per-table schema version record from all tables, even if their schema did not change. If then only some nodes get restarted, the restarted nodes will load tables with the new (recalculated) per-table schema version, while not restarted nodes will still use the 1.7 per-table schema version. Until all nodes are restarted, writes or reads between nodes from different groups will involve a needless exchange of schema definition. This will manifest in logs with repeated messages indicating schema merge with no effect, triggered by writes: database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f The sync will be performed if the receiving shard forgets the foreign version, which happens if it doesn't process any request referencing it for more than 1 second. This may impact latency of writes and reads. The fix is to treat schema changes which drop the 1.7 per-table schema version marker as an alter, which will switch in-memory data structures to use the new per-table schema version immediately, without the need for a restart. Fixes #3394 Tests: - dtest: schema_test.py, schema_management_test.py - reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git - unit (release) Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `b1465291cf`)	2018-05-03 10:51:19 +03:00
Takuya ASADA	c4af043ef7	dist/common/scripts/scylla_raid_setup: prevent 'device or resource busy' on creating mdraid device According to this web site, there is possibility we have race condition with mdraid creation vs udev: http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html And looks like it can happen on our AMI, too (see #2784). To initialize RAID safely, we should wait udev events are finished before and after mdadm executed. Fixes #2784 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1505898196-28389-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `4a8ed4cc6f`)	2018-04-24 12:53:34 +03:00
Raphael S. Carvalho	06b25320be	sstables: Fix bloom filter size after resharding by properly estimating partition count We were feeding the total estimation partition count of an input shared sstable to the output unshared ones. So sstable writer thinks, from estimation, that each sstable created by resharding will have the same data amount as the shared sstable they are being created from. That's a problem because estimation is feeded to bloom filter creation which directly influences its size. So if we're resharding all sstables that belong to all shards, the disk usage taken by filter components will be multiplied by the number of shards. That becomes more of a problem with #3302. Partition count estimation for a shard S will now be done as follow: // // TE, the total estimated partition count for a shard S, is defined as // TE = Sum(i = 0...N) { Ei / Si }. // // where i is an input sstable that belongs to shard S, // Ei is the estimated partition count for sstable i, // Si is the total number of shards that own sstable i. Fixes #2672. Refs #3302. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com> (cherry picked from commit `11940ca39e`)	2018-04-24 12:53:34 +03:00
Takuya ASADA	ff70d9f15c	dist: Drop AmbientCapabilities from scylla-server.service for Debian 8 Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd unit file, so drop the line when we build .deb package for Debian 8. For other distributions, keep using the feature. Fixes #3344 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20180423102041.2138-1-syuu@scylladb.com> (cherry picked from commit `7b92c3fd3f`)	2018-04-24 12:53:34 +03:00
Avi Kivity	9bbd5821a2	Update scylla-ami submodule * dist/ami/files/scylla-ami 9b4be70...02b1853 (1): > scylla_install_ami: remove the host id file after scylla_setup	2018-04-24 12:53:34 +03:00
Avi Kivity	a7841f1f2e	release: prepare for 2.2.rc0	2018-04-18 11:08:43 +03:00
Takuya ASADA	84859e0745	dist/debian: use ~root as HOME to place .pbuilderrc When 'always_set_home' is specified on /etc/sudoers pbuilder won't read .pbuilderrc from current user home directory, and we don't have a way to change the behavor from sudo command parameter. So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed, this can work both environment which does specified always_set_home and doesn't specified. Fixes #3366 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `ace44784e8`)	2018-04-17 09:38:43 +03:00
Avi Kivity	6b74e1f02d	Update seastar submodule * seastar bcfbe0c...491f994 (3): > tls: Ensure we always pass through semaphores on shutdown > cpu scheduler: don't penalize first group to run > reactor: fix sleep mode Fixes #3350.	2018-04-14 20:44:11 +03:00
Avi Kivity	520f17b315	Point seastar submodule at scylla-seastar.git This allows backporting seastar patches.	2018-04-14 20:43:28 +03:00
Gleb Natapov	9fe3d04f31	cql_server: fix a race between closing of a connection and notifier registration There is a race between cql connection closure and notifier registration. If a connection is closed before notification registration is complete stale pointer to the connection will remain in notification list since attempt to unregister the connection will happen to early. The fix is to move notifier unregisteration after connection's gate is closed which will ensure that there is no outstanding registration request. But this means that now a connection with closed gate can be in notifier list, so with_gate() may throw and abort a notifier loop. Fix that by replacing with_gate() by call to is_closed(); Fixes: #3355 Tests: unit(release) Message-Id: <20180412134744.GB22593@scylladb.com> (cherry picked from commit `1a9aaece3e`)	2018-04-12 16:57:07 +03:00
Raphael S. Carvalho	a74183eb1e	sstables/compaction_manager: do not break lcs invariant by not allowing parallel compaction for it After change to serialize compaction on compaction weight (`eff62bc61e`), LCS invariant may break because parallel compaction can start, and it's not currently supported for LCS. The condition is that weight is deregistered right before last sstable for a leveled compaction is sealed, so it may happen that a new compaction starts for the same column family meanwhile that will promote a sstable to an overlapping token range. That leads to strategy restoring invariant when it finds the overlapping, and that means wasted resources. The fix is about removing a fast path check which is incorrect now because we release weight early and also fixing a check for ongoing compaction which prevented compaction from starting for LCS whenever weight tracker was not empty. Fixes #3279. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com> (cherry picked from commit `638a647b7d`)	2018-04-10 20:59:48 +03:00
Raphael S. Carvalho	e059f17bf2	database: make sure sstable is also forwarded to shard responsible for its generation After `f59f423f3c`, sstable is loaded only at shards that own it so as to reduce the sstable load overhead. The problem is that a sstable may no longer be forwarded to a shard that needs to be aware of its existence which would result in that sstable generation being reallocated for a write request. That would result in a failure as follow: "SSTable write failed due to existence of TOC file for generation..." This can be fixed by forwarding any sstable at load to all its owner shards and the shard responsible for its generation, which is determined as follow: s = generation % smp::count Fixes #3273. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com> (cherry picked from commit `30b6c9b4cd`)	2018-04-05 10:58:29 +03:00
Duarte Nunes	0e8e005357	db/view: Reject view entries with non-composite, empty partition key Empty partition keys are not supported on normal tables - they cannot be inserted or queried (surprisingly, the rules for composite partition keys are different: all components are then allowed to be empty). However, the (non-composite) partition key of a view could end up being empty if that column is: a base table regular column, a base table clustering key column, or a base table partition key column, part of a composite key. Fixes #3262 Refs CASSANDRA-14345 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180403122244.10626-1-duarte@scylladb.com> (cherry picked from commit `ec8960df45`)	2018-04-03 17:20:33 +03:00
Glauber Costa	8bf6f39392	docker: default docker to overprovisioned mode. By default, overprovisioned is not enabled on docker unless it is explicitly set. I have come to believe that this is a mistake. If the user is running alone in the machine, and there are no other processes pinned anywhere - including interrupts - not running overprovisioned is the best choice. But everywhere else, it is not: even if a user runs 2 docker containers in the same machine and statically partitions CPUs with --smp (but without cpuset) the docker containers will pin themselves to the same sets of CPU, as they are totally unaware of each other. It is also very common, specially in some virtualized environments, for interrupts not to be properly distributed - being particularly keen on being delivered on CPU0, a CPU which Scylla will pin by default. Lastly, environments like Kubernetes simply don't support pinning at the moment. This patch enables the overprovisioned flag if it is explicitly set - like we did before - but also by default unless --cpuset is set. Fixes #3336. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180331142131.842-1-glauber@scylladb.com> (cherry picked from commit `ef84780c27`)	2018-04-02 17:07:20 +03:00
Glauber Costa	04ba51986e	parse and ignore background writer controller Unused options are not exposed as command line options and will prevent Scylla from booting when present, although they can still be pased over YAML, for Cassandra compatibility. That has never been a problem, but we have been adding options to i3 (and others) that are now deprecated, but were previously marked as Used. Systems with those options may have issues upgrading. While this problem is common to all Unused options, the likelihood for any other unused option to appear in the command line is near zero, except for those two - since we put them there ourselves. There are two ways to handle this issue: 1) Mark them as Used, and just ignore them. 2) Add them explicitly to boost program options, and then ignore them. The second option is preferred here, because we can add them as hidden options in program_options, meaning they won't show up in the help. We can then just print a discrete message saying that those options are, for now on ignored. v2: mark set as const (Botond) v3: rebase on top of master, identation suggested by Duarte. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180329145517.8462-1-glauber@scylladb.com> (cherry picked from commit `a9ef72537f`)	2018-03-29 17:57:43 +03:00
Asias He	1d5379c462	gossip: Relax generation max difference check start node 1 2 3 shutdown node2 shutdown node1 and node3 start node1 and node3 nodetool removenode node2 clean up all scylla data on node2 bootstrap node2 as a new node I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever: On node1, node3 [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704 On node2 [shard 0] storage_service - JOINING: waiting for schema information to complete This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2. gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe(); gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe(); Each force_newer_generation_unsafe increases the generation by 1. Here is an example, Before nodetool removenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "generation": 0, "is_alive": false, "update_time": 1521778757334, "version": 0 }, ``` After nodetool revmoenode: ``` curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" \| python -mjson.tool { "addrs": "127.0.0.2", "application_state": [ { "application_state": 0, "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246", "version": 214 }, { "application_state": 6, "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a", "version": 153 } ], "generation": 2, "is_alive": false, "update_time": 1521779276246, "version": 0 }, ``` In gossiper::apply_state_locally, we have this check: ``` if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) { // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself) logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation); } ``` to skip the gossip update. To fix, we relax generation max difference check to allow the generation of a removed node. After this patch, the removed node bootstraps successfully. Tests: dtest:update_cluster_layout_tests.py Fixes #3331 Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com> (cherry picked from commit `f539e993d3`)	2018-03-29 12:10:09 +03:00
Avi Kivity	cb5dc56bfd	Update scylla-ami submodule Ref #3332.	2018-03-29 10:35:54 +03:00
Duarte Nunes	b578b492cd	column_family: Don't retry flushing memtable if shutdown is requested Since we just keep retrying, this can cause Scylla to not shutdown for a while. The data will be safe in the commit log. Note that this patch doesn't fix the issue when shutdown goes through storage_service::drain_on_shutdown - more work is required to handle that case. Ref #3318. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180324140822.3743-3-duarte@scylladb.com> (cherry picked from commit `a985ea0fcb`)	2018-03-26 15:26:56 +03:00
Duarte Nunes	30c950a7f6	column_family: Increase scope of exception handling when flushing a memtable In column_family::try_flush_memtable_to_sstable, the handle_exception() block is on the inside of the continuations to write_memtable_to_sstable(), which, if it fails, will leave the sstable in the compaction_backlog_tracker::_ongoing_writes map, which will waste disk space, and that sstable will map to a dangling pointer to a destroyed database_sstable_write_monitor, which causes a seg fault when accessed (for example, through the backlog_controller, which accounts the _ongoing_writes when calculating the backlog). Fix this by increasing the scope of handle_exception(). Fixes #3315 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180324140822.3743-2-duarte@scylladb.com> (cherry picked from commit `50ad37d39b`)	2018-03-26 15:26:54 +03:00
Duarte Nunes	f0d1e9c518	backlog_controller: Stop update timer On database shutdown, this timer can cause use-after-free errors if not stopped. Refs #3315 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180324140822.3743-1-duarte@scylladb.com> (cherry picked from commit `b7bd9b8058`)	2018-03-26 15:26:52 +03:00
Avi Kivity	597aeca93d	Merge "Bug fixes for access-control, and finalizing roles" from Jesse " This series does not add or change any features of access-control and roles, but addresses some bugs and finalizes the switch to roles. "auth: Wait for schema agreement" and the patch prior help avoid false negatives for integration tests and error messages in logs. "auth: Remove ordering dependence" fixes an important bug in `auth` that could leave the default superuser in a corrupted state when it is first created. Since roles are feature-complete (to the best of the author's knowledge as of this writing), the final patch in the series removes any warnings about them being unimplemented. Tests: unit (release), dtest (PENDING) " * 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla: Roles are implemented auth: Increase delay before background tasks start auth: Remove ordering dependence auth: Don't warn on rescheduled task auth: Wait for schema agreement Single-node clusters can agree on schema (cherry picked from commit `999df41a49`)	2018-03-26 12:37:41 +03:00
Duarte Nunes	1a94b90a4d	Merge 'Grant default permissions' from Jesse The functional change in this series is in the last patch ("auth: Grant all permissions to object creator"). The first patch addresses `const` correctness in `auth`. This change allowed the new code added in the last patch to be written with the correct `const` specifiers, and also some code to be removed. The second-to-last patch addresses error-handling in the authorizer for unsupported operations and is a prerequisite for the last patch (since we now always grant permissions for new database objects). Tests: unit (release) * 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla: auth: Grant all permissions to object creator auth: Unify handling for unsupported errors auth: Fix life-time issue with parameter auth: Fix `const` correctness (cherry picked from commit `934d805b4b`)	2018-03-26 12:37:35 +03:00
Avi Kivity	acdd42c7c8	Merge "Fix abort during counter table read-on-delete" from Tomasz " This fixes an abort in an sstable reader when querying a partition with no clustering ranges (happens on counter table mutation with no live rows) which also doesn't have any static columns. In such case, the sstable_mutation_reader will setup the data_consume_context such that it only covers the static row of the partition, knowing that there is no need to read any clustered rows. See partition.cc::advance_to_upper_bound(). Later when the reader is done with the range for the static row, it will try to skip to the first clustering range (missing in this case). If clustering_ranges_walker tells us to skip to after_all_clustering_rows(), we will hit an assert inside continuous_data_consumer::fast_forward_to() due to attempt to skip past the original data file range. If clustering_ranges_walker returns before_all_clustering_rows() instead, all is fine because we're still at the same data file position. Fixes #3304. " * 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Test reads with no clustering ranges and no static columns tests: simple_schema: Allow creating schema with no static column clustering_ranges_walker: Stop after static row in case no clustering ranges (cherry picked from commit `054854839a`)	2018-03-22 18:13:29 +02:00
Takuya ASADA	bd4f658555	scripts/scylla_install_pkg: follow redirection of specified repo URL We should follow redirection on curl, just like normal web browser does. Fixes #3312 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1521712056-301-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `bef08087e1`)	2018-03-22 12:56:58 +02:00
Vladimir Krivopalov	a983ba7aad	perf_fast_forward: fix error in date formatting Instead of 'month', 'minutes' has been used. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <1e005ecaa992d8205ca44ea4eebbca4621ad9886.1521659341.git.vladimir@scylladb.com> (cherry picked from commit `3010b637c9`)	2018-03-22 12:56:56 +02:00
Duarte Nunes	0a561fc326	gms/gossiper: Synchronize endpoint state destruction In gossiper::handle_major_state_change() we set the endpoint_state for a particular endpoint and replicate the changes to other cores. This is totally unsynchronized with the execution of gossiper::evict_from_membership(), which can happen concurrently, and can remove the very same endpoint from the map (in all cores). Replicating the changes to other cores in handle_major_state_change() can interleave with replicating the changes to other cores in evict_from_membership(), and result in an undefined final state. Another issue happened in debug mode dtests, where a fiber executes handle_major_state_change(), calls into the subscribers, of which storage_service is one, and ultimately lands on storage_service::update_peer_info(), which iterates over the endpoint's application state with deferring points in between (to update a system table). gossiper::evict_from_membership() was executed concurrently by another fiber, which freed the state the first one is iterating over. Fixes #3299. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180318123211.3366-1-duarte@scylladb.com> (cherry picked from commit `810db425a5`)	2018-03-18 14:54:54 +02:00
Takuya ASADA	1f10549056	dist/redhat: build only scylla, iotune Since we don't package tests, we don't need to build them. It reduces package building time. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1521066363-4859-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `1bb3531b90`)	2018-03-15 10:48:36 +02:00
Takuya ASADA	c2a2560ea3	dist/debian: use 3rdparty ppa on Ubuntu 18.04 Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier to maintain Scylla package to build with same version toolchain/libraries, so switch to them. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `945e6ec4f6`)	2018-03-15 10:48:31 +02:00
Takuya ASADA	237e36a0b4	dist/ami: update CentOS base image to latest version Since we requires updated version of systemd, we need to update CentOS base image. Fixes #3184 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `69d226625a`)	2018-03-15 10:47:54 +02:00
Takuya ASADA	e78c137bfc	dist/redhat: switch to gcc-7.3 We have hit following bug on debug-mode binary: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560 Since it's fixed on gcc-7.3, we need to upgrade our gcc package. See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1521064473-17906-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `856dc0a636`)	2018-03-15 10:39:40 +02:00
Avi Kivity	fb99a7c902	Merge "Ubuntu/Debian build error fixes" from Takuya * 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla: dist/debian: build only scylla, iotune dist/debian: switch to boost-1.65 dist/debian: switch to gcc-7.3 (cherry picked from commit `bb4b1f0e91`)	2018-03-14 22:51:44 +02:00
Asias He	9b5585ebd5	range_streamer: Stream 10% of ranges instead of 10 ranges per time If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per stream plan will cause tons of stream plan to be created to stream data, each having very few data. This cause each stream plan has low transfer bandwidth, so that the total time to complete the streaming increases. It makes more sense to send a percentage of the total ranges per stream plan than a fixed ranges. Here is an example to stream a keyspace with 513 ranges in total, 10 ranges v.s. 10% ranges: Before: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 51 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 107 seconds After: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 10 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 22 seconds Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>	2018-03-14 10:12:12 +02:00
Asias He	ad7b132188	Revert "streaming: Do not abort session too early in idle detection" This reverts commit `f792c78c96`. With the "Use range_streamer everywhere" (`7217b7ab36`) series, all the user of streaming now do streaming with relative small ranges and can retry streaming at higher level. Reduce the time-to-recover from 5 hours to 10 minutes per stream session. Even if the 10 minutes idle detection might cause higher false positive, it is fine, since we can retry the "small" stream session anyway. In the long term, we should replace the whole idle detection logic with whenever the stream initiator goes away, the stream slave goes away. Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>	2018-03-14 10:11:00 +02:00
Avi Kivity	f8613a8415	Merge "Save and recall queriers for paged singular-mutation queries" from Botond " Terms ----- querier: A class encapsulating all the logic and state needed to fill a page. This Includes the reader, the compact_mutation object and all associated state. Preamble -------- Currently for paged-queries we throw away all readers, compactors and all associated state that contributed to filling the page and on the next page we create them from scratch again. Thus on each page we throw away a considerable amount of work, only to redo it again on the next page. This has been one of the major contributors to latencies as from the point of view of a replica each page is as much work as a fresh query. Solution -------- The solution presented in this patch-series is to save queriers after filling a page and reuse them on the next pages, thus doing the considerable amount of work involved with creating the them only once. On each page the coordinator will generate a UUID that identifies this page. This UUID is used as the key, under which the contributing queriers will be saved in the cache. On the next page the UUID from the previous page will be used to lookup saved queriers, and the one from the current one to saved them afterwards (if the query isn't finished). These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to the page-state. Also attached to the page state is the list of replicas hit on the last page. On the next page this list will be consulted to hit the same replicas again, thus reusing the queriers saved on them. Cached queriers will be evicted after a certain period of time to avoid unecessary resource consumption by abandoned reads. Cached queriers may also be evicted when the shard faces resource-pressure, to free up resources. Splitting up the work --------------------- This series only fixes the singular-mutation query path, that is queries that either fetch a single partition, or severeal single partitions (IN queries). The fix for the scanning query path will be done in a follow-up series, however much of the infrastructure needed for the general querier reuse is already introduced by this series. Ref #1865 Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test) Benchmarking summary (read-from-disk) ------------------------------------- 1) Latency BEFORE latency mean : 58.0 latency median : 57.4 latency 95th percentile : 68.8 latency 99th percentile : 79.9 latency 99.9th percentile : 93.6 latency max : 93.6 AFTER latency mean : 41.3 latency median : 40.5 latency 95th percentile : 50.8 latency 99th percentile : 68.9 latency 99.9th percentile : 89.2 latency max : 89.2 2) Throughput (single partition query) sum(scylla_cql_reads): BEFORE: 173'567 AFTER: 427'774 +246% 3) Throughput (IN query, 2 partitions) sum(scylla_cql_reads): BEFORE: 85'637 AFTER: 127'431 +148% " * '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits) Add unit test for resource based cache eviction Add unit tests for querier_cache Add counters to monitor querier-cache efficiency Memory based cache eviction Add buffer_size() to flat_mutation_reader Resource-based cache eviction Time-based cache eviction Save and restore queriers in mutation_query() and data_query() Add the querier_cache_context helper Add querier_cache Add querier Add are_limits_reached() compact_mutation_state Add start_new_page() to compact_mutation_state Save last key of the page and method to query it Make compact_mutation reusable Add the CompactedFragmentsConsumer Use the last_replicas stored in the page_state query_singular(): return the used replicas Consider preferred replicas when choosing endpoints for query_singular() Add preferred and last replicas to the signature of query() ...	2018-03-13 18:38:59 +02:00
Botond Dénes	c0009750c3	Add unit test for resource based cache eviction Specifically for the reader-permit based eviction. This test lives in a separate executable as it uses with_cql_test_env() and thus needs a main() of it's own.	2018-03-13 16:20:50 +02:00
Botond Dénes	c53b6f75c8	Add unit tests for querier_cache	2018-03-13 12:59:45 +02:00
Avi Kivity	636760c282	Merge "Introduce JSON output format to perf_fast_forward tests." from Vladimir " This patchset is a part of a bigger effort for bringing our microbenchmarking tests from the source tree to be used for regression testing purposes with CI. Now, it is possible to export results of tests run into JSON format that can be stored in ElasticSearch and compared among runs to detect performance degradation should it happen. Example of JSON output (formatted for readability): { "results" : { "parameters" : { "read" : "64", "read,skip,test_run_count" : "64,256,1", "skip" : "256", "test_run_count" : 1 }, "stats" : { "(KiB)" : 126960, "aio" : 993, "blocked" : 208, "c blk" : 1, "c hit" : 0, "c miss" : 1, "cpu" : 99.779365539550781, "dropped" : 0, "frag/s" : 311939.61559016741, "frags" : 200000, "idx blk" : 0, "idx hit" : 0, "idx miss" : 0, "time (s)" : 0.641149729 } }, "test_group_properties" : { "message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern", "name" : "large-partition-skips", "needs_cache" : false, "partition_type" : "large" }, "versions" : { "scylla-server" : { "commit_id" : "4acfa17f4", "date" : "20180306", "run_date_time" : "2018-16-06 12:16:41", "version" : "666.development" } } } " * 'issues/2947/v6' of https://github.com/argenet/scylla: Add support for JSON output format for perf_fast_forward results. Wrap output for customization. Move all output handling to a single managing class.	2018-03-13 12:37:34 +02:00
Benoît Canet	1d0cc7cf20	messaging_service: Start messaging service earlier The messaging service was completely started after a bootstraping node finished to join hence leading to #2034. Fixes #2034 Message-Id: <20180313084500.27265-1-amnon@scylladb.com>	2018-03-13 10:59:53 +02:00
Botond Dénes	b2f75a6c53	Add counters to monitor querier-cache efficiency Add the following counters: (1) querier_cache_lookups (2) querier_cache_misses (3) querier_cache_drops (4) querier_cache_time_based_evictions (5) querier_cache_resource_based_evictions (6) querier_cache_memory_based_evictions (6) querier_cache_population (1) counts the total number of querier cache lookups. Not all page-fetches will result in a querier lookup. For example the first page of a query will not do a lookup as there was no previous page to reuse the querier from. The second, and all subsequent pages however should attempt to reuse the querier from the previous page. (2) counts the subset of (1) where the read have missed the querier cache (failed to find a matching saved querier). (3) counts the subset of (1) where the querier was recalled and dropped immediately. This can happen for example if the querier was at the wrong position. (4) counts the cached queriers that were evicted due to their TTL expiring. (5) counts the cached queriers that were evicted due to reader-resource (those limited by reader-concurrency limits) shortage. (6) counts the cached queriers that were evicted due to reaching the cache's memory limits (currently set to 4% of the shards' memory). (7) is the current number of entries in the cache Note: * The count of cache hits can be derived from these counters as (1) - (2). * cache_drop (3) also implies a cache hit (see above). This means that the number of actually reused queriers is: (1) - (2) - (3)	2018-03-13 10:34:34 +02:00
Botond Dénes	8513549b55	Memory based cache eviction To bound the memory consumption of the querier-cache the total memory consumption of the cached queriers is limited to 4% of the shard's total memory. When inserting a new querier it is first checked whether it's insertion would cause the limit to be crossed. If this is the case existing entries are evicted until the memory consumption is sufficiently reduced so that after inserting the querier it stays below the limit. Cached queriers are evicted in LRU order as the oldest queriers are the most likely to be evicted based on their TTL anyway. To calculate the memory consumption of the cached queriers flat_mutation_reader::buffer_size() is used. While this is not very precise as it doesn't include object sizes and member containers it gives a good picture of the memory consumption of the queriers. Memory based cache eviction overlaps with resource-based cache eviction but only to some degree as that only accounts the memory consumption of sstable readers.	2018-03-13 10:34:34 +02:00
Botond Dénes	f488ae3917	Add buffer_size() to flat_mutation_reader buffer_size() exposes the collective size of the external memory consumed by the mutattion-fragments in the flat reader's buffer. This provides a basis to build basic memory accounting on. Altought this is not the entire memory consumption of any given reader it is the most volatile component and usually by far the largest one too.	2018-03-13 10:34:34 +02:00
Botond Dénes	212b2dabc4	Resource-based cache eviction Readers serving user-reads need to obtain a permit to start reading. There exists a restriction on how much active readers can be admitted based on their count and their memory onsumption. Since the saved readers of cached queriers are techically active (they hold a permit) they can block new readers from obtaining a permit. New readers have a higher priority because a cached reader might be abandoned or used later at best so in the face of memory pressure we evict cached readers to free up permits for new readers. Cached queriers are evicted in LRU order as the oldest queriers are the most likely to be evicted based on their TTL anyway.	2018-03-13 10:34:34 +02:00
Botond Dénes	d5bcadcfda	Time-based cache eviction Cached queriers should not sit in the cache indefinitely otherwise abandoned reads would cause excess and unncessary resource-usage. Attach an expiry timer to each cache-entry which evicts it after the TTL passes.	2018-03-13 10:34:34 +02:00

1 2 3 4 5 ...

14852 Commits