scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Pekka Enberg	2f519b9b34	tests/gossip_test: Fix messaging service stop This fixes gossip test shutdown similar to what commit `13ce48e` ("tests: Fix stop of storage_service in cql_test_env") did for CQL tests: gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed. Running 1 test case... [snip] unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested) seastar/tests/test-utils.cc(32): last checkpoint Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>	2016-03-16 13:15:18 +02:00
Asias He	2d50c71ca3	streaming: Handle cf is deleted after the deletion check The cf can be deleted after the cf deletion check. Handle this case as well. Use "warn" level to log if cf is missing. Although we can handle the case, but it is good to distingush where the receiver of streaming applied all the stream mutations or not. We believe that the cf is missing because it was dropped, but it could be missing because of a bug or something we didn't anticipated here. Related patch: "streaming: Handle cf is deleted when sending STREAM_MUTATION_DONE" Fixes simple_add_new_node_while_schema_changes_test failure. Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>	2016-03-16 09:46:41 +01:00
Asias He	13ce48e775	tests: Fix stop of storage_service in cql_test_env In stop() of storage_service, it unregisters the verb handler. In the test, we stop messaging_service before storage_service. Fix it by deferring stop of messaging_service. Message-Id: <c71f7b5b46e475efe2fac4c1588460406f890176.1458086329.git.asias@scylladb.com>	2016-03-16 08:32:01 +02:00
Asias He	83ffae1568	storage_service: Drop block_until_update_pending_ranges_finished It is a legacy API from c*. Since we can wait for the update_pending_ranges to complete, we can wait for it directly instead of calling block_until_update_pending_ranges_finished to do so. Also, change do_update_pending_ranges to be private. Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>	2016-03-15 15:18:45 +02:00
Avi Kivity	cc3e49e16f	Merge seastar upstream * seastar 0739576...6a207e1 (3): > file: allow custom file_impl implementations > Dockerfile update > tcp: Fix a typo in input_handle_other_state	2016-03-15 15:06:35 +02:00
Gleb Natapov	c6157dd99e	enable rpc_keepalive parameter Fixes #1044 Message-Id: <20160315104609.GV6117@scylladb.com>	2016-03-15 12:51:12 +02:00
Paweł Dziepak	9f3893980a	move SCHEMA_CHECK registration to migration_manager The verb is just for reporting and debugging purposes, but it is better not to register it until it can return a meaningful value. Besides, it really belongs to the migration manager subsystem anyway. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>	2016-03-15 12:24:37 +02:00
Asias He	d79dbfd4e8	main: Defer initalization of streaming Streaming is used by bootstrap and repair. Streaming uses storage_proxy class to apply the frozen_mutation and db/column_family class to invalidate row cache. Defer the initalization just before repair and bootstrap init. Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>	2016-03-15 11:56:34 +02:00
Pekka Enberg	eb13f65949	main: Defer REPAIR_CHECKSUM_RANGE RPC verb registration after commitlog replay Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after we have replayed the commitlog to avoid responding with bogus checksums. Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>	2016-03-15 11:56:18 +02:00
Pekka Enberg	917ed4adbe	Merge "verb init/handler for gosisp and storage_service" from Asias "- ignore ack2 msg if gossip is not enabled - move REPLICATION_FINISHED to where it belongs to - add comments for gossip runtime dependency"	2016-03-15 11:12:10 +02:00
Avi Kivity	ad26e81444	Merge "Update pending ranges when ks is changed" from Asias "At the momment, the migration_listener callbacks returns void, it is impossible to wait for the callbacks to complete. Make the callbacks runs inside seastar thread, so if we need to wait for the callback, we can make it call foo_operation().get() in the callback. It is easier than making the callbacks return future<>. Fixes #1000."	2016-03-15 10:50:07 +02:00
Asias He	883d8cb8fd	storage_service: Move REPLICATION_FINISHED verb to storage_service It belongs to storage_service not storage_proxy.	2016-03-15 16:13:22 +08:00
Asias He	fb4d292d5c	storage_service: Drop unused debug code	2016-03-15 16:13:21 +08:00
Asias He	16af12ca47	gossip: Add comments on external runtime dependency needed by gossip	2016-03-15 16:13:13 +08:00
Asias He	1034dd0aff	gossip: Ignore ack2 message if gosisp is not enabled yet	2016-03-15 16:09:43 +08:00
Asias He	1bf0412e7a	gossip: Introduce handle_shutdown_msg helper	2016-03-15 16:09:43 +08:00
Asias He	54d8ac16b5	gossip: Introduce handle_echo_msg helper	2016-03-15 16:09:42 +08:00
Asias He	1f64f4bfcb	gossip: Introdcue handle_ack2_msg helper	2016-03-15 16:09:42 +08:00
Asias He	d63281b256	storage_service: Update pending ranges when keyspace is changed If a keyspace is created after we calcuate the pending ranges during bootstrap. We will ignore the keyspace in pending ranges when handling write request for that keyspace which will casue data lose if rf = 1. Fixes #1000	2016-03-15 15:41:23 +08:00
Asias He	93015bcc54	migration_manager: Make the migration callbacks runs inside seastar thread At the momment, the callbacks returns void, it is impossible to wait for the callbacks to complete. Make the callbacks runs inside seastar thread, so if we need to wait for the callback, we can make it call foo_operation().get() in the callback. It is easier than making the callbacks return future<>.	2016-03-15 15:41:23 +08:00
Gleb Natapov	5076f4878b	main: Defer storage proxy RPC verb registration after commitlog replay Message-Id: <20160315071229.GM6117@scylladb.com>	2016-03-15 09:18:12 +02:00
Gleb Natapov	e228ef1bd9	messaging: enable keepalive tcp option for inter-node communication Some network equipment that does TCP session tracking tend to drop TCP sessions after a period of inactivity. Use keepalive mechanism to prevent this from happening for our inter-node communication. Message-Id: <20160314173344.GI31837@scylladb.com>	2016-03-14 19:39:39 +02:00
Avi Kivity	7ae2298081	Merge seastar upstream * seastar 88cc232...0739576 (4): > rpc: allow configuring keepalive for rpc client > net: add keepalive configuration to socket interface > iotune: refuse to run if there is not enough space available > rpc: make client connection error more clear	2016-03-14 19:38:54 +02:00
Pekka Enberg	1429213b4c	main: Defer migration manager RPC verb registration after commitlog replay Defer registering migration manager RPC verbs after commitlog has has been replayed so that our own schema is fully loaded before other other nodes start querying it or sending schema updates. Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>	2016-03-14 18:03:16 +01:00
Pekka Enberg	16f947dcb3	message/messaging_service: Remove init_messaging_service() declaration The function no longer exists so drop the function declaration. Message-Id: <1457694134-25600-1-git-send-email-penberg@scylladb.com>	2016-03-14 13:54:53 +02:00
Vlad Zolotarov	ce47fcb1ba	sstables: properly account removal requests The same shard may create an sstables::sstable object for the same SStable that doesn't belong to it more than once and mark it for deletion (e.g. in a 'nodetool refresh' flow). In that case the destructor of sstables::sstable accounted the deletion requests from the same shard more than once since it was a simple counter incremented each time there was a deletion request while it should account request from the same shard as a single request. This is because the removal logic waited for all shards to agree on a removal of a specific SStable by comparing the counter mentioned above to the total number of shards and once they were equal the SStable files were actually removed. This patch fixes this by replacing the counter by an std::unordered_set<unsigned> that will store a shard ids of the shards requesting the deletion of the sstable object and will compare the size() of this set to smp::count in order to decide whether to actually delete the corresponding SStable files. Fixes #1004 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>	2016-03-14 11:45:08 +02:00
Raphael S. Carvalho	1ff7d32272	sstables: make write_simple() safer by using exclusive flag We should guarantee that write_simple() will not try to overwrite an existing file. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <194bd055f1f2dc1bb9766a67225ec38c88e7b005.1457818073.git.raphaelsc@scylladb.com>	2016-03-14 11:45:00 +02:00
Raphael S. Carvalho	0af786f3ea	sstables: fix race condition when writing to the same sstable in parallel When we are about to write a new sstable, we check if the sstable exists by checking if respective TOC exists. That check was added to handle a possible attempt to write a new sstable with a generation being used. Gleb was worried that a TOC could appear after the check, and that's indeed possible if there is an ongoing sstable write that uses the same generation (running in parallel). If TOC appear after the check, we would again crap an existing sstable with a temporary, and user wouldn't be to boot scylla anymore without manual intervention. Then Nadav proposed the following solution: "We could do this by the following variant of Raphael's idea: 1. create .txt.tmp unconditionally, as before the commit `031bf57c1` (if we can't create it, fail). 2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp we just created and fail. 3. continue as usual 4. and at the end, as before, rename .txt.tmp to .txt. The key to solving the race is step 1: Since we created .txt.tmp in step 1 and know this creation succeeded, we know that we cannot be running in parallel with another writer - because such a writer too would have tried to create the same file, and kept it existing until the very last step of its work (step 4)." This patch implements the solution described above. Let me also say that the race is theoretical and scylla wasn't affected by it so far. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>	2016-03-14 11:44:51 +02:00
Avi Kivity	7278d0343b	Merge seastar upstream * seastar 906b562...88cc232 (2): > reactor: fix work item leak in syscall work queue > rpc_test: add missing header	2016-03-14 11:15:42 +02:00
Asias He	9f64c36a08	storage_service: Fix pending_range_calculator_service Since calculate_pending_ranges will modify token_metadata, we need to replicate to other shards. With this patch, when we call calculate_pending_ranges, token_metadata will be replciated to other non-zero shards. In addition, it is not useful as a standalone class. We can merge it into the storage_service. Kill one singleton class. Fixes #1033 Refs #962 Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>	2016-03-14 10:14:22 +02:00
Pekka Enberg	d4b4baad98	Merge "Add more information to query result digest" from Paweł "This series adds more information (i.e. keys and tombstones) to the query result digest in order to ensure correctness and increase the chances of early detection of disagreement between replicas. The digest is no longer computed by hashing query::result but build using the query result builder. That is necessary since the query result itself doesn't contain all information required to compute the digest. Another consequence of this is that now replicas asked for a result need to send both the result and the digest to the coordinator as it won't be able to compute the digest itself. Unfortunately, these patches change our on wire communication: 1) hash computation is different 2) format of query::result is changed (and it is made non-final) Fixes #182."	2016-03-14 08:22:05 +02:00
Paweł Dziepak	72970c9c90	query: add query::result::_digest to pretty printer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:17 +00:00
Paweł Dziepak	82d2a2dccb	specify whether query::result, result_digest or both are needed Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	21e2ebcf8c	query: build only result, only digest or both Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	46079f763b	query: add keys and tombstones to result digest Query result digest is used to verify that all replicas have the same data. Therefore, it needs to contain more information than the query result itself in order to ensure proper detection of disagreements. Generally, adding clustering keys to the digest regardless of whether the client asked for them will guarantee correctness. However, adding tombstones as well improves the chances of early detection of nodes containing stale data. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	15fd3e96ff	md5_hasher: add finalize_array() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	3efb10bd08	result.idl: keep digest together with result Result digest is going to be computed in query result builder and require information not available in the query resylt. That's why the digest now needs to be sent to the other nodes together with the result as they won't be able compute it on their own. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	86ba96622e	atomic_cell: do not require type to hash collection cell Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	23ee493d91	types: make collection_type_impl::deserialize_mutation_form static Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	c1f7f11d54	mutation_partition: do not add ck to result when not asked to Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	77dbe3c12f	storage_proxy: fix reconciliation with limits Currently, if there is a disagreement between replicas we get mutations from all of them, merge this mutations and send the result to the client, difference between the result and the mutation sent by a particular replica is sent back to repair it. Unfortunately, that may not suffice to provide user with correct results in case of disagreements. Consider the following scenario: create table cf(p int, c int, r int, primary key(p, c)); node1: p=0, c=1, r=1 (timestamp = 1) p=0, c=2, r=2 (timestamp = 2) node2: p=0, c=1, r=tombstone (timestamp = 2) p=0, c=2, r=1 (timestamp = 1) query: select r from cf limit 1; Let's assume there are no row markers. node1 will send only outdated cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and outdated cell (p=0, c=2, r=1). A disagreement will be detected, the replies will be merged and the coordinator will respond to the client with result r=1, while the correct answer is r=2. The solution proposed in this patch is to attempt to detect cases when the problem may occur and retry queries with larger limit which result in replicas providing more information. The detection logic is simple: the partition key and clustering key of the last row in the reconciled result are compared with the partition keys and clustering keys of the last rows of replies from replicas (except short reads). If the (pk, ck) of the replica last row is smaller than the (pk, ck) of the reconciled result the query is retried. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:26:33 +00:00
Asias He	f747df2aff	streaming: Fix rethrow in stream_transfer_task Fix bootstrap_test.py:TestBootstrap.failed_bootstap_wiped_node_can_join_test Logs on node 1: INFO 2016-03-11 15:53:43,287 [shard 0] gossip - FatClient 127.0.0.2 has been silent for 30000ms, removing from gossip INFO 2016-03-11 15:53:43,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.2 in on_remove WARN 2016-03-11 15:53:43,498 [shard 0] stream_session - [Stream #4e411ba0-e75e-11e5-81f8-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION_DONE to 127.0.0.2:0: std::runtime_error ([Stream #4e411ba0-e75e-11e5-81f8-000000000000] GOT STREAM_ MUTATION_DONE 127.0.0.1: Can not find stream_manager) terminate called without an active exception Backtrace on node 1: #0 0x00007fb74723da98 in raise () from /lib64/libc.so.6 #1 0x00007fb74723f69a in abort () from /lib64/libc.so.6 #2 0x00007fb74ab84aed in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #3 0x00007fb74ab82936 in ?? () from /lib64/libstdc++.so.6 #4 0x00007fb74ab82981 in std::terminate() () from /lib64/libstdc++.so.6 #5 0x00007fb74ab82be9 in __cxa_rethrow () from /lib64/libstdc++.so.6 #6 0x0000000000f3521e in streaming::stream_transfer_task::<lambda()>::<lambda(auto:44)>::operator()<std::__exception_ptr::exception_ptr> (ep=..., __closure=0x7ffce74d8630) at streaming/stream_transfer_task.cc:169 #7 do_void_futurize_apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1142 #8 futurize<void>::apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1190 #9 future<>::<lambda(auto:7&&)>::operator()<future<> > ( fut=fut@entry=<unknown type in /home/asias/src/cloudius-systems/scylla/build/release/scylla, CU 0xec84d00, DIE 0xee2561d>, __closure=__closure@entry=0x7ffce74d8630) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1014 Message-Id: <1457684884-4776-2-git-send-email-asias@scylladb.com>	2016-03-11 11:14:05 +02:00
Asias He	bcdd3dbb3e	messaging_service: Add missed throw It is missed somehow. Message-Id: <1457684884-4776-1-git-send-email-asias@scylladb.com>	2016-03-11 11:01:24 +02:00
Raphael S. Carvalho	031bf57c19	sstables: bail out if toc exists for generation used by write_components Currently, if sstable::write_components() is called to write a new sstable using the same generation of a sstable that exists, a temporary TOC will be unconditionally created. Afterwards, the same sstable::write_components() will fail when it reaches sstable::create_data(). The reason is obvious because data component exists for that generation (in this scenario). After that, user will not be able to boot scylla anymore because there is a generation with both a TOC and a temporary TOC. We cannot simply remove a generation with TOC and temporary TOC because user data will be lost (again, in this scenario). After all, the temporary TOC was only created because sstable::write_components() was wrongly called with the generation of a sstable that exists. Solution proposed by this patch is to trigger exception if a TOC file exists for the generation used. Some SSTable unit tests were also changed to guarantee that we don't try to overwrite components of an existing sstable. Refs #1014. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>	2016-03-11 09:22:51 +02:00
Nadav Har'El	1b4f8842ee	sstable: fix compressed data file overread Since commit `2f56577` ("sstables: more efficient read of compressed data file"), the compressed_file_input_stream uses a file_input_stream to efficiently read the compressed data at chunks some desired size (128 KB is our default) instead of at smaller compressed chunks. However, I had a bug where I mis-calculated the desired length of the read (giving the end byte instead of the length!) and as a result file_input_stream did not know where the read was supposed to stop, and always read 128 KB buffers. The results were not incorrect, because the sstable reader stops when it needs to, even if given too much data. But it was inefficient because too much data was read in the last buffer. With this patch, the length is correctly given to the input stream, and it can read a much smaller buffer at the end of the read, not the full 128 KB. I tested that this actually happens. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>	2016-03-11 09:17:50 +02:00
Pekka Enberg	987e8579d7	Merge "general robustness improvements for SSTables code" from Glauber "As described in issue #1014, we have found ourselves in a situation where SSTables can be written too early, and that causes problems for the existing SSTables. While this shouldn't happen - and Pekka's recent patch to move populate() a lot earlier in initialization should fix that, when that did happen what we had was not enough to prevent it from overwriting existing tables. We should do a lot better job protecting against that. Also, some of the exceptions that are generated at totally inconclusive. This series also aims at making some of the exceptions more descriptive."	2016-03-11 09:03:05 +02:00
Glauber Costa	a339296385	database: turn sstable generation number into an optional This patch makes sure that every time we need to create a new generation number - the very first step in the creation of a new SSTable, the respective CF is already initialized and populated. Failure to do so can lead to data being overwritten. Extensive details about why this is important can be found in Scylla's Github Issue #1014 Nothing should be writing to SSTables before we have the chance to populate the existing SSTables and calculate what should the next generation number be. However, if that happens, we want to protect against it in a way that does not involve overwriting existing tables. This is one of the ways to do it: every column family starts in an unwriteable state, and when it can finally be written to, we mark it as writeable. Note that this cannot be a part of add_column_family. That adds a column family to a db in memory only, and if anybody is about to write to a CF, that was most likely already called. We need to call this explicitly when we are sure we're ready to issue disk operations safely. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Glauber Costa	f2a8bcabc2	sstables: improve error messages The standard C++ exception messages that will be thrown if there is anything wrong writing the file, are suboptimal: they barely tell us the name of the failing file. Use a specialized create function so that we can capture that better. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Glauber Costa	6c4e31bbdb	main: when scanning SSTables, run shard 0 first Deletion of previous stale, temporary SSTables is done by Shard0. Therefore, let's run Shard0 first. Technically, we could just have all shards agree on the deletion and just delete it later, but that is prone to races. Those races are not supposed to happen during normal operation, but if we have bugs, they can. Scylla's Github Issue #1014 is an example of a situation where that can happen, making existing problems worse. So running a single shard first and getting making sure that all temporary tables are deleted provides extra protection against such situations. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Glauber Costa	8eb4e69053	database: remove unused parameter We are no longer using the in_flight_seals gate, but forgot to remove it. To guarantee that all seal operations will have finished when we're done, we are using the memtable_flush_queue, which also guarantees order. But that gate was never removed. The FIXME code should also be removed, since such interface does exist now. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:05:54 -05:00

... 56 57 58 59 60 ...

11716 Commits