scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Pekka Enberg	972fc6e014	main: Defer API server hooks until commitlog replay Defer registering services to the API server until commitlog has been replayed to ensure that nobody is able to trigger sstable operations via 'nodetool' before we are ready for them. Message-Id: <1458116227-4671-1-git-send-email-penberg@scylladb.com>	2016-03-17 10:04:35 +02:00
Takuya ASADA	95161d5db7	dist: add scylla-gdb.py on Ubuntu dbg package Fixes #969 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458150248-10632-1-git-send-email-syuu@scylladb.com>	2016-03-17 09:03:00 +02:00
Pekka Enberg	303dd76205	Merge "Fix debug messages for streaming session" from Glauber "One of the messages is printed twice, and one of the verbs is missing a message. That makes it hard to debug the session."	2016-03-17 08:11:50 +02:00
Glauber Costa	a3ebf640c6	stream_session: print debug message for STREAM_MUTATION For this verb(), we don't call get_session - and it doesn't look like we will. We currently have no debug message for this one, which makes it harder to debug the stream of messages. Print it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-16 22:09:46 -04:00
Glauber Costa	0ab4275893	stream_session: remove duplicated debug message Whenever we call get_session, that will print a debug message about the arrival of this new verb. Because we also print that explicitly in PREPARE_DONE, that message gets duplicated. That confuses poor developers who are, for a while, left wondering why is it that the sender is sender the message twice. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-16 22:04:25 -04:00
Glauber Costa	6a3872b355	sstables: do not assume mutation_reader will be kept alive Our sstables::mutation_reader has a specialization in which start and end ranges are passed as futures. That is needed because we may have to read the index file for those. This works well under the assumption that every time a mutation_reader will be created it will be used, since whoever is using it will surely keep the state of the reader alive. However, that assumption is no longer true - for a while. We use a reader interface for reading everything from mutations and sstables to cache entries, and when we create an sstable mutation_reader, that does not mean we'll use it. In fact we won't, if the read can be serviced first by a higher level entity. If that happens to be the case, the reader will be destructed. However, since it may take more time than that for the start and end futures to resolve, by the time they are resolved the state of the mutation reader will no longer be valid. The proposed fix for that is to only resolve the future inside mutation_reader's read() function. If that function is called, we can have a reasonable expectation that the caller object is being kept alive. A second way to fix this would be to force the mutation reader to be kept alive by transforming it into a shared pointer and acquiring a reference to itself. However, because the reader may turn out not to be used, the delayed read actually has the advantage of not even reading anything from the disk if there is no need for it. Also, because sstables can be compacted, we can't guarantee that the sst object itself , used in the resolution of start and end can be alive and that has the same problem. If we delay the calling of those, we will also solve a similar problem. We assume here that the outter reader is keeping the SSTable object alive. I must note that I have not reproduced this problem. What goes above is the result of the analysis we have made in #1036. That being the case, a thorough review is appreciated. Fixes #1036 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>	2016-03-16 17:51:02 +02:00
Nadav Har'El	02ba8ffbe8	Allow uncompression at end of file Asking to read from byte 100 when a file has 50 bytes is an obvious error. But what if we ask to read from byte 50? What if we ask to read 0 bytes at byte 50? :-) Before this patch, code which asked to read from the EOF position would get an exception. After this patch, it would simply read nothing, without error. This allows, for example, reading 0 bytes from position 0 on a file with 0 bytes, which apparently happened in issue #1039... A read which starts at a position higher than the EOF position still generates an exception. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>	2016-03-16 17:50:23 +02:00
Nadav Har'El	73297c7872	Fix out-of-range exception when uncompressing 0 bytes The uncompression code reads the compressed chunks containing the bytes pos through pos + len - 1. This, however, is not correct when len==0, and pos + len - 1 may even be -1, causing an out-of-range exception when calling locate() to find the chunks containing this byte position. So we need to treat len==0 specially, and in this case we don't read anything, and don't need to locate() the chunks to read. Refs #1039. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>	2016-03-16 15:54:48 +02:00
Takuya ASADA	f1d18e9980	dist: do not auto-start scylla-server job on Ubuntu package install time Fixes #1017 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458122424-22889-1-git-send-email-syuu@scylladb.com>	2016-03-16 13:55:12 +02:00
Pekka Enberg	2f519b9b34	tests/gossip_test: Fix messaging service stop This fixes gossip test shutdown similar to what commit `13ce48e` ("tests: Fix stop of storage_service in cql_test_env") did for CQL tests: gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed. Running 1 test case... [snip] unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested) seastar/tests/test-utils.cc(32): last checkpoint Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>	2016-03-16 13:15:18 +02:00
Asias He	2d50c71ca3	streaming: Handle cf is deleted after the deletion check The cf can be deleted after the cf deletion check. Handle this case as well. Use "warn" level to log if cf is missing. Although we can handle the case, but it is good to distingush where the receiver of streaming applied all the stream mutations or not. We believe that the cf is missing because it was dropped, but it could be missing because of a bug or something we didn't anticipated here. Related patch: "streaming: Handle cf is deleted when sending STREAM_MUTATION_DONE" Fixes simple_add_new_node_while_schema_changes_test failure. Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>	2016-03-16 09:46:41 +01:00
Asias He	13ce48e775	tests: Fix stop of storage_service in cql_test_env In stop() of storage_service, it unregisters the verb handler. In the test, we stop messaging_service before storage_service. Fix it by deferring stop of messaging_service. Message-Id: <c71f7b5b46e475efe2fac4c1588460406f890176.1458086329.git.asias@scylladb.com>	2016-03-16 08:32:01 +02:00
Asias He	83ffae1568	storage_service: Drop block_until_update_pending_ranges_finished It is a legacy API from c*. Since we can wait for the update_pending_ranges to complete, we can wait for it directly instead of calling block_until_update_pending_ranges_finished to do so. Also, change do_update_pending_ranges to be private. Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>	2016-03-15 15:18:45 +02:00
Avi Kivity	cc3e49e16f	Merge seastar upstream * seastar 0739576...6a207e1 (3): > file: allow custom file_impl implementations > Dockerfile update > tcp: Fix a typo in input_handle_other_state	2016-03-15 15:06:35 +02:00
Gleb Natapov	c6157dd99e	enable rpc_keepalive parameter Fixes #1044 Message-Id: <20160315104609.GV6117@scylladb.com>	2016-03-15 12:51:12 +02:00
Paweł Dziepak	9f3893980a	move SCHEMA_CHECK registration to migration_manager The verb is just for reporting and debugging purposes, but it is better not to register it until it can return a meaningful value. Besides, it really belongs to the migration manager subsystem anyway. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>	2016-03-15 12:24:37 +02:00
Asias He	d79dbfd4e8	main: Defer initalization of streaming Streaming is used by bootstrap and repair. Streaming uses storage_proxy class to apply the frozen_mutation and db/column_family class to invalidate row cache. Defer the initalization just before repair and bootstrap init. Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>	2016-03-15 11:56:34 +02:00
Pekka Enberg	eb13f65949	main: Defer REPAIR_CHECKSUM_RANGE RPC verb registration after commitlog replay Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after we have replayed the commitlog to avoid responding with bogus checksums. Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>	2016-03-15 11:56:18 +02:00
Pekka Enberg	917ed4adbe	Merge "verb init/handler for gosisp and storage_service" from Asias "- ignore ack2 msg if gossip is not enabled - move REPLICATION_FINISHED to where it belongs to - add comments for gossip runtime dependency"	2016-03-15 11:12:10 +02:00
Avi Kivity	ad26e81444	Merge "Update pending ranges when ks is changed" from Asias "At the momment, the migration_listener callbacks returns void, it is impossible to wait for the callbacks to complete. Make the callbacks runs inside seastar thread, so if we need to wait for the callback, we can make it call foo_operation().get() in the callback. It is easier than making the callbacks return future<>. Fixes #1000."	2016-03-15 10:50:07 +02:00
Asias He	883d8cb8fd	storage_service: Move REPLICATION_FINISHED verb to storage_service It belongs to storage_service not storage_proxy.	2016-03-15 16:13:22 +08:00
Asias He	fb4d292d5c	storage_service: Drop unused debug code	2016-03-15 16:13:21 +08:00
Asias He	16af12ca47	gossip: Add comments on external runtime dependency needed by gossip	2016-03-15 16:13:13 +08:00
Asias He	1034dd0aff	gossip: Ignore ack2 message if gosisp is not enabled yet	2016-03-15 16:09:43 +08:00
Asias He	1bf0412e7a	gossip: Introduce handle_shutdown_msg helper	2016-03-15 16:09:43 +08:00
Asias He	54d8ac16b5	gossip: Introduce handle_echo_msg helper	2016-03-15 16:09:42 +08:00
Asias He	1f64f4bfcb	gossip: Introdcue handle_ack2_msg helper	2016-03-15 16:09:42 +08:00
Asias He	d63281b256	storage_service: Update pending ranges when keyspace is changed If a keyspace is created after we calcuate the pending ranges during bootstrap. We will ignore the keyspace in pending ranges when handling write request for that keyspace which will casue data lose if rf = 1. Fixes #1000	2016-03-15 15:41:23 +08:00
Asias He	93015bcc54	migration_manager: Make the migration callbacks runs inside seastar thread At the momment, the callbacks returns void, it is impossible to wait for the callbacks to complete. Make the callbacks runs inside seastar thread, so if we need to wait for the callback, we can make it call foo_operation().get() in the callback. It is easier than making the callbacks return future<>.	2016-03-15 15:41:23 +08:00
Gleb Natapov	5076f4878b	main: Defer storage proxy RPC verb registration after commitlog replay Message-Id: <20160315071229.GM6117@scylladb.com>	2016-03-15 09:18:12 +02:00
Gleb Natapov	e228ef1bd9	messaging: enable keepalive tcp option for inter-node communication Some network equipment that does TCP session tracking tend to drop TCP sessions after a period of inactivity. Use keepalive mechanism to prevent this from happening for our inter-node communication. Message-Id: <20160314173344.GI31837@scylladb.com>	2016-03-14 19:39:39 +02:00
Avi Kivity	7ae2298081	Merge seastar upstream * seastar 88cc232...0739576 (4): > rpc: allow configuring keepalive for rpc client > net: add keepalive configuration to socket interface > iotune: refuse to run if there is not enough space available > rpc: make client connection error more clear	2016-03-14 19:38:54 +02:00
Pekka Enberg	1429213b4c	main: Defer migration manager RPC verb registration after commitlog replay Defer registering migration manager RPC verbs after commitlog has has been replayed so that our own schema is fully loaded before other other nodes start querying it or sending schema updates. Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>	2016-03-14 18:03:16 +01:00
Pekka Enberg	16f947dcb3	message/messaging_service: Remove init_messaging_service() declaration The function no longer exists so drop the function declaration. Message-Id: <1457694134-25600-1-git-send-email-penberg@scylladb.com>	2016-03-14 13:54:53 +02:00
Vlad Zolotarov	ce47fcb1ba	sstables: properly account removal requests The same shard may create an sstables::sstable object for the same SStable that doesn't belong to it more than once and mark it for deletion (e.g. in a 'nodetool refresh' flow). In that case the destructor of sstables::sstable accounted the deletion requests from the same shard more than once since it was a simple counter incremented each time there was a deletion request while it should account request from the same shard as a single request. This is because the removal logic waited for all shards to agree on a removal of a specific SStable by comparing the counter mentioned above to the total number of shards and once they were equal the SStable files were actually removed. This patch fixes this by replacing the counter by an std::unordered_set<unsigned> that will store a shard ids of the shards requesting the deletion of the sstable object and will compare the size() of this set to smp::count in order to decide whether to actually delete the corresponding SStable files. Fixes #1004 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>	2016-03-14 11:45:08 +02:00
Raphael S. Carvalho	1ff7d32272	sstables: make write_simple() safer by using exclusive flag We should guarantee that write_simple() will not try to overwrite an existing file. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <194bd055f1f2dc1bb9766a67225ec38c88e7b005.1457818073.git.raphaelsc@scylladb.com>	2016-03-14 11:45:00 +02:00
Raphael S. Carvalho	0af786f3ea	sstables: fix race condition when writing to the same sstable in parallel When we are about to write a new sstable, we check if the sstable exists by checking if respective TOC exists. That check was added to handle a possible attempt to write a new sstable with a generation being used. Gleb was worried that a TOC could appear after the check, and that's indeed possible if there is an ongoing sstable write that uses the same generation (running in parallel). If TOC appear after the check, we would again crap an existing sstable with a temporary, and user wouldn't be to boot scylla anymore without manual intervention. Then Nadav proposed the following solution: "We could do this by the following variant of Raphael's idea: 1. create .txt.tmp unconditionally, as before the commit `031bf57c1` (if we can't create it, fail). 2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp we just created and fail. 3. continue as usual 4. and at the end, as before, rename .txt.tmp to .txt. The key to solving the race is step 1: Since we created .txt.tmp in step 1 and know this creation succeeded, we know that we cannot be running in parallel with another writer - because such a writer too would have tried to create the same file, and kept it existing until the very last step of its work (step 4)." This patch implements the solution described above. Let me also say that the race is theoretical and scylla wasn't affected by it so far. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>	2016-03-14 11:44:51 +02:00
Avi Kivity	7278d0343b	Merge seastar upstream * seastar 906b562...88cc232 (2): > reactor: fix work item leak in syscall work queue > rpc_test: add missing header	2016-03-14 11:15:42 +02:00
Asias He	9f64c36a08	storage_service: Fix pending_range_calculator_service Since calculate_pending_ranges will modify token_metadata, we need to replicate to other shards. With this patch, when we call calculate_pending_ranges, token_metadata will be replciated to other non-zero shards. In addition, it is not useful as a standalone class. We can merge it into the storage_service. Kill one singleton class. Fixes #1033 Refs #962 Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>	2016-03-14 10:14:22 +02:00
Pekka Enberg	d4b4baad98	Merge "Add more information to query result digest" from Paweł "This series adds more information (i.e. keys and tombstones) to the query result digest in order to ensure correctness and increase the chances of early detection of disagreement between replicas. The digest is no longer computed by hashing query::result but build using the query result builder. That is necessary since the query result itself doesn't contain all information required to compute the digest. Another consequence of this is that now replicas asked for a result need to send both the result and the digest to the coordinator as it won't be able to compute the digest itself. Unfortunately, these patches change our on wire communication: 1) hash computation is different 2) format of query::result is changed (and it is made non-final) Fixes #182."	2016-03-14 08:22:05 +02:00
Paweł Dziepak	72970c9c90	query: add query::result::_digest to pretty printer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:17 +00:00
Paweł Dziepak	82d2a2dccb	specify whether query::result, result_digest or both are needed Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	21e2ebcf8c	query: build only result, only digest or both Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	46079f763b	query: add keys and tombstones to result digest Query result digest is used to verify that all replicas have the same data. Therefore, it needs to contain more information than the query result itself in order to ensure proper detection of disagreements. Generally, adding clustering keys to the digest regardless of whether the client asked for them will guarantee correctness. However, adding tombstones as well improves the chances of early detection of nodes containing stale data. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	15fd3e96ff	md5_hasher: add finalize_array() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	3efb10bd08	result.idl: keep digest together with result Result digest is going to be computed in query result builder and require information not available in the query resylt. That's why the digest now needs to be sent to the other nodes together with the result as they won't be able compute it on their own. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	86ba96622e	atomic_cell: do not require type to hash collection cell Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	23ee493d91	types: make collection_type_impl::deserialize_mutation_form static Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	c1f7f11d54	mutation_partition: do not add ck to result when not asked to Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:27:13 +00:00
Paweł Dziepak	77dbe3c12f	storage_proxy: fix reconciliation with limits Currently, if there is a disagreement between replicas we get mutations from all of them, merge this mutations and send the result to the client, difference between the result and the mutation sent by a particular replica is sent back to repair it. Unfortunately, that may not suffice to provide user with correct results in case of disagreements. Consider the following scenario: create table cf(p int, c int, r int, primary key(p, c)); node1: p=0, c=1, r=1 (timestamp = 1) p=0, c=2, r=2 (timestamp = 2) node2: p=0, c=1, r=tombstone (timestamp = 2) p=0, c=2, r=1 (timestamp = 1) query: select r from cf limit 1; Let's assume there are no row markers. node1 will send only outdated cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and outdated cell (p=0, c=2, r=1). A disagreement will be detected, the replies will be merged and the coordinator will respond to the client with result r=1, while the correct answer is r=2. The solution proposed in this patch is to attempt to detect cases when the problem may occur and retry queries with larger limit which result in replicas providing more information. The detection logic is simple: the partition key and clustering key of the last row in the reconciled result are compared with the partition keys and clustering keys of the last rows of replies from replicas (except short reads). If the (pk, ck) of the replica last row is smaller than the (pk, ck) of the reconciled result the query is retried. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-03-11 18:26:33 +00:00

1 2 3 4 5 ...

8875 Commits