scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Paweł Dziepak	7e89dc3bbf	tests/sstables: add storage_service_for_tests to counter write test Writing a counters to a sstable is going to require cluster feature information, which requires accessing some singletons.	2017-09-05 13:49:01 +01:00
Paweł Dziepak	2cdcaeba6e	tests/sstables: add test for reading wrong-order counter cells	2017-09-05 13:49:01 +01:00
Paweł Dziepak	55cb0cafa8	sstables: do not expect counter shards to be sorted	2017-09-05 13:49:01 +01:00
Paweł Dziepak	660572e85c	storage_service: introduce CORRECT_COUNTER_ORDER feature Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix this problem a new feature is introduced that will be used to determine when nodes with that bug fixed can start sending counter shard in the correct order.	2017-09-05 13:49:01 +01:00
Paweł Dziepak	b86da0c479	tests/counter: test 1.7.4 compatible shard ordering	2017-09-05 13:49:01 +01:00
Paweł Dziepak	b1b8599b1a	counters: add helper for retrieving shards in 1.7.4 order	2017-09-05 13:49:00 +01:00
Paweł Dziepak	89c037dfc8	tests/counter: add tests for 1.7.4 counter shard order	2017-09-05 13:49:00 +01:00
Paweł Dziepak	25eec66935	counters: add counter id comparator compatible with Scylla 1.7.4	2017-09-05 13:49:00 +01:00
Paweł Dziepak	b5787ca640	tests/counter: verify order of counter shards	2017-09-05 13:49:00 +01:00
Paweł Dziepak	838dbd98ac	tests/counter: add test for sorting and deduplicating shards	2017-09-05 13:49:00 +01:00
Paweł Dziepak	022c2ff53a	counters: add function for sorting and deduplicating counter cells Due to a bug in an implementation of UUID less compare some Scylla versions sort counter shards in an incorrect order. Moreover, when dealing with imported correct data the inconsistencies in ordering caused some counter shards to become duplicated.	2017-09-05 13:49:00 +01:00
Paweł Dziepak	b7c27d73d8	counters: add more comparison operators	2017-09-05 13:49:00 +01:00
Vlad Zolotarov	bdc0ca7064	service::storage_service: initialize auth and tracing after we joined the ring Initialize the system_auth and system_traces keyspaces and their tables after the Node joins the token ring because as a part of system_auth initialization there are going to be issues SELECT and possible INSERT CQL statements. This patch effectively reverts the `d3b8b67` patch and brings the initialization order to how it was before that patch. Fixes #2273 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com> (cherry picked from commit `e98adb13d5`)	2017-08-30 09:33:33 +02:00
Calle Wilund	34260ce471	utils::UUID: operator< should behave as comparison of hex strings/bytes I.e. need to be unsigned comparison. Message-Id: <1487683665-23426-1-git-send-email-calle@scylladb.com> (cherry picked from commit `0d87f3dd7d`)	2017-08-24 14:18:55 +01:00
Avi Kivity	cffe57bcc7	Merge "repair: Do not allow repair until node is in NORMAL status" from Asias Fixes #2723. * tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev: repair: Do not allow repair until node is in NORMAL status gossip: Add is_normal helper (cherry picked from commit `2f41ed8493`)	2017-08-23 09:45:54 +03:00
Paweł Dziepak	adb9ce7f38	lsa: avoid unnecessary segment migrations during reclaim segment_zone::migrate_all_segments() was trying to migrate all segments inside a zone to the other one hoping that the original one could be completely freed. This was an attempt to optimise for throughput. However, this may unnecesairly hurt latency if the zone is large, but only few segments are required to satisfy reclaimer's demands. Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com> (cherry picked from commit `0318dccafd`)	2017-08-22 09:29:05 +02:00
Tomasz Grabiec	5f1fd7a0b1	schema_registry: Ensure schema_ptr is always synced on the other core global_schema_ptr ensures that schema object is replicated to other cores on access. It was replicating the "synced" state as well, but only when the shard didn't know about the schema. It could happen that the other shard has the entry, but it's not yet synced, in which case we would fail to replicate the "synced" state. This will result in exception from mutate(), which rejects attempts to mutate using an unsynced schema. The fix is to always replicate the "synced" state. If the entry is syncing, we will preemptively mark it as synced earlier. The syncing code is already prepared for this. Refs #2617. Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `65c64614aa`)	2017-08-17 17:15:12 +02:00
Avi Kivity	d1f06633e0	Update seastar submodule * seastar a4d924e...949b710 (1): > fstream: do not ignore unresolved future Fixes #2697.	2017-08-16 15:12:45 +03:00
Avi Kivity	b54ea3f6cf	dist: use correct repository for third-party RPMs	2017-08-16 11:24:42 +03:00
Avi Kivity	63fd65414a	Update seastar submodule * seastar e5825b5...a4d924e (1): > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb Fixes #2690	2017-08-14 16:25:03 +03:00
Avi Kivity	9790c2d229	Update seastar submodule * seastar 8d9fd92...e5825b5 (1): > tls: Only recurse once in shutdown code Fixes #2691	2017-08-14 15:12:01 +03:00
Raphael S. Carvalho	7728a8dec5	sstables: close index file when sstable writer fails index's file output stream uses write behind but it's not closed when sstable write fails and that may lead to crash. It happened before for data file (which is obviously easier to reproduce for it) and was fixed by `0977f4fdf8`. Fixes #2673. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com> (cherry picked from commit `dddbd34b52`)	2017-08-08 09:59:10 +03:00
Duarte Nunes	1fd4a3ed34	tests/sstable_mutation_test: Don't use moved-from object Fix a bug introduced in `dbbb9e93d` and exposed by gcc6 by not using a moved-from object. Twice. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170802161033.4213-1-duarte@scylladb.com> (cherry picked from commit `4c9206ba2f`)	2017-08-03 09:46:33 +03:00
Avi Kivity	0b48863a7e	Merge "Ensure correct EOC for PI block cell names" from Duarte "This series ensures the always write correct cell names to promoted index cell blocks, taking into account the eoc of range tombstones. Fixes #2333" * 'pi-cell-name/v1' of github.com:duarten/scylla: tests/sstable_mutation_test: Test promoted index blocks are monotonic sstables: Consider eoc when flushing pi block sstables: Extract out converting bound_kind to eoc (cherry picked from commit `db7329b1cb`)	2017-08-01 18:13:19 +03:00
Gleb Natapov	aec94b926c	cql transport: run accept loop in the foreground It was meant to be run in the foreground since it is waited upon during stop(), but as it is now from the stop() perspective it is completed after first connection is accepted. Fixes #2652 Message-Id: <20170801125558.GS20001@scylladb.com> (cherry picked from commit `1da4d5c5ee`)	2017-08-01 17:07:55 +03:00
Tomasz Grabiec	0ac2c388b6	row_cache: Avoid deadlock/timeout due to sstable read concurrency limit database::make_sstable_reader() creates a reader which will need to obtain a semaphore permit when invoked, so that there is a limit on sstable read concurrency (`edeef03`). Therefore, each read may create at most one such reader in order to be guaranteed to make progress. Otherwise, the creation of the second reader may deadlock (in case of system tables) or timeout (non-system tables), if enough number of such readers tries to do the same thing at the same time. One instance of the problem fixed by this patch is in cache populating reader (`98c12dc`) when we reach partition size limit (max_cached_partition_size_in_kb). In that case population is abandoned and a second read is created, while still keeping the old one alive. We saw this causing deadlocks during schema tables parsing when system.schema_columns contained large partitions. Fixes #2623. Another case when this can potentially happen is when populating readers are recreated by cache. We replace the reader there, but using assignment, so the old reader is still alive when the new one is created. This patch fixes two out of three of such cases. The third one (in a scanning read) is not that easy to fix. That problem doesn't exist in version 2.0 and master, where the cache is reworked for row granularity. Refs #2644. Message-Id: <1501160300-18097-1-git-send-email-tgrabiec@scylladb.com>	2017-08-01 12:10:39 +03:00
Takuya ASADA	09ac5b57aa	dist/redhat: limit metapackage dependencies to specific version of scylla packages When we install scylla metapackage with version (ex: scylla-1.7.1), it just always install newest scylla-server/-jmx/-tools on the repo, instead of installing specified version of packages. To install same version packages with the metapackage, limited dependencies to current package version. Fixes #2642 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20170726193321.7399-1-syuu@scylladb.com> (cherry picked from commit `91a75f141b`)	2017-07-27 14:22:06 +03:00
Shlomi Livne	ff643e3e40	release: prepare for 1.7.4 Signed-off-by: Shlomi Livne <shlomi@scylladb.com> scylla-1.7.4	2017-07-26 17:26:33 +03:00
Asias He	a7b8d89de8	gossip: Fix nr_live_nodes calculation We need to consider the _live_endpoints size. The nr_live_nodes should not be larger than _live_endpoints size, otherwise the loop to collect the live node can run forever. It is a regression introduced in commit `437899909d` (gossip: Talk to more live nodes in each gossip round). Fixes #2637 Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com> (cherry picked from commit `515a744303`)	2017-07-26 16:49:11 +03:00
Duarte Nunes	013fa3da14	schema: Calculate default validator Fixes #2605 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170719105131.21455-3-duarte@scylladb.com>	2017-07-20 10:58:29 +02:00
Duarte Nunes	259cfaf8f9	thrift: Set default validator for static CFs Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170719105131.21455-2-duarte@scylladb.com>	2017-07-20 10:58:29 +02:00
Duarte Nunes	6501bf8e54	schema_tables: Recover comparator type Fixes #2573 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170718125450.3727-1-duarte@scylladb.com>	2017-07-19 10:58:43 +02:00
Pekka Enberg	41b4055911	release: prepare for 1.7.3 scylla-1.7.3	2017-07-18 17:34:46 +03:00
Nadav Har'El	b594f21f91	Allow reading exactly desired byte ranges and fast_forward_to Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve perforance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170718110643.8667-1-nyh@scylladb.com>	2017-07-18 16:54:11 +03:00
Avi Kivity	bcd2e6249f	dist: tolerate sysctl failures sysctl may fail in a container environment if /proc is not virtualized properly. Fixes #1990 Message-Id: <20170625145930.31619-1-avi@scylladb.com> (cherry picked from commit `08488a75e0`)	2017-07-18 15:47:10 +03:00
Takuya ASADA	4c79add7b0	dist/debian: skip tunables when kernel = 3.13.0--generic, to prevent kernel panic bug There is kernel panic bug on kernel = 3.13.0--generic(Ubuntu 14.04), we have to skip tunables. Fixes #1724 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1493196636-25645-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `abf65cb485`)	2017-07-18 15:47:03 +03:00
Asias He	00f6ccb75d	gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option It is useful for larger cluster with larger gossip message latency. By default the fd_max_interval_ms is 2 seconds which means the failure_detector will ignore any gossip message update interval larger than 2 seconds. However, in larger cluster, the gossip message udpate interval can be larger than 2 seconds. Fixes #2603. Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com> (cherry picked from commit `adc5f0bd21`)	2017-07-17 13:30:34 +03:00
Avi Kivity	77ac5a63db	Update seastar submodule * seastar fc69677...8d9fd92 (1): > rpc: start server's send loop only after protocol negotiation Fixes #2600.	2017-07-17 10:43:12 +03:00
Pekka Enberg	eb9de1a807	Merge "Repair backport for 1.7 branch" from Asias "This series backports all the repair related fixes to enterprise branch and updates the scylla_repair to send ranges to repair to all the shards in parallel, indepedently. With this series, repair can utilize all the CPUs and is much more efficent." * tag 'asias/repair-backport-branch-1.7.3-v1' of github.com:cloudius-systems/seastar-dev: repair: Use selective_token_range_sharder tests: Add test_selective_token_range_sharder dht: Add selective_token_range_sharder repair: further limit parallelism of checksum calculation repair: Do not store the failed ranges repair: Prefer nodes in local dc when streaming repair: Repair on all shards repair: Allow one stream plan in flight	2017-07-14 13:02:26 +03:00
Duarte Nunes	643a777067	storage_proxy: Preserve replica order across mutations In storage_proxy we arrange the mutations sent by the replicas in a vector of vectors, such that each row corresponds to a partition key and each column contains the mutation, possibly empty, as sent by a particular replica. There is reconciliation-related code that assumes that all the mutations sent by a particular replica can be found in a single column, but that isn't guaranteed by the way we initially arrange the mutations. This patch fixes this and enforces the expected order. Fixes #2531 Fixes #2593 Signed-off-by: Gleb Natapov <gleb@scylladb.com> Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170713162014.15343-1-duarte@scylladb.com> (cherry picked from commit `b8235f2e88`)	2017-07-14 12:12:09 +03:00
Avi Kivity	6f91939650	Update seastar submodule * seastar 8e2f629...fc69677 (1): > tls: Wrap all IO in semaphore (Fixes #2575)	2017-07-12 10:24:04 +03:00
Gleb Natapov	15da71266d	consistency_level: report less live endpoints in Unavailable exception if there are pending nodes DowngradingConsistencyRetryPolicy uses live replicas count from Unavailable exception to adjust CL for retry, but when there are pending nodes CL is increased internally by a coordinator and that may prevent retried query from succeeding. Adjust live replica count in case of pending node presence so that retried query will be able to proceed. Fixes #2535 Message-Id: <20170710085238.GY2324@scylladb.com> (cherry picked from commit `739dd878e3`)	2017-07-11 17:16:58 +03:00
Botond Dénes	9cd36ade00	Fix crash in the out-of order restrictions error msg composition Use name of the existing preceeding column with restriction (last_column) instead of assuming that the column right after the current column already has restrictions. This will yield an error message that is different from that of Cassandra, albeit still a correct one. Fixes #2421 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com> (cherry picked from commit `33bc62a9cf`)	2017-07-11 17:16:01 +03:00
Asias He	6f58a1372e	repair: Use selective_token_range_sharder With this change, we ask all the shard to handle the ranges provided by user and we use selective_token_range_sharder to split the ranges and ignore the ranges do not belong to the current shard. (cherry picked from commit `b10e961a64`) Conflicts: repair/repair.cc	2017-07-11 08:40:49 +08:00
Asias He	0a9d26de4a	tests: Add test_selective_token_range_sharder (cherry picked from commit `2a794db61b`)	2017-07-11 08:40:49 +08:00
Asias He	35cd63e1f7	dht: Add selective_token_range_sharder It is like ring_position_range_sharder but it works with dht::token_range. This sharder will return the ranges belong to a selected shard. (cherry picked from commit `d835cf2748`)	2017-07-11 08:40:49 +08:00
Nadav Har'El	2ada799e07	repair: further limit parallelism of checksum calculation Repair today has a semaphore limiting the number of ongoing checksum comparisons running in parallel (on one shard) to 100. We needed this number to be fairly high, because a "checksum comparison" can involve high latency operations - namely, sending an RPC request to another node in a remote DC and waiting for it to calculate a checksum there, and while waiting for a response we need to proceed calculating checksums in parallel. But as a consequence, in the current code, we can end up with as many as 100 fibers all at the same stage of reading partitions to checksum from sstables. This requires tons of memory, to hold at least 128K of buffer (even more with read-ahead) for each of these fibers, plus partition data for each. But doing 100 reads in parallel is pointless - one (or very few) should be enough. So this patch adds another semaphore to limit the number of checksum calculations (including the read and checksum calculation) on each shard to just 2. There may still be 100 ongoing checksum comparisons, in other stages of the comparisons (sending the checksum requests to other and waiting for them to return), but only 2 will ever be in the stage of reading from disk and checksumming them. The limit of 2 checksum calculations (per shard) applies on the repair slave, not just to the master: The slave may receive many checksum requests in parallel, but will only actually work on 2 at a time. Because the parallelism=100 now rate-limits operations which use very little memory, in the future we can safely increase it even more, to support situations where the disk is very fast but the link between nodes has very high latency. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170703151329.25716-1-nyh@scylladb.com> (cherry picked from commit `d177ec05cb`)	2017-07-11 08:40:49 +08:00
Asias He	b71037ac55	repair: Do not store the failed ranges The number of failed ranges can be large so it can consume a lot of memory. We already logged the failed ranges in the log. No need to storge them in memory. Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com> (cherry picked from commit `b2a2fbcf73`) Conflicts: repair/repair.cc	2017-07-11 08:40:49 +08:00
Asias He	8639f32efd	repair: Prefer nodes in local dc when streaming When peer nodes have the same partition data, i.e., with the same checksum, we currently choose to stream from any of them randomly. To improve streaming performance, select the peer within the same DC. This patch is supposed to improve repair perforamnce with multiple DC. Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com> (cherry picked from commit `cc02a62756`)	2017-07-11 08:40:48 +08:00
Asias He	a0dce7c922	repair: Repair on all shards Currently, shard zero is the coordinator of the repair. All the work of checksuming of the local node and sending of the repair checksum rpc verb is done on shard zero only. This causes other shards being underutilized. With this patch, we split the ranges need to be repaired into at least smp::count ranges, so sizeof(ranges) / smp::count will be assigned to each shard. For exmaple, we have 8 shards and 256 ragnes, each shard will repair 32 ranges. Each shard will repair the 32 ranges sequencially. There will be at most 8 (smp::count) ranges of repair in parallel. (cherry picked from commit `47345078ec`) Conflicts: repair/repair.cc	2017-07-11 08:40:48 +08:00

1 2 3 4 5 ...

11439 Commits