scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 14:33:08 +00:00

Author	SHA1	Message	Date
Dejan Mircevski	28e92dfa4c	cql3: Pass schema reference not pointer ... to get_single_column_clustering_bounds(). No need for the pointer; a reference is simpler and cleaner. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	de17b5449b	cql3: Replace count_if with find_atom count_if finds all matching atoms, which is redundant when we only want to find one. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	3a149daab5	cql3: Fix _partition_range_is_simple calculation Was updated for every restriction instead of only for partition ones. The only impact is on performance. The bug was introduced in `4661aa0` "cql3: Track IN partition-key restrictions". Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:31:33 +02:00
Dejan Mircevski	53f376b83f	cql3: Add test cases for indexed partition column We didn't have a case when a global index exists on a partition column and the SELECT statement specifies the full partition key. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-07-02 17:28:56 +02:00
Takuya ASADA	edd54a9463	reloc: add arch to relocatable package filename Add architecture name for relocatable packages, to support distributing both x86_64 version and aarch64 version. Also create symlink from new filename to old filename to keep compatibility with older scripts. Fixes #8675 Closes #8709 [update tools/python3 submodule: * tools/python3 ad04e8e...afe2e7f (1): > reloc: add arch to relocatable package filename ]	2021-06-28 15:01:09 +03:00
Avi Kivity	f660726773	Update seastar submodule * seastar 0e48ba883...eaa00e761 (3): > memory: reduce statistics TLS initialization even more > Merge "Sanitize io-topology creation on start" from Pavel E > doc/prometheus: note that metric family is passed by query name	2021-06-28 11:52:36 +03:00
Botond Dénes	09309f5dbf	reader_concurrency_semaphore: on_permit_created(): remove noexcept The permit creation path enters the semaphore's permit gate in on_permit_created(). Entering this gate can throw so this method is not noexcept. Remove the noexcept specifier accordingly. Also enter the gate before adding the permit to the permit list, to save some work when this fails. Fixes: #8933 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210628074941.32878-1-bdenes@scylladb.com>	2021-06-28 11:04:38 +03:00
Avi Kivity	c0c1e26014	Merge 'Remove code writing LA/KA sstables' from Piotr Jastrzębski Now that all supported versions write mc/md sstables, we can deprecate the MC_SSTABLE feature bit and consider it implicitly true, and with it the ability to write la/ka sstables. We still need to support reading them, e.g. from restoring old snapshots or migrating data from legacy clusters. Test: unit(dev, debug) Fixes #8352 Closes #8884 * github.com:scylladb/scylla: compress: Remove unused make_compressed_file_k_l_format_output_stream sstables: move sstable_writer to separate header sstable_writer: remove get_metadata_collector sstables: stop including metadata_collector.hh in sstables.hh sstables: Remove duplicated friend declaration sstables: remove unused KL writer sstables: Always use MC/MD writer sstable_datafile_test: switch tests to use latest sstables format sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version test_offstrategy_sstable_compaction: test all writable sstables compaction_with_fully_expired_table: Remove some LA specific code sstable_mutation_test: test latest sstable format instead of LA sstable_test: Test MX sstables instead of KA/LA sstable_datafile_test: Fix schema used by check_compacted_sstables sstables: Remove LA/KA sstable writting tests that check exact format sstables: define writable_sstable_versions features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled	2021-06-27 20:50:51 +03:00
Avi Kivity	121971ec0f	Merge "storage_proxy: specialize query_singular() for non-IN queries" from Gleb " query_singular() accepts a partition_range_vector, corresponding to an IN query. But such queries are rare compared to single-partition queries. Co-routinise the code and special case non-IN queries by avoiding the call to map_reduce. Also replace executers array with small_vector to avoid an allocation in the common case. perf_simple_query --smp 1 --operations-per-shard 1000000 --task-quota-ms 10: before: median 204545.04 tps ( 81.1 allocs/op, 15.1 tasks/op, 48828 insns/op) after: median 219769.97 tps ( 74.1 allocs/op, 12.1 tasks/op, 46495 insns/op) So, a ~7% improvement in tps and 5% improvement in instructions per op. Also large reduction in tasks and allocations. This is an alternative proposal to https://github.com/scylladb/scylla/pull/8909. The benefit of this one is that it does not duplicate any code (almost). " * 'query_singular-coroutine' of github.com:scylladb/scylla-dev: storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried storage_proxy: use small_vector in storage_proxy::query_singular to store executors storage_proxy: co-routinize storage_proxy::query_singular()	2021-06-27 16:30:19 +03:00
Piotr Jastrzebski	10228b35c5	compress: Remove unused make_compressed_file_k_l_format_output_stream Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	430fd5cfa9	sstables: move sstable_writer to separate header This class is used in only few places and does not have to be included everywhere sstable class is needed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	9e7144f719	sstable_writer: remove get_metadata_collector This function is only called internally so it does not have to be exposed and can be inlined instead. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	2d6608bb88	sstables: stop including metadata_collector.hh in sstables.hh metadata collector is rarely used so it's better to include it only in those few places. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:31 +02:00
Piotr Jastrzebski	39851f76fc	sstables: Remove duplicated friend declaration Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	c7096470bf	sstables: remove unused KL writer Previous two patches removed the usage of KL writer so the code is now dead and can be safely removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	8293384189	sstables: Always use MC/MD writer Previous patch made MC the lowest sstables format in use so the removed check is always true now and we can remove the else part. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	314bc0e8a5	sstable_datafile_test: switch tests to use latest sstables format instead of LA. Ability to write LA and KA sstables will be removed by the following patches so we need to switch all the tests to write newer sstables. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	f03ed9b9a7	sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:30 +02:00
Piotr Jastrzebski	1ed298b08b	test_offstrategy_sstable_compaction: test all writable sstables Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-27 15:12:12 +02:00
Gleb Natapov	e5154e77d3	storage_proxy: avoid map_reduce in storage_proxy::query_singular if only one pk is queried Not using IN queries is a common case, so avoid map_reduce overhead for them.	2021-06-27 14:49:44 +03:00
Gleb Natapov	8018eb4612	storage_proxy: use small_vector in storage_proxy::query_singular to store executors Having only one pk to query is the common case, so avoid allocating executor vector for that case.	2021-06-27 14:48:15 +03:00
Gleb Natapov	d908912dbd	storage_proxy: co-routinize storage_proxy::query_singular() The code becomes much easier to understand and allocation for used_replicas can be dropped.	2021-06-27 14:47:19 +03:00
Avi Kivity	b7cb687d36	Merge "Harden storage_service::stop_transport" from Pavel E " Stopping transport (cql, thrift, messaging, etc.) can happen from several places -- drain, decommission, stop, isolation. Some of them can even run in parallel. This patch makes transport stopping bulletproof. tests: unit(dev) start-stop (dev) start-drain-stop (dev) fixes: #8911 " * 'br-stop-transport-races' of https://github.com/xemul/scylla: storage_service: Indentation fix storage_service: Make stop_transport re-entrable storage_service: Stop transport on decommission	2021-06-27 14:46:46 +03:00
Pavel Emelyanov	7014da9404	storage_service: Unregister disk error handlers on stop Storage service install disk error handlers in constructor and these connections are not unregistered. It's not a problem in real life, because storage service is not stopped, but in some tests this can lead to use-after-frees. The sstables_datafile_test runs some of the testcases in cql_test_env which starts and (!) stops the storage service. Other testcases are run in a lightweight sstables_test_env which does not mess with the storage service at all. Now, if a case of the 2nd kind is run after the one of the 1st and (for whatever reason) generates a disk error it will trigger use-after-free -- after the 1st testcase the storage service disk error registration would remain, but the storage service itself would already be stopped, thus triggering the disk error will try to access stopped sharded storage service inside the .isolate(). The fix is to keep the scoped connection on the storage service list of various listeners. On stop it will go away automagically. tests: unit(dev), sstables_datafile_test with forced disk error Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210625062648.27812-1-xemul@scylladb.com>	2021-06-27 14:41:55 +03:00
Avi Kivity	6676ceabde	Merge 'Prevent reactor stall in utils::merge_to_gently' from Benny Halevy std::copy_if runs without yielding. See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480 Also, eliminate extraneous loop on merge first1 will point to the inserted value which is a copy of first2. Since list2 is sorted in ascending order, the next item from list2 will never be less than the one we've just inserted, so we waste an iteration to merely increment first1 again. Fixes #8897 Test: unit(dev), stall_free_test(debug) DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev) Closes #8925 github.com:scylladb/scylla: utils: merge_to_gently: eliminate extraneous loop on merge utils: merge_to_gently: prevent stall in std::copy_if	2021-06-27 13:56:32 +03:00
Raphael S. Carvalho	29c93ae592	compaction: Reduce backlog of compacting SSTable properly It was observed that as compaction progresses the backlog of compacting SSTable is being reduced very slowly, which causes shares to be higher than needed, and consequently compaction acts much more aggressively than it has to. https://user-images.githubusercontent.com/1409139/120237819-93dfc080-c232-11eb-9042-68114e285ea0.png The graph above shows the amount of backlog that is reduced from a SSTable being compacted. The red line denotes the total backlog of the SSTable, before it's selected for compaction. The expectation is that the more a SSTable is compacted the more backlog will be reduced from it. However, in the current implementation, it can be seen that the backlog to be reduced, from the SSTable being compacted, starts being inversely proportional to the amount of data already compacted. Turns out that this problem happens because the implementation of backlog formula becomes incorrect when the SSTable is being compacted. Backlog for a sstable is currently defined as: Bi = Ei * log (T / Ei) where Ei = Si - Ci (bytes left to be compacted) and Si = size of SStable and Ci = total bytes compacted and T = total size of table The formula above can also be rewritten as follows: Bi = Ei * log (T) - Ei * log (Ei) the second term `Ei * log (Ei)` can be rewritten as: = (Si - Ci) * log (Ei) = Si * log (Ei) - Ci * log (Ei) However, digging backlog implementation, turns out that we're incorrectly implementing that second term as: = Si * log (Si) - Ci * log (Ei) Given that Si > Ei, for a SSTable being compacted, the backlog will be higher than it should. the following table shows how the backlog of a SSTable being compacted behaves now versus how it's supposed to behave: https://gist.github.com/raphaelsc/42e14be0d7d4ed264e538c2d217c8f95 Turns out that this is not the only problem. It was a mistake to change the formula from `Ei * log(T / Si)` to `Ei * log(T / Ei)`, when fixing the shrinking table issue, because that also causes the backlog of a compacting SSTable to be incorrectly reduced. With the formula rewritten as follows: Bi = Ei * log (T) - Ei * log (Ei) It becomes clear that the more a SSTable is compacted, the slower it becomes for backlog to be reduced, as T / Ei can increase considerably over time. So we're reverting the formula back to `Ei * log(T / Si)`. The graph below shows a better backlog behavior when table is shrinking: https://user-images.githubusercontent.com/1409139/123495186-06a54700-d5f9-11eb-9386-3fcf4dd8e4d3.png While analyzing the problem when table is shrinking, realized that it's because T in the formula is implemented as the effective size (total + partial - compacted). With the new formula rewritten as follows: Bi = Ei * log (T) - Ei * log (Si) It becomes clearer that T cannot be lower than Si whatsoever, otherwise the backlog becomes negative. Also, while table is shrinking, it can happen that the backlog will be so low that compaction will barely make any progress. To fix both issues, let's implement T as total size (sum of all Si) rather than effective size (sum of all Ei). The graph below shows that this change prevents the backlog from going negative while still providing similar and expected behavior as before, see: https://user-images.githubusercontent.com/1409139/123495185-060cb080-d5f9-11eb-89f7-ed445729702a.png Fixes #8768. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210626003133.3011007-1-raphaelsc@scylladb.com>	2021-06-27 11:43:48 +03:00
Pavel Emelyanov	a89ae9a8e7	storage_service: Indentation fix Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:21:10 +03:00
Pavel Emelyanov	bd2a58060e	storage_service: Make stop_transport re-entrable It may happen that disk error opccurs and subsequent isolation runs in parallel with drain or decommission or shutdown. In this case the stop_transport method would be running two times in parallel. Also the drain after decommission is not disabled, so it may happen that stop_transport will be called two times in a row (however -- not in parallel). Using shared_promise solves all the possible reentrances. (the indentation is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:18:43 +03:00
Pavel Emelyanov	b0199b005d	storage_service: Stop transport on decommission The stop_transport sequence is: - stop client services (cql, thrift, alternator) - stop gossiping - stop messaging - stop stream manager The decommissioning goes very similarly - stop client services - stop batchlog manager - stop gossiping - stop messaging So this change makes decommission stop all networking _before_ batchlog, like it's already done on drain, and additionally stop the streaming manager. This change is prerequisite for fixing race between transport stop and isolation (on disk error). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-25 13:15:38 +03:00
Piotr Jastrzebski	995eb8c274	compaction_with_fully_expired_table: Remove some LA specific code Following patches will switch all sstable writing tests to use the latest sstables format. compaction_with_fully_expired_table contains some test for a LA specific behaviour so let's remove it to make the switch possible. For more context see https://github.com/scylladb/scylla/issues/2620 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	8ff37bec17	sstable_mutation_test: test latest sstable format instead of LA Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	80f8f970e9	sstable_test: Test MX sstables instead of KA/LA Replace calls to make_compressed_file_k_l_format_input_stream with calls to make_compressed_file_m_format_input_stream. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	131a0babc0	sstable_datafile_test: Fix schema used by check_compacted_sstables check_compacted_sstables is used in compact_02 test which uses sstables created by compact_sstables. The problem is that schema used in check_compacted_sstables and compact_sstables is not the same. The type of r1 column is different. This was not a problem when the test was running on LA sstables but following patches will switch all the tests to use MC and then sstable schema becomes validated when reading the sstable and the test will fail such validation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	680e341f54	sstables: Remove LA/KA sstable writting tests that check exact format Those tests check that created sstables have exactly the expected bytes inside. This won't work with other sstable formats and writting LA/KA sstables will be removed by the following patches so there's nothing we can do with those tests but to remove them. Otherwise they will be failing after LA/KA writting capability is removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	2bd6ad1e2f	sstables: define writable_sstable_versions and use it instead of all_sstable_versions in tests that check writting of sstables. Following patches remove LA/KA writer so we want tests to be ready for that and not break by trying to write LA/KA sstables. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Piotr Jastrzebski	1bdcef6890	features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled These features have been around for over 2 years and every reasonable deployment should have them enabled. The only case when those features could be not enabled is when the user has used enable_sstables_mc_format config flag to disable MC sstable format. This case has been eliminated by removing enable_sstables_mc_format config flag. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-25 10:12:00 +02:00
Pavel Emelyanov	3552e99ce7	scylla-gdb: Bring scylla netw back to work The netw command tries to access the netw::_the_messaging_service that was removed long ago. The correct place for the messaging service is in debug:: namespace. The scylla-gdb test checks that, but the netw command sees that the ptr in question is not initialized, thinks it's not yet sharded::start()-ed and exits without errors. tests: unit(gdb) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210624135107.12375-1-xemul@scylladb.com>	2021-06-24 20:59:27 +03:00
Nadav Har'El	4d7f55a29f	cql: add configurable restriction of DateTieredCompactionStrategy DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time (users should use TimeWindowCompactionStrategy, TWCS, instead). This patch adds a new configuration option - restrict_dtcs - which can be used to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE statements. This is part of a "safe mode" effort to allow an installation to restrict operations which are un-recommended or dangerous. The new restrict_dtcs option has three values: "true", "false", and "warn": For the time being, "false" is still the default, and means DTCS is not restricted and can still be used freely. We can easily change this default in a followup patch. Setting a value of "true" means that DTCS is restricted - trying to create a a table or alter a table with it will fail with an error. Setting a value of "warn" will allow the create or alter operation, but will warn the user - both with a warning message which will immediately appear in cqlsh (for example), and with a log message. Fixes #8914. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210624122411.435361-1-nyh@scylladb.com>	2021-06-24 20:59:27 +03:00
Benny Halevy	b96eeaefe4	utils: merge_to_gently: eliminate extraneous loop on merge first1 will point to the inserted value which is a copy of first2. Since list2 is sorted in ascending order, the next item from list2 will never be less than the one we've just inserted, so we waste an iteration to merely increment first1 again. Note that the standard states that no iterators or references are invalidated on insert so we can safely keep looking at `first1` after inserting a copy of `first2` before it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-24 14:58:12 +03:00
Benny Halevy	453e7c8795	utils: merge_to_gently: prevent stall in std::copy_if std::copy_if runs without yielding. See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480 Note that the standard states that no iterators or references are invalidated on insert so we can keep inserting before last1 when merging the remainder of list2 at the tail of list1. Fixes #8897 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-24 14:47:25 +03:00
Benny Halevy	9bbe7b1482	sstables: mx_sstable_mutation_reader: enforce timeout Check if the timeout has expired before issuing I/O. Note that the sstable reader input_stream is not closed when the timeout is detected. The reader must be closed anyhow after the error bubbles up the chain of readers and before the reader is destroyed. This might already happen if the reader times out while waiting for reader_concurrency_semaphore admission. Test: unit(dev), auth_test.test_alter_with_timeouts(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>	2021-06-24 12:26:57 +02:00
Kamil Braun	a3f3563828	storage_service: check for existing normal token owners before bootstrapping The bootstrap procedure starts by "waiting for range setup", which means waiting for a time interval specified by the `ring_delay` parameter (30s by default) so the node can receive the tokens of other nodes before introducing its own tokens. However it may sometimes happen that the node doesn't receive the tokens. There are no explicit checks for this. But the code may crash in weird ways if the tokens-received assuption is false, and we are lucky if it does crash (instead of, for example, allowing the node to incorrectly bootstrap, causing data loss in the process). Introduce an explicit check-and-throw-if-false: a bootstrapping node now checks that there's at least one NORMAL token in the token ring, which means that it had to have contacted at least one existing node in the cluster, which means that it received the gossip application states of all nodes from that node; in particular the tokens of all nodes. Also add an assert in CDC code which relies on that assumption (and would cause weird division-by-zero errors if the assumption was false; better to crash on assert than this). Ref #8889. Closes #8896	2021-06-24 13:19:08 +03:00
Asias He	2ad8fb756e	gossip: Promote gossip quarantine over log to info level 1) Start node n1, n2, n3 2) Bootstrap n4 and kill n4 in the middle of bootstrap 3) Wipe data on n4 and start n4 again After step 2, n1, n2 and n3 will remove n4 from gossip after fat_client_timeout and put n4 in quarantine for quarantine_delay(). If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2 and n3 will ignore gossip updates from n4, and n4 will not learn gossip updates from the cluster. After PR #8896, the bootstrap will be rejected. This patch promotes the gossip quarantine over log to info level, so that dtest can wait for the log to bootstrap the node again. Refs #8889 Refs #8890 Closes #8905	2021-06-24 12:51:32 +03:00
Michael Livshin	9b9efb2b42	disable caching of the system.large_* tables The cache of system.large_{partition,rows,cells} accumulates range tombstones (https://github.com/scylladb/scylla/issues/7750), and those range tombstones can be evicted only together with their partition (https://github.com/scylladb/scylla/issues/3288). Making the system.large_* tables uncached should work around the problem until #3288 is fixed. Fixes #8874 Refs #7750 Refs #3288 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>	2021-06-24 12:26:45 +03:00
Piotr Sarna	ae9e52a774	Merge 'Cleanup and improvements for docs/alternator/alternator.md' from Nadav Har'El Make some improvements to docs/alternator.md as suggested by a user who had trouble understanding the previous version, and also a few other random cleanups. Closes #8910 * github.com:scylladb/scylla: docs/alternator/alternator.md: improve "Running Alternator" section docs/alternator/alternator.md: correct minor typos docs/alternator/alternator.md: fix link format	2021-06-24 12:03:26 +03:00
Avi Kivity	14252c8b71	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695 ) (v3)' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. 4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to issue segment prunes from end_flush() so we wake up actual file deletion/recycling 5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate 6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above. 7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test. Closes #8764 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog_test: Improve test assertion commitlog: Add waitable future for background sync/flush commitlog: abort queues on shutdown commitlog: break out "abort" calls into member functions commitlog: Do explicit discard+delete in shutdown commitlog: Recycle or not should not depend on shutdown state commitlog: Issue discard_unused_segments on segment::flush end IFF deletable commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-06-24 12:03:26 +03:00
Pavel Emelyanov	a61afe8421	btree: Improve unlink_leftmost_without_rebalance() The helper is used to walk the tree key-by-key destroying it in the mean time. Current implementation of this method just uses the "regular" erasing code which actually rebalances the tree despite the name. The biggest problem with removing the rebalancing is that at some point non-balanced tree may have the left-most key on an inner node, so to make 100% rebalance-less unlink every other method of the tree would have to be prepared for that. However, there's an option to make "light rebalance" (as it's called in this patch) that only maintains this crucial property of the tree -- the left-most key is on the leaf. Some more tech details. Current rebalancer starts when the node population falls below 1/2 of its capacity and tries to - grab a key from one of the siblings if it's balanced - merge two siblings together if they are small enough The light rebalance is lighter in two ways. First, it leaves the node unbalanced until it becomes empty. And then it goes ahead and replaces it with the next sibling. This change removes ~60% of the keys movements on random test. Keys still move when the sibling replace happens because in this case the separation key needs to be placed at the right sibling 0 position which means shifting all its keys right. tests: unit(debug) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210623083836.27491-1-xemul@scylladb.com>	2021-06-24 12:03:26 +03:00
Raphael S. Carvalho	ab9d08d80e	sstables: Remove unused filtering reader from sstable_set::make_local_shard_sstable_reader() It's been a long time since table no longer accepts shared sstables, so this code which creates a filtering reader, if sstable is shared, is never used. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210618200026.1002621-2-raphaelsc@scylladb.com>	2021-06-24 12:03:26 +03:00
Raphael S. Carvalho	88119a5c81	distributed_loader: Kill table's _sstables_opened_but_not_loaded _sstables_opened_but_not_loaded was needed because the old loader would open sstables from all shards before loading them. In the new loader, introduced with reshape, make_sstables_available() is called on each shard after resharding and reshape finished, so there's no need whatsoever for that mess. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210618200026.1002621-1-raphaelsc@scylladb.com>	2021-06-24 12:03:26 +03:00
Tomasz Grabiec	ee28eb4100	Merge "test: raft: move some tests to `raft` folder" from Pavel Solodovnikov Move `raft_sys_table_storage_test` and `raft_address_map_test` to `test/raft` folder since they naturally belong here, not in `test/boost` folder. Tests: unit(dev) * manmanson/move_some_raft_tests_to_raft_folder: test: raft: move `raft_address_map_test` to `raft` folder test: raft: move `raft_sys_table_storage_test` to `raft` folder configure: add extended raft testing dependencies	2021-06-24 12:03:26 +03:00

1 2 3 4 5 ...

27122 Commits