scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 00:50:35 +00:00

Author	SHA1	Message	Date
Eliran Sinvani	dd74556ad9	service/qos: adding service level table to the distributed keyspace This patch adds the service level table and functions to manipulate it to the distributed keyspace. Message-Id: <b6cb7f311ac1ee6802d8f3d78eac9cf40fe21f68.1609161341.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00
Benny Halevy	705f9c4f79	commitlog: segment_manager: max_size must be aligned This was triggered by the test_total_space_limit_of_commitlog dtest. When it passes a very large commitlog_segment_size_in_mb (1/6th of the free memory size, in mb), segment_manager constructor limits max_size to std::numeric_limits<position_type>::max() which is 0xffffffff. This causes allocate_segment_ex to loop forever when writing the segment file since `dma_write` returns 0 when the count is unaligned (seen 4095). The fix here is to select a sligtly small maxsize that is aligned down to a multiple of 1MB. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>	2021-04-11 13:17:50 +03:00
Pavel Emelyanov	70c851e69b	view: Don't expect int from position_in_partition::tri_compare Now it's int, but soon will be std::strong_ordering. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Piotr Sarna	8e808a56d2	Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund Fixes #8363 Fixes #8376 Delete segements has two issues when running with size-limited commit log and strict adherence to said limit. 1.) It uses parallel processing, with deferral. This means that the disk usage variables it looks at might not be fully valid - i.e. we might have already issued a file delete that will reduce disk footprint such that a segment could instead be recycled, but since vars are (and should) only updated _post_ delete, we don't know. 2.) It does not take into account edge conditions, when we only delete a single segment, and this segment is the border segment - i.e. the one pushing us over the limit, yet allocation is desperately waiting for recycling. In this case we should allow it to live on, and assume that next delete will reduce footprint. Note: to ensure exact size limit, make sure total size is a multiple of segment size. if we had an error in recycling (disk rename?), and no elements are available, we could have waiters hoping they will get segements. abort the queue (not permanent, but wakes up waiters), and let them retry. Since we did deletions instead, disk footprint should allow for new allocs at least. Or more likely, everything is broken, but we will at least make more noise. Closes #8372 * github.com:scylladb/scylla: commitlog: Add signalling to recycle queue iff we fail to recycle commitlog: Fix race and edge condition in delete_segments commitlog: coroutinize delete_segments commitlog_test: Add test for deadlock in recycle waiter	2021-04-07 15:13:25 +02:00
Nadav Har'El	0dd6f2db8f	Merge 'CDC generations: refactors and improvements' from Kamil Braun The "most important" major changes are: 1. storage_service: simplify CDC generation management during node replace Previously, when node A replaced node B, it would obtain B's generation timestamp from its application state (gossiped by other nodes) and start gossiping it immediately on bootstrap. But that's not necessary: - if this is the timestamp of the last (current) generation, we would obtain it from other nodes anyway (every node gossips the last known timestamp), - if this is the timestamp of an earlier generation, we would forget it immediately and start gossiping the last timestamp (obtained from other nodes). This commit simplifies the bootstrap code (in node-replace case) a bit: the replacing node no longer attempts to retrieve the CDC generation timestamp from the node being replaced. 2. tree-wide: introduce cdc::generation_id type Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: - if a piece of code doesn't care about the timestamp, it just passes the ID around - if it does care, it can access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. 3. cdc: handle missing generation case in check_and_repair_cdc_streams check_and_repair_cdc_streams assumed that there is always at least one generation being gossiped by at least one of the nodes. Otherwise it would enter undefined behavior. I'm not aware of any "real" scenario where this assumption wouldn't be satisfied at the moment where check_and_repair_cdc_streams makes it except perhaps some theoretical races. But it's best to stay on the safe side. --- Additionally the PR does some simplifications, stylistic improvements, removes some dead code, coroutinizes some functions, uncoroutinizes others (due to miscompiles), adds additional logging, updates some stale comments. Read commit messages for more details. Closes #8283 * github.com:scylladb/scylla: cdc: log a message when creating a new CDC generation cdc: handle missing generation case in check_and_repair_cdc_streams tree-wide: introduce cdc::generation_id type tree-wide: rename "cdc streams timestamp" to "cdc generation id" cdc: remove some functions from generation.hh storage_service: make set_gossip_tokens a static free-function db: system_keyspace: group cdc functions in single place cdc: get rid of "get_local_streams_timestamp" sys_dist_ks: update comment at quorum_if_many storage_service: simplify CDC generation management during node replace	2021-04-07 14:49:02 +03:00
Kamil Braun	99fd2244a3	tree-wide: introduce cdc::generation_id type This is a follow-up to the previous commit. Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: 1. if a piece of code doesn't care about the timestamp, it just passes the ID around 2. if it does care, it can simply access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. Using the occasion, we change the `do_handle_cdc_generation_intercept...` function to be a standard function, not a coroutine. It turns out that - depending on the shape of the passed-in argument - the function would sometimes miscompile (the compiled code would not copy the argument to the coroutine frame).	2021-04-07 13:47:13 +02:00
Avi Kivity	5109bf8b99	config: relax batch size warning and failure thresholds We inherited very low threshold for warning and failing multi-partition batches, but these warnings aren't useful. The size of a batch in bytes as no impact on node stability. In fact the warnings can cause more problems if they flood the log. Fix by raising the warning threshold to 128 kiB (our magic size) and the fail threshold to 1 MiB. Fixes #8416. Closes #8417	2021-04-06 20:56:06 +03:00
Calle Wilund	d734f85280	commitlog: Add signalling to recycle queue iff we fail to recycle Fixes #8376 If a recycle should fail, we will sort of handle it by deleting the segment, so no leaks. But if we have waiter(s) on the recycle queue, we could end up deadlocked/starved because nothing is incoming there. This adds an abort of the queue iff we failed and no objects are available. This will wake up any waiter, and he should retry, and hopefully at least be able to create a new segment. We then reset the queue to a new one. So we can go on. v2: * Forgot to reset queue v3: * Nicer exception handling in allocate_segment_ex	2021-04-06 16:38:14 +00:00
Calle Wilund	15dd76f0c2	commitlog: Fix race and edge condition in delete_segments Fixes #8363 Delete segements has two issues when running with size-limited commit log and strict adherence to said limit. 1.) It uses parallel processing, with deferral. This means that the disk usage variables it looks at might not be fully valid - i.e. we might have already issued a file delete that will reduce disk footprint such that a segment could instead be recycled, but since vars are (and should) only updated _post_ delete, we don't know. 2.) It does not take into account edge conditions, when we only delete a single segment, and this segment is the border segment - i.e. the one pushing us over the limit, yet allocation is desperately waiting for recycling. In this case we should allow it to live on, and assume that next delete will reduce footprint. Note: to ensure exact size limit, make sure total size is a multiple of segment size. Fixed by a.) Doing delete serialized. It is not like being parallel here will win us speed awards. And now we can know exact footprint, and how many segments we have left to delete b.) Check if we are a block across the footprint boundry, and people might be waiting for a segment. If so, don't delete segment, but recycle. As a follow-up, we should probably instead adjust the commitlog size limit (per shard) to be a multiple of segment sizes, but there is risks in that too.	2021-04-06 16:38:14 +00:00
Calle Wilund	d9a9897892	commitlog: coroutinize delete_segments Because we like cow routines.	2021-04-06 16:38:14 +00:00
Calle Wilund	813694b617	commitlog_test: Add test for deadlock in recycle waiter Not a very good test, mind you. Nothing to verify, just see if the test times out. But try to make it at least complete for failure report.	2021-04-06 16:38:14 +00:00
Konstantin Osipov	c83cf1f965	uuid: switch the API to use std::chrono A follow up for the patch for #7611. This change was requested during review and moved out of #7611 to reduce its scope. The patch switches UUID_gen API from using plain integers to hold time units to units from std::chrono. For one, we plan to switch the entire code base to std::chrono units, to ensure type safety. Secondly, using std::chrono units allows to increase code reuse with template metaprogramming and remove a few of UUID_gen functions that beceme redundant as a result. * switch get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch min_time_UUID(), max_time_UUID(), create_time_safe() to std::chrono * remove unused variant of from_unix_timestamp() * remove unused get_time_UUID_bytes(), create_time_unsafe(), redundant get_adjusted_timestamp() * inline get_raw_UUID_bytes() * collapse to similar implementations of get_time_UUID() * switch internal constants to std::chrono * remove unnecessary unique_ptr from UUID_gen::_instance Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>	2021-04-06 17:12:54 +03:00
Kamil Braun	e486e0f759	tree-wide: rename "cdc streams timestamp" to "cdc generation id" Each CDC generation always has a timestamp, but the fact that the timestamp identifies the generation is an implementation detail. We abstract away from this detail by using a more generic naming scheme: a generation "identifier" (whatever that is - a timestamp or something else). It's possible that a CDC generation will be identified by more than a timestamp in the (near) future. The actual string gossiped by nodes in their application state is left as "CDC_STREAMS_TIMESTAMP" for backward compatibility. Some stale comments have been updated.	2021-04-06 13:15:31 +02:00
Kamil Braun	1019ff07cb	db: system_keyspace: group cdc functions in single place	2021-04-06 13:15:31 +02:00
Kamil Braun	3cebe99613	sys_dist_ks: update comment at quorum_if_many The comment mentioned tables that no longer exist: their names have changed some time ago. Update the comment to be name-agnostic. Furthemore, the second part of the comment related to a case of "joining a node without bootstrapping". Fortunately this operation is no longer possible (after #6848 which became part of Scylla 4.3) so we can shorten the comment.	2021-04-06 13:15:31 +02:00
Avi Kivity	56cd058b34	config: correct description of listen_address - it does not support using interface names - listen_interface is not supported - 0.0.0.0 will work (and is reasonable) if you set broadcast_address - empty setting is not supported Fixes #8381. Closes #8409	2021-04-05 14:06:48 +03:00
Avi Kivity	82c76832df	treewide: don't include "db/system_distributed_keyspace.hh" from headers This just causes unneeded and slower recompliations. Instead replace with forward declarations, or includes of smaller headers that were incidentally brought in by the one removed. The .cc files that really need it gain the include, but they are few. Ref #1. Closes #8403	2021-04-04 14:00:26 +03:00
Kamil Braun	641040d465	sys_dist_ks: remove dead code (expire_cdc_* functions) These functions were not used anywhere but had to be maintained anyway. When (if) the expiration algorithm actually gets implemented (see issue #7300), the functions can be added back (perhaps they will need to look differently at that time, and it's likely that the `expire` column won't be used in the expiration algorithm in the end anyway).	2021-04-04 13:12:12 +03:00
Kamil Braun	4f3f245188	sys_dist_ks: coroutinize system_distributed_keyspace::start	2021-04-04 13:10:44 +03:00
Tomasz Grabiec	307bd354d2	Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function. Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node. Tests: - unit(dev) - dtest(hintedhandoff_additional_test.py) - dtest(topology_test.py) Closes #8387 * github.com:scylladb/scylla: hints: clarify docstring comment for can_send hints: use token_metadata to tell if node is in the ring hints: slightly reogranize "if" statement in can_send storage_service: release token_metadata lock before notify_left storage_service: notify_left after token_metadata is replicated	2021-04-01 15:51:46 +02:00
Piotr Dulikowski	6a1152ea9b	hints: clarify docstring comment for can_send Now, the docstring comment next to can_send better represents the condition that is checked inside that function. The statement about returning true when destination left the NORMAL state is replaced with a statement about returning true when the destination has left the ring.	2021-04-01 03:58:29 +02:00
Piotr Dulikowski	4f90514247	hints: use token_metadata to tell if node is in the ring Now, instead of looking at the gossiper state to check if the destination node is still in the ring, we are using token_metadata as a source of truth. This results in much simpler code in can_send() as token_metadata has an is_member method which does exactly what we want.	2021-04-01 03:58:29 +02:00
Piotr Dulikowski	e7d9057d0c	hints: slightly reogranize "if" statement in can_send This commit reverses the order of if-else blocks in can_send, which makes it - in my opinion, at least - slightly easier to read.	2021-04-01 03:58:29 +02:00
Piotr Jastrzebski	57c7964d6c	config: ignore enable_sstables_mc_format flag Don't allow users to disable MC sstables format any more. We would like to retire some old cluster features that has been around for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this we first have to make sure that all existing clusters have them enabled. It is impossible to know that unless we stop supporting enable_sstables_mc_format flag. Test: unit(dev) Refs #8352 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Closes #8360	2021-03-31 12:23:59 +03:00
Calle Wilund	c0666ea89b	commitlog: Fix inner loop condition in allocation pre-fill Fixes #8369 This was originally found (and fixed) by @gleb-cloudius, but the patch set with the fix was reverted at some point, and the fix went away. Now the error remains even in new, nice coroutine code. We check the wrong var in the inner loop of the pre-fill path of allocate_segment_ex, often causing us to generate giant writev:s of more or less the whole file. Not intended. Closes #8370	2021-03-30 12:14:55 +02:00
Nadav Har'El	ccc75bfe2a	Merge 'Disable thrift by default' from Piotr Sarna The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default. Fixes #8336 Closes #8338 * github.com:scylladb/scylla: docs: mention disabling Thrift by default db,config: disable Thrift by default	2021-03-29 12:48:20 +03:00
Piotr Wojtczak	c1daf2bb24	column_family: Make toppartitions queries more generic Right now toppartitions can only be invoked on one column family at a time. This change introduces a natural extension to this functionality, allowing to specify a list of families. We provide three ways for filtering in the query parameter "name_list": 1. A specific column family to include in the form "ks:cf" 2. A keyspace, telling the server to include all column families in it. Specified by omitting the cf name, i.e. "ks:" 3. All column families, which is represented by an empty list The list can include any amount of one or both of the 1. and 2. option. Fixes #4520 Closes #7864	2021-03-24 17:54:05 +02:00
Pavel Emelyanov	37bec6fb76	commitlog: Open files with append_is_unlikely This open option tells seastar that the file in question will be truncated to the needed size right at once and all the subsequent writes will happen within this size. This hint turns off append optimization in seastar that's not that cheap and helps so save few cpu cycles. The option was introduced in seastar by 8bec57bc. tests: unit(dev), dtest(commitlog: test_batch_commitlog, test_periodic_commitlog, test_commitlog_replay_on_startup) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210323115409.31215-1-xemul@scylladb.com>	2021-03-24 13:05:33 +02:00
Piotr Sarna	e2443337d9	db,config: disable Thrift by default It will still be possible to use Thrift once it's enabled in the yaml file, but it's better to not open this port by default, since Thrift is definitely not the first choice for Scylla users. Fixes #8336	2021-03-22 10:54:26 +01:00
Avi Kivity	972ea9900c	Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund Refs #7794 Iff we need to pre-fill segment file ni O_DSYNC mode, we should drop this for the pre-fill, to avoid issuing flushes until the file is filled. Done by temporarily closing, re-opening in "normal" mode, filling, then re-opening. Closes #8250 * github.com:scylladb/scylla: commitlog: Make pre-allocation drop O_DSYNC while pre-filling commitlog: coroutinize allocate_segment_ex	2021-03-17 09:59:22 +02:00
Calle Wilund	48ca01c3ab	commitlog: Make pre-allocation drop O_DSYNC while pre-filling Refs #7794 Iff we need to pre-fill segment file ni O_DSYNC mode, we should drop this for the pre-fill, to avoid issuing flushes until the file is filled. Done by temporarily closing, re-opening in "normal" mode, filling, then re-opening. v2: * More comment v3: * Add missing flush v4: * comment v5: * Split coroutine and fix into separate patches	2021-03-15 09:35:45 +00:00
Calle Wilund	ae3b8e6fdf	commitlog: coroutinize allocate_segment_ex To make further changes here easier to write and read.	2021-03-15 09:35:37 +00:00
Calle Wilund	f44420f2c9	snapshot: Add filter to check for existing snapshot Fixes #8212 Some snapshotting operations call in on a single table at a time. When checking for existing snapshots in this case, we should not bother with snapshots in other tables. Add an optional "filter" to check routine, which if non-empty includes tables to check. Use case is "scrub" which calls with a limited set of tables to snapshot. Closes #8240	2021-03-10 20:21:38 +02:00
Eliran Sinvani	9162748b18	materialized views: create view schemas with proper base table reference. Newly created view schemas don't always have their base info, this is bad since such schemas don't support read nor write. This leaves us vulnerable to a race condition where there is an attempt to use this schema for read or write. Here we initialize the base reference and also reconfigure the view to conform to the new computed column type, which makes it usable for write and not only reads. We do it for views created in the migration manager following announcements and also for copied schemas.	2021-03-07 12:50:42 +02:00
Eliran Sinvani	39cd9dae4e	materialized views: Extract fix legacy schema into its own logic We extract the logic for fixing the view schema into it's own logic as we will need to use it in more places in the code. This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since it becomes a two liner wrapper for this logic. We also remove it here and replace the call to it with the equivalent code.	2021-03-07 12:50:42 +02:00
Piotr Sarna	added53b7d	Merge 'hints: use a soft disk space limit in hints commitlog' from Piotr Dulikowski A recent change to the commitlog (`4082f57`) caused its configurable size limit to be strictly enforced - after reaching the limit, new segments wouldn't be allocated until some of the previous segments are freed. This flow can work for the regular commitlog, however the hints commitlog does not delete the segments itself - instead, hints manager recreates its commitlog every 10 seconds, picks up segments left by the previous instance and deletes each segment manually only after all hints are sent out from a segment. Because of the non-standard flow, it is possible that the hints commitlog fills up and stops accepting more hints. Hints manager uses a relatively low limit for each commitlog instance (128MB divided by shard count), so it's not hard to fill it up. What's worse, hints manager tries to acquire file_update_mutex in exclusive mode before re-creating the commitlog, while hints waiting to be written acquire this lock in shared mode - which causes hints flushing to completely deadlock and no more hints be admitted to the commitlog. The queue of hints waiting to be admitted grows very quickly and soon all writes which could result in a hint being generated are rejected with OverloadedException. To solve this problem, it is now possible to bring back the soft disk space limit by setting a flag in commitlog's configuration. Tests: - unit(dev) - wrote hints for 15 minutes in order to see if it gets stuck again Fixes #8137 Closes #8206 * github.com:scylladb/scylla: hints_manager: don't use commitlog hard space limit commitlog: add an option to allow going over size limit	2021-03-04 12:24:05 +01:00
Calle Wilund	5da0129775	system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp Since we have a table of cdc version timestamps, conviniently sorted reversed, we can just query this and get the latest known gen ts.	2021-03-03 15:44:54 +00:00
Calle Wilund	5a69250d7e	system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range With the new scheme for cdc generation management, one of the last changes was to make the time ordering of the stream timestamps reversed. However, cdc_get_versioned_streams forgot to take this into account when sifting out timestamp ranges for stream retrieval (based on low mark). Fixed by doing reverse iteration.	2021-03-03 15:41:42 +00:00
Piotr Dulikowski	376da49cf4	hints_manager: don't use commitlog hard space limit This commit disables the hard space limit applied by commitlogs created to store hints. The hard limit causes problems for hints because they use small-sized commitlogs to store hints (128MB, currently). Instead of letting the commitlog delete the segments itself, it recreates the commitlog every 10 seconds and manually deletes old segments after all hints are sent out from them. If the 128MB limit is reached, the hints manager will get stuck. A future which puts hint into commitlog holds a shared lock, and commitlog recreation needs to get an exclusive lock, which results in a deadlock. No more hints will be admitted, and eventually we will start rejecting writes with OverloadedException due to too many hints waiting to be admitted to the commitlog. By disabling the hard limit for hints commitlog, the old behavior is brought back - commitlog becomes more conservative with the space used after going over its size limit, but does not block until some of its segments are deleted.	2021-03-02 16:53:50 +01:00
Avi Kivity	5f4bf18387	Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros" This reverts commit `31909515b3`, reversing changes made to `ef97adc72a`. It shows many serious regressions in dtest. Fixes #8197.	2021-03-02 13:21:22 +02:00
Benny Halevy	baf5d05631	storage_service: use atomic_vector for lifecycle_subscribers So it can be modified while walked to dispatch subscribed event notifications. In #8143, there is a race between scylla shutdown and notify_down(), causing use-after-free of cql_server. Using an atomic vector itstead and futurizing unregister_subscriber allows deleting from _lifecycle_subscribers while walked using atomic_vector::for_each. Fixes #8143 Test: unit(release) DTest: update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release) materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>	2021-03-01 20:34:42 +02:00
Avi Kivity	8747c684e0	Merge 'Move timeouts to client state' from Piotr Sarna This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests. The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867). Closes #8140 * github.com:scylladb/scylla: treewide: remove timeout config from query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options cql3: use timeout config from client state instead of query options service: add timeout config to client state	2021-03-01 20:34:35 +02:00
Piotr Dulikowski	aa2df75321	commitlog: add an option to allow going over size limit This commit adds an option which, when turned on, allows the commitlog to go over configured size limit. After reaching the limit, commitlog will be more conservative with its usage of the disk space - for example, it won't increase the segment reserve size or reuse recycled segments. Most importantly, it won't block writes until the space used by the commitlog goes down. This change is necessary for hinted handoff to keep its current behavior. Hinted handoff does not let the commitlog free segments itself - instead, it re-creates it every 10 seconds and manually deletes segments after all hints are sent from a segment.	2021-03-01 14:16:05 +01:00
Avi Kivity	31909515b3	Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros Currently, the sstable_set in a table is copied before every change to allow accessing the unchanged version by existing sstable readers. This patch changes the sstable_set to a structure that keeps all its versions that are referenced somewhere and provides a way of getting a reference to an immutable version of the set. Each sstable in the set is associated with the versions it is alive in, and is removed when all such versions don't have references anymore. To avoid copying, the object holding all sstables in the set version is changed to a new structure, sstable_list, which was previously an alias for std::unordered_set<shared_sstable>, and which implements most of the methods of an unordered_set, but its iterator uses the actual set with all sstables from all referenced versions and iterates over those sstables that belong to the captured version. The methods that modify the sets contents give strong exception guarantee by trying to insert new sstables to its containers, and erasing them in the case of an caught exception. To release shared_sstables as soon as possible (i.e. when all references to versions that contain them die), each time a version is removed, all sstables that were referenced exclusively by this version are erased. We are able to find these sstables efficiently by storing, for each version, all sstables that were added and erased in it, and, when a version is removed, merging it with the next one. When a version that adds an sstable gets merged with a version that removes it, this sstable is erased. Fixes #2622 Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com Closes #8111 * github.com:scylladb/scylla: sstables: add test for checking the latency of updating the sstable_set in a table sstables: move column_family_test class from test/boost to test/lib sstables: use fast copying of the sstable_set instead of rebuilding it sstables: replace the sstable_set with a versioned structure sstables: remove potential ub sstables: make sstable_set constructor less error-prone	2021-03-01 14:16:36 +02:00
Piotr Sarna	7936652322	db,view: improve verbosity of errors coming from view updates The error now contains information about the view table that failed, as well as base and view tokens. Example: view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index, base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error) Fixes #8177 Closes #8178	2021-03-01 10:46:14 +02:00
Piotr Sarna	c5214eb096	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2021-02-25 17:20:27 +01:00
Kamil Braun	841f07e9b7	cdc: add config option to disable streams rewriting Rewriting stream descriptions is a long, expensive, and prone-to-failure operation. Due to #8061 it may consume a lot of memory. In general, it may keep failing (and being retried) endlessly, straining the cluster. As a backdoor we add this flag for potential future needs of admins or field engineers. I don't expect it will ever be used, but it won't hurt and may save us some work in the worst case scenario.	2021-02-18 11:44:59 +01:00
Kamil Braun	9bdd000e97	cdc: rewrite streams to the new description table Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. This commit adds an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations).	2021-02-18 11:44:59 +01:00
Kamil Braun	4ef736a0a3	cql3: query_processor: improve internal paged query API The `query_processor::query` method allowed internal paged queries. However, it was quite limited, hardcoding a number of parameters: consistency level, timeout config, page size. This commit does the following improvements: 1. Rename `query` to `query_internal` to make it obvious that this API is supposed to be used for internal queries only 2. Extend the method to take consistency level, timeout config, and page size as parameters 3. Remove unused overloads of `query_internal` 4. Fix a bunch of typos / grammar issues in the docstring	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00

1 2 3 4 5 ...

2003 Commits