scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
Raphael S. Carvalho	0a8afe18ca	cql: Reject create and alter table with DateTieredCompactionStrategy It's been ~1 year (`2bf47c902e`) since we set restrict_dtcs config option to WARN, meaning users have been warned about the deprecation process of DTCS. Let's set the config to TRUE, meaning that create and alter statements specifying DTCS will be rejected at the CQL level. Existing tables will still be supported. But the next step will be about throwing DTCS code into the shadow realm, and after that, Scylla will automatically fallback to STCS (or ICS) for users which ignored the deprecation process. Refs #8914. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11458	2022-09-15 11:46:18 +03:00
Nadav Har'El	8ece63c433	Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails' This series introduces two configurable options when working with TWCS tables: - `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting - `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check. Refs: #6923 Fixes: #9029 Closes #11445 * github.com:scylladb/scylladb: tests: cql_query_test: add mixed tests for verifying TWCS guard rails tests: cql_query_test: add test for TWCS window size tests: cql_query_test: add test for TWCS tables with no TTL defined cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables cql: add max window restriction for TimeWindowCompactionStrategy time_window_compaction_strategy: reject invalid window_sizes cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr	2022-09-12 23:55:51 +03:00
Botond Dénes	5374f0edbf	Merge 'Task manager' from Aleksandra Martyniuk Task manager for observing and managing long-running, asynchronous tasks in Scylla with the interface for the user. It will allow listing of tasks, getting detailed task status and progression, waiting for their completion, and aborting them. The task manager will be configured with a “task ttl” that determines how long the task status is kept in memory after the task completes. At first it will support repair and compaction tasks, and possibly more in the future. Currently: Sharded `task_manager` is started in `main.cc` where it is further passed to `http_context` for the purpose of user interface. Task manager's tasks are implemented in two two layers: the abstract and the implementation one. The latter is a pure virtual class which needs to be overriden by each module. Abstract layer provides the methods that are shared by all modules and the access to module-specific methods. Each module can access task manager, create and manage its tasks through `task_manager::module` object. This way data specific to a module can be separated from the other modules. User can access task manager rest api interface to track asynchronous tasks. The available options consist of: - getting a list of modules - getting a list of basic stats of all tasks in the requested module - getting the detailed status of the requested task - aborting the requested task - waiting for the requested task to finish To enable testing of the provided api, test specific task implementation and module are provided. Their lifetime can be simulated with the standalone test api. These components are compiled and the tests are run in all but release build modes. Fixes: #9809 Closes #11216 * github.com:scylladb/scylladb: test: task manager api test task_manager: test api layer implementation task_manager: add test specific classes task_manager: test api layer task_manager: api layer implementation task_manager: api layer task_manager: keep task_manager reference in http_context start sharded task manager task_manager: create task manager object	2022-09-12 09:26:46 +03:00
Felipe Mendes	7fec4fcaa6	cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth. However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0. The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn": We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.	2022-09-11 16:50:42 -03:00
Felipe Mendes	a3356e866b	cql: add max window restriction for TimeWindowCompactionStrategy The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough. Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase. This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check. Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.	2022-09-11 16:50:22 -03:00
Aleksandra Martyniuk	2439e55974	task_manager: create task manager object Implementation of a task manager that allows tracking and managing asynchronous tasks. The tasks are represented by task_manager::task class providing members common to all types of tasks. The methods that differ among tasks of different module can be overriden in a class inheriting from task_manager::task::impl class. Each task stores its status containing parameters like id, sequence number, begin and end time, state etc. After the task finishes, it is kept in memory for configurable time or until it is unregistered. Tasks need to be created with make_task method. Each module is represented by task_manager::module type and should have an access to task manager through task_manager::module methods. That allows to easily separate and collectively manage data belonging to each module.	2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch	5b1421cc33	db: config: add BROADCAST_TABLES feature flag Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature. This feature requires raft group0, thus enabling it without RAFT will cause an error.	2022-09-05 11:11:08 +02:00
Avi Kivity	0dbcd13a0f	config: change logging::settings constructor call to use designated initializer Safer wrt reordering, and more readable too. Closes #11382	2022-08-26 06:14:01 +03:00
Botond Dénes	33f0447ba0	db/config: add config item for query tombstone limit This will be the value used to break pages, after processing the specified amount of tombstones. The page will be cut even if empty. We could maybe use the already existing tombstone_{warn,fail}_threshold instead and use them as a soft/hard limit pair, like we did with page sizes.	2022-08-09 10:00:40 +03:00
Benny Halevy	edd308c705	config: use ordered map for experimental features So that the help string will be sorted lexicographically. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11178	2022-08-01 17:40:10 +03:00
Avi Kivity	1f21c1ecc8	Merge "Add IO throttling to streaming class" from Pavel E " Same thing was done for compaction class some time ago, now it's time for streaming to keep repair-generated IO in bounds. This set mostly resembles the one for compaction IO class with the exception that boot-time reshard/reshape currently runs in streaming class, but that's nod great if the class is throttled, so the set also moves boot-time IO into default IO class. " * 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla: distributed_loader: Populate keyspaces in default class streaming: Maintain class bandwidth streaming: Pass db::config& to manager constructor config: Add stream_io_throughput_mb_per_sec option sstables: Keep priority class on sstable_directory	2022-07-19 17:10:25 +03:00
Pavel Emelyanov	07460761fb	Merge "Make compaction_static_shares and memtable_flush_static_shares live updateable" from Igor Ribeiro Barbosa Duarte (3): Currently, after updating the static shares it's necessary to restart the cluster. This patch series makes compaction_static_shares and memtable_flush_static_shares live updateable so that this restart isn't necessary anymore. dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_liveupdate_compaction_static_shares ci: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1412/ * https://github.com/igorribeiroduarte/scylla/tree/make_compaction_static_shares_live_updateable: memtable_flush: Make memtable_flush_static_shares liveupdateable compaction: Make compaction_static_shares liveupdateable backlog_controller: Unify backlog_controller constructors	2022-07-19 16:55:55 +03:00
Igor Ribeiro Barbosa Duarte	3b19bcf1a1	memtable_flush: Make memtable_flush_static_shares liveupdateable This patch makes memtable_flush_static_shares liveupdateable to avoid having to restart the cluster after updating this config. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte	8dd0f4672d	compaction: Make compaction_static_shares liveupdateable This patch makes compaction_static_shares liveupdateable to avoid having to restart the cluster after updating this config. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-07-19 10:10:46 -03:00
Pavel Emelyanov	85d32485d9	config: Mark compaction_throughput_mb_per_sec option as Used Otherwise it's not shown in the --help output. Should've been the part of `868c3be0` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220716085221.26634-1-xemul@scylladb.com>	2022-07-19 13:18:17 +03:00
Pavel Emelyanov	7d0110cd31	config: Add stream_io_throughput_mb_per_sec option It's going to control the bandwidth for the streaming prio class. For now it's jsut added but does't work for real Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:14:41 +03:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Avi Kivity	3b20407f25	Merge 'db: Avoid memtable flush latency on schema merge' from Tomasz Grabiec Currently, applying schema mutations involves flushing all schema tables so that on restart commit log replay is performed on top of latest schema (for correctness). The downside is that schema merge is very sensitive to fdatasync latency. Flushing a single memtable involves many syncs, and we flush several of them. It was observed to take as long as 30 seconds on GCE disks under some conditions. This patch changes the schema merge to rely on a separate commit log to replay the mutations on restart. This way it doesn't have to wait for memtables to be flushed. It has to wait for the commitlog to be synced, but this cost is well amortized. We put the mutations into a separate commit log so that schema can be recovered before replaying user mutations. This is necessary because regular writes have a dependency on schema version, and replaying on top of latest schema satisfies all dependencies. Without this, we could get loss of writes if we replay a write which depends on the latest schema on top of old schema. Also, if we have a separate commit log for schema we can delay schema parsing for after the replay and avoid complexity of recognizing schema transactions in the log and invoking the schema merge logic. I reproduced bad behavior locally on my machine with a tired (high latency) SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement). Without load, it is 300ms vs 50ms. Fixes #8272 Fixes #8309 Fixes #1459 Closes #10333 * github.com:scylladb/scylla: config: Introduce force_schema_commit_log option config: Introduce unsafe_ignore_truncation_record db: Avoid memtable flush latency on schema merge db: Allow splitting initiatlization of system tables db: Flush system.scylla_local on change migration_manager: Do not drop system.IndexInfo on keyspace drop Introduce SCHEMA_COMMITLOG cluster feature frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations db/commitlog: Improve error messages in case of unknown column mapping db/commitlog: Fix error format string to print the version db: Introduce multi-table atomic apply()	2022-07-07 16:03:50 +03:00
Tomasz Grabiec	6622e3369a	config: Introduce force_schema_commit_log option	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	b8d20335a4	config: Introduce unsafe_ignore_truncation_record The node now refuses to boot if schema tables were truncated. This adds a config option to ignore truncation records as a workaround if user truncated them manually.	2022-07-06 22:08:56 +02:00
Pavel Emelyanov	868c3be01f	config: Tune the config option The option is used, but is not implemented. If attaching implementation to it right a once the compaction will slow down to 16MB/s on all nodes. Make it zero (unbound) by default and mard live-updateable while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-30 09:55:52 +03:00
Pavel Emelyanov	3a753068be	Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache (e.g. Adding a user and changing their permissions) can become pretty annoying. This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. It also adds an API for flushing the cache to make it easier for users who don't want to modify their permissions_cache config. branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/ dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache * https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable: loading_cache_test: Test loading_cache::reset and loading_cache::update_config api: Add API for resetting authorization cache authorization_cache: Make permissions cache and authorized prepared statements cache live updateable auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct utils/loading_cache.hh: Add update_config method utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh utils/loading_cache.hh: Add reset method	2022-06-29 11:14:13 +03:00
Igor Ribeiro Barbosa Duarte	b9051c79bc	authorization_cache: Make permissions cache and authorized prepared statements cache live updateable Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache(e.g.: Adding a user) can become pretty annoying. This patch make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:58:06 -03:00
Piotr Dulikowski	761a037afb	config: add add_per_partition_rate_limit_extension function for testing ...and use it in cql_test_env to enable the per_partition_rate_limit extension for all tests that use it.	2022-06-22 20:16:49 +02:00
Benny Halevy	6677028212	sstables: mx/writer: auto-scale promoted index Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it's halved by merging each two adjacent blocks into one and doubling the desired_block_size. Fixes #4217 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:35 +03:00
Michał Radwański	29e09a3292	db/config: command line arguments logger_stdout_timestamps and logger_ostream_type are no longer ignored Closes #10452	2022-05-04 14:40:52 +03:00
Calle Wilund	7dd7760e8d	commitlog: Make flush threshold a config parameter	2022-04-11 16:34:00 +00:00
Calle Wilund	d478896d46	commitlog: kill non-recycled segment management It has been default for a while now. Makes no sense to not do it. Even hints can use it (even if it makes no difference there)	2022-04-11 16:34:00 +00:00
Piotr Sarna	3272b4826f	db: add keyspace-storage-options experimental feature Specifying non-standard keyspace options is experimental, so it's going to be protected by a configuration flag.	2022-04-08 09:17:01 +02:00
Nadav Har'El	49a8164fb7	alternator: add configurable scan period to TTL expiration Before this patch, the experimental TTL (expiration time) feature in Alternator scans tables for expiration in a tight loop - starting the next scan one second after the previous one completed. In this patch we introduce a new configuration option, alternator_ttl_period_in_seconds, which determines how frequently to start the scan. The default is 24 hours - meaning that the next scan is started 24 hours after the previous one started. The tests (test/alternator/run) change this configuration back to one second, so that expiration tests finish as quickly as possible. Please note that the scan is not slowed down to fill this 24 hours - if it finishes in one hour, it will then sleep for 23 hours. Additional work would be needed to slow down the scan to not finish too quickly. One idea not yet implemented is to move the expiration service from the "maintenance" scheduling group which it uses today to a new scheduling group, and modifying the number of shares that this group gets. Another thing worth noting about the configurable period (which defaults to 24 hours) is that when TTL is enabled on an Alternator table, it can take that amount of time until its scan starts and items start expiring from it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Michael Livshin	3bf1e137fc	config: make the ME sstable format default Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0b1447c702	add "sstable_format" config Initialize it to "md" until ME format support is complete (i.e. storing originating host id in sstable stats metadata is implemented), so at present there is no observable change by default. Also declare "enable_sstables_md_format" unused -- the idea, going forward, being that only "sstable_format" controls the written sstable file format and that no more per-format enablement config options shall be added. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Avi Kivity	13cf66d3ef	Revert "schema_registry: Increase grace period for schema version cache" This reverts commit `23da2b5879`. It causes the node to quickly run out of memory when many schema changes are made within a small time window. Fixes #10071.	2022-02-13 19:38:24 +02:00
Nadav Har'El	fef7934a2d	config: fix some types in system.config virtual table The system.config virtual tables prints each configuration variable of type T based on the JSON printer specified in the config_type_for<T> in db/config.cc. For two variable types - experimental_features and tri_mode_restriction, the specified converter was wrong: We used value_to_json<string> or value_to_json<vector<string>> on something which was not a string. Unfortunately, value_to_json silently casted the given objects into strings, and the result was garbage: For example as noted in #10047, for experimental_features instead of printing a list of features names, e.g., "raft", we got a bizarre list of one-byte strings with each feature's number (which isn't documented or even guaranteed to not change) as well as carriage-return characters (!?). So solution is a new printable_to_json<T> which works on a type T that can be printed with operator<< - as in fact the above two types can - and the type is converted into a string or vector of strings using this operator<<, not a cast. Also added a cql-pytest test for reading system.config and in particular options of the above two types - checking that they contain sensible strings and not "garbage" like before this patch. Fixes #10047. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20220209090421.298849-1-nyh@scylladb.com>	2022-02-10 09:10:24 +03:00
Tomasz Grabiec	23da2b5879	schema_registry: Increase grace period for schema version cache If version is absent in cache, it will be fetched from the coordinator. This is not expensive, but if the version is not known, it must be also "synced". It means that the node will do a full schema pull from the coordinator. This pull is expensive and can take seconds. If the coordinator we pull from is at an old version, the pull will do nothing and current node will soon forget the old version, initiating another pull. If some nodes stay at an old version for a long time for some reason, this will make new coordinators initiate pulls frequently. Increase the expiration period to 15 minutes to reduce the impact in such scenarios. Fixes #10042. Message-Id: <20220207122317.674241-1-tgrabiec@scylladb.com>	2022-02-09 09:27:07 +02:00
Michał Sala	b439d6e710	db: config: add a flag to disable new parallelized aggregation algorithm Just in case the new algorithm turns out to be buggy, add a flag to fall-back to the old algorithm.	2022-02-01 21:26:25 +01:00
Pavel Emelyanov	a026b4ef49	config: Add option to disable config updates via CQL The system.config table allows changing config parameters, but this change doesn't survive restarts and is considered to be dangerous (sometimes). Add an option to disable the table updates. The option is LiveUpdate and can be set to false via CQL too (once). fixes #9976 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220201121114.32503-1-xemul@scylladb.com>	2022-02-01 14:30:47 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Kamil Braun	e98711cfcb	db: config: add a flag to disable new reversed reads algorithm Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field.	2022-01-12 18:59:19 +01:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Avi Kivity	9e74556413	Merge 'Support reverse reads in the row cache natively' from Tomasz Grabiec This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read. The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after `703aed3277`. Refs: #1413 Tests: - unit [dev] - manual query with build/dev/scylla and cache tracing on Closes #9454 * github.com:scylladb/scylla: tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries row_cache: partition_snapshot_row_cursor: Print more details about the current version vector row_cache: Improve trace-level logging config: Use cache for reversed reads by default config: Adjust reversed_reads_auto_bypass_cache description row_cache: Support reverse reads natively mvcc: partition_snapshot: Support slicing range tombstones in reverse test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition row_cache: Log produced range tombstones test: Make produces_range_tombstone() report ck_ranges tests: lib: random_mutation_generator: Extract make_random_range_tombstone() partition_snapshot_row_cursor: Support reverse iteration utils: immutable-collection: Make movable intrusive_btree: Make default-initialized iterator cast to false	2021-12-29 16:53:25 +02:00
Asias He	eba4a4fba4	repair: Allow ignoring dead nodes for replace operation Consider 1) n1, n2, n3, n4, n5 2) n2 and n3 are both down 3) start n6 to replace n2 4) start n7 to replace n3 We want to replace the dead nodes n2 and n3 to fix the cluster to have 5 running nodes. Replace operation in step 3 will fail because n3 is down. We would see errors like below: replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed for replace operation are down. It is highly recommended to fix the down nodes and try again. In the above example, currently, there is no way to replace any of the dead nodes. Users can either fix one of the dead nodes and run replace or run removenode operation to remove one of the dead nodes then run replace and run bootstrap to add another node. Fixing dead nodes is always the best solution but it might not be possible. Running removenode operation is not better than running replace operation (with best effort by ignoring the other dead node) in terms of data consistency. In addition, users have to run bootstrap operation to add back the removed node. So, allowing replacing in such case is a clear win. This patch adds the --ignore-dead-nodes-for-replace option to allow run replace operation with best effort mode. Please note, use this option only if the dead nodes are completely broken and down, and there is no way to fix the node and bring it back. This also means the user has to make sure the ignored dead nodes specified are really down to avoid any data consistency issue. Fixes #9757 Closes #9758	2021-12-20 00:49:03 +02:00
Tomasz Grabiec	65a1a0247a	config: Use cache for reversed reads by default	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	9fd1120ad5	config: Adjust reversed_reads_auto_bypass_cache description Bypassing cache is no longer necessary to use native reverse readers.	2021-12-19 22:41:35 +01:00
Avi Kivity	f28552016f	Update seastar submodule * seastar f8a038a0a2...8d15e8e67a (21): > core/program_options: preserve defaultness of CLI arguments > log: Silence logger when logging > Include the core/loop.hh header inside when_all.hh header > http: Fix deprecated wrappers > foreign_ptr: Add concept > util: file: add read_entire_file > short_streams: move to util > Revert "Merge: file: util: add read_entire_file utilities" > foreign_ptr: declare destroy as a static method > Merge: file: util: add read_entire_file utilities > Merge "output_stream: handle close failure" from Benny > net: bring local_address() to seastar::connected_socket. > Merge "Allow programatically configuring seastar" from Botond > Merge 'core: clean up memory metric definitions' from John Spray > Add PopOS to debian list in install-dependencies.sh > Merge "make shared_mutex functions exception safe and noexcept" from Benny > on_internal_error: set_abort_on_internal_error: return current state > Implementation of iterator-range version of when_any > net: mark functions returning ethernet_address noexcept > net: ethernet_address: mark functions noexcept > shared_mutex: mark wake and unlock methods noexcept Contains patch from Botond Dénes <bdenes@scylladb.com>: db/config: configure logging based on app_template::seastar_options Scylla has its own config file which supports configuring aspects of logging, in addition to the built-in CLI logging options. When applying this configuration, the CLI provided option values have priority over the ones coming from the option file. To implement this scylla currently reads CLI options belonging to seastar from the boost program options variable map. The internal representation of CLI options however do not constitute an API of seastar and are thus subject to change (even if unlikely). This patch moves away from this practice and uses the new shiny C++ api: `app_template::seastar_options` to obtain the current logging options.	2021-12-08 14:21:11 +02:00
Nadav Har'El	605a2de398	config: change default prometheus_address handling, again In the very recent commit `3c0e703` fixing issue #8757, we changed the default prometheus_address setting in scylla.yaml to "localhost", to match the default listen_address in the same file. We explained in that commit how this helped developers who use an unchanged scylla.yaml, and how it didn't hurt pre-existing users who already had their own scylla.yaml. However, it was quickly noted by Tzach and Amnon that there is one use case that was hurt by that fix: Our existing documentation, such as the installation guide https://www.scylladb.com/download/?platform=centos ask the user to take our initial scylla.yaml, and modify listen_address, rpc_address, seeds, and cluster_name - and that's it. That document - and others - don't tell the user to also override prometheus_address, so users will likely forget to do so - and monitoring will not work for them. So this patch includes a different solution to #8757. What it does is: 1. The setting of prometheus_address in scylla.yaml is commented out. 2. In config.cc, prometheus_address defaults to empty. 3. In main.cc, if prometheus_address is empty (i.e., was not explicitly set by the user), the value of listen_address is used instead. In other words, the idea is that prometheus_address, if not explicitly set by the user, should default to listen_address - which is the address used to listen to the internal Scylla inter-node protocol. Because the documentation already tells the user to set listen_address and to not leave it set to localhost, setting it will also open up prometheus, thereby solving #9701. Meanwhile, developers who leave the default listen_address=localhost will also get prometheus_address=localhost, so the original #8757 is solved as well. Finally, for users who had an old scylla.yaml where prometheus_address was explicitly set to something, this setting will continue to be used. This was also a requirement of issue #8757. Fixes #9701. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211129155201.1000893-1-nyh@scylladb.com>	2021-12-02 19:43:30 +02:00
Nadav Har'El	8618346331	config: automate experimental_features_t::all() The experimental_features_t has an all() method, supposedly returning all values of the enum - but it's easy to forget to update it when adding a new experimental feature - and it's currently out-of-sync (it's missing the ALTERNATOR_TTL option). We already have another method, map(), where a new experimental feature must be listed otherwise it can't be used, so let's just take all()'s values from map(), automatically, instead of forcing developers to keep both lists up-to-date. Note that using the all() function to enable all experimental features is not recommended - the best practice is to enable specific experimental features, not all of them. Nevertheless, this all() function is still used in one place - in the cql_repl tool - which uses it to enable all experimental features. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211108135601.78460-1-nyh@scylladb.com>	2021-11-29 18:44:23 +02:00
Calle Wilund	a8bb4dcd28	tls: Add certficate_revocation_list option for client/server encryption options Fixes #9630 Adds support for importing a CRL certificate reovcation list. This will be monitored and reloaded like certs/keys. Allows blacklisting individual certs. Closes #9655	2021-11-17 14:24:22 +02:00
Pavel Emelyanov	a62631d441	config: Enable developer-mode by default in dev/debug modes Other than looking sane, this change continues the founded by the --workdir option tradition of freeing the developer form annoying necessity to type too many options when scylla is started by hand for devel purposes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211116104815.31822-1-xemul@scylladb.com>	2021-11-16 12:53:33 +02:00

1 2 3 4

194 Commits