scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	6d0c8fb6e2	config: Add constexpr value for default murmur ignore bits ... and use in some places of sstable_compaction_test. This will allow getting rid of global test_db_config thing later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-01 13:38:15 +03:00
Nadav Har'El	2dedb5ea75	alternator: make Alternator TTL feature no longer "experimental" Until now, the Alternator TTL feature was considered "experimental", and had to be manually enabled on all nodes of the cluster to be usable. This patch removes this requirement and in essence GAs this feature. Even after this patch, Alternator TTL is still a "cluster feature", i.e., for this feature to be usable every node in the cluster needs to support it. If any of the nodes is old and does not yet support this feature, the UpdateTimeToLive request will not be accepted, so although the expiration-scanning threads may exist on the newer nodes, they will not do anything because none of the tables can be marked as having expiration enabled. This patch does not contain documentation fixes - the documentation still suggests that the Alternator TTL feature is experimental. The documentation patch will come separately. Fixes #12037 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12049	2022-11-24 17:21:39 +02:00
Avi Kivity	20bad62562	Merge 'Detect and record large collections' from Benny Halevy This series adds support for detecting collections that have too many items and recording them in `system.large_cells`. A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000. Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively. A new column was added to system.large_cells: `collection_items`. Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold. Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred. Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it. The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled. In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items. Closes #11449 Closes #11674 * github.com:scylladb/scylladb: sstables: mx/writer: optimize large data stats members order sstables: mx/writer: keep large data stats entry as members db: large_data_handler: dynamically update config thresholds utils/updateable_value: add transforming_value_updater db/large_data_handler: cql_table_large_data_handler: record large_collections db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler db/large_data_handler: cql_table_large_data_handler: move ctor out of line docs: large-rows-large-cells-tables: fix typos db/system_keyspace: add collection_elements column to system.large_cells gms/feature_service: add large_collection_detection cluster feature test: sstable_3_x_test: add test_sstable_too_many_collection_elements test: lib: simple_schema: add support for optional collection column test: lib: simple_schema: build schema in ctor body test: lib: simple_schema: cql: define s1 as static only if built this way db/large_data_handler: maybe_record_large_cells: consider collection_elements db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells sstables: mx/writer: add large_data_type::elements_in_collection db/large_data_handler: get the collection_elements_count_threshold db/config: add compaction_collection_elements_count_warning_threshold test: sstable_3_x_test: add test_sstable_write_large_cell test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random	2022-10-06 18:28:21 +03:00
Avi Kivity	37c6b46d26	dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty" The "virtual dirty" term is not very informative. "Virtual" means "not real", but it doesn't say in which way it isn't real. In this case, virtual dirty refers to real dirty memory, minus the portion of memtables that has been written to disk (but not yet sealed - in that case it would not be dirty in the first place). I chose to call "the portion of memtables that has been written to disk" as "spooled memory". At least the unique term will cause people to look it up and may be easier to remember. From that we have "unspooled memory". I plan to further change the accounting to account for spooled memory rather than unspooled, as that is a more natural term, but that is left for later. The documentation, config item, and metrics are adjusted. The config item is practically unused so it isn't worth keeping compatibility here.	2022-10-04 14:03:59 +03:00
Benny Halevy	167ec84eeb	db/config: add compaction_collection_elements_count_warning_threshold Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:31:10 +03:00
Piotr Sarna	481240b8b4	Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El We had quite a few tests for Alternator TTL in test/alternator, but most of them did not run as part of the usual Jenkins test suite, because they were considered "very slow" (and require a special "--runveryslow" flag to run). In this series we enable six tests which run quickly enough to run by default, without an additional flag. We also make them even quicker - the six tests now take around 2.5 seconds. I also noticed that we don't have a test for the Alternator TTL metrics - and added one. Fixes #11374. Refs https://github.com/scylladb/scylla-monitoring/issues/1783 Closes #11384 * github.com:scylladb/scylladb: test/alternator: insert test names into Scylla logs rest api: add a new /system/log operation alternator ttl: log warning if scan took too long. alternator,ttl: allow sub-second TTL scanning period, for tests test/alternator: skip fewer Alternator TTL tests test/alternator: test Alternator TTL metrics	2022-09-22 09:47:50 +02:00
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
Nadav Har'El	8ece63c433	Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails' This series introduces two configurable options when working with TWCS tables: - `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting - `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check. Refs: #6923 Fixes: #9029 Closes #11445 * github.com:scylladb/scylladb: tests: cql_query_test: add mixed tests for verifying TWCS guard rails tests: cql_query_test: add test for TWCS window size tests: cql_query_test: add test for TWCS tables with no TTL defined cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables cql: add max window restriction for TimeWindowCompactionStrategy time_window_compaction_strategy: reject invalid window_sizes cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr	2022-09-12 23:55:51 +03:00
Nadav Har'El	e7e9adc519	alternator,ttl: allow sub-second TTL scanning period, for tests Alternator has the "alternator_ttl_period_in_seconds" parameter for controlling how often the expiration thread looks for expired items to delete. It is usually a very large number of seconds, but for tests to finish quickly, we set it to 1 second. With 1 second expiration latency, test/alternator/test_ttl.py took 5 seconds to run. In this patch, we change the parameter to allow a floating-point number of seconds instead of just an integer. Then, this allows us to halve the TTL period used by tests to 0.5 seconds, and as a result, the run time of test_ttl.py halves to 2.5 seconds. I think this is fast enough for now. I verified that even if I change the period to 0.1, there is no noticable slowdown to other Alternator tests, so 0.5 is definitely safe. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Botond Dénes	5374f0edbf	Merge 'Task manager' from Aleksandra Martyniuk Task manager for observing and managing long-running, asynchronous tasks in Scylla with the interface for the user. It will allow listing of tasks, getting detailed task status and progression, waiting for their completion, and aborting them. The task manager will be configured with a “task ttl” that determines how long the task status is kept in memory after the task completes. At first it will support repair and compaction tasks, and possibly more in the future. Currently: Sharded `task_manager` is started in `main.cc` where it is further passed to `http_context` for the purpose of user interface. Task manager's tasks are implemented in two two layers: the abstract and the implementation one. The latter is a pure virtual class which needs to be overriden by each module. Abstract layer provides the methods that are shared by all modules and the access to module-specific methods. Each module can access task manager, create and manage its tasks through `task_manager::module` object. This way data specific to a module can be separated from the other modules. User can access task manager rest api interface to track asynchronous tasks. The available options consist of: - getting a list of modules - getting a list of basic stats of all tasks in the requested module - getting the detailed status of the requested task - aborting the requested task - waiting for the requested task to finish To enable testing of the provided api, test specific task implementation and module are provided. Their lifetime can be simulated with the standalone test api. These components are compiled and the tests are run in all but release build modes. Fixes: #9809 Closes #11216 * github.com:scylladb/scylladb: test: task manager api test task_manager: test api layer implementation task_manager: add test specific classes task_manager: test api layer task_manager: api layer implementation task_manager: api layer task_manager: keep task_manager reference in http_context start sharded task manager task_manager: create task manager object	2022-09-12 09:26:46 +03:00
Felipe Mendes	7fec4fcaa6	cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth. However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0. The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn": We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.	2022-09-11 16:50:42 -03:00
Felipe Mendes	a3356e866b	cql: add max window restriction for TimeWindowCompactionStrategy The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough. Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase. This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check. Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.	2022-09-11 16:50:22 -03:00
Aleksandra Martyniuk	2439e55974	task_manager: create task manager object Implementation of a task manager that allows tracking and managing asynchronous tasks. The tasks are represented by task_manager::task class providing members common to all types of tasks. The methods that differ among tasks of different module can be overriden in a class inheriting from task_manager::task::impl class. Each task stores its status containing parameters like id, sequence number, begin and end time, state etc. After the task finishes, it is kept in memory for configurable time or until it is unregistered. Tasks need to be created with make_task method. Each module is represented by task_manager::module type and should have an access to task manager through task_manager::module methods. That allows to easily separate and collectively manage data belonging to each module.	2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch	5b1421cc33	db: config: add BROADCAST_TABLES feature flag Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature. This feature requires raft group0, thus enabling it without RAFT will cause an error.	2022-09-05 11:11:08 +02:00
Benny Halevy	d295d8e280	everywhere: define locator::host_id as a strong tagged_uuid type So it can be distinguished from other uuid-based identifiers in the system. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11276	2022-08-12 06:01:44 +03:00
Botond Dénes	33f0447ba0	db/config: add config item for query tombstone limit This will be the value used to break pages, after processing the specified amount of tombstones. The page will be cut even if empty. We could maybe use the already existing tombstone_{warn,fail}_threshold instead and use them as a soft/hard limit pair, like we did with page sizes.	2022-08-09 10:00:40 +03:00
Benny Halevy	edd308c705	config: use ordered map for experimental features So that the help string will be sorted lexicographically. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11178	2022-08-01 17:40:10 +03:00
Pavel Emelyanov	7d0110cd31	config: Add stream_io_throughput_mb_per_sec option It's going to control the bandwidth for the streaming prio class. For now it's jsut added but does't work for real Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:14:41 +03:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Tomasz Grabiec	6622e3369a	config: Introduce force_schema_commit_log option	2022-07-06 22:08:56 +02:00
Tomasz Grabiec	b8d20335a4	config: Introduce unsafe_ignore_truncation_record The node now refuses to boot if schema tables were truncated. This adds a config option to ignore truncation records as a workaround if user truncated them manually.	2022-07-06 22:08:56 +02:00
Avi Kivity	dab56b82fa	Merge 'Per-partition rate limiting' from Piotr Dulikowski Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode. This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension: ``` ALTER TABLE ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second': 100, 'max_reads_per_second': 200 }; ``` Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead. Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all. The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here. Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`: - Write mode: ``` `8f690fdd47` (PR base): 129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op) This PR: 125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op) ``` - Read mode: ``` `8f690fdd47` (PR base): 150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op) This PR: 151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op) ``` Manual upgrade test: - Start 3 nodes, 4 shards each, Scylla version `8f690fdd47` - Create a keyspace with scylla-bench, RF=3 - Start reading and writing with scylla-bench with CL=QUORUM - Manually upgrade nodes one by one to the version from this PR - Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded - Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected Fixes: #4703 Closes #9810 * github.com:scylladb/scylla: storage_proxy: metrics for per-partition rate limiting of reads storage_proxy: metrics for per-partition rate limiting of writes database: add stats for per partition rate limiting tests: add per_partition_rate_limit_test config: add add_per_partition_rate_limit_extension function for testing cf_prop_defs: guard per-partition rate limit with a feature query-request: add allow_limit flag storage_proxy: add allow rate limit flag to get_read_executor storage_proxy: resultize return type of get_read_executor storage_proxy: add per partition rate limit info to read RPC storage_proxy: add per partition rate limit info to query_result_local(_digest) storage_proxy: add allow rate limit flag to mutate/mutate_result storage_proxy: add allow rate limit flag to mutate_internal storage_proxy: add allow rate limit flag to mutate_begin storage_proxy: choose the right per partition rate limit info in write handler storage_proxy: resultize return types of write handler creation path storage_proxy: add per partition rate limit to mutation_holders storage_proxy: add per partition rate limit info to write RPC storage_proxy: add per partition rate limit info to mutate_locally database: apply per-partition rate limiting for reads/writes database: move and rename: classify_query -> classify_request schema: add per_partition_rate_limit schema extension db: add rate_limiter storage_proxy: propagate rate_limit_exception through read RPC gms: add TYPED_ERRORS_IN_READ_RPC cluster feature storage_proxy: pass rate_limit_exception through write RPC replica: add rate_limit_exception and a simple serialization framework docs: design doc for per-partition rate limiting transport: add rate_limit_error	2022-06-24 01:32:13 +03:00
Piotr Dulikowski	761a037afb	config: add add_per_partition_rate_limit_extension function for testing ...and use it in cql_test_env to enable the per_partition_rate_limit extension for all tests that use it.	2022-06-22 20:16:49 +02:00
Pavel Emelyanov	820be06ac1	hints: Remove snitch dependency After previous patch hints manager class gets unused dependency on snitch. While removing it it turns out that several unrelated places get needed headers indirectly via host_filter.hh -> snitsh_base.hh inclusion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-06-22 11:47:26 +03:00
Benny Halevy	6677028212	sstables: mx/writer: auto-scale promoted index Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it's halved by merging each two adjacent blocks into one and doubling the desired_block_size. Fixes #4217 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-05-24 13:32:35 +03:00
Calle Wilund	7dd7760e8d	commitlog: Make flush threshold a config parameter	2022-04-11 16:34:00 +00:00
Calle Wilund	d478896d46	commitlog: kill non-recycled segment management It has been default for a while now. Makes no sense to not do it. Even hints can use it (even if it makes no difference there)	2022-04-11 16:34:00 +00:00
Piotr Sarna	3272b4826f	db: add keyspace-storage-options experimental feature Specifying non-standard keyspace options is experimental, so it's going to be protected by a configuration flag.	2022-04-08 09:17:01 +02:00
Nadav Har'El	49a8164fb7	alternator: add configurable scan period to TTL expiration Before this patch, the experimental TTL (expiration time) feature in Alternator scans tables for expiration in a tight loop - starting the next scan one second after the previous one completed. In this patch we introduce a new configuration option, alternator_ttl_period_in_seconds, which determines how frequently to start the scan. The default is 24 hours - meaning that the next scan is started 24 hours after the previous one started. The tests (test/alternator/run) change this configuration back to one second, so that expiration tests finish as quickly as possible. Please note that the scan is not slowed down to fill this 24 hours - if it finishes in one hour, it will then sleep for 23 hours. Additional work would be needed to slow down the scan to not finish too quickly. One idea not yet implemented is to move the expiration service from the "maintenance" scheduling group which it uses today to a new scheduling group, and modifying the number of shares that this group gets. Another thing worth noting about the configurable period (which defaults to 24 hours) is that when TTL is enabled on an Alternator table, it can take that amount of time until its scan starts and items start expiring from it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-02-25 07:26:11 +02:00
Michael Livshin	3fef604075	sstables_manager: add get_local_host_id() method and support Since ME sstable format includes originating host id in stats metadata, local host id needs to be made available for writing and validation. Both Scylla server (where local host id comes from the `system.local` table) and unit tests (where it is fabricated) must be accomodated. Regardless of how the host id is obtained, it is stored in the db config instance and accessed through `sstables_manager`. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michael Livshin	0b1447c702	add "sstable_format" config Initialize it to "md" until ME format support is complete (i.e. storing originating host id in sstable stats metadata is implemented), so at present there is no observable change by default. Also declare "enable_sstables_md_format" unused -- the idea, going forward, being that only "sstable_format" controls the written sstable file format and that no more per-format enablement config options shall be added. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-16 18:21:24 +02:00
Michał Sala	b439d6e710	db: config: add a flag to disable new parallelized aggregation algorithm Just in case the new algorithm turns out to be buggy, add a flag to fall-back to the old algorithm.	2022-02-01 21:26:25 +01:00
Pavel Emelyanov	a026b4ef49	config: Add option to disable config updates via CQL The system.config table allows changing config parameters, but this change doesn't survive restarts and is considered to be dangerous (sometimes). Add an option to disable the table updates. The option is LiveUpdate and can be set to false via CQL too (once). fixes #9976 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220201121114.32503-1-xemul@scylladb.com>	2022-02-01 14:30:47 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Kamil Braun	e98711cfcb	db: config: add a flag to disable new reversed reads algorithm Just in case the new algorithm turns out to be buggy, or give a performance regression, add a flag to fall-back to the old algorithm for use in the field.	2022-01-12 18:59:19 +01:00
Asias He	eba4a4fba4	repair: Allow ignoring dead nodes for replace operation Consider 1) n1, n2, n3, n4, n5 2) n2 and n3 are both down 3) start n6 to replace n2 4) start n7 to replace n3 We want to replace the dead nodes n2 and n3 to fix the cluster to have 5 running nodes. Replace operation in step 3 will fail because n3 is down. We would see errors like below: replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed for replace operation are down. It is highly recommended to fix the down nodes and try again. In the above example, currently, there is no way to replace any of the dead nodes. Users can either fix one of the dead nodes and run replace or run removenode operation to remove one of the dead nodes then run replace and run bootstrap to add another node. Fixing dead nodes is always the best solution but it might not be possible. Running removenode operation is not better than running replace operation (with best effort by ignoring the other dead node) in terms of data consistency. In addition, users have to run bootstrap operation to add back the removed node. So, allowing replacing in such case is a clear win. This patch adds the --ignore-dead-nodes-for-replace option to allow run replace operation with best effort mode. Please note, use this option only if the dead nodes are completely broken and down, and there is no way to fix the node and bring it back. This also means the user has to make sure the ignored dead nodes specified are really down to avoid any data consistency issue. Fixes #9757 Closes #9758	2021-12-20 00:49:03 +02:00
Avi Kivity	f28552016f	Update seastar submodule * seastar f8a038a0a2...8d15e8e67a (21): > core/program_options: preserve defaultness of CLI arguments > log: Silence logger when logging > Include the core/loop.hh header inside when_all.hh header > http: Fix deprecated wrappers > foreign_ptr: Add concept > util: file: add read_entire_file > short_streams: move to util > Revert "Merge: file: util: add read_entire_file utilities" > foreign_ptr: declare destroy as a static method > Merge: file: util: add read_entire_file utilities > Merge "output_stream: handle close failure" from Benny > net: bring local_address() to seastar::connected_socket. > Merge "Allow programatically configuring seastar" from Botond > Merge 'core: clean up memory metric definitions' from John Spray > Add PopOS to debian list in install-dependencies.sh > Merge "make shared_mutex functions exception safe and noexcept" from Benny > on_internal_error: set_abort_on_internal_error: return current state > Implementation of iterator-range version of when_any > net: mark functions returning ethernet_address noexcept > net: ethernet_address: mark functions noexcept > shared_mutex: mark wake and unlock methods noexcept Contains patch from Botond Dénes <bdenes@scylladb.com>: db/config: configure logging based on app_template::seastar_options Scylla has its own config file which supports configuring aspects of logging, in addition to the built-in CLI logging options. When applying this configuration, the CLI provided option values have priority over the ones coming from the option file. To implement this scylla currently reads CLI options belonging to seastar from the boost program options variable map. The internal representation of CLI options however do not constitute an API of seastar and are thus subject to change (even if unlikely). This patch moves away from this practice and uses the new shiny C++ api: `app_template::seastar_options` to obtain the current logging options.	2021-12-08 14:21:11 +02:00
Pavel Emelyanov	d513034ca4	utils: Ability to set_value(sstring) for an option There soon will appear an updateable system.config table that will push sstrings into names_value-s. Prepare for this change by adding the respective .set_value() call. Since the update only works for LiveUpdate-able options, and inability to do it can be propagated back to the caller make this method return true/false whether the update took place or not. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-11 15:15:05 +03:00
Pavel Emelyanov	71ce7c6e87	db.config: Verbose address resolver helper The helper works on named_value() and throws and exception containing the option name for convenient error reporting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-11-08 17:33:27 +03:00
Nadav Har'El	e4a6569258	config: experimental flag UNUSED_CDC shouldn't be distinct from UNUSED When an experimental feature graduates from being experimental, we want to continue allow the old "--experimental-features=..." option to work, in case some user's configuration uses it - just do nothing. The way we do it is to map in db::experimental_features_t::map() the feature's name to the UNUSED value - this way the feature's name is accepted, but doesn't change anything. When the CDC feature graduated from being experimental, a new bit UNUSED_CDC was introduced to do the same thing. This separate bit was not actually necessary - if we ever check for UNUSED_CDC bit anywhere in the code it means the flag isn't actually unused ;-) And we don't check it. So simplify the code by conflating UNUSED_CDC into UNUSED. This will also make it easy to build from db::experimental_features_t::map() a list of current experimental features - now it will simply be those that do not map to UNUSED. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211013105107.123544-1-nyh@scylladb.com>	2021-10-20 17:54:17 +03:00
Tomasz Grabiec	e89b9799b8	Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun Until now reversed queries were implemented inside `querier::consume_page` (more precisely, inside the free function `consume_page` used by `querier::consume_page`) by wrapping the passed-in reader into `make_reversing_reader` and then consuming fragments from the resulting reversed reader. The first couple of commits change that by pushing the reversing down below the `make_combined_reader` call in `table::query`. This allows working on improving reversing for memtables independently from reversing for sstables. We then extend the `index_reader` with functions that allow reading the promoted index in reverse. We introduce `partition_reversing_data_source`, which wraps an sstable data file and returns data buffers with contents of a single chosen partition as if the rows were stored in reverse order. We use the reversing source and the extended index reader in `mx_sstable_mutation_reader` to implement efficient (at least in theory) reversed single-partition reads. The patchset disables cache for reversed reads. Fast-forwarding is not supported in the mx reader for reversed queries at this point. Details in commit messages. Read the commits in topological order for best review experience. Refs: #9134 (not saying "Fixes" because it's only for single-partition queries without forwarding) Closes #9281 * github.com:scylladb/scylla: table: add option to automatically bypass cache for reversed queries test: reverse sstable reader with random schema and random mutations sstables: mx: implement reversed single-partition reads sstables: mx: introduce partition_reversing_data_source sstables: index_reader: add support for iterating over clustering ranges in reverse clustering_key_filter: clustering_key_filter_ranges owning constructor flat_mutation_reader: mention reversed schema in make_reversing_reader docstring clustering_key_filter: document clustering_key_filter_ranges::get_ranges	2021-10-04 15:37:34 +02:00
Kamil Braun	703aed3277	table: add option to automatically bypass cache for reversed queries Currently the new reversing sstable algorithms do not support fast forwarding and the cache does not yet handle reversed results. This forced us to disable the cache for reversed queries if we want to guarantee bounded memory. We introduce an option that does this automatically (without specifying `bypass cache` in the query) and turn it on by default. If the user decides that they prefer to keep the cache at the cost of fetching entire partitions into memory (which may be viable if their partitions are small) during reversed queries, the option can be turned off. It is live-updateable.	2021-10-04 15:24:12 +02:00
Pavel Emelyanov	bbcf671276	config: Remove unused replacing options The --replace-token and --replace-node were added some time ago, but have never been used since then, just parsed and immediatelly aborted. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210930102222.16294-1-xemul@scylladb.com>	2021-09-30 14:56:04 +03:00
Nadav Har'El	4ffd8c1f2b	alternator: stub TTL operations This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive operations to Alternator. These operations can enable, disable, or inquire about, the chosen expiration-time attribute. Currently, the information about the chosen attribute is only saved, with no actual expiration of any items taking place. Some of the tests for the TTL feature start to pass, so their xfail tag is removed. Because this this new feature is incomplete, it is not enabled unless the "alternator-ttl" experimental feature is enabled. Moreover, for these operations to be allowed, the entire cluster needs to support this experimental feature, because all nodes need to participate in the data expiration - if some old nodes don't support Alternator TTL, some of the data they hold won't get expired... So we don't allow enabling TTL until all the nodes in the cluster support this feature. The implementation is in a new source file, alternator/ttl.cc. This source file will continue to grow as we implement the expiration feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-09-19 21:05:21 +03:00
Avi Kivity	c5f52f9d97	schema_tables: don't flush in tests Flushing schema tables is important for crash recovery (without a flush, we might have sstables using a new schema before the commitlog entry noting the schema change has been replayed), but not important for tests that do not test crash recovery. Avoiding those flushes reduces system, user, and real time on tests running on a consumer-level SSD. before: real 8m51.347s user 7m5.743s sys 5m11.185s after: real 7m4.249s user 5m14.085s sys 2m11.197s Note real time is higher that user+sys time divided by the number of hardware threads, indicating that there is still idle time due to the disk flushing, so more work is needed. Closes #9319	2021-09-12 11:32:13 +03:00
Avi Kivity	705f957425	Merge "Generalize TLS creds builder configuration" from Pavel E " There are 4 places out there that do the same steps parsing "client_\|server_encryption_options" and configuring the seastar::tls::creds_builder with the values (messaging, redis, alternator and transport). Also to make redis and transport look slimmer main() cleans the client_encryption_options by ... parsing it too. This set introduces a (coroutinized) helper to configure the creds_builder with map<string, string> and removes the options beautification from main. tests: unit(dev), dtest.internode_ssl_test(dev) " * 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla: code: Generalize tls::credentials_builder configuration transport, redis: Do not assume fixed encryption options messaging: Move encryption options parsing to ms main: Open-code internode encryption misconfig warning main, config: Move options parsing helpers	2021-09-01 14:19:19 +03:00
Avi Kivity	8b59e3a0b1	Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski Return the pre- `6773563d3` behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now. Fixes #7608; see comments there for justification. Tests: unit (debug, dev), dtest (cql_additional_test, paging_test) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9126 * github.com:scylladb/scylla: cql3: Demand ALLOW FILTERING for unlimited, sliced partitions cql3: Track warnings in prepared_statement test: Use ALLOW FILTERING more strictly cql3: Add statement_restrictions::to_string	2021-08-31 18:05:26 +03:00
Dejan Mircevski	2f28f68e84	cql3: Demand ALLOW FILTERING for unlimited, sliced partitions When a query requests a partition slice but doesn't limit the number of partitions, require that it also says ALLOW FILTERING. Although do_filter() isn't invoked for such queries, the performance can still be unexpectedly slow, and we want to signal that to the user by demanding they explicitly say ALLOW FILTERING. Because we now reject queries that worked fine before, existing applications can break. Therefore, the behavior is controlled by a flag currently defaulting to off. We will default to "on" in the next Scylla version. Fixes #7608; see comments there for justification. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-31 10:45:41 -04:00
Pavel Solodovnikov	22794efc22	db: add experimental option for raft Introduce `raft` experimental option. Adjust the tests accordingly to accomodate the new option. It's not enabled by default when providing `--experimental=true` config option and should be requested explicitly via `--experimental-options=raft` config option. Hide the code related to `raft_group_registry` behind the switch. The service object is still constructed but no initialization is performed (`init()` is not called) if the flag is not set. Later, other raft-related things, such as raft schema changes, will also use this flag. Also, don't introduce a corresponding gossiper feature just yet, because again, it should be done after the raft schema changes API contract is stabilized. This will be done in a separate series, probably related to implementing the feature itself. Tests: unit(dev) Ref #9239. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>	2021-08-23 17:45:58 +03:00
Pavel Emelyanov	e02b39ca3d	code: Generalize tls::credentials_builder configuration All the places in code that configure the mentioned creds builder from client_\|server_encryption_options now do it the same way. This patch generalizes it all in the utils:: helper. The alternator code "ignores" require_client_auth and truststore keys, but it's easy to make the generalized helper be compatible. Also make the new helper coroutinized from the beginning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-20 18:05:41 +03:00

1 2 3 4 5 ...

259 Commits