scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Botond Dénes	5621cdd7f9	db/view/view_builder: don't drop partition and range tombstones when resuming The view builder builds the views from a given base table in view_builder::batch_size batches of rows. After processing this many rows, it suspends so the view builder can switch to building views for other base tables in the name of fairness. When resuming the build step for a given base table, it reuses the reader used previously (also serving the role of a snapshot, pinning sstables read from). The compactor however is created anew. As the reader can be in the middle of a partition, the view builder injects a partition start into the compactor to prime it for continuing the partition. This however only included the partition-key, crucially missing any active tombstones: partition tombstone or -- since the v2 transition -- active range tombstone. This can result in base rows covered by either of this to be resurrected and the view builder to generate view updates for them. This patch solves this by using the detach-state mechanism of the compactor which was explicitly developed for situations like this (in the range scan code) -- resuming a read with the readers kept but the compactor recreated. Also included are two test cases reproducing the problem, one with a range tombstone, the other with a partition tombstone. Fixes: #11668 Closes #11671	2022-10-03 11:28:22 +03:00
Botond Dénes	ad04f200d3	Merge 'database: automatically take snapshot of base table views' from Benny Halevy The logic to reject explicit snapshot of views/indexes was improved in `aa127a2dbb`. However, we never implemented auto-snapshot of view/indexes when taking a snapshot of the base table. This is implemented in this patch. The implementation is built on top of `ba42852b0e` so it would be hard to backport to 5.1 or earlier releases. Fixes #11612 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11616 * github.com:scylladb/scylladb: database: automatically take snapshot of base table views api: storage_service: reject snapshot of views in api layer	2022-09-29 13:33:31 +03:00
Benny Halevy	d32c497cd9	database: automatically take snapshot of base table views The logic to reject explicit snapshot of views/indexes was improved in `aa127a2dbb`. However, we never implemented auto-snapshot of view/indexes when taking a snapshot of the base table. This is implemented in this patch. The implementation is built on top of `ba42852b0e` so it would be hard to backport to 5.1 or earlier releases. Fixes #11612 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 11:02:54 +03:00
Benny Halevy	55b0b8fe2c	api: storage_service: reject snapshot of views in api layer Rather than pushing the check to `snapshot_ctl::take_column_family_snapshot`, just check that explcitly when taking a snapshot of a particular table by name over the api. Other paths that call snapshot_ctl::take_column_family_snapshot are internal and use it to snap views already. With that, we can get rid of the allow_view_snapshots flag that was introduced in `aab4cd850c`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-26 10:44:56 +03:00
Benny Halevy	fcbbc3eb9c	db/large_data_handler: print static cell/collection description in log warning When warning about a large cell/collection in a static row, print that fact in the log warning to make it clearer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:37:42 +03:00
Benny Halevy	4670829502	db/large_data_handler: separate pk and ck strings in log warning with delimiter Currently (since `f3089bf3d1`), when printing a warning to the log about large rows and/or cells the clustering key string is concatenated to the partition key string, rendering the warning confsing and much less useful. This patch adds a '/' delimiter to separate the fields, and also uses one to separate the clustering key from the column name for large cells. In case of a static cell, the clustering key is null hence the warning will look like: `pk//column`. This patch does NOT change anything in the large_* system table schema or contents. It changes only the log warning format that need not be backward compatible. Fixes #11620 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-25 14:36:41 +03:00
Botond Dénes	b9d55ee02f	Merge 'Add cassandra functional - show warn/err when tombstone threshold reached.' from Taras Borodin Add cassandra functional - show warn/err when tombstone_warn_threshold/tombstone_failure_threshold reached on select, by partitions. Propagate raw query_string from coordinator to replicas. Closes #11356 * github.com:scylladb/scylladb: add utf8:validate to operator<< partition_key with_schema. Show warn message if `tombstone_warn_threshold` reached on querier.	2022-09-23 05:53:47 +03:00
Tomasz Grabiec	ccbfe2ef0d	Merge 'Fix invalid mutation fragment stream issues' from Botond Dénes Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own. The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state. The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans. Closes #11604 * github.com:scylladb/scylladb: shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied readers/mutation_readers: compacting_reader: remember injected partition-end db/view: view_builder::execute(): only inject partition-start if needed	2022-09-22 17:57:27 +02:00
TarasBor	1f4a93da78	Show warn message if `tombstone_warn_threshold` reached on querier. When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs. If `tombstone_warn_threshold:0` feature disabled. Refs scylladb#11410	2022-09-22 16:42:31 +03:00
Botond Dénes	681e6ae77f	db/view: view_builder::execute(): only inject partition-start if needed When resuming a build-step, the view builder injects the partition-start fragment of the last processed partition, to bring the consumer (compactor) into the correct state before it starts to consume the remainder of the partition content. This results in an invalid fragment stream when the partition was actually over or there is nothing left for the build step. Make the inject conditional on when the reader contains more data for the partition. Fixes: #11607	2022-09-22 13:54:36 +03:00
Piotr Sarna	481240b8b4	Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El We had quite a few tests for Alternator TTL in test/alternator, but most of them did not run as part of the usual Jenkins test suite, because they were considered "very slow" (and require a special "--runveryslow" flag to run). In this series we enable six tests which run quickly enough to run by default, without an additional flag. We also make them even quicker - the six tests now take around 2.5 seconds. I also noticed that we don't have a test for the Alternator TTL metrics - and added one. Fixes #11374. Refs https://github.com/scylladb/scylla-monitoring/issues/1783 Closes #11384 * github.com:scylladb/scylladb: test/alternator: insert test names into Scylla logs rest api: add a new /system/log operation alternator ttl: log warning if scan took too long. alternator,ttl: allow sub-second TTL scanning period, for tests test/alternator: skip fewer Alternator TTL tests test/alternator: test Alternator TTL metrics	2022-09-22 09:47:50 +02:00
Kamil Braun	595472ac59	Merge 'Don't use qctx in CDC tables quering' from Pavel Emelyanov There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage. Closes #11557 * github.com:scylladb/scylladb: system_keyspace: De-static get_cdc_generation_id() system_keyspace: De-static cdc_is_rewritten() system_keyspace: De-static cdc_set_rewritten() system_keyspace: De-static update_cdc_generation_id()	2022-09-16 11:52:01 +02:00
Pavel Emelyanov	e221bb0112	system_keyspace: De-static get_cdc_generation_id() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-16 08:34:15 +03:00
Pavel Emelyanov	4f67898e7b	system_keyspace: De-static cdc_is_rewritten() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:59 +03:00
Pavel Emelyanov	736021ee98	system_keyspace: De-static cdc_set_rewritten() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:53 +03:00
Pavel Emelyanov	b3d139bbdb	system_keyspace: De-static update_cdc_generation_id() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-09-15 18:44:40 +03:00
Michał Chojnowski	cdb3e71045	sstables: add a flag for disabling long-term index caching Long-term index caching in the global cache, as introduced in 4.6, is a major pessimization for workloads where accesses to the index are (spacially) sparse. We want to have a way to disable it for the affected workloads. There is already infrastructure in place for disabling it for BYPASS CACHE queries. One way of solving the issue is hijacking that infrastructure. This patch adds a global flag (and a corresponding CLI option) which controls index caching. Setting the flag to `false` causes all index reads to behave like they would in BYPASS CACHE queries. Consequences of this choice: - The per-SSTable partition_index_cache is unused. Every index_reader has its own, and they die together. Independent reads can no longer reuse the work of other reads which hit the same index pages. This is not crucial, since partition accesses have no (natural) spatial locality. Note that the original reason for partition_index_cache -- the ability to share reads for the lower and upper bound of the query -- is unaffected. - The per-SSTable cached_file is unused. Every index_reader has its own (uncached) input stream from the index file, and every bsearch_clustered_cursor has its own cached_file, which dies together with the cursor. Note that the cursor still can perform its binary search with caching. However, it won't be able to reuse the file pages read by index_reader. In particular, if the promoted index is small, and fits inside the same file page as its index_entry, that page will be re-read. It can also happen that index_reader will read the same index file page multiple times. When the summary is so dense that multiple index pages fit in one index file page, advancing the upper bound, which reads the next index page, will read the same index file page. Since summary:disk ratio is 1:2000, this is expected to happen for partitions with size greater than 2000 partition keys. Fixes #11202	2022-09-15 17:16:26 +03:00
Raphael S. Carvalho	0a8afe18ca	cql: Reject create and alter table with DateTieredCompactionStrategy It's been ~1 year (`2bf47c902e`) since we set restrict_dtcs config option to WARN, meaning users have been warned about the deprecation process of DTCS. Let's set the config to TRUE, meaning that create and alter statements specifying DTCS will be rejected at the CQL level. Existing tables will still be supported. But the next step will be about throwing DTCS code into the shadow realm, and after that, Scylla will automatically fallback to STCS (or ICS) for users which ignored the deprecation process. Refs #8914. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11458	2022-09-15 11:46:18 +03:00
Michał Chojnowski	9b6fc553b4	db: commitlog: don't print INFO logs on shutdown The intention was for these logs to be printed during the database shutdown sequence, but it was overlooked that it's not the only place where commitlog::shutdown is called. Commitlogs are started and shut down periodically by hinted handoff. When that happens, these messages spam the log. Fix that by adding INFO commitlog shutdown logs to database::stop, and change the level of the commitlog::shutdown log call to DEBUG. Fixes #11508 Closes #11536	2022-09-14 11:30:53 +03:00
Nadav Har'El	8ece63c433	Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails' This series introduces two configurable options when working with TWCS tables: - `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting - `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check. Refs: #6923 Fixes: #9029 Closes #11445 * github.com:scylladb/scylladb: tests: cql_query_test: add mixed tests for verifying TWCS guard rails tests: cql_query_test: add test for TWCS window size tests: cql_query_test: add test for TWCS tables with no TTL defined cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables cql: add max window restriction for TimeWindowCompactionStrategy time_window_compaction_strategy: reject invalid window_sizes cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr	2022-09-12 23:55:51 +03:00
Kamil Braun	2fe3e67a47	gms: feature_service: don't distinguish between 'known' and 'supported' features `feature_service` provided two sets of features: `known_feature_set` and `supported_feature_set`. The purpose of both and the distinction between them was unclear and undocumented. The 'supported' features were gossiped by every node. Once a feature is supported by every node in the cluster, it becomes 'enabled'. This means that whatever piece of functionality is covered by the feature, it can by used by the cluster from now on. The 'known' set was used to perform feature checks on node start; if the node saw that a feature is enabled in the cluster, but the node does not 'know' the feature, it would refuse to start. However, if the feature was 'known', but wasn't 'supported', the node would not complain. This means that we could in theory allow the following scenario: 1. all nodes support feature X. 2. X becomes enabled in the cluster. 3. the user changes the configuration of some node so feature X will become unsupported but still known. 4. The node restarts without error. So now we have a feature X which is enabled in the cluster, but not every node supports it. That does not make sense. It is not clear whether it was accidental or purposeful that we used the 'known' set instead of the 'supported' set to perform the feature check. What I think is clear, is that having two sets makes the entire thing unnecessarily complicated and hard to think about. Fortunately, at the base to which this patch is applied, the sets are always the same. So we can easily get rid of one of them. I decided that the name which should stay is 'supported', I think it's more specific than 'known' and it matches the name of the corresponding gossiper application state. Closes #11512	2022-09-12 13:09:12 +03:00
Nadav Har'El	e7e9adc519	alternator,ttl: allow sub-second TTL scanning period, for tests Alternator has the "alternator_ttl_period_in_seconds" parameter for controlling how often the expiration thread looks for expired items to delete. It is usually a very large number of seconds, but for tests to finish quickly, we set it to 1 second. With 1 second expiration latency, test/alternator/test_ttl.py took 5 seconds to run. In this patch, we change the parameter to allow a floating-point number of seconds instead of just an integer. Then, this allows us to halve the TTL period used by tests to 0.5 seconds, and as a result, the run time of test_ttl.py halves to 2.5 seconds. I think this is fast enough for now. I verified that even if I change the period to 0.1, there is no noticable slowdown to other Alternator tests, so 0.5 is definitely safe. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-09-12 10:32:56 +03:00
Pavel Emelyanov	f3dfc9dbd4	system_keyspace: Don't load preferred IPs if not asked for If snitch->prefer_local() is false, advertised (via gossiper) INTERNAL_IPs are not suggested to messaging service to use. The same should apply to boot-time when messaging service is loaded with those IPs taken from the system.peers table. fixes: #11353 tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2172/ Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220909144800.23122-1-xemul@scylladb.com>	2022-09-12 09:48:23 +03:00
Botond Dénes	5374f0edbf	Merge 'Task manager' from Aleksandra Martyniuk Task manager for observing and managing long-running, asynchronous tasks in Scylla with the interface for the user. It will allow listing of tasks, getting detailed task status and progression, waiting for their completion, and aborting them. The task manager will be configured with a “task ttl” that determines how long the task status is kept in memory after the task completes. At first it will support repair and compaction tasks, and possibly more in the future. Currently: Sharded `task_manager` is started in `main.cc` where it is further passed to `http_context` for the purpose of user interface. Task manager's tasks are implemented in two two layers: the abstract and the implementation one. The latter is a pure virtual class which needs to be overriden by each module. Abstract layer provides the methods that are shared by all modules and the access to module-specific methods. Each module can access task manager, create and manage its tasks through `task_manager::module` object. This way data specific to a module can be separated from the other modules. User can access task manager rest api interface to track asynchronous tasks. The available options consist of: - getting a list of modules - getting a list of basic stats of all tasks in the requested module - getting the detailed status of the requested task - aborting the requested task - waiting for the requested task to finish To enable testing of the provided api, test specific task implementation and module are provided. Their lifetime can be simulated with the standalone test api. These components are compiled and the tests are run in all but release build modes. Fixes: #9809 Closes #11216 * github.com:scylladb/scylladb: test: task manager api test task_manager: test api layer implementation task_manager: add test specific classes task_manager: test api layer task_manager: api layer implementation task_manager: api layer task_manager: keep task_manager reference in http_context start sharded task manager task_manager: create task manager object	2022-09-12 09:26:46 +03:00
Felipe Mendes	7fec4fcaa6	cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth. However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0. The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn": We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.	2022-09-11 16:50:42 -03:00
Felipe Mendes	a3356e866b	cql: add max window restriction for TimeWindowCompactionStrategy The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough. Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase. This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check. Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.	2022-09-11 16:50:22 -03:00
Kamil Braun	dba595d347	Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch Broadcast tables are tables for which all statements are strongly consistent (linearizable), replicated to every node in the cluster and available as long as a majority of the cluster is available. If a user wants to store a “small” volume of metadata that is not modified “too often” but provides high resiliency against failures and strong consistency of operations, they can use broadcast tables. The main goal of the broadcast tables project is to solve problems which need to be solved when we eventually implement general-purpose strongly consistent tables: designing the data structure for the Raft command, ensuring that the commands are idempotent, handling snapshots correctly, and so on. In this MVP (Minimum Viable Product), statements are limited to simple SELECT and UPDATE operations on the built-in table. In the future, other statements and data types will be available but with this PR we can already work on features like idempotent commands or snapshotting. Snapshotting is not handled yet which means that restarting a node or performing too many operations (which would cause a snapshot to be created) will give incorrect results. In a follow-up, we plan to add end-to-end Jepsen tests (https://jepsen.io/). With this PR we can already simulate operations on lists and test linearizability in linear complexity. This can also test Scylla's implementation of persistent storage, failure detector, RPC, etc. Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing Closes #11164 * github.com:scylladb/scylladb: raft: broadcast_tables: add broadcast_kv_store test raft: broadcast_tables: add returning query result raft: broadcast_tables: add execution of intermediate language raft: broadcast_tables: add compilation of cql to intermediate language raft: broadcast_tables: add definition of intermediate language db: system_keyspace: add broadcast_kv_store table db: config: add BROADCAST_TABLES feature flag	2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk	2439e55974	task_manager: create task manager object Implementation of a task manager that allows tracking and managing asynchronous tasks. The tasks are represented by task_manager::task class providing members common to all types of tasks. The methods that differ among tasks of different module can be overriden in a class inheriting from task_manager::task::impl class. Each task stores its status containing parameters like id, sequence number, begin and end time, state etc. After the task finishes, it is kept in memory for configurable time or until it is unregistered. Tasks need to be created with make_task method. Each module is represented by task_manager::module type and should have an access to task manager through task_manager::module methods. That allows to easily separate and collectively manage data belonging to each module.	2022-09-09 14:29:28 +02:00
Benny Halevy	6fb4b5555d	db: view: get_tombstone_gc_state from compaction_manager Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	71ede6124a	db: view: pass base table to view_update_builder To be used by generate_update() for getting the tombstone_gc_state via the table's compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:04:23 +03:00
Benny Halevy	5dd15aa3c8	tombstone_gc: introduce tombstone_gc_state and use it to access the repair history maps. At this introductory patch, we use default-constructed tombstone_gc_state to access the thread-local maps temporarily and those use sites will be replaced in following patches that will gradually pass the tombstone_gc_state down from the compaction_manager to where it's used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Mikołaj Grzebieluch	726658f073	db: system_keyspace: add broadcast_kv_store table First implementation of strongly consistent everywhere tables operates on simple table representing string to string map. Add hard-coded schema for broadcast_kv_store table (key text primary key, value text). This table is under system keyspace and is created if and only if BROADCAST_TABLES feature is enabled.	2022-09-05 11:11:08 +02:00
Mikołaj Grzebieluch	5b1421cc33	db: config: add BROADCAST_TABLES feature flag Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature. This feature requires raft group0, thus enabling it without RAFT will cause an error.	2022-09-05 11:11:08 +02:00
Avi Kivity	421557b40a	Merge "Provide DC/RACK when populating topology" from Pavel E " The topology object maintains all sort of node/DC/RACK mappings on board. When new entries are added to it the DC and RACK are taken from the global snitch instance which, in turn, checks gossiper, system keyspace and its local caches. This set make topology population API require DC and RACK via the call argument. In most of the cases the populating code is the storage service that knows exactly where to get those from. After this set it will be possible to remove the dependency knot consiting of snitch, gossiper, system keyspace and messaging. " * 'br-topology-dc-rack-info' of https://github.com/xemul/scylla: toplogy: Use the provided dc/rack info test: Provide testing dc/rack infos storage_service: Provide dc/rack for snitch reconfiguration storage_service: Provide dc/rack from system ks on start storage_service: Provide dc/rack from gossiper for replacement storage_service: Provide dc/rack from gossiper for remotes storage_service,dht,repair: Provide local dc/rack from system ks system_keyspace: Cache local dc-rack on .start() topology: Some renames after previous patch topology: Require entry in the map for update_normal_tokens() topology: Make update_endpoint() accept dc-rack info replication_strategy: Accept dc-rack as get_pending_address_ranges argument dht: Carry dc-rack over boot_strapper and range_streamer storage_service: Make replacement info a real struct	2022-08-31 12:53:06 +03:00
Tomasz Grabiec	ae8d2a550d	db: schema_tables: Make table creation shadow earlier concurrent changes Issuing two CREATE TABLE statements with a different name for one of the partition key columns leads to the following assertion failure on all replicas: scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id \|\| def.id == id - column_offset(def.kind)' failed. The reason is that once the create table mutations are merged, the columns table contains two entries for the same position in the partition key tuple. If the schemas were the same, or not conflicting in a way which leads to abort, the current behavior would be to drop the older table as if the last CREATE TABLE was preceded by a DROP TABLE. The proposed fix is to make CREATE TABLE mutation include a tombstone for all older schema changes of this table, effectively overriding them. The behavior will be the same as if the schemas were not different, older table will be dropped. Fixes #11396	2022-08-29 12:06:02 +02:00
Tomasz Grabiec	661db2706f	db: schema_tables: Fix formatting	2022-08-26 17:37:48 +02:00
Pavel Emelyanov	a03d6f7751	system_keyspace: Cache local dc-rack on .start() There's a cache of endpoint:{dc,rack} on system keyspace cache, but the local node is not there, because this data is populated from the peers table, while local node's dc/rack is in snitch (or system.local table). At the same time, storage_service::join_cluster() and whoever it calls (e.g. -- the repair) will need this info on start and it's convenient to have this data on sys-ks cache. It's not on the peers part of the cache because next branch removes this map and it's going to be very clumsy to have a whole container with just one enty in it. There's a peer code in system_keyspace::setup() that gets the local node dc/rack and committs it into the system.local table. However, putting the data into cache is done on .start(). This is because cql-test-env needs this data cached too, but it doesn't call sys_ks.setup(). Will be cleaned some other day. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-08-26 09:47:30 +03:00
Avi Kivity	0dbcd13a0f	config: change logging::settings constructor call to use designated initializer Safer wrt reordering, and more readable too. Closes #11382	2022-08-26 06:14:01 +03:00
Wojciech Mitros	49dba4f0c1	functions: fix dropping of a keyspace with an aggregate in it Currently, if a keyspace has an aggregate and the keyspace is dropped, the keyspace becomes corrupted and another keyspace with the same name cannot be created again This is caused by the fact that when removing an aggregate, we call create_aggregate() to get values for its name and signature. In the create_aggregate(), we check whether the row and final functions for the aggregate exist. Normally, that's not an issue, because when dropping an existing aggregate alone, we know that its UDFs also exist. But when dropping and entire keyspace, we first drop the UDFs, making us unable to drop the aggregate afterwards. This patch fixes this behavior by removing the create_aggregate() from the aggregate dropping implementation and replacing it with specific calls for getting the aggregate name and signature. Additionally, a test that would previously fail is added to cql-pytest/test_uda.py where we drop a keyspace with an aggregate. Fixes #11327 Closes #11375	2022-08-25 16:28:57 +02:00
Kamil Braun	7e56251aea	service/raft: introduce `group0_upgrade_state` Define an enum class, `group0_upgrade_state`, describing the state of the upgrade procedure (implemented in later commits). Provide IDL definitions for (de)serialization. The node will have its current upgrade state stored on disk in `system.scylla_local` under the `group0_upgrade_state` key. If the key is not present we assume `use_pre_raft_procedures` (meaning we haven't started upgrading yet or we're at the beginning of upgrade). Introduce `system_keyspace` accessor methods for storing and retrieving the on-disk state.	2022-08-19 19:15:19 +02:00
Kamil Braun	547134faf4	db: system_keyspace: introduce `load_peers` Load the addresses of our peers from `system.peers`. Will be used be the Raft upgrade procedure to obtain the set of all peers.	2022-08-19 19:15:18 +02:00
Piotr Sarna	cf30d4cbcf	Merge 'Secondary index of collection columns' from Nadav Har'El This pull request introduces global secondary-indexing for non-frozen collections. The intent is to enable such queries: ``` CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id)); CREATE INDEX ON test(keys(somemap)); CREATE INDEX ON test(values(somemap)); CREATE INDEX ON test(entries(somemap)); CREATE INDEX ON test(values(somelist)); CREATE INDEX ON test(values(someset)); -- index on test(c) is the same as index on (values(c)) CREATE INDEX IF NOT EXISTS ON test(somelist); CREATE INDEX IF NOT EXISTS ON test(someset); CREATE INDEX IF NOT EXISTS ON test(somemap); SELECT * FROM test WHERE someset CONTAINS 7; SELECT * FROM test WHERE somelist CONTAINS 7; SELECT * FROM test WHERE somemap CONTAINS KEY 7; SELECT * FROM test WHERE somemap CONTAINS 7; SELECT * FROM test WHERE somemap[7] = 7; ``` We use here all-familiar materialized views (MVs). Scylla treats all the collections the same way - they're a list of pairs (key, value). In case of sets, the value type is dummy one. In case of lists, the key type is TIMEUUID. When describing the design, I will forget that there is more than one collection type. Suppose that the columns in the base table were as follows: ``` pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2) ``` The MV schema is as follows (the names of columns which are not the same as in base might be different). All the columns here form the primary key. ``` -- for index over entries indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int -- for index over keys indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int -- for index over values indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int ``` The reason for the last additional column is that the values from a collection might not be unique. Fixes #2962 Fixes #8745 Fixes #10707 This patch does not implement local secondary indexes for collection columns: Refs #10713. Closes #10841 * github.com:scylladb/scylladb: test/cql-pytest: un-xfail yet another passing collection-indexing test secondary index: fix paging in map value indexing test/cql-pytest: test for paging with collection values index cql, view: rename and explain bytes_with_action cql, index: make collection indexing a cluster feature test/cql-pytest: failing tests for oversized key values in MV and SI cql: fix secondary index "target" when column name has special characters cql, index: improve error messages cql, index: fix default index name for collection index test/cql-pytest: un-xfail several collecting indexing tests test/cql-pytest/test_secondary_index: verify that local index on collection fails. docs/design-notes/secondary_index: add `VALUES` to index target list test/cql-pytest/test_secondary_index: add randomized test for indexes on collections cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail. test/boost/secondary_index_test: update for non-frozen indexes on collections test/cql-pytest: Uncomment collection indexes tests that should be working now cql, index: don't use IS NOT NULL on collection column cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections cql3, index: Use entries() indexes on collections for queries cql3, index: Use keys() and values() indexes on collections for queries. types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function db/view/view.cc: compute view_updates for views over collections view info: has_computed_column_depending_on_base_non_primary_key column_computation: depends_on_non_primary_key_column schema, index/secondary_index_manager: make schema for index-induced mv index/secondary_index_manager: extract keys, values, entries types from collection cql3/statements/: validate CREATE INDEX for index over a collection cql3/statements/create_index_statement,index_target: rewrite index target for collection column_computation.hh, schema.cc: collection_column_computation column_computation.hh, schema.cc: compute_value interface refactor Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`	2022-08-16 14:18:51 +02:00
Botond Dénes	d56dcb842c	db/virtual_table: add virtual destructor to virtual_table It should have had one, derived instances are stored and destroyed via the base-class. The only reason this haven't caused bugs yet is that derived instances happen to not have any non-trivial members yet. Closes #11293	2022-08-15 16:58:05 +03:00
Botond Dénes	a9573b84c5	Merge 'commitlog: Revert/modify `fac2bc4` - do footprint add in delete' from Calle Wilund Fixes #11184 Fixes #11237 In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over files (replay candidates) to disk footprint on commitlog init. This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things. Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint). So shards 1-X would all be either locked out or performance degraded. Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved. To further emphasize this, don't store segments found on init scan in all shard instances, instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store pointless string lists in shards 1 to X, and also we can potentially ask for the list twice. More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is considered candidate for recycling, and included in footprint. Closes #11251 * github.com:scylladb/scylladb: commitlog: Make get_segments_to_replay on-demand commitlog: Revert/modify `fac2bc4` - do footprint add in delete	2022-08-15 09:10:32 +03:00
Kamil Braun	b4c5b79f5e	db: system_distributed_keyspace: don't call `on_internal_error` in `check_exists` The function `check_exists` checks whether a given table exists, giving an error otherwise. It previously used `on_internal_error`. `check_exists` is used in some old functions that insert CDC metadata to CDC tables. These tables are no longer used in newer Scylla versions (they were replaced with other tables with different schema), and this function is no longer called. The table definitions were removed and these tables are no longer created. They will only exists in clusters that were upgraded from old versions of Scylla (4.3) through a sequence of upgrades. If you tried to upgrade from a very old version of Scylla which had neither the old or the new tables to a modern version, say from 4.2 to 5.0, you would get `on_internal_error` from this `check_exists` function. Fortunately: 1. we don't support such upgrade paths 2. `on_internal_error` in production clusters does not crash the system, only throws. The exception would be catched, printed, and the system would run (just without CDC - until you finished upgrade and called the propoer nodetool command to fix the CDC module). Unfortunately, there is a dtest (`partitioner_tests.py`) which performs an unsupported upgrade scenario - it starts Scylla from Cassandra (!) work directories, which is like upgrading from a very old version of Scylla. This dtest was not failing due to another bug which masked the problem. When we try to fix the bug - see #11225 - the dtest starts hitting the assertion in `check_exists`. Because it's a test, we configure `on_internal_error` to crash the system. The point of this commit is to not crash the system in this rare scenario which happens only in some weird tests. We now throw `std::runtime_error` instead of calling `on_internal_error`. In the dtest, we already ignore the resulting CDC error appearing in the logs (see scylladb/scylla-dtest#2804). Together with this change, we'll be able to fix the #11225 bug and pass this test. Closes #11287	2022-08-14 13:12:03 +03:00
Nadav Har'El	5d556115a1	cql, view: rename and explain bytes_with_action The structure "bytes_with_action" was very hard to understand because of its mysterious and general-sounding name, and no comments. In this patch I add a large comment explaining its purpose, and rename it to a more suitable name, view_key_and_action, which suggests that each such object is about one view key (where to add a view row), and an additional "action" that we need to take beyond adding the view row. This is the best I can do to make this code easier to understand without completely reorganizing it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:52 +03:00
Michał Radwański	32289d681f	db/view/view.cc: compute view_updates for views over collections For collection indexes, logic of computing values for each of the column needed to change, since a single particular column might produce more than one value as a result. The liveness info from individual cells of the collection impacts the liveness info of resulting rows. Therefore it is needed to rewrite the control flow - instead of functions getting a row from get_view_row and later computing row markers and applying it, they compute these values by themselves. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-08-14 10:29:49 +03:00
Michał Radwański	112086767c	view info: has_computed_column_depending_on_base_non_primary_key In case of secondary indexes, if an index does not contain any column from the base which makes up for the primary key, then it is assumed that during update, a change to some cells from the base table cannot cause that we're dealing with a different row in the view. This however doesn't take into account the possibility of computed columns which in fact do depend on some non-primary-key columns. Introduce additional property of an index, has_computed_column_depending_on_base_non_primary_key.	2022-08-14 10:29:14 +03:00
Michał Radwański	ebc4ad4713	column_computation.hh, schema.cc: collection_column_computation This type of column computation will be used for creating updates to materialized views that are indexes over collections. This type features additional function, compute_values_with_action, which depending on an (optional) old row and new row (the update to the base table) returns multiple bytes_with_action, a vector of pairs (computed value, some action), where the action signifies whether a deletion of row with a specific key is needed, or creation thereby.	2022-08-14 10:29:13 +03:00
Michał Radwański	2babee2cdc	column_computation.hh, schema.cc: compute_value interface refactor The compute_value function of column_computation has had previously the following signature: virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override; This is superfluous, since never in the history of Scylla, the last parameter (row) was used in any implentation, and never did it happen that it returned bytes_opt. The absurdity of this interface can be seen especially when looking at call sites like following, where dummy empty row was created: ``` token_column.get_computation().compute_value( *_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty())); ```	2022-08-14 10:29:13 +03:00

1 2 3 4 5 ...

2732 Commits