scylladb

Author	SHA1	Message	Date
Yaniv Kaul	953fee7fb0	Update cql-extensions.md	2023-10-25 15:18:55 +03:00
Botond Dénes	6c90d166cc	Merge 'build: cmake: avoid using large amount stack of when compiling parser ' from Kefu Chai this mirrors what we have in `configure.py`, to build the CqlParser with `-O1` and disable `-fsanitize-address-use-after-scope` when compiling CqlParser.cc in order to prevent the compiler from emitting code which uses large amount of stack space at the runtime. Closes scylladb/scylladb#15819 * github.com:scylladb/scylladb: build: cmake: avoid using large amount stack of when compiling parser build: cmake: s/COMPILE_FLAGS/COMPILE_OPTIONS/	2023-10-24 16:19:51 +03:00
Nadav Har'El	4b80130b0b	Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in: - `system_distributed_keyspace::start`, - `redis:create_keyspace_if_not_exists_impl`, - `table_helper::setup_keyspace` (for the `system_traces` keyspace). All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this: - calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum, - we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder. This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME. Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring. Fixes #15437 Closes scylladb/scylladb#15594 * github.com:scylladb/scylladb: table_helper: announce twice in setup_keyspace table_helper: refactor setup_table redis: create_keyspace_if_not_exists_impl: fix indentation redis: announce once in create_keyspace_if_not_exists_impl db: system_distributed_keyspace: fix indentation db: system_distributed_keyspace: announce once in start tablet_allocator: update on_before_create_column_family migration_listener: add parameter to on_before_create_column_family alternator: executor: use new prepare_new_column_family_announcement alternator: executor: introduce create_keyspace_metadata migration_manager: add new prepare_new_column_family_announcement	2023-10-24 15:42:48 +03:00
David Garcia	a5519c7c1f	docs: update cofig params design Closes scylladb/scylladb#15827	2023-10-24 15:41:56 +03:00
Kefu Chai	f8104b92f8	build: cmake: detect rapidxml we use rapidxml for parsing XML, so let's detect it before using it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15813	2023-10-24 15:12:04 +03:00
Kamil Braun	2a21029ff5	Merge 'make topology_coordinator::run noexcept' from Gleb Topology coordinator should handle failures internally as long as it remains to be the coordinator. The raft state monitor is not in better position to handle any errors thrown by it, all it can do it to restart the coordinator. The series makes topology_coordinator::run handle all the errors internally and mark the function as noexcept to not leak error handling complexity into the raft state monitor. * 'gleb/15728-fix' of github.com:scylladb/scylla-dev: storage_service: raft topology: mark topology_coordinator::run function as noexcept storage_service: raft topology: do not throw error from fence_previous_coordinator()	2023-10-24 12:16:36 +02:00
Kefu Chai	4abcec9296	test: add __repr__ for MinIoServer and S3_Server it is printed when pytest passes it down as a fixture as part of the logging message. it would help with debugging a object_store test. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15817	2023-10-24 12:35:49 +03:00
Gleb Natapov	dcaaa74cd4	storage_service: raft topology: mark topology_coordinator::run function as noexcept The function handled all exceptions internally. By making it noexcept we make sure that the caller (raft_state_monitor_fiber) does not need handle any exceptions from the topology coordinator fiber.	2023-10-24 10:58:45 +03:00
Gleb Natapov	65bf5877e7	storage_service: raft topology: do not throw error from fence_previous_coordinator() Throwing error kills the topology coordinator monitor fiber. Instead we retry the operation until it succeeds or the node looses its leadership. This is fine before for the operation to succeed quorum is needed and if the quorum is not available the node should relinquish its leadership. Fixes #15728	2023-10-24 10:57:48 +03:00
Botond Dénes	0cba973972	Update tools/java submodule * tools/java 3c09ab97...86a200e3 (1): > cassandra-stress: add storage options	2023-10-24 09:41:36 +03:00
Kefu Chai	9347b61d3b	build: cmake: avoid using large amount stack of when compiling parser this mirrors what we have in `configure.py`, to build the CqlParser with -O1 and disable sanitize-address-use-after-scope when compiling CqlParser.cc in order to prevent the compiler from emitting code which uses large amount of stack at the runtime. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-24 12:40:20 +08:00
Kefu Chai	3da02e1bf4	build: cmake: s/COMPILE_FLAGS/COMPILE_OPTIONS/ according to https://cmake.org/cmake/help/latest/prop_sf/COMPILE_FLAGS.html, COMPILE_FLAGS has been superseded by COMPILE_OPTIONS. so let's replace the former with the latter. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-24 12:40:20 +08:00
Pavel Emelyanov	7c580b4bd4	Merge 'sstable: switch to uuid identifier for naming S3 sstable objects' from Kefu Chai before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter. to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier. so, in this change: 1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key. 2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose. 3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>) Fixes #14175 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#14406 * github.com:scylladb/scylladb: sstable: remove _remote_prefix from s3_storage sstable: switch to uuid identifier for naming S3 sstable objects	2023-10-23 21:05:13 +03:00
Pavel Emelyanov	d7031de538	Merge 'test/pylib: extract the env variable related functions out' from Kefu Chai this series extracts the the env variables related functions out and remove unused `import`s for better readability. Closes scylladb/scylladb#15796 * github.com:scylladb/scylladb: test/pylib: remove duplicated imports test/pylib: extract the env variable printing into MinIoServer test/pylib: extract _set_environ() out	2023-10-23 21:03:03 +03:00
Aleksandra Martyniuk	0c6a3f568a	compaction: delete default_compaction_progress_monitor default_compaction_progress_monitor returns a reference to a static object. So, it should be read-only, but its users need to modify it. Delete default_compaction_progress_monitor and use one's own compaction_progress_monitor instance where it's needed. Closes scylladb/scylladb#15800	2023-10-23 16:03:34 +03:00
Anna Stuchlik	55ee999f89	doc: enable publishing docs for branch-5.4 This commit enables publishing documentation from branch-5.4. The docs will be published as UNSTABLE (the warning about version 5.4 being unstable will be displayed). Closes scylladb/scylladb#15762	2023-10-23 15:47:01 +03:00
Avi Kivity	ee9cc450d4	logalloc: report increases of reserves The log-structured allocator maintains memory reserves to so that operations using log-strucutured allocator memory can have some working memory and can allocate. The reserves start small and are increased if allocation failures are encountered. Before starting an operation, the allocator first frees memory to satisfy the reserves. One problem is that if the reserves are set to a high value and we encounter a stall, then, first, we have no idea what value the reserves are set to, and second, we have no idea what operation caused the reserves to be increased. We fix this problem by promoting the log reports of reserve increases from DEBUG level to INFO level and by attaching a stack trace to those reports. This isn't optimal since the messages are used for debugging, not for informing the user about anything important for the operation of the node, but I see no other way to obtain the information. Ref #13930. Closes scylladb/scylladb#15153	2023-10-23 13:37:50 +02:00
Tomasz Grabiec	4af585ec0e	Merge 'row_cache: make_reader_opt(): make make_context() reentrant ' from Botond Dénes Said method is called in an allocating section, which will re-try the enclosed lambda on allocation failure. `read_context()` however moves the permit parameter so on the second and later calls, the permit will be in a moved-from state, triggering a `nullptr` dereference and therefore a segfault. We already have a unit test (`test_exception_safety_of_reads` in `row_cache_test.cc`) which was supposed to cover this, but: * It only tests range scans, not single partition reads, which is a separate path. * Turns out allocation failure tests are again silently broken (no error is injected at all). This is because `test/lib/memtable_snapshot_source.hh` creates a critical alloc section which accidentally covers the entire duration of tests using it. Fixes: #15578 Closes scylladb/scylladb#15614 * github.com:scylladb/scylladb: test/boost/row_cache_test: test_exception_safety_of_reads: also cover single-partition reads test/lib/memtable_snapshot_source: disable critical alloc section while waiting row_cache: make_reader_opt(): make make_context() reentrant	2023-10-23 11:22:13 +02:00
Raphael S. Carvalho	ea6c281b9f	replica: Fix major compaction semantics by performing off-strategy first Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15792	2023-10-23 11:32:03 +03:00
Nadav Har'El	e7dd0ec033	test/cql-pytest: reproduce incompatibility with same-name bind marks This patch adds a reproducer for a minor compatibility between Scylla's and Cassandra's handling of a prepared statement when a bind marker with the same name is used more than once, e.g., ``` SELECT * FROM tbl WHERE p=:x AND c=:x ``` It turns out that Scylla tells the driver that there is only one bind marker, :x, whereas Cassandra tells the driver that there are two bind markers, both named :x. This makes no different if the user passes a map `{'x': 3}`, but if the user passes a tuple, Scylla accepts only `(3,)` (assigning both bind markers the same value) and Cassandra accepts only `(3,3)`. The test added in this patch demonstrates this incompatibility. It fails on Scylla, passes on Cassandra, and is marked "xfail". Refs #15559 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#15564	2023-10-23 11:19:15 +03:00
Aleksandra Martyniuk	a1271d2d5c	repair: throw more detailed exception Exception thrown from row_level_repair::run does not show the root cause of a failure making it harder to debug. Add the internal exception contents to runtime_error message. After the change the log will mention the real cause (last line), e.g.: repair - repair[92db0739-584b-4097-b6e2-e71a66e40325]: 33 out of 132 ranges failed, keyspace=system_distributed, tables={cdc_streams_descriptions_v2, cdc_generation_timestamps, view_build_status, service_levels}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_distributed, cf=cdc_streams_descriptions_v2, range=(8720988750842579417,+inf)) (while cleaning up after seastar::abort_requested_exception (abort requested)) Closes scylladb/scylladb#15770	2023-10-23 11:15:25 +03:00
Botond Dénes	950a1ff22c	Merge 'doc: improve the docs for handling failures' from Anna Stuchlik This PR improves the way of how handling failures is documented and accessible to the user. - The Handling Failures section is moved from Raft to Troubleshooting. - Two new topics about failure are added to Troubleshooting with a link to the Handling Failures page (Failure to Add, Remove, or Replace a Node, Failure to Update the Schema). - A note is added to the add/remove/replace node procedures to indicate that a quorum is required. See individual commits for more details. Fixes https://github.com/scylladb/scylladb/issues/13149 Closes scylladb/scylladb#15628 * github.com:scylladb/scylladb: doc: add a note about Raft doc: add the quorum requirement to procedures doc: add more failure info to Troubleshooting doc: move Handling Failures to Troubleshooting	2023-10-23 11:09:28 +03:00
Kefu Chai	5a17a02abb	build: cmake: add -ffile-prefix-map option this mirrors what we already have in configure.py. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15798	2023-10-23 10:26:21 +03:00
Botond Dénes	940c2d1138	Merge 'build: cmake: use add_compile_options() and add_link_options() when appropriate ' from Kefu Chai instead of appending the options to the CMake variables, use the command to do this. simpler this way. and the bonus is that the options are de-duplicated. Closes scylladb/scylladb#15797 * github.com:scylladb/scylladb: build: cmake: use add_link_options() when appropriate build: cmake: use add_compile_options() when appropriate	2023-10-23 09:58:10 +03:00
Botond Dénes	c960c2cdbf	Merge 'build: extract code fragments into functions' from Kefu Chai this series is one of the steps to remove global statements in `configure.py`. not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Closes scylladb/scylladb#15780 * github.com:scylladb/scylladb: build: update modeval using a dict build: pass args.test_repeat and args.test_timeout explicitly build: pull in jsoncpp using "pkgs" build: build: extract code fragments into functions	2023-10-23 09:42:37 +03:00
Kefu Chai	0080b15939	build: cmake: use add_link_options() when appropriate instead of appending to CMAKE_EXE_LINKER_FLAGS, use add_link_options() to add more options. as CMAKE_EXE_LINKER_FLAGS is a string, and typically set by user, let's use add_link_options() instead. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 12:06:42 +08:00
Kefu Chai	686adec52e	build: cmake: use add_compile_options() when appropriate instead of appending to CMAKE_CXX_FLAGS, use add_compile_options() to add more options. as CMAKE_CXX_FLAGS is a string, and typically set by user, let's use add_compile_options() instead, the options added by this command will be added before CMAKE_CXX_FLAGS, and will have lower priority. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 12:06:42 +08:00
Kefu Chai	8756838b16	test/pylib: remove duplicated imports Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:36:05 +08:00
Kefu Chai	6b84bc50c3	test/pylib: extract the env variable printing into MinIoServer less repeatings this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:36:05 +08:00
Kefu Chai	02cad8f85b	test/pylib: extract _set_environ() out will add _unset_environ() later. extracting this helper out helps with the readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:36:05 +08:00
Kefu Chai	b36cef6f1a	sstable: remove _remote_prefix from s3_storage since we use the sstable.generation() for the remote prefix of the key of the object for storing the sstable component, there is no need to set remote_prefix beforehand. since `s3_storage::ensure_remote_prefix()` and `system_kesypace::sstables_registry_lookup_entry()` are not used anymore, they are removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Kefu Chai	af8bc8ba63	sstable: switch to uuid identifier for naming S3 sstable objects before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter. to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier. so, in this change: 1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key. 2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose. 3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>) 4. enable the `uuid_sstable_identifers` cluster feature if it is enabled in the `test_env_config`, so that it the sstable manager can enable the uuid-based uuid when creating a new uuid for sstable. 5. throw if the generation of sstable is not UUID-based when accessing / manipulating an sstable with S3 storage backend. as the S3 storage backend now relies on this option. as, otherwise we'd have sstables with key like s3://bucket/number/basename, which is just unable to serve as a unique id for sstable if the bucket is shared across multiple tables. Fixes #14175 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-23 10:08:22 +08:00
Avi Kivity	f181ac033a	Merge 'tools/nodetool: implement additional commands, part 2/N' from Botond Dénes The following new commands are implemented: * stop * compactionhistory All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation. Refs: https://github.com/scylladb/scylladb/issues/15588 Closes scylladb/scylladb#15649 * github.com:scylladb/scylladb: tools/scylla-nodetool: implement compactionhistory command tools/scylla-nodetool: implement stop command mutation/json: extract generic streaming writer into utils/rjson.hh test/nodetool: rest_api_mock.py: add support for error responses	2023-10-21 00:11:42 +03:00
Botond Dénes	19fc01be23	Merge 'Sanitize API -> task_manager dependency' from Pavel Emelyanov This is the continuation of `8c03eeb85d` Registering API handlers for services need to * get the service to handle requests via argument, not from http context (http context, in turn, is going not to depend on anything) * unset the handlers on stop so that the service is not used after it's stopped (and before API server is stopped) This makes task manager handlers work this way Closes scylladb/scylladb#15764 * github.com:scylladb/scylladb: api: Unset task_manager test API handlers api: Unset task_manager API handlers api: Remove ctx->task_manager dependency api: Use task_manager& argument in test API handlers api: Push sharded<task_manager>& down the test API set calls api: Use task_manager& argument in API handlers api: Push sharded<task_manager>& down the API set calls	2023-10-20 18:07:20 +03:00
Botond Dénes	4b57c2bf18	tools/scylla-nodetool: implement compactionhistory command	2023-10-20 10:55:38 -04:00
Botond Dénes	a212ddc5b1	tools/scylla-nodetool: implement stop command	2023-10-20 10:04:56 -04:00
Botond Dénes	9231454acd	mutation/json: extract generic streaming writer into utils/rjson.hh This writer is generally useful, not just for writing mutations as json. Make it generally available as well.	2023-10-20 10:04:56 -04:00
Botond Dénes	6db2698786	test/nodetool: rest_api_mock.py: add support for error responses	2023-10-20 10:04:56 -04:00
Kefu Chai	9f62bfa961	build: update modeval using a dict instead of updating `modes` in with global statements, update it in a function. for better readablity. and to reduce the statements in global scope. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-20 21:37:07 +08:00
Botond Dénes	ad90bb8d87	replica/database: remove "streaming" from dirty memory metric description We don't have streaming memtables for a while now. Closes scylladb/scylladb#15638	2023-10-20 13:09:57 +03:00
Kefu Chai	c240c70278	build: pass args.test_repeat and args.test_timeout explicitly for better readability. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-20 16:53:16 +08:00
Kefu Chai	c2cd11a8b3	build: pull in jsoncpp using "pkgs" this change adds "jsoncpp" dependency using "pkgs". simpler this way. it also helps to remove more global statements. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-20 16:53:16 +08:00
Kefu Chai	890113a9cf	build: build: extract code fragments into functions this change extract `get_warnings_options()` out. it helps to remove more global statements. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-20 16:53:16 +08:00
Patryk Jędrzejczak	fbcd667030	replica: keyspace::create_replication_strategy: remove a redundant parameter The options parameter is redundant. We always use `_metadata->strategy_options()` and `keyspace::create_replication_strategy` already assumes that `_metadata` is set by using its other fields. Closes scylladb/scylladb#15776	2023-10-20 10:20:49 +03:00
Botond Dénes	460bc7d8e1	test/boost/row_cache_test: test_exception_safety_of_reads: also cover single-partition reads The test currently only covers scans. Single partition reads have a different code-path, make sure it is also covered.	2023-10-20 03:16:57 -04:00
Botond Dénes	ffefa623f4	test/lib/memtable_snapshot_source: disable critical alloc section while waiting memtable_snapshot_source starts a background fiber in its constructor, which compacts LSA memory in a loop. The loop's inside is covered with a critical alloc section. It also contains a wait on a condition variable and in its present form the critical section also covers the wait, effectively turning off allocation failure injection for any test using the memtable_snapshot_source. This patch disables the critical alloc section while the loop waits on the condition variable.	2023-10-20 03:16:57 -04:00
Botond Dénes	92966d935a	row_cache: make_reader_opt(): make make_context() reentrant Said lambda currently moves the permit parameter, so on the second and later calls it will possibly run into use-after-move. This can happen if the allocating section below fails and is re-tried.	2023-10-20 03:16:57 -04:00
Kefu Chai	11d7cadf0d	install-dependencies.sh: drop java deps the java related build dependencies are installed by * tools/java/install-dependencies.sh * tools/jmx/install-dependencies.sh respectively. and the parent `install-dependencies.sh` always invoke these scripts, so there is no need to repeat them in the parent `install-dependenceies.sh` anymore. in addition to dedup the build deps, this change also helps to reduce the size of build dependencies. as by default, `dnf` install the weak deps, unless `-setopt=install_weak_deps=False` is passed to it, so this change also helps to reduce the traffic and foot print of the installed packages for building scylla. see also `9dddad27bf` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15473	2023-10-20 09:43:28 +03:00
Kamil Braun	059d647ee5	test/pylib: scylla_cluster: protect `ScyllaCluster.stop` with a lock test.py calls `uninstall()` and `stop()` concurrently from exit artifacts, and `uninstall()` internally calls `stop()`. This leads to premature releasing of IP addresses from `uninstall()` (returning IPs to the pool) while the servers using those IPs are still stopping. Then a server might obtain that IP from the pool and fail to start due to "Address already in use". Put a lock around the body of `stop()` to prevent that. Fixes: scylladb/scylladb#15755 Closes scylladb/scylladb#15763	2023-10-20 09:30:37 +03:00
Kefu Chai	80c656a08b	types: use more readable error message when serializing non-ASCII string before this change, we print marshaling error: Value not compatible with type org.apache.cassandra.db.marshal.AsciiType: '...' but the wording is not quite user friendly, it is a mapping of the underlying implementation, user would have difficulty understanding "marshaling" and/or "org.apache.cassandra.db.marshal.AsciiType" when reading this error message. so, in this change 1. change the error message to: Invalid ASCII character in string literal: '...' which should be more straightforward, and easier to digest. 2. update the test accordingly please note, the quoted non-ASCII string is preserved instead of being printed in hex, as otherwise user would not be able to map it with his/her input. Refs #14320 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15678	2023-10-20 09:25:44 +03:00
Pavel Emelyanov	0c69a312db	Update seastar submodule * seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: #15312 refs: #11805 http client fixes refs: #13736 refs: #15509 rpc fixes refs: #15462 Closes scylladb/scylladb#15774	2023-10-19 20:52:37 +03:00
Tomasz Grabiec	899ecaffcd	test: tablets: Enable for verbose logging in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode To help diagnose #14746 where we experience timeouts due to connection dropping. Closes scylladb/scylladb#15773	2023-10-19 16:58:53 +03:00
Raphael S. Carvalho	fded314e46	sstables: Fix update of tombstone GC settings to have immediate effect After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode, gc period if applicable, etc) is used to estimate the amount of droppable data (or determine full expiration = max_deletion_time < gc_before). It could happen that the user switched from timeout to repair mode, but sstables will still use the old mode, despite the user asked for a new one. Another example is when you play with value of grace period, to prevent data resurrection if repair won't be able to run in a timely manner. The problem persists until all sstables using old GC settings are recompacted or node is restarted. To fix this, we have to feed latest schema into sstable procedures used for expiration purposes. Fixes #15643. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15746	2023-10-19 16:27:59 +03:00
Kefu Chai	a6e68d8309	build: cmake: move message/* into message/CMakeLists.txt messaging_service.cc depends on idl, but many source files in scylla-main do no depend on idl, so let's * move "message/" into its own directory and add an inter-library dependency between it and the "idl" library. rename the target of "message" under test/manual to "message_test" to avoid the name collision this should address the compilation failure of ``` FAILED: CMakeFiles/scylla-main.dir/message/messaging_service.cc.o /usr/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -MF CMakeFiles/scylla-main.dir/message/messaging_service.cc.o.d -o CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -c /home/kefu/dev/scylladb/message/messaging_service.cc /home/kefu/dev/scylladb/message/messaging_service.cc:81:10: fatal error: 'idl/join_node.dist.hh' file not found ^~~~~~~~~~~~~~~~~~~~~~~ ``` where the compiler failed to find the included `idl/join_node.dist.hh`, which is exposed by the idl library as part of its public interface. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15657	2023-10-19 13:33:29 +03:00
Botond Dénes	60145d9526	Merge 'build: extract code fragments into functions' from Kefu Chai this series is one of the steps to remove global statements in `configure.py`. not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Closes scylladb/scylladb#15668 * github.com:scylladb/scylladb: build: move check for NIX_CC into dynamic_linker_option() build: extract dynamic_linker_option(): out build: move `headers` into write_build_file()	2023-10-19 13:31:33 +03:00
Avi Kivity	39966e0eb1	Merge 'build: cmake: pass -dynamic-linker to ld' from Kefu Chai to match the behavior of `configure.py`. Closes scylladb/scylladb#15667 * github.com:scylladb/scylladb: build: cmake: pass -dynamic-linker to ld build: cmake: set CMAKE_EXE_LINKER_FLAGS in mode.common.cmake	2023-10-19 13:15:47 +03:00
Jan Ciolek	c256cca6f1	cql3/expr: add more comments in expression.hh `expression` is a std::variant with 16 different variants that represent different types of AST nodes. Let's add documentation that explains what each of these 16 types represents. For people who are not familiar with expression code it might not be clear what each of them does, so let's add clear descriptions for all of them. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#15767	2023-10-19 10:56:38 +03:00
Kefu Chai	b105be220b	build: cmake: add join_node.idl.hh to CMake we add a new verb in `7cbe5e3af8`, so let's update the CMake-based building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15658	2023-10-19 10:19:16 +03:00
Nikita Kurashkin	2a7932efa1	alternator: fix DeleteTable return values to match DynamoDB's It seems that Scylla has more values returned by DeleteTable operation than DynamoDB. In this patch I added a table status check when generating output. If we delete the table, values KeySchema, AttributeDefinitions and CreationDateTime won't be returned. The test has also been modified to check that these attributes are not returned. Fixes scylladb#14132 Closes scylladb/scylladb#15707	2023-10-19 09:34:16 +03:00
Pavel Emelyanov	ec94cc9538	Merge 'test: set use_uuid to true by default in sstables::test_env ' from Kefu Chai this series 1. let sstable tests using test_env to use uuid-based sstable identifiers by default 2. let the test who requires integer-based identifier keep using it this should enable us to perform the s3 related test after enforcing the uuid-based identifier for s3 backend, otherwise the s3 related test would fail as it also utilize `test_env`. Closes scylladb/scylladb#14553 * github.com:scylladb/scylladb: test: set use_uuid to true by default in sstables::test_env test: enable test to set uuid_sstable_identifiers	2023-10-19 09:09:38 +03:00
Pavel Emelyanov	0981661f8b	api: Unset task_manager test API handlers So that the task_manager reference is not used when it shouldn't on stop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:56:24 +03:00
Pavel Emelyanov	2d543af78e	api: Unset task_manager API handlers So that the task_manager reference is not used when it shouldn't on stop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:56:01 +03:00
Pavel Emelyanov	0632ad50f3	api: Remove ctx->task_manager dependency Now the task manager's API (and test API) use the argument and this explicit dependency is no longer required Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:55:27 +03:00
Pavel Emelyanov	572c880d97	api: Use task_manager& argument in test API handlers Now it's there and can be used. This will allow removing the ctx->task_manager dependency soon Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:55:13 +03:00
Pavel Emelyanov	0396ce7977	api: Push sharded<task_manager>& down the test API set calls This is to make it possible to use this reference instead of the ctx.tm one by the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:54:53 +03:00
Pavel Emelyanov	ef1d2b2c86	api: Use task_manager& argument in API handlers Now it's there and can be used. This will allow removing the ctx->task_manager dependency soon Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:54:24 +03:00
Pavel Emelyanov	14e10e7db4	api: Push sharded<task_manager>& down the API set calls This is to make it possible to use this reference instead of the ctx.tm one by the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-18 18:52:46 +03:00
Avi Kivity	7d5e22b43b	replica: memtable: don't forget memtable memory allocation statistics A memtable object contains two logalloc::allocating_section members that track memory allocation requirements during reads and writes. Because these are local to the memtable, each time we seal a memtable and create a new one, these statistics are forgotten. As a result we may have to re-learn the typical size of reads and writes, incurring a small performance penalty. The solution is to move the allocating_section object to the memtable_list container. The workload is the same across all memtables of the same table, so we don't lose discrimination here. The performance penalty may be increased later if log changes to memory reserve thresholds including a backtrace, so this reduces the odds of incurring such a penalty. Closes scylladb/scylladb#15737	2023-10-18 17:43:33 +02:00
Kefu Chai	c8cb70918b	sstable: drop unused parse() overload for deletion_time `deletion_time` is a part of the `partition_header`, which is in turn a part of `partition`. and `data_file` is a sequence of `partition`. `data_file` represents *-Data.db component of an SSTable. see docs/architecture/sstable3/sstables-3-data-file-format.rst. we always parse the data component via `flat_mutation_reader_v2`, which is in turn implemented with mx/reader.cc or kl/reader.cc depending on the version of SSTable to be read. in other words, we decode `deletion_time` in mx/reader.cc or kl/reader.cc, not in sstable.cc. so let's drop the overload parse() for deletion_time. it's not necessary and more importantly, confusing. Refs #15116 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15756	2023-10-18 18:41:56 +03:00
Avi Kivity	f3dc01c85e	Merge 'Enlight sstable_directory construction' from Pavel Emelyanov Currently distributed_loader starts sharded<sstable_directory> with four sharded parameters. That's quite bulky and can be made much shorter. Closes scylladb/scylladb#15653 * github.com:scylladb/scylladb: distributed_loader: Remove explicit sharded<erms> distributed_loader: Brush up start_subdir() sstable_directory: Add enlightened construction table: Add global_table_ptr::as_sharded_parameter()	2023-10-18 16:42:04 +03:00
Anna Stuchlik	274cf7a93a	doc:remove upgrade guides for unsupported versions This commit: - Removes upgrade guides for versions older than 5.0. The oldest one is from version 4.6 to 5.0. - Adds the redirections for the removed pages. Closes scylladb/scylladb#15709	2023-10-18 15:12:26 +03:00
Kefu Chai	f69a44bb37	test/object_store: redirect to STDOUT and STDERR pytest changes the test's sys.stdout and sys.stderr to the captured fds when it captures the outputs of the test. so we are not able to get the STDOUT_FILENO and STDERR_FILENO in C by querying `sys.stdout.fileno()` and `sys.stderr.fileno()`. their return values are not 1 and 2 anymore, unless pytest is started with "-s". so, to ensure that we always redirect the child process's outputs to the log file. we need to use 1 and 2 for accessing the well-known fds, which are the ones used by the child process, when it writes to stdout and stderr. this change should address the problem that the log file is always empty, unless "-s" is specified. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15560	2023-10-18 14:54:01 +03:00
Yaron Kaikov	b340bd6d9e	release: prepare for 5.5.0-dev	2023-10-18 14:40:06 +03:00
Botond Dénes	f7e269ccb8	Merge 'Progress of compaction executors' from Aleksandra Martyniuk compaction_read_monitor_generator is an existing mechanism for monitoring progress of sstables reading during compaction. In this change information gathered by compaction_read_monitor_generator is utilized by task manager compaction tasks of the lowest level, i.e. compaction executors, to calculate task progress. compaction_read_monitor_generator has a flag, which decides whether monitored changes will be registered by compaction_backlog_tracker. This allows us to pass the generator to all compaction readers without impacting the backlog. Task executors have access to compaction_read_monitor_generator_wrapper, which protects the internals of compaction_read_monitor_generator and provides only the necessary functionality. Closes scylladb/scylladb#14878 * github.com:scylladb/scylladb: compaction: add get_progress method to compaction_task_impl compaction: find total compaction size compaction: sstables: monitor validation scrub with compaction_read_generator compaction: keep compaction_progress_monitor in compaction_task_executor compaction: use read monitor generator for all compactions compaction: add compaction_progress_monitor compaction: add flag to compaction_read_monitor_generator	2023-10-18 12:19:51 +03:00
Kamil Braun	c1486fee40	Merge 'commitlog: drop truncation_records after replay' from Petr Gusev This is a follow-up for #15279 and it fixes two problems. First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed. Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table. It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP. To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem. Fixes #15354 Closes scylladb/scylladb#15532 * github.com:scylladb/scylladb: add test_commitlog system.truncated: Remove replay_position data from truncated on start main.cc: flush only local memtables when replaying schema commitlog main.cc: drop redundant supervisor::notify system_keyspace: flush if schema commitlog is not available	2023-10-18 11:14:31 +02:00
Gleb Natapov	f80fff3484	gossip: remove unused STATUS_LEAVING gossiper status The status is no longer used. The function that referenced it was removed by `5a96751534` and it was unused back then for awhile already. Message-Id: <ZS92mcGE9Ke5DfXB@scylladb.com>	2023-10-18 11:13:14 +02:00
Botond Dénes	7f81957437	Merge 'Initialize datadir for system and non-system keyspaces the same way' from Pavel Emelyanov When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what. This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables. Test included. fixes: #15708 closes: scylladb/scylla-manager#3603 Closes scylladb/scylladb#15723 * github.com:scylladb/scylladb: test: Add test for datadir/ layout sstable_directory: Indentation fix after previous patch db,sstables: Move storage init for system keyspace to table creation	2023-10-18 12:12:19 +03:00
David Garcia	51466dcb23	docs: add latest option to aws_images extension rollback only latest Closes scylladb/scylladb#15651	2023-10-18 11:43:21 +03:00
Kefu Chai	203f41dc99	sstable: improve descriptions of capped.*deletion_time before this change, they reads > Was local deletion time capped at ... and > Was partition tombstone deletion time capped at ... the "Was" part is confusing. and the first description is not accurate enough. so let's improve them a little bit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15108	2023-10-18 09:40:02 +03:00
Kefu Chai	9bc0a9f95e	mutation: do not include unused header the `utils::UUID` class is not used by the implementation of `canonical_mutation`, so let's remove the include from this source file. the `#include` was originally added in `5a353486c6`, but that commit did add any code using UUID to this file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15731	2023-10-17 20:38:07 +03:00
Avi Kivity	dfffc022da	Merge 'doc: doc: remove recommended image upgrade with OS from previous releases' from Anna Stuchlik This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted). The scope of this commit: - Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info. - Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info. - Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info. - Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides. /upgrade/_common/upgrade-image-opensource.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst This PR is a continuation of https://github.com/scylladb/scylladb/pull/15739. This PR must be backported to branch-5.2 and branch-5.1. Closes scylladb/scylladb#15740 * github.com:scylladb/scylladb: doc: remove wrong image upgrade info (5.x.y-to-5.x.y) doc: remove wrong image upgrade info (4.6-to-5.0) doc: remove wrong image upgrade info (5.0-to-5.1)	2023-10-17 18:29:36 +03:00
Anna Stuchlik	9d9fe57efa	doc: remove recommended image upgrade with OS This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted). The scope of this commit: - Remove the information from the 5.1-to.-5.2 upgrade guide and replace with general info. - Remove the information from the Image Upgrade page. - Remove outdated info (about previous releases) from the Image Upgrade page. - Rename "AMI Upgrade" as "Image Upgrade" in the page tree. Refs: https://github.com/scylladb/scylladb/issues/15733 Closes scylladb/scylladb#15739	2023-10-17 18:28:52 +03:00
Avi Kivity	f42eb4d1ce	Merge 'Store and propagage GC timestamp markers from commitlog' from Calle Wilund Fixes #14870 (Originally suggested by @avikivity). Use commit log stored GC clock min positions to narrow compaction GC bounds. (Still requires augmented manual flush:es with extensive CL clearing to pass various dtest, but this does not affect "real" execution). Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment the first time. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. This is then (in a rather ugly way) propagated to tombstone_gc_state to narrow the allowed GC bounds for a CF, based on what is currently left in CL. Note: this is a rather unoptimized version - no caching or anything. But even so, should not be excessively expensive, esp. since various other code paths already cache the results. Closes scylladb/scylladb#15060 * github.com:scylladb/scylladb: main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info tombstone_gc_state: Add optional callback to augment GC bounds commitlog: Add keeping track of approximate lowest GC clock for CF entries database: Force new commitlog segment on user initiated flush commitlog: Add helper to force new active segment	2023-10-17 18:27:43 +03:00
Anna Stuchlik	7718f76ecd	doc: remove outdated info from Materialized Views This commit removes outdated info from the Materialized Views page: - The reference to the outated blog post. - Irrelevant information about versions. Fixes https://github.com/scylladb/scylladb/issues/15725 Closes scylladb/scylladb#15742	2023-10-17 18:26:54 +03:00
Anna Stuchlik	dd1207cabb	doc: remove wrong image upgrade info (5.x.y-to-5.x.y) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 5.x.y-to-5.x.y upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733 In addition, the following files are removed as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides. /upgrade/_common/upgrade-image-opensource.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst /upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst	2023-10-17 16:48:51 +02:00
Anna Stuchlik	526d543b95	doc: remove wrong image upgrade info (4.6-to-5.0) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 4.6-to-5.0 upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733	2023-10-17 16:28:19 +02:00
Petr Gusev	a0aee54f2c	add test_commitlog Check that commitlog provides durability in case of a node reboot: * truncate table T, truncation_record RP=1000; * clean shutdown node/reboot machine/restart node, now RP=~0 since segment ids count from boot time; * write some data to T; crash/restart * check data is retained	2023-10-17 18:16:50 +04:00
Calle Wilund	6fbd210679	system.truncated: Remove replay_position data from truncated on start Once we've started clean, and all replaying is done, truncation logs commit log regarding replay positions are invalid. We should exorcise them as soon as possible. Note that we cannot remove truncation data completely though, since the time stamps stored are used by things like batch log to determine if it should use or discard old batch data.	2023-10-17 18:16:48 +04:00
Petr Gusev	dde36b5d9d	main.cc: flush only local memtables when replaying schema commitlog Schema commitlog can be used only on shard 0, so it's redundant to flush any other memtables.	2023-10-17 18:15:51 +04:00
Petr Gusev	54dd7cf1da	main.cc: drop redundant supervisor::notify Later in the code we have 'replaying schema commit log', which duplicates this one. Also, maybe_init_schema_commitlog may skip schema commitlog initialization if the SCHEMA_COMMITLOG feature is not yet supported by the cluster, so this notification can be misleading.	2023-10-17 18:15:49 +04:00
Petr Gusev	c89ead55ff	system_keyspace: flush if schema commitlog is not available In PR #15279 we removed flushes when writing to a number of tables from the system keyspace. This was made possible by switching these tables to the schema commitlog. Schema commitlog is enabled only when the SCHEMA_COMMITLOG feature is supported by all nodes in the cluster. Before that these tables will use the regular commitlog, which is not durable because it uses db::commitlog::sync_mode::PERIODIC. This means that we may lose data if a node crashes during upgrade to the version with schema commitlog. In this commit we fix this problem by restoring flushes after writes to the tables if the schema commitlog is not enabled yet. The patch also contains a test that demonstrates the problem. We need flush_schema_tables_after_modification option since otherwise schema changes are not durable and node fails after restart.	2023-10-17 18:14:27 +04:00
Anna Stuchlik	9852130c5b	doc: remove wrong image upgrade info (5.0-to-5.1) This commit removes the invalid information about the recommended way of upgrading ScyllaDB images (by updating ScyllaDB and OS packages in one step) from the 5.0-to-5.1 upgrade guide. This upgrade procedure is not supported (it was implemented, but then reverted). Refs https://github.com/scylladb/scylladb/issues/15733	2023-10-17 16:04:16 +02:00
Kefu Chai	77b96f7748	main.cc: do not cast hours to milliseconds there is no need to explicitly cast an instance of std::chrono::hours to std::chrono::milliseconds to feed it to a function which expects std::chrono::milliseconds. the constructor of of std::chrono::milliseconds is able to do this convert and create a new instance of std::chrono::milliseconds from another. std::chrono::duration<> instance. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15734	2023-10-17 17:02:45 +03:00
Kamil Braun	7dcee7de02	test/pylib: implement `expected_error` for decommission and removenode You can now pass `expected_error` to `ManagerClient.decommission_node` and `ManagerClient.remove_node`. Useful in combination with error injections, for example. Closes scylladb/scylladb#15650	2023-10-17 16:25:43 +03:00
Calle Wilund	3378c246f7	main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info Fixes #14870 (yet another alternative solution) (Originally suggested by @avikivity). Use store GC clock min positions from CL to narrow compaction GC bounds. Note: not optimized with caches or anything at this point. Can easily be added though of course always somewhat risky.	2023-10-17 10:30:40 +00:00
Kefu Chai	031ff755ce	test/sstable: verify sstables::parse_path() check the behavior of sstables::parse_path(). for better test coverage of this function. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15659	2023-10-17 13:28:58 +03:00
Calle Wilund	43a7d83fd0	tombstone_gc_state: Add optional callback to augment GC bounds Allows potentially narrowing of GC time bounds.	2023-10-17 10:26:41 +00:00
Calle Wilund	560d3c17f0	commitlog: Add keeping track of approximate lowest GC clock for CF entries Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment first. Because GC clock is wall clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds per segment. Includes of course a function to get the all-segment lowest per CF.	2023-10-17 10:26:41 +00:00
Calle Wilund	2429cf656c	database: Force new commitlog segment on user initiated flush Helper for tools - ensures that tests which use nodetool flush to force data to sstables also remove as much as possible from commitlog.	2023-10-17 10:26:40 +00:00
Calle Wilund	810d06946f	commitlog: Add helper to force new active segment When called, if active segment holds data, close and replace with pristine one.	2023-10-17 10:26:40 +00:00
Petr Gusev	39789b6527	main.cc: ARM build fix This is a follow-up for #15720. Closes scylladb/scylladb#15730	2023-10-17 13:17:32 +03:00
Takuya ASADA	58d94a54a3	scylla_raid_setup: faillback to other paths when UUID not avialable On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is not available, scylla_raid_setup will fail while mounting volume. To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available paths to mount the disk and fallback to other paths like by-partuuid, by-id, by-path or just using real device path like /dev/md0. To get device path, and also to dumping device status when UUID is not available, this will introduce UdevInfo class which communicate udev using pyudev. Related #11359 Closes scylladb/scylladb#13803	2023-10-17 12:24:58 +03:00
Tomasz Grabiec	0aef0f900b	Merge 'truncation records refactorings' from Petr Gusev This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases: * drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard; * add a check that `table::_truncated_at` is properly initialized before it's accessed; * move its initialization after `init_non_system_keyspaces` Closes scylladb/scylladb#15583 * github.com:scylladb/scylladb: system_keyspace: drop truncation_record system_keyspace: remove get_truncated_at method table: get_truncation_time: check _truncated_at is initialized database: add_column_family: initialize truncation_time for new tables database: add_column_family: rename readonly parameter to is_new system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace commitlog_replayer: refactor commitlog_replayer::impl::init system_keyspace: drop redundant typedef system_keyspace: drop redundant save_truncation_record overload table: rename cache_truncation_record -> set_truncation_time system_keyspace: get_truncated_position -> get_truncated_positions	2023-10-17 10:55:30 +02:00
Raphael S. Carvalho	da04fea71e	compaction: Fix key estimation per sstable to produce efficient filters The estimation assumes that size of other components are irrelevant, when estimating the number of partitions for each output sstable. The sstables are split according to the data file size, therefore size of other files are irrelevant for the estimation. With certain data models, like single-row partitions containing small values, the index could be even larger than data. For example, assume index is as large as data, then the estimation would say that 2x more sstables will be generated, and as a result, each sstable are underestimated to have 2x less keys. Fix it by only accounting size of data file. Fixes #15726. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15727	2023-10-17 11:21:11 +03:00
Aleksandra Martyniuk	0ce9db2329	repair: throw abort_requested_exception when abort is requested If abort is requsted during bootstrap then a node should exit normally. To achieve so, abort_requested_exception should be thrown as main handles it gracefully. In data_sync_repair_task_impl::run exceptions from all shards are wrapped together into std::runtime_exception and so they aren't handled as they are supposed to. Throw abort_requested_exception when shutdown was requested. Throw abort_requested_exception also if repair::task_manager_module::is_aborted, so that force_terminate_all_repair_sessions acts the same regardless the state of the repair. To maintain consistency do the same for user_requested_repair_task_impl. Fixes: #15710. Closes scylladb/scylladb#15722	2023-10-17 10:08:06 +03:00
Kefu Chai	19e724822d	test.py: pass self.suite.scylla_env to pytest process before this change, pytest does not populate its suites's `scylla_env` down to the forked pytest child process. this works if the test does not care about the env variables in `scylla_env`. but object_store is an exception, as it launches scylla instances by itself. so, without the help of `scylla_env`, `run.find_scylla()` always find the newest file globbed by `build//scylla`. this is not always what we expect. on the contrary, if we launch object_store's pytest using `test.py`, there are good chances that object_store ends up with testing a wrong scylla executable if we have multiple builds under `build//scylla`. so, in this change, we populate `self.suite.scylla_env` down to the child process created by `PythonTest`, so that all pytest based tests can have access to its suites's env variables. in addition to 'SCYLLA' env variable, they also include the the env variables required by LLVM code coverage instrumentation. this is also nice to have. Fixes #15679 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15682	2023-10-17 09:27:12 +03:00
Petr Gusev	9b1dfad51c	main.cc: disable stall detector for debug ARM builds The stall detector uses glibc backtrace function to collect backtraces, this causes ASAN failures on ARM. For now we just disable the stall detector in this configuration, the ticket about migrating to libunwind: scylladb/seastar#1878 We increase the value of blocked_reactor_notify_ms to make sure the stall detector never fires. Fixes #15389 Fixes #15090 Closes scylladb/scylladb#15720	2023-10-16 21:57:35 +03:00
Pavel Emelyanov	d59cd662f8	test: Add test for datadir/ layout The test checks that - for non-system keyspace datadir and its staging/ and upload/ subdirs are created when the table is created _and_ that the directory is re-populated on boot in case it was explicitly removed - for system non-virtual tables it checks that the same directory layout is created on boot - for system virtual tables it checks that the directory layout doesn't exist Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-16 16:26:48 +03:00
Pavel Emelyanov	c3b3e5b107	sstable_directory: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-16 16:26:37 +03:00
Pavel Emelyanov	059d7c795e	db,sstables: Move storage init for system keyspace to table creation User and system keyspaces are created and populated slightly differently. System keyspace is created via system_keyspace::make() which eventually calls calls add_column_family(). Then it's populated via init_system_keyspace() which calls sstable_directory::prepare() which, in turn, optionally creates directories in datadir/ or checks the directory permissions if it exists User keyspaces are created with the help of add_column_family_and_make_directory() call which calls the add_column_family() mentioned above _and_ calls table::init_storage() to create directories. When it's populated with init_non_system_keyspaces() it also calls sstable_directory::prepare() which notices that the directory exists and then checks the permissions. As a result, sstable_directory::prepare() initializes storage for system keyspace only and there's a BUG (#15708) that the upload/ subdir is not created. This patch makes the directories creation for _all_ keyspaces with the table::init_storage(). The change only touches system keyspace by moving the creation of directories from sstable_directory::prepare() into system_keyspace::make(). Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-16 16:19:25 +03:00
Patryk Jędrzejczak	7810e8d860	table_helper: announce twice in setup_keyspace We refactor table_helper::setup_keyspace so that it calls migration_manager::announce at most twice. We achieve it by announcing all tables at once. The number of announcements should further be reduced to one, but it requires a big refactor. The CQL code used in parse_new_cf_statement assumes the keyspace has already been created. We cannot have such an assumption if we want to announce a keyspace and its tables together. However, we shouldn't touch the CQL code as it would impact user requests, too. One solution is using schema_builder instead of the CQL statements to create tables in table_helper. Another approach is removing table_helper completely. It is used only for the system_traces keyspace, which Scylla creates automatically. We could refactor the way Scylla handles this keyspace and make table_helper unneeded.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	2b4e1e0f9c	table_helper: refactor setup_table In the following commit, we reduce migration_manager::announce calls in table_helper::setup_keyspace by announcing all tables together. To do it, we cannot use table_helper::setup_table anymore, which announces a single table itself. However, the new code still has to translate CQL statements, so we extract it to the new parse_new_cf_statement function to avoid duplication.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	fad71029f0	redis: create_keyspace_if_not_exists_impl: fix indentation Broken in the previous commit.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	a3044d1f46	redis: announce once in create_keyspace_if_not_exists_impl We refactor create_keyspace_if_not_exists_impl so that it takes at most one group 0 guard and calls migration_manager::announce at most once.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	98d067e77d	db: system_distributed_keyspace: fix indentation Broken in the previous commit.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	5ebc0e8617	db: system_distributed_keyspace: announce once in start We refactor system_distributed_keyspace::start so that it takes at most one group 0 guard and calls migration_manager::announce at most once. We remove a catch expression together with the FIXME from get_updated_service_levels (add_new_columns_if_missing before the patch) because we cannot treat the service_levels update differently anymore.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	449b4c79c2	tablet_allocator: update on_before_create_column_family After adding the keyspace_metadata parameter to migration_listener::on_before_create_column_family, tablet_allocator doesn't need to load it from the database. This change is necessary before merging migration_manager::announce calls in the following commit.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	7653059369	migration_listener: add parameter to on_before_create_column_family After adding the new prepare_new_column_family_announcement that doesn't assume the existence of a keyspace, we also need to get rid of the same assumption in all on_before_create_column_family calls. After all, they may be initiated before creating the keyspace. However, some listeners require keyspace_metadata, so we pass it as a new parameter.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	96d9e768c4	alternator: executor: use new prepare_new_column_family_announcement We can use the new prepare_new_column_family_announcement function that doesn't assume the existence of the keyspace instead of the previous work-around.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	fcd092473c	alternator: executor: introduce create_keyspace_metadata We need to store a new keyspace's keyspace_metadata as a local variable in create_table_on_shard0. In the following commit, we use it to call the new prepare_new_column_family_announcement function.	2023-10-16 14:59:53 +02:00
Patryk Jędrzejczak	7e6017d62d	migration_manager: add new prepare_new_column_family_announcement In the following commits, we reduce the number of the migration_manager::anounce calls by merging some of them in a way that logically makes sense. Some of these merges are similar -- we announce a new keyspace and its tables together. However, we cannot use the current prepare_new_column_family_announcement there because it assumes that the keyspace has already been created (when it loads the keyspace from the database). Luckily, this assumption is not necessary as this function only needs keyspace_metadata. Instead of loading it from the database, we can pass it as a parameter.	2023-10-16 14:59:53 +02:00
Aleksandra Martyniuk	198119f737	compaction: add get_progress method to compaction_task_impl compaction_task_impl::get_progress is used by the lowest level compaction tasks which progress can be taken from compaction_progress_monitor.	2023-10-12 17:16:05 +02:00
Aleksandra Martyniuk	39e96c6521	compaction: find total compaction size	2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk	7b3e0ab1f2	compaction: sstables: monitor validation scrub with compaction_read_generator Validation scrub bypasses the usual compaction machinery, though it still needs to be tracked with compaction_progress_monitor so that we could reach its progress from compaction task executor. Track sstable scrub in validate mode with read monitors.	2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk	3553556708	compaction: keep compaction_progress_monitor in compaction_task_executor Keep compaction_progress_monitor in compaction_task_executor and pass a reference to it further, so that the compaction progress could be retrieved out of it.	2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk	37da5a0638	compaction: use read monitor generator for all compactions Compaction read monitor generators are used in all compaction types. Classes which did not use _monitor_generator so far, create it with _use_backlog_tracker set to no, not to impact backlog tracker.	2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk	22bf3c03df	compaction: add compaction_progress_monitor In the following patches compaction_read_monitor_generator will be used to find progress of compaction_task_executor's. To avoid unnecessary life prolongation and exposing internals of the class out of compaction.cc, compaction_progress_monitor is created. Compaction class keeps a reference to the compaction_progress_monitor. Inheriting classes which actually use compaction_read_monitor_generator, need to set it with set_generator method.	2023-10-12 17:03:46 +02:00
Aleksandra Martyniuk	b852ad25bf	compaction: add flag to compaction_read_monitor_generator Following patches will use compaction_read_monitor_generator to track progress of all types of compaction. Some of them should not be registered in compaction_backlog_tracker. _use_backlog_tracker flag, which is by default set to true, is added to compaction_read_monitor_generator and passed to all compaction_read_monitors created by this generator.	2023-10-12 17:03:46 +02:00
Wojciech Mitros	055f061706	test: handle fast execution of test_user_function_filtering Currently, when the test is executed too quickly, the timestamp insterted into the 'my_table' table might be the same as the timestamp used in the SELECT statement for comparison. However, the statement only selects rows where the inserted timestamp is strictly lower than current timestamp. As a result, when this comparison fails, we may skip executing the following comparison, which uses a user-defined function, due to which the statement is supposed to fail with an error. Instead, the select statement simply returns no rows and the test case fails. To fix this, simply use the less or equal operator instead of using the strictly less operator for comparing timestamps. Fixes #15616 Closes scylladb/scylladb#15699	2023-10-12 17:04:43 +03:00
Tomasz Grabiec	accac7efd8	test: test_tablets.py: Enable verbose logging This is in order to aid investigation of falkiness of the test, which fails due to a timeout during scan after cluster restart in debug mode. See #14746. I enable trace-level logging for some scylla-side loggers and inject logging of sent and received messages on the driver side. Closes scylladb/scylladb#15696	2023-10-12 17:03:19 +03:00
Jan Ciolek	940e44f887	db/view: change log level of failed view updates to WARN When a remote view update doesn't succeed there's a log message saying "Error applying view update...". This message had log level ERROR, but it's not really a hard error. View updates can fail for a multitude of reasons, even during normal operation. A failing view update isn't fatal, it will be saved as a view hint a retried later. Let's change the log level to WARN. It's something that shouldn't happen too much, but it's not a disaster either. ERROR log level causes trouble in tests which assume that an ERROR level message means that the test has failed. Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784 For local view updates the log level stays at "ERROR", local view updates shouldn't fail. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#15640	2023-10-11 18:19:23 +03:00
Israel Fruchter	41c80929eb	Update tools/cqlsh submodule * tools/cqlsh 66ae7eac...426fa0ea (8): > Updated Scylla Driver[Issue scylladb/scylla-cqlsh#55] > copyutil: closing the local end of pipes after processes starts > setup.py: specify Cython language_level explicitly > setup.py: pass extensions as a list > setup.py: reindent block in else branch > setup.py: early return in get_extension() > reloc: install build==0.10.0 > reloc: add --verbose option to build_reloc.sh Fixes: https://github.com/scylladb/scylla-cqlsh/issues/37 Closes scylladb/scylladb#15685	2023-10-11 17:29:23 +03:00
Aleksandra Martyniuk	5a10bd44bf	test_storage_service: use new_test_snapshot fixture test_storage_service_keyspace_cleanup_with_no_owned_ranges from test_storage_service.py creates snapshots with tags based on current time. Thus if a test runs on the same node twice with time distance short enough, there may be a name collision between the snapshots from two runs. This will cause the second run to fail on assertions. Use new_test_snapshot fixture to drop snapshots after the test. Delete my_snapshot_tags as it's no longer necessary. Fixes: #15680. Closes scylladb/scylladb#15683	2023-10-11 00:53:36 +03:00
Avi Kivity	35849fc901	Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun" This reverts commit `3d4398d1b2`, reversing changes made to `45dfce6632`. The commit causes some schema changes to be lost due to incorrect timestamps in some mutations. More information is available in [1]. Reopens: scylladb/scylladb#7620 Reopens: scylladb/scylladb#13957 Fixes scylladb/scylladb#15530. [1] https://github.com/scylladb/scylladb/pull/15687	2023-10-11 00:32:05 +03:00
Kamil Braun	05ede7a042	test/pylib: always return a response from `put_json` In `20ff2ae5e1` mutating endpoints were changed to use PUT. But some of them return a response, and I forgot to provide `response_type` parameter to `put_json` (which causes `RESTClient` to actually obtain the response). These endpoints now return `None`. Fix this. Closes scylladb/scylladb#15674	2023-10-09 14:35:04 +03:00
Kefu Chai	e76a02abc5	build: move check for NIX_CC into dynamic_linker_option() `employ_ld_trickery` is only used by `dynamic_linker_option()`, so move it into this function. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-09 11:11:57 +08:00
Kefu Chai	e85fc9f8be	build: extract dynamic_linker_option(): out this change helps to remove more global statements. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-09 11:11:57 +08:00
Kefu Chai	21b61e8f0a	build: move `headers` into write_build_file() `headers` is only used in this function, so move it closer to where it is used. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-09 11:11:57 +08:00
Kefu Chai	b3e5c8c348	build: cmake: pass -dynamic-linker to ld to match the behavior of `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-09 11:07:13 +08:00
Kefu Chai	ce46f7b91b	build: cmake: set CMAKE_EXE_LINKER_FLAGS in mode.common.cmake so that CMakeLists.txt is less cluttered. as we will append `--dynamic-linker` option to the LDFLAGS. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-09 11:07:13 +08:00
Raphael S. Carvalho	4e6fe34501	tests: Synchronize boost logger for multithreaded tests in sstable_directory_test The logger is not thread safe, so a multithreaded test can concurrently write into the log, yielding unreadable XMLs. Example: boost/sstable_directory_test: failed to parse XML output '/scylladir/testlog/x86_64/release/xml/boost.sstable_directory_test.sstable_directory_shared_sstables_reshard_correctly.3.xunit.xml': not well-formed (invalid token): line 1, column 1351 The critical (today's unprotected) section is in boost/test/utils/xml_printer.hpp: ``` inline std::ostream& operator<<( custom_printer<cdata> const& p, const_string value ) { p << BOOST_TEST_L( "<![CDATA[" ); print_escaped_cdata( p, value ); return p << BOOST_TEST_L( "]]>" ); } ``` The problem is not restricted to xml, but the unreadable xml file caused the test to fail when trying to parse it, to present a summary. New thread-safe variants of BOOST_REQUIRE and BOOST_REQUIRE_EQUAL are introduced to help multithreaded tests. We'll start patching tests of sstable_directory_test that will call BOOST_REQUIRE from multiple threads. Later, we can expand its usage to other tests. Fixes #15654. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15655	2023-10-08 15:57:08 +03:00
Kefu Chai	1efd0d9a92	test: set use_uuid to true by default in sstables::test_env for better coverage of uuid-based sstable identifier. since this option is enabled by default, this also match our tests with the default behavior of scylladb. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-07 18:56:47 +08:00
Kefu Chai	50c8619ed9	test: enable test to set uuid_sstable_identifiers some of the tests are still relying on the integer-based sstable identifier, so let's add a method to test_env, so that the tests relying on this can opt-out. we will change the default setting of sstables::test_env to use uuid-base sstable identifier in the next commit. this change does not change the existing behavior. it just adds a new knob to test_env_config. and let the tests relying on this to customize the test_env_config to disable use_uuid. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-10-07 18:56:47 +08:00
Avi Kivity	765e193122	Merge 'db/hints: Modernize manager' from Dawid Mędrek This PR is another step in refactoring the Hinted Handoff module. It aims at modernizing the code by moving to coroutines, using `std::ranges` instead of Boost's ones where possible, and uses other features coming with the new C++ standards. It also tries to make the code clearer and get rid of confusing elements, e.g. using shared pointers where they shouldn't be used or marking methods as virtual even though nothing derives from the class. It also prevents `manager.hh` from giving direct access to internal structures (`hint_endpoint_manager` in this case). Refs #15358 Closes scylladb/scylladb#15631 * github.com:scylladb/scylladb: db/hints/manager: Reword comments about state db/hints/manager: Unfriend space_watchdog db/hints: Remove a redundant alias db/hints: Remove an unused namespace db/hints: Coroutinize change_host_filter() db/hints: Coroutinize drain_for() db/hints: Clean up can_hint_for() db/hints: Clean up store_hint() db/hints: Clean up too_many_in_flight_hints_for() db/hints: Refactor get_ep_manager() db/hints: Coroutinize wait_for_sync_point() db/hints: Use std::span in calculate_current_sync_point db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints() db/hints: Clean up manager::forbid_hints() db/hints: Clean up manager::allow_hints() db/hints: Coroutinize compute_hints_dir_device_id() db/hints: Clean up manager::stop() db/hints: Clean up manager::start() db/hints/manager: Clean up the constructor db/hints: Remove boilerplate drain_lock() db/hints: Let drain_for() return a future db/hints: Remove ep_managers_end db/hints: Remove find_ep_manager db/hints: Use manager as API for hint_endpoint_manager db/hints: Don't mark have_ep_manager()'s definition as inline db/hints: Remove make_directory_initializer() db/hints/manager: Order constructors db/hints: Move ~manager() and mark it as noexcept db/hints: Use reference for storage proxy db/hints/manager: Explicitly delete copy constructor db/hints: Capitalize constants db/hints/manager: Hide declarations db/hints/manager: Move the defintions of static members to the header db/hints: Move make_dummy() to the header db/hints: Don't explicitly define ~directory_initializer() db/hints: Change the order of logging in ensure_created_and_verified() db/hints: Coroutinize ensure_rebalanced() db/hints: Coroutinize ensure_created_and_verified() db/hints: Improve formatting of directory_initializer::impl db/hints: Do not rely on the values of enums db/hints: Move the implementation of directory_initializer db/hints: Prefer nested namespaces db/hints: Remove an unused alias from manager.hh db/hints: Reorder includes in manager.hh and .cc	2023-10-06 17:20:33 +03:00
Anna Stuchlik	5d3584faa5	doc: add a note about Raft This commit adds a note to specify that the information on the Handling Failures page only refers to clusters with Raft enabled. Also, the comment is included to remove the note in future versions.	2023-10-06 16:04:43 +02:00
Pavel Emelyanov	e485c854b2	distributed_loader: Remove explicit sharded<erms> The sharded replication map was needed to provide sharded for sstable directory. Now it gets sharded via table reference and thus the erms thing becomes unused Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-06 15:57:45 +03:00
Pavel Emelyanov	c2eb1ae543	distributed_loader: Brush up start_subdir() Drop some local references to class members and line-up arguments to starting distributed sstable directory. Purely a clean up patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-06 15:57:03 +03:00
Pavel Emelyanov	795dcf2ead	sstable_directory: Add enlightened construction The existing constructor is pretty heavyweight for the distributed loader to use -- it needs to pass it 4 sharded parameters which looks pretty bulky in the text editor. However, 5 constructor arguments are obtained directly from the table, so the dist. loader code with global table pointer at hand can pass _it_ as sharded parameter and let the sstable directory extract what it needs. Sad news is that sstable_directory cannot be switched to just use table reference. Tools code doesn't have table at hand, but needs the facilities sstable_directory provides Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-06 15:54:51 +03:00
Pavel Emelyanov	e004469827	table: Add global_table_ptr::as_sharded_parameter() The method returns seastar::sharded_parameter<> for the global table that evaluates into local table reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-06 15:53:57 +03:00
Botond Dénes	0ea0982590	Merge 'test/pylib: better code consistency, less boilerplate' from Kamil Braun Refactor the code to be more consistent -- we often did the same thing in multiple ways depending on the endpoint, such as how we returned errors (some endpoints would return them through exceptions, other would wrap into `aiohttp.web.Response`s). Choose the arguably least boilerplate'y way in each case. Then reduce the boilerplate even further. Thanks to these refactors, modifying the framework in the future will require less work and it will be more obvious which of the possible ways to modify it should be picked (i.e. consistent with the existing code.) Closes scylladb/scylladb#15646 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: reduce `aiohttp` boilerplate test/pylib: always return data as JSON from endpoints test/pylib: scylla_cluster: catch `HTTPError` in topology change endpoints test/pylib: scylla_cluster: do sanity/precondition checks through asserts test/pylib: scylla_cluster: return errors through exceptions test/pylib: use JSON data to pass `expected_error` in `server_start` test/pylib: use PUT instead of GET for mutating endpoints test/pylib: rest_client: make `data` optional in `put_json` test/pylib: fix some type errors	2023-10-06 14:51:16 +03:00
Dawid Medrek	6fdca0d3a8	db/hints/manager: Reword comments about state The current comments should be clearer to someone not familiar with the module. This commit also makes them abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	aa38ea3642	db/hints/manager: Unfriend space_watchdog space_watchdog is a friend of shard hint manager just to be able to execute one of its functions. This commit changes that by unfriending the class and exposing the function.	2023-10-06 13:25:30 +02:00
Dawid Medrek	6cd0153954	db/hints: Remove a redundant alias	2023-10-06 13:25:30 +02:00
Dawid Medrek	ddc385bce0	db/hints: Remove an unused namespace	2023-10-06 13:25:30 +02:00
Dawid Medrek	76d414012b	db/hints: Coroutinize change_host_filter()	2023-10-06 13:25:30 +02:00
Dawid Medrek	09eb30e6f1	db/hints: Coroutinize drain_for() This commit turns the function into a coroutine and makes the code less compact and more readable.	2023-10-06 13:25:30 +02:00
Dawid Medrek	907a572e24	db/hints: Clean up can_hint_for() This commit gets rid of unnecessary additional calls to functions and makes all lines abide by the limit of 120 characters.	2023-10-06 13:25:30 +02:00
Dawid Medrek	596e1f9859	db/hints: Clean up store_hint() This commit makes the function abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	8a43f94ca6	db/hints: Clean up too_many_in_flight_hints_for() This commit makes the return statement more readable. It also makes the comment abide by the limit of 120 characters per line.	2023-10-06 13:25:30 +02:00
Dawid Medrek	96a5906621	db/hints: Refactor get_ep_manager()	2023-10-06 13:25:30 +02:00
Dawid Medrek	8b591be3c3	db/hints: Coroutinize wait_for_sync_point() This commit coroutinizes the function and adds a comment explaining a non-trivial case.	2023-10-06 13:25:27 +02:00
Dawid Medrek	fee3aafd80	db/hints: Use std::span in calculate_current_sync_point std::span is a lot more flexible than std::vector as it allows for arbitrary contiguous ranges.	2023-10-06 12:36:05 +02:00
Dawid Medrek	64fd4d6323	db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints()	2023-10-06 12:26:55 +02:00
Dawid Medrek	58cd5c4167	db/hints: Clean up manager::forbid_hints()	2023-10-06 12:26:55 +02:00
Dawid Medrek	f8ed93f5bc	db/hints: Clean up manager::allow_hints()	2023-10-06 12:26:52 +02:00
Dawid Medrek	bfe32bcf89	db/hints: Coroutinize compute_hints_dir_device_id()	2023-10-06 12:18:30 +02:00
Dawid Medrek	8f28eb6522	db/hints: Clean up manager::stop() This commit gets rid of boilerplate in the function, leverages a range pipe and explicit types to make the code more readable, and changes the logs to make it clearer what happens.	2023-10-06 12:18:30 +02:00
Dawid Medrek	a384caece0	db/hints: Clean up manager::start() This commit coroutinizes the function and makes it less compact.	2023-10-06 12:18:30 +02:00
Dawid Medrek	2db97aaf81	db/hints/manager: Clean up the constructor fmt::to_string should be preferred to seastar::format. It's clearer and simpler. Besides that, this commit makes the code abide by the limit of 120 characters per line.	2023-10-06 12:18:30 +02:00
Dawid Medrek	6c10a86791	db/hints: Remove boilerplate drain_lock()	2023-10-06 12:18:30 +02:00
Dawid Medrek	f1f35ba819	db/hints: Let drain_for() return a future Currently, the function doesn't return anything. However, if the futurue doesn't need to be awaited, the caller can decide that. There is no reason to make that decision in the function itself.	2023-10-06 12:18:25 +02:00
Dawid Medrek	79e1412f14	db/hints: Remove ep_managers_end The methods are redundant and are effectively code boilerplate.	2023-10-06 12:15:04 +02:00
Dawid Medrek	cfbacb29bb	db/hints: Remove find_ep_manager The methods are redundant and are effectively code boilerplate.	2023-10-06 12:15:04 +02:00
Dawid Medrek	1c70a18fc7	db/hints: Use manager as API for hint_endpoint_manager This commit makes with_file_update_mutex() a method of hint_endpoint_manager and introduces db::hints::manager::with_file_update_mutex_for() for accessing it from the outside. This way, hint_endpoint_manager is hidden and no one needs to know about its existence.	2023-10-06 12:15:01 +02:00
Dawid Medrek	d068143b83	db/hints: Don't mark have_ep_manager()'s definition as inline Doing that doesn't allow for external linkage, so it's not accessible from other files.	2023-10-06 11:54:15 +02:00
Dawid Medrek	58249363bc	db/hints: Remove make_directory_initializer() The function is never used. It's not even implemented.	2023-10-06 11:54:15 +02:00
Dawid Medrek	f47a669f75	db/hints/manager: Order constructors This commit orders constructors of db::hints::manager for readability.	2023-10-06 11:54:15 +02:00
Dawid Medrek	4663f72990	db/hints: Move ~manager() and mark it as noexcept The destructor is trivial and there is no reason to keep in the source file. We mark it as noexcept too.	2023-10-06 11:54:15 +02:00
Dawid Medrek	18a2831186	db/hints: Use reference for storage proxy This commit makes db::hints::manager store service::storage_proxy as a reference instead of a seastar::shared_ptr. The manager is owned by storage proxy, so it only lives as long as storage proxy does. Hence, it makes little sense to store the latter as a shared pointer; in fact, it's very confusing and may be error-prone. The field never changes, so it's safe to keep it as a reference (especially because copy and move constructors of db::hints::manager are both deleted). What's more, we ensure that the hint manager has access to storage proxy as soon as it's created. The same changes were applied to db::hints::resource_manager. The rationale is the same.	2023-10-06 11:54:15 +02:00
Dawid Medrek	3c347cc196	db/hints/manager: Explicitly delete copy constructor This commit explicitly deletes the copy constructor of db::hints::manager and its copy assignment. They're not used in the code, and they should not.	2023-10-06 11:54:15 +02:00
Dawid Medrek	ee5a5c1661	db/hints: Capitalize constants This is a common convention. Follow it for readability.	2023-10-06 11:54:15 +02:00
Dawid Medrek	fd30bac7b1	db/hints/manager: Hide declarations	2023-10-06 11:54:15 +02:00
Dawid Medrek	4b03cba1bf	db/hints/manager: Move the defintions of static members to the header If the variables are accessible from the outside, it makes sense to also expose their initial values to the user. This commit moves them to the header and marks as inline.	2023-10-06 11:54:15 +02:00
Dawid Medrek	c3ab28f5e9	db/hints: Move make_dummy() to the header The function is trivial. It can also be marked as noexcept.	2023-10-06 11:54:15 +02:00
Dawid Medrek	5e333f0a52	db/hints: Don't explicitly define ~directory_initializer() The destructor is the default destructor, and it is safe to drop it altogether.	2023-10-06 11:53:02 +02:00
Kamil Braun	bf17aa93e2	test/pylib: scylla_cluster: reduce `aiohttp` boilerplate Make handlers return their results directly, without wrapping them into `aiohttp.web.Response`s. Instead the wrapping is done in a generic way when defining the routes.	2023-10-06 11:24:13 +02:00
Kamil Braun	d3bc0d47e0	test/pylib: always return data as JSON from endpoints Some endpoint handlers return JSON, some return text, some return empty responses. Reduce the number of different handler types by making the text case a subcase of the JSON case. This also simplifies some code on the `ManagerClient` side, which would have to deserialize data from text (because some endpoint handlers would serialize data into text for no particular reason). And it will allow reducing boilerplate in later commits even further.	2023-10-06 11:24:02 +02:00
Dawid Medrek	9f215d3cf1	db/hints: Change the order of logging in ensure_created_and_verified() The new logging order seems to make more sense, i.e. we first log that we're creating and validating directories, and only then do we start doing that. The previous order when those actions were reversed didn't match the log's message because the action was already done when we informed the user of it.	2023-10-06 11:14:41 +02:00
Dawid Medrek	4ad3e8d37b	db/hints: Coroutinize ensure_rebalanced()	2023-10-06 11:14:41 +02:00
Dawid Medrek	672cdb5c05	db/hints: Coroutinize ensure_created_and_verified()	2023-10-06 11:14:41 +02:00
Dawid Medrek	a5f14cb130	db/hints: Improve formatting of directory_initializer::impl The implementation class has been divided into clear sections. The indentation has also been adjusted to what is commonly used in the codebase.	2023-10-06 11:14:41 +02:00
Dawid Medrek	500175d738	db/hints: Do not rely on the values of enums These changes move away from relying on specific values of enum variants. The code based on the arithmetic of them is trivial, and there is no reason to not operator== and operator!= instead. This should make the code less error prone and easier to understand.	2023-10-06 11:14:41 +02:00
Dawid Medrek	d0b4d9f14f	db/hints: Move the implementation of directory_initializer This commit moves said code to the top of manager.cc to match its position in the header file. That should make navigation easier.	2023-10-06 11:14:41 +02:00
Dawid Medrek	b516fe1fc0	db/hints: Prefer nested namespaces This reduces the amount of boilerplate.	2023-10-06 11:14:41 +02:00
Dawid Medrek	75a85b224b	db/hints: Remove an unused alias from manager.hh	2023-10-06 11:14:41 +02:00
Dawid Medrek	fc80c57bec	db/hints: Reorder includes in manager.hh and .cc These changes improve the readability of the included headers.	2023-10-06 11:14:41 +02:00
Kamil Braun	2d4f157216	test/pylib: scylla_cluster: catch `HTTPError` in topology change endpoints Exceptions flying from `RESTClient` (which used to communicate with Scylla's REST API) are in fact not `RuntimeException`s, they are `HTTPError`s (a type defined in the `rest_client` module). So they would just fly through our catch branches, and the additional info (such as log file path) would not be attached. Fix this.	2023-10-06 10:58:34 +02:00
Kamil Braun	e001463c4a	test/pylib: scylla_cluster: do sanity/precondition checks through asserts Some of them are already done that way, so turn the rest to asserts as well for consistency and code terseness, instead of using the `if` + `raise` pattern.	2023-10-06 10:58:34 +02:00
Kamil Braun	9072379863	test/pylib: scylla_cluster: return errors through exceptions Since `2f84e820fd` it is possible to return errors from `ScyllaClusterManager` handlers through exceptions without losing the contents of these exceptions (the contents arrive at `ManagerClient` and the test can inspect them, unlike in the past where the client would get a generic `InternalServerError`) Change all handlers to return errors through exceptions (like some already do) and get rid of the `ActionReturn` boilerplate. When checking for `self.cluster`, do it through assertions, like most of the handlers already do, instead of using the `if` + `raise` pattern.	2023-10-06 10:58:32 +02:00
Kamil Braun	f848d7b5c0	test/pylib: use JSON data to pass `expected_error` in `server_start` Most other endpoints receive data through request body as JSON, this one endpoint is an exception for some reason. Make it consistent with others.	2023-10-06 10:55:45 +02:00
Kamil Braun	20ff2ae5e1	test/pylib: use PUT instead of GET for mutating endpoints `ScyllaClusterManager` registers a bunch of HTTP endpoints which `ManagerClient` uses to perform operations on a cluster during a topology test. The endpoints were inconsistently using verbs, like using GET for endpoints that would have side effects. Use PUT for these.	2023-10-06 10:55:45 +02:00
Kamil Braun	69a6910a90	test/pylib: rest_client: make `data` optional in `put_json`	2023-10-06 10:55:45 +02:00
Kamil Braun	33463df7d2	test/pylib: fix some type errors	2023-10-06 10:55:45 +02:00
Raphael S. Carvalho	2a81b2e49a	types: Avoid unneeded copy in simple_date_type_impl::from_sstring() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15645	2023-10-06 11:05:27 +03:00
Botond Dénes	8c03eeb85d	Merge 'Sanitize hints API handlers and remove proxy from http context' from Pavel Emelyanov This is the continuation of `3e74432dbf`. Registering API handlers for services need to - happen next to the corresponding service's start - use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job) - get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything) Hints API handlers want to use proxy, but also reference gossiper and capture proxy via http context. This PR fixes both and removes http_contex -> proxy dependency as no longer needed Closes scylladb/scylladb#15644 * github.com:scylladb/scylladb: api: Remove proxy reference from http context api,hints: Use proxy instead of ctx api,hints: Pass sharded<proxy>& instead of gossiper& api,hints: Fix indentation after previous patch api,hints: Move gossiper access to proxy	2023-10-06 11:04:27 +03:00
Avi Kivity	854188a486	Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large: 1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque 2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s 3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): * Node1: 1 live row, 1M dead rows * Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` After: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs https://github.com/scylladb/scylladb/issues/7929 Refs https://github.com/scylladb/scylladb/issues/3672 Refs https://github.com/scylladb/scylladb/issues/7933 Fixes https://github.com/scylladb/scylladb/issues/9111 Closes scylladb/scylladb#15414 * github.com:scylladb/scylladb: test/topology_custom: add test_read_repair.py replica/mutation_dump: detect end-of-page in range-scans tools/scylla-sstable: write: abort parser thread if writing fails test/pylib: add REST methods to get node exe and workdir paths test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} service/storage_proxy: add trace points for the actual read executor type service/storage_proxy: add trace points for read-repair storage_proxy: Add more trace-level logging to read-repair database: Fix accounting of small partitions in mutation query database, storage_proxy: Reconcile pages with no live rows incrementally	2023-10-05 22:39:34 +03:00
Avi Kivity	197b7590df	Update tools/jmx submodule * tools/jmx d107758...8d15342 (2): > Revert "install-dependencies.sh: do not install weak dependencies" > install-dependencies.sh: do not install weak dependencies Especially for Java, we really do not need the tens of packages and MBs it adds, just because Java apps can be built and use sound and graphics and whatnot.	2023-10-05 22:36:54 +03:00
Avi Kivity	ee57f69b17	Update tools/java submodule * tools/java 9dddad27bf...3c09ab97a9 (1): > nodetool: parse and forward -h\|--host to nodetool	2023-10-05 22:35:58 +03:00
Michael Huang	75109e9519	cql3: Fix invalid JSON parsing for JSON objects with ASCII keys For JSON objects represented as map<ascii, int>, don't treat ASCII keys as a nested JSON string. We were doing that prior to the patch, which led to parsing errors. Included the error offset where JSON parsing failed for rjson::parse related functions to help identify parsing errors better. Fixes: #7949 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15499	2023-10-05 22:26:08 +03:00
Avi Kivity	e600f35d1e	Merge 'logalloc, reader_concurrency_semaphore: cooperate on OOM kills' from Botond Dénes Consider the following code snippet: ```c++ future<> foo() { semaphore.consume(1024); } future<> bar() { return _allocating_section([&] { foo(); }); } ``` If the consumed memory triggers the OOM kill limit, the semaphore will throw `std::bad_alloc`. The allocating section will catch this, bump std reserves and retry the lambda. Bumping the reserves will not do anything to prevent the next call to `consume()` from triggering the kill limit. So this cycle will repeat until std reserves are so large that ensuring the reserve fails. At this point LSA gives up and re-throws the `std::bad_alloc`. Beyond the useless time spent on code that is doomed to fail, this also results in expensive LSA compaction and eviction of the cache (while trying to ensure reserves). Prevent this situation by throwing a distinct exception type which is derived from `std::bad_alloc`. Allocating section will not retry on seeing this exception. A test reproducing the bug is also added. Fixes: #15278 Closes scylladb/scylladb#15581 * github.com:scylladb/scylladb: test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled() reader_concurrency_semaphore: use utils::memory_limit_reached exception utils: add memory_limit_reached exception	2023-10-05 19:47:21 +03:00
Pavel Emelyanov	967faa97e4	proxy: Coroutinize start_hints_manager() All the other calls managing hints are coroutinized Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15641	2023-10-05 16:16:27 +02:00
Pavel Emelyanov	162642ac18	api: Remove proxy reference from http context Now it's unused Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:16:52 +03:00
Pavel Emelyanov	e76f23994c	api,hints: Use proxy instead of ctx Now hints endpoints use ctx.sp reference, but it has the direct proxy reference at hand and should prefer it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:16:01 +03:00
Pavel Emelyanov	6ce7ec4a5e	api,hints: Pass sharded<proxy>& instead of gossiper& Proxy is the target service to handle hints API endpoints. Need to pass it as argument to handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:15:28 +03:00
Pavel Emelyanov	5f521116a2	api,hints: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:15:20 +03:00
Pavel Emelyanov	53891dd9cc	api,hints: Move gossiper access to proxy API handlers should try to avoid using any service other than the "main" one. For hints API this service is going to be proxy, so no gossiper access in the handler itself. (indentation is left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:14:26 +03:00
Botond Dénes	96787ec0a5	Merge 'Do not keep excessive info on sstables::entry_descriptor' from Pavel Emelyanov The descriptor in question is used to parse sstable's file path and return back the result. Parser, among "relevant" info, also parses sstable directory and keyspace+table names. However, there are no code (almost) that needs those strings. And the need to construct descriptor with those makes some places obscurely use empty strings. The PR removes sstable's directory, keyspace and table names from descriptor and, while at it, relaxes the sstable directory code that makes descriptor out of a real sstable object by (!) parsing its Data file path back. Closes scylladb/scylladb#15617 * github.com:scylladb/scylladb: sstables: Make descriptor from sstable without parsing sstables: Do not keep directory, keyspace and table names on descriptor sstables: Make tuple inside helper parser method sstables: Do not use ks.cf pair from descriptor sstables: Return tuple from parse_path() without ks.cf hints sstables: Rename make_descriptor() to parse_path()	2023-10-05 15:15:23 +03:00
Petr Gusev	a6087a10bd	system_keyspace: drop truncation_record This is a refactoring commit without observable changes in behaviour. The only usage was in get_truncation_records method which can be inlined.	2023-10-05 15:19:59 +04:00
Petr Gusev	9d350e7532	system_keyspace: remove get_truncated_at method The only usage is in batchlog_manager, and it can be replaced with cf.get_truncation_time(). std::optional<std::reference_wrapper<canonical_mutation>> is replaced with canonical_mutation* since it is semantically the same but with less type boilerplate.	2023-10-05 15:19:59 +04:00
Petr Gusev	80fa5810a7	table: get_truncation_time: check _truncated_at is initialized	2023-10-05 15:19:59 +04:00
Petr Gusev	59db2703cd	database: add_column_family: initialize truncation_time for new tables	2023-10-05 15:19:59 +04:00
Petr Gusev	32a19fd61b	database: add_column_family: rename readonly parameter to is_new We want to make table::_truncated_at optional, so that in get_truncation_time we can assert that it is initialized. For existing tables this initialisation will happen in load_truncation_times function, and for new tables we want to initialize it in add_column_family like we do with mark_ready_for_writes. Now add_column_family function has parameter 'readonly', which is set by the callers to false if we are creating a fresh new table and not loading it from sstables. In this commit we rename this parameter to is_new and invert the passed values. This will allow us in the next commit to initialize _truncated_at field for new tables.	2023-10-05 15:19:59 +04:00
Petr Gusev	b70bca71bc	system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace load_truncation_times() now works only for schema tables since the rest is not loaded until distributed_loader::init_non_system_keyspaces. An attempt to call cf.set_truncation_time for non-system table just throws an exception, which is caught and logged with debug level. This means that the call cf.get_truncation_time in paxos_state.cc has never worked as expected. To fix that we move load_truncation_times() closer to the point where the tables are loaded. The function distributed_loader::populate_keyspace is called for both system and non-system tables. Once the tables are loaded, we use the 'truncated' table to initialize _truncated_at field for them. The truncation_time check for schema tables is also moved into populate_keyspace since is seems like a more natural place for it.	2023-10-05 15:19:52 +04:00
Pavel Emelyanov	d112098c08	sstables: Make descriptor from sstable without parsing When loading unshared remote sstable, sstable_directory needs to make a descriptor out of a real sstable. For that it parses the sstable's Data component path which is pretty weird. It's simpler to make descriptor out of the ssatble itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 12:21:01 +03:00
Pavel Emelyanov	96651e0ddb	sstables: Do not keep directory, keyspace and table names on descriptor Now no code uses those strings. Even worse -- there are some places that need to provide some strings but don't have real values at hand, so just hard-code the empty strings there (because they are really not used). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 12:21:01 +03:00
Pavel Emelyanov	6a601be1f3	sstables: Make tuple inside helper parser method This just moves the std::make_tuple() call into internal static path parsing helper to make the next patch smaller and nicer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 12:21:01 +03:00
Pavel Emelyanov	14ee59fb04	sstables: Do not use ks.cf pair from descriptor There's only one place that needs ks.cf pair from the parsed desctipror -- sstables loader from tools/. This code already has ks.cf from the tuple returned after parsing and can use them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 12:21:01 +03:00
Pavel Emelyanov	62d71d398f	sstables: Return tuple from parse_path() without ks.cf hints There are two path parsers. One of them accepts keyspace and table names and the other one doesn't. The latter is then supposed to parse the ks.cf pair from path and put it on the descriptor. This patch makes this method return ks.cf so that later it will be possible to remove these strings from the desctiptor itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 12:21:00 +03:00
Botond Dénes	498e3ec435	Merge 'Remove _schema field from sstable_set' from Piotr Jastrzębski All `sstable_set_impl` subclasses/implementations already keep a `schema_ptr` so we can make `sstable_set_impl::make_incremental_selector` function return both the selector and the schema that's being used by it. That way, we can use the returned schema in `sstable_set::make_incremental_selector` function instead of `sstable_set::_schema` field which makes the field unused and allows us to remove it alltogether and reduce the memory footprint of `sstable_set` objects. Closes scylladb/scylladb#15570 * github.com:scylladb/scylladb: sstable_set: Remove unused _schema field sstable_set_impl: Return also schema from make_incremental_selector	2023-10-05 11:46:08 +03:00
Piotr Dulikowski	4340e46c66	storage_service: increase timeout during join procedure to 3 minutes When joining the cluster in raft topology mode, the new node asks some existing node in the cluster to put its information to the `system.topology` table. Later, the topology coordinator is supposed to contact the joining node back, telling it that it was added to group 0 and accepted, or rejected. Due to the fact that the topology coordinator might not manage to successfully contact the joining node, in order not to get stuck it might decide to give up and move the node to left state and forget about it (this not always happens as of now, but will in the future). Because of that, the joining node must use a timeout when waiting for a response because it's not guaranteed that it will ever receive it. There is an additional complication: the topology coordinator might be busy and not notice the request to join for a long time. For example, it might be migrating tablets or joining other nodes which are in the queue before it. Therefore, it's difficult to choose a timeout which is long enough for every case and still not too long. Such a failure was observed to happen in ARM tests in debug mode. In order to unblock the CI the timeout is increased from 30 seconds to 3 minutes. As a proper solution, the procedure will most likely have to be adjusted in a more significant way. Fixes: #15600 Closes scylladb/scylladb#15618	2023-10-05 10:29:03 +02:00
Pavel Emelyanov	d56f9db121	sstables: Rename make_descriptor() to parse_path() The method really parses provided path, so the existing name is pretty confusing. It's extra confusing in the table::get_snapshot_details() where it's just called and the return value is simply ignored. Named "parse_..." makes it clear what the method is for. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 11:04:07 +03:00
Botond Dénes	36b00710c1	querier: add more information about the read on semaphore mismatch Also rephase the messages a bit so they are more uniform. The goal of this change is to make semaphore mismatches easier to diagnose, by including the table name and the permit name in the printout. While at it, add a test for semaphore mismatch, it didn't have one. Refs: #15485 Closes scylladb/scylladb#15508	2023-10-05 10:27:53 +03:00
Botond Dénes	19ed3393b3	Merge 'Sanitize tracing start-stop calls' from Pavel Emelyanov Tracing is one of two global service left out there with its starting and stopping being pretty hairy. In order to de-globalize it and keep its start-stop under control the existing start-stop sequence is worth cleaning. This PR * removes create_ , start_ and stop_ wrappers to un-hide the global tracing_instance thing * renames tracing::stop() to shutdown() as it's in fact shutdown * coroutinizes start/shutdown/stop while at it Squeezed parts from #14156 that don't reorder start-stop calls Closes scylladb/scylladb#15611 * github.com:scylladb/scylladb: main: Capture local tracing reference to stop tracing tracing: Pack testing code tracing: Remove stop_tracing() wrapper tracing: Remove start_tracing() wrapper tracing: Remove create_tracing() wrapper tracing: Make shutdown() re-entrable tracing: Coroutinize start/shutdown/stop tracing: Rename helper's stop() to shutdown()	2023-10-05 10:27:19 +03:00
Avi Kivity	6d5823e8f5	Regenerate frozen toolchain for new Python driver Update to scylla-driver 3.26.3. Closes scylladb/scylladb#15629	2023-10-05 10:09:53 +03:00
Michał Chojnowski	330d221deb	row_cache: when the constructor fails, clear `_partitions` in the right allocator If the constructor of row_cache throws, `_partitions` is cleared in the wrong allocator, possibly causing allocator corruption. Fix that. Fixes #15632 Closes scylladb/scylladb#15633	2023-10-04 23:44:45 +02:00
Michał Chojnowski	83b71ed6b2	row_cache_test: fix test_exception_safety_of_update_from_memtable The test does (among other things) the following: 1. Create a cache reader with buffer of size 1 and fill the buffer. 2. Update the cache. 3. Check that the reader produces the first mutation as seen before the update (because the buffer fill should have snapshotted the first mutation), and produces other mutation as seen after the update. However, the test is not guaranteed to stop after the update succeeds. Even during a successful update, an allocation might have failed (and been retried by an allocation_section), which will cause the body of with_allocation_failures to run again. On subsequent runs the last check (the "3." above) fails, because the first mutation is snapshotted already with the new version. Fix that. Closes scylladb/scylladb#15634	2023-10-04 23:42:03 +02:00
Tomasz Grabiec	1252d5bd7d	Merge 'replica: Clean up storage of tablet on migration' from Raphael "Raph" Carvalho When a tablet is migrated into a new home, we need to clean its storage (i.e. the compaction group) in the old home. This includes its presence in row cache, which can be shared by multiple tablets living in the same shard. For exception safety, the following is done first in a "prepare phase" during cache invalidation. 1) take a compaction guard, to stop and disable compaction 2) flush memtable(s). 3) builds a list of all sstables, which represents all the storage of the tablet. Then once cache is invalidated successfully, we then clear the sstable sets of the the group in the "execution phase", to prevent any background op from incorrectly picking them and also to allow for their deletion. All the sstables of a tablet are deleted atomically, in order to guarantee that a failure midway won't cause data resurrection if it happens tablet is migrated back into the old home. Closes scylladb/scylladb#15524 * github.com:scylladb/scylladb: replica: Clean up storage of tablet on migration replica: Add async gate to compaction_group replica: Coroutinize compaction_group::stop() replica: Make compaction group flush noexcept	2023-10-04 23:41:32 +02:00
Piotr Jastrzebski	9edf6e4653	sstable_set: Remove unused _schema field Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>	2023-10-04 18:50:23 +02:00
Piotr Jastrzebski	ce2be977a6	sstable_set_impl: Return also schema from make_incremental_selector Define sstable_set_impl::selector_and_schema_t type as a tuple that contains both a newly created selector and a schema that the selector is using. This will allow removal of _schema field from sstable_set class as the only place it was used was make_incremental_selector. Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>	2023-10-04 18:40:05 +02:00
Raphael S. Carvalho	893ee68251	replica: Clean up storage of tablet on migration When a tablet is migrated into a new home, we need to clean its storage (i.e. the compaction group) in the old home. This includes its presence in row cache, which can be shared by multiple tablets living in the same shard. For exception safety, the following is done first in a "prepare phase" during cache invalidation. 1) take a compaction guard, to stop and disable compaction 2) flush memtable(s). 3) builds a list of all sstables, which represents all the storage of the tablet. Then once cache is invalidated successfully, we then clear the sstable sets of the the group in the "execution phase", to prevent any background op from incorrectly picking them and also to allow for their deletion. All the sstables of a tablet are deleted atomically, in order to guarantee that a failure midway won't cause data resurrection if it happens tablet is migrated back into the old home. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-10-04 12:16:19 -03:00
Raphael S. Carvalho	e23f4cf8c9	main: delete dead initialization code for compaction this is redundant code that should have be gone a long time ago. the snippet (which lies above the code being deleted): db.invoke_on_all([] (replica::database& db) { db.get_tables_metadata().for_each_table([] (table_id, lw_shared_ptr<replica::table> table) { replica::table& t = *table; t.enable_auto_compaction(); }); }).get(); provides the same thing as this code being deleted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15597	2023-10-04 15:57:24 +03:00
Avi Kivity	d217c6f7c1	Merge 'tools/nodetool: implement additional commands, part 1/N' from Botond Dénes The following new commands are implemented: * disablebackup * disablebinary * disablegossip * enablebackup * enablebinary * enablegossip * gettraceprobability * help * settraceprobability * statusbackup * statusbinary * statusgossip * version All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation. Refs: https://github.com/scylladb/scylladb/issues/15588 Closes scylladb/scylladb#15593 * github.com:scylladb/scylladb: tools/scylla-nodetool: implement help operation tools/scylla-nodetool: implement the traceprobability commands tools/scylla-nodetool: implement the gossip commands tools/scylla-nodetool: implement the binary commands tools/scylla-nodetool: implement backup related commands tools/scylla-nodetool: implement version command test/nodetool: introduce utils.check_nodetool_fails_with() test/nodetool: return stdout of nodetool invokation test/nodetool/rest_api_mock.py: fix request param matching tools/scylla-nodetool: compact: remove --partition argument tools/scylla-nodetool: scylla_rest_client: add support delete method tools/scylla-nodetool: get rid of check_json_type() tools/scylla-nodetool: log more details for failed requests tools/scylla-*: use operation_option for positional options tools/utils: add support for operation aliases	2023-10-04 14:33:16 +03:00
Anna Stuchlik	eb5a9c535a	doc: add the quorum requirement to procedures This commit adds a note to the docs for cluster management that a quorum is required to add, remove, or replace a node, and update the schema.	2023-10-04 13:16:21 +02:00
Anna Stuchlik	bf25b5fe76	doc: add more failure info to Troubleshooting This commit adds new pages with reference to Handling Node Failures to Troubleshooting. The pages are: - Failure to Add, Remove, or Replace a Node (in the Cluster section) - Failure to Update the Schema (in the Data Modeling section)	2023-10-04 12:44:26 +02:00
Anna Stuchlik	8c4f9379d5	doc: move Handling Failures to Troubleshooting This commit moves the content of the Handling Failures section on the Raft page to the new Handling Node Failures page in the Troubleshooting section. Background: When Raft was experimental, the Handling Failures section was only applicable to clusters where Raft was explicitly enabled. Now that Raft is the default, the information about handling failures is relevant to all users.	2023-10-04 12:23:33 +02:00
Botond Dénes	62cdc36a74	tools/scylla-nodetool: implement help operation Nodetool considers "help" to be just another operation. So implement it as such. The usual --help and --help <command> is also supported.	2023-10-04 05:27:09 -04:00
Botond Dénes	1efabca515	tools/scylla-nodetool: implement the traceprobability commands gettraceprobability and settraceprobability	2023-10-04 05:27:09 -04:00
Botond Dénes	25d41f72c4	tools/scylla-nodetool: implement the gossip commands disablegossip, enablegossip and statusgossip	2023-10-04 05:27:09 -04:00
Botond Dénes	5bc25dbebe	tools/scylla-nodetool: implement the binary commands disablebinary, enablebinary and statusbinary	2023-10-04 05:27:09 -04:00
Botond Dénes	2ac1705c90	tools/scylla-nodetool: implement backup related commands disablebackup, enablebackup and statusbackup	2023-10-04 05:27:09 -04:00
Botond Dénes	91e62413c8	tools/scylla-nodetool: implement version command	2023-10-04 05:27:09 -04:00
Botond Dénes	5ad9b1424c	test/nodetool: introduce utils.check_nodetool_fails_with() Checking that nodetool fails with a given message turned out to be a common pattern, so extract the logic for checking this into a method of its own. Refactor the existing tests to use it, instead of the hand-coded equivalent.	2023-10-04 05:27:09 -04:00
Botond Dénes	644d91fe95	test/nodetool: return stdout of nodetool invokation So the test can inspect it.	2023-10-04 05:09:49 -04:00
Botond Dénes	dd62299355	test/nodetool/rest_api_mock.py: fix request param matching Turns out expected request params were dropped on the floor, so any expected param matched any actual params.	2023-10-04 05:09:41 -04:00
Botond Dénes	4f66e0208b	tools/scylla-nodetool: compact: remove --partition argument This argument is not recognized by the current nodetool either. It is mentioned only in our documentation, but it should be removed from there too.	2023-10-04 05:08:33 -04:00
Botond Dénes	2ddf28b8e5	tools/scylla-nodetool: scylla_rest_client: add support delete method	2023-10-04 05:07:03 -04:00
David Garcia	1121a4df04	docs: add groups to reference docs fix: comment Closes scylladb/scylladb#15592	2023-10-04 11:42:36 +03:00
Petr Gusev	9711bfde11	commitlog_replayer: refactor commitlog_replayer::impl::init We don't need map_reduce here since get_truncated_positions returns the same result on all shards. We remove 'finally' semantics in this commit since it doesn't seem we really need it. There is no code that relies on the state of this data structure in case of exception. An exception will propagate to scylla_main() and the program will just exit.	2023-10-03 17:11:40 +04:00
Petr Gusev	c94946d566	system_keyspace: drop redundant typedef	2023-10-03 17:11:40 +04:00
Petr Gusev	f7d2300cf9	system_keyspace: drop redundant save_truncation_record overload	2023-10-03 17:11:40 +04:00
Petr Gusev	da1e6751e9	table: rename cache_truncation_record -> set_truncation_time This is a refactoring commit without observable changes in behaviour. There is a truncation_record struct, but in this method we only care about time, so rename it (and other related methods) appropriately to avoid confusion.	2023-10-03 17:11:35 +04:00
Botond Dénes	e0c8fee7db	Merge 'doc: update the Cassandra compatibility information' from Anna Stuchlik This PR updates the information on the ScyllaDB vs. Cassandra compatibility. It covers the information from https://github.com/scylladb/scylladb/issues/15563, but there could more more to fix. @tzach @scylladb/scylla-maint Please review this PR and the page covering our compatibility with Cassandra and let me know if you see anything else that needs to be fixed. I've added the updates with separate commits in case you want to backport some info (e.g. about AzureSnitch). Fixes https://github.com/scylladb/scylladb/issues/15563 Closes scylladb/scylladb#15582 * github.com:scylladb/scylladb: doc: deprecate Thrift in Cassandra compatibility doc: remove row/key cache from Cassandra compatibility doc: add AzureSnitch to Cassandra compatibility	2023-10-03 13:31:27 +03:00
Botond Dénes	926da9eeb2	docs: nodetool compact: correct phrase about table arguments The sentence says that if table args are provided compaction will run on all tables. This is ambigous, so the sentence is rephrased to specify that compaction will only run on the provided tables. Closes scylladb/scylladb#15394	2023-10-03 10:31:03 +02:00
Pavel Emelyanov	c07905b074	main: Capture local tracing reference to stop tracing Now it's using global reference, but has the local one since recently Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	65b7aa3387	tracing: Pack testing code There's a finally-chain stopping tracing out there, now it can just use the deferred stop call and that's it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	4c74425780	tracing: Remove stop_tracing() wrapper Now it's confusing, as it doesn't stop tracing, but rather shuts it down on all shards. The only caller of it can be more descriptive without the wrapper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	61381feaad	tracing: Remove start_tracing() wrapper Callers can make one-like stop with the help of invoke_on_all() overload that wraps std::invoke Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	89c43f6677	tracing: Remove create_tracing() wrapper It doesn't make callers' life easier, but hides global tracing instance Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	ce5062eb13	tracing: Make shutdown() re-entrable Today's shutdown() and its stop() peer are very restrictive in a way callers should use them. There's no much point in it, making shutdown() re-entrable, as for other services, will let relaxing callers code here and in next patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:47 +03:00
Pavel Emelyanov	232de8b180	tracing: Coroutinize start/shutdown/stop They are all simple enough to be in one patch. Further patching is simpler in coroutinized form. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:46:46 +03:00
Pavel Emelyanov	8234235b94	tracing: Rename helper's stop() to shutdown() Because it's called on shutdown, not on stop Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-03 10:02:12 +03:00
Pavel Emelyanov	c4f1929eea	s3: Abort multipart upload if finalize request fails It may happen that wrapping up multipart upload fails too. However, before sending the request the driver clears the _upload_id field thus marking the whole process as "all is OK". So in case the finalization method fails and thrown, the upload context remains on the server side forever. Fix this by keeping the _upload_id set, so even if finalization throws, closing the uploader notices this and calls abort. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15521	2023-10-03 09:47:33 +03:00
Kamil Braun	b68d6ad5e9	api: storage_service: unset `reload_raft_topology_state` Every endpoint needs to be unset. Oversight in `992f1327d3`. Closes scylladb/scylladb#15591	2023-10-03 09:12:12 +03:00
Botond Dénes	7dc77d03af	tools/scylla-nodetool: get rid of check_json_type() This check is redundant. Originally it was intended to work around by rapidjson using an assert by default to check that the fields have the expected type. But turns out we already configure rapidjson to use a plain exception in utils/rjson.hh, so check_json_type() is not needed for graceful error handling.	2023-10-03 02:05:30 -04:00
Botond Dénes	fdecea5480	tools/scylla-nodetool: log more details for failed requests Instead of the unhelpful "Unexpected reply status", log what the request was and what is the response status code.	2023-10-03 02:05:30 -04:00
Botond Dénes	adb65e18a1	tools/scylla-*: use operation_option for positional options Use operation_option to describe positional options. The structure used before -- app_template::positional_option -- was not a good fit for this, as it was designed to store a description that is immediately passed to the boost::program_options subsystem and then discarded. As such, it had a raw pointer member, which was expected to be immediately wrapped by boost::shared_ptr<> by boost::program_options. This produced memory leaks for tools, for options that ended up not being used. To avoid this altogether, use operation_option, converting to the app_template::positional_option at the last moment.	2023-10-03 02:05:30 -04:00
Botond Dénes	c252ff4f03	tools/utils: add support for operation aliases Some operations may have additional names, beyond their "main". Add support for this.	2023-10-03 02:05:30 -04:00
Botond Dénes	471e125592	Merge 'Use REST API client in object_store test' from Pavel Emelyanov The test needs to call flush-keyspace API endpoint and currently it does it by hand. Not very convenient. Also in the future there will be the need for _background_ API kicking, the currently used requests package cannot do it, while pylib REST API can Closes scylladb/scylladb#15565 * github.com:scylladb/scylladb: test/object_store: Use REST client from pylib test/pylib: Add flush_keyspace() method to rest client test/object_store: Wrap yielded managed cluster	2023-10-03 08:50:55 +03:00
David Garcia	d543b96d18	docs: download iam csv files docs: automate generation docs: rm _data dir fix: windows build Closes scylladb/scylladb#15276	2023-10-02 12:28:56 +03:00
Botond Dénes	3e74432dbf	Merge 'Sanitize storage_proxy API handlers' from Pavel Emelyanov Registering API handlers for services need to - happen next to the corresponding service's start - use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job) - get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything) The storage proxy handlers don't follow any of that rules, this PR fixes them Closes scylladb/scylladb#15584 * github.com:scylladb/scylladb: api: Make storage_proxy handlers use proxy argument api: Change some static helpers to use proxy instead of ctx api: Pass sharded<storage_proxy> reference to storage_proxy handlers api: Start (and stop) storage_proxy API earlier api: Remove storage_service argument from storage_proxy setup api: Move storage_proxy/ endpoint using storage_service api: Remove storage_proxy.hh from storage_service.cc main: Initialize API server early	2023-10-02 12:28:56 +03:00
Benny Halevy	6dc1ac768d	cql-pytest/test_select_from_mutation_fragments: disable compaction on test_table Use NullCompactionStrategy for the test_table fixture rather than using the `no_autocompaction_context`. Besides being simpler, as regular compaction just comes in the way for all tests that use `SELECT MUTATION_FRAGMENTS` The latter would be problematic when we start run cql-pytest test cases in parallel rather than in serial since it will inadvertantly affect other test cases. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#15574	2023-10-02 10:28:59 +03:00
Michael Huang	1640f83fdc	raft: Store snapshot update and truncate log atomically In case the snapshot update fails, we don't truncate commit log. Fixes scylladb/scylladb#9603 Closes scylladb/scylladb#15540	2023-09-29 17:57:49 +02:00
Pavel Emelyanov	2603605cd5	api: Make storage_proxy handlers use proxy argument And stop using proxy reference from http context. After a while the proxy dependency will be removed from http context Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:09 +03:00
Pavel Emelyanov	fc4335387a	api: Change some static helpers to use proxy instead of ctx There are some helpers in storage_proxy.cc that get proxy reference from passed http context argument. Next patch will stop using ctx for that purpose, so prepare in advance by making the helpers use proxy reference argument directly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:09 +03:00
Pavel Emelyanov	4910b4d5b7	api: Pass sharded<storage_proxy> reference to storage_proxy handlers The goals is to make handlers use proxy argument instead of keeping proxt as dependency on http context (other handlers are mostly such already) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:09 +03:00
Pavel Emelyanov	bbba691931	api: Start (and stop) storage_proxy API earlier Handlers can register as early as the used service is started Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:09 +03:00
Pavel Emelyanov	7ef7b05397	api: Remove storage_service argument from storage_proxy setup The code setting up storage_proxy/ endpoints no longer needs storage_service and related decoration Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:09 +03:00
Pavel Emelyanov	0eea513663	api: Move storage_proxy/ endpoint using storage_service The storage_proxy/get_schema_version is served by storage_service, so it should be in storage_service.cc instead Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:10:08 +03:00
Pavel Emelyanov	b5eb474d95	api: Remove storage_proxy.hh from storage_service.cc Proxy is not used in storage service handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:09:01 +03:00
Pavel Emelyanov	abf541cf29	main: Initialize API server early Surprisingly, but the dependency-less API server context is initialized somewhere in the middle of main. By that time some "real" services had already started and should have the ability to register their endpoints, so API context should be initialized way ahead. This patch places its initialization next to prometheus init. One thing that's not nice here is that API port listening remains where it was before the patch, so for the external ... observer API initialization doesn't change. Likely API should start listening for connection early as well, but that's left for future patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-29 14:09:01 +03:00
Botond Dénes	ecceb554c3	Merge 'db/hints: Clean up hint_storage.cc' from Dawid Mędrek This PR is the second step in refactoring the Hinted Handoff module. It cleans up the contents of the file `hint_storage.cc`. The biggest change is the transition from continuations to coroutines. Refs #15358 Closes scylladb/scylladb#15496 * github.com:scylladb/scylladb: db/hints: Alias segment list in hint_storage.cc db/hints: Rename rebalance to rebalance_hints db/hints: Clean up rebalance() in hint_storage.cc db/hints: Coroutinize hint_storage.cc db/hints: Clean up remove_irrelevant_shards_directories() in hint_storage.cc db/hints: Clean up rebalance_segments() in hint_storage.cc db/hints: Clean up rebalance_segments_for() in hint_storage.cc db/hints: Clean up get_current_hints_segments() in hint_storage.cc db/hints: Rename scan_for_hints_dirs to scan_shard_hint_directories db/hints: Clean up scan_for_hints_dirs() in hint_storage.cc db/hints: Wrap hint_storage.cc in an anonymous namespace	2023-09-29 08:55:38 +03:00
Botond Dénes	5d8384eff0	Merge 'Fix `test_fencing.py::test_fence_hints` flakiness' from Kamil Braun Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages. Refactor the test a bit while at it. Fixes: #15285 Closes scylladb/scylladb#15523 * github.com:scylladb/scylladb: test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints` test: test_fencing.py: reload topology through REST API in `test_fence_hints` test: refactor test_fencing.py api: storage_service: add REST API to reload topology state	2023-09-28 16:30:23 +03:00
Benny Halevy	3709a43ccc	cql-pytest.nodetool: no_autocompaction_context: support ks.tbl syntax Allow disabling auto-compaction for given table(s) using either the ks.table syntax or ks:table (as the api suggests). The first syntax would likely be more common since the test tables we automatically create are named as test_keyspace.test_table so we can pass that name to `no_autocompaction_context` as is. test_tools.system_scylla_local_sstable_prepared was modified to disable auto-compaction only only the `system.scylla_local` table rather than the whole `system` keyspace, since it only relies on this table. Plus, it helps test this change :) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#15575	2023-09-28 13:59:48 +03:00
Kamil Braun	53c01b121a	test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints` Enable TRACE level logging on the server that's supposed to send the hints. Should make it easier to debug failures in the future, if any happen again.	2023-09-28 11:59:17 +02:00
Kamil Braun	706734f76c	test: test_fencing.py: reload topology through REST API in `test_fence_hints` Restarting a node in order to reload topology may have side effects that lead to test flakiness. While the node is shutting down, it gives up leadership. Before it finishes shutting down, another node may become Raft group 0 leader, then topology coordinator, then send a topology command, triggering topology state reload on the shutting down node, causing its topology version to get updated, allowing it to send a successful hint before it shuts down and restarts. After it restarts, no more hints will be sent, so the metrics condition we're waiting for (for a hint to be sent) will never become true (metrics are not persisted between restarts). Instead of restarting, reload topology state through the new REST API. This also makes the test a bit faster. Fixes #15285	2023-09-28 11:59:17 +02:00
Kamil Braun	02dd297ba1	test: refactor test_fencing.py - use `manager.get_cql()` to silence mypy (`manager.cql` is `Optional`) - extract `metrics.lines_by_prefix('scylla_hints_manager_')` to a helper function - when waiting for conditions on metrics, split the condition into safety and liveness part, and fail early if the safety part does not hold - in `exactly_one_hint` send, don't check that `send_errors_metric` is `0` (it won't be after the next commit)	2023-09-28 11:59:17 +02:00
Kamil Braun	992f1327d3	api: storage_service: add REST API to reload topology state Some tests may want to modify system.topology table directly. Add a REST API to reload the state into memory. An alternative would be restarting the server, but that's slower and may have other side effects undesired in the test. The API can also be called outside tests, it should not have any observable effects unless the user modifies `system.topology` table directly (which they should never do, outside perhaps some disaster recovery scenarios).	2023-09-28 11:59:16 +02:00
Kamil Braun	060f2de14e	Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip. Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle. In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows: - The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`. - The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then: - If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted. - Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down. The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated). More details about the procedure and the RPC are described in `topology-over-raft.md`. Fixes: #15152 Closes scylladb/scylladb#15196 * github.com:scylladb/scylladb: tests: mark test_blocked_bootstrap as skipped storage_service: do not check features in shadow round storage_service: remove raft_{boostrap,replace} topology_coordinator: relax the check in enable_features raft_group0: insert replaced node info before server setup storage_service: use join node rpc to join the cluster topology_coordinator: handle joining nodes topology_state_machine: add join_group0 state storage_service: add join node RPC handlers raft: expose current_leader in raft::server storage_service: extract wait_for_live_nodes_timeout constant raft_group0: abstract out node joining handshake storage_service: pass raft_topology_change_enabled on rpc init rpc: add new join handshake verbs docs: document the new join procedure topology_state_machine: add supported_features to replica_state storage_service: check destination host ID in raft verbs group_state_machine: take reference to raft address map raft_group0: expose joined_group0	2023-09-28 11:45:09 +02:00
Anna Stuchlik	f4d53978da	doc: deprecate Thrift in Cassandra compatibility This commit adds the information that Thrift is deprecated (both in ScyllaDB and Cassandra) to the Cassandra compatibility page. Refs: https://github.com/scylladb/scylladb/issues/3811	2023-09-28 10:53:59 +02:00
Anna Stuchlik	d1f6832909	doc: remove row/key cache from Cassandra compatibility This commit removes the misleading "row/key cache" row from the Indexing and Caching table on the Cassandra compatibility page.	2023-09-28 10:42:28 +02:00
Anna Stuchlik	9d4ad355c5	doc: add AzureSnitch to Cassandra compatibility This commit adds AzureSnitch (together with a link to the AzureSnitch description) to the Cassandra compatibility page. In addition, the Sniches table is fixed.	2023-09-28 10:37:14 +02:00
Pavel Emelyanov	0eb8d1b438	test/object_store: Use REST client from pylib Test cases kick scylla to force keyspaces flush (to have the objects on object store) by hand. Equip the wrapped cluster object with the REST API class instance for convenience The assertion for 200 return status code is dropped, REST client does it behind the scenes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-28 11:33:00 +03:00
Petr Gusev	1b2e0d0cc9	system_keyspace: get_truncated_position -> get_truncated_positions This method can return many replay_positions, so the plural form is more appropriate.	2023-09-28 12:25:40 +04:00
Pavel Emelyanov	4fdf12b1c7	test/pylib: Add flush_keyspace() method to rest client Which does POST /storage_service/keyspace_flush/{ks} Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-28 11:19:04 +03:00
Pavel Emelyanov	9ce99a01d5	test/object_store: Wrap yielded managed cluster Test cases use temporary cluster object which is, in fact, cql cluster. In the future there will be the need to perform more actions on it rather than just querying it with cql client, so wrap the cluster with an extendable object Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-28 11:19:03 +03:00
Botond Dénes	08c0456b88	test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill Check that the cache reader reacts correctly to semaphore's OOM kill attempt, letting the read to fail, instead of going berserk, trying to reserve more-and-more memory, until the reserve cannot be satisfied.	2023-09-28 04:12:52 -04:00
Raphael S. Carvalho	707ade21f8	replica: Add async gate to compaction_group replica::table has the same gate for gating async operations, and even synchronize stop of table with in-flight writes that will apply into memory. compaction group gains the same gate, which will be used when operations are confined to a single group. table's gate is kept for table wide operations like query, truncate, etc. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-27 19:13:14 -03:00
Raphael S. Carvalho	57a0b46aa4	replica: Coroutinize compaction_group::stop() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-27 17:36:12 -03:00
Raphael S. Carvalho	de4db3ac19	replica: Make compaction group flush noexcept Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-27 17:36:12 -03:00
Dawid Medrek	a870eeb2ab	db/hints: Alias segment list in hint_storage.cc Naming the type should improve readability.	2023-09-27 18:49:08 +02:00
Dawid Medrek	aba85c9c98	db/hints: Rename rebalance to rebalance_hints The new name conveys the idea clearly.	2023-09-27 18:49:08 +02:00
Dawid Medrek	64f4b825d3	db/hints: Clean up rebalance() in hint_storage.cc This commit fixes indentation and formatting after recent changes in the file.	2023-09-27 18:49:04 +02:00
Dawid Medrek	b662756256	db/hints: Coroutinize hint_storage.cc	2023-09-27 18:47:38 +02:00
Dawid Medrek	17e763a83a	db/hints: Clean up remove_irrelevant_shards_directories() in hint_storage.cc This commit makes the function abide by the limit of 120 characters per line and stops unnecessarily calling c_str() on seastar::sstring.	2023-09-27 18:45:01 +02:00
Dawid Medrek	73d02cfcef	db/hints: Clean up rebalance_segments() in hint_storage.cc This commit makes the function less compact and turns overly long lines into shorter ones to improve the readability of the code.	2023-09-27 18:45:01 +02:00
Dawid Medrek	479f4d1ad3	db/hints: Clean up rebalance_segments_for() in hint_storage.cc This commit makes the function less compact and abides by the limit of 120 characters per line; that makes the code more readable. We start using fmt::to_string instead of seastar::format("{:d"}) to convert strings to integers -- the new way is the preferred one. The changes also name variables in a more descriptive way.	2023-09-27 18:45:01 +02:00
Dawid Medrek	a1df8dbf1c	db/hints: Clean up get_current_hints_segments() in hint_storage.cc This commit makes the function less compact and abides by the limit of 120 characters per line. That makes the code more readable. It also doesn't unnecessarily call c_str() on seastar::sstring.	2023-09-27 18:45:01 +02:00
Dawid Medrek	1fccd34dba	db/hints: Rename scan_for_hints_dirs to scan_shard_hint_directories The new name better conveys which directories the function should scan.	2023-09-27 18:45:01 +02:00
Dawid Medrek	8e94074b85	db/hints: Clean up scan_for_hints_dirs() in hint_storage.cc There is no need to call c_str() on the name of the directory entry. In fact, the used overload std::stoi() takes an std::string as its argument. Providing seastar::sstring instead of const char* is more efficient because we can allocate just the right amount of memory and std::memcpy it, i.e. call std::string(const char, std::size_t). Using the overload std::string(const char) would need to first traverse the string to find the null byte. This is a small change, all the more because paths don't tend to be long, but it's some gain nonetheless. The commit also inserts a few empty lines to make the code less compact and improve readability as a result.	2023-09-27 18:45:01 +02:00
Dawid Medrek	7c68882578	db/hints: Wrap hint_storage.cc in an anonymous namespace An anonymous namespace is a safer mechanism than the static keyword. When adding a new piece of code, it's easy to forget about adding the static. In that case, that code might undergo external linkage. However, when code is put in an anonymous namespace (when it should not), the linker will immediately detect it (in most cases), and the programmer will be able to spot and fix their mistake right away.	2023-09-27 18:41:41 +02:00
Botond Dénes	c0da6bcfb8	utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled() Said method catches bad-allocs and retries the passed-in function after raising the reserves. This does nothing to help the function succeed if the bad alloc was throw from the semaphore, because the kill limit was reached. In this case the read should be left to fail and terminate. Now that the semaphore is throwing utils::memory_limit_reached in this case, we can distinguish this case and just re-throw the exception.	2023-09-27 10:28:00 -04:00
Botond Dénes	6829eaad39	reader_concurrency_semaphore: use utils::memory_limit_reached exception When the kill limit is triggered.	2023-09-27 10:27:32 -04:00
Botond Dénes	721ffa319d	utils: add memory_limit_reached exception A distinct exception derived from std::bad_alloc, used in cases when memory didn't really run out, but the process or task reached the memory limit alloted for it. Using a distinct type for this case allows for LSA to correctly react to this case.	2023-09-27 10:26:41 -04:00
Piotr Dulikowski	2c17f81f44	tests: mark test_blocked_bootstrap as skipped With the new procedure to join nodes, testing the scenario in `test_blocked_bootstrap` becomes very tricky. To recap, the test does the following: - Starts a 3-node cluster, - Shuts down node 1, - Tries to replace node 1 with node 4, but an error injection is triggered which causes node 4 to fail after it joins group 0. Note that pre-JOIN_NODE handshake, this would only result in node 4 being added to group 0 config, but no modification to the group 0 state itself is being done - the joining node is supposed to write a request to join. - Tries to replace node 1 again with node 5, which should succeed. The bug that this regression test was supposed to check for was that node 5 would try to resolve all IPs of nodes added to group 0 config. Because node 4 shuts down before advertising itself in gossip, the node 5 would get stuck. The new procedure to join group 0 complicates the situation because a request to join is written first to group 0 and only then the topology coordinator modifies the group 0 config. It is possible to add an error injection to the topology coordinator code so that it doesn't change the group 0 state and proceeds with bootstrapping the node, but it will only get stuck trying to add the node. If node 5 tries to join in the meantime, the topology coordinator may switch to it and try to bootstrap it instead, but this is basically a 50% chance because it depends on the order of node 4 and node 5's host IDs in the topology_state_machine struct. It should be possible to fix the test with error recovery, but until then it is marked as skipped.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	11ab7c3853	storage_service: do not check features in shadow round The new joining procedure safely checks compatibility of supported/enabled features, therefore there is no longer any need to do it in the gossip shadow round.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	bf5059e83c	storage_service: remove raft_{boostrap,replace} The functionality of `raft_bootstrap` and `raft_replace` is handled by the new handshake, so those functions can be removed.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	9a829ddf97	topology_coordinator: relax the check in enable_features Currently, `enable_features` requires that there are no topology in progress and there are no nodes waiting to be joined. Now, after the new handshake is implemented, we can drop the second condition because nodes in `none` state are not a part of group 0 yet. Additionally, the comments inside `enable_features` are clarified so that they explain why it's safe to only include normal features when doing the barrier and calculating features to enable.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	3ee3699a9c	raft_group0: insert replaced node info before server setup Currently, information about replaced node is put into the raft address map after joining group 0 via `join_group0`. However, the new handshake which happens when joining group 0 needs to read the group 0 state (so that it can wait until it sees all normal nodes as UP). Loading the topology state to memory involves resolving IP addresses of the normal nodes, so the information about replaced node needs to be inserted before the handshake happens. This commit moves insertion of the replace node's data before the call to `join_group0`.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	41a22f6e3b	storage_service: use join node rpc to join the cluster Now, the storage_service uses new RPCs to join the cluster. A new handshaker is implemented and passed to group0 in order to make it happen.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	862b6e61a4	topology_coordinator: handle joining nodes The topology coordinator is updated to perform verification of joining nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	5ba2bfa015	topology_state_machine: add join_group0 state Currently, when the topology coordinator notices a request to join or replace a node, the node is transitioned to an appropriate state and the topology is moved to commit_new_generation/write_both_read_old, in a single group 0 operation. In later commits, the topology coordinator will accept/reject nodes based on the request, so we would like to have a separate step - topology coordinator accepts, transitions to bootstrap state, tells the node that it is accepted, and only then continues with the topology transition. This commits adds a new `join_group0` transition state that precedes `commit_cdc_generation`.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	bb40c2a8b8	storage_service: add join node RPC handlers	2023-09-27 15:53:13 +02:00
Botond Dénes	6d34f99202	Merge 'doc: replace the link to Cassandra compatibility information' from Anna Stuchlik This PR replaces a link to a section of the ScyllaDB website with little information about ScyllaDB vs. Cassandra with a link to a documentation section where Cassandra compatibility is covered in detail. In addition, it removes outdated or irrelevant information about versions from the Cassandra compatibility page. Now that the documentation is versioned, we shouldn't add such information to the content. Fixes https://github.com/scylladb/scylla-enterprise/issues/3454 Closes scylladb/scylladb#15562 * github.com:scylladb/scylladb: doc: remove outdated/irrelevant version info doc: replace the link to Cassandra compatibility	2023-09-27 16:43:28 +03:00
Anna Stuchlik	c19959e226	doc: remove outdated/irrelevant version info This commit removes outdated or irrelevant information about versions from the Cassandra compatibility page. Now that the documentation is versioned, we shouldn't add such information to the content.	2023-09-27 14:08:07 +02:00
Kefu Chai	56f68bcf1b	build: cmake: compare CMAKE_SYSTEM_PROCESSOR using STREQUAL operator `if (.. EQUAL ..)` is used to compare numbers so if the LHS is not a number the condition is evaluated as false, this prevents us from setting the -march when building for aarch64 targets. and because crc32 implementation in utils/ always use the crypto extension intrinsics, this also breaks the build like ``` In file included from /home/fedora/scylla/utils/gz/crc_combine.cc:40: /home/fedora/scylla/utils/clmul.hh:60:12: error: always_inline function 'vmull_p64' requires target feature 'aes', but would be inlined into functi on 'clmul_u32' that is compiled without support for 'aes' return vmull_p64(p1, p2); ^ ``` so, in this change, * compare two strings using `STREQUAL`. * document the reason why we need to set the -march to the specfied argument. see also http://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#g_t-march-and--mcpu-Feature-Modifiers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15553	2023-09-27 13:58:51 +02:00
Botond Dénes	508d469fef	Merge 'build: extract code fragments into functions' from Kefu Chai this series is one of the steps to remove global statements in `configure.py`. not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Closes scylladb/scylladb#15552 * github.com:scylladb/scylladb: build: pass `args` explicitly build: remove `distro_extra_ldflags` build: remove `distro_extra_cflags` build: remove `distro_extra_cmake_args` build: pass variables explicitly build: do not mutate args.user_ldflags build: do not mutate args.user_ldflags build: use os.makedirs(exist_ok=True)	2023-09-27 13:58:51 +02:00
Anna Stuchlik	61d2730e6d	doc: fix section headings that appear on page tree Some "Additional Information" section headings appear on the page tree in the left sidebar because of their incorrect underline. This commit fixes the problem by replacing title underline with section underline. Closes scylladb/scylladb#15550	2023-09-27 13:58:51 +02:00
Anna Stuchlik	53d5635dc3	doc: replace the link to Cassandra compatibility This commit replaces a link to a section of the ScyllaDB website with little information about ScyllaDB vs. Cassandra with a link to a documentation section where Cassandra compatibility is covered in detail.	2023-09-27 13:52:34 +02:00
Botond Dénes	2cc37eb89b	Merge 'Sanitize storage_service API maintenance' from Pavel Emelyanov Storage service API set/unset has two flaws. First, unset doesn't happen, so after storage service is stopped its handlers become "local is not initialized"-assertion and use-after-free landmines. Second, setting of storage service API carry gossiper and system keyspace references, thus duplicating the knowledge about storage service dependencies. This PR fixes both by adding the storage service API unsetting and by making the handlers use _only_ storage service instance, not any externally provided references. Closes scylladb/scylladb#15547 * github.com:scylladb/scylladb: main, api: Set/Unset storage_service API in proper place api/storage_service: Remove gossiper arg from API api/storage_service: Remove system keyspace arg from API api/storage_service: Get gossiper from storage service api/storage_service: Get token_metadata from storage service	2023-09-27 10:00:54 +03:00
Kefu Chai	67d0c596d3	build: pass `args` explicitly instead relying on updating `global()` with `args`, pass the argument explicitly, this helps us understand the data dependency in this script better. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	66428220d7	build: remove `distro_extra_ldflags` this variable is always empty, so drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	a9af6b71e7	build: remove `distro_extra_cflags` this variable is always empty, so drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	c18e996d70	build: remove `distro_extra_cmake_args` this variable is always empty, so drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	854ae62774	build: pass variables explicitly instead of using `global()` pass used variables explicitly this helps us to understand the data dependency in this script better. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	e537962660	build: do not mutate args.user_ldflags mutating the member variables in `args` after it is returned from `arg_parser.parse_args()` is but confusing. let's use a new variable for tracking the updated `user_cflags`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	9251542761	build: do not mutate args.user_ldflags mutating the member variables in `args` after it is returned from `arg_parser.parse_args()` is but confusing. let's use a new variable for tracking the updated `user_ldflags`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:47 +08:00
Kefu Chai	fd9552de53	build: use os.makedirs(exist_ok=True) instead of checking the existence of the directory, use the `exist_ok` parameter, which was introduced back in Python 3.2, and it has been used in this script. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-27 10:45:46 +08:00
Avi Kivity	301b0a989a	Merge ' cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables' from Jan Ciołek When presented with queries that use the same named bind variables twice, like this one: ```cql SELECT p FROM table WHERE p = :x AND c = :x ``` Scylla generated empty `partition_key_bind_indexes` (`pk_indexes`). `pk_indexes` tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node. Scylla should generate pk_indexes which tell the driver that it can use bind variable with `bind_index = 0` to calculate the partition token for this query. The problem was that `_target_columns` keep only a single target_column for each bind variable. In the example above `:x` is compared with both `p` and `c`, but `_target_columns` would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column. To fix it, let's replace `_target_columns` with `_targets`. `_targets` keeps all comparisons between bind variables and other expressions, so none of them will be forgotten/overwritten. A `cql-pytest` reproducer is added. I also added some comments in `prepare_context.hh/cc` to make it easier to read. Fixes: https://github.com/scylladb/scylladb/issues/15374 Closes scylladb/scylladb#15526 * github.com:scylladb/scylladb: cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables cql3: improve readability of prepare_context cql-pytest: test generation of pk indexes during PREPARE	2023-09-26 19:47:04 +03:00
Nadav Har'El	9dea20539d	Merge 'Sanitize forward-service shutdown' from Pavel Emelyanov There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in `3b95f4f107` refs: #2737 refs: #4384 Closes scylladb/scylladb#15545 * github.com:scylladb/scylladb: forward_service: Remove .shutdown() method forward_service: Set _shutdown in abort-source subscription forward_service: Add abort_source to constructor	2023-09-26 18:36:52 +03:00
Kefu Chai	50c937439b	reloc: strip.sh: always generate symbol list with posix format we compare the symbols lists of stripped ELF file ($orig.stripped) and that of the one including debugging symbols ($orig.debug) to get a an ELF file which includes only the necessary bits as the debuginfo ($orig.minidebug). but we generate the symbol list of stripped ELF file using the sysv format, while generate the one from the unstripped one using posix format. the former is always padded the symbol names with spaces so that their the length at least the same as the section name after we split the fields with "\|". that's why the diff includes the stuff we don't expect. and hence, we have tons of warnings like: ``` objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line ``` when using objcopy to filter the ELF file to keep only the symbols we are interested in. so, in this change * use the same format when dumping the symbols from unstripped ELF file * include the symbols in the text area -- the code, by checking "T" and "t" in the dumped symbols. this was achieved by matching the lines with "FUNC" before this change. * include the the symbols in .init data section -- the global variables which are initialized at compile time. they could be also interesting when debugging an application. Fixes #15513 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15514	2023-09-26 17:59:40 +03:00
Alexander Turetskiy	024ba84637	cql3: SELECT CAST column names should match Cassandra's When doing a SELECT CAST(b AS int), Cassandra returns a column named cast(b as int). Currently, Scylla uses a different name - system.castasint(b). For Cassandra compatibility, we should switch to the same name. fixes #14508 Closes scylladb/scylladb#14800	2023-09-26 17:26:14 +03:00
Aleksandra Martyniuk	f42be12f43	repair: release resources of shard_repair_task_impl Before integration with task manager the state of one shard repair was kept in repair_info. repair_info object was destroyed immediately after shard repair was finished. In an integration process repair_info's fields were moved to shard_repair_task_impl as the two served the similar purposes. Though, shard_repair_task_impl isn't immediately destoyed, but is kept in task manager for task_ttl seconds after it's complete. Thus, some of repair_info's fields have their lifetime prolonged, which makes the repair state change delayed. Release shard_repair_task_impl resources immediately after shard repair is finished. Fixes: #15505. Closes scylladb/scylladb#15506	2023-09-26 17:09:47 +03:00
Piotr Dulikowski	64668e325e	raft: expose current_leader in raft::server The handler for join_node_request will need to know which node is considered the group 0 leader right now by the local node. If the topology coordinator crashes and a new node immediately wants to replace it with the same IP, the node that handles join_node_request will attempt to perform a read barrier. If this happens quickly enough, due to the IP reuse the RPC will be sent to the new node instead of the (now crashed) topology coordinator; the RPC will get an error and will fail the barrier. If we detect that the new node wants to replace the current topology coordinator, the upcoming join_node_request_handler will wait until there is a leader change.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	74b01730b4	storage_service: extract wait_for_live_nodes_timeout constant Like in the non-raft topology path, during the new handshake, the joining node will wait until all normal nodes are alive. The timeout used during the wait is extracted to a constant so that it will be reused in the handshake code, to be introduced in later commits.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	4f82f9fe50	raft_group0: abstract out node joining handshake Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an existing group 0 member to add this node to the group, in case the joining node was not a discovery leader. The new handshake verbs (JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a preparation, this commit abstracts away the handshake process.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	c24daf7e88	storage_service: pass raft_topology_change_enabled on rpc init We will want to conditionally register some verbs based on whether we are using raft topology or not. This commit serves as a preparation, passing the `raft_topology_change_enabled` to the function which initializes the verbs (although there is _raft_topology_change_enabled field already, it's only initialized on shard 0 later).	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	7cbe5e3af8	rpc: add new join handshake verbs The `join_node_request` and `join_node_response` RPCs are added: - `join_node_request` is sent from the joining node to any node in the cluster. It contains some initial parameters that will be verified by the receiving node, or the topology coordinator - notably, it contains a list of cluster features supported by the joining node. - `join_node_response` is sent from the topology coordinator to the joining node to tell it about the the outcome of the verification.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	dd4579637b	docs: document the new join procedure	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	caf1d4938e	topology_state_machine: add supported_features to replica_state The `service::topology_features` struct was introduced in #14955. Its purpose was to make it possible to load cluster features from `system.topology` before schema commitlog replay. It contains a map from host ID to supported feature set for every normal node. In order not to duplicate logic for loading features, the `service::topology`'s `replica_state`s do not hold a set of supported features and users are supposed to refer to the features in `topology_features`, which is a field in the `topology` struct. However, accessing features is quite awkward now. This commit adds `supported_features` field back to the `replica_state` struct and the `load_topology_state` function initializes them properly. The logic duplication needed to initialize them is quite small and the drawbacks that come with it are outweighed by the fact that we now can refer to node's supported features in a more natural way. The `topology_features` struct is no longer a field of `topology`, but it still exists for the purpose of the feature check that happens before commitlog replay.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	51b0e4d44f	storage_service: check destination host ID in raft verbs In unlucky but possible circumstances where a node is being replaced very quickly, RPC requests using raft-related verbs from storage_service might be sent to it - even before the node starts its group 0 server. In the latter case, this triggers on_internal_error. This commit adds protection to the existing verbs in storage_service: they check whether the group 0 is running and whether the received host_id matches the actual recipient's host_id. None of the verbs that are modified are in any existing release, so the added parameter does not have to be wrapped in rpc::optional.	2023-09-26 15:56:51 +02:00
Piotr Dulikowski	0317705f5a	group_state_machine: take reference to raft address map It will be needed to translate host ids to addresses.	2023-09-26 15:46:25 +02:00
Piotr Dulikowski	193e8eba26	raft_group0: expose joined_group0 It will be needed in the next commit to check whether the group 0 server has been started.	2023-09-26 15:46:25 +02:00
Piotr Jastrzebski	47917bcf22	filter: hash key once per sstable set not sstable Before this commit the primary key was hashed for bloom filter check for each sstable. This commit makes the key be hashed once per sstable set and reused for bloom filter lookups in all sstables in the set. I tested this change using perf_simple_query with the following modifications: 1. Create more than one sstable to have sstable set of more than one elements 2. Try to prevent compactions (I wasn't 100% successful) 3. Use a key that's not present to avoid reading from disk ``` diff --git a/test/perf/perf_simple_query.cc b/test/perf/perf_simple_query.cc index 26dbf1e99..6bd460df2 100644 --- a/test/perf/perf_simple_query.cc +++ b/test/perf/perf_simple_query.cc @@ -105,6 +105,8 @@ std::ostream& operator<<(std::ostream& os, const test_config& cfg) { static void create_partitions(cql_test_env& env, test_config& cfg) { std::cout << "Creating " << cfg.partitions << " partitions..." << std::endl; + // Create 10 sstables each with all the data + for (unsigned count = 0; count < 10; ++count) { for (unsigned sequence = 0; sequence < cfg.partitions; ++sequence) { if (cfg.counters) { execute_counter_update_for_key(env, make_key(sequence)); @@ -117,6 +119,7 @@ static void create_partitions(cql_test_env& env, test_config& cfg) { std::cout << "Flushing partitions..." << std::endl; env.db().invoke_on_all(&replica::database::flush_all_memtables).get(); } + } } static int64_t make_random_seq(test_config& cfg) { @@ -137,8 +140,18 @@ static std::vector<perf_result> test_read(cql_test_env& env, test_config& cfg) { query += " using timeout " + cfg.timeout; } auto id = env.prepare(query).get0(); - return time_parallel([&env, &cfg, id] { - bytes key = make_random_key(cfg); + // Always use the same key that is not present + // to make sure we don't read from disk and make + // the benchmark CPU bounded. + int64_t key_value = 6; + bytes key(bytes::initialized_later(), 5sizeof(key_value)); + auto i = key.begin(); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + return time_parallel([&env, id, key] { return env.execute_prepared(id, {{cql3::raw_value::make_value(std::move(key))}}).discard_result(); }, cfg.concurrency, cfg.duration_in_seconds, cfg.operations_per_shard, cfg.stop_on_error); } @@ -423,6 +436,10 @@ static std::vector<perf_result> do_cql_test(cql_test_env& env, test_config& cfg) .with_column("C2", bytes_type) .with_column("C3", bytes_type) .with_column("C4", bytes_type) + // Try to prevent compaction + // to keep the number of sstables high + .set_compaction_enabled(false) + .set_min_compaction_threshold(2000000000) .build(); }).get(); @@ -539,6 +556,11 @@ int scylla_simple_query_main(int argc, char* argv) { const auto enable_cache = app.configuration()["enable-cache"].as<bool>(); std::cout << "enable-cache=" << enable_cache << '\n'; db_cfg->enable_cache(enable_cache); + // Try to prevent compaction + // to keep the number of sstables high + db_cfg->concurrent_compactors(1); + db_cfg->compaction_enforce_min_threshold(true); + db_cfg->compaction_throughput_mb_per_sec(1); cql_test_config cfg(db_cfg); return do_with_cql_env_thread([&app] (auto&& env) { ``` The following command showed 2-3% improvement on my machine but this depends on the lenght of the key and the number of sstables in the set. ``` ./build/release/scylla perf-simple-query --bypass-cache --flush -c 1 --random-seed=2068087418 --enable-cache false ``` Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com> Closes scylladb/scylladb#15538	2023-09-26 16:27:11 +03:00
Botond Dénes	d5f095d5a4	Merge 'Make interaction of compaction strategy with sstable runs more robust and efficient' from Raphael "Raph" Carvalho SSTable runs work hard to keep the disjointness invariant, therefore they're expensive to build from scratch. For every insertion, it keeps the elements sorted by their first key in order to reject insertion of element that would introduce overlapping. Additionally, a sstable run can grow to dozens of elements (or hundreds) therefore, we can also make interaction with compaction strategies more efficient by not copying them when building a list of candidates in compaction manager. And less fragile by filtering out any sstable runs that are not completely eligible for compaction. Previously, ICS had to give up on using runs managed by sstable set due to fragility of the interface (meaning runs are being built from scratch on every call to the strategy, which is very inefficient, but that had to be done for correctness), but now we can restore that. Closes scylladb/scylladb#15440 * github.com:scylladb/scylladb: compaction: Switch to strategy_control::candidates() for regular compaction tests: Prepare sstable_compaction_test for change in compaction_strategy interface compaction: Allow strategy to retrieve candidates either as sstables or runs compaction: Make get_candidates() work with frozen_sstable_run too sstables: add sstable_run::run_identifier() sstables: tag sstable_run::insert() with nodiscard sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs sstables: Simplify sstable_set interface to retrieve runs	2023-09-26 14:56:05 +03:00
Aleksandra Martyniuk	d799adc536	tasks: change task_manager::task::impl::is_internal() Most of the time only the roots of tasks tree should be non internal. Change default implementation of is_internal and delete overrides consistent with it. Closes scylladb/scylladb#15353	2023-09-26 14:49:49 +03:00
Avi Kivity	5804386ca6	Merge 'Don't mess with table directories in distributed loader' from Pavel Emelyanov Distributed loader code still "knows" that table's datadir is a filesystem directory with some structure. For S3-backed sstables this still works because for S3 keyspaces scylla still creates and maintains empty directories in datadir. This set fixes the dist. loader assumptions about that and moves them into sstable directory's lister. refs: #13020 Closes scylladb/scylladb#15542 * github.com:scylladb/scylladb: sstable_directory: Indentation fix after previous patch sstable_directory: Simplify filesystem prepare() distributed_loader: Remove get_path() method distributed_loader: Move directory touching to sstable_directory distributed_loader: Move directory existance checks to sstable_directory sstable_directory: Move prepare() core to lister	2023-09-26 14:48:23 +03:00
Kefu Chai	8066929960	build: cmake: use `if (.. IN_LIST ..)` when appropriate for better readability, and do not create a CMAKE_BUILD_TYPE CACHE entry if it is already set using `-D`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15543	2023-09-26 14:15:53 +03:00
Pavel Emelyanov	e022a76350	main, api: Set/Unset storage_service API in proper place Currently the storage-service API handlers are set up in "random" place. It can happen earlier -- as soon as the storage service itself is ready. Also, despite storage service is stopped on shutdown, API handlers continue reference it leading to potential use-after-frees or "local is not initialized" assertions. Fix both. Unsetting is pretty bulky, scylladb/seastar#1620 is to help. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:22:13 +03:00
Pavel Emelyanov	78a22c5ae3	api/storage_service: Remove gossiper arg from API Now all handlers work purely on storage_service and gossiper argument is no longer needed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:21:46 +03:00
Pavel Emelyanov	8dc6e74138	api/storage_service: Remove system keyspace arg from API It's not used nowadays Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:21:25 +03:00
Pavel Emelyanov	27eaff9d44	api/storage_service: Get gossiper from storage service Some handlers in set_storage_service() have implicit dependency on gossiper. It's not API that should track it, but storage service itself, so get the gossiper from service, not from the external argument (it will be removed soon) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:20:27 +03:00
Pavel Emelyanov	4008ebb1b0	api/storage_service: Get token_metadata from storage service The API handlers that live in set_storage_service() should be self-contained and operate on storage-service only. Said that, they should get the token metadata, when needed, from storage service, not from somewhere else. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:19:24 +03:00
Tomasz Grabiec	0f22e8d196	storage_service: Fixed missed notificaiton on tablet metadata update There can be 2 waiters now (coordinator and CDC generation publisher), so signal() is not enough. Change made in `c416c9ff33` missed to update this site. Closes scylladb/scylladb#15527	2023-09-26 10:37:57 +02:00
Jan Ciolek	e5f0468761	cql/prepare_expr: fix wrong receiver in field_selection_test_assignment When preparing a `field_selection`, we need to prepare the UDT value, and then verify that it has this field. `field_selection_test_assignment` prepares the UDT value using the same receiver as the whole `field_selection`. This is wrong, this receiver has the type of the field, and not the UDT. It's impossible to create a receiver for the UDT. Many different UDTs can produce an `int` value when the field `a` is selected. Therefore the receiver should be `nullptr`. No unit test is added, as this bug doesn't currently cause any issues. Preparing a column value doesn't do any type checks, so nothing fails. Still it's good to fix it, just to be correct. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes scylladb/scylladb#14788	2023-09-26 11:15:00 +03:00
Pavel Emelyanov	0e0f9a57c6	forward_service: Remove .shutdown() method It's now empty and has no value Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:39:22 +03:00
Pavel Emelyanov	a251b9893f	forward_service: Set _shutdown in abort-source subscription Currently the bit is set in .shutdown() method which is called early on stop. After the patch the bit it set in the abort-source subscription callback which is also called early on stop. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:38:34 +03:00
Pavel Emelyanov	b18c54f56c	forward_service: Add abort_source to constructor It will be used by the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:38:26 +03:00
Raphael S. Carvalho	8997fe0625	compaction: Switch to strategy_control::candidates() for regular compaction Now everything is prepared for the switch, let's do it. Now let's wait for ICS to enjoy the set of changes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	761a37022f	tests: Prepare sstable_compaction_test for change in compaction_strategy interface Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	02f1f24f27	compaction: Allow strategy to retrieve candidates either as sstables or runs That's needed for upcoming changes that will allow ICS to efficiently retrieve sstable runs. Next patch will remove candidates from compaction_strategy's interface to retrieve candidates using this one instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	ff8510445d	compaction: Make get_candidates() work with frozen_sstable_run too This is done in preparation for ICS to retrieve candidates as sstable runs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	4b193c04dd	sstables: add sstable_run::run_identifier() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	8235889b8a	sstables: tag sstable_run::insert() with nodiscard sstable_run may reject insertion of a sstable if it's going to break the disjoint invariant of the run, but it's important that the caller is aware of it, so it can act on it like generating a new run id for the sstable so it can be inserted in another run. the tag is important to avoid unknown problems in this area. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	0fe2630d70	sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs Users of all_sstable_runs() don't want to mutate the runs, but rather work with their content. So let's avoid copy and make the intention explicit with the new frozen_sstable_run used as return type for the interface. This will guarantee that ICS will be able to fetch uncompacting runs efficiently. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:20 -03:00
Raphael S. Carvalho	9f6c3369d2	sstables: Simplify sstable_set interface to retrieve runs This interface selects all runs that store at least one of the sstables in the vector. But that's very fragile, to the point that even ICS had to stop using it. A better interface is to return all runs managed by the set and allow compaction manager to do its filtering. We want to use it in ICS to avoid the overhead of rebuilding sstable runs which may be expensive as sorting is performed to guarantee the disjoint invariant. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:04:20 -03:00
Tomasz Grabiec	19ff4b730f	storage_service: Avoid SIGSEGV when tablet cleanup is invoked on non-0 shard We access group0, which is only set on shard 0. Closes scylladb/scylladb#15469	2023-09-25 20:59:27 +03:00
Pavel Emelyanov	901bbf21e9	Merge 'build: extract code fragments into functions' from Kefu Chai more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs https://github.com/scylladb/scylladb/issues/15379 Closes scylladb/scylladb#15515 * github.com:scylladb/scylladb: build: extract get_os_ids() out build: extract find_ninja() out build: extract thrift_uses_boost_share_ptr() out	2023-09-25 20:57:59 +03:00
Botond Dénes	caeddb9c88	tools/utils: return a distinct error-code on unknown operation Currently, the tools loosely follow the following convention on error-codes: * return 1 if the error is with any of the command-line arguments * return 2 on other errors This patch changes the returned error-code on unknown operation/command to 100 (instead of the previous 1). The intent is to allow any wrapper script to determine that the tool failed because the operation is unrecognized and not because of something else. In particular this should enable us to write a wrapper script for scylla-nodetool, which dispatches commands still un-implemented in scylla-nodetool, to the java nodetool. Note that the tool will still print an error message on an unknown operation. So such wrapper script would have to make sure to not let this bleed-through when it decides to forward the operation. Closes scylladb/scylladb#15517	2023-09-25 20:56:44 +03:00
Pavel Emelyanov	99cbb6b733	sstable_directory: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 20:34:52 +03:00
Pavel Emelyanov	7ab03e33a2	sstable_directory: Simplify filesystem prepare() When FS lister gets prepared it - checks if the directory exists - creates if it it doesn't or bais out if it's quarantine one - goes and checks the directory's owner and mode The last step is excessive if the directory didn't exist on entry and was created. Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 20:03:19 +03:00
Pavel Emelyanov	0232f939dc	distributed_loader: Remove get_path() method It's no longer used Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 20:03:06 +03:00
Pavel Emelyanov	9c3e055d22	distributed_loader: Move directory touching to sstable_directory This is continuation of the previous patch -- when populating a table, creating directories should be (optionally) performed by the lister backend, not by the generic loader. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 20:01:53 +03:00
Pavel Emelyanov	2678cc2ae8	distributed_loader: Move directory existance checks to sstable_directory The loader code still "knows" that tables' sstables live in directories on datadir filesystem, but that's not always so. So whether or not the directory with sstables exists should be checked by sstable directory's component lister, not the loader. After this change potentially missing quarantine directory will be processed with the sstable directory with empty result, but that's OK, empty directories should be already handled correctly, so even if the directory lister doesn't produce any sstables because it found no files, or because it just skipped scanning doesn't make any difference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 19:59:41 +03:00
Pavel Emelyanov	603f3ca042	sstable_directory: Move prepare() core to lister Current sstable_directory::prepare() code checks the sstable directory existance which only makes sense for filesystem-backed sstables. S3-backed don't (well -- won't) have any directories in datadir, so the check should be moved into component lister. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-25 19:58:53 +03:00
Jan Ciolek	649b634c63	cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables Issue #15374 has been fixed, so these tests can be enabled. Duplicate bind variable names are now handled correctly. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-09-25 17:19:07 +02:00
Jan Ciolek	3cff10f756	cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables When presented with queries that use the same named bind variables twice, like this one: ```cql SELECT p FROM table WHERE p = :x AND c = :x ``` Scylla generated empty partition_key_bind_indexes (pk_indexes). pk_indexes tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node. Scylla should generate pk_indexes which tell the driver that it can use bind variable with bind_index = 0 to calculate the partition token for a query. The problem was that _target_columns keep only a single target_column for each bind variable. In the example above :x is compared with both p and c, but _target_columns would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column. To fix it, let's replace _target_columns with _targets. _targets keeps all comparisons between bind variables and other expressions, so none of them will be forgotten/overwritten. Fixes: #15374 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-09-25 17:18:53 +02:00
Jan Ciolek	a993ae31f8	cql3: improve readability of prepare_context This commits adds a few comments and changes a few variable names so that it's easier to figure out what the code does. When I first started looking at this part of the code it wasn't obvious what's going on - what are _specs, how are they different from _target_columns? What happens when a variable doesn't have a name? I hope that this change will make it easier to understand for future readers. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-09-25 17:18:53 +02:00
Anna Stuchlik	4afe2b9d9f	doc: add RBNO to glossary This commit adds Repair Based Node Operations to the ScyllaDB glossary. Fixes https://github.com/scylladb/scylladb/issues/11959 Closes scylladb/scylladb#15522	2023-09-25 18:16:53 +03:00
Jan Ciolek	f3ecd279f2	cql-pytest: test generation of pk indexes during PREPARE Add some tests that test whether `pk indexes` are generated correctly. When a driver asks to prepare a statement, Scylla's response includes the metadata for this prepared statement. In this metadata there's `pk indexes`, which tells the driver which bind variable values it should use to calculate the partition token. For a query like: SELECT * FROM t WHERE p2 = ? AND p1 = ? AND p3 = ? The correct pk_indexes would be [1, 0, 2], which means "To calculate the token calculate Hash(bind_vars[1] \| bind_vars[0] \| bind_vars[2])". More information is available in the specification: `1959502d8b/doc/native_protocol_v4.spec (L699-L707)` Two tests are marked as xfail because of #15374 - Scylla doesn't correctly handle using the same named variable in multiple places. This will be fixed soon. I couldn't find a good place for these tests, so I created a new file - test_prepare.py. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-09-25 17:12:17 +02:00
Pavel Emelyanov	652153c291	Merge 'populate_keyspace: use datadir' from Benny Halevy Currently the datadir is ignored. Use it to construct the table's base path. Fixes scylladb/scylladb#15418 Closes scylladb/scylladb#15480 * github.com:scylladb/scylladb: distributed_loader: populate_keyspace: access cf by ref distributed_loader: table_populator: use datadir for base_path distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed distributed_loader: populate_keyspace: fixup indentation distributed_loader: populate_keyspace: iterate over datadirs in the inner loop test: sstable_directory_test: add test_multiple_data_dirs table: init_storage: create upload and staging subdirs on all datadirs	2023-09-25 13:40:50 +03:00
Nadav Har'El	1a5debac5c	test/cql-pytest: cleaner reproducer for spurious static row returned Issue #10357 is about a SELECT with a filter on a regular column which incorrectly returns a static row without regular columns set (so the filter would not have matched). We already have four tests reproducing this issue, but each of them is a small part of a large tests translated from Cassandra, making it hard to understand the scope of this bug. So in this patch we add two new tests, one passing and one xfailing, which clarify the scope of this bug. It turns out that the bug only occurs when a partition has no clustering rows and only has a static row. If the partition does have clustering rows - even if those don't match the filter - the bug doesn't happen. The xfailing test is just two statements long - a single INSERT and a single SELECT Refs #10357. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#15120	2023-09-25 11:01:22 +03:00
Raphael S. Carvalho	914cbc11cf	reader_concurrency_semaphore: Fix stop() in face of evictable reads becoming inactive Scylla can crash due to a complicated interaction of service level drop, evictable readers, inactive read registration path. 1) service level drop invoke stop of reader concurrency semaphore, which will wait for in flight requests 2) turns out it stops first the gate used for closing readers that will become inactive. 3) proceeds to wait for in-flight reads by closing the reader permit gate. 4) one of evictable reads take the inactive read registration path, and finds the gate for closing readers closed. 5) flat mutation reader is destroyed, but finds the underlying reader was not closed gracefully and triggers the abort. By closing permit gate first, evictable readers becoming inactive will be able to properly close underlying reader, therefore avoiding the crash. Fixes #15534. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15535	2023-09-25 08:55:50 +03:00
Nadav Har'El	be942c1bce	Merge 'treewide: rename s3 credentials related variable and option names' from Kefu Chai in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability. Closes scylladb/scylladb#15529 * github.com:scylladb/scylladb: main.cc: rename aws option utils/s3/creds: rename aws_config member variables	2023-09-24 14:03:47 +03:00
Nadav Har'El	4e1e7568d8	Merge 'cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe' from Michał Jadwiszczak So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index. This commit adds UDT/UDF/UDA to generic describe. Fixes: #14170 Closes scylladb/scylladb#14334 * github.com:scylladb/scylladb: docs:cql: add information about generic describe cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe	2023-09-24 13:03:04 +03:00
Kefu Chai	f3f31f0c65	main.cc: rename aws option - s/aws_key/aws_access_key_id/ - s/aws_secret/aws_secret_access_key/ - s/aws_token/aws_session_token/ rename them to more popular names, these names are also used by boto's API. this should improve the readability and consistency. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-23 14:31:32 +08:00
Kefu Chai	ac3406e537	utils/s3/creds: rename aws_config member variables - s/key/access_key_id/ - s/secret/secret_access_key/ - s/token/session_token/ so they are more aligned with the AWS document. for instance, in https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#ConstructingTheAuthenticationHeader AWSAccessKeyId is used in the "Authorization" header. this would help with the readability and maintainability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-23 14:28:07 +08:00
Benny Halevy	7bd131d212	distributed_loader: populate_keyspace: access cf by ref There is no need to hold on to the table's shared ptr since it's held by the global table ptr we got in the outer loop. Simplify the code by just getting the local table reference from `gtable`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:51:41 +03:00
Benny Halevy	a8e7981bb6	distributed_loader: table_populator: use datadir for base_path Currently the datadir is ignored. Use it to construct the table's base path. Fixes scylladb/scylladb#15418 Note that scylla still doesn't work correctly with multiple data directories due to #15510. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:51:39 +03:00
Benny Halevy	14da3e4218	distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed Currently, mark_ready_for_writes is called too early, after the first data dir is processed, then the next datadir will hit an assert in `table::mark_ready_for_writes`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:53 +03:00
Benny Halevy	84510370e1	distributed_loader: populate_keyspace: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:52 +03:00
Benny Halevy	87d438b234	distributed_loader: populate_keyspace: iterate over datadirs in the inner loop It is more efficient to iterate over multiple data directories in the inner loop rather than the outer loop. Following patch will make use of the datadir in table_populator. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:50:24 +03:00
Benny Halevy	2591f5f935	test: sstable_directory_test: add test_multiple_data_dirs Add a basic regression test that starts the cql test env with multiple data directories. It fails without the previous patch: table: init_storage: create upload and staging subdirs on all datadirs Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:24:54 +03:00
Benny Halevy	2937552e5b	table: init_storage: create upload and staging subdirs on all datadirs We need to have a complete directory structure for each table and each configured datadir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-23 08:24:54 +03:00
David Garcia	762ca61ad9	docs: format db reference as list docs: limit reference max_depth docs: change reference description order Closes scylladb/scylladb#15205	2023-09-22 19:25:01 +03:00
Kamil Braun	99d83808cc	Merge 'test/topology_custom/test_select_from_mutation_fragments.py: use async api and clean-up' from Botond Dénes Also, while at it, add copyright/license blurbs for tests that were missing it. Closes scylladb/scylladb#15495 * github.com:scylladb/scylladb: test/topology_custom: add copyright/license blurb to tests test/topology_custom: test_select_from_mutation_fragments.py: use async query api	2023-09-22 10:59:48 +02:00
Botond Dénes	4acde0fb4b	test/topology_custom: add test_read_repair.py	2023-09-22 02:53:15 -04:00
Botond Dénes	d007a0ec16	replica/mutation_dump: detect end-of-page in range-scans The current read-loop fails to detect end-of-page and if the query result buider cuts the page, it will just proceed to the next partition. This will result in distorted query results, as the result builder will request for the consumption to stop after each clustering row. To fix, check if the page was cut before moving on to the next partition. A unit test reproducing the bug was also added.	2023-09-22 02:53:15 -04:00
Botond Dénes	e723fb3017	tools/scylla-sstable: write: abort parser thread if writing fails Currently if writing the sstable fails, e.g. because the input data is out-of-order, the json parser thread hangs because its output is no longer consumed. This results in the entire application just freezing. Fix this by aborting the parsing thread explicitely in the json_mutation_stream_parser destructor. If the parser thread existed successfully, this will be a no-op, but on the error-path, this will ensure that the parser thread doesn't hang.	2023-09-22 02:53:15 -04:00
Botond Dénes	70e26e5a10	test/pylib: add REST methods to get node exe and workdir paths	2023-09-22 02:53:15 -04:00
Botond Dénes	8bd5f67039	test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} To support the equivalent (roughly) of the following nodetool commands: * nodetool refresh * nodetool flush * nodetool compact	2023-09-22 02:53:15 -04:00
Botond Dénes	d62a83683e	service/storage_proxy: add trace points for the actual read executor type There is currently a trace point for when the read executor is created, but this only contains the initial replica set and doesn't mention which read executor is created in the end. This patch adds trace points for each different return path, so it is clear from the trace whether speculative read can happen or not.	2023-09-22 02:53:15 -04:00
Botond Dénes	d3aabf7896	service/storage_proxy: add trace points for read-repair Currently the fact that read-repair was triggered can only be inferred from seeing mutation reads in the trace. This patch adds an explicit trace point for when read repair is triggered and also when it is finished or retried.	2023-09-22 02:53:14 -04:00
Tomasz Grabiec	1bcac74976	storage_proxy: Add more trace-level logging to read-repair Extremely helpful in debugging.	2023-09-22 02:53:14 -04:00
Tomasz Grabiec	8b7623f49e	database: Fix accounting of small partitions in mutation query The partition key size was ignored by the accounter, as well as the partition tombstone. As a result, a sequence of partitions with just tombstones would be accounted as taking no memory and page size limitter to not kick in. Fix by accounting the real size of accumulated frozen_mutation. Also, break pages across partitions even if there are no live rows. The coordinator can handle it now. Refs #7933	2023-09-22 02:53:14 -04:00
Tomasz Grabiec	17c1cad4b4	database, storage_proxy: Reconcile pages with no live rows incrementally Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stems from the fact that resulting reconcilable_result will be large: * Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque * Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s. * Large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs #9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): Node1: 1 live row, 1M dead rows Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] After: Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs #7929 Refs #3672 Refs #7933 Fixes #9111	2023-09-22 02:53:14 -04:00
Botond Dénes	91a8100b3f	Merge 'Validate compaction strategy options in prepare' from Aleksandra Martyniuk Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. Fixes: #14710. Closes scylladb/scylladb#15091 * github.com:scylladb/scylladb: cql3: statements: delete execute override cql3: statements: call check_restricted_table_properties in prepare cql3: statements: pass data_dictionary::database to check_restricted_table_properties	2023-09-22 09:49:19 +03:00
Kefu Chai	be7363a621	build: extract get_os_ids() out this helper is only used by pkgname(), so move it closer to its sole caller. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 14:12:12 +08:00
Kefu Chai	0af50b2709	build: extract find_ninja() out more structured this way. and the data dependency is more clear with this change. this also allows us to quickly identify the parts which should/can be reused when migrating to the CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 13:23:45 +08:00
Kefu Chai	2e901bae2f	build: extract thrift_uses_boost_share_ptr() out more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-22 13:23:45 +08:00
Michael Huang	a684e51e4d	cql3: fix bad optional access when executing fromJson function Fix fromJson(null) to return null, not a error as it did before this patch. We use "null" as the default value when unwrapping optionals to avoid bad optional access errors. Fixes: scylladb#7912 Signed-off-by: Michael Huang <michaelhly@gmail.com> Closes scylladb/scylladb#15481	2023-09-21 20:18:49 +03:00
Avi Kivity	61440d20c3	Merge 'Enable incremental compaction on off-strategy' from Raphael "Raph" Carvalho Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to be mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes https://github.com/scylladb/scylladb/issues/14992. Closes scylladb/scylladb#15400 * github.com:scylladb/scylladb: test: Verify that off-strategy can do incremental compaction compaction: Clear pending_replacement list when tombstone GC is disabled compaction: Enable incremental compaction on off-strategy compaction: Extend reshape type to allow for incremental compaction compaction: Move reshape_compaction in the source compaction: Enable incremental compaction only if replacer callback is engaged	2023-09-21 20:12:19 +03:00
Gleb Natapov	c94a9cf731	storage_service: raft topology: fence off write from old topology coordinator before starting a new one Make sure that all writes started by the old coordinator are completed or will eventually fail before starting a new coordinator. Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>	2023-09-21 17:26:45 +02:00
Avi Kivity	1da6a939fe	Merge 'Track memory usage of S3 object uploads' from Pavel Emelyanov The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any. Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not. This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use. Closes scylladb/scylladb#15497 * github.com:scylladb/scylladb: s3::client: Track memory in client uploads code: Configure s3 clients' memory usage s3::client: Construct client with shared semaphore sstables::storage_manager: Introduce config	2023-09-21 18:24:42 +03:00
Botond Dénes	a0c5dee2aa	utils/logalloc: introduce logalloc::bad_alloc This new exception type inherits from std::bad_alloc and allows logalloc code to add additional information about why the allocation failed. We currently have 3 different throw sites for std::bad_alloc in logalloc.cc and when investigating a coredump produced by --abort-on-lsa-bad-alloc, it is impossible to determine, which throw-site activated last, triggering the abort. This patch fixes that by disambiguating the throw-sites and including it in the error message printed, right before abort. Refs: #15373 Closes scylladb/scylladb#15503	2023-09-21 17:43:53 +03:00
Raphael S. Carvalho	91efd878d7	test: Verify that off-strategy can do incremental compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	9d92374b20	compaction: Clear pending_replacement list when tombstone GC is disabled pending_replacement list is used by incremental compaction to communicate to other ongoing compactions about exhausted sstables that must be replaced in the sstable set they keep for tombstone GC purposes. Reshape doesn't enable tombstone GC, so that list will not be cleared, which prevents incremental compaction from releasing sstables referenced by that list. It's not a problem until now where we want reshape to do incremental compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	42050f13a0	compaction: Enable incremental compaction on off-strategy Off-strategy suffers with a 100% space overhead, as it adopted a sort of all or nothing approach. Meaning all input sstables, living in maintenance set, are kept alive until they're all reshaped according to the strategy criteria. Input sstables in off-strategy are very likely to mostly disjoint, so it can greatly benefit from incremental compaction. The incremental compaction approach is not only good for decreasing disk usage, but also memory usage (as metadata of input and output live in memory), and file desc count, which takes memory away from OS. Turns out that this approach also greatly simplifies the off-strategy impl in compaction manager, as it no longer have to maintain new unused sstables and mark them for deletion on failure, and also unlink intermediary sstables used between reshape rounds. Fixes #14992. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:46 -03:00
Raphael S. Carvalho	db9ce9f35a	compaction: Extend reshape type to allow for incremental compaction That's done by inheriting regular_compaction, which implement incremental compaction. But reshape still implements its own methods for creating writer and reader. One reason is that reshape is not driven by controller, as input sstables to it live in maintenance set. Another reason is customization of things like sstable origin, etc. stop_sstable_writer() is extended because that's used by regular_compaction to check for possibility of removing exhausted sstables earlier whenever an output sstable is sealed. Also, incremental compaction will be unconditionally enabled for ICS/LCS during off-strategy. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:15:12 -03:00
Raphael S. Carvalho	33a0f42304	compaction: Move reshape_compaction in the source That's in preparation to next change that will make reshape inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-21 11:11:13 -03:00
Botond Dénes	3b95f4f107	Merge 'Sanitize view-update-generator start-stop sequence' from Pavel Emelyanov The v.u.g. start stop is now spread over main() code heavily. 1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it 2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen 3. early on stop the v.u.g. is fully stopped The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start. Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires. Closes scylladb/scylladb#15466 * github.com:scylladb/scylladb: view_update_generator: Stop for real later view_update_generator: Add logging to do_abort() view_update_generator: Move abort kicking to do_abort() view_update_generator: Add early abort subscription	2023-09-21 17:01:27 +03:00
Pavel Emelyanov	6e972f8505	repair: Shutdown repair on nodetool drain too Currently repair shutdown only happens on stop, but it looks like nodetool drain can call shutdown too to abort no longer relevant repair tasks if any. This also makes the main()'s deferred shutdown/stop paths cleaner a little bit Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15438	2023-09-21 16:58:23 +03:00
Kefu Chai	2392b6a179	doc: start unordered list with an empty line otherwise, sphinx would render them as a single block instead of as an unordererd list. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15504	2023-09-21 14:35:09 +03:00
Aleksandra Martyniuk	6c7eb7096e	cql3: statements: delete execute override Delete overriden create_table_statement::execute as it only calls its direct parent's (schema_altering_statement) execute method anyway.	2023-09-21 13:24:26 +02:00
Aleksandra Martyniuk	60fdc44bce	cql3: statements: call check_restricted_table_properties in prepare Table properties validation is performed on statement execution. Thus, when one attempts to create a table with invalid options, an incorrect command gets committed in Raft. But then its application fails, leading to a raft machine being stopped. Check table properties when create and alter statements are prepared. The error is no longer returned as an exceptional future, but it is thrown. Adjust the tests accordingly.	2023-09-21 13:21:51 +02:00
Aleksandra Martyniuk	ec98b182c8	cql3: statements: pass data_dictionary::database to check_restricted_table_properties Pass data_dictionary::database to check_restricted_table_properties as an arguemnt instead of query_processor as the method will be called from a context which does not have access to query processor.	2023-09-21 13:20:45 +02:00
Pavel Emelyanov	0ae0f75a04	view_update_generator: Stop for real later Now the v.u.g.::stop() code waits for the generator bacground fiber to stop for real. This can happen much later, all the necessary precautions not to produce more work for the generator had been taken in do_abort(). This keeps the v.u.g. start-stop in one place, except for the call to invoke_on_all(v.u.g.::start()) which will be handler separately. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:34:23 +03:00
Pavel Emelyanov	becd960ae8	view_update_generator: Add logging to do_abort() Just tell the logs that the guy is aborting refs: #10941 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:34:21 +03:00
Pavel Emelyanov	967ebacaa4	view_update_generator: Move abort kicking to do_abort() When v.u.g. stops is first aborts the generation background fiber by requesting abort on the internal abort source and signalling the fiber in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last in main(), so this move doesn't change much -- when stop_signal fires, it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing that would happen after it before real ::stop() is called that depends on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:32:45 +03:00
Pavel Emelyanov	e34220ebb7	view_update_generator: Add early abort subscription Subscribe v.u.g. to the main's stop_signal. For now a no-op callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-21 13:32:45 +03:00
Kefu Chai	0819788207	utils/s3: use structured binding when appropriate and use `sstring::starts_with()`, for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15487	2023-09-21 13:26:49 +03:00
Kefu Chai	c364efb998	utils/s3: auth using AWS_SESSION_TOKEN when accessing AWS resources, uses are allowed to long-term security credentials, they can also the temporary credentials. but if the latter are used, we have to pass a session token along with the keys. see also https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html so, if we want to programatically get authenticated, we need to set the "x-amz-security-token" header, see https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#UsingTemporarySecurityCredentials so, in this change, we 1. add another member named `token` in `s3::endpoint_config::aws_config` for storing "AWS_SESSION_TOKEN". 2. populate the setting from "object_storage.yaml" and "$AWS_SESSION_TOKEN" environment variable. 3. set "x-amz-security-token" header if `s3::endpoint_config::aws_config::token` is not empty. this should allow us to test s3 client and s3 object store backend with S3 bucket, with the temporary credentials. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15486	2023-09-21 13:26:11 +03:00
Botond Dénes	7f03ef07c8	Merge 'build: use default value of --with-* option s' from Kefu Chai in this series, we use the default values of options specifying the paths to tools for better readability. and also to ease the migration to CMake Refs #15379 Closes scylladb/scylladb#15500 * github.com:scylladb/scylladb: build: do not check for args.ragel_exec build: set default value of --with-antlr3 option	2023-09-21 10:51:08 +03:00
Pavel Emelyanov	e6fe18ca55	s3: Handle piece flushing exception When a piece is uploaded it's first flushed, then upload-copy is issued. Both happen in the background and if piece flush calls resolves with exception the exception remains unhandled. That's OK, since upload finalization code checks that some pieces didn't complete (for whatever reason) and fails the whole uploading, however, the ignored exception is reported in logs. Not nice. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15491	2023-09-21 10:39:04 +03:00
Botond Dénes	ac8005a102	Merge 'build: extract code fragments into functions' from Kefu Chai more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Closes scylladb/scylladb#15501 * github.com:scylladb/scylladb: build: extract check_for_lz4() out build: extract check_for_boost() out build: extract check_for_minimal_compiler_version() out build: extract write_build_file() out	2023-09-21 09:36:14 +03:00
Botond Dénes	f6575344df	Merge 'Collect dangling object-store sstables' from Pavel Emelyanov Sstables in transitional states are marked with the respective 'status' in the registry. Currently there are two of such -- 'creating' and 'removing'. And the 'sealed' status for sstables in use. On boot the distributed loader tries to garbage collect the dangling sstables. For filesystem storage it's done with the help of temorary sstables' dirs and pending deletion logs. For s3-backed sstables, the garbage collection means fetching all non-sealed entries and removing the corresponding objects from the storage. Test included (last patch) fixes #13024 Closes scylladb/scylladb#15318 * github.com:scylladb/scylladb: test: Extend object_store test to validate GC works sstable_directory: Garbage collect S3 sstables on reboot sstable_directory: Pass storage to garbage_collect() sstable_directory: Create storage instance too	2023-09-21 09:15:00 +03:00
Benny Halevy	e8f720315d	gossiper: run: hold background_gate when sending gossip in background So it would be waited on in shutdown(). Although gossiper::run holds the `_callback_running` semaphore which is acquired in `do_stop_gossiping`, the gossip messages it initiates in the background are never waited on. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#15493	2023-09-21 08:54:35 +03:00
Kefu Chai	fe4caeb77f	utils/s3/client: do not allocate rapidxml::xml_document on stack as the size of `rapidxml::xml_document` size quite large, let's allocate it on the heap. otherwise GCC 13.2.1 warns us like: ``` utils/s3/client.cc: In function ‘seastar::sstring s3::parse_multipart_copy_upload_etag(seastar::sstring&)’: utils/s3/client.cc:455:9: warning: stack usage is 66208 bytes [-Wstack-usage=] 455 \| sstring parse_multipart_copy_upload_etag(sstring& body) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15472	2023-09-21 08:51:08 +03:00
Kefu Chai	8802364b5b	build: extract check_for_lz4() out more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 11:11:00 +08:00
Kefu Chai	cb02a56421	build: extract check_for_boost() out more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 11:05:31 +08:00
Kefu Chai	9996503f56	build: extract check_for_minimal_compiler_version() out more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 11:05:31 +08:00
Kefu Chai	7236b81efc	build: extract write_build_file() out more structured this way. also, this will allow us to switch over to the CMake building system. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 10:33:05 +08:00
Kefu Chai	f3d6e91287	build: do not check for args.ragel_exec args.ragel_exec defaults to "ragel" already, so unless user specifies an empty ragel using `--with-ragel=""`, we won't have an `args.ragel_exec` which evaluates to `False`, so drop this check. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 10:28:57 +08:00
Kefu Chai	4632609a1c	build: set default value of --with-antlr3 option so we don't need to check if this option is specified. this option will also be used even after switching to CMake. Refs #15379 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-21 10:28:57 +08:00
Pavel Emelyanov	fc5306c5e8	s3::client: Track memory in client uploads When uploading an object part, client spawns a background fiber that keeps the buffers with data on the http request's write_body() lambda capture. This generates unbound usage of memory with uploaded buffers which is not nice. Even though s3 client is limited with http's client max-connections parallelism, waiting for the available connection still happens with buffers held in memory. This patch makes the client claim the background memory from the provided semaphore (which, in turn, sits on the shard-wide storage manager instance). Once body writing is complete, the claimed units are returned back to the semaphore allowing for more background writes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-20 17:50:29 +03:00
Pavel Emelyanov	182a5348d4	code: Configure s3 clients' memory usage This sets the real limits on the memory semaphore. - scylla sets it to 1% of total memory, 10Mb min, 100Mb max - tests set it to 16Mb - perf test sets it to all available memory Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-20 17:50:29 +03:00
Pavel Emelyanov	b299757884	s3::client: Construct client with shared semaphore The semaphore will be used to cap memory consumption by client. This patch makes sure the reference to a semaphore exists as an argument to client's constructor, not more than that. In scylla binary, the semaphore sits on storage_manager. In tests the semaphore is some local object. For now the semaphore is unused and is initialized locked as this patch just pushes the needed argument all the way around, next patches will make use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-20 17:50:07 +03:00
Pavel Emelyanov	f40b4e3e84	sstables::storage_manager: Introduce config Just an empty config that's fed to storage_manager when constructed as a preparation for further heavier patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-20 17:42:59 +03:00
Botond Dénes	f2df8cf484	test/topology_custom: add copyright/license blurb to tests Most tests were missing this, fix it.	2023-09-20 10:41:31 -04:00
Botond Dénes	3e5fe6e0a6	test/topology_custom: test_select_from_mutation_fragments.py: use async query api cql.execute_async() can now execute paged queries, use it instead of a blocking API. While at it, clean-up the test: * remove unneded wait on ring0 settle * address flake8 concerns: - unused imports - unused variables - style	2023-09-20 10:41:31 -04:00
Botond Dénes	a56a4b6226	Merge 'compaction_backlog_tracker: do not allow moving registered trackers' from Benny Halevy Currently, the moved-object's manager pointer is moved into the constructed object, but without fixing the registration to point to the moved-to object, causing #15248. Although we could properly move the registration from the moved-from object to the moved-to one, it is simpler to just disallow moving a registered tracker, since it's not needed anywhere. This way we just don't need to mess with the trackers' registration. The move-assignment operator has a similar problem, therefore it is deleted in this series, and the function is renamed to `transfer_backlog` that just doesn't deal with the moved-from registration. This is safe since it's only used internally by the compaction manager. Fixes #15248 Closes scylladb/scylladb#15445 * github.com:scylladb/scylladb: compaction_state: store backlog_track in std::optional compaction_backlog_tracker: do not allow moving registered trackers	2023-09-20 16:41:10 +03:00
Kefu Chai	6fc171b9cf	main: use fallback parameter when converting a YAML node as yaml-cpp returns an invalid node when the node to be indexed does not exist all. but it allows us to provide a fallback value which is returned when the node is not valid. so, let's just use this helper for accessing a node which does not necessarily exist. simpler this way Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15488	2023-09-20 16:01:45 +03:00
Nadav Har'El	f239849563	Merge 'doc: add a note that counters do not support TTL' from Anna Stuchlik This PR adds the information that counters do not support data expiration with TTL, plus the link to the TTL page. Fixes https://github.com/scylladb/scylladb/issues/15479 Closes scylladb/scylladb#15489 * github.com:scylladb/scylladb: doc: improve TTL limitation info on Counters page doc: add a note that counters do not support TTL	2023-09-20 15:49:44 +03:00
Anna Stuchlik	5073609366	doc: improve TTL limitation info on Counters page This commit improves the information about counters not supporting TTL on the Counters page.	2023-09-20 14:38:35 +02:00
Anna Stuchlik	715b1a80c7	doc: add a note that counters do not support TTL This commit adds the information that counters do not support data expiration wtih TTL, plus the link to the TTL page. Fixes https://github.com/scylladb/scylladb/issues/15479	2023-09-20 13:28:33 +02:00
Benny Halevy	72a5ac9ce7	gossiper: get_or_create_endpoint_state: create empty endpoint_state Currently, the endpoint address is set as the new endpoint_state RPC_ADDRESS. This is wrong since it should be assigned with the `broadcast_rpc_address` rather than `broadcast_address`. This was introduced in `b82c77ed9c` Instead just create an empty endpoint_state. The RPC_ADDRESS (as well as HOST_ID) application states are set later. Fixes scylladb/scylladb#15458 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#15475	2023-09-20 13:20:44 +02:00
Avi Kivity	47a1dc8d01	Update seastar submodule * seastar 576ee47d...bab1625c (13): > build: s/{dpdk_libs}/${dpdk_libs}/ > build: build with dpdk v23.07 > scripts: Fix escaping of regexes in addr2line > linux-aio: print more specific error when setup_aio fails > linux-aio: correct the error message raised when io_setup() fails > build: reenable -Warray-bound compiling option > build: error out if find_program() fails > build: enable systemtap only if it is available > build: check if libucontext is necessary for using ucontext functions > smp: reference correct variable when fetch_or() > build: use target_compile_definitions() for adding -D... > http/client: pass tls_options to tls::connect() > Merge 'build, process: avoid using stdout or stderr as C++ identifiers' from Kefu Chai Frozen toolchain regenerated for new Seastar depdendencies. configure.py adjusted for new Seastar arch names. Closes scylladb/scylladb#15476	2023-09-20 10:43:40 +02:00
Tomasz Grabiec	3d4398d1b2	Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620). If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957). When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary. We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`. Fixes: #7620 Fixes: #13957 Closes scylladb/scylladb#15331 * github.com:scylladb/scylladb: test: add test for group 0 schema versioning test/pylib: log_browsing: fix type hint feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations schema_tables: use schema version from group 0 if present migration_manager: store `group0_schema_version` in `scylla_local` during schema changes migration_manager: migration_request handler: assume `canonical_mutation` support system_keyspace: make `get/set_scylla_local_param` public feature_service: add `GROUP0_SCHEMA_VERSIONING` feature schema_tables: refactor `scylla_tables(schema_features)` migration_manager: add `std::move` to avoid a copy schema_tables: remove default value for `reload` in `merge_schema` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-09-20 10:43:40 +02:00
Botond Dénes	45dfce6632	Merge 'compaction: change behaviour of compaction task executors' from Aleksandra Martyniuk Compaction tasks executors serve two different purposes - as compaction manager related entity they execute compaction operation and as task manager related entity they track compaction status. When one role depends on the other, as it currently is for compaction_task_impl::done() and compaction_task_executor::compaction_done(), requirements of both roles need to be satisfied at the same time in each corner case. Such complexity leads to bugs. To prevent it, compaction_task_impl::done() of executors no longer depends on compaction_task_executor::compaction_done(). Fixes: #14912. Closes scylladb/scylladb#15140 * github.com:scylladb/scylladb: compaction: warn about compaction_done() compaction: do not run stopped compaction compaction: modify lowest compaction tasks' run method compaction: pass do_throw_if_stopping to compaction_task_executor	2023-09-19 15:15:14 +03:00
Botond Dénes	844a0e426f	Merge 'Mark counters with skip when empty' from Amnon Heiman This series mark multiple high cardinality counters with skip_when_empty flag. After this patch the following counters will not be reported if they were never used: ``` scylla_transport_cql_errors_total scylla_storage_proxy_coordinator_reads_local_node scylla_storage_proxy_coordinator_completed_reads_local_node scylla_transport_cql_errors_total ``` Also marked, the CAS related CQL operations. Fixes #12751 Closes scylladb/scylladb#13558 * github.com:scylladb/scylladb: service/storage_proxy.cc: mark counters with skip_when_empty cql3/query_processor.cc: mark cas related metrics with skip_when_empty transport/server.cc: mark metric counter with skip_when_empty	2023-09-19 15:02:39 +03:00
Benny Halevy	7ca91d719c	compaction_state: store backlog_track in std::optional So that replacing it will destroy the previous tracker and unregister it before assigning the new one and then registering it. This is safer than assiging it in place. With that, the move assignment operator is not longer used and can be deleted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-19 13:59:54 +03:00
Benny Halevy	4ad4b632b8	compaction_backlog_tracker: do not allow moving registered trackers Currently, the moved-object's manager pointer is moved into the constructed object, but without fixing the registration to point to the moved-to object, causing #15248. Although we could properly move the registration from the moved-from object to the moved-to one, it is simpler to just disallow moving a registered tracker, since it's not needed anywhere. This way we just don't need to mess with the trackers' registration. With that in mind, when move-assigning a compaction_backlog_tracker the existing tracker can remain registered. Fixes scylladb/scylladb#15248 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-19 13:24:36 +03:00
Benny Halevy	e784930dd7	storage_service: fix comment about when group0 is set Since `8598cebb11` it is set earlier, before join_cluster. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-ID: <20230919063951.1424924-1-bhalevy@scylladb.com>	2023-09-19 13:20:58 +03:00
Kefu Chai	ba002de263	build: enable more warnings these options for disabling warnings are not necessary anymore, for one of the following reasons: * the code which caused the warning were either fixed or removed * the toolchain were updated, so the false alarms do not exist with the latest frozen toolchain. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15450	2023-09-19 13:02:34 +03:00
Kefu Chai	484d02da14	cql3: expr: do not use multi-line comment do not use muti-line comment. this silences the warning from GCC: ``` In file included from ./cql3/prepare_context.hh:19, from ./cql3/statements/raw/parsed_statement.hh:14, from build/debug/gen/cql3/CqlParser.hpp:62, from build/debug/gen/cql3/CqlParser.cpp:44: ./cql3/expr/expression.hh:490:1: error: multi-line comment [-Werror=comment] 490 \| /// Custom formatter for an expression. Supports multiple modes:\ \| ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15471	2023-09-19 12:00:09 +03:00
Kefu Chai	4b53a70d76	build: cmake: add `tests` target this target mirrors the target named `{mode}e-test` in the `build.ninja` build script created by `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15448	2023-09-19 11:20:02 +03:00
Kefu Chai	da7de887d6	build: cmake: bump the minimum required CMake version because we should have a frozeon toolchain built with fedora38, and f38 provides cmake v3.27.4, we can assume the availability of cmake v3.27.4 when building scylla with the toolchain. in this change, the minimum required CMake version is changed to 3.27. this also allows us to simplify the implementation of `add_whole_archive()`, and remove the buggy branch for supporting CMake < 3.24, as we should have used `${name}` in place of `auth` there. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15446	2023-09-19 10:57:57 +03:00
Botond Dénes	111cdce2e1	Merge 'db/hints: Modularize manager.hh' from Dawid Mędrek This PR modularizes `manager.{hh, cc}` by dividing the files into separate smaller units. The changes improve overall readability of code and help reason about it. Each file has a specific purpose now. This is the first step in refactoring the Hinted Handoff module. Refs scylladb/scylla#15358 Closes scylladb/scylladb#15378 * github.com:scylladb/scylladb: db/hints: Remove unused aliases from manager.hh db/hints: Rename end_point_hints_manager db/hints: Rename sender to hint_sender db/hints: Move the rebalancing logic to hint_storage db/hints: Move the implementation of sender db/hints: Move the declaration of sender to hint_sender.hh db/hints: Move sender::replay_allowed() to the source file db/hints: Put end_point_hints_manager in internal namespace db/hints: Move the implementation of end_point_hints_manager db/hints: Move the declaration of end_point_hints_manager db/hints: Move definitions of functions using shard hint manager db/hints: Introduce hint_storage.hh db/hints: Extract the logger from manager.cc db/hints: Extract common types from manager.hh	2023-09-19 10:56:16 +03:00
Raphael S. Carvalho	6cc85068d7	compaction: Enable incremental compaction only if replacer callback is engaged That's needed for enabling incremental compaction to operate, and needed for subsequent work that enables incremental compaction for off-strategy, which in turn uses reshape compaction type. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-18 17:57:11 -03:00
Michael Huang	62a8a31be7	cdc: use chunked_vector for topology_description entries Lists can grow very big. Let's use a chunked vector to prevent large contiguous allocations. Fixes: #15302. Closes scylladb/scylladb#15428	2023-09-18 23:17:01 +03:00
Avi Kivity	ab6988c52f	Merge "auth: do not grant permissions to creator without actually creating" from Wojciech Mitros Currently, when creating the table, permissions may be mistakenly granted to the user even if the table is already existing. This can happen in two cases: The query has a IF NOT EXISTS clause - as a result no exception is thrown after encountering the existing table, and the permission granting is not prevented. The query is handled by a non-zero shard - as a result we accept the query with a bounce_to_shard result_message, again without preventing the granting of permissions. These two cases are now avoided by checking the result_message generated when handling the query - now we only grant permissions when the query resulted in a schema_change message. Additionally, a test is added that reproduces both of the mentioned cases. CVE-2023-33972 Fixes #15467. * 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq: auth: do not grant permissions to creator without actually creating transport: add is_schema_change() method to result_message	2023-09-18 21:47:28 +03:00
Avi Kivity	16a72a81fc	Merge 'build: cmake: add "dist-server-debuginfo" target' from Kefu Chai this target mirrors the "dist-server-debuginfo-{mode}" target in the `build.ninja` created by `configure.py`. Closes scylladb/scylladb#15441 * github.com:scylladb/scylladb: build: cmake: add "dist-server-debuginfo" target build: cmake: remove debian dep from relocatable pkg	2023-09-18 20:54:21 +03:00
Avi Kivity	146e49d0dd	Merge 'Rewrap keyspace population loop' from Pavel Emelyanov Populating of non-system keyspaces is now done by listing datadirs and assuming that each subdir found is a keyspace. For S3-backed keyspaces this is also true, but it's a bug (#13020). The loop needs to walk the list of known keyspaces instead, and try to find the keyspace storage later, based on the storage option. Closes scylladb/scylladb#15436 * github.com:scylladb/scylladb: distributed_loader: Indentation fix after previous patch distributed_loader: Generalize datadir parallelizm loop distributed_loader: Provide keyspace ref to populate_keyspace distributed_loader: Walk list of keyspaces instead of directories	2023-09-18 20:51:01 +03:00
Kefu Chai	cf5400bc75	cql.g: always initialize returned values always initialize returned values. the branches which return these unitiailized returned values handles the unmatched cases, so this change should not have any impact on the behavior. ANTLR3's C++ code generator does not assign any value to the value, if it runs into failure or encounter exceptions. for instance, following rule assigns the value of `isStatic` to `isStaticColumn` only if nothing goes wrong. ``` cfisStatic returns [bool isStaticColumn] @init{ bool isStatic = false; } : (K_STATIC { isStatic=true; })? { $isStaticColumn = isStatic; } ; ``` as shown in the generated C++ code: ```c++ switch (alt118) { case 1: // build/debug/gen/cql3/Cql.g:989:8: K_STATIC { this->matchToken(K_STATIC, &FOLLOW_K_STATIC_in_cfisStatic5870); if (this->hasException()) { goto rulecfisStaticEx; } if (this->hasFailed()) { return isStaticColumn; } if ( this->get_backtracking()==0 ) { isStaticColumn=isStatic; } } break; } ``` when `this->hasException()` or `this->hasFailed()`, `isStaticColumn` is returned right away without being initialized, because we don't assign any initial value to it, neither do we customize the exception handling for this rule. and, the parser bails out when its smells something bad after it tries to match the specified rule. also, the parser is a stateful tokenizer, its failure state is carried by the parser itself. also, the matchToken() could fail when trying to find the matched token, this is the runtime behavior of parser, that's why the compiler cannot be certain that the error path won't be taken. anyway, let's always initialize the values without default constructor. the return values whose type is of scoped enum are initialized with zero initialization, because their types don't provide an "invalid" value. this change should silence warnings like: ``` clang++ -MD -MT build/debug/gen/cql3/CqlParser.o -MF build/debug/gen/cql3/CqlParser.o.d -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen --std=gnu++20 -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -DBOOST_TEST_DYN_LINK -DNOMINMAX -DNOMINMAX -fvisibility=hidden -Wall -Werror -Wextra -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-c++11-narrowing -Wno-ignored-qualifiers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-implicit-int-float-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DFMT_DEPRECATED_OSTREAM -Wno-parentheses-equality -O1 -fno-sanitize-address-use-after-scope -c -o build/debug/gen/cql3/CqlParser.o build/debug/gen/cql3/CqlParser.cpp build/debug/gen/cql3/CqlParser.cpp:26645:28: error: variable 'perm' is uninitialized when used here [-Werror,-Wuninitialized] return perm; ^~~~ build/debug/gen/cql3/CqlParser.cpp:26616:5: note: variable 'perm' is declared here auth::permission perm; ^ build/debug/gen/cql3/CqlParser.cpp:52577:28: error: variable 'op' is uninitialized when used here [-Werror,-Wuninitialized] return op; ^~ build/debug/gen/cql3/CqlParser.cpp:52518:5: note: variable 'op' is declared here oper_t op; ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15451	2023-09-18 16:45:50 +03:00
Kefu Chai	ece45c9f70	build: cmake: use find_program(.. REQUIRED) when appropriate instead of checking the availability of a required program, let's use the `REQUIRED` argument introduced by CMake 3.18, simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15447	2023-09-18 16:35:46 +03:00
Kefu Chai	9de00c1c5a	build: cmake: add node_ops node_ops source files was extracted into /node_ops directory in `d0d0ad7aa4`, so let's update the building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15442	2023-09-18 16:27:02 +03:00
Kefu Chai	4d285590f0	utils/config_file: document config_file::value_status add doxygen style comment to document `value_status` members. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15277	2023-09-18 16:20:06 +03:00
Benny Halevy	8a56050507	main: handle abort_requested_exception on startup Handle abort_requested_exception exactly like sleep_aborted, as an expected error when startup is aborted mid-way. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#15443	2023-09-18 15:05:52 +03:00
Botond Dénes	f7557a4891	Merge 'updating presto integration page documentation' from Guy Shtub null Closes scylladb/scylladb#15342 * github.com:scylladb/scylladb: Update integration-presto.rst Update integration-presto.rst Update docs/using-scylla/integrations/integration-presto.rst updating presto integration page	2023-09-18 14:41:16 +03:00
Botond Dénes	edb50c27ec	Merge 'Use sstable_state in sstables populator' from Pavel Emelyanov Some time ago populating of tables from sstables was reworked to use sstable states instead of full paths (#12707). Since then few places in the populator was left that still operate on the state-based subdirectory name. This PR collects most of those dangling ends refs: #13020 Closes scylladb/scylladb#15421 * github.com:scylladb/scylladb: distributed_loader: Print sstable state explicitly distributed_loader: Move check for the missing dir upper distributed_loader: Use state as _sstable_directories key	2023-09-18 14:38:49 +03:00
Kefu Chai	054beb6377	tests: tablets: do not compare signed integer with unsigned integer when compiling the tests with -Wsign-compare, the compiler complains like: ``` /home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -MF test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o.d -o test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -c /home/kefu/dev/scylladb/test/boost/tablets_test.cc /home/kefu/dev/scylladb/test/boost/tablets_test.cc:1335:53: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare] for (int log2_tablets = 0; log2_tablets < tablet_count_bits; ++log2_tablets) { ~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~ ``` in this case, it should be safe to use an signed int as the loop variable to be compared with `tablet_count_bits`, but let's just appease the compiler so we can enable the warning option project-wide to prevent any potential issues caused by signed-unsigned comparision. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15449	2023-09-18 13:17:16 +02:00
Kamil Braun	bc6f7d1b20	Merge 'raft topology: add garbage collection for internal CDC generations table' from Patryk Jędrzejczak We add garbage collection for the `CDC_GENERATIONS_V3` table to prevent it from endlessly growing. This mechanism is especially needed because we send the entire contents of `CDC_GENERATIONS_V3` as a part of the group 0 snapshot. The solution is to keep a clean-up candidate, which is one of the already published CDC generations. The CDC generation publisher introduced in #15281 continually uses this candidate to remove all generations with timestamps not exceeding the candidate's and sets a new candidate when needed. We also add `test_cdc_generation_clearing.py` that verifies this new mechanism. Fixes #15323 Closes scylladb/scylladb#15413 * github.com:scylladb/scylladb: test: add test_cdc_generation_clearing raft topology: remove obsolete CDC generations raft topology: set CDC generation clean-up candidate topology_coordinator: refactor publish_oldest_cdc_generation system_keyspace: introduce decode_cdc_generation_id system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3	2023-09-18 11:30:10 +02:00
Pavel Emelyanov	30959fc9b1	lsa, test: Extend memory footprint test with per-type total sizes When memory footprint test is over it prints total size taken by row cache, memtable and sstables as well as individual objects' sizes. It's also nice to know the details on the row-cache's individual objects. This patch extends the printing with total size of allocated object types according to migrator_fn types. Sample output: mutation footprint: - in cache: 11040928 - in memtable: 9142424 - in sstable: mc: 2160000 md: 2160000 me: 2160000 - frozen: 540 - canonical: 827 - query result: 342 sizeof(cache_entry) = 64 sizeof(memtable_entry) = 64 sizeof(bptree::node) = 288 sizeof(bptree::data) = 72 -- sizeof(decorated_key) = 32 -- sizeof(mutation_partition) = 96 -- -- sizeof(_static_row) = 8 -- -- sizeof(_rows) = 24 -- -- sizeof(_row_tombstones) = 40 sizeof(rows_entry) = 144 sizeof(evictable) = 24 sizeof(deletable_row) = 72 sizeof(row) = 16 radix_tree::inner_node::node_sizes = 48 80 144 272 528 1040 radix_tree::leaf_node::node_sizes = 120 216 416 816 3104 sizeof(atomic_cell_or_collection) = 16 btree::linear_node_size(1) = 24 btree::inner_node_size = 216 btree::leaf_node_size = 120 LSA stats: N18compact_radix_tree4treeI13cell_and_hashjE9leaf_nodeE: 360 N5bplus4dataIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 5040 N5bplus4nodeIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 19296 17partition_version: 952416 N11intrusive_b4nodeI10rows_entryXadL_ZNS1_5_linkEEENS1_11tri_compareELm12ELm20ELNS_10key_searchE0ELNS_10with_debugE0EEE: 317472 10rows_entry: 1429056 12blob_storage: 254 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15434	2023-09-18 11:23:18 +02:00
Guy Shtub	5d833b2ee7	Update integration-presto.rst	2023-09-18 11:29:38 +03:00
Botond Dénes	bb7121a1fb	Merge 'tools/scylla-nodetools: do not create unowned bpo::value ' from Kefu Chai in other words, do not create bpo::value unless transfer it to an option_description. `boost::program_options::value()` create a new typed_value<T> object, without holding it with a shared_ptr. boost::program_options expects developer to construct a `bpo::option_description` right away from it. and `boost::program_options::option_description` takes the ownership of the `type_value<T>` raw pointer, and manages its life cycle with a shared_ptr. but before passing it to a `bpo::option_description`, the pointer created by `boost::program_options::value()` is a still a raw pointer. before this change, we initialize `operations_with_func` as global variables using `boost::program_options::value()`. but unfortunately, we don't always initialize a `bpo::option_description` from it -- we only do this on demand when the corresponding subcommand is called. so, if the corresponding subcommand is not called, the created `typed_value<T>` objects are leaked. hence LeakSanitizer warns us. after this change, we create the option map as a static local variable in a function so it is created on demand as well. as an alternative, we could initialize the options map as local variable where it used. but to be more consistent with how `global_option` is specified. and to colocate them in a single place, let's keep the existing code layout. this change is quite similar to `374bed8c3d` Fixes https://github.com/scylladb/scylladb/issues/15429 Closes scylladb/scylladb#15430 github.com:scylladb/scylladb: tools/scylla-nodetools: reindent tools/scylla-nodetools: do not create unowned bpo::value	2023-09-18 11:09:46 +03:00
Kefu Chai	a51b14d4c4	sstables/metadata_collector: drop unused functions column_stats::update_local_deletion_time() is not used anywhere, what is being used is `column_stats::update_local_deletion_time_and_tombstone_histogram(time_point)`. while `update_local_deletion_time_and_tombstone_histogram(int32_t)` is only used internally by a single caller. neither is `column_stats::update(const deletion_time&)` used. so let's drop them. and merge `update_local_deletion_time_and_tombstone_histogram(int32_t)` into its caller. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15189	2023-09-18 10:18:56 +03:00
Botond Dénes	b97778e4b2	Merge 'create-relocatable-package.py: do not assume "build" build directory' from Kefu Chai in this series, we do not assume the existence of "build" build directory. and prefer using the version files located under the directory specified with the `--build-dir` option. Refs #15241 Closes scylladb/scylladb#15402 * github.com:scylladb/scylladb: create-relocatable-package.py: prefer $build_dir/SCYLLA-RELEASE-FILE create-relocatable-package.py: create SCYLLA-RELOCATABLE-FILE with tempfile	2023-09-18 09:07:37 +03:00
Kefu Chai	a03dc92cb5	tools/scylla-nodetools: reindent Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-18 13:57:37 +08:00
Kefu Chai	ed41c725f3	tools/scylla-nodetools: do not create unowned bpo::value in other words, do not create bpo::value unless transfer it to an option_description. `boost::program_options::value()` create a new typed_value<T> object, without holding it with a shared_ptr. boost::program_options expects developer to construct a `bpo::option_description` right away from it. and `boost::program_options::option_description` takes the ownership of the `type_value<T>*` raw pointer, and manages its life cycle with a shared_ptr. but before passing it to a `bpo::option_description`, the pointer created by `boost::program_options::value()` is a still a raw pointer. before this change, we initialize `operations_with_func` as global variables using `boost::program_options::value()`. but unfortunately, we don't always initialize a `bpo::option_description` from it -- we only do this on demand when the corresponding subcommand is called. so, if the corresponding subcommand is not called, the created `typed_value<T>` objects are leaked. hence LeakSanitizer warns us. after this change, we create the option map as a static local variable in a function so it is created on demand as well. as an alternative, we could initialize the options map as local variable where it used. but to be more consistent with how `global_option` is specified. and to colocate them in a single place, let's keep the existing code layout. this change is quite similar to `374bed8c3d` Fixes #15429 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-18 13:57:37 +08:00
Kefu Chai	b350596656	docs: correct the code sample for checking service status ```console $ journalctl --user start scylla-server -xe Failed to add match 'start': Invalid argument ``` `journalctl` expects a match filter as its positional arguments. but apparently, start is not a filter. we could use `--unit` to specify a unit though, like: ```console $ journalctl --user --unit scylla-server.service -xe ``` but it would flood the stdout with the logging messages printed by scylla. this is not what a typical user expects. probably a better use experience can be achieved using ```console $ systemctl --user status scylla-server ``` which also print the current status reported by the service, and the command line arguments. they would be more informative in typical use cases. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15390	2023-09-18 08:37:42 +03:00
Avi Kivity	67a0c865cf	tools: toolchain: prepare: don't overwrite existing images The docker/podman tooling is destructive: it will happily overwrite images locally and on the server. If a maintainer forgets to update tools/toolchain/image, this can result in losing an older toolchain container image. To prevent that, check that the image name is new. Closes scylladb/scylladb#15397	2023-09-18 08:35:01 +03:00
Kefu Chai	a04fa0b41e	conf: update commented out experimental_features update commented out experimental_features to reflect the latest experimental features: - in `4f23eec4`, "raft" was renamed to "consistent-topology-changes". - in `2dedb5ea`, "alternator-ttl" was moved out of experimental features. - in `5b1421cc`, "broadcast-tables" was added to experimental features. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15407	2023-09-18 08:31:01 +03:00
Guy Shtub	b8693636b8	Update integration-presto.rst Removing link to forum, will be added as general footer	2023-09-18 06:50:11 +03:00
Guy Shtub	7d0691b348	Update docs/using-scylla/integrations/integration-presto.rst Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-09-18 06:46:02 +03:00
Kefu Chai	2a780553f8	build: cmake: add "dist-server-debuginfo" target this target mirrors the "dist-server-debuginfo-{mode}" target in the `build.ninja` created by `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-16 16:51:21 +08:00
Kefu Chai	38e697943f	build: cmake: remove debian dep from relocatable pkg `create-relocatable-package.py` does not use or include `${CMAKE_CURRENT_BINARY_DIR}/debian`. so there is no need to include this directory as a dependency. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-16 16:51:21 +08:00
Kamil Braun	5add0e1734	test: add test for group 0 schema versioning Perform schema changes while mixing nodes in RECOVERY mode with nodes in group 0 mode: - schema changes originating from RECOVERY node use digest-based schema versioning. - schema changes originating from group 0 nodes use persisted versions committed through group 0. Verify that schema versions are in sync after each schema change, and that each schema change results in a different version. Also add a simple upgrade test, performing a schema change before we enable Raft (which also enables the new versioning feature) in the entire cluster, then once upgrade is finished. One important upgrade test is missing, which we should add to dtest: create a cluster in Raft mode but in a Scylla version that doesn't understand GROUP0_SCHEMA_VERSIONING. Then start upgrading to a version that has this patchset. Perform schema changes while the cluster is mixed, both on non-upgraded and on upgraded nodes. Such test is especially important because we're adding a new column to the `system.scylla_local` table (which we then redact from the schema definition when we see that the feature is disabled).	2023-09-15 18:36:11 +02:00
Avi Kivity	4eb4ac4634	scripts: pull_gitgub_pr.sh: absolutize project reference pull_gitgub_pr.sh adds a "Closes #xyz" tag so github can close the pull request after next promotion. Convert it to an absolute refefence (scylladb/scylladb#xyz) so the commit can be cherry-picked into another repository without the reference dangling. Closes #15424	2023-09-15 19:29:50 +03:00
Kefu Chai	1e6b2eb4c8	tools/scylla-nodetool: mark format string as constexpr this change change `const` to `constexpr`. because the string literal defined here is not only immutable, but also initialized at compile-time, and can be used by constexpr expressions and functions. this change is introduced to reduce the size of the change when moving to compile-time format string in future. so far, seastar::format() does not use the compile-time format string, but we have patches pending on review implementing this. and the author of this change has local branches implementing the changes on scylla side to support compile-time format string, which practically replaces most of the `format()` calls with `seastar::format()`. without this change, if we use compile-time format check, compiler fails like: ``` /home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: error: call to consteval function 'fmt::basic_format_string<char, const char const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char , 0>' is not a constant expression .description = seastar::format(description_template, app_name, boost::algorithm::join(operations \| boost::adaptors::transformed([] (const auto& op) { ^ /usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) { ^ /home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: note: in call to 'basic_format_string(description_template)' .description = seastar::format(description_template, app_name, boost::algorithm::join(operations \| boost::adaptors::transformed([] (const auto& op) { ^ /home/kefu/dev/scylladb/tools/scylla-nodetool.cc:258:16: note: declared here const auto description_template = ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15432	2023-09-15 19:28:38 +03:00
Kefu Chai	6c75dc4be8	tools/scylla-nodetool: do not compare unsigned with int change the loop variable to `int` to silence warning like ``` /home/kefu/.local/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -I/home/kefu/dev/scylladb/build/cmake/gen -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -MF tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o.d -o tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -c /home/kefu/dev/scylladb/tools/scylla-nodetool.cc /home/kefu/dev/scylladb/tools/scylla-nodetool.cc:215:28: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare] for (unsigned i = 0; i < argc; ++i) { ~ ^ ~~~~ ``` `i` is used as the index in a plain C-style array, it's perfectly fine to use a signed integer as index in this case. as per C++ standard, > The expression E1[E2] is identical (by definition) to *((E1)+(E2)) Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15431	2023-09-15 19:28:14 +03:00
Kamil Braun	52903ef456	test/pylib: log_browsing: fix type hint	2023-09-15 17:58:54 +02:00
Kamil Braun	c2beee348a	feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode As promised in earlier commits: Fixes: #7620 Fixes: #13957 Also modify two test cases in `schema_change_test` which depend on the digest calculation method in their checks. Details are explained in the comments.	2023-09-15 17:54:36 +02:00
Pavel Emelyanov	e61f4e0abb	distributed_loader: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-15 17:51:03 +03:00
Pavel Emelyanov	bb4ddbb996	distributed_loader: Generalize datadir parallelizm loop Population of keyspaces happens first fo system keyspaces, then for non-system ones. Both methods iterate over config datadirs to populate from all configured directories. This patch generalizes this loop into the populate_keyspace() method. (indentation is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-15 17:49:53 +03:00
Pavel Emelyanov	0430ebf851	distributed_loader: Provide keyspace ref to populate_keyspace The method in question tries to find keyspace reference on the database by the given keyspace name. However, one of the callers aready has the keyspace reference at hands and can just pass it. The other calls can find the keyspace on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-15 17:49:03 +03:00
Pavel Emelyanov	e1262e46eb	distributed_loader: Walk list of keyspaces instead of directories When populating non-system keyspaces the dist. loader lists the directories with keyspaces in datadirs, then tries to call populate_keyspace() with the found name. If the keyspace in question is not found on the database, a warning is printed and population continues. S3-backed keyspaces are nowadays populated with this process just because there's a bug #13020 -- even such keyspaces still create empty directories in datadirs. When the bug gets fixed, population would omit such keyspaces. This patch prepares this by making population walk the known keyspaces from the database. BTW, population of system keyspaces already works by iterating over the list of known keyspaces, not the datadir subdirectories. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-15 17:44:46 +03:00
Kefu Chai	30ef69fcb2	docs/dev/object_store: add more samples in hope to lower the bar to testing object store. * add language specifier for better readability of the document. to highlight the config with YAML syntax * add more specific comment on the AWS related settings * explain that endpoint should match in the CREATE KEYSPACE statement and the one defined by the YAML configuration. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15433	2023-09-15 17:35:17 +03:00
Kamil Braun	947c419421	schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 As explained in the previous commit, we use the new `committed_by_group0` flag attached to each row of a `scylla_tables` mutation to decide whether the `version` cell needs to be deleted or not. The rest of #13957 is solved by pre-existing code -- if the `version` column is present in the mutation, we don't calculate a hash for `schema::version()`, but take the value from the column: ``` table_schema_version schema_mutations::digest(db::schema_features sf) const { if (_scylla_tables) { auto rs = query::result_set(_scylla_tables); if (!rs.empty()) { auto&& row = rs.row(0); auto val = row.get<utils::UUID>("version"); if (val) { return table_schema_version(val); } } } ... ``` The issue will therefore be fixed once we enable `GROUP0_SCHEMA_VERSIONING`.	2023-09-15 14:32:52 +02:00
Kamil Braun	ce68ee0950	migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations As described in #13957, when creating or altering a table in group 0 mode, we don't want each node to calculate `schema::version()`s independently using a hash algorithm. Instead, we want to all nodes to use a single version for that table, commited by the group 0 command. There's even a column ready for this in `system.scylla_tables` -- `version`. This column is currently being set for system tables, but it's not being used for user tables. Similarly to what we did with global schema version in earlier commits, the obvious thing to do would be to include a live cell for the `version` column in the `system.scylla_tables` mutation when we perform the schema change in Raft mode, and to include a tombstone when performing it outside of Raft mode, for the RECOVERY case. But it's not that simple because as it turns out, we're already sending a `version` live cell (and also a tombstone, with timestamp decremented by 1) in all `system.scylla_tables` mutations. But then we delete that cell when doing schema merge (which begs the question why were we sending it in the first place? but I digress): ``` // We must force recalculation of schema version after the merge, since the resulting // schema may be a mix of the old and new schemas. delete_schema_version(mutation); ``` the above function removes the `version` cell from the mutation. So we need another way of distinguishing the cases of schema change originating from group 0 vs outside group 0 (e.g. RECOVERY). The method I chose is to extend `system.scylla_tables` with a boolean column, `committed_by_group0`, and extend schema mutations to set this column. In the next commit we'll decide whether or not the `version` cell should be deleted based on the value of this new column.	2023-09-15 14:32:52 +02:00
Kamil Braun	59912ca3b0	schema_tables: use schema version from group 0 if present As promised in the previous commit, if we persisted a schema version through a group 0 command, use it after a schema merge instead of calculating a digest. Ref: #7620 The above issue will be fixed once we enable the `GROUP0_SCHEMA_VERSIONING` feature.	2023-09-15 14:32:52 +02:00
Kamil Braun	7ab7588d59	migration_manager: store `group0_schema_version` in `scylla_local` during schema changes We extend schema mutations with an additional mutation to the `system.scylla_local` table which: - in Raft mode, stores a UUID under the `group0_schema_version` key. - outside Raft mode, stores a tombstone under that key. As we will see in later commits, nodes will use this after applying schema mutations. If the key is absent or has a tombstone, they'll calculate the global schema digest on their own -- using the old way. If the key is present, they'll take the schema version from there. The Raft-mode schema version is equal to the group 0 state ID of this schema command. The tombstone is necessary for the case of performing a schema change in RECOVERY mode. It will force a revert to the old digest-based way. Note that extending schema mutations with a `system.scylla_local` mutation is possible thanks to earlier commits which moved `system.scylla_local` to schema commitlog, so all mutations in the schema mutations vector still go to the same commitlog domain.	2023-09-15 14:32:45 +02:00
Kamil Braun	06c141f585	migration_manager: migration_request handler: assume `canonical_mutation` support Support for `canonical_mutation`s was added way back in Scylla 3.2. The migration request handler was checking whether the remote supports `canonical_mutation`s to handle rolling upgrades, and if not, it would use `frozen_mutation`s instead. We no longer need that second branch, since we don't support skipping versions during upgrades (certainly everything would burn if we tried a 3.2->5.4 upgrade). Leave a sanity check but otherwise delete the other branch.	2023-09-15 14:29:45 +02:00
Pavel Emelyanov	cce2752b64	Merge 'node_ops: move node_ops related classes to node_ops/' from Aleksandra Martyniuk Move node_ops related classes to node_ops/ so that they are consistently grouped and could be access from many modules. Closes #15351 * github.com:scylladb/scylladb: node_ops: extract classes related to node operations node_ops: repair: move node_ops_id to node_ops directory	2023-09-15 15:12:00 +03:00
Kamil Braun	3ab244e6d9	system_keyspace: make `get/set_scylla_local_param` public We'll use it outside `system_keyspace` code in later commit.	2023-09-15 13:04:04 +02:00
Kamil Braun	72cd457d53	feature_service: add `GROUP0_SCHEMA_VERSIONING` feature This feature, when enabled, will modify how schema versions are calculated and stored. - In group 0 mode, schema versions are persisted by the group 0 command that performs the schema change, then reused by each node instead of being calculated as a digest (hash) by each node independently. - In RECOVERY mode or before Raft upgrade procedure finishes, when we perform a schema change, we revert to the old digest-based way, taking into account the possibility of having performed group0-mode schema changes (that used persistent versions). As we will see in future commits, this will be done by storing additional flags and tombstones in system tables. By "schema versions" we mean both the UUIDs returned from `schema::version()` and the "global" schema version (the one we gossip as `application_state::SCHEMA`). For now, in this commit, the feature is always disabled. Once all necessary code is setup in following commits, we will enable it together with Raft.	2023-09-15 13:04:04 +02:00
Kamil Braun	dc4e20d835	schema_tables: refactor `scylla_tables(schema_features)` The `scylla_tables` function gives a different schema definition for the `system_schema.scylla_tables` table, depending on whether certain schema features are enabled or not. The way it was implemented, we had to write `θ(2^n)` amount of code and comments to handle `n` features. Refactor it so that the amount of code we have to write to handle `n` features is `θ(n)`.	2023-09-15 13:04:04 +02:00
Kamil Braun	2d561eecbc	migration_manager: add `std::move` to avoid a copy	2023-09-15 13:04:04 +02:00
Kamil Braun	4376854473	schema_tables: remove default value for `reload` in `merge_schema` To avoid bugs like the one fixed in the previous commit.	2023-09-15 13:04:04 +02:00
Kamil Braun	48164e1d09	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_scehma` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this.	2023-09-15 13:04:04 +02:00
Kamil Braun	9017b998ca	system_keyspace: fix outdated comment	2023-09-15 13:04:04 +02:00
Anna Stuchlik	fb635dccaa	doc: add info - support for FIPS-compliant systems This commit adds the information that ScyllaDB Enterprise supports FIPS-compliant systems in versions 2023.1.1 and later. The information is excluded from OSS docs with the "only" directive, because the support was not added in OSS. This commit must be backported to branch-5.2 so that it appears on version 2023.1 in the Enterprise docs. Closes #15415	2023-09-15 11:08:34 +02:00
Patryk Jędrzejczak	840e1c5185	test: add test_cdc_generation_clearing We add a test for the new CDC generation garbage collection mechanism.	2023-09-15 09:28:32 +02:00
Patryk Jędrzejczak	0cc54e0da7	raft topology: remove obsolete CDC generations We make the CDC generation publisher continually remove the obsolete CDC generation data to prevent CDC_GENERATIONS_V3 from endlessly growing. To achieve this, we use the clean-up candidate. If it exists and can be safely removed, we remove it together with all older CDC generations. We also mark the lack of a new candidate. The next published CDC generation will become one. Note this solution does not have any guarantee about "when" it removes obsolete generations. Formally, it guarantees that if there is a candidate that can be removed and the CDC generation publisher attempts to remove it, all generations up to the candidate are removed. In practice, when a new generation appears, the publisher makes a new candidate or tries to remove an old candidate, so obsolete generations can stay for a long time only if no generation appears for a long time. But it is fine because we only want to prevent CDC_GENERATIONS_V3 from growing too much. Moreover, providing any guarantees would require a new wake-up mechanism for the publisher, which would be hard to implement.	2023-09-15 09:26:58 +02:00
Patryk Jędrzejczak	e375e769b9	raft topology: set CDC generation clean-up candidate We want to use the clean-up candidates to remove the obsolete CDC generation data, but first, we need to set suitable generations as a candidate when there is no candidate. Since CDC generations must be published before we remove them, a generation that is being published is a good candidate.	2023-09-15 09:23:59 +02:00
Patryk Jędrzejczak	b84e097c28	topology_coordinator: refactor publish_oldest_cdc_generation In the following commits, we add a new task for the CDC generation publisher -- clearing obsolete CDC generation data. This task can be done together with the publishing under one group 0 guard. We refactor publish_oldest_cdc_generation to make it possible. Now, this function is more like a command builder. It takes guard by const reference and updates the vector of mutations and the reason string. The CDC generation publisher uses them directly to update the topology at the end after finishing building the command. This logic will be more visible after adding the clearing task.	2023-09-15 09:04:23 +02:00
Dawid Medrek	fbbb9f879a	db/hints: Remove unused aliases from manager.hh	2023-09-15 04:17:08 +02:00
Dawid Medrek	d46437a87b	db/hints: Rename end_point_hints_manager This commit renames `end_point_hints_manager` to `hint_endpoint_manager` to be consistent with other names used in the module (they all start with `hint_`).	2023-09-15 03:46:15 +02:00
Dawid Medrek	6d1eee448b	db/hints: Rename sender to hint_sender We rename the structure to highlight what exactly its purpose is.	2023-09-15 03:46:15 +02:00
Dawid Medrek	4ad0f8907c	db/hints: Move the rebalancing logic to hint_storage This commit continues modularizing manager.hh.	2023-09-15 03:46:15 +02:00
Dawid Medrek	999484466d	db/hints: Move the implementation of sender This commit continues modularizing manager.hh. After moving the declaration of sender to a dedicated header file, these changes move its implementation to a separate source file.	2023-09-15 03:46:15 +02:00
Dawid Medrek	17aabf6b9a	db/hints: Move the declaration of sender to hint_sender.hh This commit is yet another step in modularizing manager.hh. We move the declaration of sender to a dedicated file. Its implementation will follow in a future commit.	2023-09-15 03:46:15 +02:00
Dawid Medrek	1a7262ed6e	db/hints: Move sender::replay_allowed() to the source file The premise of these changes is the fact that we cannot have a cycle of #includes. Because the declaration of `sender` is going to be moved to a separate header file in a future commit, and because that header file is going to be included in the file where `end_point_hints_manager` is declared, we will need to rely on `end_point_hints_manager` being an incomplete type there. A consequence of that is that we cannot access any of `end_point_hints_manager`'s methods. This commit prepares the ground for it by moving the definition of the function to the source file where `end_point_hints_manager` will be a complete type.	2023-09-15 03:46:15 +02:00
Dawid Medrek	ad2a36bd45	db/hints: Put end_point_hints_manager in internal namespace	2023-09-15 03:46:15 +02:00
Dawid Medrek	507054012d	db/hints: Move the implementation of end_point_hints_manager This commit continues moving end_point_hints_manager to its dedicated files. After moving the declaration of the class, these changes move the implementation.	2023-09-15 03:46:15 +02:00
Dawid Medrek	f72c423984	db/hints: Move the declaration of end_point_hints_manager This commit is yet another step in modularizing manager.hh. We move the declaration of the class to a dedicated header file. The implementation will follow in a future commit.	2023-09-15 03:46:15 +02:00
Dawid Medrek	854cc0c939	db/hints: Move definitions of functions using shard hint manager We move definitions of inline methods of end_point_hints_manager and sender accessing shard hint manager to the source file, effectively un-inlining them. We need to do that to prepare for moving said structures out of manager.hh. This commit is yet another step in modularizing manager.hh.	2023-09-15 03:45:57 +02:00
Dawid Medrek	db08a85f5d	db/hints: Introduce hint_storage.hh This commit moves types used by shard hint manager and related to storing hints on disk to another file. It is yet another step in modularizing manager.hh.	2023-09-15 02:28:10 +02:00
Dawid Medrek	4814b3b19a	db/hints: Extract the logger from manager.cc This commit extracts the logger used in manager.cc to prepare the ground for modularization of manager.hh into separate smaller files. We want to preserve the logging behavior (at least for the time being), which means new files should use the same logger. These changes serve that purpose.	2023-09-15 02:24:20 +02:00
Dawid Medrek	efd6d1f57a	db/hints: Extract common types from manager.hh Currently, data structures used in manager.hh use their own aliases for gms::inet_address. It is clear they all should use the same type and having different names for it only reduces readability of the code. This commit introduces a common alias -- endpoint_id -- and gets rid of the other ones. This commit is also the first step in modularizing manager.hh by extracting common types to another file.	2023-09-15 02:23:30 +02:00
Botond Dénes	b87660f90c	tools/scylla-sstable: log where schema was obtained from Currently, we only log anything about what was tried w.r.t. obtaining the schema if it failed. Add a log message to the success path too, so in case the wrong schema was successfully loaded, the user can find the problem. The log message is printed with debug-level, so it doesn't distrurb output by default. Fixes: #15384 Closes #15417	2023-09-14 23:09:30 +03:00
Botond Dénes	0f8b297d07	Merge 'build: cmake: add targets for building deb and rpm packages' from Kefu Chai in this series, - the build of unstripped package is fixed, and - the targets for building deb and rpm packages are added. these targets builds deb and rpm packages from the unstripped package. Closes #15403 * github.com:scylladb/scylladb: build: cmake: add targets for building deb and rpm packages build: cmake: correct the paths used when building unstripped pkg	2023-09-14 18:22:30 +03:00
Kefu Chai	60db7f8ae3	doc: do not suggest "-node xxx" when running c-s cassandra-stress connects to "localhost" by default. that's exactly the use case when we install scylla using the unified installer. so do not suggest "-node xxx" option. the "xxx" part is but confusing. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15411	2023-09-14 18:21:46 +03:00
Petr Gusev	6c3cc7d6e0	test_fence_hints: increase timeouts We saw failures on CI in debug mode, probably the machine running the test is shared, and we starved for some resources. Fix #15285 Closes #15388	2023-09-14 16:22:50 +02:00
Avi Kivity	d9a453e72e	Merge 'Introduce a scylla-native nodetool' from Botond Dénes This series introduces a scylla-native nodetool. It is invokable via the main scylla executable as the other native tools we have. It uses the seastar's new `http::client` to connect to the specified node and execute the desired commands. For now a single command is implemented: `nodetool compact`, invokable as `scylla nodetool compact`. Once all the boilerplate is added to create a new tool, implementing a single command is not too bad, in terms of code-bloat. Certainly not as clean as a python implementation would be, but good enough. The advantages of a C++ implementation is that all of us in the core team know C++ and that it is shipped right as part of the scylla executable.. Closes #14841 * github.com:scylladb/scylladb: test: add nodetool tests test.py: add ToolTestSuite and ToolTest tools/scylla-nodetool: implement compact operation tools/scylla-nodetool: implement basic scylla_rest_api_client tools: introduce scylla-nodetool utils: export dns_connection_factory from s3/client.cc to http.hh utils/s3/client: pass logger to dns_connection_factory in constructor tools/utils: tool_app_template::run_async(): also detect --help* as --help	2023-09-14 17:20:40 +03:00
Avi Kivity	a3d73bfba7	Merge 'Add support for decommission with tablets' from Tomasz Grabiec Load balancer will recognize decommissioning nodes and will move tablet replicas away from such nodes with highest priority. Topology changes have now an extra step called "tablet draining" which calls the load balancer. The step will execute tablet migration track as long as there are nodes which require draining. It will not do regular load balancing. If load balancer is unable to find new tablet replicas, because RF cannot be met or availability is at risk due to insufficient node distribution in racks, it will throw an exception. Currently, topology change will retry in a loop. We should make this error cause topology change to be aborted. There is no infrastructure for aborts yet, so this is not implemented. Closes #15197 * github.com:scylladb/scylladb: tablets, raft topology: Add support for decommission with tablets tablet_allocator: Compute load sketch lazily tablet_allocator: Set node id correctly tablet_allocator: Make migration_plan a class tablets: Implement cleanup step storage_service, tablets: Prevent stale RPCs from running beyond their stage locator: Introduce tablet_metadata_guard locator, replica: Add a way to wait for table's effective_replication_map change storage_service, tablets: Extract do_tablet_operation() from stream_tablet() raft topology: Add break in the final case clause raft topology: Fix SIGSEGV when trace-level logging is enabled raft topology: Set node state in topology raft topology: Always set host id in topology	2023-09-14 17:16:23 +03:00
Kamil Braun	0564d000c6	Merge 'Validate compaction strategy options' from Aleksandra Martyniuk When a column family's schema is changed new compaction strategy type may be applied. To make sure that it will behave as expected, compaction strategy need to contain only the allowed options and values. Methods throwing exception on invalid options are added. Fixes: #2336. Closes #13956 * github.com:scylladb/scylladb: test: add test for compaction strategy validation compaction: unify exception messages compaction: cql3: validate options in check_restricted_table_properties compaction: validate options used in different compaction strategies compaction: validate common compaction strategy options compaction: split compaction_strategy_impl constructor compaction: validate size_tiered_compaction_strategy specific options compaction: validate time_window_compaction_strategy specific options compaction: add method to validate min and max threshold compaction: split size_tiered_compaction_strategy_options constructor compaction: make compaction strategy keys static constexpr compaction: use helpers in validate_* functions compaction: split time_window_compaction_strategy_options construtor compaction: add validate method to compaction_strategy_options time_window_compaction_strategy_options: make copy and move-able size_tiered_compaction_strategy_options: make copy and move-able	2023-09-14 16:11:52 +02:00
Pavel Emelyanov	4370e6c8d0	distributed_loader: Print sstable state explicitly When populating from a particular directory, populator code converts state to subdir name, then prints the path. The conversion is pretty much artificial, it's better to provide printer for state and print state explicitly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-14 16:41:26 +03:00
Pavel Emelyanov	b19e6a68f8	distributed_loader: Move check for the missing dir upper The quarantine directory can be missing on the datadir and that's OK. In order to check that and skip population the populator code uses two-step logic -- first it checks if the directory exists and either puts or not the sstable_directory object into the map. Later it checks the map and decide whether to throw or not if the directory is missing. Let's keep both check and throw in one place for brevity. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-14 16:39:56 +03:00
Pavel Emelyanov	74eef029e2	distributed_loader: Use state as _sstable_directories key The populator maintains a map of path -> sstable_directory pairs one for each subdirectory for every sstable state. The "path" is in fact not used by the logic as it's just a subdirectory name for the state and the rest of the core operates on state. So it's good to make the map of directories also be indexed by the state. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-14 16:38:03 +03:00
Benny Halevy	a5a22fe5b7	tools/scylla-sstable: load_sstables: handle load errors Currently, exceptions thrown from `sst->load` are unhandled, resulting in, e.g.: ``` ERROR 2023-09-12 08:02:58,124 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (SSTable /home/bhalevy/.dtest/dtest-dxg4xdxg/test/node1/data/ks/cf-a3009f20512911ee8000d81cd2da3fd7/me-3g9b_0e0x_39vtt1y2rcqrffz55j-big-Data.db uses org.apache.cassandra.dht.Murmur3Partitioner partitioner which is different than com.scylladb.dht.CDCPartitioner partitioner used by the database) ``` Log the errors and exit the tool with non-zero status in this case. Fixes #15359 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15376	2023-09-14 14:27:38 +03:00
Tomasz Grabiec	551cc0233d	tablets, raft topology: Add support for decommission with tablets Load balancer will recognize decommissioning nodes and will move tablet replicas away from such nodes with highest priority. Topology changes have now an extra step called "tablet draining" which calls the load balancer. The step will execute tablet migration track as long as there are nodes which require draining. It will not do regular load balancing. If load balancer is unable to find new tablet replicas, because RF cannot be met or availability is at risk due to insufficient node distribution in racks, it will throw an exception. Currently, topology change will retry in a loop. We should make this error cause topology change to be paused so that admin becomes aware of the problem and issues an abort on the topology change. There is no infrastructure for aborts yet, so this is not implemented.	2023-09-14 13:05:49 +02:00
Tomasz Grabiec	8565af4dd3	tablet_allocator: Compute load sketch lazily This allows any node to act as a target later.	2023-09-14 13:04:49 +02:00
Tomasz Grabiec	1c595ab7f4	tablet_allocator: Set node id correctly It was unset and unused.	2023-09-14 13:04:49 +02:00
Tomasz Grabiec	389573543e	tablet_allocator: Make migration_plan a class It will be extended with more fields so that load balancer can communicate more information to the coordinator.	2023-09-14 13:04:47 +02:00
Tomasz Grabiec	d5539e080d	tablets: Implement cleanup step This change adds a stub for tablet cleanup on the replica side and wires it into the tablet migration process. The handling on replica side is incomplete because it doesn't remove the actual data yet. It only flushes the memtables, so that all data is in sstables and none requires a memtable flush. This patch is necessary to make decommission work. Otherwise, a memtable flush would happen when the decommissioned node is put in the drained state (as in nodetool drain) and it would fail on missing host id mapping (node is no longer in topology), which is examined by the tablet sharder when producing sstable sharding metadata. Leading to abort due to failed memtable flush.	2023-09-14 12:45:10 +02:00
Tomasz Grabiec	5cf035878d	storage_service, tablets: Prevent stale RPCs from running beyond their stage Example scenario: 1. coordinator A sends RPC #1 to trigger streaming 2. coordinator fails over to B 3. coordinator B performs streaming successfully 4. RPC #1 arrives and starts streaming 5. coordinator B commits the transition to the post-streaming stage 6. coordinator B executes global token metadata barrier We end up with streaming running despite the fact that the current coordinator moved on. Currently, this won't happen, because streaming holds on to erm. But we want to change that (see #14995), so that it does not block barriers for migrations of other tablets. The same problem applies to tablet cleanup. The fix is to use tablet_metadata_guard around such long running operations, which will keep hold to erm so that in the above scenario coordinator B will wait for it in step 6. The guard ensures that erm doesn't block other migrations because it switches to the latest erm if it's compatible. If it's not, it signals abort_source for the guard so that such stale operation aborts soon and the barrier in step 6 doesn't wait for long.	2023-09-14 12:45:10 +02:00
Tomasz Grabiec	6a62aca3a9	locator: Introduce tablet_metadata_guard Will be used to synchronize long-running tablet operations with topology coordinator. It blocks barriers like erm_ptr, but refreshes if change is irrelevant, so behaves as if the erm_ptr's scope was narrowed down to a single tablet.	2023-09-14 12:45:10 +02:00
Patryk Jędrzejczak	c0fd42ead4	system_keyspace: introduce decode_cdc_generation_id The decode_cdc_generations_ids function allows us to decode a vector of CDC generation IDs. After adding cleanup_candidate to CDC_GENERATIONS_V3, we need a similar function that decodes a single ID.	2023-09-14 12:09:14 +02:00
Patryk Jędrzejczak	6db325fb69	system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3 In the following commits, we implement a garbage collection for CDC_GENERATIONS_V3. The first step is introducing the clean-up candidate. It will be continually updated by the CDC generation publisher and used to remove obsolete data.	2023-09-14 12:09:10 +02:00
Tomasz Grabiec	532ec84210	locator, replica: Add a way to wait for table's effective_replication_map change	2023-09-14 12:08:54 +02:00
Tomasz Grabiec	2c6785dc8f	storage_service, tablets: Extract do_tablet_operation() from stream_tablet() It will be shared with cleanup_tablet(). Minor changes: - ditch the redundant optional<> around shared_future<>	2023-09-14 12:08:52 +02:00
Tomasz Grabiec	e2c1f904c8	raft topology: Add break in the final case clause To be safe in case we add more cases.	2023-09-14 12:07:59 +02:00
Tomasz Grabiec	97f3f496bd	raft topology: Fix SIGSEGV when trace-level logging is enabled rs.ring may be disengaged.	2023-09-14 12:07:59 +02:00
Tomasz Grabiec	a4c91a5ee7	raft topology: Set node state in topology Will be examined by the load balancer.	2023-09-14 12:07:59 +02:00
Tomasz Grabiec	56e1a72c8f	raft topology: Always set host id in topology Before, it was updated only for normal nodes. We need it for bootstrapping nodes too. Otherwise, algorithms, e.g. load balancer, will be confused by observing nodes in topology without host id set. This will become a problem when load balancer is invoked concurrently with bootstrap, which currently is not the case, but will be after later patches. We should maintain that all nodes in topology have a host id.	2023-09-14 12:07:59 +02:00
Botond Dénes	3e2d8ca94d	test: add nodetool tests Testing the new scylla nodetool tool. The tests can be run aginst both implementations of nodetool: the scylla-native one and the cassandra one. They all pass with both implementations.	2023-09-14 05:25:14 -04:00
Botond Dénes	56f7b2f45d	test.py: add ToolTestSuite and ToolTest A test suite for python pytests, testing tools, and hence not needing a scylla cluster setup for them.	2023-09-14 05:25:14 -04:00
Botond Dénes	60dc2e9303	tools/scylla-nodetool: implement compact operation Equivalent of nodetool compact. The following arguments are accepted: * split-output,s (unused) * user-defined (error is raised) * start-token,st (unused) * end-token,et (unused) * partition (unused) The partition argument is mentioned only in our doc, our nodetool doesn't recognize it. I added it nevertheless (it is ignored). Split-output doesn't work with our current nodetool, attempting to use it will result in an error. The option is parsed but an error is used if used.	2023-09-14 05:25:14 -04:00
Botond Dénes	d67e22b876	tools/scylla-nodetool: implement basic scylla_rest_api_client Add --host and --port parameters, parse and resolve these and establish a connection to the provided host. Add a simple get() and post() method, parsing the returned data as json. Add the following compatibility arguments: * password,pw * password-file,pwf * username,u * print-port,pp These are parsed and silently ignored, as they are specific to JMX and aren't needed when connecting to the REST API. Since boost program options doesn't support multi-char short-form switches, as well as the -short=value syntax, the argv has to be massaged into a form which boost program options can digest. This is achieved by substituting all incompatible option formats and syntax with the equivalent boost program options compatible one. This mechanism is also used to make sure -h is translated to --host, not --help. The help message is unfortunately still ambiguous, displaying both with -h. This will be addressed in a follow-up.	2023-09-14 05:25:14 -04:00
Botond Dénes	eb1beca1b6	tools: introduce scylla-nodetool This patch only introuces the bare skeleton of the tool, plus the wiring into main. No operations are added yet, they will be added in later patches.	2023-09-14 05:25:14 -04:00
Botond Dénes	bf2fad3c00	utils: export dns_connection_factory from s3/client.cc to http.hh So others can use it too. Move headers only used by said class too.	2023-09-14 05:25:14 -04:00
Botond Dénes	17fd57390e	utils/s3/client: pass logger to dns_connection_factory in constructor We want to publish this class in a header so it can be used by others, but it uses the s3 logger. We don't want future users to pollute the s3 logs, so allow users to pass their own loggers to the factory.	2023-09-14 05:25:14 -04:00
Botond Dénes	4dd373b8d3	tools/utils: tool_app_template::run_async(): also detect --help* as --help Don't try to lookup the current operation if the first argument is --help*. This allows --help-seastar and --help-loggers to work.	2023-09-14 05:25:14 -04:00
Kamil Braun	47b18ae908	migration_manager: log when performing read barrier in `get_schema_for_write` Will be useful for debugging problems with timing out queries if they are caused by slow schema sync read barriers. Ref: #15357 Closes #15396	2023-09-14 11:44:24 +03:00
Kamil Braun	bff9cedef9	Merge 'system_keyspace: remove flushes when writing to system tables' from Petr Gusev There are several system tables with strict durability requirements. This means that if we have written to such a table, we want to be sure that the write won't be lost in case of node failure. We currently accomplish this by accompanying each write to these tables with `db.flush()` on all shards. This is expensive, since it causes all the memtables to be written to sstables, which causes a lot of disk writes. This overheads can become painful during node startup, when we write the current boot state to `system.local`/`system.scylla_local` or during topology change, when `update_peer_info`/`update_tokens` write to `system.peers`. In this series we remove flushes on writes to the `system.local`, `system.peers`, `system.scylla_local` and `system.cdc_local` tables and start using schema commitlog for durability. Fixes: #15133 Closes #15279 * github.com:scylladb/scylladb: system_keyspace: switch CDC_LOCAL to schema commitlog system_keyspace: scylla_local: use schema commitlog database.cc: make _uses_schema_commitlog optional system_keyspace: drop load phases database.hh: add_column_family: add readonly parameter schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards system_keyspace: switch system.peers to schema commitlog system_keyspace: switch system.local to schema commitlog main.cc: move schema commitlog replay earlier sstables_format_selector: extract listener sstables_format_selector: wrap when_enabled with seastar::async main.cc: inline and split system_keyspace.setup system_keyspace: refactor save_system_schema function system_keyspace: move initialize_virtual_tables into virtual_tables.hh system_keyspace: remove unused parameter config.cc: drop db::config::host_id main.cc:: extract local_info initialization into function schema.cc: check static_props for sanity system_keyspace: set null sharder when configuring schema commitlog system_keyspace: rename static variables system_keyspace: remove redundant wait_for_sync_to_commitlog	2023-09-14 10:39:20 +02:00
Kefu Chai	25457fca38	Update tools/cqlsh submodule * tools/cqlsh 66ae7eac...e651e12e (6): > setup.py: specify Cython language_level explicitly > setup.py: pass extensions as a list > setup.py: reindent block in else branch > setup.py: early return in get_extension() > reloc: install build==0.10.0 > reloc: add --verbose option to build_reloc.sh Closes #15401	2023-09-14 10:30:07 +02:00
Kefu Chai	60c293ed7d	doc/dev: correct the path to `object_storage.yaml` we get the path object storage config like: ```c++ db::config::get_conf_sub("object_storage.yaml").native() ``` so, the default path should be $SCYLLA_CONF/object_storage.yaml. in this change, it is corrected. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15406	2023-09-14 10:40:55 +03:00
Botond Dénes	cc16502691	Merge 'Add metrics to S3 client' from Pavel Emelyanov The added metrics include: - http client metrics, which include the number of connections, the number of active connections and the number of new connections made so far - IO metrics that mimic those for traditional IO -- total number of object read/write ops, total number of get/put/uploaded bytes and individual IO request delay (round-trip, including body transfer time) fixes: #13369 Closes #14494 * github.com:scylladb/scylladb: s3/client: Add IO stats metrics s3/client: Add HTTP client metrics s3/client: Split make_request() s3/client: Wrap http client with struct group_client s3/client: Move client::stats to namespace scope s3/client: Keep part size local variable	2023-09-14 09:49:08 +03:00
Kefu Chai	88a7bf2853	build: cmake: add targets for building deb and rpm packages Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 13:00:04 +08:00
Kefu Chai	93faac0a0c	build: cmake: correct the paths used when building unstripped pkg in `a0dcbb09c3`, the newly introduced unstripped package does not build at all. it was using the wrong paths. so, let's correct them. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 13:00:04 +08:00
Kefu Chai	16eea4569d	create-relocatable-package.py: prefer $build_dir/SCYLLA-RELEASE-FILE similar to `d9dcda9dd5`, we need to use the version files located under $build_dir instead "build". so let's check the existence of $build_dir/SCYLLA-RELEASE-FILE, and then fallback to the ones under "build". Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 12:45:40 +08:00
Kefu Chai	6dc6b39609	create-relocatable-package.py: create SCYLLA-RELOCATABLE-FILE with tempfile this change serves two purposes: 1. so we don't assume the existence of '$PWD/build' directory. we should not assume this. as the build directory could be any diectory, it does not have to be "build". 2. we don't have to actually create a file under $build_dir. what we need is but an empty file. so tempfile serves this purpose just well. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 12:45:15 +08:00
Botond Dénes	fa88ed76a5	Merge 'build: cmake: add packaging support' from Kefu Chai a new target "dist-unified" is added, so that CMake can build unified package, which is a bundle of all subcomponents, like cqlsh, python3, jmx and tools. Fixes #15241 Closes #15398 * github.com:scylladb/scylladb: build: cmake: build unified package build: cmake: put stripped_dist_pkg under $build/dist	2023-09-14 07:02:44 +03:00
Yaniv Kaul	6c67c270c8	Update node exporter to v1.6.1 Fixes: https://github.com/scylladb/scylladb/issues/15044 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes #15045 [avi: toolchain regenerated; also pulls in clang-16.0.6-3] Ref #15090 Closes #15392	2023-09-14 01:04:14 +03:00
Petr Gusev	082cd3bc8e	system_keyspace: switch CDC_LOCAL to schema commitlog	2023-09-13 23:17:20 +04:00
Petr Gusev	a683cebb02	system_keyspace: scylla_local: use schema commitlog We remove flush from set_scylla_local_param_as since it's now redundant. We add it to save_local_enabled_features as features need to be available before schema commitlog replay. We skip the flush if save_local_enabled_features is called from topology_state_load when the features are migrated to system.topology and we don't need strict durability.	2023-09-13 23:17:20 +04:00
Petr Gusev	ce0ee32d5a	database.cc: make _uses_schema_commitlog optional This field on the null shard is properly initialized in maybe_init_schema_commitlog function, until then we can't make decisions based on its value. This problem can happen e.g. if add_column_family function is called with readonly=false before maybe_init_schema_commitlog. It will call commitlog_for to pass the commitlog to mark_ready_for_writes and commitlog_for reads _uses_schema_commitlog. In this commit we add protection against this case - we trigger internal_error if _uses_schema_commitlog is read before it is initialized. maybe_init_schema_commitlog() was added to cql_test_env to make boost tests work with the new invariant.	2023-09-13 23:17:20 +04:00
Petr Gusev	beb29f094b	system_keyspace: drop load phases We want to switch system.scylla_local table to the schema commitlog, but load phases hamper here - schema commitlog is initialized after phase1, so a table which is using it should be moved to phase2, but system.scylla_local contains features, and we need them before schema commitlog initialization for SCHEMA_COMMITLOG feature. In this commit we are taking a different approach to loading system tables. First, we load them all in one pass in 'readonly' mode. In this mode, the table cannot be written to and has not yet been assigned a commit log. To achieve this we've added _readonly bool field to the table class, it's initialized to true in table's constructor. In addition, we changed the table constructor to always assign nullptr to commitlog, and we trigger an internal error if table.commitlog() property is accessed while the table is in readonly mode. Then, after triggering on_system_tables_loaded notifications on feature_service and sstable_format_selector, we call system_keyspace::mark_writable and eventually table::mark_ready_for_writes which selects the proper commitlog and marks the table as writable. In sstable_compaction_test we drop several mark_ready_for_writes calls since they are redundant, the table has already been made writable in env.make_table_for_tests call. The table::commitlog function either returns the current commitlog or causes an error if the table is readonly. This didn't work for virtual tables, since they never called mark_ready_for_writes. In this commit we add this call to initialize_virtual_tables.	2023-09-13 23:17:20 +04:00
Petr Gusev	47ffc66c7f	database.hh: add_column_family: add readonly parameter Previously, creating a table or view in schema_tables.cc/merge_tables_and_views was a two-step process: first adding a column family (add_column_family function) and then marking it as ready for writes (mark_table_as_writable). There is an yield between these stages, this means someone could see a table or view for which the mark_table_as_writable method had not yet been called, and start writing to it. This problem was demonstrated by materialised view dtests. A view is created on all nodes. On some nodes it will be created earlier than on others and the view rebuild process will start writing data to that view on other nodes, where mark_table_as_writable has not yet been called. In this patch we solve this problem by adding a readonly parameter to the add_column_family method. When loading tables from disk, this flag is set to true and the mark_table_as_writable is called only after all sstables have been loaded. When creating a new table, this flag is set to false, mark_table_as_writable is called from inside add_column_family and the new table becomes visible already as writable.	2023-09-13 23:17:20 +04:00
Petr Gusev	7e52014633	schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards db.get_notifier().create_view triggers view rebuild, this process writes to the table on all shards and thus can access partially created table, e.g the one where mark_table_ready_for_writes was not yet called.	2023-09-13 23:17:20 +04:00
Petr Gusev	0e5f9ae9a4	system_keyspace: switch system.peers to schema commitlog Also, we remove flushes on writes as durability is now guaranteed by the commitlog.	2023-09-13 23:17:20 +04:00
Petr Gusev	7881ce1e09	system_keyspace: switch system.local to schema commitlog Schema commitlog lives only on the zero shard, so we need to turn on use_null_sharder option. Also, we remove flushes on writes as durability is now guaranteed by the commitlog.	2023-09-13 23:17:20 +04:00
Petr Gusev	cbfc512667	main.cc: move schema commitlog replay earlier We want to switch system.local table to schema commitlog, but this table is used in host_id initialization (initialize_local_info), so we need to replay schema commitlog before. In this commit we gather all the actions related to early system_keyspace initialization in one place, before initialize_local_info_thread. The calls to save_system_schema and recalculate_schema_version are tied to legacy_schema_migrator::migrate and initialize_virtual_tables calls, so they are done separately after legacy_schema_migrator::migrate.	2023-09-13 23:17:11 +04:00
Petr Gusev	a0653590b5	sstables_format_selector: extract listener In the following commits we want to move schema commitlog replay earlier, but the current sstable format should be selected before the replay. The current sstable format is stored in system.scylla_local, so we can't read it until system tables are loaded. This problem is similar to the enabled_features. To solve this we split sstables_format_selector in two parts. The lower level part, sstables_format_selector, knows only about database and system_keyspace. It will be moved before system_keyspace initialization, and the on_system_tables_loaded method will be called on it when the system_keyspace has loaded its tables. The higher level part, sstables_format_listener, is responsible for subscribing to feature_services and gossipier and is started later, at the same place as sstables_format_selector before this commit.	2023-09-13 23:04:50 +04:00
Petr Gusev	7104fc8a7e	sstables_format_selector: wrap when_enabled with seastar::async The listener may fire immediately, we must be in a thread context for this to work. In the next commits we are going to move enable_features_on_startup above sstables_format_selector::start in scylla_main, so we need to fix this beforehand.	2023-09-13 23:00:16 +04:00
Petr Gusev	2a0b228d17	main.cc: inline and split system_keyspace.setup Our goal is to switch system.local table to schema commitlog and stop doing flushes when we write to it. This means it would be incorrect to read from this table until schema commitlog is replayed. On the other hand, we need truncation records to be loaded before we start replaying schema commitlog, since commitlog_replayer relies on them. In this commit we inline the system_keyspace::setup function and split its content into two parts. In the first part, before schema commitlog replay, we load truncation records. It's safe to load them before schema commitlog replay since we intend to let the flushes on writes to system.truncated table. In the second part, after schema commitlog replay, we do the rest of the job - build_bootstrap_info and db::schema_tables::save_system_schema. We decided to inline this function since there is very low cohesion between the actions it's performing. It's just simpler to reason about them individually.	2023-09-13 23:00:15 +04:00
Petr Gusev	f0bc9f2d93	system_keyspace: refactor save_system_schema function This is a refactoring commit without observable changes in behaviour. Previously, there were two related functions in db::schema_tables: save_system_keyspace_schema(qp) and save_system_schema(qp, ks). The first called the second passing "system_schema" as the second argument. Outside of schema_tables module we don't need two functions, we just need a way to say 'persist system schema objects in the appropriate tables/keyspaces'. In this commit we change the function save_system_schema to have this meaning. Internally it calls save_system_schema_to_keyspace twice with "system_schema" and "system", since that's what we need in the single call site of this function in system_keyspace::setup. In subsequent commits we are going to move this call out of the system_keyspace::setup.	2023-09-13 23:00:15 +04:00
Petr Gusev	e395086557	system_keyspace: move initialize_virtual_tables into virtual_tables.hh This is a readability refactoring commit without observable changes in behaviour. initialize_virtual_tables logically belongs to virtual_tables module, and it allows to make other functions in virtual_tables.cc (register_virtual_tables, install_virtual_readers) local to the module, which simplifies the matters a bit. all_virtual_tables() is not needed anymore, all the references to registered virtual tables are now local to virtual_tables module and can just use virtual_tables variable directly.	2023-09-13 23:00:15 +04:00
Petr Gusev	c4787a160b	system_keyspace: remove unused parameter	2023-09-13 23:00:15 +04:00
Petr Gusev	b90011294d	config.cc: drop db::config::host_id In this refactoring commit we remove the db::config::host_id field, as it's hacky and duplicates token_metadata::get_my_id. Some tests want specific host_id, we add it to cql_test_config and use in cql_test_env. We can't pass host_id to sstables_manager by value since it's initialized in database constructor and host_id is not loaded yet. We also prefer not to make a dependency on shared_token_metadata since in this case we would have to create artificial shared_token_metadata in many tools and tests where sstables_manager is used. So we pass a function that returns host_id to sstables_manager constructor.	2023-09-13 23:00:15 +04:00
Petr Gusev	d15c961a2f	main.cc:: extract local_info initialization into function This is a refactoring commit without observable changes in behaviour. The scylla main function is huge and incomprehensible. There are a lot of hidden dependencies between actions that it performs, and it's too difficult to reason about them. In this commit, we've extracted a small part of it into its own function. We're hoping that, moving forward, the rest of the code can be modified in a similar manner.	2023-09-13 23:00:15 +04:00
Petr Gusev	c59dae9a73	schema.cc: check static_props for sanity wait_for_sync_to_commitlog is redundant for schema commitlog since all writes to it automatically sync due to db::commitlog::sync_mode::BATCH option.	2023-09-13 23:00:15 +04:00
Petr Gusev	a03fbc3781	system_keyspace: set null sharder when configuring schema commitlog The schema commitlog lives only on the null shard, it makes no sense to set use_schema_commitlog without use_null_sharder. We also extract the function enable_schema_commitlog which sets all the needed properties.	2023-09-13 23:00:15 +04:00
Petr Gusev	d32191a353	system_keyspace: rename static variables 'raft_tables' in set_use_schema_commitlog initialization was misleading. Other variables have also been renamed for consistency.	2023-09-13 23:00:15 +04:00
Petr Gusev	cda49b06dc	system_keyspace: remove redundant wait_for_sync_to_commitlog Tables with schema commitlog already sync every write, wait_for_sync_to_commitlog makes sense only for the regular commitlog. Technically there are nothing wrong with allowing both options, but it's confusing. Being strict and accurate about the meaning of the options reduces the chance of errors due to misunderstanding. This is preparation for the next commits, where we will start generating an error if the combination of options doesn't make sense.	2023-09-13 23:00:15 +04:00
Kefu Chai	268f75c931	build: cmake: build unified package a new target "dist-unified" is added, so that CMake can build unified package, which is a bundle of all subcomponents, like cqlsh, python3, jmx and tools. Fixes #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 00:41:46 +08:00
Kefu Chai	f39129f93c	build: cmake: put stripped_dist_pkg under $build/dist more consistent this way, as other tarballs are also located under this directory. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-14 00:39:06 +08:00
Tomasz Grabiec	c27d212f4b	api, storage_service: Recalculate table digests on relocal_schema api call Currently, the API call recalculates only per-node schema version. To workaround issues like #4485 we want to recalculate per-table digests. One way to do that is to restart the node, but that's slow and has impact on availability. Use like this: curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema Fixes #15380 Closes #15381	2023-09-13 18:27:57 +03:00
Avi Kivity	0a5d9532f9	Merge 'Sanitize batchlog manager start/stop' from Pavel Emelyanov This code is now spread over main and differs in cql_test_env. The PR unifies both places and makes the manager start-stop look standard refs: #2795 Closes #15375 * github.com:scylladb/scylladb: batchlog_manager: Remove start() method batchlog_manager: Start replay loop in constructor main, cql_test_env: Start-stop batchlog manager in one "block" batchlog_manager: Move shard-0 check into batchlog_replay_loop() batchlog_manager: Fix drain() reentrability	2023-09-13 18:20:56 +03:00
Aleksandra Martyniuk	14598fdfdd	test: add test for compaction strategy validation	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	53ecc29cd7	compaction: unify exception messages Use fmt::format in exception messages in all methods validating compaction strategies.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	ac08b57555	compaction: cql3: validate options in check_restricted_table_properties Check whether valid compaction strategy options are set for the given strategy type in check_restricted_table_properties.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	44744d6229	compaction: validate options used in different compaction strategies For each compaction strategy, validate whether options values are valid.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	0ed39af221	compaction: validate common compaction strategy options Add compaction_strategy_impl::validate_options to validate common compaction strategy options.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	a2e6081984	compaction: split compaction_strategy_impl constructor Split compaction_strategy_impl constructor into methods that will be reused for validation. Add additional checks providing that options' values are legal.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	5c72bcd40e	compaction: validate size_tiered_compaction_strategy specific options	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	7e5b6ea09a	compaction: validate time_window_compaction_strategy specific options	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	84fd90e472	compaction: add method to validate min and max threshold Add compaction_strategy_impl::validate_min_max_threshold method that will be used to validate min and max threshold values for different compaction methods.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	50c1bb555b	compaction: split size_tiered_compaction_strategy_options constructor Split size_tiered_compaction_strategy_options constructor into methods that will be reused for validation. Add additional checks providing that options' values are legal.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	702c19f941	compaction: make compaction strategy keys static constexpr	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	e3d8f71a88	compaction: use helpers in validate_* functions To be consistent with other compaction_strategy_options, time_window_compaction_strategy_options uses compaction_strategy_impl::get_value and cql3::statements::property_definitions::to_long helpers for parsing.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	c8c3c0e6a6	compaction: split time_window_compaction_strategy_options construtor Split time_window_compaction_strategy_options constructor into functions that will be reused for validation.	2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk	a01dd1351e	compaction: add validate method to compaction_strategy_options Add temporarily empty validate method to compaction_strategy_options. The method will validate the options and help determining whether only the allowed options were set.	2023-09-13 16:59:40 +02:00
Benny Halevy	e5cf6f0897	time_window_compaction_strategy_options: make copy and move-able Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-13 16:59:40 +02:00
Benny Halevy	c9475d6fe0	size_tiered_compaction_strategy_options: make copy and move-able Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-13 16:59:40 +02:00
Pavel Emelyanov	f9b09d4549	migration_manager: Register RPC verbs on start There's a dedicated call to register migration manager's verbs somewhere in the middle of main. However, until messaging service listening starts it makes no difference when to register verbs. This patch moves the verbs registration into mig. manager constructor thus making it called it with sharded<migration_manager>::start(). Unregistration happens in migration_manager::drain() and it's not touched here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15367	2023-09-13 17:32:51 +03:00
Pavel Emelyanov	9dea26aa03	storage_service: Remove proxy arg from init_messaging_service_part() It's only used to be carried along down to a handler and get sharded<database> from. Storage service itself can provide it, and the handler in question already uses it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15368	2023-09-13 17:11:33 +03:00
Raphael S. Carvalho	c53b8fb1b5	storage_service: initialize group0 in ctor there are a couple of places that check group is not nullptr, so let's set it to nullptr on ctor, so shards that don't have it initialized will bump on assert, instead of failing with a cryptic segfault error. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15330	2023-09-13 14:51:24 +02:00
Botond Dénes	50e3448527	Merge 'unified: add --build-dir option and respect --pkgs' from Kefu Chai in this series, `unified/build_unified.sh` is improved in couple perspectives: 1. add `--build-dir` option, so we don't hardwire the build directory to the naming convention of `build/$mode`. 2. `--pkgs` is respected. this allows the caller to specify the paths to the dist tarballs instead of hardwiring to the paths defined in this script these changes give us more flexibility when building unified package, and enable us to switch over to CMake based building system, Refs #15241 Closes #15377 * github.com:scylladb/scylladb: unified: respect --pkgs option unified: allow passing --pkgs with a semicolon-separated list unified: prefer SCYLLA-PRODUCT-FILE in build_dir unified: derive UNIFIED_PKG from --build-dir unified: add --build-dir option to build_unified.sh	2023-09-13 15:30:57 +03:00
Kamil Braun	a184b07cbb	Merge 'raft topology: make CDC_GENERATIONS_V3 single-partition, timeuuid-sorted' from Patryk Jędrzejczak We make the `CDC_GENERATIONS_V3` table single-partition and change the clustering key from `range_end` to `(id, range_end)`. We also change the type of `id` to `timeuuid` and ensure that a new generation always has the highest `id`. These changes allow efficient clearing of obsolete CDC generation data, which we need to prevent Raft-topology snapshots from endlessly growing as we introduce new generations over time. All this code is protected by an experimental feature flag. It includes the definition of `CDC_GENERATIONS_V3`. The table is not created unless the feature flag is enabled. Fixes #15163 Closes #15319 * github.com:scylladb/scylladb: system_keyspace: rename cdc_generation_id_v2 system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3 cdc: generation: remove topology_description_generator cdc: do not create uuid in make_new_generation_data system_kayspace: make CDC_GENERATIONS_V3 single-partition cdc: generation: introduce get_common_cdc_generation_mutations cdc: generation: rename get_cdc_generation_mutations	2023-09-13 12:54:49 +02:00
Kefu Chai	bbb6e4f822	docs: s/tar xvfz tar/tar xvfz/ in command line sample should not "tar" to tar, otherwise we'd have following error: ``` tar (child): tar: Cannot open: No such file or directory tar (child): Error is not recoverable: exiting now tar: Child returned status 2 tar: Error is not recoverable: exiting now ``` as "tar" is not the compressed tarball we want to untar. Fixes #15328 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15383	2023-09-13 13:37:38 +03:00
Aleksandra Martyniuk	d0d0ad7aa4	node_ops: extract classes related to node operations Node operations will be integrated with task manager and so node_ops directory needs to be created. To have an access to node ops related classes from task manager and preserve consistent naming, move the classes to node_ops/node_ops_data.cc.	2023-09-13 10:49:31 +02:00
Aleksandra Martyniuk	e90e10112f	node_ops: repair: move node_ops_id to node_ops directory	2023-09-13 10:40:04 +02:00
Piotr Dulikowski	66206207f9	gossiper: properly acquire lock_endpoint_update_semaphore in reset_endpoint_state_map The `gossiper::reset_endpoint_state_map` function is supposed to acquire a lock in order to serialize with `replicate_live_endpoints_on_change`. The `lock_endpoint_update_semaphore` is called, but its result is a future - and it is not co_awaited. Therefore, the lock has no effect. This commit fixes the issue by adding missing co_await. Fixes: #15361 Closes #15362	2023-09-13 10:03:47 +02:00
Botond Dénes	7e7101c180	Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes" This reverts commit `628e6ffd33`, reversing changes made to `45ec76cfbf`. The test included with this PR is flaky and often breaks CI. Revert while a fix is found. Fixes: #15371	2023-09-13 10:45:37 +03:00
Avi Kivity	2c810e221a	Merge 'Gossiper: replace seastar threads with coroutines' from Benny Halevy Many of the gossiper internal functions currently use seastar threads for historical reasons, but since they are short living, the cost of spawning a seastar thread for them is excessive and they can be simplified and made more efficient using coroutines. Closes #15364 * github.com:scylladb/scylladb: gossiper: reindent do_stop_gossiping gossiper: coroutinize do_stop_gossiping gossiper: reindent assassinate_endpoint gossiper: coroutinize assassinate_endpoint gossiper: coroutinize handle_ack2_msg gossiper: handle_ack_msg: always log warning on exception gossiper: reindent handle_ack_msg gossiper: coroutinize handle_ack_msg gossiper: reindent handle_syn_msg gossiper: coroutinize handle_syn_msg gossiper: message handlers: no need to capture shared_from_this gossiper: add_local_application_state: throw internal error if endpoint state is not found gossiper: coroutinize add_local_application_state	2023-09-12 21:50:52 +03:00
Benny Halevy	47dc287efd	gossiper: reindent do_stop_gossiping Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:33:09 +03:00
Benny Halevy	8fa65ed016	gossiper: coroutinize do_stop_gossiping Simplify the function. It does not need to spawn a seastar thread. While at it, declare it as private since it's called only internally by the gossiper (and on shard 0). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:33:09 +03:00
Benny Halevy	a792babbda	gossiper: reindent assassinate_endpoint Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:33:09 +03:00
Benny Halevy	5dbc168c03	gossiper: coroutinize assassinate_endpoint It has no need to spawn a seastar thread. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:33:09 +03:00
Benny Halevy	29b9596050	gossiper: coroutinize handle_ack2_msg Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:33:09 +03:00
Benny Halevy	cc030a5040	gossiper: handle_ack_msg: always log warning on exception Unlike handle_syn_msg, the warning is currently printed only `if (_ack_handlers.contains(from.addr))`. Unclear why. It is interesting in any case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:32:40 +03:00
Benny Halevy	990ac23d19	gossiper: reindent handle_ack_msg Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:27:08 +03:00
Benny Halevy	2ca2118130	gossiper: coroutinize handle_ack_msg Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:26:03 +03:00
Benny Halevy	8c065bf023	gossiper: reindent handle_syn_msg Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:24:14 +03:00
Benny Halevy	264f4daded	gossiper: coroutinize handle_syn_msg Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:23:09 +03:00
Benny Halevy	63ab5f1ab3	gossiper: message handlers: no need to capture shared_from_this The handlers future is waited on under `background_msg` which is closed in gossiper::stop so the instance is already guranteed to be kept valid. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:21:07 +03:00
Benny Halevy	8bfec81985	gossiper: add_local_application_state: throw internal error if endpoint state is not found If the function is called too early, the first get_endpoint_state_ptr would throw an exception that is later caught and degraded into a warning. But that endpoint_state should never disappear after yielding, so call on_internal_error in that case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:21:07 +03:00
Benny Halevy	d1c67300d4	gossiper: coroutinize add_local_application_state There is no need for it to spawn a seastar thread. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-12 19:20:41 +03:00
Kefu Chai	75f458f2a5	unified: respect --pkgs option let's provide the default value, only if user does not specify --pkgs. otherwise the --pkgs option is always ignored. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 22:56:10 +08:00
Kefu Chai	84387e3856	unified: allow passing --pkgs with a semicolon-separated list simpler than passing a space-separated list requiring escaping, which is a source of headache. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 22:56:10 +08:00
Kefu Chai	d9dcda9dd5	unified: prefer SCYLLA-PRODUCT-FILE in build_dir unlike `configure.py`, the building system created by CMake do not share the `SCYLLA-PRODUCT-FILE` across different builds. so we cannot assume that build/SCYLLA-PRODUCT-FILE exists. so, in this change, we check $BUILD_DIR/SCYLLA-PRODUCT-FILE first, and fallback to $BUILD_DIR/../SCYLLA-PRODUCT-FILE. this should work for both configure.py and CMake building system. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 22:56:10 +08:00
Kefu Chai	fea3a11716	unified: derive UNIFIED_PKG from --build-dir we should respect the --build-dir if --unified-pkg is not specified, and deduce the path to unified pkg from BUILD_DIR. so, in this change, we deduce the path to unified pkg from BUILD_DIR unless --unified-pkg is specfied. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 22:56:10 +08:00
Kefu Chai	4bb5af763b	unified: add --build-dir option to build_unified.sh this allows build_unified.sh to generate unified pkg in specified directory, instead of assuming the naming convention of build/$mode. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 22:56:10 +08:00
Pavel Emelyanov	d48aff5789	batchlog_manager: Remove start() method It's now a no-op, can be dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 16:37:52 +03:00
Pavel Emelyanov	3966a50ed4	batchlog_manager: Start replay loop in constructor ... and sanitize the future used on stop. The loop in question is now started in .start(), but all callers now construct the manager late enough, so the loop spawning can be moved. This also calls for renaming the future member of the class and allows to make it regular, not shared, future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 16:35:53 +03:00
Pavel Emelyanov	512465288f	main, cql_test_env: Start-stop batchlog manager in one "block" Currently starting and stopping of b.m. is spread over main(). Keep it close to each other. Another trickery here is that calling b.m.::start() can only be done after joining the cluster, because this start() spawns replay loop which, in turn calls token_metadata::count_normal_token_owners() and if the latter returns zero, the b.m. code uses it as a fraction denominator and crashes. With the above in mind, cql_test_env should start batchlog manager after it "joins the ring" too. For now it doesn't make any difference, but next patch will make use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 16:33:31 +03:00
Pavel Emelyanov	9f45778467	batchlog_manager: Move shard-0 check into batchlog_replay_loop() Currently the only caller of it is the batchlog manager itself. It checks for the shard-id to be zero, calls the method, then the method asserts that it's run on shard-0. Moving the check into the method removes the need for assertion and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 16:32:12 +03:00
Pavel Emelyanov	38d0ea0916	batchlog_manager: Fix drain() reentrability Currently drain() is called twise -- first time from storage_service::drain() (on shutdown), second via batchlog_manager::stop(). The routine is unintentinally re-entrable, because: - explicit check for not aborting the abort source twise - breaking semaphore can be done multiple times - co-await-ing of the _started future works because the future is shared That's not extremely elegant, better to make the drain() bail out early if it was already called. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 16:30:07 +03:00
Avi Kivity	a1b2ca6184	Merge 'build: cmake: package cqlsh and fix the noarch postfix of python3 package' from Kefu Chai in this series, the packaging of tools modules are improved: - package cqlsh also. as cqlsh should be redistributed as a part of the unified package - use ${arch} in the postfix of the python3 package. the python3 package is not architecture independent. - set the version with tide for `Scylla_VERSION`, so it can be reused elsewhere. Refs #15241 Closes #15369 * github.com:scylladb/scylladb: build: cmake: build cqlsh as a submodule build: cmake: always use the version with tilde build: cmake: build python3 dist tarball with arch postfix build: cmake: use the default comment message	2023-09-12 16:27:03 +03:00
David Garcia	5177ddac17	Support advanced db config scenarios docs: skip html tags from description Closes #15338	2023-09-12 15:29:16 +03:00
Tomasz Grabiec	6e83e54b0d	Merge 'gossiper: get rid of uses_host_id' from Benny Halevy This function practically returned true from inception. In `d38deef499` it started using messaging_service().knows_version(endpoint) that also returns `true` unconditionally, to this day So there's no point calling it since we can assume that `uses_host_id` is true for all versions. Closes #15343 * github.com:scylladb/scylladb: storage_service: fixup indentation after last patch gossiper: get rid of uses_host_id	2023-09-12 12:44:56 +02:00
Kefu Chai	571fab4179	build: cmake: build cqlsh as a submodule since we also redistribute cqlsh, let's package it as well. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 18:18:31 +08:00
Kefu Chai	4ff5ce9933	build: cmake: always use the version with tilde since we always use tilde ("~") in the verson number, let's just cache it as an internal variable in CMake. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 18:18:31 +08:00
Kefu Chai	111d20958e	build: cmake: build python3 dist tarball with arch postfix now that `configure.py` always generate python3 dist tarball with ${arch} postfix, let's mirror this behavior. as `build_unified.sh` uses this naming convention. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 18:18:31 +08:00
Kefu Chai	760b7c8772	build: cmake: use the default comment message it turns out "Generating submodule python3 in python3" is not as informative as default one: "/home/kefu/dev/scylladb/tools/python3/build/scylla-python3-5.4.0~dev-0.20230908.1668d434e458.noarch.tar.gz" so let's drop the "COMMENT" argument. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-12 18:18:31 +08:00
Patryk Jędrzejczak	92209996b5	system_keyspace: rename cdc_generation_id_v2 Changing the second value of cdc_generation_id_v2 from uuid_type to timeuuid_type made the name of cdc_generation_id_v2 unsuitable because it does not match cdc::generation_id_v2 anymore.	2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak	1c58c6336a	system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3 We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to give them a time-based order. We also change how we initialize them so that the new CDC generation always has the highest ID. This is the last step to enabling the efficient clearing of obsolete CDC generation data. Additionally, we change the types of current_cdc_generation_uuid, new_cdc_generation_data_uuid and the second values of the elements in unpublished_cdc_generations to timeuuid, so that they match id in CDC_GENERATIONS_V3.	2023-09-12 11:43:34 +02:00
Patryk Jędrzejczak	fab066cffe	cdc: generation: remove topology_description_generator After moving the creation of uuid out of make_new_generation_description, this function only calls the topology_description_generator's constructor and its generate method. We could remove this function, but we instead simplify the code by removing the topology_description_generator class. We can do this refactor because make_new_generation_description is the only place using it. We inline its generate method into make_new_generation_description and turn its private methods into static functions.	2023-09-12 11:18:54 +02:00
Patryk Jędrzejczak	3bf4cac72e	cdc: do not create uuid in make_new_generation_data In the future commit, we change how we initialize uuid of the new CDC generation in the Raft-based topology. It forces us to move this initialization out of the make_new_generation_data function shared between Raft-based and gossiper-based topologies. We also rename make_new_generation_data to make_new_generation_description since it only returns cdc::topology_description now.	2023-09-12 11:18:38 +02:00
Patryk Jędrzejczak	2cd430ac80	system_kayspace: make CDC_GENERATIONS_V3 single-partition We make CDC_GENERATIONS_V3 single-partition by adding the key column and changing the clustering key from range_end to (id, range_end). This is the first step to enabling the efficient clearing of obsolete CDC generation data, which we need to prevent Raft-topology snapshots from endlessly growing as we introduce new generations over time. The next step is to change the type of the id column to timeuuid. We do it in the following commits. After making CDC_GENERATIONS_V3 single-partition, there is no easy way of preserving the num_ranges column. As it is used only for sanity checking, we remove it to simplify the implementation.	2023-09-12 09:51:45 +02:00
Patryk Jędrzejczak	29f54836d0	cdc: generation: introduce get_common_cdc_generation_mutations In the following commit, we implement the get_cdc_generation_mutations_v3 function very similar to get_cdc_generation_mutations_v2. The only differences in creating mutations between CDC_GENERATIONS_V2 and CDC_GENERATIONS_V3 are: - a need to set the num_ranges cell for CDC_GENERATIONS_V2, - different partition keys, - different clustering keys. To avoid code duplication, we introduce get_common_cdc_generation_mutations, which does most of the work shared by both functions.	2023-09-12 09:37:21 +02:00
Botond Dénes	bc4b3e4fa3	Merge 'build: cmake: add packaging support ' from Kefu Chai this change allows CMake to build the dist tarball for a certain build. Refs https://github.com/scylladb/scylladb/issues/15241 Closes #15352 * github.com:scylladb/scylladb: build: cmake: add packaging support build: cmake: enable build of seastar/apps/iotune	2023-09-12 09:59:53 +03:00
Pavel Emelyanov	3d0a5f2173	test: Extend object_store test to validate GC works The test-case creates a S3-backed ks, populates it with table and data, then forces flush to make sstables appear on the backend. Then it updates the registry by marking all the objects as 'removing' so that on next boot they will be garbage-collected. After reboot check that the table is "empty" and also validate that the backend doesn't have the corresponding objects on board for real Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 09:56:13 +03:00
Pavel Emelyanov	2c9ec6bc93	sstable_directory: Garbage collect S3 sstables on reboot When booting there can be dangling entries in sstables registry as well as objects on the storage itself. This patch makes the S3 lister list those entries and then kick the s3_storage to remove the corresponding objects. At the end the dangling entries are removed from the registry Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 09:56:13 +03:00
Pavel Emelyanov	6cb4e3d05a	sstable_directory: Pass storage to garbage_collect() The lister method is going to list the dangling objects and then call storage to actually wipe them (next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 09:29:34 +03:00
Pavel Emelyanov	a957e97ab4	sstable_directory: Create storage instance too Right now the directory instance only creates lister, but lister is unaware on exact objects manipulations. The storage is, so create it too, it's going to be used by garbage collector in next patches This change also needs fixing the way cql_test_env is configured for schema_change_test. There are cases that try to pick up keyspace with S3 storage option from the pre-created sstables, and populating those would need to provide some (even empty) object storage endpoint Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-12 09:29:34 +03:00
Avi Kivity	89ba4e4a5e	Merge 'Stop using anonymous minio bucket for tests' from Pavel Emelyanov Currently minio starts with a bucket that has public anonymous access. Respectively, all tests use unsigned S3 requests. That was done for simplicity, and its better to apply some policy to the bucket and, consequentially, make tests sign their requests. Other than the obvious benefit that we test requests signing in unit tests, another goal of this PR is to make it possible to simulate and test various error paths locally, e.g. #13745 and #13022 Closes #14525 * github.com:scylladb/scylladb: test/s3: Remove AWS_S3_EXTRA usage test/s3: Run tests over non-anonymous bucket test/minio: Create random temp user on start code: Rename S3_PUBLIC_BUCKET_FOR_TEST	2023-09-11 23:12:56 +03:00
Tomasz Grabiec	f77e90a0f0	tests: test_tablets: Reconnect the driver after server restart This is a workaround for the flakiness of the test where INSERT statements following the rolling restart fail with "No host available" exception. The hypothesis is that those INSERTS race with driver reconnecting to the cluster and if INSERTs are attempted before reconnection is finished, the driver will refuse to execute the statements. The real fix should be in the driver to join with reconnections but before that is ready we want to fix CI flakiness. Refs #14746 Closes #15355	2023-09-11 21:58:46 +03:00
Kefu Chai	34e3302c01	dbuild: use --userns option when using podman instead of fabricating a `/etc/password` manually, we can just leave it to podman to add an entry in `/etc/password` in container. as podman allows us to map user's account to the same UID in the container. see https://docs.podman.io/en/stable/markdown/options/userns.container.html. this is not only a cosmetic change, it also avoid the permission denied failure when accessing `/etc/passwd` in the container when selinux is enabled. without this change, we would otherwise need to either add the selinux lable to the bind volume with ':Z' option address the failure like: ``` type=AVC msg=audit(1693449115.261:2599): avc: denied { open } for pid=2298247 comm="bash" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0 type=AVC msg=audit(1693449115.263:2600): avc: denied { open } for pid=2298249 comm="id" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0 ``` found in `/var/log/audit/audit.log`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15230	2023-09-11 21:41:48 +03:00
Avi Kivity	b8a655f55e	Update tools/python3 submodule * tools/python3 45fbd05...3e833f1 (1): > install.sh: replace <tab> with spaces	2023-09-11 21:38:02 +03:00
Avi Kivity	628e6ffd33	Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large: 1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque 2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s 3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): * Node1: 1 live row, 1M dead rows * Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` After: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs https://github.com/scylladb/scylladb/issues/7929 Refs https://github.com/scylladb/scylladb/issues/3672 Refs https://github.com/scylladb/scylladb/issues/7933 Fixes https://github.com/scylladb/scylladb/issues/9111 Closes #14923 * github.com:scylladb/scylladb: test/topology_custom: add test_read_repair.py replica/mutation_dump: detect end-of-page in range-scans tools/scylla-sstable: write: abort parser thread if writing fails test/pylib: add REST methods to get node exe and workdir paths test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} service/storage_proxy: add trace points for the actual read executor type service/storage_proxy: add trace points for read-repair storage_proxy: Add more trace-level logging to read-repair database: Fix accounting of small partitions in mutation query database, storage_proxy: Reconcile pages with no live rows incrementally	2023-09-11 19:20:19 +03:00
Kefu Chai	a0dcbb09c3	build: cmake: add packaging support this change allows CMake to build the dist tarball for a certain build. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-11 23:05:30 +08:00
Kefu Chai	649c8f248d	build: cmake: enable build of seastar/apps/iotune scylla redistribute iotune, so let's enable the related building options, so that we can built iotune on demand. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-11 23:02:52 +08:00
Nadav Har'El	45ec76cfbf	Merge 'Enlighten native-transport shutdown' from Pavel Emelyanov When `nodetool disablebinary` command executes its handler aborts listening sockets, shuts down all client connections _and_ (!) then waits for the connections to stop existing. Effectively the command tries to make sure that no activity initiated by a CQL query continues, even though client would never see its result (client sockets are closed) This makes the disablebinary command hang for long sometimes, which is not really nice. The proposal is to wait for the connections to terminate in the background. So once disablebinary command exists what's guaranteed is that all client connections are aborted and new connections are not admitted, but some activity started by them may still be running (e.g. up until `nodetool drain` is issued). Driver-side sockets won't get the queries' results anyway. The behavior of `disablebinary` is not documented wrt whether it should wait for CQL processing to stop or not, so technically we're not breaking anything. However, it can happen that it's a disruptive change and some setups may behave differently after it. refs: #14031 refs: #14711 Closes #14743 * github.com:scylladb/scylladb: test/cql-pytest: Add enable\|disable-binary test case test.py: Add suite option to auto-dirty cluster after test test/pylib: Add nodetool enable\|disable-binary commands transport: Shutdown server on disablebinary generic_server: Introduce shutdown() generic_server: Decouple server stopped from connection stopped transport/controller: Coroutinize do_stop_server() transport/controller: Coroutinize stop_server()	2023-09-11 17:54:52 +03:00
Pavel Emelyanov	821a9c1fd4	test/cql-pytest: Add enable\|disable-binary test case The test checks that `nodetool disablebinary` makes subsequent queries fail and `nodetool enablebinary` lets client to establish new connections. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:38:49 +03:00
Pavel Emelyanov	375b8c6213	test.py: Add suite option to auto-dirty cluster after test ScyllaCluster can be marked as 'dirty' which means that the cluster is in unusable state (after test) and shouldn't be re-used by other tests launched by test.py. For now this is only implemented via the cluster manager class which is only available for topology tests. Add a less flexible short-cut for cql-pytest-s via suite.yaml marking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:37:48 +03:00
Pavel Emelyanov	2c3b30b395	test/pylib: Add nodetool enable\|disable-binary commands Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:37:48 +03:00
Pavel Emelyanov	b42391bfbe	transport: Shutdown server on disablebinary ... and do the real "sharded::stop" in the background. On node shutdown it needs to pick up all dangling background stopping. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:37:48 +03:00
Pavel Emelyanov	4682c7f9a5	generic_server: Introduce shutdown() The method waits for listening sockets to stop listening and aborts the connected sockets, but doesn't wait for the established connections to finish processing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:37:48 +03:00
Pavel Emelyanov	6dcf653995	generic_server: Decouple server stopped from connection stopped The _stopped future resolves when all "sockets" stop -- listening and connected ones. Furure patching will need to wait for listening sockets to stop separately from connected ones. Rename the `_stopped` to reflect what it is now while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:32:07 +03:00
Pavel Emelyanov	bc2d44994a	transport/controller: Coroutinize do_stop_server() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:32:07 +03:00
Pavel Emelyanov	7701aa0789	transport/controller: Coroutinize stop_server() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:32:07 +03:00
Benny Halevy	7119c1d8cc	token_metadata: update_topology: make endpoint_dc_rack arg optional It's better to pass a disengaged optional when the caller doesn't have the information rather than passing the default dc_rack location so the latter will never implicitly override a known endpoint dc/rack location. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15300	2023-09-11 16:16:19 +02:00
Benny Halevy	08f8fd30ea	gossiper: get rid of comment about advertise_removing It was deleted in `66ff072540`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20230911140349.1809014-1-bhalevy@scylladb.com>	2023-09-11 16:14:26 +02:00
Benny Halevy	ed32ba7431	storage_service: fixup indentation after last patch Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-11 16:51:28 +03:00
Benny Halevy	f855479c9d	gossiper: get rid of uses_host_id This function practically returned true from inception. In `d38deef499` It started using messaging_service().knows_version(endpoint) that also returns `true` unconditionally, to this day So there's no point calling it since we can assume that `uses_host_id` is true for all versions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-11 16:48:07 +03:00
Kefu Chai	87088b65b6	util: replace <tab> with spaces to be aligned with seastar's coding-style.md: scylladb uses seastar's coding-style.md. so let's adhere to it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15345	2023-09-11 14:38:46 +03:00
Botond Dénes	e5f724d5bd	Merge 'create-relocatable-package.py: add --node-exporter-dir and --build-dir options' from Kefu Chai this series adds `--node-exporter-dir` and `--build-dir` options to `create-relocatable-package.py`. this enables us to use create relocatable package from arbitrary build directories. Refs #15241 Closes #15299 * github.com:scylladb/scylladb: create-relocatable-package.py: add --node-exporter-dir option build: specify the build dir instead mode	2023-09-11 14:32:53 +03:00
Botond Dénes	0c107c2076	Merge 'dist/debian: add command line option for builddir ' from Kefu Chai so we can point `debian_files_gen.py` to builddir other than 'build', and can optionally use other output directory. this would help to reduce the number of "magic numbers" in our building system. Refs https://github.com/scylladb/scylladb/issues/15241 Closes #15282 * github.com:scylladb/scylladb: dist/debian: specify debian/* file encodings dist/debian: wrap lines whose length exceeds 100 chars dist/debian: add command line option for builddir dist/debian: modularize debian_files_gen.py	2023-09-11 14:31:33 +03:00
Botond Dénes	f770ff7a2b	test/topology_custom: add test_read_repair.py	2023-09-11 07:07:12 -04:00
Botond Dénes	b55cead5cd	replica/mutation_dump: detect end-of-page in range-scans The current read-loop fails to detect end-of-page and if the query result buider cuts the page, it will just proceed to the next partition. This will result in distorted query results, as the result builder will request for the consumption to stop after each clustering row. To fix, check if the page was cut before moving on to the next partition. A unit test reproducing the bug was also added.	2023-09-11 07:02:14 -04:00
Botond Dénes	82f4563757	tools/scylla-sstable: write: abort parser thread if writing fails Currently if writing the sstable fails, e.g. because the input data is out-of-order, the json parser thread hangs because its output is no longer consumed. This results in the entire application just freezing. Fix this by aborting the parsing thread explicitely in the json_mutation_stream_parser destructor. If the parser thread existed successfully, this will be a no-op, but on the error-path, this will ensure that the parser thread doesn't hang.	2023-09-11 07:02:14 -04:00
Botond Dénes	46e37436d0	test/pylib: add REST methods to get node exe and workdir paths	2023-09-11 07:02:14 -04:00
Botond Dénes	dc269cb6bd	test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} To support the equivalent (roughly) of the following nodetool commands: * nodetool refresh * nodetool flush * nodetool compact	2023-09-11 07:01:20 -04:00
Botond Dénes	ff29f43060	service/storage_proxy: add trace points for the actual read executor type There is currently a trace point for when the read executor is created, but this only contains the initial replica set and doesn't mention which read executor is created in the end. This patch adds trace points for each different return path, so it is clear from the trace whether speculative read can happen or not.	2023-09-11 06:56:13 -04:00
Botond Dénes	727e519c3a	service/storage_proxy: add trace points for read-repair Currently the fact that read-repair was triggered can only be inferred from seeing mutation reads in the trace. This patch adds an explicit trace point for when read repair is triggered and also when it is finished or retried.	2023-09-11 06:56:13 -04:00
Tomasz Grabiec	f76f5f6bfe	storage_proxy: Add more trace-level logging to read-repair Extremely helpful in debugging.	2023-09-11 06:56:13 -04:00
Tomasz Grabiec	0d773c9f9f	database: Fix accounting of small partitions in mutation query The partition key size was ignored by the accounter, as well as the partition tombstone. As a result, a sequence of partitions with just tombstones would be accounted as taking no memory and page size limitter to not kick in. Fix by accounting the real size of accumulated frozen_mutation. Also, break pages across partitions even if there are no live rows. The coordinator can handle it now. Refs #7933	2023-09-11 06:56:13 -04:00
Tomasz Grabiec	2c8a0e4175	database, storage_proxy: Reconcile pages with no live rows incrementally Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stems from the fact that resulting reconcilable_result will be large: * Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque * Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s. * Large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs #9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): Node1: 1 live row, 1M dead rows Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] After: Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs #7929 Refs #3672 Refs #7933 Fixes #9111	2023-09-11 06:56:13 -04:00
Patryk Jędrzejczak	ed1c1369d9	cdc: generation: rename get_cdc_generation_mutations In the following commits, we modify the CDC_GENERATIONS_V3 schema to enable efficient clearing of obsolete CDC generation data. These modifications make the current get_cdc_generation_mutations work only for the CDC_GENERATIONS_V2 schema, and we need a new function for CDC_GENERATIONS_V3, so we add the "_v2" suffix.	2023-09-11 12:30:21 +02:00
Botond Dénes	685486a20d	Update tools/python3 submodule * tools/python3 30b8fc21...45fbd056 (1): > build_reloc: do not run SCYLLA-VERSION-GEN twice	2023-09-11 10:59:56 +03:00
Kefu Chai	2bbffccaca	SCYLLA-VERSION-GEN: do not print version by default actually, we never use the its output in our workflow. and the output is distracting when building the package. so, in this change, let's print it only on demand. this feature is preserved just in case some of us would want to use this script for getting the version number string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15327	2023-09-11 10:50:50 +03:00
Kefu Chai	bcc05305ae	build: cmake: set the default CMAKE_BUILD_TYPE to Release if user fails to set "CMAKE_BUILD_TYPE", it would be empty. and CMake would fail with confusing error messages like ``` CMake Error at CMakeLists.txt:21 (list): list sub-command FIND requires three arguments. CMake Error at CMakeLists.txt:27 (include): include could not find requested file: mode. ``` so, in this change * the the default CMAKE_BUILD_TYPE to "Release" * quote the ${CMAKE_BUILD_TYPE} when searching it in the allowed build type lists. this should address the issues above. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15326	2023-09-11 10:49:28 +03:00
Botond Dénes	b062b245ad	Merge 'Don't cache dc:rack on system keyspace local cache' from Pavel Emelyanov The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three. Closes #15280 * github.com:scylladb/scylladb: system_keyspace: Don't require snitch argument on start system_keyspace: Don't cache local dc:rack pair system_keyspace: Save local info with explicit location storage_service: Get endpoint location from snitch, not system keyspace snitch: Introduce and use get_location() method repair: Local location variables instead of system keyspace's one repair: Use full endpoint location instead of datacenter part	2023-09-11 10:26:26 +03:00
Nadav Har'El	ea56c8efcd	test/alternator: reduce code duplication in test for list_append() A reviewer noted that test_update_expression_list_append_non_list_arguments has too much code duplication - the same long API call to run "SET a = list_append(...)" was repeated many times. So in this patch we add a short inner function "try_list_append" to avoid this duplication. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes: #15298	2023-09-11 10:09:35 +03:00
David Garcia	a14bcf7c6a	docs: improve configuration properties reference - Adds type for each option. - Filters out unused / invalid values, moves them to a separate section. - Adds the term "liveness" to the glossary. - Removes unused and invalid properties from the docs. - Updates to the latest version of pyaml. docs: rename config template directive Closes #15164	2023-09-11 09:47:16 +03:00
Botond Dénes	d92620868d	Merge 'docs: improve command line samples in unified-installer.rst' from Kefu Chai in this series, we try to improve `unified-installer.rst` - encourage user to install smaller package - run `./install.sh` directly instead relying on that `sh` points to `bash` Closes #15325 * github.com:scylladb/scylladb: doc: run install.sh directly doc: install headless jdk in sample command line	2023-09-11 09:34:14 +03:00
Botond Dénes	7385f93816	Merge 'Task manager repair tasks progress' from Aleksandra Martyniuk Find progress of repair tasks based on the number of ranges that have been repaired. Fixes: [#1156](https://github.com/scylladb/scylla-enterprise/issues/1156). Closes #14698 * github.com:scylladb/scylladb: test: repair tasks test repair: add methods making repair progress more precise tasks: make progress related methods virtual repair: add get_progress method to shard_repair_task_impl repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size() repair: log a name of a particular table repair is working on tasks: delete move and copy constructors from task_manager::task::impl	2023-09-11 09:32:23 +03:00
Guy Shtub	8606cca64f	updating presto integration page	2023-09-11 08:31:20 +03:00
Aleksandra Martyniuk	932f39e37c	compaction: warn about compaction_done() compaction_done() returns ready future before compaction_task_executor::run_compaction() even though the compaction did not start. Make compaction_done() private and add a comment to warn against incorrect usage.	2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk	59b7a45f73	compaction: do not run stopped compaction Before compaction_task_executor::do_run is called, the executor can be already aborted. Check if compaction was stopped and set _compaction_done to exceptional future.	2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk	515b8d4890	compaction: modify lowest compaction tasks' run method For compaction_task_executors, unlike for all other task manager tasks, run method does not embrace operations performed in a scope of a task, but only waits until shared_future connected with the operations is resolved. Apart from breaking task manager task conventions, such a run method must consider all corner cases, not to break task manager or compaction manager functionality. To fix existing and prevent further bugs related to task manager and compaction manager coexistence, call perform_task inside run method and wait for it in a standard way. Executors that are not going to be reflected in task manager run call perform_task the old way.	2023-09-09 11:19:11 +02:00
Aleksandra Martyniuk	832df38d26	compaction: pass do_throw_if_stopping to compaction_task_executor As a preparation for further changes, keep do_throw_if_stopping flag as a member of compaction_task_executor.	2023-09-09 11:19:11 +02:00
Raphael S. Carvalho	c7e02a1077	storage_service: Enforce tablet streaming runs on shard 0 SIGSEGV was caught during tablet streaming, and the reason was that storage_service::_group0 (via set_group0()) is only set on shard 0, therefore when streaming ran on any other shard, it tried to dereference garbage, which resulted in the crash. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15307	2023-09-08 20:45:13 +03:00
Kefu Chai	ce6464b649	sstable: do not call into sstable in filesystem_storage::open() before this change, filesystem_storage::open() reuses `sstable::make_component_file_writer()` to create the temporary toc, it will rename the temporary toc to the real TOC when sealing the sstable. but this prevents us from reusing filesystem_storage in yet another storage backend. as the 1. create temporary 2. rename temporary to toc dance only applies to filesystem_storage. when filesystem_storage calls into sstable, it calls `sst.make_component_file_writer()`, which in turn calls the `_storage->make_component_sink()`. but at this moment, `_storage` is not necessarily `filesystem_storage` anymore. it could be a wrapper around `filesystem_storage`, which is not aware of the create-rename dance. and could do a lot more than create a temporary file when asked to "make_component_sink()". if we really want to go this way by reusing sstable's API in `filesystem_storage` to create a temporary toc, we will have to rename the whatever temporary toc component created by the wrapper backend to the toc with the seal() func. but again, this rename op is only implemented in the filesystem_storage backend. to mirror this operation in the wrapper backend does not make sense at all -- it does not have to be aware of the filesystem_storage's internals. so in this change, instead of reusing the `sstable::make_component_file_writer()`, we just inline its implementation in filesystem_storage to avoid this problem. this is also an improvement from the design perspective, as the storage should not call into its the higher abstraction -- sstable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14443	2023-09-08 19:57:39 +03:00
Kefu Chai	ce291f4385	s3/client: do not use deprecated tls::connect() overload seastar has deprecated the overload which accepts `server_name`, let's use the one which accepts `tls::tls_options`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15324	2023-09-08 18:44:45 +03:00
Avi Kivity	0656810c28	Update tools/java submodule * tools/java 585b30fda6...9dddad27bf (1): > install-dependencies.sh: do not install weak dependencies Frozen toolchain regenerated. Closes #15322	2023-09-08 17:22:07 +03:00
Kamil Braun	26d9a82636	Merge 'raft topology: replace publish_cdc_generation with a bg fiber' from Patryk Jędrzejczak Currently, the topology coordinator has the `topology::transition_state::publish_cdc_generation` state responsible for publishing the already created CDC generations to the user-facing description tables. This process cannot fail as it would cause some CDC updates to be missed. On the other hand, we would like to abort the `publish_cdc_generation` state when bootstrap aborts. Of course, we could also wait until handling this state finishes, even in the case of the bootstrap abort, but that would be inefficient. We don't want to unnecessarily block topology operations by publishing CDC generations. The solution proposed by this PR is to remove the `publish_cdc_generation` state completely and introduce a new background fiber of the topology coordinator -- `cdc_generation_publisher` -- that continually publishes committed CDC generations. Apart from introducing the CDC generation publisher, we add `test_cdc_generation_publishing.py` that verifies its correctness and we adapt other CDC tests to the new changes. Fixes #15194 Closes #15281 * github.com:scylladb/scylladb: test: test_cdc: introduce wait_for_first_cdc_generation test: move cdc_streams_check_and_repair check test: add test_cdc_generation_publishing docs: remove information about publish_cdc_generation raft topology: introduce the CDC generation publisher system_keyspace: load unpublished_cdc_generations to topology raft topology: mark committed CDC generations as unpublished raft topology: add unpublished_cdc_generations to system.topology	2023-09-08 15:08:41 +02:00
Kamil Braun	8bff5843b5	Merge 'test: topology: add tests for gossiper/endpoint/live and gossiper/endpoint/down' from Aleksandra Martyniuk Add tests for gossiper/endpoint/live and gossiper/endpoint/down which run only in release mode. Enable test_remove_node_with_concurrent_ddl and fix types and variables names used by it, so that they can be reused in gossiper test. Fixes: #15223. Closes #15244 * github.com:scylladb/scylladb: test: topology: add gossiper test test: fix types and variable names in wait_for_host_down	2023-09-08 12:43:11 +02:00
Nadav Har'El	548386a0bb	treewide: reduce include of cql_statement.hh ClangBuildAnalyzer reports cql3/cql_statement.hh as being one of the most expensive header files in the project - being included (mostly indirectly) in 129 source files, and costing a total of 844 CPU seconds of compilation. This patch is an attempt, only partially successful, to reduce the number of times that cql_statement.hh is included. It succeeds in lowering the number 129 to 99, but not less :-( One of the biggest difficulties in reducing it further is that query_processor.hh includes a lot of templated code, which needs stuff from cql_statement.hh. The solution should be to un-template the functions in query_processor.hh and move them from the header to a source file, but this is beyond the scope of this patch and query_processor.hh appears problematic in other respects as well. Unfortunately the compilation speedup by this patch is negligible (the `du -bc build/dev/*/.o` metric shows less than 0.01% reduction). Beyond the fact that this patch only removes 30% of the inclusions of this header, it appears that most of the source files that no longer include cql_statement.hh after this patch, included anyway many of the other headers that cql_statement.hh included, so the saving is minimal. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15212	2023-09-08 13:23:50 +03:00
Kefu Chai	7591b1b384	doc: run install.sh directly strictly speaking, `sh` is not necessarily bash. while `install.sh` is written in the Bash dialect. and it errors out if it is not executed with Bash. and we don't need to add "-x" when running the script, if we have to, we should add it in `install.sh` not ask user to add this option. also, `install.sh` is executable with a shebang line using bash, so we can just execute it. so, in this change, we just launch this script in the command line sample. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-08 17:21:30 +08:00
Kefu Chai	1e0c7d14aa	doc: install headless jdk in sample command line in comparison with java-11-openjdk, java-11-openjdk-headless does not offer audio and video support, and has less dependencies. for instance, java-11-openjdk depends on the X11 libraries, and it also provides icons representing JDK. but since scylla is a server side application, we don't expect user to run a desktop on it. so there is no need to support audio and video. in this change, we just suggest the a "smaller" package, which is actually also a dependency of java-11-open-jdk. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-08 17:21:30 +08:00
Patryk Jędrzejczak	23a4557662	test: test_cdc: introduce wait_for_first_cdc_generation After introducing the CDC generation publisher, test_cdc_log_entries_use_cdc_streams could (at least in theory) fail by accessing system_distributed.cdc_streams_descriptions_v2 before the first CDC generation has been published. To avoid flakiness, we simply wait until the first CDC generation is published in a new function -- wait_for_first_cdc_generation.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	3a2c080cbe	test: move cdc_streams_check_and_repair check The part of test_topology_ops that tests the cdc_streams_check_and_repair request could (at least in theory) fail on `assert(len(gen_timestamps) + 1 == len(new_gen_timestamps))` after introducing the CDC generation publisher because we can no longer assume that all previously committed CDC generations have been published before sending the request. To prevent flakiness, we move this part of the test to test_cdc_generations_are_published. This test allows for ensuring that all previous CDC generations have been published. Additionally, checking cdc_streams_check_and_repair there is simpler and arguably fits the test better.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	4ee68a47bb	test: add test_cdc_generation_publishing We add two test cases that test the new CDC generation publisher to detect potential bugs like incorrect order of publications or not publishing some generations at all. The purpose of the second test case -- test_multiple_unpublished_cdc_generations -- is to enforce and test a scenario when there are multiple unpublished CDC generations at the same time. We expect that this is a rare case. The main fiber of the topology coordinator would have to make much more progress (like finishing two bootstraps) than the CDC generation publisher fiber. Since multiple unpublished CDC generations might never appear in other tests but could be handled incorrectly, having such a test is valuable.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	2643ccc70e	docs: remove information about publish_cdc_generation We update documentation after replacing the topology::transition_state::publish_cdc_generation state with the CDC generation publisher fiber.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	fc1ee2cc14	raft topology: introduce the CDC generation publisher Currently, the topology coordinator has the topology::transition_state::publish_cdc_generation state responsible for publishing the already created CDC generations to the user-facing description tables. This process cannot fail as it would cause some CDC updates to be missed. On the other hand, we would like to abort the publish_cdc_generation state when bootstrap aborts. Of course, we could also wait until handling this state finishes, even in the case of the bootstrap abort, but that would be inefficient. We don't want to unnecessarily block topology operations by publishing CDC generations. The solution is to remove the publish_cdc_generation state completely and introduce a new background fiber of the topology coordinator -- cdc_generation_publisher -- that continually publishes committed CDC generations. The implementation of the CDC generation publisher is very similar to the main fiber of the topology coordinator. One noticeable difference is that we don't catch raft::commit_status_unknown, which is handled raft_group0_client::add_entry. Note that this modification changes the Raft-based topology a bit. Previously, the publish_cdc_generation state had to end before entering the next state -- write_both_read_old. Now, committed CDC generations can theoretically be published at any time. Although it is correct because the following states don't depend on publish_cdc_generation, it can cause problems in tests. For example, we can't assume now that a CDC generation is published just because the bootstrap operation has finished.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	d404443b54	system_keyspace: load unpublished_cdc_generations to topology We extend service::topology with the list of unpublished CDC generations and load its contents from system.topology. This step is the last one in making unpublished CDC generations accessible to the topology coordinator. Note that when we load unpublished_cdc_generations, we don't perform any sanity checks contrary to current_cdc_generation_uuid. Every unpublished CDC generation was a current generation once, and we checked it at that moment.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	bc726a066f	raft topology: mark committed CDC generations as unpublished We add committed CDC generations to unpublished_cdc_generations so that we can load them to topology and properly handle them in the following commits.	2023-09-08 09:05:01 +02:00
Patryk Jędrzejczak	5ed9d4db6d	raft topology: add unpublished_cdc_generations to system.topology In the following commits, we replace the topology::transition_state::publish_cdc_generation state with a background fiber that continually publishes committed CDC generations. To make these generations accessible to the topology coordinator, we store them in the new column of system.topology -- unpublished_cdc_generations.	2023-09-08 09:05:01 +02:00
Israel Fruchter	3d082acd29	Update tools/cqlsh submodule * tools/cqlsh 2254e920...66ae7eac (5): > switch from `ssl_options` to `ssl_context` > cqlsh should use cql v4 by default when connecting #44 > Revert "Skip pp38-macosx wheel builds" > update to newer cibuildwheel > Skip pp38-macosx wheel builds Closes #15308	2023-09-07 22:48:37 +03:00
Aleksandra Martyniuk	8a65477202	tasks: db: change default task_ttl value If a test isn't going to use task manager or isn't interested in statuses of finished tasks, then keeping them in the memory for some time (currently 10s by default) after they are finished is a memory waste. Set default task_ttl value to zero. It can be changed by setting --task-ttl-in-seconds or through rest api (/task_manager/ttl). In conf/scylla.yaml set task-ttl-in-seconds to 10. Closes #15239	2023-09-07 12:42:29 +03:00
Nadav Har'El	42e26ab13b	Merge 'Explicitly use do_with_cql_env_thread in query test' from Pavel Emelyanov Some tests use non-threaded do_with_cql_env() and wrap the inner lambda with seastar::async(). The cql env already provides a helper for that Closes #15305 * github.com:scylladb/scylladb: cql_query_test: Fix indentation after previous patch cql_query_test: Use do_with_cql_env_thread() explicitly	2023-09-07 11:54:54 +03:00
Pavel Emelyanov	4dc4f65b18	test/s3: Remove AWS_S3_EXTRA usage Now when the keys and region can be configured with "standard" environment variables, the old custom one can be removed. No automation uses that it was purely a support for manual testing of a client against AWS's S3 server Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 11:16:13 +03:00
Pavel Emelyanov	1d00cc5baa	test/s3: Run tests over non-anonymous bucket Currently minio applies anonymous public policy for the test bucket and all tests just use unsigned S3 requests. This patch generates a policy for the temporary minio user and removes the anon public one. All tests are updated respectively to use the provided key:secret pair. The use-https bit is off by default as minio still starts with plain http. That's OK for now, all tests are local and have no secret data anyway Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 11:16:13 +03:00
Benny Halevy	c5e4dace8e	gossiper: real_mark_alive: do not erase from unreachable_endpoints without holding lock This code was supposed to be moved into `mutate_live_and_unreachable_endpoints` in `2c27297dbd` but it looks like the original statements were left in place outside the mutate function. This patch just removes the stale code since the required logic is already done inside `mutate_live_and_unreachable_endpoints`. Fixes scylladb/scylladb#15296 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15304	2023-09-07 10:02:49 +02:00
Pavel Emelyanov	bff8064abd	test/minio: Create random temp user on start The user is going to have rights to access the test bucket. For now just create one and export the tests via environment Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 10:40:12 +03:00
Pavel Emelyanov	e8e8539c7c	code: Rename S3_PUBLIC_BUCKET_FOR_TEST The bucket is going to stop being public, rename the env variable in advance to make the essential patch smaller Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 10:25:53 +03:00
Pavel Emelyanov	308db51306	s3/client: Add IO stats metrics These metrics mimic the existing IO ones -- total number of read operation, total number of read bytes and total read delay. And the same for writing. This patch makes no difference between wrting object with plain PUT vs putting it with multipart uploading. Instead, it "measures" individual IO writes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Pavel Emelyanov	91235a84cd	s3/client: Add HTTP client metrics Currently an http client has several exported "numbers" regarding the number of transport connections the client uses. This patch exports those via S3 client's per-sched-group metrics and prepares the ground for more metrics in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Pavel Emelyanov	08a12cd4a6	s3/client: Split make_request() There will appear another make_request() helper that'll do mostly the same. This split will help to avoid code duplication Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Pavel Emelyanov	4b548dd240	s3/client: Wrap http client with struct group_client The http-client is per-sched-group. Next patch will need to keep metrics per-sched-group too and this sched-group -> http-client map is the good place to put them on. Wrapping struct will allow extending it with metrics Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Pavel Emelyanov	627c1932e4	s3/client: Move client::stats to namespace scope The stats is stats about object, not about client, so it's better if it lives in namespace scope. Also it will avoid conflicts with client stats that will be reported as metrics (later patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Pavel Emelyanov	896b582850	s3/client: Keep part size local variable This serves two purposes. First, it fixes potential use-after-move since the bufs are moved on lambda and bufs.size() are called in the same statement with no defined evaluation order. Second, this makes 'size' varable alive up to the time request is complete thus making it possible to update stats with it (later patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-07 09:25:00 +03:00
Nadav Har'El	c52e0fd333	test/alternator: avoid warnings about unverified HTTPS The Alternator tests can run against HTTPS - namely when using test/alternator/run with the "--https" option (local Alternator configured with HTTPS) or "--aws" option (DynamoDB, using HTTPS). In some cases we make these HTTPS requests with verify=False, to avoid checking the SSL certificates. E.g., this is necessary for Alternator with a self-signed certificate. Unfortunately, the urllib3 library adds an ugly warning message when SSL certificate verification is disabled. In the past we tried to disable these warnings, using the documented urllib3.disable_warnings() function, but it didn't help. It turns out that pytest has its own warning handling, so to disable warnings in pytest we must say so in a special configuration parameter in pytest.ini. So in this patch, we drop the disable_warnings call from conftest.py (where it didn't help), and instead put a similar declaration in pytest.ini. The disable_warnings call in the test/alternator/run script needs to remain - it is run outside pytest, so pytest.ini doesn't affect it. After this patch, running test/alternator/run with --https or --aws finishes without warnings, as desired. Fixes #15287 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15292	2023-09-07 07:23:57 +03:00
Tomasz Grabiec	dd57c53328	Merge 'Topology: use this host_id in is_configured_this_node' from Benny Halevy Since `5d1f60439a` we have this node's host_id in topology config, so it can be used to determine this node when adding it. Prepare for extending the token_metadata interface to provide host_id in update_topology. We would like to compare the host_id first to be able to distinguish this node from a node we're replacing that may have the same ip address (but different host_id). Closes #15297 * github.com:scylladb/scylladb: locator: topology: is_configured_this_node: delete spurious semicolumn locator: topology: is_configured_this_node: compare host_id first	2023-09-06 22:13:29 +02:00
Pavel Emelyanov	9da4668c71	cql_query_test: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-06 16:54:25 +03:00
Pavel Emelyanov	84e30ab56c	cql_query_test: Use do_with_cql_env_thread() explicitly Some tests use non-threaded do_with_cql_env() and wrap the inner lambda with seastar::async(). The cql env already provides a helper for that Indentation is deliberately left broken until next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-06 16:54:14 +03:00
Kefu Chai	e1d08d888a	create-relocatable-package.py: add --node-exporter-dir option before this change, we assume that node_exporter artifacts are always located under `build/node_exporter`. but this could might hold anymore, if we want to have a self-contained build, in the sense that different builds do not share the same set of node_exporter artifacts. this could be a waste as the node_exporter artifacts are identical across different builds, but this makes things a lot simpler -- different builds do not have to hardwire to a certain directory. so, a new option is added to `create-relocatable-package.py`, this allows us to specify the directory where node_export artifacts are located. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 20:52:07 +08:00
Kefu Chai	8bd9645d1d	build: specify the build dir instead mode instead of specifying the build "mode", and assuming that the build directory is always located at "build/${mode}", specify the build directory explicitly. this allows us to use `create-relocatable-package.py` to package artifacts built at build directory whose path does not comply to the naming convention, for instance, we might want to build scylla in `build/yet-another-super-feature/release`. so, in this change, we trade `--mode` for an option named `--build-dir` and update `configure.py` accordingly. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 20:47:42 +08:00
Kefu Chai	1ed894170c	sstables: throw at seeing invalid chunk_len before this change, when running into a zero chunk_len, scylla crashes with `assert(chunk_size != 0)`. but we can do better than printing a backtrace like: ``` scylla: sstables/compress.cc:158: void sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed. ``` so, in this change, a `malformed_sstable_exception` is throw in place of an `assert()`, which is supposed to verify the programming invariants, not for identifying corrupted data file. Fixes #15265 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15264	2023-09-06 14:20:38 +03:00
Nadav Har'El	5930637ad8	Merge 'task_manager: module: make_task: enter gate when the task is created' from Benny Halevy Passing the gate_closed_exception to the task promise ends up with abandoned exception since no-one is waiting for it. Instead, enter the gate when the task is made so it will fail make_task if the gate is already closed. Fixes scylladb/scylladb#15211 In addition, this series adds a private abort_source for each task_manager module (chained to the main task_manager::abort_source) and abort is requested on task_manager::module::stop(). gate holding in compaction_manager is hardened and makes sure to stop compaction_manager and task_manager in sstable_compaction_test cases. Closes #15213 * github.com:scylladb/scylladb: compaction_manager: stop: close compaction_state:s gates compaction_manager: gracefully handle gate close task_manager: task: start: fixup indentation task_manager: module: make_task: enter gate when the task is created task_manaer: module: stop: request abort task_manager: task::impl: subscribe to module about_source test: compaction_manager_stop_and_drain_race_test: stop compaction and task managers test: simple_backlog_controller_test: stop compaction and task managers	2023-09-06 13:29:26 +03:00
Benny Halevy	574c7e349a	locator: topology: is_configured_this_node: delete spurious semicolumn Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-06 12:24:09 +03:00
Benny Halevy	115462be17	locator: topology: is_configured_this_node: compare host_id first Since `5d1f60439a` we have this node's host_id in topology config, so it can be used to determine this node when adding it. Prepare for extending the token_metadata interface to provide host_id in update_topology. We would like to compare the host_id first to be able to distinguish this node from a node we're replacing that may have the same ip address (but different host_id). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-06 12:24:09 +03:00
Nadav Har'El	cfc70810d3	test/alternator: more error-path tests for list_append() function Improved the coverage of the tests for the list_append() function in UpdateExpression - test that if one of its arguments is not a list, including a missing attribute or item, it is reported as an error as expected. The new tests pass on both Alternator and DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15291	2023-09-06 11:59:54 +03:00
Kefu Chai	4856e95a1c	dist/debian: specify debian/* file encodings instead of using the encoding returned by "locale.getencoding()", explicitly specify the used encoding. per the Debian policies, the debian/control, debian/changelog, and paths should be encoded in UTF-8. see - https://www.debian.org/doc/debian-policy/ch-controlfields.html - https://manpages.debian.org/testing/dpkg-dev/deb-changelog.5.en.html - https://www.debian.org/doc/debian-policy/ch-files.html#file-names Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 11:26:19 +08:00
Kefu Chai	a43cd7f03e	dist/debian: wrap lines whose length exceeds 100 chars to be more PEP8 compliant. it requires < 80 chars. see https://peps.python.org/pep-0008/#maximum-line-length Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 11:26:19 +08:00
Kefu Chai	524ccc6da7	dist/debian: add command line option for builddir so we can point debian_files_gen.py to builddir other than 'build', and can optionally use other output directory. this would help to reduce the number of "magic numbers" in our building system. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 11:26:19 +08:00
Kefu Chai	899b12da54	dist/debian: modularize debian_files_gen.py restructure the script into functions, prepare for the change which allows us to specify the build directory when preparing the "debian" packaging recipes. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-06 11:26:19 +08:00
Avi Kivity	f594175042	Merge 'build: extract generate_compdb() out' from Kefu Chai instead of flattening the functions into the script, let's structure them into functions. so they can be reused. and more maintainable this way. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15242 * github.com:scylladb/scylladb: build: early return when appropriate build: extract generate_compdb() out	2023-09-05 20:54:06 +03:00
Dawid Medrek	c7fe5d7f94	utils/lister: Limit the API of scan_dir() to fs::path Right now, the function allows for passing the path to a file as a seastar::sstring, which is then converted to std::filesystem::path -- implicitly to the caller. However, the function performs I/O, and there is no reason to accept any other type than std::filesystem::path, especially because the conversion is straightforward. Callers can perform it on their own. This commit introduces the more constrained API. Closes #15266	2023-09-05 20:50:42 +03:00
Nadav Har'El	1cbe60a7e3	Update seastar submodule * seastar 6e80e84a...576ee47d (9): > http/client: Add "total new connections" metrics > semaphore: initialize wait_list in move ctor Fixes #15253 Fixes #15263 > tutorial: Add a missing argument in code example > sstring: format sstring without implicitly conversion > coroutine: Add a necessary include in generator.hh > tls: Move server name into tls_options > net/arp\|ip: fix unused param warning in forward virtual method > net/ethernet: fix unused param ethernet_address::adjust_endianness > tls: Optionally skip client EOF wait Closes #15273	2023-09-05 17:07:08 +03:00
Aleksandra Martyniuk	c96224e97d	test: topology: add gossiper test Add tests for gossiper/endpoint/live and gossiper/endpoint/down which run only in release mode.	2023-09-05 15:04:26 +02:00
Aleksandra Martyniuk	ede8182dd4	test: fix types and variable names in wait_for_host_down Fix types and variable names in ManagerClient::wait_for_host_down and related methods.	2023-09-05 15:01:59 +02:00
Pavel Emelyanov	1ef4ba196b	Merge 'Gossiper: mark const methods and remove dead code' from Benny Halevy This series cleans up gossiper. Methods that do not change the gossiper object are marked as const. Dead code is removed. Closes #15272 * github.com:scylladb/scylladb: gossiper: get_current* methods: mark as const gossiper: get_generation_for_nodes: mark as const gossiper: examine_gossiper: mark as const gossiper: request_all, send_all: mark as const gossiper: do_on_notifications: mark as const utils: atomic_vector: mark for_each functions as const gossiper: compare_endpoint_startup: mark as const gossiper: get_state_for_version_bigger_than: mark as const gossiper: make_random_gossip_digest: delete dead legacy code gossiper: make_random_gossip_digest: mark as const gossiper: do_sort: mark as const gossiper: is methods: mark as const gossiper: wait_for_gossip and friends: mark as const gossiper: drop unused dump_endpoint_state_map gossiper: remove unused shadow version members	2023-09-05 13:47:29 +03:00
Pavel Emelyanov	5d52a35e05	system_keyspace: Don't require snitch argument on start Now system keyspace is finally independent from snitch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:57:09 +03:00
Pavel Emelyanov	1daa8fa3bb	system_keyspace: Don't cache local dc:rack pair Now no code needs it from system keyspace Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:56:45 +03:00
Pavel Emelyanov	9926917bf5	system_keyspace: Save local info with explicit location On boot system keyspace is kicked to insert local info into system.local table. Among other things there's dc:rack pair which sys.ks. gets from its cache which, in turn, should have been previously initialized from snitch on sys.ks. start. This patch makes the local info updating method get the dc:rack from caller via argument. Callers, in turn, call snitch directly, because these are main and cql_test_env startup routines. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:54:46 +03:00
Pavel Emelyanov	99cfd018c5	storage_service: Get endpoint location from snitch, not system keyspace Storage service needs to get local dc:rack pair in some places and it calls system keyspace's local_dc_rack() method for it. However, the method returns back the data from sys.ks. cache which, in turn, was previously initialized from snitch's data. This patch makes storage service get location from snitch directly, without messing with system keyspace. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:53:20 +03:00
Pavel Emelyanov	d2bd203cba	snitch: Introduce and use get_location() method There are some places out there that generate locator::endpoint_dc_rack pair out of snitch's get_datacenter() and get_rack() calls. Generalize those with snitch's new method. It will also be used by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:52:30 +03:00
Pavel Emelyanov	153607d587	repair: Local location variables instead of system keyspace's one Previous patch made full endpoint location be available as a local variable near the places that get this location from the system keyspace. This patch replaces the sys.ks. calls with the variables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:51:34 +03:00
Pavel Emelyanov	620273899b	repair: Use full endpoint location instead of datacenter part There are several places in repair code that get datacenter from the topology. Nearby there are calls to update_topology() which, in turn, needs full location ({dc, rack} pair). This patch makes the former places obtain full location from topology and get the dc part from it. This is needed as a preparation to let latter places use that location. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:50:00 +03:00
Kefu Chai	f6cca741ea	config: remove "experimental" option "experimental" option was marked "Unused" in `64bc8d2f7d`. but we chose to keep it in hope that the upgrade test does not fail. despite that the upgrade tests per-se survived the "upgrade", after the upgrade, the tests exercising the experimental features are still failing hard. they have not been updated to set the "experimental-features" option, and are still relying on "experimental" to enable all the experimental features under test. so, in this change, let's just drop the option so that scylla can fail early at seeing this "experimental" option. this should help us to identify the tests relying on it quicker. as the "experimental" features should only be used in development environment, this change should have no impact to production. Refs #15214 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15233	2023-09-05 10:09:04 +03:00
Benny Halevy	cfecb68245	compaction_manager: stop: close compaction_state:s gates Make sure the compaction_state:s are idle before they are destroyed. Although all tasks are stopped in stop_ongoing_compactions, make sure there is fiber holding the compaction_state gate. compaction_manager::remove now needs to close the compaction_state gate and to stop_ongoing_compactions only if the gate is not closed yet. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	96055414c7	compaction_manager: gracefully handle gate close Check if the compaction_state gate is closed along with _state != state::enabled and return early in this case. At this point entering the gate is guaranteed to succeed. So enter the gate before calling `perform_compaction` keeping the std::optional<gate_holder> throughout the compaction task. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	a5b7f1a275	task_manager: task: start: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	f9a7635390	task_manager: module: make_task: enter gate when the task is created Passing the gate_closed_exception to the task promise in start() ends up with abandoned exception since no-one is waiting for it. Instead, enter the gate when the task is made so it will fail make_task if the gate is already closed. Fixes scylladb/scylladb#15211 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	51792d2292	task_manaer: module: stop: request abort Have a private about_source for every module and request abort on stop() to signal all outstanding tasks to abort (especially when they are sleeping for the task_ttl). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	d7205db863	task_manager: task::impl: subscribe to module about_source Rather to the top-level task_manager about_source, to provide separation between task_manager modules so each one can be aborted and stopped independentally of the others (in the next patch). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	062684eb1f	test: compaction_manager_stop_and_drain_race_test: stop compaction and task managers Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Benny Halevy	b9127f55ac	test: simple_backlog_controller_test: stop compaction and task managers The compaction_manager and task_manager should be orderly stopped before they are destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-05 09:17:25 +03:00
Pavel Emelyanov	13a0c29618	storage_service: Remove query processor arg from join_cluster() The s.service since `d42685d0cb` is having on-board query processor ref^w pointer and can use it to join cluster Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15236	2023-09-05 07:30:37 +03:00
Kefu Chai	ea91342d4b	build: early return when appropriate less intentation for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-05 12:14:02 +08:00
Kefu Chai	ce5f7d36cd	build: extract generate_compdb() out instead of flattening the functions into the script, let's structure them into functions. so they can be reused. and more maintainable this way. Refs #15241 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-05 12:14:02 +08:00
Piotr Smaroń	eb46f1bd17	guardrails: restrict replication factor (RF) Replacing `minimum_keyspace_rf` config option with 4 config options: `{minimum,maximum}_replication_factor_{warn,fail}_threshold`, which allow us to impose soft limits (issue a warning) and hard limits (not execute CQL) on RF when creating/altering a keyspace. The reason to rather replace than extend `minimum_keyspace_rf` config option is to be aligned with Cassandra, which did the same, and has the same parameters' names. Only min soft limit is enabled by default and it is set to 3, which means that we'll generate a CQL warning whenever RF is set to either 1 or 2. RF's value of 0 is always allowed and means that there will not be any replicas on a given DC. This was agreed with PM. Because we don't allow to change guardrails' values when scylla is running (per PM), there're no tests provided with this PR, and dtests will be provided separately. Exceeding guardrails' thresholds will be tracked by metrics. Resolves #8619 Refs #8892 (the RF part, not the replication-strategy part) Closes #14262	2023-09-04 19:22:17 +03:00
Benny Halevy	04ba560b8d	gossiper: get_current* methods: mark as const We need to const_cast `this` since the const container() has no const invoke_on override. Trying to fix this in seastar sharded.hh breaks many other call sites in scylla. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:18:04 +03:00
Benny Halevy	43d883c5aa	gossiper: get_generation_for_nodes: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:17:38 +03:00
Benny Halevy	cfe0ec2203	gossiper: examine_gossiper: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:17:25 +03:00
Benny Halevy	ce05bbe32f	gossiper: request_all, send_all: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:16:19 +03:00
Benny Halevy	cc1d5771e5	gossiper: do_on_*notifications: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:16:10 +03:00
Benny Halevy	eb51b70e6d	utils: atomic_vector: mark for_each functions as const They only need to access the _vec_lock rwlock so mark it as mutable, but otherwise they provide a const interface to the calls, as the called func receives the entries by value and it cannot modify them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:14:38 +03:00
Benny Halevy	963d6fb009	gossiper: compare_endpoint_startup: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:14:22 +03:00
Benny Halevy	2899e07572	gossiper: get_state_for_version_bigger_than: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:13:02 +03:00
Benny Halevy	87ac1a26f2	gossiper: make_random_gossip_digest: delete dead legacy code Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:12:51 +03:00
Benny Halevy	33f004587e	gossiper: make_random_gossip_digest: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:12:43 +03:00
Benny Halevy	02e8fdc4b8	gossiper: do_sort: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:11:56 +03:00
Benny Halevy	482963b2c4	gossiper: is* methods: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:11:00 +03:00
Benny Halevy	f7eddf0322	gossiper: wait_for_gossip and friends: mark as const Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:09:15 +03:00
Benny Halevy	044a696aca	gossiper: drop unused dump_endpoint_state_map Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:09:04 +03:00
Benny Halevy	083506d479	gossiper: remove unused shadow version members Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-04 16:08:25 +03:00
Patryk Jędrzejczak	06c0c6a9d9	raft_group0: start leadership monitor fiber during restart Currently, we start group 0 leadership monitor fiber only during a normal bootstrap. However, we should also do it when we restart a node (either with or without upgrading it to Raft). Fixes #15166 Closes #15204	2023-09-04 10:41:50 +02:00
Kefu Chai	dd59b90999	open-coredump: pass the content of "image" file not its path to dbuild in `a4eb3c6e0f`, we passed the path of "image" to `dbuild`, but that was wrong. we should pass its content to this script. so in this change, it is fixed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15247	2023-09-04 07:34:44 +03:00
Avi Kivity	78fc3b5f56	config: rename stream_plan_ranges_percentage to *_fraction The value is specified as a fraction between 0 and 1, so don't mislead users into specifying a value between 0 and 100. Closes #15261	2023-09-03 23:24:29 +03:00
Avi Kivity	9a3d57256a	Merge 'config: add index_cache_fraction' from Michał Chojnowski Index caching was disabled by default because it caused performance regressions for some small-partition workloads. See https://github.com/scylladb/scylladb/issues/11202. However, it also means that there are workloads which could benefit from the index cache, but (by default) don't. As a compromise, we can set a default limit on the memory usage of index cache, which should be small enough to avoid catastrophic regressions in small-partition workloads, but big enough to accommodate workloads where index cache is obviously beneficial. This series adds such a configurable limit, sets it to to 0.2 of total cache memory by default, and re-enables index caching by default. Fixes #15118 Closes #14994 * github.com:scylladb/scylladb: test: boost/cache_algorithm_test: add cache_algorithm_test sstables: partition_index_cache: deglobalize stats utils: cached_file: deglobalize cached_file metrics db: config: enable index caching by default config: add index_cache_fraction utils: lru: add move semantics to list links	2023-09-03 19:39:31 +03:00
Dawid Medrek	a5448fade9	Simplify service_set constructor in init.hh Leverage the comma operator with a parameter pack. Closes #15246	2023-09-03 18:14:44 +03:00
Aleksandra Martyniuk	cf37ab96f4	api: task_manager: fix indentation Closes #15173	2023-09-02 08:18:59 +03:00
Michał Chojnowski	bcc235ad5f	test: boost/cache_algorithm_test: add cache_algorithm_test The tests added in this patch validate that index_cache_fraction does what it's supposed to do.	2023-09-01 22:34:41 +02:00
Michał Chojnowski	f00bed9429	sstables: partition_index_cache: deglobalize stats Move partition_index_cache stats from a thread_local variable to cache_tracker. After the change, partition_index_cache receives a reference to the stats via constructor, instead of referencing a global. This is needed so that cache_tracker can know the memory usage of index caches (for cache eviction purposes) without relying on globals. But it also makes sense even without that motive.	2023-09-01 22:34:41 +02:00
Michał Chojnowski	c7d9d35030	utils: cached_file: deglobalize cached_file metrics Move cached_file metrics from a thread_local variable to cache_tracker. This is needed so that cache_tracker can know the memory usage of index caches (for purposes of cache eviction) without relying on globals. But it also makes sense even without that motive.	2023-09-01 22:34:41 +02:00
Michał Chojnowski	023accf246	db: config: enable index caching by default Index caching was disabled by default because it caused performance regressions for some small-partition workloads. See #11202. However, it also means that there are workloads which could benefit from the index cache, but (by default) don't. As a compromise, we can set a default limit on the memory usage of index cache, which should be small enough to avoid catastrophical regressions in small-partition workloads, but big enough to accomodate workloads where index cache is obviously beneficial. This patch sets such a limit to 0.2 of total cache memory, and re-enables index caching by default.	2023-09-01 22:34:23 +02:00
Michał Chojnowski	50b429f255	config: add index_cache_fraction Adds a configurable upper limit to memory usage by index caches. See the source code comments added in this patch for more details. This patch shouldn't change visible behaviour, because the limit is set to 1.0 by default, so it is never triggerred. We will change the default in a future patch.	2023-09-01 22:34:23 +02:00
Michał Chojnowski	6a7ce6781e	utils: lru: add move semantics to list links Before the patch, fixing list links is done manually in the move constructor of `evictable`. After the patch, it is done by the move constructors of the links themselves. This makes for slightly cleaner code, especially after we add more links in an upcoming patch.	2023-09-01 22:34:23 +02:00
Tomasz Grabiec	7b65d4d947	Merge 'Gossiper: provide strong exception safety for endpoint state changes' from Benny Halevy This series ensures that endpoint state changes (for each single endpoint) are applied to the gossiper endpoint_state_map as a whole and on all shards. Any failure in the process will keep the existing endpoint state intact. Note that verbs that modify the endpoint states of multiple endpoints may still succeed to modify some of them before hitting an error and those changes are committed to the endpoint_state_map, so we don't ensure atomicity when updating multiple endpoints' states. Fixes scylladb/scylladb#14794 Fixes scylladb/scylladb#14799 Closes #15073 * github.com:scylladb/scylladb: gossiper: move endpoint_state by value to apply it gossiper: replicate: make exception safe gms: pass endpoint_state_ptr to endpoint_state change subscribers gossiper: modify endpoint state only via replicate gossiper: keep and serve shared endpoint_state_ptr in map gossiper: get_max_endpoint_state_version: get state by reference api/failure_detector: get_all_endpoint_states: reduce allocations cdc/generation: get_generation_id_for: get endpoint_state& gossiper: add for_each_endpoint_state helpers gossiper: add num_endpoints gossiper: add my_endpoint_state	2023-09-01 12:23:19 +02:00
Kamil Braun	117dedab19	Merge 'Cluster features on raft: topology coordinator + check on boot followups' from Piotr Dulikowski This PR collects followups described in #14972: - The `system.topology` table is now flushed every time feature-related columns are modified. This is done because of the feature check that happens before the schema commitlog is replayed. - The implementation now guarantees that, if all nodes support some feature as described by the `supported_features` column, then support for that feature will not be revoked by any node. Previously, in an edge case where a node is the last one to add support for some feature `X` in `supported_features` column, crashes before applying/persisting it and then restarts without supporting `X`, it would be allowed to boot anyway and would revoke support for the `X` in `system.topology`. The existing behavior, although counterintuitive, was safe - the topology coordinator is responsible for explicitly marking features as enabled, and in order to enable a feature it needs to perform a special kind of a global barrier (`barrier_after_feature_update`) which only succeeds after the node has updated its features column - so there is no risk of enabling an unsupported feature. In order to make the behavior less confusing, the node now will perform a second check when it tries to update its `supported_features` column in `system.topology`. - The `barrier_after_feature_update` is removed and the regular global `barrier` topology command is used instead. The `barrier` handler now performs a feature check if the node did not have a chance to verify and update its cluster features for the second time. JOIN_NODE rpc will be sent separately as it is a big item on its own. Fixes: #14972 Closes #15168 * github.com:scylladb/scylladb: test: topology{_experimental_raft}: don't stop gracefully in feature tests storage_service: remove _topology_updated_with_local_metadata topology_coordinator: remove barrier_after_feature_update topology_coordinator: perform feature check during barrier storage_service: repeat the feature check after read barrier feature_service: introduce unsupported_feature_exception feature_service: move startup feature check to a separate function topology_coordinator: account for features to enable in should_preempt_balancing group0_state_machine: flush system.topology when updating features columns	2023-09-01 11:52:26 +02:00
Nadav Har'El	5625624533	doc/dev: add document about analyzing build time Add a document describing in detail how to use clang's "-ftime-trace" option, and the ClangBuildAnalyzer tool, to find the source files, header files and templates which slow down Scylla's build the most. I've used this tool in the past to reduce Scylla build time - see commits: `fa7a302130` (reduced 6.5%) `f84094320d` (reduced 0.1%) `6ebf32f4d7` (reduced 1%) `d01e1a774b` (reduced 4%) I'm hoping that documenting how to use this tool will allow other developers to suggest similar commits. Refs #1. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15209	2023-09-01 11:33:36 +03:00
Botond Dénes	b6986668d4	Merge 'sstables: change make_descriptor() to accept fs::path' from Kefu Chai to lower the programmer's cognitive load. as programmer might want to pass the full path as the `fname` when calling `make_descriptor(sstring sstdir, sstring fname)`, but this overload only accepts the filename component as its second parameter. a single `path` parameter would be easier to work with. Refs #15187 Closes #15188 * github.com:scylladb/scylladb: sstable: add samples of fname to be matched by regex sstables: change make_descriptor() to accept fs::path sstables: switch entry_descriptor(sstring..) to std::string_view sstables: change make_descriptor() to accept fs::path	2023-09-01 10:55:25 +03:00
Kefu Chai	a4eb3c6e0f	open-coredump: use the dbuild script in current branch the dbuild script provided by the branch being debugged might not include the recent fixes included by current branch from which `open-coredump.sh` is launched. so, instead of using the dbuild script in the repo being debugged, let's use the dbuild provided by current branch. also, wrap the dbuild command line. for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15240	2023-09-01 07:27:21 +03:00
Botond Dénes	34d94fb549	test/cql-pytest/test_tools.py: improve tempdir usage for scrub tests Scrub tests use a lot of temporary directories. This is suspected to cause problems in some cases. To improve the situation, this patch: * Creates a single root temporary directory for all scrub tests * All further fixtures create their files/directories inside this root dir. * All scrub tests create their temporary directories within this root dir. * All temporary directories now use an appropriate "prefix", so we can tell which temporary directory is part of the problem if a test fails. Refs: #14309 Closes #15117	2023-09-01 07:17:49 +03:00
Gleb Natapov	55f047f33f	raft: drop assert in server_impl::apply_snapshot for a condition that may happen server_impl::apply_snapshot() assumes that it cannot receive a snapshots from the same host until the previous one is handled and usually this is true since a leader will not send another snapshot until it gets response to a previous one. But it may happens that snapshot sending RPC fails after the snapshot was sent, but before reply is received because of connection disconnect. In this case the leader may send another snapshot and there is no guaranty that the previous one was already handled, so the assumption may break. Drop the assert that verifies the assumption and return an error in this case instead. Fixes: #15222 Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>	2023-09-01 07:17:49 +03:00
Pavel Emelyanov	91cc544b05	Update seastar submodule * seastar 0784da87...6e80e84a (29): > Revert "shared_token_bucket: Make duration->tokens conversion more solid" > Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai > memory: diable transparent hugepages if --overprovisioned is specified Ref https://github.com/scylladb/scylladb/issues/15095 > http/exception: s/<TAB>/ / > install-dependencies.sh: re-add protobuf > Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov > Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov Fixes https://github.com/scylladb/scylladb/issues/13100 > smp: make service management semaphore thread local > tls_test: abort_accept() after getting server socket > Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov > rpc: Fix client-side stream registration race Ref https://github.com/scylladb/scylladb/issues/13100 > tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas > build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler > reactor: Drop some dangling friend declarations > fair_queue: Do not re-evaluate request capacity twice > build: do not use serial number file when signing a cert > shared_token_bucket: Make duration->tokens conversion more solid > tests: Add perf test for shard_token_bucket > Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov > fair_queue: Remove individual requests counting > reactor, linux-aio: print value of aio-max-nr on error > Merge 'build, net: disable implicit fallthough' from Kefu Chai > shared_token_bucket: Fix duration_for() underflow > rpc: Generalize get_stats_internal() method > doc/building-dpdk.md: fix invalid file path of README-DPDK.md > install-dependencies: add centos9 > Merge 'log: report scheduling group along with shard id' from Kefu Chai > dns: handle exception in do_sendv for udp > Merge 'Add a stall detector histogram' from Amnon Heiman Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15218	2023-09-01 07:17:49 +03:00
Alexey Novikov	87fa7d0381	compact and remove expired range tombstones from cache on read during read from cache compact and expire range tombstones remove expired empty rows from cache Refs #2252 Fixes #6033 Closes #14463	2023-09-01 07:17:49 +03:00
Kefu Chai	94a056bda8	sstable: add samples of fname to be matched by regex for better readability of code Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-01 07:44:06 +08:00
Kefu Chai	a29838f9e1	sstables: change make_descriptor() to accept fs::path change another overload of `make_descriptor()` to accept `fs::path`, in the same spirit of a previous change in this area. so we have a more consistent API for creating sstable descriptor. and this new API is simpler to use. Refs #15187 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-01 07:44:06 +08:00
Kefu Chai	50c1d9aee7	sstables: switch entry_descriptor(sstring..) to std::string_view so its callers don't need to construct a temporary `sstring` if the parameter's type is not `sstring`. for instance, before this change, `entry_descriptor::make_descriptor(const std::filesystem::path...)` would have to construct two temporary instances of `sstring` for calling this function. after this change, it does not have to do so. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-01 07:44:06 +08:00
Kefu Chai	6656707164	sstables: change make_descriptor() to accept fs::path to lower the programmer's cognitive load. as programmer might want to pass the full path as the `fname` when calling `make_descriptor(sstring sstdir, sstring fname)`, but this overload only accepts the filename component as its second parameter. a single `path` parameter would be easier to work with. Refs #15187 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-09-01 07:44:06 +08:00
Piotr Dulikowski	5471330ee7	test: topology{_experimental_raft}: don't stop gracefully in feature tests The current cluster feature tests are stopping nodes in a graceful way. Doing it gracefully isn't strictly necessary for the test scenarios and we can switch `server_stop_gracefully` calls to `server_stop`. This only became possible after a previous commit which causes `system.topology` table to be flushed when cluster feature columns are modified, and will server as a good test for it.	2023-08-31 16:46:11 +02:00
Piotr Dulikowski	62167d584b	storage_service: remove _topology_updated_with_local_metadata After removing barrier_after_feature_update, the flag is no longer needed by anybody. The field in storage_service is removed and do_update_topology_with_local_metadata is inlined.	2023-08-31 16:46:11 +02:00
Piotr Dulikowski	3d7cf3bfe6	topology_coordinator: remove barrier_after_feature_update The `barrier_after_feature_update` was introduced as a variant of the `barrier` command, meant to be used by the topology coordinator when enabling a feature. It was meant to give more guarantees to the topology coordinator than the regular barrier, but the regular barrier has been adjusted in the previous commits so that it can be used instead of the special barrier. This commit gets rid of `barrier_after_feature_update` and replaces its uses with `barrier`.	2023-08-31 16:46:11 +02:00
Piotr Dulikowski	1b62abfc42	topology_coordinator: perform feature check during barrier Due to the possible situation where a node applies a command that advertises support for a feature but crashes before applying it, there is a period of time where a node might have its group 0 server running but does not support all of the features. Currently, we solve the issue by using a special `barrier_after_feature_update` which will not succeed until the node makes sure to update its `supported_features` column (or, since the previous commit, shuts down if it doesn't support all required features). However, we can make it work with regular barrier after adjusting it slightly. In case the local metadata was not updated yet, it will perform a feature check. This will make sure that the global barrier issued by the topology coordinator before enabling features will not succeed if the problematic situation occurs.	2023-08-31 16:46:11 +02:00
Piotr Dulikowski	3edfd29c86	storage_service: repeat the feature check after read barrier We would like to guarantee the following property: if all nodes have some feature X in their `supported_features` column in `system.topology`, then it's no longer possible for anybody to revoke support for it. Currently, it is not guaranteed because the following can happen: 1. A node commits a command that updates its `supported_features`, marking feature X as supported. It is the last node to do so and now all nodes support X. 2. Node crashes before applying the command locally. 3. Node is downgraded not to support X and restarted. 4. The feature check in `enable_features_on_startup` passes because it happens before starting the group 0 server. 5. The `supported_features` column is updated in `update_topology_with_local_metadata`, removing support for X. Even though the guarantee does not hold, it's not a problem because the `barrier_after_metadata_update` is required to succeed on all nodes before topology coordinator moves to enable a feature, and - as the name suggests - it requires `update_topology_with_local_metadata` to finish. However, choosing to give this guarantee makes it simpler to reason about how cluster features on raft work and removes some pathological cases (e.g. trying to downgrade some other node after step 1 will fail, but will be again possible after step 5). Therefore, this commit adds a second check to `update_topology_with_local_metadata` which disallows removing support for a feature that is supported by everybody - and stops the boot process if necessary.	2023-08-31 16:46:11 +02:00
Piotr Dulikowski	aa5401383f	feature_service: introduce unsupported_feature_exception The new `unsupported_feature_exception` is introduced so that the exception thrown by `check_features` can be caught in a type-safe way.	2023-08-31 16:46:10 +02:00
Piotr Dulikowski	8286a2c369	feature_service: move startup feature check to a separate function The logic responsible for checking supported features agains the currently enabled features (and features that are unsafe to disable) is moved to a separate function, `check_features`. Currently, it is only used from `enable_features_on_startup`, but more checks against features in raft will be added in the commits that follow.	2023-08-31 16:45:40 +02:00
Benny Halevy	98fd9fcc11	gossiper: move endpoint_state by value to apply it Save a copy of the applied endpoint state by moving the value towards replicate. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:35:15 +03:00
Benny Halevy	38c2347a3c	gossiper: replicate: make exception safe First replicate the new endpoint_state on all shards before applying the replicated endpoint_state objects to _endpoint_state_map. Fixes scylladb/scylladb#14794 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:35:15 +03:00
Benny Halevy	c16ec870da	gms: pass endpoint_state_ptr to endpoint_state change subscribers Now that the endpoint_state isn't change in place we do not need to copy it to each subscriber. We can rather just pass the lw_shared_ptr holding a snapshot of it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:35:15 +03:00
Benny Halevy	1d04242a90	gossiper: modify endpoint state only via replicate And restrict the accessor methods to return const pointers or refrences. With that, the endpoint_state_ptr:s held in the _endpoint_state_map point to immutable endpoint_state objects - with one exception: the endpoint_state update_timestamp may be updated in place, but the endpoint_state_map is immutable. replicate() replaces the endpoint_state_ptr in the map with a new one to maintain immutability. A later change will also make this exception safe so replicate will guarantee strong exception safety so that all shards are updated or none of them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:35:15 +03:00
Benny Halevy	d00e49a1bb	gossiper: keep and serve shared endpoint_state_ptr in map This commit changes the interface to using endpoint_state_ptr = lw_shared_ptr<const endpoint_state> so that users can get a snapshot of the endpoint_state that they must not modify in-place anyhow. While internally, gossiper still has the legacy helpers to manage the endpoint_state. Fixes scylladb/scylladb#14799 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:34:36 +03:00
Benny Halevy	f33a6d37f2	gossiper: get_max_endpoint_state_version: get state by reference No need to copy the endpoint_state since the function is synchronous. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 09:33:42 +03:00
Benny Halevy	f1a88c01a2	api/failure_detector: get_all_endpoint_states: reduce allocations reserve the result vector based on the known number of endpoints and then move-construct each entry rather than copying it. Also, use refrences to traverse the application_state_map rather than copying each of them. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 08:34:42 +03:00
Benny Halevy	1e0d19b89d	cdc/generation: get_generation_id_for: get endpoint_state& No need to lookup the application_state again using the endpoint, as both callers already have a reference to the endpoint_state handy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 08:34:23 +03:00
Benny Halevy	4f5ffc7719	gossiper: add for_each_endpoint_state helpers Before changing _endpoint_state_map to hold a lw_shared_ptr<endpoint_state>, provide synchronous helpers for users to traverse all endpoint_states with no need to copy them (as long as the called func does not yield). With that, gossiper::get_endpoint_states() can be made private. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 08:32:31 +03:00
Benny Halevy	3208af1880	gossiper: add num_endpoints Return the number of endpoints tracked by gossiper. This is useful when the caller doesn't need access to the endpoint states map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 08:30:40 +03:00
Benny Halevy	b82c77ed9c	gossiper: add my_endpoint_state Get or creates the endpoint_state for this node. Instead of accessing _endpoint_state_map directly. Do this before changing the map to hold a lw_shared_ptr<endpoint_state> in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-31 08:27:27 +03:00
Kefu Chai	3bdbe620aa	open-coredump: do not assume remote repo name open-coredump.sh allows us to specify --scylla-repo-path, but developer's remote repo name is not always "origin" -- the "origin" could be his/her own remote repo. not the one from which we want to pull from. so, in this change, assuming that the remote repo to be pulled from has been added to the local repo, we query the local repo for its name and pull using that name instead. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15220	2023-08-31 08:04:28 +03:00
Kefu Chai	a47dcb29d5	open-coredump: add selinux label to shared volume before this change, we don't configure the selinux label when binding shared volume, but this results in permission denied when accessing `/opt/scylladb` in the container when the selinux is enabled. since we are not likely to share the volume with other containers, we can use `Z` to indicate that the bind mount is private and unshared. this allows the launched container to access `/opt/scylladb` even if selinux is enabled. since selinux is enabled by default on a installation of fedora 38. this change should improve the user experience of open-coredump when developer uses fedora distributions. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15229	2023-08-31 08:03:40 +03:00
Dawid Medrek	9afaf39acb	Get rid of UB in commitlog.hh Identifiers starting with an underscore followed by a capital letter are reserved. They should not be used. Closes #15227	2023-08-31 00:03:04 +03:00
Aleksandra Martyniuk	92fad5769a	test: repair tasks test Add tests checking whether repair tasks are properly structured and their progress is gathered correctly.	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	848dfb26ef	repair: add methods making repair progress more precise Override methods returning expected children number and job size in repair tasks. With them get_progress method would be able to return more precise progress value.	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	564454f8c2	tasks: make progress related methods virtual	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	4766f74623	repair: add get_progress method to shard_repair_task_impl Count shard_repair_task_impl progress based on a number of ranges which have already been repaired.	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	c9d68869b8	repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size()	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	09abbdddae	repair: log a name of a particular table repair is working on Instead of logging the list of all tables' names for a given repair, log the particular name of a table the repair is working on.	2023-08-30 15:34:25 +02:00
Aleksandra Martyniuk	587715b399	tasks: delete move and copy constructors from task_manager::task::impl	2023-08-30 15:34:22 +02:00
Botond Dénes	008bde5461	Merge 'gossiper: apply_new_states: tolerate listener errors during shutdown' from Benny Halevy Change `6449c59` brought back abort on listener failure, but this causes crashes when listeners hit expected errors like gate_closed. Detect shutdown usig the gossiper _abort_source and in this case just log a warning about the errors but do not abort. Fixes scylladb/scylladb#15031 Closes #15100 * github.com:scylladb/scylladb: gossiper: apply_new_states: tolerate listener errors during shutdown gossiper: do_on_change_notifications: check abort source gossiper: lock_endpoint_update_semaphore: get_units with _abort_source gossiper: lock_endpoint: get_units with _abort_source gossiper: is_enabled: consider also _abort_source	2023-08-30 11:52:13 +03:00
Kamil Braun	0ee23b260e	Merge 'raft topology: add and deprecate support for --ignore-dead-nodes with IPs' from Patryk Jędrzejczak We want to stop supporting IPs for `--ignore-dead-nodes` in `raft_removenode` and `--ignore-dead-nodes-for-replace` for `raft_replace`. However, we shouldn't remove these features without the deprecation period because the original `removenode` and `replace` operations still support them. So, we add them for now. The `IP -> Raft ID` translation is done through the new `raft_address_map::find_by_addr` member function. We update the documentation to inform about the deprecation of the IP support for `--ignore-dead-nodes`. Fixes #15126 Closes #15156 * github.com:scylladb/scylladb: docs: inform about deprecating IP support for --ignore-dead-nodes raft topology: support IPs for --ignore-dead-nodes raft_address_map: introduce find_by_addr	2023-08-30 10:41:23 +02:00
Botond Dénes	eb7618406f	Merge 'Gossiper: do_on_dead_notifications' from Benny Halevy Use common code to notify subscribers on_dead from remove_endpoint() and from mark_dead(). Modeled after do_on_change_notifications. Refs https://github.com/scylladb/scylladb/pull/15179#discussion_r1306969125 Closes #15206 * github.com:scylladb/scylladb: gossiper: remove_endpoint: get the endpoint_state before yielding gossiper: add do_on_dead_notifications	2023-08-30 09:32:35 +03:00
Botond Dénes	3e7ec6cc83	Merge 'Move cell assertion from cql_test_env to cql_assertions' from Pavel Emelyanov The cql_test_env has a virtual require_column_has_value() helper that better fits cql_assertions crowd. Also, the helper in question duplicates some existing code, so it can also be made shorter (and one class table helper gets removed afterwards) Closes #15208 * github.com:scylladb/scylladb: cql_assertions: Make permit from env table: Remove find_partition_slow() helper sstable_compaction_test: Do not re-decorate key cql_test_env: Move .require_column_has_value cql_test_env: Use table.find_row() shortcut	2023-08-30 08:34:05 +03:00
Kamil Braun	0bff96a611	Merge 'gossip: add group0_id attribute to gossip_digest_syn' from Mikołaj Grzebieluch Motivation: The user can bootstrap 3 different clusters and then connect them (#14448). When these clusters start gossiping, their token rings will be merged, but there will be 3 different group 0s in there. It results in a corrupted cluster. We need to prevent such situations from happening in clusters which don't use Raft-based topology. ------- Gossiper service sets its group0 id on startup if it is stored in `scylla_local` or sets it during joining group0. Send group0_id (if it is set) when the node tries to initiate the gossip round. When a node gets gossip_digest_syn it checks if its group0 id equals the local one and if not, the message is discarded. Fixes #14448 Performed manual tests with the following scenario: 1. setup a cluster of two nodes (one compiled with and one without this patch) 2. setup a new node 3. create a basic keyspace and table 4. execute simple select and insert queries Tested 4 scenarios: the seed node was with or without this patch, and the third node was with or without this patch. These tests didn't detect any errors. Closes #15004 * github.com:scylladb/scylladb: tests: raft: cluster of nodes with different group0 ids gossip: add group0_id attribute to gossip_digest_syn	2023-08-29 16:41:29 +02:00
Pavel Emelyanov	5c95b1cb7f	scylla-gdb: Remove _cost_capacity from fair-group debug This field is about to be removed in newer seastar, so it shouldn't be checked in scylla-gdb (see also `ae6fdf1599`) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15203	2023-08-29 16:13:50 +03:00
Pavel Emelyanov	137c7116dc	cql_assertions: Make permit from env To call table::find_row() one needs to provide a permit. Tests have short and neat helper to create one from cql_test_env Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 16:01:29 +03:00
Kefu Chai	6a55e4120e	encoding_state: mark helper methods protected these methods are only used by the public methods of this class and its derived class "memtable_encoding_stats_collector". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15190	2023-08-29 15:41:13 +03:00
Pavel Emelyanov	c2f2e0fd7a	table: Remove find_partition_slow() helper It's no longer used Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 15:38:41 +03:00
Pavel Emelyanov	0a727a9b2e	sstable_compaction_test: Do not re-decorate key The is_partition_dead() local helper accepts partition key argument and decorates it. Howerver, its caller gets partition key from decorated key itself, and can just pass it along Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 15:38:41 +03:00
Pavel Emelyanov	4e9f380608	cql_test_env: Move .require_column_has_value This env helper is only used by tests (from cql_query_test) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 15:38:33 +03:00
Pavel Emelyanov	7597663ef5	cql_test_env: Use table.find_row() shortcut The require_column_has_value() finds the cell in three steps -- finds partition, then row, then cell. The class table already has a method to facilitate row finding by partition and clustering key Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 15:37:27 +03:00
Kamil Braun	ebc9056237	Merge 'Restore storage_service -> cdc_generation_service dependency' from Pavel Emelyanov The main goal of this PR is to stop cdc_generation_service from calling system_keyspace::bootstrap_complete(). The reason why it's there is that gen. service doesn't want to handle generation before node joined the ring or after it was decommissioned. The cleanup is done with the help of storage_service->cdc_generation_service explicit dependency brought back and this, in turn, suddenly freed the raft and API code from the need to carry cdc gen. service reference around. Closes #15047 * github.com:scylladb/scylladb: cdc: Remove bootstrap state assertion from after_join() cdc: Rework gen. service check for bootstrap state api: Don't carry cdc gen. service over storage_service: Use local cdc gen. service in join_cluster() storage_service: Remove cdc gen. service from raft_state_monitor_fiber() raft: Do not carry cdc gen. service over storage_service: Use local cdc gen. service in topo calls storage_service: Bring cdc_generation_service dependency back	2023-08-29 14:10:06 +02:00
Piotr Dulikowski	5b99a9c084	topology_coordinator: account for features to enable in should_preempt_balancing In #14722, a source of work was added to `handle_topology_transition`, but the `should_preempt_balancing` function was not updated accordingly, as is suggested by the comment in `handle_topology_transition`. This omission happened due to a hasty rebase. This commit fixes the issue, and now `should_preempt_balancing` will return true if there are some features that should be enabled.	2023-08-29 11:53:12 +02:00
Piotr Dulikowski	371b640309	group0_state_machine: flush system.topology when updating features columns The `supported_features` and `enabled_features` columns from `system.topology` are read during the feature check that happens early on boot. The check enforces two properties: - A node is not allowed to revoke support for a feature after it notices in its local topology state that the feature is supported by all nodes. - Similarly, a node is not allowed to revoke support for a feature after seeing that it was put to the `enabled_features` column by the topology coordinator. However, due to the fact that the check has to happen before (schema) commitlog replay and the table is not explicitly flushed when `supported_features` or `enabled_features` columns are modified, the feature check on boot might operate on old data and not do its job properly. In order to fix this, this commit modifies the `group0_state_machine` so that is flushes the `system.topology` table every time the `supported_features` or `enabled_features` column is modified, and after every snapshot transfer.	2023-08-29 11:53:11 +02:00
Mikołaj Grzebieluch	bac8aa38d9	tests: raft: cluster of nodes with different group0 ids The reproducer for #14448. The test starts two nodes with different group0_ids. The second node is restarted and tries to join the cluster consisting of the first node. gossip_digest_syn message should be rejected by the first node, so the second node will not be able to join the cluster. This test uses repair-based node operations to make this test easier. If the second node successfully joins the cluster, their tokens metadata will be merged and the repair service will allow to decommission the second node. If not - decommissioning the second node will fail with an exception "zero replica after the removal" thrown by the repair service.	2023-08-29 11:09:15 +02:00
Mikołaj Grzebieluch	2230abc9b2	gossip: add group0_id attribute to gossip_digest_syn Gossiper service sets its group0 id on startup if it is stored in `scylla_local` or sets it during joining group0. Send group0_id (if it is set) when the node tries to initiate the gossip round. When a node gets gossip_digest_syn it checks if its group0 id equals the local one and if not, the message is discarded. Fixes #14448.	2023-08-29 11:09:15 +02:00
Aleksandra Martyniuk	5e31ca7d20	tasks: api: show tasks' scopes To make manual analysis of task manager tasks easier, task_status and task_stats contain operation scope (e.g. shard, table). Closes #15172	2023-08-29 11:32:16 +03:00
Benny Halevy	fbc2907e70	gossiper: remove_endpoint: get the endpoint_state before yielding We want to call the on_dead notifications if the node was alive and it had endpoint_state. Get the ep state before we may yield in mutate_live_and_unreachable_endpoints, similarly to mark_dead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-29 10:58:51 +03:00
Benny Halevy	adf5a1e082	gossiper: add do_on_dead_notifications Use common code to notify subscribers on_dead from remove_endpoint() and from mark_dead(). Modeled after do_on_change_notifications. Refs https://github.com/scylladb/scylladb/pull/15179#discussion_r1306969125 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-29 10:58:37 +03:00
Botond Dénes	57deeb5d39	Merge 'gossiper: add get_unreachable_members_synchronized and use over api' from Benny Halevy Modeled after get_live_members_synchronized, get_unreachable_members_synchronized calls replicate_live_endpoints_on_change to synchronize the state of unreachable_members on all shards. Fixes #12261 Fixes #15088 Also, add rest_api unit test for those apis Closes #15093 * github.com:scylladb/scylladb: test: rest_api: add test_gossiper gossiper: add get_unreachable_members_synchronized	2023-08-29 10:43:22 +03:00
Pavel Emelyanov	4bf8f693ee	cdc: Remove bootstrap state assertion from after_join() As was described in the previous patch, this method is explicitly called by storage service after updating the bootstrap state, so it's unneeded Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:47:35 +03:00
Pavel Emelyanov	566f57b683	cdc: Rework gen. service check for bootstrap state The legacy_handle_cdc_generation() checks if the node had bootstrapped with the help of system_keyspace method. The former is called in two cases -- on boot via cdc_generation_service::after_join() and via gossiper on_...() notifications. The notifications, in turn, are set up in the very same after_join(). The after_join(), in turn, is called from storage_service explicitly after the bootstrap state is updated to be "complete", so the check for the state in legacy_handle_...() seems unnecessary. However, there's still the case when it may be stepped on -- decommission. When performed it calls storage_service::leave_ring() which udpates the bootstrap state to be "needed", thus preventing the cdc gen. service from doing anything inside gossiper's on_...() notifications. It's more correct to stop cdc gen. service handling gossiper notifications by unsubscribing it, but by adding fragile implicit dependencies on the bootstrap state. Checks for sys.dist.ks in the legacy_handle_...() are kept in a form of on-internal-error. The system distributed keyspace is activated by storage service even before the bootstrap state is updated and is never deactivated, but it's anyway good to have this assertion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:46:13 +03:00
Pavel Emelyanov	c3a6e31368	api: Don't carry cdc gen. service over There's a storage_service/cdc_streams_check_and_repair endpoint that needs to provide cdc gen. service to call storage_service method on. Now the latter has its own reference to the former and API can stop taking care of that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Pavel Emelyanov	a61454be00	storage_service: Use local cdc gen. service in join_cluster() The method in question accepts cdc_generation_service ref argument from main and cql_test_env, but storage service now has local cdcv gen. service reference, so this argument and its propagation down the stack can be removed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Pavel Emelyanov	acc646fab6	storage_service: Remove cdc gen. service from raft_state_monitor_fiber() This argument is just unused Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Pavel Emelyanov	4c89249c29	raft: Do not carry cdc gen. service over There's a cdc_generation_service ref sitting on group0_state_machine and the only reason it's there is to call storage_service::topology_...() mathods. Now when storage service can access cdc gen. service on its own, raft code can forget about cdc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Pavel Emelyanov	c03d419adc	storage_service: Use local cdc gen. service in topo calls The topology_state_load() and topology_transition() both take cdc gen. service an an argument, but can work with the local reference. This makes the next patch possible Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Pavel Emelyanov	933ea0afe6	storage_service: Bring cdc_generation_service dependency back It sort of reverts the `5a97ba7121` commit, because storage service now uses the cdc generation service to serve raft topo updates which, in turn, takes the cdc gen. service all over the raft code _just_ to make it as an argument to storage service topo calls. Also there's API carrying cdc gen. service for the single call and also there's an implicit need to kick cdc gen. service on decommission which also needs storage service to reference cdc gen. after boot is complete Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-29 09:36:58 +03:00
Benny Halevy	2c54d7a35a	view, storage_proxy: carry effective_replication_map along with endpoints When sending mutation to remote endpoint, the selected endpoints must be in sync with the current effective_replication_map. Currently, the endpoints are sent down the storage_proxy stack, and later on an effective_replication_map is retrieved again, and it might not match the target or pending endpoints, similar to the case seen in https://github.com/scylladb/scylladb/issues/15138 The correct way is to carry the same effective replication map used to select said endpoints and pass it down the stack. See also https://github.com/scylladb/scylladb/pull/15141 Fixes scylladb/scylladb#15144 Fixes scylladb/scylladb#14730 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15142	2023-08-29 09:08:42 +03:00
Benny Halevy	0b9f221f2a	gossiper: wait_for_live_nodes_to_show_up: increase timeout This function is too flaky with the 30 seconds timeout. For example, the following was seen locally with `test_updated_shards_during_add_decommission_node` in dev mode: alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node6.log: ``` INFO 2023-08-27 15:47:25,753 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present... INFO 2023-08-27 15:47:30,754 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present... INFO 2023-08-27 15:47:35,761 [shard 0] gossip - (rate limiting dropped 495 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present... INFO 2023-08-27 15:47:40,766 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present... INFO 2023-08-27 15:47:45,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present... INFO 2023-08-27 15:47:50,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present... ERROR 2023-08-27 15:47:55,758 [shard 0] gossip - Timed out waiting for 2 live nodes to show up in gossip INFO 2023-08-27 15:47:55,759 [shard 0] init - Shutting down group 0 service ``` alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node1.log: ``` INFO 2023-08-27 15:48:02,532 [shard 0] gossip - InetAddress 127.0.43.6 is now UP, status = UNKNOWN ... WARN 2023-08-27 15:48:03,552 [shard 0] gossip - failure_detector_loop: Send echo to node 127.0.43.6, status = failed: seastar::rpc::closed_error (connection is closed) ``` Note that node1 saw node6 as UP after node6 already timed out and was shutting down. Increase the timeout to 3 minutes in all modes to reduce flakiness. Fixes #15185 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15186	2023-08-29 09:02:41 +03:00
Michał Chojnowski	2000a09859	reader_concurrency_semaphore: fix a deadlock between stop() and execution_loop() Permits added to `_ready_list` remain there until executed by `execution_loop()`. But `execution_loop()` exits when `_stopped == true`, even though nothing prevents new permits from being added to `_ready_list` after `stop()` sets `_stopped = true`. Thus, if there are reads concurrent with `stop()`, it's possible for a permit to be added to `_ready_list` after `execution_loop()` has already quit. Such a permit will never be destroyed, and `stop()` will forever block on `_permit_gate.close()`. A natural solution is to dismiss `execution_loop()` only after it's certain that `_ready_list` won't receive any new permits. This is guaranteed by `_permit_gate.close()`. After this call completes, it is certain that no permits exist. After this patch, `execution_loop()` no longer looks at `_stopped`. It only exits when `_ready_list_cv` breaks, and this is triggered by `stop()` right after `_permit_gate.close()`. Fixes #15198 Closes #15199	2023-08-29 08:18:49 +03:00
Benny Halevy	5afc242814	token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id It is too early to require that all nodes in normal state have a non-null host_id. The assertion was added in `44c14f3e2b` but unfortunately there are several call sites where we add the node as normal, but without a host_id and we patch it in later on. In the future we should be able to require that once we identify nodes by host_id over gossiper and in token_metadata. Fixes scylladb/scylladb#15181 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15184	2023-08-28 21:40:55 +03:00
Botond Dénes	47ce69e9bf	Merge 'paxos_response_handler: carry effective replication map' from Benny Halevy As `create_write_response_handler` on this path accepts an `inet_address_vector_replica_set` that corresponds to the effective_replication_map_ptr in the paxos_response_handler, but currently, the function retrieves a new effective_replication_map_ptr that may not hold all the said endpoints. Fixes scylladb/scylladb#15138 Closes #15141 * github.com:scylladb/scylladb: storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found	2023-08-28 11:42:38 +03:00
Kefu Chai	86e8be2dcd	replica:database: log if endpoint not found if the endpoint specified when creating a KEYSPACE is not found, when flushing a memtable, we would throw an `std::out_of_range` exception when looking up the client in `storage_manager::_s3_endpoints` by the name of endpoint. and scylla would crash because of it. so far, we don't have a good way to error out early. since the storage option for keyspace is still experimental, we can live with this, but would be better if we can spot this error in logging messages when testing this feature. also, in this change, `std::invalid_argument` is thrown instead of `std::out_of_range`. it's more appropriate in this circumstance. Refs #15074 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15075	2023-08-28 10:51:19 +03:00
Avi Kivity	fb8375e1e7	Merge 'storage_proxy: mutate_atomically_result: carry effective replication map down to create_write_response_handler' from Benny Halevy The effective_replication_map_ptr passed to `create_write_response_handler` by `send_batchlog_mutation` must be synchronized with the one used to calculate _batchlog_endpoints to ensure they use the same topology. Fixes scylladb/scylladb#15147 Closes #15149 * github.com:scylladb/scylladb: storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context	2023-08-27 16:34:34 +03:00
Benny Halevy	a5d5b6ded1	gossiper: remove_endpoint: call on_dead notifications is endpoint was alive Since `75d1dd3a76` gossiper::convict will no longer call `mark_dead` (e.g. when called from the failure detection loop after a node is stopped following decommission) and therefore the on_dead notification won't get called. To make that explicit, if the node was alive before remove_endpoint erased it from _live_endpoint, and it has an endpoint_state, call the on_dead notifications. These are imporant to clean up after the node is dead e.g. in storage_proxy::on_down which cancels all respective write handlers. This preferred over going through `mark_dead` as the latter marks the endpoint as unreachable, which is wrong in this case as the node left the cluster. Fixes #15178 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15179	2023-08-27 16:18:27 +03:00
Takuya ASADA	ae25a216bc	scylla_fstrim_setup: stop disabling fstrim.timer Disabling fstrim.timer was for avoid running fstrim on /var/lib/scylla from both scylla-fstrim.timer and fstrim.timer, but fstrim.timer actually never do that, since it is only looking on fstab entries, not our systemd unit. To run fstrim correctly on rootfs and other filesystems not related scylla, we should stop disabling fstrim.timer. Fixes #15176 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #15177	2023-08-27 14:56:37 +03:00
Kefu Chai	83ceedb18b	storage_service: do not cast a string to string_view before formatting seastar::format() just forward the parameters to be formatted to `fmt::format_to()`, which is able to format `std::string`, so there is no need to cast the `std::string` instance to `std::string_view` for formatting it. in this change, the cast is dropped. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15143	2023-08-25 16:43:38 +03:00
Mikołaj Grzebieluch	a031a14249	tests: add asynchronous log browsing functionality Add a class that handles log file browsing with the following features: * mark: returns "a mark" to the current position of the log. * wait_for: asynchronously checks if the log contains the given message. * grep: returns a list of lines matching the regular expression in the log. Add a new endpoint in `ManagerClient` to obtain the scylla logfile path. Fixes #14782 Closes #14834	2023-08-25 14:19:09 +02:00
Raphael S. Carvalho	a22f74df00	table: Introduce storage snapshot for upcoming tablet streaming New file streaming for tablets will require integration with compaction groups. So this patch introduces a way for streaming to take a storage snapshot of a given tablet using its token range. Memtable is flushed first, so all data of a tablet can be streamed through its sstables. The interface is compaction group / tablet agnostic, but user can easily pick data from a single tablet by using the range in tablet metadata for a given tablet. E.g.: auto erm = table.get_effective_replication_map(); auto& tm = erm->get_token_metadata(); auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id()); for (auto tid : tablet_map.tablet_ids()) { auto tr = tmap.get_token_range(tid); auto ssts = co_await table.take_storage_snapshot(tr); ... } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15128	2023-08-25 13:06:02 +02:00
Patryk Jędrzejczak	64a9bbe0ce	docs: inform about deprecating IP support for --ignore-dead-nodes We also remove one of the removenode examples because it uses the deprecated IP support.	2023-08-25 12:33:49 +02:00
Patryk Jędrzejczak	b2755755f4	raft topology: support IPs for --ignore-dead-nodes We want to stop supporting IPs for --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace for raft_replace. However, we shouldn't remove these features without the deprecation period because the original removenode and replace operations still support them. So, we add them for now. Additionally, we modify test_raft_ignore_nodes.py so that it verifies the added IP support.	2023-08-25 12:33:45 +02:00
Patryk Jędrzejczak	9806bddf75	test: fix a test case in raft_address_map_test The test didn't test what it was supposed to test. It would pass even if set_nonexpiring() didn't insert a new entry. Closes #15157	2023-08-25 12:11:33 +02:00
Kefu Chai	d2d1141188	sstables: writer: delegate flush() in checksummed_file_data_sink_impl before this change, `checksummed_file_data_sink_impl` just inherits the `data_sink_impl::flush()` from its parent class. but as a wrapper around the underlying `_out` data_sink, this is not only an unusual design decision in a layered design of an I/O system, but also could be problematic. to be more specific, the typical user of `data_sink_impl` is a `data_sink`, whose `flush()` member function is called when the user of `data_sink` want to ensure that the data sent to the sink is pushed to the underlying storage / channel. this in general works, as the typical user of `data_sink` is in turn `output_stream`, which calls `data_sink.flush()` before closing the `data_sink` with `data_sink.close()`. and the operating system will eventually flush the data after application closes the corresponding fd. to be more specific, almost none of the popular local filesystem implements the file_operations.op, hence, it's safe even if the `output_stream` does not flush the underlying data_sink after writing to it. this is the use case when we write to sstables stored on local filesystem. but as explained above, if the data_sink is backed by a network filesystem, a layered filesystem or a storage connected via a buffered network device, then it is crucial to flush in a timely manner, otherwise we could risk data lost if the application / machine / network breaks when the data is considerered persisted but they are _not_! but the `data_sink` returned by `client::make_upload_jumbo_sink` is a little bit different. multipart upload is used under the hood, and we have to finalize the upload once all the parts are uploaded by calling `close()`. but if the caller fails / chooses to close the sink before flushing it, the upload is aborted, and the partially uploaded parts are deleted. the default-implemented `checksummed_file_data_sink_impl::flush()` breaks `upload_jumbo_sink` which is the `_out` data_sink being wrapped by `checksummed_file_data_sink_impl`. as the `flush()` calls are shortcircuited by the wrapper, the `close()` call always aborts the upload. that's why the data and index components just fail to upload with the S3 backend. in this change, we just delegate the `flush()` call to the wrapped class. Fixes #15079 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15134	2023-08-24 18:03:10 +03:00
Patryk Jędrzejczak	59df5ce7e4	raft_address_map: introduce find_by_addr In the following commit, we add IP support for --ignore-dead-nodes in raft_removenode and raft_replace. To implement it, we need a way to translate IPs to Raft IDs. The solution is to add a new member function -- find_by_addr -- to raft_address_map that does the IP->ID translation. The IP support for --ignore-dead-nodes will be deprecated and find_by_addr shouldn't be called for other reasons, so it always logs a warning. We also add some unit tests for find_by_addr.	2023-08-24 15:10:43 +02:00
Raphael S. Carvalho	d6cc752718	test: Fix flakiness in sstable_compaction_test.autocompaction_control_test It's possible that compaction task is preempted after completion and before reevaluation, causing pending_tasks to be > 1. Let's only exit the loop if there are no pending tasks, and also reduce 100ms sleep which is an eternity for this test. Fixes #14809. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15059	2023-08-24 13:37:06 +03:00
Benny Halevy	4a2e367e92	storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler As `create_write_response_handler` on this path accepts an `inet_address_vector_replica_set` that corresponds to the effective_replication_map_ptr in the paxos_response_handler, but currently, the function retrieves a new effective_replication_map_ptr that may not hold all the said endpoints. Fixes scylladb/scylladb#15138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:45:13 +03:00
Benny Halevy	672ec66769	test: rest_api: add test_gossiper Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:37:12 +03:00
Benny Halevy	8825828817	gossiper: add get_unreachable_members_synchronized Modeled after get_live_members_synchronized, get_unreachable_members_synchronized calls replicate_live_endpoints_on_change to synchronize the state of unreachable_members on all shards. Fixes scylladb/scylladb#15088 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:36:55 +03:00
Benny Halevy	415d923c08	gossiper: apply_new_states: tolerate listener errors during shutdown Change `6449c59` brought back abort on listener failure, but this causes crashes when listeners hit expected errors like gate_closed. Detect shutdown usig the gossiper _abort_source and in this case just log a warning about the errors but do not abort. Fixes scylladb/scylladb#15031 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:12:03 +03:00
Benny Halevy	40f508a51d	gossiper: do_on_change_notifications: check abort source As Tomasz Grabiec correctly noted: > We should also ensure that once _abort_source is aborted, we don't attempt to process any further notifications, because that would violate monotonicity due to partially failed notification. Even if the next listener eventually fails too, if this invariant is violated, it can have undesired side effects. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:12:03 +03:00
Benny Halevy	85f553e723	gossiper: lock_endpoint_update_semaphore: get_units with _abort_source Locking the _endpoint_update_semaphore should be abortable with the gossiper _abort_source. No further processing should be done once abort is requested. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:12:03 +03:00
Benny Halevy	d5dff1a16e	gossiper: lock_endpoint: get_units with _abort_source Locking an endpoint should be abortable with the gossiper _abort_source. No further processing should be done once abort is requested. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:12:03 +03:00
Benny Halevy	ae70afd099	gossiper: is_enabled: consider also _abort_source Once abort is requested we should not process any more gossip RPCs to prevent undesired side effects of partially applied state changes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 11:12:02 +03:00
Benny Halevy	6af0b281a6	storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler The effective_replication_map_ptr passed to `create_write_response_handler` by `send_batchlog_mutation` must be synchronized with the one used to calculate _batchlog_endpoints to ensure they use the same topology. Fixes scylladb/scylladb#15147 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 10:43:40 +03:00
Benny Halevy	098dd5021a	storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context The batchlog mutation is for system.batchlog. Rather than looking the schema up in multiple places do that once and keep it in the context object. It will be used in the next patch to get a respective effective_replication_map_ptr. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 10:43:23 +03:00
Benny Halevy	27c33015a5	storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found Return error in production rather than crashing as in https://github.com/scylladb/scylladb/issues/15138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-24 08:59:38 +03:00
Kefu Chai	2f17b76df7	docs/operating-scylla/admin-tools: add note on deprecating sstabledump sstabledump is deprecated in place of `scylla sstable` commands. so let's reflect this in the document. Fixes #15020 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15021	2023-08-24 08:31:29 +03:00
Botond Dénes	1609c76d62	tools/scylla-sstable: scrub: don't qurantine sstables after validate Scylla sstable promises to never mutate its input sstables. This promise was broken by `scylla sstable scrub --scrub-mode=validate`, because validate moves invalid input sstables into qurantine. This is unexpected and caused occasional failures in the scrub tests in test_tools.py. Fix by propagating a flag down to `scrub_sstables_validate_mode()` in `compaction.cc`, specifying whether validate should qurantine invalid sstables, then set this flag to false in `scylla-sstable.cc`. The existing test for validate-mode scrub is ammended to check that the sstable is not mutated. The test now fails before the fix and passes afterwards. Fixes: #14309 Closes #15139	2023-08-23 21:53:12 +03:00
Kamil Braun	93be4c0cb0	Merge 'Base node liveliness consistently on gossiper::is_alive' from Benny Halevy Currently he gossiper marks endpoint_state objects as alive/dead. I some cases the endpoint_state::is_alive function is checked but in many other cases gossiper::is_alive(endpoint) is used to determine if the endpoint is alive. This series removed the endpoint_state::is_alive state and moves all the logic to gossiper::is_alive that bases its decision on the endpoint having an endpoint_state and being in the _live_endpoints set. For that, the _live_endpoints is made sure to be replicated to all shards when changed and the endpoint_state changes are serialized under lock_endpoint, and also making sure that the endpoint_state in the _endpoint_states_map is never updated in place, but rather a temporary copy is changed and then safely replicated using gossiper::replicate Refs https://github.com/scylladb/scylladb/issues/14794 Closes #14801 * github.com:scylladb/scylladb: gossiper: mark_alive: remove local_state param endpoint_state: get rid of _is_alive member and methods gossiper: is_alive: use _live_endpoints gossiper: evict_from_membership: erase endpoint from _live_endpoints gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change gossiper: run: no need to replicate live_endpoints gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change gossiper: add mutate_live_and_unreachable_endpoints gossiper: reset_endpoint_state_map: clear also shadow endpoint sets gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards gossiper: functions that change _live_endpoints must be called on shard 0 gossiper: add lock_endpoint_update_semaphore gossiper: make _live_endpoints an unordered_set endpoint_state: use gossiper::is_alive externally	2023-08-23 17:18:05 +02:00
Gleb Natapov	d1654ccdda	storage_service: register schema version observer before joining group0 and starting gossiper The schema version is updated by group0, so if group0 starts before schema version observer is registered some updates may be missed. Since the observer is used to update node's gossiper state the gossiper may contain wrong schema version. Fix by registering the observer before starting group0 and even before starting gossiper to avoid a theoretical case that something may pull schema after start of gossiping and before the observer is registered. Fixes: #15078 Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>	2023-08-23 17:11:51 +02:00
Patryk Jędrzejczak	ef2eac9941	raft topology: make every type in request_param a named struct We make every alternative type in the request_param variant a named struct to make the code more readable. Additionally, this change will make extending request parameters easier if we decide to do so in the future. Closes #15132	2023-08-23 16:56:00 +02:00
Patryk Jędrzejczak	7eab9f8a02	raft_removenode: remove "raft topology" from errors Some runtime errors thrown in storage_service::raft_removenode start with the "raft topology " prefix. Since "raft topology" is an implementation detail, we don't want to throw this information through the user API. Only logs should contain it. Closes #15136	2023-08-23 16:20:14 +02:00
Amnon Heiman	4b1be88c93	service/storage_proxy.cc: mark counters with skip_when_empty This patch mark per-scheduling group counters with skip_when_empty flag. This reduces metrics reporting for scheduling groups that do not use those counters. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-08-23 09:38:35 -04:00
Amnon Heiman	c279409d48	cql3/query_processor.cc: mark cas related metrics with skip_when_empty This patch mark the conditional metrics counters with skip_when_empty flag, to reduce metrics reporting when cas is not in used. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-08-23 09:30:35 -04:00
Amnon Heiman	1abcd4bb11	transport/server.cc: mark metric counter with skip_when_empty This patch mark scylla_transport_cql_errors_total with skip_when_empty flag. It reduces the overhead for metrics for types that are never reported. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-08-23 09:30:35 -04:00
Nadav Har'El	5530c529c2	test/cql-pytest: regression test for old bug with CAST(f AS TEXT) precision When casting a float or double column to a string with `CAST(f AS TEXT)`, Scylla is expected to print the number with enough digits so that reading that string back to a float or double restores the original number exactly. This expectation isn't documented anywhere, but makes sense, and is what Cassandra does. Before commit `71bbd7475c`, this wasn't the case in Scylla: `CAST(f AS TEXT)` always printed 6 digits of precision, which was a bit under enough for a float (which can have 7 decimal digits of precision), but very much not enough for a double (which can need 15 digits). The origin of this magic "6 digits" number was that Scylla uses seastar::to_sstring() to print the float and double values, and before the aforementioned commit those functions used sprintf with the "%g" format - which always prints 6 decimal digits of precision! After that commit, to_sstring() now uses a different approach (based on fmt) to print the float and double values, that prints all significant digits. This patch adds a regression test for this bug: We write float and double values to the database, cast them to text, and then recover the float or double number from that text - and check that we get back exactly the same float or double object. The test fails before the aforementioned commit, and passes after it. It also passes on Cassandra. Refs #15127 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15131	2023-08-23 16:06:52 +03:00
Botond Dénes	e7af2a7de8	Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy And verify the they returned host_id isn't null. Call on_internal_error_noexcept in that case since all token owners are expected to have their host_id set. Aborting in testing would help fix issues in this area. Fixes scylladb/scylladb#14843 Refs scylladb/scylladb#14793 Closes #14844 * github.com:scylladb/scylladb: api: storage_service: improve description of /storage_service/host_id token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners	2023-08-23 13:55:14 +03:00
Botond Dénes	139ba553b8	Merge 'sstable, test: log sstable name and pk when capping local_deletion_time ' from Kefu Chai in this series, we also print the sstable name and pk when writing a tombstone whose local_deletion_time (ldt for short) is greater than INT32_MAX which cannot be represented by an uint32_t. Fixes #15015 Closes #15107 * github.com:scylladb/scylladb: sstable/writer: log sstable name and pk when capping ldt test: sstable_compaction_test: add a test for capped tombstone ldt	2023-08-23 09:29:54 +03:00
Botond Dénes	f7505405f0	scylla-gdb.py: use for_each_table() everywhere scylla-gdb.py has two methods for iterating over all tables: * all_tables() * for_each_table() Despite this, many places in the code iterate over the column family map directly. This patch leaves just a single method (for_each_table()) and migrates all the codebase to use it, instead of iterating over the raw map. While at it, the access to the map is made backward compatible with pre `52afd9d42d` code, said commit wrapped database::_column_families in tables_metadata object. This broke scylla-gdb.py for older versions. Closes #15121	2023-08-22 20:39:31 +03:00
Kamil Braun	169d19e5b0	Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak We add support for `--ignore-dead-nodes` in `raft_removenode` and `--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow passing only host ids of the ignored nodes. Supporting IPs is currently impossible because `raft_address_map` doesn't provide a mapping from IP to a host id. The main steps of the implementation are as follows: - add the `ignore_nodes` column to `system.topology`, - set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`, - extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes, - load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`, - add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`, - pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`. Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes. Fixes #15025 Closes #15113 * github.com:scylladb/scylladb: test: add test_raft_ignore_nodes test: ManagerClient.remove_node: allow List[HostId] for ignore_dead raft topology: pass ignore_nodes to {replace, remove}_with_repair raft topology: exec_global_command: add ignore_nodes to exclude_nodes raft topology: exec_global_command: change type of exclude_nodes topology_state_machine: extend request_param with a set of raft ids raft topology: set ignore_nodes in raft_removenode and raft_replace utils: introduce split_comma_separated_list raft topology: add the ignore_nodes column to system.topology	2023-08-22 18:04:59 +02:00
Kamil Braun	cdc3cd2b79	Merge 'raft: add fencing tests' from Petr Gusev In this PR a simple test for fencing is added. It exercises the data plane, meaning if it somehow happens that the node has a stale topology version, then requests from this node will get an error 'stale topology'. The test just decrements the node version manually through CQL, so it's quite artificial. To test a more real-world scenario we need to allow the topology change fiber to sometimes skip unavailable nodes. Now the algorithm fails and retries indefinitely in this case. The PR also adds some logs, and removes one seemingly redundant topology version increment, see the commit messages for details. Closes #14901 * github.com:scylladb/scylladb: test_fencing: add test_fence_hints test.py: output the skipped tests test.py: add skip_mode decorator and fixture test.py: add mode fixture hints: add debug log for dropped hints hints: send_one_hint: extend the scope of file_send_gate holder pylib: add ScyllaMetrics hints manager: add send_errors counter token_metadata: add debug logs fencing: add simple data plane test random_tables.py: add counter column type raft topology: don't increment version when transitioning to node_state::normal	2023-08-22 16:28:21 +02:00
Piotr Grabowski	17e3e367ca	test: use more frequent reconnection policy The default reconnection policy in Python Driver is an exponential backoff (with jitter) policy, which starts at 1 second reconnection interval and ramps up to 600 seconds. This is a problem in tests (refs #15104), especially in tests that restart or replace nodes. In such a scenario, a node can be unavailable for an extended period of time and the driver will try to reconnect to it multiple times, eventually reaching very long reconnection interval values, exceeding the timeout of a test. Fix the issue by using a exponential reconnection policy with a maximum interval of 4 seconds. A smaller value was not chosen, as each retry clutters the logs with reconnection exception stack trace. Fixes #15104 Closes #15112	2023-08-22 15:40:39 +02:00
Avi Kivity	d944872d19	Merge 'Prevent reactor stalls in to_repair_rows_list' from Benny Halevy This sort series deals with two stall sources in row-level repair `to_repair_rows_list`: 1. Freeing the input `repair_rows_on_wire` in one shot on return (as seen in https://github.com/scylladb/scylladb/issues/14537) 2. Freeing the result `row_list` in one shot on error. this hasn't been seen in testing but I have no reason to believe it is not susceptible to stalls exactly like repair_rows_on_wire with the same number of rows and mutations. Fixes https://github.com/scylladb/scylladb/issues/14537 Closes #15102 * github.com:scylladb/scylladb: repair: reindent to_repair_rows_list repair: to_repair_rows_list: clear_gently on error repair: to_repair_rows_list: consume frozen rows gently	2023-08-22 15:29:37 +03:00
Patryk Jędrzejczak	b044ee535f	test: add test_raft_ignore_nodes We add two tests verifying that --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace in raft_replace are handled correctly. We need a 7-cluster to have a Raft majority. Therefore, these tests are quite slow, and we want to run them only in the dev mode.	2023-08-22 14:19:21 +02:00
Patryk Jędrzejczak	6818d13f7d	test: ManagerClient.remove_node: allow List[HostId] for ignore_dead ManagerClient.remove_node allows passing ignore_dead only as List[IPAddress]. However, raft_removenode currently supports only host ids. To write a test that passes ignore_dead to ManagerClient.remove_node in the Raft topology mode, we allow passing ignore_dead as List[HostId]. Note that we don't want to use List[IPAddress \| HostId] because mixing IP addresses and host ids fails anyway. See ss::remove_node.set(...) in api::set_storage_service.	2023-08-22 14:19:09 +02:00
Patryk Jędrzejczak	26ad527666	raft topology: pass ignore_nodes to {replace, remove}_with_repair To properly stream ranges during the removenode or replace operation in the Raft topology mode, we pass IPs of the ignored nodes to replace_with_repair and remove_with_repair in storage_service::raft_topology_cmd_handler.	2023-08-22 14:18:39 +02:00
Patryk Jędrzejczak	e685182290	raft topology: exec_global_command: add ignore_nodes to exclude_nodes We add ignore_nodes to exclude_nodes in exec_global_command to ignore nodes marked as dead by --ignore-dead-nodes for raft_removenode and --ignore-dead-nodes-for-replace for raft_replace.	2023-08-22 14:18:37 +02:00
Patryk Jędrzejczak	5ebee35f99	raft topology: exec_global_command: change type of exclude_nodes We extend exclude_nodes in exec_global_command with ignore_nodes in the next commit. Since we already use std::unordered_set to store ids of the ignored nodes and their number is unknown, we change the type of exclude_nodes from utils::small_vector to std::unordered_set.	2023-08-22 14:17:55 +02:00
Patryk Jędrzejczak	1f57d80ba1	topology_state_machine: extend request_param with a set of raft ids We add two new alternative types to service::request_param: removenode_param and replace_param. They allow storing the list of ignored nodes loaded from the ignore_nodes column of system.topology. We also remove the raft::server_id type because it has been only used by the replace operation.	2023-08-22 14:17:37 +02:00
Patryk Jędrzejczak	7d3dc306eb	raft topology: set ignore_nodes in raft_removenode and raft_replace To handle --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace in raft_replace, we set the ignore_nodes value of the topology mutation in these functions. In the following commits, we ensure that the topology coordinator properly makes use of it.	2023-08-22 14:13:51 +02:00
Petr Gusev	1ddc76ffd1	test_fencing: add test_fence_hints The test makes a write through the first node with the third node down, this causes a hint to be stored on the first node for the second. We increment the version and fence_version on the third node, restart it, and expect to see a hint delivery failure because of versions mismatch. Then we update the versions of the first node and expect hint to be successfully delivered.	2023-08-22 15:48:40 +04:00
Petr Gusev	3ccd2abad4	test.py: output the skipped tests pytest option -rs forces it to print all the skipped tests along with the reasons. Without this option we can't tell why certain tests were skipped, maybe some of them shouldn't already.	2023-08-22 15:48:40 +04:00
Petr Gusev	c434d26b36	test.py: add skip_mode decorator and fixture Syntactic sugar for marking tests to be skipped in a particular mode. There is skip_in_debug/skip_in_release in suite.yaml, but they can be applied only on the entire file, which is unnatural and inconvenient. Also, they don't allow to specify a reason why the test is skipped. Separate dictionary skipped_funcs is needed since we can't use pytest fixtures in decorators.	2023-08-22 15:48:40 +04:00
Petr Gusev	a639d161e6	test.py: add mode fixture Sometimes a test wants to know what mode it is running in so that e.g. it can skip itself in some of them.	2023-08-22 15:48:40 +04:00
Petr Gusev	439c91851f	hints: add debug log for dropped hints Dropping data is rather important event, let's log it at least at the debug level. It'll help in debugging tests.	2023-08-22 15:48:40 +04:00
Petr Gusev	9fd3df13a2	hints: send_one_hint: extend the scope of file_send_gate holder The problem was that the holder in with_gate call was released too early. This happened before the possible call to on_hint_send_failure in then_wrapped. As a result, the effects of on_hint_send_failure (segment_replay_failed flag) were not visible in send_one_file after ctx_ptr->file_send_gate.close(), so we could decide that the segment was sent in full and delete it even if sending of some hints led to errors. Fixes #15110	2023-08-22 15:48:40 +04:00
Petr Gusev	0b7a90dff6	pylib: add ScyllaMetrics This patch adds facilities to work with Scylla metrics from test.py tests. The new metrics property was added to ManagerClient, its query method sends a request to Scylla metrics endpoint and returns and object to conveniently access the result. ScyllaMetrics is copy-pasted from test_shedding.py. It's difficult to reuse code between 'new' and 'old' styles of tests, we can't just import pylib in 'old' tests because of some problems with python search directories. A past commit of mine that attempted to solve this problem was rejected on review.	2023-08-22 14:31:04 +04:00
Petr Gusev	1b7603af23	hints manager: add send_errors counter There was no indication of problems in the hints manager metrics before. We need this counter for fencing tests in the later commit, but it seems to be useful on its own.	2023-08-22 14:31:04 +04:00
Petr Gusev	fa25e6d63e	token_metadata: add debug logs We log the new version when the new token metadata is set. Also, the log for fence_version is moved in shared_token_metadata from storage_service for uniformity.	2023-08-22 14:31:04 +04:00
Petr Gusev	360453fd87	fencing: add simple data plane test The test starts a three node cluster and manually decrements the version on the last node. It then tries to write some data through the last node and expects to get 'stale topology' exception.	2023-08-22 14:31:01 +04:00
Benny Halevy	801987ab19	gossiper: mark_alive: remove local_state param It is not used anymore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:45 +03:00
Benny Halevy	75d1dd3a76	endpoint_state: get rid of _is_alive member and methods Now that gossiper bases its is_alive status on _live_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:45 +03:00
Benny Halevy	8a92a1c699	gossiper: is_alive: use _live_endpoints Use the presence of of the endpoint in _live_endpoints as the authoritative source for is_alive rather than the endpoint_state::is_alive status. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:45 +03:00
Benny Halevy	a79acbb643	gossiper: evict_from_membership: erase endpoint from _live_endpoints Although it shouldn't be necessary, erase endpoint from _live_endpoint, just in case. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:06:30 +03:00
Benny Halevy	ce2b8724ed	gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change Rather than keeping an expensive copy of `_live_endpoint` and `_unreachable_endpoints` in shadow members, while they aren't currently used for their content anyhow, just to detect when their corresponding members change. With that, it is renamed to replicate_live_and_unreachable_endpoints. This still doesn't provide strong exception safety guarantees, but at least we don't "cheat" about shard state and we don't mark shard 0 state as "replicated" by updating the shadow members. Also, we save some unneeded allocations. Refs scylladb/scylladb#15089 Refs scylladb/scylladb#15088 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 12:04:25 +03:00
Benny Halevy	d666fbfe8f	gossiper: run: no need to replicate live_endpoints As asias@scylladb.com noticed, after the previous patch that calls replicate_live_endpoints_on_change in mutate_live_and_unreachable_endpoints, _live_endpoints are always updated on all shards when they change, so there's no need anymore to replicate them here. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 11:57:32 +03:00
Benny Halevy	2c27297dbd	gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change We want to propagate any change to _live_endpoints to all shards. Currently we just update the `_live_endpoints_version` and `replicate_live_endpoints_on_change` propagtes the change some undetermined time in the future. To rely on `_live_endpoints` for gossiper::is_alive, that may be called on any shard, we want to propagate the change to all shards as soon as it happens. Use `mutate_live_and_unreachable_endpoints` to update _live_endpoints and/or _unreachable_endpoints safely, under `lock_endpoint_update_semaphore`. It is responsible for incrementing _live_endpoints_version and calling `replicate_live_endpoints_on_change` to propagate the change to all shards. Refs scylladb/scylladb#15089 Refs scylladb/scylladb#15088 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 11:56:41 +03:00
Benny Halevy	86ccc1f49b	gossiper: add mutate_live_and_unreachable_endpoints To be used for safely modifying _live_endpoints and/or _unreachable_endpoints and replicating the new version to all shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 11:56:37 +03:00
Patryk Jędrzejczak	0beabdc6ba	utils: introduce split_comma_separated_list Three places handle comma-separated lists similarly: - ss::remove_node.set(...) in api::set_storage_service, - storage_service::parse_node_list, - storage_service::is_repair_based_node_ops_enabled. In the next commit, the fourth place that needs the same logic appears -- storage_service::raft_replace. It needs to load and parse the --ignore-dead-nodes-for-replace param from config. Moreover, the code in is_repair_based_node_ops_enabled is different and doesn't seem right. We swap '\"' and '\'' with ' ' but don't do anything with it afterward. To avoid code duplication and fix is_repair_based_node_ops_enabled, we introduce the new function utils::split_comma_separated_list. This change has a small side effect on logging. For example, ignore_nodes_strs in storage_service::parse_node_list might be printed in a slightly different form.	2023-08-22 10:30:36 +02:00
Patryk Jędrzejczak	16f5db8af2	raft topology: add the ignore_nodes column to system.topology In the following commits, we add support for --ignore-dead-nodes in raft_removenode and --ignore-dead-nodes-for-replace in raft_replace. To make these request parameters accessible for the topology coordinator, we store them in the new ignore_nodes column of system.topology.	2023-08-22 10:30:12 +02:00
Benny Halevy	a14e5ab8a3	gossiper: reset_endpoint_state_map: clear also shadow endpoint sets If we don't clear them, there is a slight chance that the next update will make `_live_endpoints` or `_unreachable_endpoints` equal to their shadow counterparts and prevent an update in `replicate_live_endpoints_on_change`. Fixes #15003 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:17:19 +03:00
Benny Halevy	0cc0a95543	gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards Not only on the calling shard (shard 0). Essentially this change folds `update_live_endpoints_version` into `reset_endpoint_state_map`. Acquire the _endpoint_update_semaphore to serialize this with `replicate_live_endpoints_on_change`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:17:19 +03:00
Benny Halevy	c45868e3bc	gossiper: functions that change _live_endpoints must be called on shard 0 `update_live_endpoints_version` and functions that call it must be called on shard 0, since it updates the authoritative `_live_endpoints` and `_live_endpoints_version`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:17:19 +03:00
Benny Halevy	b0b1c8ae6e	gossiper: add lock_endpoint_update_semaphore Add a private helper to acquire the _endpoint_update_semaphore before calling replicate_live_endpoints_on_change. Must be called on shard 0. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:17:19 +03:00
Benny Halevy	18881bc89d	gossiper: make _live_endpoints an unordered_set It is more efficient to maintain as an unrdered_set and it will be used in a following patch to determine is_alive(endpoint) in O(1) on average. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:17:19 +03:00
Benny Halevy	97061cc3b8	endpoint_state: use gossiper::is_alive externally Before we remove endpoint_state:_is_alive to rely solely on gossipper::_live_endpoints. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 09:06:09 +03:00
Benny Halevy	758dc252ff	repair: reindent to_repair_rows_list	2023-08-22 08:46:26 +03:00
Benny Halevy	7406e9f99b	repair: to_repair_rows_list: clear_gently on error Prevent destroying of potentially large `rows` and `row_list` in one shot on error as it might caused a reactor stall. Instead, use utils::clear_gently on the error return path. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 08:45:59 +03:00
Benny Halevy	e55143148f	repair: to_repair_rows_list: consume frozen rows gently Although to_repair_rows_list may yield if needed between rows and mutation fragments, the input `repair_rows_on_wire` is freed in one shot and that may cause stalls as seen in qa: ``` \| bytes_ostream::free_chain at ././bytes_ostream.hh:163 ++ - addr=0x4103be0: \| bytes_ostream::~bytes_ostream at ././bytes_ostream.hh:199 \| (inlined by) frozen_mutation_fragment::~frozen_mutation_fragment at ././mutation/frozen_mutation.hh:273 \| (inlined by) std::destroy_at<frozen_mutation_fragment> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88 \| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537 \| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77 \| (inlined by) std::__cxx11::_List_base<frozen_mutation_fragment, std::allocator<frozen_mutation_fragment> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575 \| (inlined by) partition_key_and_mutation_fragments::~partition_key_and_mutation_fragments at ././repair/repair.hh:203 \| (inlined by) std::destroy_at<partition_key_and_mutation_fragments> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88 \| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537 \| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77 \| (inlined by) std::__cxx11::_List_base<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575 \| (inlined by) to_repair_rows_list at ./repair/row_level.cc:597 ``` This change consumes the rows and frozen mutation fragments incrementally, freeing each after being processed. Fixes scylladb/scylladb#14537 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-22 08:45:54 +03:00
Nadav Har'El	a963b59495	test/cql-pytest: add reproducer for IN not working with secondary index We already have a test for issue #13533, where an "IN" doesn't work with a secondary index (the secondary index isn't used in that case, and instead inefficient filtering is required). Recently a user noticed the same problem also exists for local secondary indexes - and this patch includes a reproducing test. The new test is marked xfail, as the issue is still unfixed. The new test is Scylla-only because local secondary index is a Scylla-only extension that doesn't exist in Cassandra. Refs #13533. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15106	2023-08-22 07:25:32 +03:00
Avi Kivity	23be6f0336	tablets: change persistent type of replica set from set to list The system.tablets table stores replica sets as a CQL set type, which is sorted. This means that if, in a tablet replica set [n1, n2, n3] n2 is replaced with n4, then on reload we'll see [n1, n3, n4], changing the relative position of n3 from the third replica to the second. The relative position of replicas in a replica set is important for materialized views, as they use it to pair base replicas with view replicas. To prepare for materialized views using tablets, change the persistent data type to list, which preserves order. The code that generates new replica sets already preserves order: see locator::replace_replica(). While this changes the system schema, tablets are an experimental feature so we don't need to worry about upgrades. Closes #15111	2023-08-21 22:55:14 +02:00
Nadav Har'El	18e8e62798	cql-pytest: translate Cassandra's tests for SELECT with LIMIT This is a translation of Cassandra's CQL unit test source file validation/operations/SelectLimitTest.java into our cql-pytest framework. The tests reproduce two already-known bugs: Refs #9879: Using PER PARTITION LIMIT with aggregate functions should fail as Invalid query Refs #10357: Spurious static row returned from query with filtering, despite not matching filter And also helped discover two new issues: Refs #15099: Incorrect sort order when combining IN, and ORDER BY Refs #15109: PER PARTITION LIMIT should be rejected if SELECT DISTINCT is used Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #15114	2023-08-21 22:29:11 +03:00
Kefu Chai	63b32cbdb4	tasks: s/stoppping/stopping/ fix a typo Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15103	2023-08-21 22:28:38 +03:00
Eliran Sinvani	eb368f9f6e	internal_keyspace extention: enhance the semantics also to flushes commit `7c8c020` introduced a new type of a keyspace, an internal keyspace It defined the semantics for this internal keyspace, this keyspace is somewhat a hybrid between system and user keyspace. Here we extend the semantics to include also flushes, meaning that flushes will be done using the system dirty_mamory_manager. This is in order to allow inter dependencies between internal tables and user tables and prevent deadlocks. One example of such a deadlock is our `replicated_key_provider` encryption on the enterprise version. The deadlock occur because in some circumstances, an encrypted user table flush is dependant upon the `encrypted_keys` table being flushed but since the requests are serialized, we get a deadlock. Tests: unit tests dev + debug The deadlock dtest reproducer: encryption_at_rest_test.py::TestEncryptionAtRest::test_reboot Fixes #14529 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com> Closes #14547	2023-08-21 18:17:05 +03:00
Avi Kivity	ce43effc21	Merge "fix rebuild with consistent topology management" From Gleb Natapov " The series fixes bogus asserting during topology state load and add a test that runs rebuild to make sure the code will not regress again. Fixes #14958 " * 'gleb/rebuilding_fix_v1' of github.com:scylladb/scylla-dev: test: add rebuild test system_keyspace: fix assertion for missing transition_state	2023-08-21 16:00:42 +03:00
Kefu Chai	8cc215db96	test: randomized_nemesis_test: do not brace around scalars Clang and GCC's warning option of `-Wbraced-scalar-init` warns at seeing superfluous use of braces, like: ``` /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2187:32: error: braces around scalar initializer [-Werror,-Wbraced-scalar-init] .snapshot_threshold{1}, ^~~ ``` usually, this does not hurt. but by taking the braces out, we have a more readable piece of code, and less warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15086	2023-08-21 15:57:06 +03:00
Kefu Chai	9c24be05c3	sstable/writer: log sstable name and pk when capping ldt when the local_deletion_time is too large and beyond the epoch time of INT32_MAX, we cap it to INT32_MAX - 1. this is a signal of bad configuration or a bug in scylla. so let's add more information in the logging message to help track back to the source of the problem. Fixes #15015 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-21 19:25:32 +08:00
Kefu Chai	0bc99c7f49	test: sstable_compaction_test: add a test for capped tombstone ldt local_delection_time (short for ldt) is a timestamp used for the purpose of purging the tombstone after gc_grace_seconds. if its value is greater than INT32_MAX, it is capped when being written to sstable. this is very likely a signal of bad configuration or a even a bug in scylla. so we keep track of it with a metric named "scylla_sstables_capped_tombstone_deletion_time". in this change, a test is added to verify that the metric is updated upon seeing a tombstone with this abnormal ldt. because we validate the consistency before and after compaction in tests, this change adds a parameter to disable this check, otherwise, because capping the ldt changes the mutation, the validation would fail the test. Refs #15015 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-21 19:25:32 +08:00
Aleksandra Martyniuk	e0ce711e4f	compaction: do not swallow compaction_stopped_exception for reshape Loop in shard_reshaping_compaction_task_impl::run relies on whether sstables::compaction_stopped_exception is thrown from run_custom_job. The exception is swallowed for each type of compaction in compaction_manager::perform_task. Rethrow an exception in perfrom task for reshape compaction. Fixes: #15058. Closes #15067	2023-08-21 12:41:55 +03:00
Vlad Zolotarov	e13a2b687d	scylla_raid_setup: make --online-discard argument useful This argument was dead since its introduction and 'discard' was always configured regardless of its value. This patch allows actually configuring things using this argument. Fixes #14963 Closes #14964	2023-08-21 12:21:23 +03:00
Anna Stuchlik	b5c4d13e36	doc: update the Seastar Perftune page This commit updates the description of perftune.py. It is based on the information in the reported issue (below), the contents of help for perftune.py, and the input from @vladzcloudius. Fixes https://github.com/scylladb/scylladb/issues/14233 Closes #14879	2023-08-21 10:23:30 +03:00
Anna Stuchlik	57e86b05f1	doc: fix the outdated Networking section Fixes https://github.com/scylladb/scylla-docs/issues/2467 This commit updates the Networking section. The scope is: - Removing the outdated content, including the reference to the super outdated posix_net_conf.sh script. - Adding the guidelines provided by @vladzcloudius. - Adding the reference to the documentation for the perftune.py script. Closes #14859	2023-08-21 10:17:37 +03:00
Petr Gusev	9176a3341a	test_topology_smp: more logs for debug/aarch64 The test is flaky on CI in debug builds on aarch64 (#14752), here we sprinkle more logs for debug/aarch64 hoping it'll help to debug it. Ref #14752 Closes #14822	2023-08-21 10:03:09 +03:00
Kefu Chai	adfc139a74	tools/scylla-sstable: path::parent_path() when appropriate in load_sstables(), `sst_path` is already an instace of `std::filesystem::path`, so there is no need to cast it to `std::filesystem::path`. also, `path.remove_filename()` returns something like "system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/", when the trailing slash. when we get a component's path in `sstable::filename`, we always add a "/" in between the `dir` and the filename, so this'd end up with two slashes in the path like: "/var/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//mc-2-big-Data.db" so, in order to remove the duplicated slash, let's just use `path.parent_path()` here. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15035	2023-08-21 09:28:03 +03:00
Benny Halevy	6e416b8ff2	api: storage_service: improve description of /storage_service/host_id Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-21 09:20:39 +03:00
Benny Halevy	44c14f3e2b	token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners And verify the they returned host_id isn't null. Call on_internal_error_noexcept in that case since all token owners are expected to have their host_id set. Aborting in testing would help fix issues in this area. Fixes scylladb/scylladb#14843 Refs scylladb/scylladb#14793 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-21 09:16:42 +03:00
Benny Halevy	0f54e24519	migration_notifier: get schema_ptr by value To prevent use-after-free as seen in https://github.com/scylladb/scylladb/issues/15097 where a temp schema_ptr retrieved from a global_schema_ptr get destroyed when the notification function yielded. Capturing the schema_ptr on the coroutine frame is inexpensive since its a shared ptr and it makes sure that the schema remains valid throughput the coroutine life time. Fixes scylladb/scylladb#15097 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15098	2023-08-20 21:36:57 +03:00
David Garcia	e23d9cd7eb	docs: Autogenerate db/config.cc docs Update layout docs: remove output param docs: generate cc properties on build docs: track cc file on change rm: note dependency docs: clean _data Fixes #8424. Closes #14973	2023-08-20 21:27:37 +03:00
Kefu Chai	1aa01d63d4	test: randomized_nemesis_test: mark direct_fd_{pinger,clock} final `raft_server` in test/raft/randomized_nemesis_test.cc manages instances of direct_fd_pinger and direct_fd_clock with unique_ptr<>. this unique_ptr<> deletes these managed instances using delete. but since these two classes have virtual methods, the compiler feels nervous when deleting them. because these two classes have virtual functions, but they do not have virtual destructor. in other words, in theory, these pointers could be pointing derived classes of them, and deleting them could lead to leak. so to silence the warning and to prevent potential issues, let's just mark these two classes final. this should address the warning like: ``` In file included from /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:9: In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/reactor.hh:24: In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/aligned_buffer.hh:24: In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78: /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'direct_fd_pinger<int>' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor] delete __ptr; ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<direct_fd_pinger<int>>::operator()' requested here get_deleter()(std::move(__ptr)); ^ /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1400:5: note: in instantiation of member function 'std::unique_ptr<direct_fd_pinger<int>>::~unique_ptr' requested here ~raft_server() { ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: note: in instantiation of member function 'raft_server<ExReg>::~raft_server' requested here delete __ptr; ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<raft_server<ExReg>>::operator()' requested here get_deleter()(std::move(__ptr)); ^ /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1704:24: note: in instantiation of member function 'std::unique_ptr<raft_server<ExReg>>::~unique_ptr' requested here ._server = nullptr, ^ /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1742:19: note: in instantiation of member function 'environment<ExReg>::new_node' requested here auto id = new_node(first, std::move(cfg)); ^ /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2113:39: note: in instantiation of member function 'environment<ExReg>::new_server' requested here auto leader_id = co_await env.new_server(true); ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15084	2023-08-20 21:26:08 +03:00
Avi Kivity	4db5d8dd56	Merge 'build: cmake: support Coverage and Sanitize build modes' from Kefu Chai to mirror the build modes supported by `configure.py`. Closes #15085 * github.com:scylladb/scylladb: build: cmake: support Coverage and Sanitize build modes build: cmake: error out if specified build type is unknown	2023-08-20 21:25:21 +03:00
Pavel Emelyanov	6bc30f1944	system_keyspace: De-bloat .setup() from messing with system.local On boot several manipulations with system.local are performed. 1. The host_id value is selected from it with key = local If not found, system_keyspace generates a new host_id, inserts the new value into the table and returns back 2. The cluster_name is selected from it with key = local Then it's system_keyspace that either checks that the name matches the one from db::config, or inserts the db::config value into the table 3. The row with key = local is updated with various info like versions, listen, rpc and bcast addresses, dc, rack, etc. Unconditionally All three steps are scattered over main, p.1 is called directly, p.2 and p.3 are executed via system_keyspace::setup() that happens rather late. Also there's some touch of this table from the cql_test_env startup code. The proposal is to collect this setup into one place and execute it early -- as soon as the system.local table is populated. This frees the system_keyspace code from the logic of selecting host id and cluster name leaving it to main and keeps it with only select/insert work. refs: #2795 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15082	2023-08-20 21:24:31 +03:00
Tomasz Grabiec	1552044615	storage_service, tablets: Fix corrupting tablet metadata on migration concurrent with table drop Tablet migration may execute a global token metadata barrier before executing updates of system.tablets. If table is dropped while the barrier is happening, the updates will bring back rows for migrated tablets in a table which is no longer there. This will cause tablet metadata loading to fail with error: missing_column (missing column: tablet_count) Like in this log line: storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ") The fix is to read and execute the updates in a single group0 guard scope, and move execution of the barrier later. We cannot now generate updates in the same handle_tablet_migration() step if barrier needs to be executed, so we resuse the mechanism for two-step stage transition which we already have for handling of streaming. The next pass will notice that the barrier is not needed for a given tablet and will generate the stage update. Fixes #15061 Closes #15069	2023-08-20 21:17:57 +03:00
Avi Kivity	a4e7f9bed0	docs: cql: split DML page into one page per statement The DML page is quite long (21 screenfuls on my monitor); split it into one page per statement to make it more digestible. The sections that are common to multiple statement are kept in the main DML page, and references to them are added. Closes #15053	2023-08-20 17:14:32 +03:00
Kefu Chai	12d6ec5a18	config: respect --log-with-color 1 scylladb overrides some of seastar logging related options with its own options by applying them with `logging::apply_settings()`. but we fail to inherit `with_color` from Seastar as we are using the designated initializer, so the unspecified members are zero initialized. that's why we always have logging message in black and white even if scylla is running in a tty and `--log-with-color 1` is specified. so, make the debugging life more colorful, let's inherit the option from Seastar, and apply it when setting logging related options. see also `29e09a3292` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15076	2023-08-20 13:47:43 +03:00
Tomasz Grabiec	bd8bb5d4b1	Merge 'Wire tablet into compaction group' from Raphael "Raph" Carvalho Compaction group is the data plane for tablets, so this integration allows each tablet to have its own storage (memtable + sstables). A crucial step for dynamic tablets, where each tablet can be worked on independently. There are still some inefficiencies to be worked on, but as it is, it already unlocks further development. ``` INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf ``` Closes #14863 * github.com:scylladb/scylladb: Kill scylla option to configure number of compaction groups replica: Wire tablet into compaction group token_metadata: Add this_host_id to topology config replica: Switch to chunked_vector for storing compaction groups replica: Generate group_id for compaction_group on demand	2023-08-18 15:17:17 +02:00
Kefu Chai	9fa0b9b75b	build: cmake: support Coverage and Sanitize build modes Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-18 14:17:12 +08:00
Kefu Chai	3c3fb03b01	build: cmake: error out if specified build type is unknown this should help the developer to understand what build types are supported if the specified one is unknown. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-18 14:17:12 +08:00
Avi Kivity	1901475598	Merge 'config: mark "experimental" option unused and cleanups' from Kefu Chai in this series, the "experimental" option is marked `Unused` as it has been marked deprecated for almost 2 years since scylla 4.6. and use `experimental_features` to specify the used experimental features explicitly. Closes #14948 * github.com:scylladb/scylladb: config: remove unused namespace alias config: use std::ranges when appropriate config: drop "experimental" option test: disable 'enable_user_defined_functions' if experimental_features does not include udf test: pylib: specify experimental_features explicitly	2023-08-17 20:42:02 +03:00
Kefu Chai	7275b8967c	docs: add sstablemetadata to operating-scylla/admin-tools to note that sstablemetadata is being deprecated and encourage user to switch over to the native tools. Fixes #15020 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15040	2023-08-17 18:48:46 +03:00
Avi Kivity	e91256a621	Merge 'build: cmake: fix the build of rpm/deb from submodules' from Kefu Chai in this series, the build of rpm and deb from submodules is fixed: 1. correct the path of reloc package 2. add the dependency of reloc package to deb/rpm build targets Closes #15062 * github.com:scylladb/scylladb: build: cmake: correct reloc_pkg's path build: cmake: build rpm/deb from reloc_pkg	2023-08-17 17:58:49 +03:00
Pavel Emelyanov	3ed5b00ba2	Merge 's3/client: generate config file for tests and cleanups' from Kefu Chai before this change, object_store/test_basic.py create a config file for specifying the object storage settings, and pass the path of this file as the argument of `--object-storage-config-file` option when running scylla. we have the same requirement when testing scylla with minio server, where we launch a minio server and manually create a the config file and feed it to scylla. to ease the preparation work, let's consolidate by creating the config file in `minio_server.py`, so it always creates the config file and put it in its tempdir. since object_store/test_basic.py can also run against an S3 bucket, the fixture implemented object_store/conftest.py is updated accordingly to reuse the helper exposed by MinioServer to create the config file when it is not available. Closes #15064 * github.com:scylladb/scylladb: s3/client: avoid hardwiring env variables names s3/client: generate config file for tests	2023-08-17 16:39:23 +03:00
Gleb Natapov	4ffc39d885	cql3: Extend the scope of group0_guard during DDL statement execution Currently we hold group0_guard only during DDL statement's execute() function, but unfortunately some statements access underlying schema state also during check_access() and validate() calls which are called by the query_processor before it calls execute. We need to cover those calls with group0_guard as well and also move retry loop up. This patch does it by introducing new function to cql_statement class take_guard(). Schema altering statements return group0 guard while others do not return any guard. Query processor takes this guard at the beginning of a statement execution and retries if service::group0_concurrent_modification is thrown. The guard is passed to the execute in query_state structure. Fixes: #13942 Message-ID: <ZNsynXayKim2XAFr@scylladb.com>	2023-08-17 15:52:48 +03:00
Kefu Chai	6788903fd6	db: config: mark config class final in `34c3688017`, we added a virtual function to `config_file`, and we new and delete pointer pointing to a `db::config` instance with `unique_ptr<>`. this makes the compiler nervous, as deleting a pointer pointing to an instance of non-final class with virtual function could lead to leak, if this pointer actually points to a derived class of this non-final class. so, in order to silence the warning and to prevent potential problem in future, let's mark `db::config` final. the warning from Clang 16 looks like: ``` In file included from /home/kefu/dev/scylladb/test/lib/test_services.cc:10: In file included from /home/kefu/dev/scylladb/test/lib/test_services.hh:25: In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78: /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'db::config' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor] delete __ptr; ^ /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<db::config>::operator()' requested here get_deleter()(std::move(__ptr)); ^ /home/kefu/dev/scylladb/test/lib/test_services.cc:189:16: note: in instantiation of member function 'std::unique_ptr<db::config>::~unique_ptr' requested here auto cfg = std::make_unique<db::config>(); ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15071	2023-08-17 13:43:16 +03:00
Kefu Chai	fc6b8d4040	s3/client: avoid hardwiring env variables names instead of hardwiring the names in multiple places, let's just keep them in a single place as variables, and reference them by these variables instead of their values. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-17 16:06:55 +08:00
Kefu Chai	ec7fa3628c	s3/client: generate config file for tests before this change, object_store/test_basic.py create a config file for specifying the object storage settings, and pass the path of this file as the argument of `--object-storage-config-file` option when running scylla. we have the same requirement when testing scylla with minio server, where we launch a minio server and manually create a the config file and feed it to scylla. to ease the preparation work, let's consolidate by creating the config file in `minio_server.py`, so it always creates the config file and put it in its tempdir. since object_store/test_basic.py can also run against an S3 bucket, the fixture implemented object_store/conftest.py is updated accordingly to reuse the helper exposed by MinioServer to create the config file when it is not available. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-17 16:06:55 +08:00
Raphael S. Carvalho	b578d6643f	Kill scylla option to configure number of compaction groups The option was introduced to bootstrap the project. It's still useful for testing, but that translates into maintaining an additional option and code that will not be really used outside of testing. A possible option is to later map the option in boost tests to initial_tablets, which may yield the same effect for testing. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-16 18:23:53 -03:00
Raphael S. Carvalho	cc60598368	replica: Wire tablet into compaction group Compaction group is the data plane for tablets, so this integration allows each tablet to have its own storage (memtable + sstables). A crucial step for dynamic tablets, where each tablet can be worked on independently. There are still some inefficiencies to be worked on, but as it is, it already unlocks further development. INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf There's a need for compaction_group_manager, as table will still support "tabletless" mode, and we don't want to sprinkle ifs here and there, to support both modes. It's not really a manager (it's not even supposed to store a state), but I couldn't find a better name. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-16 18:23:53 -03:00
Raphael S. Carvalho	5d1f60439a	token_metadata: Add this_host_id to topology config The motivation is that token_metadata::get_my_id() is not available early in the bootstrap process, as raft topology is pulled later than new tables are registered and created, and this node is added to topology even later. To allow creation of compaction groups to retrieve "my id" from token metadata early, initialization will now feed local id into topology config which is immutable for each node anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-16 18:23:44 -03:00
Piotr Smaroń	34c3688017	db: config: add live_updatable_config_params_changeable_via_cql option If `live_updatable_config_params_changeable_via_cql` is set to true, configuration parameters defined with `liveness::LiveUpdate` option can be updated in the runtime with CQL, i.e. by updating `system.config` virtual table. If we don't want any configuration parameter to be changed in the runtime by updating `system.config` virtual table, this option should be set to false. This option should be set to false for e.g. cloud users, who can only perform CQL queries, and should not be able to change scylla's configuration on the fly. Current implemenatation is generic, but has a small drawback - messages returned to the user can be not fully accurate, consider: ``` cqlsh> UPDATE system.config SET value='2' WHERE name='task_ttl_in_seconds'; WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="option is not live-updateable" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} ``` where `task_ttl_in_seconds` has been defined with `liveness::LiveUpdate`, but because `live_updatable_config_params_changeable_via_cql` is set to `false` in `scylla.yaml,` `task_ttl_in_seconds` cannot be modified in the runtime by updating `system.config` virtual table. Fixes #14355 Closes #14382	2023-08-16 17:56:27 +03:00
Aleksandra Martyniuk	e9d94894f1	compaction: release resources of compaction executors Before compaction task executors started inheriting from compaction_task_impl, they were destructed immediately after compaction finished. Destructors of executors and their fields performed actions that affected global structures and statistics and had impact on compaction process. Currently, task executors are kept in memory much longer, as their are tracked by task manager. Thus, destructors are not called just after the compaction, which results in compaction stats not being updated, which causes e.g. infinite cleanup loop. Add release_resources() method which is called at the end of compaction process and does what destructors used to. Fixes: #14966. Fixes: #15030. Closes #15005	2023-08-16 15:51:17 +03:00
Kefu Chai	564522c4a8	s3/test: remove tempdir if log does not exists should have been use `ignore_errors=True` to ignore the error. this issue has not poped up, because we haven't run into the case where the log file does not exist. this was a regression introduced by `d4ee84ee1e` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15063	2023-08-16 15:11:00 +03:00
Kefu Chai	32c26624bf	build: cmake: correct reloc_pkg's path before this change, the filename in path of reloc package looks like: tools-scylla-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz but it should have been: scylla-tools-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz so, when repackaging the reloc tarball to rpm / deb, the scripts just fails to find the reloc tarball and fail. after this change, the filename is corrected to match with the one generated using `build_reloc.sh`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-16 16:15:23 +08:00
Kefu Chai	a19c7fa8d5	build: cmake: build rpm/deb from reloc_pkg before this change, dist-${name}-rpm and dist-${name}-deb targets do not depend on the corresponding reloc pkg from which these prebuilt packages are created. so these scripts fail if the reloc package does not exist. to address this problem, the reloc package is added as their dependencies. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-16 16:15:23 +08:00
Avi Kivity	e8f3b073c3	Merge 'Maintain sstable state explicitly' from Pavel Emelyanov An sstable can be in one of several states -- normal, quarantined, staging, uploading. Right now this "state" is hard-wired into sstable's path, e.g. quarantined sstable would sit in e.g. /var/lib/data/ks-cf-012345/quarantine/ directory. Respectively, there's a bunch of directory names constexprs in sstables.hh defining each "state". Other than being confusing, this approach doesn't work well with S3 backend. Additionally, there's snapshot subdir that adds to the confusion, because snapshot is not quite a state. This PR converts "state" from constexpr char* directories names into a enum class and patches the sstable creation, opening and state-changing API to use that enum instead of parsing the path. refs: #13017 refs: #12707 Closes #14152 * github.com:scylladb/scylladb: sstable/storage: Make filesystem storage with initial state sstable: Maintain state sstable: Make .change_state() accept state, not directory string sstable: Construct it with state sstables_manager: Remove state-less make_sstable() table: Make sstables with required state test: Make sstables with upload state in some cases tools: Make sstables with normal state table: Open-code sstables making streaming helpers tests: Make sstables with normal state by default sstable_directory: Make sstable with required state sstable_directory: Construct with state distributed_loader: Make sstable with desired state when populating distributed_loader: Make sstable with upload state when uploading sstable: Introduce state enum sstable_directory: Merge verify and g.c. calls distributed_loader: Merge verify and gc invocations sstable/filesystem: Put underscores to dir members sstable/s3: Mark make_s3_object_name() const sstable: Remove filename(dir, ...) method	2023-08-15 17:44:06 +03:00
Avi Kivity	5949623e0d	Merge 'sstable_set: maintain bytes on disk' from Benny Halevy and use that in compaction_group, rather than respective accumulators of its own. This is part of of larger series to make cache updates exception safe. Refs #14043 Closes #15052 * github.com:scylladb/scylladb: sstable_set: maintain total bytes_on_disk sstable_set: insert, erase: return status	2023-08-15 17:32:12 +03:00
Kefu Chai	64ed0127d7	s3/client: retry if minio server fails to start there is a small time window after we find a free port and before the minio server listens on that port, if another server sneaked in the time window and listen on that port, minio server can still fail to start even there might be free port for it. so, in this change, we just retry with a random port for a fixed number of times until the minio server is able to serve. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15042	2023-08-15 16:17:47 +03:00
Raphael S. Carvalho	d3f71ae4ee	replica: Switch to chunked_vector for storing compaction groups We aim for a large number of tablets, therefore let's switch to chunked_vector to avoid large contiguous allocs. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-15 09:04:05 -03:00
Raphael S. Carvalho	2590eec352	replica: Generate group_id for compaction_group on demand There are a few good reasons for this change. 1) compaction_group doesn't have to be aware of # of groups 2) thinking forward to dynamic tablets, # of groups cannot be statically embedded in group id, otherwise it gets stale. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-15 09:04:05 -03:00
Raphael S. Carvalho	9400b79658	gce_snitch: Fix use-after-move in load_config() The use-after-move is not very harmful as it's only used when handling exception. So user would be left with a bogus message. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15054	2023-08-15 10:23:57 +03:00
Kefu Chai	c82f1d2f57	tools/scylla-sstable: dump column_desc as an object before this change, `scylla sstable dump-statistics` prints the "regular_columns" as a list of strings, like: ``` "regular_columns": [ "name", "clustering_order", "type_name", "org.apache.cassandra.db.marshal.UTF8Type", "name", "column_name_bytes", "type_name", "org.apache.cassandra.db.marshal.BytesType", "name", "kind", "type_name", "org.apache.cassandra.db.marshal.UTF8Type", "name", "position", "type_name", "org.apache.cassandra.db.marshal.Int32Type", "name", "type", "type_name", "org.apache.cassandra.db.marshal.UTF8Type" ] ``` but according https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics, > $SERIALIZATION_HEADER_METADATA := { > "min_timestamp_base": Uint64, > "min_local_deletion_time_base": Uint64, > "min_ttl_base": Uint64", > "pk_type_name": String, > "clustering_key_types_names": [String, ...], > "static_columns": [$COLUMN_DESC, ...], > "regular_columns": [$COLUMN_DESC, ...], > } > > $COLUMN_DESC := { > "name": String, > "type_name": String > } "regular_columns" is supposed to be a list of "$COLUMN_DESC". the same applies to "static_columnes". this schema makes sense, as each column should be considered as a single object which is composed of two properties. but we dump them like a list. so, in this change, we guard each visit() call of `json_dumper()` with `StartObject()` and `EndObject()` pair, so that each column is printed as an object. after the change, "regular_columns" are printed like: ``` "regular_columns": [ { "name": "clustering_order", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" }, { "name": "column_name_bytes", "type_name": "org.apache.cassandra.db.marshal.BytesType" }, { "name": "kind", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" }, { "name": "position", "type_name": "org.apache.cassandra.db.marshal.Int32Type" }, { "name": "type", "type_name": "org.apache.cassandra.db.marshal.UTF8Type" } ] ``` Fixes #15036 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15037	2023-08-15 08:22:51 +03:00
Avi Kivity	d57a951d48	Revert "cql3: Extend the scope of group0_guard during DDL statement execution" This reverts commit `70b5360a73`. It generates a failure in group0_test .test_concurrent_group0_modifications in debug mode with about 4% probability. Fixes #15050	2023-08-15 00:26:45 +03:00
Patryk Jędrzejczak	e7077da12d	replica: reduce the size limit of the schema commitlog The size of the schema commitlog is incorrectly set to 10 TB. To avoid wasting space, we reduce it to 2 * schema commitlog segment size. Closes #14946	2023-08-14 20:41:15 +02:00
Benny Halevy	f54ab48273	sstable_set: maintain total bytes_on_disk and use that in compaction_group, rather than respective accumulators of its own. bytes_on_disk is implemented by each sstable_set_impl and is update on insert and erase (whether directly into the sstable_set_impl or via the sstable_set). Although compound_sstable_set doesn't implement insert and erase, it override `bytes_on_disk()` to return the sum of all the underlying `sstable_set::bytes_on_disk()`. Also, added respective unit tests for `partitioned_sstable_set` and `time_series_sstable_set`, that test each type's bytes_on_disk, including cloning of the set, and the `compound_sstable_set` bytes_on_disk semantics. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-14 21:07:27 +03:00
Benny Halevy	9f77a32805	compaction_manager: run_offstrategy_compaction: retrieve owned_ranges from compaction_state perform_offstrategy is called from try_perform_cleanup when there are sstables in the maintenance set that require cleanup. The input sstables are inserted into the compaction_state `sstables_requiring_cleanup` and `try_perform_cleanup` expects offstrategy compaction to clean them up along with reshape compaction. Otherwise, the maintenance sstables that require cleanup are not cleaned up by cleanup compaction, since the reshape output sstable(s) are not analyzed again after reshape compaction, where that would insert the output sstable(s) into `sstables_requiring_cleanup` and trigger their cleanup in the subsequent cleanup compaction. The latter method is viable too, but it is less effficient since we can do reshape+cleanup in one pass, vs. reshape first and cleanup later. Fixes scylladb/scylladb#15041 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15043	2023-08-14 18:37:34 +03:00
Avi Kivity	1937a5c1dd	docs: cql: document the relative priority of SELECT clauses Document how SELECT clauses are considered. For example, given the query SELECT * FROM tab WHERE a = 3 LIMIT 1 We'll get different results if we first apply the WHERE clause then LIMIT the result set, or if we first LIMIT there result set and then apply the WHERE clause. Closes #14990	2023-08-14 17:40:37 +03:00
Benny Halevy	2dc9ef17be	sstable_set: insert, erase: return status To be used for maintaining disk_space_used in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-14 17:10:39 +03:00
Aleksandra Martyniuk	7a28cc60ec	compaction: ignore future explicitly discard_result ignores only successful futures. Thus, if perform_compaction<regular_compaction_task_executor> call fails, a failure is considered abandoned, causing tests to fail. Explicitly ignore failed future. Fixes: #14971. Closes #15000	2023-08-14 16:41:15 +03:00
Pavel Emelyanov	296eb61432	sstable/storage: Make filesystem storage with initial state The filesystem storage driver uses different paths depending on sstable state. It's possible to keep only table directory _and_ state on it and construct this path on demand when needed, but it's faster to keep full path onboard. All the more so it's only exported outside via .prefix() call which is for logs only, but still Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:40:44 +03:00
Pavel Emelyanov	5c39d61b62	sstable: Maintain state This means -- keep state on sstable, change it with change_state() call and (!) fix the is_<state>() helpers not to check storage->prefix() nit: mark requires_view_building() noexcept while at it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:40:44 +03:00
Pavel Emelyanov	b06917f235	sstable: Make .change_state() accept state, not directory string Pretty cosmetic change, but it will allow S3 to finally support moving sstables between states (after this patch it still doesn't) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:40:44 +03:00
Pavel Emelyanov	ef25352412	sstable: Construct it with state This just moves make_path() call from outside of sstable::sstable() inside it. Later it will be moved even further. Also, now sstable can know its state and keep it (next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:28:54 +03:00
Pavel Emelyanov	1f247f0b05	sstables_manager: Remove state-less make_sstable() Now all callers specify the state they want their sstables in explicitly and the old API can be removed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:28:54 +03:00
Pavel Emelyanov	fdfec474ae	table: Make sstables with required state By default it's created with normal state, but there are some places that need to put it into staging. Do it with new state enum Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:28:54 +03:00
Pavel Emelyanov	855f2b4b86	test: Make sstables with upload state in some cases As was mentione in the previous patch, there are few places in tests that put sstables in upload/ subdir and they really mean it. Those need to use sstables manager/directory API directly (already) and specify the state explicitly (this patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 15:28:54 +03:00
Pavel Emelyanov	9e752ca6ab	tools: Make sstables with normal state Just like tests, tool open sstable by its full path and doesn't make any assumptions about sstable state Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:56:02 +03:00
Pavel Emelyanov	6628dc47c5	table: Open-code sstables making streaming helpers There are two of those that call each other to end up calling plain make_sstable() one. It's simpler to patch both if they just call the latter directly. While at it -- drop the unused default argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:56:02 +03:00
Pavel Emelyanov	734c0820df	tests: Make sstables with normal state by default It's assumed that sstables are not very specific about which subdirectory an sstable is, so they can use normal state. Places that need to move sstables between states will use sstable manager API explicitly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:56:02 +03:00
Pavel Emelyanov	e7bbdbcef0	sstable_directory: Make sstable with required state The state is on sstable_directory, can switch to using the new manager API. The full path is still there, will be dropped later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:56:02 +03:00
Pavel Emelyanov	c0b922a8af	sstable_directory: Construct with state This is to replace full path sitting on this object eventually. For now they have to co-exist, but state will be used to make_sstable()-s from manager with its new API Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:56:01 +03:00
Pavel Emelyanov	6fc62c2d9f	distributed_loader: Make sstable with desired state when populating This still needs to conver state to directory name internally as sstable_directory instances are hashed on populator by subdir string. Also the full string path is printed in logs. All this is now internal to populate method and will be fixed later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:45:52 +03:00
Pavel Emelyanov	b0064f5c55	distributed_loader: Make sstable with upload state when uploading Just make use of the new shiny sstables_manager API Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:45:52 +03:00
Pavel Emelyanov	249a6a4d27	sstable: Introduce state enum There are several states between which an sstable can migrate. Nowadays the state is encoded into sstable directory, which is not nice. Also S3 backed sstables don't support states only keeping sstables in "normal". This patch adds enum state in order to replace the path-encoded one eventually. The new sstables_manager::make_sstable() method is added that accepts table directory (without quarantine/ or staging/ component) and the desired initial state (optional). Next patches will make use of this maker and the existing one will be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:45:52 +03:00
Pavel Emelyanov	c257ad90e1	sstable_directory: Merge verify and g.c. calls Name it .prepare() and remove the sstable_directory() public method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:45:51 +03:00
Pavel Emelyanov	07d4672054	distributed_loader: Merge verify and gc invocations Both are launched on shard-0, no need to invoke_on two times Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:41:48 +03:00
Pavel Emelyanov	e955186765	sstable/filesystem: Put underscores to dir members They are private class fields, must be _-prefixed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:00:50 +03:00
Pavel Emelyanov	84b318228a	sstable/s3: Mark make_s3_object_name() const Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:00:50 +03:00
Pavel Emelyanov	f15246f5ef	sstable: Remove filename(dir, ...) method It's only used by fs storage driver that can do dir/file concatenation on its own. Moreover, this method is not welcome to be used even internally Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-14 14:00:50 +03:00
Pavel Emelyanov	116d08d10f	distributed_loader: Mark tables for write in populator Both, init_system_keyspace() and init_non_system_keyspaces() populate the keyspaces with the help of distributed_loader::populate_keyspace(). That method, in turn, walks the list of keyspaces' tables to load sstables from disk and attach to them. After it both init_...-s take the 2nd pass over keyspaces' tables to call the table::mark_ready_for_writes() on each. This marking can be moved into populate_keyspace(), that's much easier and shorter because that method already has the shard-wide table pointer and can just call whatever it needs on the table. This changes the initialization sequence, before the patch all tables were populated before any of them was marked as ready for write. This looks safe however, as marking a table for write meaks resetting its generation generator and different tables' generators are independent from each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #15026	2023-08-14 13:22:41 +03:00
Aleksandra Martyniuk	9ec43fd3a7	compaction: update comment in compaction_manager::submit Closes #15023	2023-08-14 09:34:56 +03:00
Avi Kivity	b120d35c58	Merge 'Relax cql_test_env services maintenance' from Pavel Emelyanov To add a sharded service to the cql_test_env one needs to patch it in 5 or 6 places - add cql_test_env reference - add cql_test_env constructor argument - initialize the reference in initializer list - add service variable to do_with method - pass the variable to cql_test_env constructor - (optionally) export it via cql_test_env public method Steps 1 through 5 are annoying, things get much simpler if look like - add cql_test_env variable - (optionally) export it via cql_test_env public method This is what this PR does refs: #2795 Closes #15028 * github.com:scylladb/scylladb: cql_test_env: Drop local *this reference cql_test_env: Drop local references cql_test_env: Move most of the stuff in run_in_thread() cql_test_env: Open-code env start/stop and remove both cql_test_env: Keep other services as class variables cql_test_env: Keep services as class variables cql_test_env: Construct env early cql_test_env: De-static fdpinger variable cql_test_env: Define all services' variables early cql_test_env: Keep group0_client pointer	2023-08-13 20:24:52 +03:00
Benny Halevy	8fbcf1ab9f	view: start: ignore also abort_requested_exception We see the abort_requested_exception error from time to time, instead of sleep_aborted that was expected and quietly ignored (in debug log level). Treat abort_requested_exception the same way since the error is expected on shutdown and to reduce test flakiness, as seen for example, in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3033/artifact/logs-full.release.010/1691896356104_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_schema/node2.log ``` INFO 2023-08-13 03:12:29,151 [shard 0] compaction_manager - Asked to stop WARN 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted) INFO 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Finished main loop WARN 2023-08-13 03:12:29,152 [shard 0] cdc - Aborted update CDC description table with generation (2023/08/13 03:12:17, d74aad4b-6d30-4f22-947b-282a6e7c9892) INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Asked to stop INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Stopped INFO 2023-08-13 03:12:29,153 [shard 0] init - Signal received; shutting down INFO 2023-08-13 03:12:29,153 [shard 0] init - Shutting down view builder ops INFO 2023-08-13 03:12:29,153 [shard 0] view - Draining view builder INFO 2023-08-13 03:12:29,153 [shard 1] view - Draining view builder INFO 2023-08-13 03:12:29,153 [shard 0] compaction_manager - Stopped ERROR 2023-08-13 03:12:29,153 [shard 0] view - start failed: seastar::abort_requested_exception (abort requested) ERROR 2023-08-13 03:12:29,153 [shard 1] view - start failed: seastar::abort_requested_exception (abort requested) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15029	2023-08-13 18:39:09 +03:00
Gleb Natapov	70b5360a73	cql3: Extend the scope of group0_guard during DDL statement execution Currently we hold group0_guard only during DDL statement's execute() function, but unfortunately some statements access underlying schema state also during check_access() and validate() calls which are called by the query_processor before it calls execute. We need to cover those calls with group0_guard as well and also move retry loop up. This patch does it by introducing new function to cql_statement class take_guard(). Schema altering statements return group0 guard while others do not return any guard. Query processor takes this guard at the beginning of a statement execution and retries if service::group0_concurrent_modification is thrown. The guard is passed to the execute in query_state structure. Fixes: #13942 Message-ID: <ZNSWF/cHuvcd+g1t@scylladb.com>	2023-08-13 14:19:39 +03:00
Patryk Jędrzejczak	2e2271f639	raft: make a replaced node a non-voter early We make a replaced node a non-voter early, just as a removed node in `377f87c91a`. Closes #15022	2023-08-12 22:03:46 +02:00
Avi Kivity	0cd0be6275	Merge 'Remove some stale sugar from cql_test_env' from Pavel Emelyanov There are some asserting checks for keyspace and table existence on cql_test_env that perform some one-linee work in a complex manner, tests can do better on their own. Removing it makes cql_test_env simpler refs: #2795 Closes #15027 * github.com:scylladb/scylladb: test: Remove require_..._exists from cql_test_env test: Open-code ks.cf name parse into cdc_test test: Don't use require_table_exists() in test/lib/random_schema test: Use BOOST_REQUIRE(!db.has_schema()) test: Use BOOST_REQUIRE(db.has_schema()) test: Use BOOST_REQUIRE(db.has_keyspace()) test: Threadify cql_query_test::test_compact_storage case test: Threadify some cql_query_test cases	2023-08-12 22:32:47 +03:00
Pavel Emelyanov	64ddc9e4b4	cql_test_env: Drop local this reference The auto& env = this is also now excessive, so drop it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:30:34 +03:00
Pavel Emelyanov	de679d7c36	cql_test_env: Drop local references The local auto& foo = env._foo references in run_in_thread() a no longer needed, the code that uses foo can be switched to use _foo (this->_foo) instead Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:29:42 +03:00
Pavel Emelyanov	487ecae517	cql_test_env: Move most of the stuff in run_in_thread() Thw do_with() method is static and cannot just access cql_test_env variable's fields, using local references instead. To simplify this, most of the method's content is moved to non-static run_in_thread() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:28:40 +03:00
Pavel Emelyanov	2c175660f2	cql_test_env: Open-code env start/stop and remove both These two just make more churn in next patch, so drop both Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:28:03 +03:00
Pavel Emelyanov	10f9292fe8	cql_test_env: Keep other services as class variables There are more services on do_with() stack that are not referenced from the cql_test_env. Move them to be class variables too Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:27:19 +03:00
Pavel Emelyanov	08a3be3b17	cql_test_env: Keep services as class variables Now they are duplicated -- variables exist on do_with() stack and the class references some of them. This patch makes is vice-versa -- all the variables are on the cql_test_env and do_with() references them. The latter will change soon Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:26:21 +03:00
Pavel Emelyanov	b31d2097b8	cql_test_env: Construct env early Its constructor is _just_ assigning references and setting up rlimits. Both can happen early Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:25:49 +03:00
Pavel Emelyanov	49d4760655	cql_test_env: De-static fdpinger variable So that it could be moved onto cql_test_env as a class member Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:25:25 +03:00
Pavel Emelyanov	749c5baf21	cql_test_env: Define all services' variables early Nowadays they are all scattered along the .do_with() function. Keeping them in one early place makes it possible to relocate them onto the cql_test_env later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:23:54 +03:00
Pavel Emelyanov	d36737f094	cql_test_env: Keep group0_client pointer It's now reference, but some time later it won't be able to get initialized construction-time, so turn it into pointer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 15:23:16 +03:00
Pavel Emelyanov	da98355bc8	test: Remove require_..._exists from cql_test_env Not used by any code anymore. This makes cql_test_env shorter and nicer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:36 +03:00
Pavel Emelyanov	64c8a59e9b	test: Open-code ks.cf name parse into cdc_test The test uses qualified ks.cf name to find the schema, but it's the only test case that does it. There's no point in maintaining a dedicated helper on the cql_test_env just for that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:36 +03:00
Pavel Emelyanov	6ead9a5255	test: Don't use require_table_exists() in test/lib/random_schema This check is pointless. The subsequent call to find_column_family() would call on_internal_error() in case schema is not found, and since cql_test_env sets abort-on-internal-error to true, this would fail just like that Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:36 +03:00
Pavel Emelyanov	b4c84f9174	test: Use BOOST_REQUIRE(!db.has_schema()) Surprisingly there's a dedicated helper for the check opposite to the one fixed in the previous patch. Fix one too Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:36 +03:00
Pavel Emelyanov	063baabaee	test: Use BOOST_REQUIRE(db.has_schema()) Same as in previous patch, the cql_test_env::require_table_exists() helper is exactly the same, but returns future and asserts on failures for no gain Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:32 +03:00
Pavel Emelyanov	a128108acd	test: Use BOOST_REQUIRE(db.has_keyspace()) The cql_test_env::require_keyspace_exists() performs exactly the same check, but is future-returning function for no reason and it assert()s on failure, that's less informative (not that it ever failed) than BOOST_REQUIRE Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:46:29 +03:00
Pavel Emelyanov	53c309063e	test: Threadify cql_query_test::test_compact_storage case It's like in previous patch, and for the same reason, but the change is a bit more complicated because it uses resolved futures' results in few places, so it likely deserves separate commit Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:39:40 +03:00
Pavel Emelyanov	59db35fba0	test: Threadify some cql_query_test cases Those two use straight .then-s sequences, no point in keeping them that long. Being threads makes next patches shorter and nicer Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-12 11:38:51 +03:00
Petr Gusev	5361de76f9	random_tables.py: add counter column type We'll need it for fencing test.	2023-08-11 17:37:09 +04:00
Petr Gusev	f5b41a8075	raft topology: don't increment version when transitioning to node_state::normal This version increment in not accompanied by a global_token_metadata_barrier, which means the new version won't be reflected in fence_version and basically will have no effect in terms of fencing.	2023-08-11 16:22:25 +04:00
Aleksandra Martyniuk	db932c7106	compaction: hold gate immediately after task executor is created If make_task call in compaction_manager::perform_compaction yields, compaction_task_executor::_compaction_state may be gone and gate won't be held. Hold gate immediately after compaction_task_executor is created. Add comment not to call prepare_task without preparation. Refs: #14971. Fixes: #14977. Closes #14999	2023-08-11 13:56:38 +02:00
Kefu Chai	17c1b15c81	create-relocatable-package.py: add version file with tempfile before this change, we build multiple relocatable package for different builds in parallel using ninja. all these relocatable packages are built using the same script of `create-relocatable-package.py`. but this script always use the same directory and file for the `.relocatable_package_version` file. so there are chances that these jobs building the relocatable package can race and writing / accessing the same file at the same time. so, in this change, instead of using a fixed file path for this temporary file, we use a NamedTemporaryFile for this purpose. this should helps us avoid the build failures like ``` [2023-08-10T09:38:00.019Z] FAILED: build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz [2023-08-10T09:38:00.019Z] scripts/create-relocatable-package.py --mode debug 'build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz' [2023-08-10T09:38:00.019Z] Traceback (most recent call last): [2023-08-10T09:38:00.019Z] File "/jenkins/workspace/scylla-master/scylla-ci/scylla/scripts/create-relocatable-package.py", line 130, in <module> [2023-08-10T09:38:00.019Z] os.makedirs(f'build/{SCYLLA_DIR}') [2023-08-10T09:38:00.019Z] File "<frozen os>", line 225, in makedirs [2023-08-10T09:38:00.019Z] FileExistsError: [Errno 17] File exists: 'build/scylla-package' ``` Fixes #15018 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #15007	2023-08-11 12:57:28 +03:00
Aleksandra Martyniuk	ae67f5d47e	api: ignore future in task_manager_json::wait_task Before returning task status, wait_task waits for it to finish with done() method and calls get() on a resulting future. If requested task fails, an exception will be thrown and user will get internal server error instead of failed task status. Result of done() method is ignored. Fixes: #14914. Closes #14915	2023-08-11 08:18:51 +03:00
Patryk Jędrzejczak	d1d1b6cf6e	raft: remove a replaced node from group 0 earlier The topology coordinator only marks a replaced node as LEFT during the replace operation and actually removes it from the group 0 config in cleanup_group0_config_if_needed. If this function is called before raft has committed a replacing node as a voter, it does not remove the replaced node from the group 0 config. Then, the coordinator can decide that it has no work to do and starts sleeping, leaving us with an outdated config. This behavior reduces group 0 availability and causes problems in tests (see #14885). Also, it makes the coordinator's logic confusing - it claims that it has no work to do when it has some work to do. Therefore, we modify the coordinator so that it removes the replaced node earlier in handle_topology_transition. Fixes #14885 Fixes #14975 Closes #15009	2023-08-11 01:32:24 +02:00
Botond Dénes	403ba9b055	Merge 'gossiper: lock_endpoint: fix review comments' from Benny Halevy This series fixes a couple of review comments on #14845 Closes #14976 * github.com:scylladb/scylladb: gossiper: lock_endpoint: fix comment regarding permit_id mismatch gossiper: lock_endpoint: change assert to on_internal_error	2023-08-10 17:37:32 +03:00
Gleb Natapov	517f6bfa8a	test: add rebuild test Add simple rebuild test that makes sure that rebuild operation does not fail.	2023-08-10 16:46:13 +03:00
Gleb Natapov	53120c1d57	system_keyspace: fix assertion for missing transition_state The code assumes that if there is no transition_state there should be no nodes that currently in transition in a state other then left_token_ring state, but rebuild operation also creates such nodes, so add the check for it as well.	2023-08-10 16:37:56 +03:00
Kamil Braun	8f658fb139	Merge 's3/client: check for available port before starting minio server' from Kefu Chai there is chance that the default port of 9000 has been used on the host running the test, in that case, we should try to use another available port. so, in this change, we try ports in the ranges of [9000, 9000+1000), and use the first one which is not connectable. Fixes #14985 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14997 * github.com:scylladb/scylladb: test: stop using HostRegistry in MinioServer s3/client: check for available port before starting minio server	2023-08-10 14:01:13 +02:00
Alejo Sanchez	e2122163f5	test/pylib: protect double call to cluster stop test.py schedules calls to cluster .uninstall() and .stop() making double calls to it running at the same time. Mark the cluster as not running early on. While there, do the same for .stop_gracefully() for consistency. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14987	2023-08-10 13:37:49 +02:00
Kamil Braun	39330b9c11	Merge 'gossiper: convict: lock_endpoint' from Benny Halevy Currently, `mark_dead` is called with null_permit_id from `convict`, and in this case, by contract, it should lock the endpoint, same as `mark_as_shutdown`. This change somehow escaped #14845 so it amends it. Fixes #14838 Closes #15001 * github.com:scylladb/scylladb: gossiper: verify permit_id in all private functions gossiper: convict: lock_endpoint	2023-08-10 13:09:05 +02:00
Benny Halevy	623ed1a249	gossiper: verify permit_id in all private functions Instead of acquiring the permit is the permit_id arg is null, like in mark_as_shutdown, just asssert that the permit_id is non-null. The functions are both private and we control all callers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-10 08:17:04 +03:00
Benny Halevy	42c1c5ead8	gossiper: convict: lock_endpoint Currently, `mark_dead` is called with null_permit_id from `convict`, and in this case, by contract, it should lock the endpoint, same as `mark_as_shutdown`. This change somehow escaped #14845 so it amends it. Fixes #14838 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-10 07:50:33 +03:00
Kefu Chai	0c0a59bf62	test: stop using HostRegistry in MinioServer since MinioServer find a free port by itself, there is no need to provide it an IP address for it anymore -- we can always use 127.0.0.1. so, in this change, we just drop the HostRegistry parameter passed to the constructor of MinioServer, and pass the host address in place of it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 23:40:22 +08:00
Kamil Braun	59c410fb97	Merge 'migration_manager: announce: provide descriptions for all calls' from Patryk Jędrzejczak The `system.group0_history` table provides useful descriptions for each command committed to Raft group 0. One way of applying a command to group 0 is by calling `migration_manager::announce`. This function has the `description` parameter set to empty string by default. Some calls to `announce` use this default value which causes `null` values in `system.group0_history`. We want `system.group0_history` to have an actual description for every command, so we change all default descriptions to reasonable ones. Going further, We remove the default value for the `description` parameter of `migration_manager::announce` to avoid using it in the future. Thanks to this, all commands in `system.group0_history` will have a non-null description. Fixes #13370 Closes #14979 * github.com:scylladb/scylladb: migration_manager: announce: remove the default value of description test: always pass empty description to migration_manager::announce migration_manager: announce: provide descriptions for all calls	2023-08-09 16:58:41 +02:00
Kefu Chai	29554b0fc6	s3/client: check for available port before starting minio server there is chance that the default port of 9000 has been used on the host running the test, in that case, we should try to use another available port. so, in this change, we try ports in the ranges of [9000, 9000+1000), and use the first one which is not connectable. Fixes #14985 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 17:33:42 +08:00
Botond Dénes	108e510a23	Merge 'Update sstable_requiring_cleanup on compaction completion' from Benny Halevy Currently `sstable_requiring_cleanup` is updated using `compacting_sstable_registration`, but that mechanism is not used by offstrategy compaction, leading to #14304. This series introduces `compaction_manager::on_compaction_completion` that intercepts the call to the table::on_compaction_completion. This allows us to update `sstable_requiring_cleanup` right before the compacted sstables are deleted, making sure they are no leaked to `sstable_requiring_cleanup`, which would hold a reference to them until cleanup attempts to clean them up. `cleanup_incremental_compaction_test` was adjusted to observe the sstables `on_delete` (by adding a new observer event) to detect the case where cleanup attempts to delete the leaked sstables and fails since they were already deleted from the file system by offstrategy compaction. The test fails with the fix and passes with it. Fixes #14304 Closes #14858 * github.com:scylladb/scylladb: compaction_manager: on_compaction_completion: erase sstables from sstables_requiring_cleanup compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 sstable: add on_delete observer compaction_manager: add on_compaction_completion sstable_compaction_test: cleanup_incremental_compaction_test: verify sstables_requiring_cleanup is empty	2023-08-09 11:03:45 +03:00
Botond Dénes	69d6778daf	Merge 'build: cmake: fixes for the release build' from Kefu Chai before this change, we use generator expression to initialize CMAKE_CXX_FLAGS_RELEASE, this has two problems: 1. the generator expression is not expanded when setting a regular variable. 2. the ending ">" is missing in one of the generator expression. 3. the parameters are not separated with ";" so address them, let's just * use `add_compile_options()` directly, as the corresponding `mode.${build_mode}.cmake` is only included when the "${build_mode}" is used. * add comma in between the command line options. * add the missing closing ">" Closes #14996 * github.com:scylladb/scylladb: build: cmake: pass --gc-sections to ld not ar build: cmake: use add_compile_options() in release build	2023-08-09 09:55:02 +03:00
Kefu Chai	47c9b25bac	compaction_manager: correct comment on compaction_task_executor::state when it comes to `regular_compaction_task_executor`, we repeat the compaction until the compaction can not proceed, so after an iteration of compaction completes successfully, the task can still continue with yet another round of the compaction as it sees appropriate. so let's update the comment to reflect this fact. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14891	2023-08-09 09:49:18 +03:00
Kefu Chai	6dc885a8e2	compaction: mark more member variables const quite a few member variables serves as the configuration for a given compaction, they are immutable in the life cycle of it, so for better readability, let's mark them `const`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14981	2023-08-09 09:28:44 +03:00
Botond Dénes	5f65ac73ed	Merge 'Remove qctx' from Pavel Emelyanov The only place that still calls it is static force_blocking_flush method. If can be made non-static already. Also, while at it, coroutinize some system_keyspace methods and fix a FIXME regarding replica::database access in it. Closes #14984 * github.com:scylladb/scylladb: code: Remove query-context.hh code: Remove qctx system_keyspace: Use system_keyspace's container() to flush system_keyspace: Make force_blocking_flush() non-static system_keyspace: Coroutinize update_tokens() system_keyspace: Coroutinize save_truncation_record()	2023-08-09 09:27:53 +03:00
Kefu Chai	782b1992b2	build: cmake: pass --gc-sections to ld not ar ar is not able to tell which sections to be GC'ed, hence it does not care about --gc-sections, but ld does. let's add this option to CMAKE_EXE_LINKER_FLAGS. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 13:50:44 +08:00
Kefu Chai	f7377725c2	build: cmake: use add_compile_options() in release build before this change, we use generator expression to initialize CMAKE_CXX_FLAGS_RELEASE, this has two problems: 1. the generator expression is not expanded when setting a regular variable. 2. the ending ">" is missing in one of the generator expression. 3. the parameters are not separated with ";" so address them, let's just * use `add_compile_options()` directly, as the corresponding `mode.${build_mode}.cmake` is only included when the "${build_mode}" is used. * add comma in between the command line options. * add the missing closing ">" Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 12:56:06 +08:00
Kefu Chai	153a808f52	config: remove unused namespace alias bpo is not used after it is defined, so drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 10:17:34 +08:00
Kefu Chai	6355270120	config: use std::ranges when appropriate use std::ranges functions for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 10:17:34 +08:00
Kefu Chai	64bc8d2f7d	config: drop "experimental" option "experimental" was marked deprecated in `8b917f7c`. this change was included since Scylla 4.6. now that 5.3 has been branched, this change will be included 5.4. this should be long enough for the user's turn around if this option is ever used. the dtests using this option has been audited and updated accordingly. and the unit testing this option is removed as well. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 10:17:34 +08:00
Kefu Chai	959362f85b	test: disable 'enable_user_defined_functions' if experimental_features does not include udf "enable_user_defined_functions" is enabled by default by `make_scylla_conf()` in pylib/scylla_cluster.py, and we've being using `experimental` = True in this very function. this combination works fine, as "udf" is enabled by `experimental`. but since `experimental` is deprecated, if we drop this option and stop handling it. this peace is broken. "enable_user_defined_function" requires "udf" experimental feature. but test_boost_after_ip_change feed the scylla with an empty `experimental_features` list, so the test fails. to pave for the road of dropping `experimental` option, let's disable `enable_user_defined_function` as well in test_boost_after_ip_change. the same applies to other tests changed in this commit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 10:17:34 +08:00
Kefu Chai	91e677c6c8	test: pylib: specify experimental_features explicitly "experimental" was marked deprecated in `8b917f7c`. so let's specify the experimental features explicitly using `experimental_feature` option. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-09 10:17:34 +08:00
Pavel Emelyanov	f1515c610e	code: Remove query-context.hh The whole thing is unused now, so the header is no longer needed Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:11:07 +03:00
Pavel Emelyanov	413d81ac16	code: Remove qctx Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:10:56 +03:00
Pavel Emelyanov	d7f5d6dba8	system_keyspace: Use system_keyspace's container() to flush In force_blocking_flush() there's an invoke-on-all invocation of replica::database::flush() and a FIXME to get the replica database from somewhere else rather than via query-processor -> data_dictionary. Since now the force_blocking_flush() is non-static the invoke-on-all can happen via system_keyspace's container and the database can be obtained directly from the sys.ks. local instance Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:09:32 +03:00
Pavel Emelyanov	7a342ed5c0	system_keyspace: Make force_blocking_flush() non-static Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:09:20 +03:00
Pavel Emelyanov	6b8fe5ac43	system_keyspace: Coroutinize update_tokens() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:09:15 +03:00
Pavel Emelyanov	1700d79b60	system_keyspace: Coroutinize save_truncation_record() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-08 11:09:09 +03:00
Benny Halevy	7a7c8d0d23	compaction_manager: on_compaction_completion: erase sstables from sstables_requiring_cleanup Erase retired sstable from compaction_state::sstables_requiring_cleanup also on_compaction_completion (in addition to compacting_sstable_registration::release_compacting for offstrategy compaction with piggybacked cleanup or any other compaction type that doesn't use compacting_sstable_registration. Add cleanup_during_offstrategy_incremental_compaction_test that is modeled after cleanup_incremental_compaction_test to check that cleanup doesn't attempt to cleanup already-deleted sstables that were left over by offstrategy compaction in sstables_requiring_cleanup. Fixes #14304 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:16:46 +03:00
Benny Halevy	b1e164a241	compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0 Prevent div-by-zero byt returning const level 1 if max_sstable_size is zero, as configured by cleanup_incremental_compaction_test, before it's extended to cover also offstrategy compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:16:46 +03:00
Benny Halevy	b08f2ac4c6	sstable: add on_delete observer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:15:00 +03:00
Benny Halevy	df66895080	compaction_manager: add on_compaction_completion Pass the call to the table on_compaction_completion so we can manage the sstables requiring cleanup state along the way. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:12:05 +03:00
Benny Halevy	ea64ae54f8	sstable_compaction_test: cleanup_incremental_compaction_test: verify sstables_requiring_cleanup is empty Make sure that there are no sstables_requiring_cleanup after cleanup compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:12:01 +03:00
Patryk Jędrzejczak	356e131acd	migration_manager: announce: remove the default value of description We remove the default value for the description parameter of migration_manager::announce to avoid using it in the future. Thanks to this, all commands in system.group0_history will have a non-null description.	2023-08-07 14:38:11 +02:00
Patryk Jędrzejczak	866c9a904d	test: always pass empty description to migration_manager::announce In the next commit, we remove the default value for the description parameter of migration_manager::announce to avoid using it in the future. However, many calls to announce in tests use the default value. We have to change it, but we don't really care about descriptions in the tests, so we pass the empty string everywhere.	2023-08-07 14:38:11 +02:00
Patryk Jędrzejczak	27ddf78171	migration_manager: announce: provide descriptions for all calls The system.group0_history table provides useful descriptions for each command committed to Raft group 0. One way of applying a command to group 0 is by calling migration_manager::announce. This function has the description parameter set to empty string by default. Some calls to announce use this default value which causes null values in system.group0_history. We want system.group0_history to have an actual description for every command, so we change all default descriptions to reasonable ones. We can't provide a reasonable description to announce in query_processor::execute_thrift_schema_command because this function is called in multiple situations. To solve this issue, we add the description parameter to this function and to handler::execute_schema_command that calls it.	2023-08-07 14:38:11 +02:00
Avi Kivity	4f7e83a4d0	cql3: select_statement: reject DISTINCT with GROUP BY on clustering keys While in SQL DISTINCT applies to the result set, in CQL it applies to the table being selected, and doesn't allow GROUP BY with clustering keys. So reject the combination like Cassandra does. While this is not an important issue to fix, it blocks un-xfailing other issues, so I'm clearing it ahead of fixing those issues. An issue is unmarked as xfail, and other xfails lose this issue as a blocker. Fixes #12479 Closes #14970	2023-08-07 15:35:59 +03:00
Benny Halevy	db7a4109dd	gossiper: lock_endpoint: fix comment regarding permit_id mismatch Fixes a code review comment. See https://github.com/scylladb/scylladb/pull/14845#discussion_r1283572889 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-07 14:39:42 +03:00
Benny Halevy	4ebd2fa09d	gossiper: lock_endpoint: change assert to on_internal_error Fixes a code review comment. See https://github.com/scylladb/scylladb/pull/14845#discussion_r1283060243 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-07 14:36:35 +03:00
Patryk Jędrzejczak	1772433ae2	raft_group0: log gaining and losing leadership on the INFO level Knowing that a server gained or lost leadership in group 0 is sometimes useful for the purpose of debugging, so we log information about these events on the INFO level. Gaining and losing leadership are relatively rare events, so this change shouldn't flood the logs. Closes #14877	2023-08-07 12:13:24 +02:00
Kamil Braun	9edc98f8e9	Merge 'raft: make a removed/decommissioning node a non-voter early' from Patryk Jędrzejczak For `removenode`, we make a removed node a non-voter early. There is no downside to it because the node is already dead. Moreover, it improves availability in some situations. For `decommission`, if we decommission a node when the number of nodes is even, we make it a non-voter early to improve availability. All majorities containing this node will remain majorities when we make this node a non-voter and remove it from the set because the required size of a majority decreases. We don't change `decommission` when the number of nodes is odd since this may reduce availability. Fixes #13959 Closes #14911 * github.com:scylladb/scylladb: raft: make a decommissioning node a non-voter early raft: topology_coordinator: implement step_down_as_nonvoter raft: make a removed node a non-voter early	2023-08-07 10:14:33 +02:00
Botond Dénes	fa4aec90e9	Merge 'test: tasks: Fix task_manager/wait_task test ' from Aleksandra Martyniuk Rewrite test that checks whether task_manager/wait_task works properly. The old version didn't work. Delete functions used in old version. Closes #14959 * github.com:scylladb/scylladb: test: rewrite wait_task test test: move ThreadWrapper to rest_util.py	2023-08-07 09:04:29 +03:00
Benny Halevy	6f037549ac	sstables: delete_with_pending_deletion_log: batch sync_directory When deleting multiple sstables with the same prefix the deletion atomicity is ensured by the pending_delete_log file, so if scylla crashes in the middle, deletions will be replyed on restart. Therefore, we don't have to ensure atomicity of each individual `unlink`. We just need to sync the directory once, before removing the pending_delete_log file. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14967	2023-08-06 18:52:13 +03:00
Avi Kivity	6c1e44e237	Merge 'Make replica::database and cql3::query_processor share wasm manager' from Pavel Emelyanov This makes it possible to remove remaining users of the global qctx. The thing is that db::schema_tables code needs to get wasm's engine, alien runner and instance cache to build wasm context for the merged function or to drop it from cache in the opposite case. To get the wasm stuff, this code uses global qctx -> query_processor -> wasm chain. However, the functions (un)merging code already has the database reference at hand, and its natural to get wasm stuff from it, not from the q.p. which is not available So this PR packs the wasm engine, runner and cache on sharded<wasm::manager> instance, makes the manager be referenced by both q.p. and database and removes the qctx from schema tables code Closes #14933 * github.com:scylladb/scylladb: schema_tables: Stop using qctx database: Add wasm::manager& dependency main, cql_test_env, wasm: Start wasm::manager earlier wasm: Shuffle context::context() wasm: Add manager::remove() wasm: Add manager::precompile() wasm: Move stop() out of query_processor wasm: Make wasm sharded<manager> query_processor: Wrap wasm stuff in a struct	2023-08-06 17:00:28 +03:00
Avi Kivity	412629a9a1	Merge 'Export tablet load-balancer metrics' from Tomasz Grabiec The metrics are registered on-demand when load-balancer is invoked, so that only leader exports the metrics. When leader changes, the old leader will stop exporting. The metrics are divided into two levels: per-dc and per-node. In prometheus, they will have appropriate labels for dc and host_id values. Closes #14962 * github.com:scylladb/scylladb: tablet_allocator: unregister metrics when leadership is lost tablets: load_balancer: Export metrics service, raft: Move balance_tablets() to tablet_allocator tablet_allocator: Start even if tablets feature is not enabled main, storage_service: Pass tablet allocator to storage_service	2023-08-06 16:58:27 +03:00
Tomasz Grabiec	f26e65d4d4	tablets: Fix crash on table drop Before the patch, tablet metadata update was processed on local schema merge before table changes. When table is dropped, this means that for a while table will exist without a corresponding tablet map. This can cause memtable flush for this table to fail, resulting in intentional abort(). That's because sstable writing attempts to access tablet map to generate sharding metadata. If auto_snapshot is enabled, this is much more likely to happen, because we flush memtables on table drop. To fix the problem, process tablet metadata after dropping tables, but before creating tables. Fixes #14943 Closes #14954	2023-08-06 16:45:43 +03:00
Pavel Emelyanov	3c6686e181	bptree: Replace assert with static_assert The one runs under checked constexpr value anyway Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14951	2023-08-06 16:36:12 +03:00
Tomasz Grabiec	f827cfd5b6	tablet_allocator: unregister metrics when leadership is lost So that graphs are not polluted with stale metrics from past leaders.	2023-08-05 21:48:08 +02:00
Tomasz Grabiec	d653cbae53	tablets: load_balancer: Export metrics	2023-08-05 21:48:08 +02:00
Tomasz Grabiec	67c7aadded	service, raft: Move balance_tablets() to tablet_allocator The implementation will access metrics registered from tablet_allocator.	2023-08-05 21:48:08 +02:00
Tomasz Grabiec	cb0d763a22	tablet_allocator: Start even if tablets feature is not enabled topology coordinator will call it. Rather than spreading ifs there, it's simpler to start it and disable functionality in the tablet allocator.	2023-08-05 21:48:08 +02:00
Tomasz Grabiec	5bfc8b0445	main, storage_service: Pass tablet allocator to storage_service Tablet balancing will be done through tablet_allocator later.	2023-08-05 03:10:26 +02:00
Pavel Emelyanov	fd50ba839c	schema_tables: Stop using qctx There are two places in there that need qctx to get query_processor from to, in turn, get wasm::manager from. Fortunately, both places have the database reference at hand and can get the wasm::manager from it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	fa93ac9bfd	database: Add wasm::manager& dependency The dependency is needed by db::schema_tables to get wasm manager for its needs. This patch prepares the ground. Now the wasm::manager is shared between replica::database and cql3::query_processor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	f4e7ffa0fc	main, cql_test_env, wasm: Start wasm::manager earlier It will be needed by replica::database and should be available that early. It doesn't depend on anything and can be moved in the starting order safely Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	595c5abbf9	wasm: Shuffle context::context() Add a constructor that builds context out of const manager reference. The existing one needs to get engine and instance cache and does it via query_processor. This change lets removing those exports and finally -- drop the wasm::manager -> cql3::query_processor friendship Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	56404ee053	wasm: Add manager::remove() This is one of the users of query_processor's export of wasm::manager's instance cache. Remove it in advance Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	93cb73fddb	wasm: Add manager::precompile() This is not to make query_processor export alien runner from the wasm::manager Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	d58a2d65b5	wasm: Move stop() out of query_processor When the q.p. stops it also "stops" the wasm manager. Move this call into main. The cql test env doesn't need this change, it stops the whole sharded service which stops instances on its own Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	243f2217dd	wasm: Make wasm sharded<manager> The wasm::manager is just cql3::wasm_context renamed. It now sits in lang/wasm* and is started as a sharded service in main (and cql test env). This move also needs some headers shuffling, but it's not severe This change is required to make it possible for the wasm::manager to be shared (by reference) between q.p. and replica::database further Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Pavel Emelyanov	dde285e7e9	query_processor: Wrap wasm stuff in a struct There are three wasm-only fields on q.p. -- engine, cache and runner. This patch groups them on a single wasm_context structure to make it earier to manipulate them in the next patches The 'friend' declaration it temporary, will go away soon Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-08-04 19:47:50 +03:00
Kamil Braun	421a5ad55c	Merge 'feature_service: don't load whole topology state to check features' from Piotr Dulikowski Currently, feature service uses `system_keyspace::load_topology_state` to load information about features from the `system.topology` table. This function implicitly assumes that it is called after schema commitlog replay and will correspond to the state of the topology state machine after some command is applied. However, feature check happens before the commitlog replay. If some group 0 command consists of multiple mutations that are not applied atomically, the `load_topology_state` function may fail to construct a `service::topology` object based on the table state. Moreover, this function not only checks `system.topology` but also `system.cdc_generations_v3` - in the case of the issue, the entry that was loaded from the this table didn't contain the `num_ranges` parameter. In order to fix this, the feature check code now uses `load_topology_features_state` which only loads enabled and supported features from `system.topology`. Only this information is really necessary for the feature check, and it doesn't have any invariants to check. Fixes: #14944 Closes #14955 * github.com:scylladb/scylladb: feature_service: don't load whole topology state to check features system_keyspace: separate loading topology_features from topology topology_state_machine: extract features-related fields to a struct untyped_result_set: add missing_column_exception	2023-08-04 15:09:12 +02:00
Kamil Braun	fed775e13b	Merge 'group0_state_machine: await transfer_snapshot' from Benny Halevy Hold a (newly added) group0_state_machine gate that is closed and waited on in group0_state_machine::abort() To prevent use-after-free when destroying the group0_state_machine while transfer_snapshot runs. Fixes #14907 Also, use an abort_source in group0_state_machine to abort an ongoing transfer_snapshot operation on group0_state_machine::abort() Closes #14952 * github.com:scylladb/scylladb: raft: group0_state_machine: transfer_snapshot: make abortable raft: group0_state_machine: transfer_snapshot: hold gate	2023-08-04 14:21:57 +02:00
Botond Dénes	68d2397d01	Merge 'repair: delete unused fields' from Aleksandra Martyniuk Delete unused shard_repair_task_impl members and incorrectly used method's argument. Closes #14956 * github.com:scylladb/scylladb: repair: delete task_manager_module::get_progress argument repair: delete unused shard_repair_task_impl fields	2023-08-04 15:08:31 +03:00
Aleksandra Martyniuk	629f893355	test: rewrite wait_task test Rewrite test that checks whether task_manager/wait_task works properly. The old version didn't work. Delete functions used in old version.	2023-08-04 13:34:58 +02:00
Aleksandra Martyniuk	9d2e55fd37	test: move ThreadWrapper to rest_util.py Move ThreadWrapper to rest_util.py so it can be reused in different tests.	2023-08-04 13:29:03 +02:00
Piotr Dulikowski	b7d9348229	feature_service: don't load whole topology state to check features Currently, feature service uses `system_keyspace::load_topology_state` to load information about features from the `system.topology` table. This function implicitly assumes that it is called after schema commitlog replay and will correspond to the state of the topology state machine after some command is applied. However, feature check happens before the commitlog replay. If some group 0 command consists of multiple mutations that are not applied atomically, the `load_topology_state` function may fail to construct a `service::topology` object based on the table state. Moreover, this function not only checks `system.topology` but also `system.cdc_generations_v3` - in the case of the issue, the entry that was loaded from the this table didn't contain the `num_ranges` parameter. In order to fix this, the feature check code now uses `load_topology_features_state` which only loads enabled and supported features from `system.topology`. Only this information is really necessary for the feature check, and it doesn't have any invariants to check. Fixes: #14944	2023-08-04 12:32:05 +02:00
Piotr Dulikowski	8f491457ae	system_keyspace: separate loading topology_features from topology Now, it is possible to load topology_features separately from the topology struct. It will be used in the code that checks enabled features on startup.	2023-08-04 12:32:04 +02:00
Piotr Dulikowski	f1704eeee6	topology_state_machine: extract features-related fields to a struct `enabled_features` and `supported_features` are now moved to a new `topology::features` struct. This will allow to move load this information independently from the `topology` struct, which will be needed for feature checking during start.	2023-08-04 12:21:51 +02:00
Aleksandra Martyniuk	66df686980	repair: delete task_manager_module::get_progress argument Getting reason argument in task_manager_module::get_progress is deceiving as the method works properly only for streaming::stream_reason::repair (repair::shard_repair_task_impl::nr_ranges_finished isn't updated for any other reason).	2023-08-04 11:09:37 +02:00
Aleksandra Martyniuk	93ebbdcf1d	repair: delete unused shard_repair_task_impl fields	2023-08-04 10:52:24 +02:00
Botond Dénes	00a62866ac	Merge 'Make database::add_column_family exception safe.' from Aleksandra Martyniuk If some state update in database::add_column_family throws, info about a column family would be inconsistent. Undo already performed operations in database::add_column_family when one throws. Fixes: #14666. Closes #14672 * github.com:scylladb/scylladb: replica: undo the changes if something fails replica: start table earlier in database::add_column_family	2023-08-04 10:58:17 +03:00
Botond Dénes	4d538e1363	Merge 'Task manager tasks covering compaction group compaction' from Aleksandra Martyniuk All compaction task executors, except for regular compaction one, become task manager compaction tasks. Creating and starting of major_compaction_task_executor is modified to be consistent with other compaction task executors. Closes #14505 * github.com:scylladb/scylladb: test: extend test_compaction_task.py to cover compaction group tasks compaction: turn custom_task_executor into compaction_task_impl compaction: turn sstables_task_executor into sstables_compaction_task_impl compaction: change sstables compaction tasks type compaction: move table_upgrade_sstables_compaction_task_impl compaction: pass task_info through sstables compaction compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl comapction: use optional task info in major compaction compaction: use perform_compaction in compaction_manager::perform_major_compaction	2023-08-04 10:11:00 +03:00
Michał Jadwiszczak	b92d47362f	schema::describe: print 'synchronous_updates' only if it was specified While describing materialized view, print `synchronous_updates` option only if the tag is present in schema's extensions map. Previously if the key wasn't present, the default (false) value was printed. Fixes: #14924 Closes #14928	2023-08-04 09:52:37 +03:00
Kefu Chai	d8d91379e7	test: remove unnecessary check in compaction_manager_basic_test we wait for the same condition couple lines before, so no need to check it again using `BOOST_CHECK_EQUAL()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14921	2023-08-04 09:26:22 +03:00
Piotr Dulikowski	fad1e82bf7	untyped_result_set: add missing_column_exception Currently, when one tries to access a column that an untyped_result_set does not contain, a `std::bad_variant_access` exception is thrown. This exception's message provides very little context and it can be difficult to even figure out where this message is coming from. In order to improve the situation, a new exception `missing_column` is introduced which includes the missing column's name in its error message. The exception derives from `std::bad_variant_access` for compatibility with existing code that may want to catch it.	2023-08-04 07:37:12 +02:00
Kefu Chai	374bed8c3d	tools: do not create bpo::value unless transfer it to an option_description `boost::program_options::value()` create a new typed_value<T> object, without holding it with a shared_ptr. boost::program_options expects developer to construct a `bpo::option_description` right away from it. and `boost::program_options::option_description` takes the ownership of the `type_value<T>*` raw pointer, and manages its life cycle with a shared_ptr. but before passing it to a `bpo::option_description`, the pointer created by `boost::program_options::value()` is a still a raw pointer. before this change, we initialize positional options as global variables using `boost::program_options::value()`. but unfortunately, we don't always initialize a `bpo::option_description` from it -- we only do this on demand when the corresponding subcommand is called. so, if the corresponding subcommand is not called, the created `typed_value<T>` objects are leaked. hence LeakSanitizer warns us. after this change, we create the option vector as a static local variable in a function so it is created on demand as well. as an alternative, we could initialize the options vector as local variable where it used. but to be more consistent with how `global_option` is specified. and to colocate them in a single place, let's keep the existing code layout. Fixes #14929 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14939	2023-08-04 08:03:11 +03:00
Aleksandra Martyniuk	1e9b2972ea	replica: undo the changes if something fails If a step of adding a table fails, previous steps are undone.	2023-08-03 17:37:31 +02:00
Benny Halevy	46c9e3032d	storage_service: get_all_ranges: reserve enough space in ranges Commit `bc5f6cf45d` added a reserve call to the `ranges` vector before inserting all the returned token ranges into it. However, that reservation is too small as we need to express size+1 ranges for size tokens with <unbound, token[0]> and <token[size-1], unbound> ranges at the front and back, respectively. Fixes #14849 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14938	2023-08-03 17:13:03 +03:00
Benny Halevy	357d57c82d	raft: group0_state_machine: transfer_snapshot: make abortable Use an abort_source in group0_state_machine to abort an ongoing transfer_snapshot operation on group0_state_machine::abort() Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-03 16:32:08 +03:00
Benny Halevy	a23b58231e	raft: group0_state_machine: transfer_snapshot: hold gate Hold a (newly added) group0_state_machine gate that is closed and waited on in group0_state_machine::abort() To prevent use-after-free when destroying the group0_state_machine while transfer_snapshot runs. Fixes #14907 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-03 15:45:34 +03:00
Botond Dénes	946c6487ee	Merge 'repair: Add ranges_parallelism option' from Asias He This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes #4847 Closes #14886 * github.com:scylladb/scylladb: repair: Add ranges_parallelism option repair: Change to use coroutine in do_repair_ranges	2023-08-03 11:34:05 +03:00
Kefu Chai	d4ee84ee1e	s3/test: nuke tempdir but keep $tempdir/log before this change, if the object_store test fails, the tempdir will be preserved. and if our CI test pipeline is used to perform the test, the test job would scan for the artifacts, and if the test in question fails, it would take over 1 hour to scan the tempdir. to alleviate the pain, let's just keep the scylla logging file no matter the test fails or succeeds. so that jenkins can scan the artifacts faster if the test fails. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14880	2023-08-03 11:07:59 +03:00
Avi Kivity	cb3b808e3f	Merge 'replica/table.cc: Add per-node-per-table metrics' from Amnon Heiman Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems. This patch adds a small subset of per-metrics table that will be reported on the node level. The list of metrics is: system_column_family_memtable_switch - Number of times flush has resulted in the memtable being switched out system_column_family_memtable_partition_writes - Number of write operations performed on partitions in memtables system_column_family_memtable_partition_hits - Number of times a write operation was issued on an existing partition in memtables system_column_family_memtable_row_writes - Number of row writes performed in memtables system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change. The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it. The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off Fixes #2198 Closes #13293 * github.com:scylladb/scylladb: replica/table.cc: Add node-per-table metrics config: add enable_node_table_metrics flag	2023-08-02 22:17:47 +03:00
Patryk Jędrzejczak	d9137c7bdc	raft: make a decommissioning node a non-voter early If we decommission a node when the number of nodes is even, we make it a non-voter early to improve availability. All majorities containing this node will remain majorities when we make this node a non-voter and remove it from the set because the required size of a majority decreases.	2023-08-02 17:02:55 +02:00
Patryk Jędrzejczak	20b13f89a1	raft: topology_coordinator: implement step_down_as_nonvoter We move the logic that makes topology_coordinator a non-voter to a separate function called step_down_as_nonvoter to avoid code duplication. We use this function in the next commit.	2023-08-02 16:52:34 +02:00
Patryk Jędrzejczak	377f87c91a	raft: make a removed node a non-voter early For removenode, we make a removed node a non-voter early. There is no downside to it because the node is already dead. Moreover, it improves availability in some situations. Consider a 4-node cluster with one dead none. If we make the dead node a non-voter at the beginning of removenode, group 0 will survive the death of another node in the middle of removenode.	2023-08-02 16:52:33 +02:00
Aleksandra Martyniuk	9f68566038	replica: start table earlier in database::add_column_family In database::add_column_family table::start() is called before a table is registered in different structures.	2023-08-02 16:35:34 +02:00
Kamil Braun	39ca07c49b	Merge 'Gossiper endpoint locking' from Benny Halevy This series cleans up and hardens the endpoint locking design and implementation in the gossiper and endpoint-state subscribers. We make sure that all notifications (expect for `before_change`, that apparently can be dropped) are called under lock_endpoint, as well as all calls to gossiper::replicate, to serialize endpoint_state changes across all shards. An endpoint lock gets a unique permit_id that is passed to the notifications and passed back by them if the notification functions call the gossiper back for the same endpoint on paths that modify the endpoint_state and may acquire the same endpoint lock - to prevent a deadlock. Fixes scylladb/scylladb#14838 Refs scylladb/scylladb#14471 Closes #14845 * github.com:scylladb/scylladb: gossiper: replicate: ensure non-null permit gossiper: add_saved_endpoint: lock_endpoint gossiper: mark_as_shutdown: lock_endpoint gossiper: real_mark_alive: lock_endpoint gossiper: advertise_token_removed: lock_endpoint gossiper: do_status_check: lock_endpoint gossiper: remove_endpoint: lock_endpoint if needed gossiper: force_remove_endpoint: lock_endpoint if needed storage_service: lock_endpoint when removing node gossiper: use permit_id to serialize state changes while preventing deadlocks gossiper: lock_endpoint: add debug messages utils: UUID: make default tagged_uuid ctor constexpr gossiper: lock_endpoint must be called on shard 0 gossiper: replicate: simplify interface gossiper: mark_as_shutdown: make private gossiper: convict: make private gossiper: mark_as_shutdown: do not call convict	2023-08-02 13:50:08 +02:00
Konstantin Osipov	df97135583	test.py: forward the optional property file when creating a server To support multi-DC tests we need to provide a property file when creating a server. Forward it from the test client to test.py. Closes #14683	2023-08-02 13:45:19 +02:00
Kamil Braun	b835acf853	Merge 'Cluster features on raft: topology coordinator + check on boot' from Piotr Dulikowski This PR implements the functionality of the raft-based cluster features needed to safely manage and enable cluster features, according to the cluster features on raft design doc. Enabling features is a two phase process, performed by the topology coordinator when it notices that there are no topology changes in progress and there are some not-yet enabled features that are declared to be supported by all nodes: 1. First, a global barrier is performed to make sure that all nodes saw and persisted the same state of the `system.topology` table as the coordinator and see the same supported features of all nodes. When booting, nodes are now forbidden to revoke support for a feature if all nodes declare support for it, a successful barrier this makes sure that no node will restart and disable the features. 2. After a successful barrier, the features are marked as enabled in the `system.topology` table. The whole procedure is a group 0 operation and fails if the topology table is modified in the meantime (e.g. some node changes its supported features set). For now, the implementation relies on gossip shadow round check to protect from nodes without all features joining the cluster. In a followup, a new joining procedure will be implemented which involves the topology coordinator and lets it verify joining node's cluster features before the new node is added to group 0 and to the cluster. A set of tests for the new implementation is introduced, containing the same tests as for the non-raft-based cluster feature implementation plus one additional test, specific to this implementation. Closes #14722 * github.com:scylladb/scylladb: test: topology_experimental_raft: cluster feature tests test: topology: fix a skipped test storage_service: add injection to prevent enabling features storage_service: initialize enabled features from first node topology_state_machine: add size(), is_empty() group0_state_machine: enable features when applying cmds/snapshots persistent_feature_enabler: attach to gossip only if not using raft feature_service: enable and check raft cluster features on startup storage_service: provide raft_topology_change_enabled flag from outside storage_service: enable features in topology coordinator storage_service: add barrier_after_feature_update topology_coordinator: exec_global_command: make it optional to retake the guard topology_state_machine: add calculate_not_yet_enabled_features	2023-08-02 12:32:27 +02:00
Pavel Emelyanov	c3b23fc03d	Merge 'Skip mode validation for snapshots' from Benny Halevy Skip over verification of owner and mode of the snapshots sub-directory as this might race with scylla-manager trying to delete old snapshots concurrently. Fixes #12010 Closes #14892 * github.com:scylladb/scylladb: distributed_loader: process_sstable_dir: do not verify snapshots utils/directories: verify_owner_and_mode: add recursive flag	2023-08-02 13:05:47 +03:00
Kefu Chai	d28c06b65b	test: remove unused #include in sstable_*_test.cc for faster build times and clear inter-module dependencies, we should not #includes headers not directly used. instead, we should only #include the headers directly used by a certain compilation unit. in this change, the source files under "/compaction" directories are checked using clangd, which identifies the cases where we have an #include which is not directly used. all the #includes identified by clangd are removed, except for "test/lib/scylla_test_case.hh" as it brings some command line options used by scylla tests. see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14922	2023-08-02 11:58:03 +03:00
Kefu Chai	1bcd9dd80a	compaction: drop unnecessary type cast get_compacted_fragments_writer() returns a instance of `compacted_fragments_writer`, there is no need to cast it again. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14919	2023-08-02 11:36:10 +03:00
Amnon Heiman	c30d7ba5d7	replica/table.cc: Add node-per-table metrics Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems. This patch adds a small subset of per-metrics table that will be reported on the node level. The list of metrics is: system_column_family_memtable_switch - Number of times flush has resulted in the memtable being switched out system_column_family_memtable_partition_writes - Number of write operations performed on partitions in memtables system_column_family_memtable_partition_hits - Number of times a write operation was issued on an existing partition in memtables system_column_family_memtable_row_writes - Number of row writes performed in memtables system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change. The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it. The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off and enable_node_table_metrics is true. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-08-02 10:20:18 +03:00
Amnon Heiman	d10a3dd19a	config: add enable_node_table_metrics flag By default, per-table-per-shard metrics reporting is turned off, and the aggregated version of the metrics (per-table-per-node) will be turned on. There could be a situation where a user with an excessive number of tables would suffer from performance issues, both from the network and the metrics collection server. This patch adds a config option, enable_node_table_metrics, which allows users to turn off per-table metrics reporting altogether. For example, when running Scylla with the command line argument '--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-08-02 10:20:18 +03:00
Kefu Chai	6c66030b7b	compaction: add formatter for compaction_task_executor add fmt formatter for `compaction_task_executor::state` and `compaction_task_executor` and its derived classes. this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `compaction_task_executor`, its derived classes and `compaction_task_executor::state` without the help of `operator<<`. since all of the callers of 'operator<<' of these types now use formatters, the operator<< are removed in this change. the helpers like `to_string()` and `describe()` are removed as well, as it'd be more consistent if we always use fmtlib for formatting instead of inventing APIs with different names. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14906	2023-08-02 09:15:43 +03:00
Benny Halevy	949ea43034	topology: unindex_node: erase dc from datacenters when empty In branch 5.2 we erase `dc` from `_datacenters` if there are no more endpoints listed in `_dc_endpoints[dc]`. This was lost unintentionally in `f3d5df5448` and this commit restores that behavior, and fixes test_remove_endpoint. Fixes scylladb/scylladb#14896 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14897	2023-08-02 09:08:24 +03:00
Piotr Dulikowski	d40bb0bacb	test: topology_experimental_raft: cluster feature tests Although the implementation of cluster features on raft is not complete yet, it makes sense to add some tests for the existing implementation. The `test_raft_cluster_features.py` file includes the same set of tests as the file with non-raft-based cluster feature tests, plus one additional test which checks that a node will not allow disabling a feature if it sees that other nodes support it (even though the feature is not enabled yet).	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	435005b6a5	test: topology: fix a skipped test The `test_partial_upgrade_can_be_finished_with_removenode` test does not work because the `cql` variable is used before it is declared. It was not noticed because the test is marked as skipped, and does not work for the non-raft cluster feature implementation. The variable declaration is moved higher and the test now works; it will be used to test the raft cluster feature implementation.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	0e29abae8e	storage_service: add injection to prevent enabling features Adds the `raft_topology_suppress_enabling_features` error injection which, while enabled, prevents the topology coordinator from enabling features.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	b0c57f34d2	storage_service: initialize enabled features from first node The first node in the cluster defines it and it does not need to consult with anybody whether its features should be enabled or not. We can immediately mark those features as enabled in raft when the first node inserts its join request to the topology table.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	82fc6d9360	topology_state_machine: add size(), is_empty() The latter method will be used in the next commit.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	232f2b49d2	group0_state_machine: enable features when applying cmds/snapshots As declared in the previous commit, the group0 state machine now enables features on command application and snapshot transfer.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	7c309549d6	persistent_feature_enabler: attach to gossip only if not using raft The enable_features_on_join function is now only called if the node does not use topology over raft, and so the node will not react to changes in gossip features. In the future, support for switching to topology coordinator in runtime will be added and the persistent feature enabler should disconnect itself during the upgrade procedure. We don't have such procedure yet, so a bunch of TODOs is added instead.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	3c1ca12e62	feature_service: enable and check raft cluster features on startup The enable_features_on_startup method is adjusted for the raft-based cluster features. In topology coordinator mode: - Information about enabled features is taken from system.topology instead of the usual system.scylla_local (`enabled_features` key). - Features which, according to the local state, are supported by all nodes but not enabled yet are also checked. Support for such features cannot be revoked safely because the topology coordinator might have performed a successful global barrier and might have proceeded with marking the feature as enabled.	2023-08-01 18:54:58 +02:00
Piotr Dulikowski	61a44e0bc0	storage_service: provide raft_topology_change_enabled flag from outside Information about whether we are using topology changes on raft or not will be soon necessary for the persistent feature enabler, so that it can do some additional checks based on the local raft topology state.	2023-08-01 18:54:57 +02:00
Piotr Dulikowski	5a45301ac8	storage_service: enable features in topology coordinator If the topology coordinator notices that there are no nodes requesting to be joined, no topology operations in progress and there are some features that are declared to be supported by all normal nodes but not enabled yet, the topology coordinator will attempt to enable those features. This is done in the following way, under a group 0 guard: - A global `barrier_after_feature_update` is performed to make sure that: - All nodes have already updated their supported_features column after boot and won't attempt to revoke any during current runtime, - Saw and persisted the latest topology state so that, after restart, the feature check won't allow them to revoke support for features that the topology coordinator is going to enable. - After the barrier succeeds, the coordinator tries to add the features to the `enabled_features` column.	2023-08-01 18:54:57 +02:00
Benny Halevy	e7f9700836	gossiper: replicate: ensure non-null permit Ensure that replicate is called under lock_endpoint to serialize endpoint state changes on all shards. Otherwise, we may end up with incosistent state across shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	cf7858d960	gossiper: add_saved_endpoint: lock_endpoint Modify and replicate the endpoint state must be done under the lock_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	6fdec20b59	gossiper: mark_as_shutdown: lock_endpoint The function manipulates the endpoint state and calls replicate and mark_dead, therefore it must ensure this is done under lock_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	d6fcfdcd65	gossiper: real_mark_alive: lock_endpoint The function manipulates internal state on shard 0 and calls subscribers async callbacks so we should lock the endpoint to serialize state changes on it. With that, get_endpoint_state_for_endpoint_ptr after locking the endpoint in real_mark_alive, not before calling it, in the background continuation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	6d58be59d1	gossiper: advertise_token_removed: lock_endpoint The function manipulates the endpoint state and calls replicate, therefore it must ensure this is done under lock_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	2ac1796a5c	gossiper: do_status_check: lock_endpoint The function manipulates the endpoint state by calling remove_endpoint and evict_from_membership (and possibly yielding in-between), so it should serialize the state change with lock_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	3293c45682	gossiper: remove_endpoint: lock_endpoint if needed lock_endpoint to serialize changes to endpoint state and calling the on_remove notification. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	13124f0db4	gossiper: force_remove_endpoint: lock_endpoint if needed lock_endpoint to serialize changes to endpoint state via remove_endpoint and evict_from_membership. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	c7805f303d	storage_service: lock_endpoint when removing node Hold the endpoint lock across advertise_token_removed and excise. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:42:02 +03:00
Benny Halevy	f74d154fe3	gossiper: use permit_id to serialize state changes while preventing deadlocks Pass permit_id to subscribers when we acquire one via lock_endpoint. The subscribers then pass it back to gossiper for paths that acquire lock_endpoint for the same endpoint, to detect nested locks when the endpoint is locked with the same permit_id. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-01 17:41:57 +03:00
Piotr Dulikowski	082af79111	storage_service: add barrier_after_feature_update Adds a variant of the existing `barrier` topology command which requires from all participating nodes to confirm that they updated their features after boot and won't remove any features from it until restart. A successful global barrier of this type gives the topology coordinator a guarantee that it can safely enable features that were supported by all nodes at the moment of the barrier.	2023-08-01 14:33:20 +02:00
Piotr Dulikowski	af931553b1	topology_coordinator: exec_global_command: make it optional to retake the guard Currently, exec_global_command takes a group 0 guard, drops it and retakes it after the command is finished. For current uses it is fine from the correctness' point of view and, given that an operation can take a long time, shorter duration of the guard improves the odds of the operation succeeding. However, this is not sufficient for cluster features because they will need to execute a global barrier under the group 0 guard. This commit modifies the interface of `exec_global_command` so that dropping and retaking the guard is optional (the default is to retake it).	2023-08-01 14:16:49 +02:00
Piotr Dulikowski	7868d8ec17	topology_state_machine: add calculate_not_yet_enabled_features Adds a function which calculates a set of features that are supported by all normal nodes but are not enabled yet - according to the state of the topology state machine.	2023-08-01 14:16:49 +02:00
Tomasz Grabiec	0239ba4527	Merge 'fencing: handle counter_mutations' from Gusev Petr In this PR we add proper fencing handling to the `counter_mutation` verb. As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update. In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`. We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests. Closes #14564 * github.com:scylladb/scylladb: counter_mutation: add fencing encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant counter_mutation: add replica::exception_variant to signature	2023-08-01 12:41:22 +02:00
Kamil Braun	8bb3732d66	Merge 'storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal' from Patryk Jędrzejczak We add the CDC generation optimality check in `storage_service::raft_check_and_repair_cdc_streams` so that it doesn't create new generations when unnecessary. Since `generation_service::check_and_repair_cdc_streams` already has this check, we extract it to the new `is_cdc_generation_optimal` function to not duplicate the code. After this change, multiple tasks could wait for a single generation change. Calling `signal` on `topology_state_machine.event` would't wake them all. Moreover, we must ensure the topology coordinator wakes when his logic expects it. Therefore, we change all `signal` calls on `topology_state_machine.event` to `broadcast`. We delay the deletion of the `new_cdc_generation` request to the moment when the topology transition reaches the `publish_cdc_generation` state. We need this change to ensure the added CDC generation optimality check in the next commit has an intended effect. If we didn't make it, it would be possible that a task makes the `new_cdc_generation` request, and then, after this request was removed but before committing the new generation, another task also makes the `new_cdc_generation` request. In such a scenario, two generations are created, but only one should. After delaying the deletion of `new_cdc_generation` requests, the second request would have no effect. Additionally, we modify the `test_topology_ops.py` test in a way that verifies the new changes. We call `storage_service::raft_check_and_repair_cdc_streams` multiple times concurrently and verify that exactly one generation has been created. Fixes #14055 Closes #14789 * github.com:scylladb/scylladb: storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal storage_service: delay deletion of the new_cdc_generation request raft topology: broadcast on topology_state_machine.event instead of signal cdc: implement the is_cdc_generation_optimal function	2023-08-01 12:10:00 +02:00
Kamil Braun	84bb75ea0a	Merge 'service: migration_manager: change the prepare_ methods to functions' from Patryk Jędrzejczak The `migration_manager` service is responsible for schema convergence in the cluster - pushing schema changes to other nodes and pulling schema when a version mismatch is observed. However, there is also a part of `migration_manager` that doesn't really belong there - creating mutations for schema updates. These are the functions with `prepare_` prefix. They don't modify any state and don't exchange any messages. They only need to read the local database. We take these functions out of `migration_manager` and make them separate functions to reduce the dependency of other modules (especially `query_processor` and CQL statements) on `migration_manager`. Since all of these functions only need access to `storage_proxy` (or even only `replica::database`), doing such a refactor is not complicated. We just have to add one parameter, either `storage_proxy` or `database` and both of them are easily accessible in the places where these functions are called. This refactor makes `migration_manager` unneeded in a few functions: - `alternator::executor::create_keyspace`, - `cql3::statements::alter_type_statement::prepare_announcement_mutations`, - `cql3::statements::schema_altering_statement::prepare_schema_mutations`, - `cql3::query_processor::execute_thrift_schema_command:`, - `thrift::handler::execute_schema_command`. We remove the `migration_manager&` parameter from all these functions. Fixes #14339 Closes #14875 * github.com:scylladb/scylladb: cql3: query_processor::execute_thrift_schema_command: remove an unused parameter cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter cql3: alter_type_statement::prepare_announcement_mutations: change parameters alternator: executor::create_keyspace: remove an unused parameter service: migration_manager: change the prepare_ methods to functions	2023-08-01 11:56:56 +02:00
Patryk Jędrzejczak	233d801f39	cql3: query_processor::execute_thrift_schema_command: remove an unused parameter After changing the prepare_ methods of migration_manager to functions, the migration_manager& parameter of query_processor::execute_thrift_schema_command and thrift::handler::execute_schema_command (that calls query_processor::execute_thrift_schema_command) has been unused.	2023-08-01 10:07:31 +02:00
Patryk Jędrzejczak	ffc3c1302e	cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter After changing the prepare_ methods of migration_manager to functions, the migration_manager& parameter of schema_altering_statement::prepare_schema_mutations has been unused by all classes inheriting from schema_altering_statement.	2023-08-01 10:07:31 +02:00
Patryk Jędrzejczak	b6ead8de10	cql3: alter_type_statement::prepare_announcement_mutations: change parameters After changing the prepare_ methods of migration_manager to functions, the migration_manager& parameter of alter_type_statement::prepare_announcement_mutations has become unneeded. However, the function needs access to service::storage_proxy and data_dictionary::database. Passing storage_proxy& to it is enough.	2023-08-01 10:06:38 +02:00
Patryk Jędrzejczak	928ee9616c	alternator: executor::create_keyspace: remove an unused parameter After changing the prepare_ methods of migration_manager to functions, the migration_manager& parameter of executor::create_key has been unused.	2023-08-01 09:36:04 +02:00
Asias He	9b3fd9407b	repair: Add ranges_parallelism option This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes #4847	2023-08-01 10:58:14 +08:00
Asias He	1a875ec0f1	repair: Change to use coroutine in do_repair_ranges The with_semaphore was changed to use permit inside the coroutine.	2023-08-01 10:57:35 +08:00
Avi Kivity	3de7cacdf3	Merge 'De-static system_keyspace's [gs]et_scylla_local_param(_as)?' from Pavel Emelyanov Those without `_as` suffix are just marked non-static The `..._as` ones are made class methods (now they are local to system_keyspace.cc) After that the `..._as` ones are patched to use `this->` instead of `qctx` Closes #14890 * github.com:scylladb/scylladb: system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as() system_keyspace: Reuse container() and _db member for flushing system_keyspace: Make [gs]et_scylla_local_param_as() class methods system_keyspace: De-static [gs]et_scylla_local_param()	2023-07-31 21:51:04 +03:00
Botond Dénes	2d26613f28	tools: move operation-options to the operations themselves Currently, operation-options are declared in a single global list, then operations refer to the options they support via name. This system was born at a time, when scylla-sstable had a lot of shared options between its operations, so it was desirable to declare them centrally and only add references to individual operations, to reduce duplication. However, as the dust settled, only 2 options are shared by 2 operations each. This is a very low benefit. Up to now the cost was also very low -- shared options meant the same in all operations that used them. However this is about to change and this system becomes very awkward to use as soon as multiple operations want to have an option with the same name, but sligthly (or very) different meaning/semantics. So this patch changes moves the options to the operations themselves. Each will declare the list of options it supports, without having to reference some common list. This also removes an entire (although very uncommon) class of bugs: option-name referring to inexistent option. Closes #14898	2023-07-31 20:16:41 +03:00
Benny Halevy	5f2e2a78e6	gossiper: lock_endpoint: add debug messages Keep the endpoint address and the caller function name around and print them in the different lock life cycle state changes. While at it, coroutinize gossiper::lock_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 19:29:18 +03:00
Benny Halevy	929d03b370	utils: UUID: make default tagged_uuid ctor constexpr So it can be used for gms::null_permit_id in the next patch Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 19:29:18 +03:00
Benny Halevy	6401348ada	gossiper: lock_endpoint must be called on shard 0 We can't lock an endpoint on arbitrary shards since collision will not be detected this way. Assert that, and while at it, make the method private as it is only used internally by the class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 19:29:18 +03:00
Benny Halevy	dc6e7e47c8	gossiper: replicate: simplify interface Before making further changes to the endpoint_state_map implementation, simplify `replicate` by providing only one variant, replicating complete endpoint_state across shards, instead of applying finer resolution changes. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 19:29:17 +03:00
Mikołaj Grzebieluch	37b548f463	raft: stop group0 server during group0 service shutdown When a `topology_change` command is applied, the topology state is reloaded and `cdc::generation_service::handle_cdc_generation` is called. This creates a dependency of group0 on `cdc::generation_service`. Currently, the group0 server is stopped during `raft_group_registry` shutdown. However, it is called after `cdc::generation_service` shutdown, which can result in a segfault. To prevent this issue, this commit stops the group0 server and removes it from `raft_group_registry` during `group0_service` shutdown. Fixes #14397. Closes #14779 Reproducer: `97d6946e31` It creates two nodes. The second one is forced to stop after joining group0. It sleeps before calling handle_cdc_generation and sleeps just before raft_group_registry is stopped. It ensures that handle_cdc_generation wakes up after starting the second sleep. If the cdc_generation_service shutdown waits for raft_group_registry to stop, handle_cdc_generation will be called without any issue. Otherwise, it will crash since cdc_generation_service won't exist. The test passes always. If the crash happens it can be seen in the log file of the second node.	2023-07-31 16:17:11 +02:00
Avi Kivity	dac93b2096	Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec This change makes tablet load balancing more efficient by performing migrations independently for different tablets, and making new load balancing plans concurrently with active migrations. The migration track is interrupted by pending topology change operations. The coordinator executes the load balancer on edges of tablet state machine transitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented. Overload of shards is limited by the fact that load balancer algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. Closes #14851 * github.com:scylladb/scylladb: tablets: load_balancer: Remove double logging tests: tablets: Check that load balancing is interrupted by topology change tests: tablets: Add test for load balancing with active migrations tablets: Balance tablets concurrently with active migrations storage_service, tablets: Extract generate_migration_updates() storage_service, tablets: Move get_leaving_replica() to tablets.cc locator: tablets: Move std::hash definition earlier storage_service: Advance tablets independently topology_coordinator: Fix missed notification on abort tablets: Add formatter for tablet_migration_info	2023-07-31 16:44:33 +03:00
Pavel Emelyanov	a596186e47	system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as() Now those methods are non-static and can start using this's reference to query processor instead of the global qctx thing Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-31 16:02:21 +03:00
Pavel Emelyanov	ec4040496b	system_keyspace: Reuse container() and _db member for flushing The set_scylla_local_param_as() wants to flush replica::database on all shards. For that it uses smp::invoke_on_all() and qctx, but since the method is now non-static one for system_keyspace it can enjoy usiing container().invoke_on_all() and this->_db (on target shard) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-31 16:02:21 +03:00
Pavel Emelyanov	1ac4b7d2fe	system_keyspace: Make [gs]et_scylla_local_param_as() class methods These are now two .cc-local templatized helpers, but they are only called by system_keyspace:: non-static methods, so can be such as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-31 16:02:18 +03:00
Pavel Emelyanov	04b12d24fd	system_keyspace: De-static [gs]et_scylla_local_param() All same-class callers are now non-static methods of system_keyspace, all external callers do it via an object at hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-31 16:02:18 +03:00
Benny Halevy	845b6f901b	distributed_loader: process_sstable_dir: do not verify snapshots Skip over verification of owner and mode of the snapshots sub-directory as this might race with scylla-manager trying to delete old snapshots concurrently. Fixes #12010 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 16:01:46 +03:00
Benny Halevy	60862c63dd	utils/directories: verify_owner_and_mode: add recursive flag Allow the caller to verify only the top level directories so that sub-directories can be verified selectively (in particular, skip validation of snapshots). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 16:01:43 +03:00
Botond Dénes	4a02865ea1	Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk Maps related to column families in database are extracted to a column_families_data class. Access to them is possible only through methods. All methods which may preempt hold rwlock in relevant mode, so that the iterators can't become invalid. Fixes: #13290 Closes #13349 * github.com:scylladb/scylladb: replica: make tables_metadata's attributes private replica: add methods to get a filtered copy of tables map replica: add methods to check if given table exists replica: add methods to get table or table id replica: api: return table_id instead of const table_id& replica: iterate safely over tables related maps replica: pass tables_metadata to phased_barrier_top_10_counts replica: add methods to safely add and remove table replica: wrap column families related maps into tables_metadata replica: futurize database::add_column_family and database::remove	2023-07-31 15:31:59 +03:00
Botond Dénes	72043a6335	Merge 'Avoid using qctx in schema_tables' column-mapping queries' from Pavel Emelyanov There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice. Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member Closes #14876 * github.com:scylladb/scylladb: schema_tables: Extract query_processor from system_keyspace for querying schema_tables: Add system_keyspace& argument to ..._column_mapping() calls migration_manager: Add system_keyspace argument to get_schema_mapping()	2023-07-31 15:00:59 +03:00
Botond Dénes	781721218f	Merge 'storage_service: refresh_sync_nodes: restrict to normal token owners' from Benny Halevy It is possible that topology will contain nodes that are no longer normal token owners, so they don't need to be sync'ed with. Fixes scylladb/scylladb#14793 Closes #14798 * github.com:scylladb/scylladb: storage_service: refresh_sync_nodes: restrict to reachable token owners storage_service: refresh_sync_nodes: fix log message locator: topology: node::state: make fine grained	2023-07-31 14:52:19 +03:00
Avi Kivity	f2c1a214e5	Merge 'Prevent stalls in query_partition_key_range_concurrent' from Benny Halevy Prevent stalls caused by query_partition_key_range_concurrent nested calls when it never yields. Fixes #14008 Closes #14884 * github.com:scylladb/scylladb: storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop storage_proxy: query_partition_key_range_concurrent: fixup indentation storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration storage_proxy: coroutinize query_partition_key_range	2023-07-31 13:36:53 +03:00
Benny Halevy	1431e2798b	storage_service: refresh_sync_nodes: restrict to reachable token owners It is possible that topology would contain nodes that do no longer own tokens or that are unreachable, so they can't be sync'ed with. Restrict the list to nodes in a normal or being_decommissioned state. Fixes scylladb/scylladb#14793 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 10:49:06 +03:00
Benny Halevy	431bfd6c3a	storage_service: refresh_sync_nodes: fix log message Remove outdated args to log message. The issue was introduced in `ca61d88764` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 10:33:58 +03:00
Benny Halevy	d903d03bf8	locator: topology: node::state: make fine grained Currently the node::state is coarse grained so one cannot distinguish between e.g. a leaving node due to decommission (where the node is used for reading) vs. due to remove node (where the node is not used for reading). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 10:33:48 +03:00
Kefu Chai	47e27dd2d2	test: wait until there is no pending tasks in compaction_manager_basic_test before this change, after triggering the compaction, compaction_manager_basic_test waits until the triggered compaction completes. but since the regular compaction is run in a loop which does not stop until either the daemon is stopping, or there is no more sstables to be compacted, or the compaction is disabled. but we only get the input sstables for compaction after swiching to the "pending" state, and acquiring the read lock of the compaction_state, and acquiring the read lock is implemented as an coroutine, so there is chance that coroutine is suspended, and the execution switches to the test. in this case, the test will find that even after the triggered compaction completes, there are still one or more pending compactions. hence the test fails. to address this problem, instead of just waiting for the compaction to complete, we also wait until the number of pending compaction tasks is 0. so that even if the test manages to sneak into the time window, it won't proceed and starting check the compaction manager's stats. Fixes #14865 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14889	2023-07-31 10:29:18 +03:00
Benny Halevy	2902a4136f	storage_proxy: query_partition_key_range_concurrent: maybe_yield in loop Add calls to `maybe_yield` in the per-range loops to prevent stalls if the loop never yields. Note: originally the stalls were detected in nested calls to `query_partition_key_range_concurrent` (see #14008). This series turned the tail-recursion into iteration, but still the inner loop(s) never yield and do quite a lot of computations - so they mioght stall when called with a large number of ranges. Fixes #14008 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 09:54:34 +03:00
Kefu Chai	1c525c02a3	tools/utils: use std::shift_left() when appropriate instead of using a loop of std::swap(), let's use std::shift_left() when appropriate. simpler and more readable this way. moreover, the pattern of looking for a command and consume it from the command line resembles what we have in main(), so let's use similar logic to handle both of them. probably we can consolidate them in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14888	2023-07-31 09:46:52 +03:00
Kefu Chai	eab160e947	tools/scylla-sstable: mark const variable with constexpr this change change `const` to `constexpr`. because the string literal defined here is not only immutable, but also initialized at compile-time, and can be used by constexpr expressions and functions. this change is introduced to reduce the size of the change when moving to compile-time format string in future. so far, seastar::format() does not use the compile-time format string, but we have patches pending on review implementing this. and the author of this change has local branches implementing the changes on scylla side to support compile-time format string, which practically replaces most of the `format()` calls with `seastar::format()`. to reduce the size of the change and the pain of rebasing, some of the less controversial changes are extracted and upstreamed. this one is one of them. this change also addresses following compilation failure: ``` /home/kefu/dev/scylladb/tools/scylla-sstable.cc:2836:44: error: call to consteval function 'fmt::basic_format_string<char, const char const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char , 0>' is not a constant expression 2836 \| .description = seastar::format(description_template, app_name, boost::algorithm::join(operations \| boost::adaptors::transformed([] (const auto& op) { \| ^ /usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression 3148 \| FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) { \| ^ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14887	2023-07-31 09:44:00 +03:00
Benny Halevy	8d5020b8f6	storage_proxy: query_partition_key_range_concurrent: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 09:43:33 +03:00
Benny Halevy	3c122a87b5	storage_proxy: query_partition_key_range_concurrent: turn tail recursion to iteration Update the function state and loop for the next ranges instead of nesting it oneself. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 09:43:33 +03:00
Nadav Har'El	04e5082d52	alternator: limit expression length and recursion depth DynamoDB limits of all expressions (ConditionExpression, UpdateExpression, ProjectionExpression, FilterExpression, KeyConditionExpression) to just 4096 bytes. Until now, Alternator did not enforce this limit, and we had an xfailing test showing this. But it turns out that not enforcing this limit can be dangerous: The user can pass arbitrarily-long and arbitrarily nested expressions, such as: a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...)))))) or ((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( and those can cause recursive algorithms in Alternator's parser and later when applying expressions to recurse very deeply, overflow the stack, and crash. This patch includes new tests that demonstrate how Scylla crashes during parsing before enforcing the 4096-byte length limit on expressions. The patch then enforces this length limit, and these tests stop crashing. We also verify that deeply-nested expressions shorter than the 4096-byte limit are apparently short enough for our recursion ability, and work as expected. Unforuntately, running these tests many times showed that the 4096-byte limit is not low enough to avoid all crashes so this patch needs to do more: The parsers created by ANTLR are recursive, and there is no way to limit the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH). Very deep recursion can overflow the stack and crash Scylla. After we limited the length of expression strings to 4096 bytes this was almost enough to prevent stack overflows. But unfortunetely the tests revealed that even limited to 4096 bytes, the expression can sometimes recurse too deeply: Consider the expression "((((((....((((" with 4000 parentheses. To realize this is a syntax error, the parser needs to do a recursive call 4000 times. Or worse - because of other Antlr limitations (see rants in comments in expressions.g) it's actually 12000 recursive calls, and each of these calls have a pretty large frame. In some cases, this overflows the stack. The solution used in this patch is not pretty, but works. We add to rules in alternator/expressions.g that recurse (there are two of those - "value" and "boolean_expression") an integer "depth" parameter, which we increase when the rule recurses. Moreover, we add a so-called predicate "{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached. When the parsing is stopped, the user will see a special kind of parse error, saying "expression nested too deeply". With this last modification to expressions.g, the tests for deeply-nested but still-below-4096-bytes expressions (test_limits.py::test_deeply_nested_expression_*) would not fail sporadically as they did without it. While adding the "expression nested too deeply" case, I also made the general syntax-error reporting in Alternator nicer: It no longer prints the internal "expression_syntax_error" type name (an exception type will only be printed if some sort of unexpected exception happens), and it prints the character position where the syntax error (or too deep nested expression) was recognized. Fixes #14473 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14477	2023-07-31 08:57:54 +03:00
Botond Dénes	a637ddd09c	Merge 'cql: add missing functions for the COUNTER column type' from Nadav Har'El We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742). As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass. Fixes #14501 Fixes #14742 Closes #14745 * github.com:scylladb/scylladb: test/cql-pytest: test confirming that casting to counter doesn't work cql: support casting of counter to other types cql: implement missing counterasblob() and blobascounter() functions cql: implement missing type functions for "counters" type	2023-07-31 08:55:45 +03:00
Benny Halevy	fd119469d8	storage_proxy: coroutinize query_partition_key_range Prepare for coroutinizing query_partition_key_range_concurrent. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 08:22:24 +03:00
Tomasz Grabiec	3f221b1f05	tablets: load_balancer: Remove double logging	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	96d06b58df	tests: tablets: Check that load balancing is interrupted by topology change We add a special mode of load balancing, enabled through error injection, which causes it to continuously generate plans. This should keep the topology coordinator continuously in the tablet migration track. We enable this mode in test_tablets.py:test_bootstrap before bootstrapping nodes to see that bootstrap request interrupts tablet migration track. If this would not be the case, the test will hang.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	8fdbc42e71	tests: tablets: Add test for load balancing with active migrations	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fe181b3bac	tablets: Balance tablets concurrently with active migrations After this change, the load balancer can make progress with active migrations. If the algorithm is called with active tablet migrations in tablet metadata, those are treated by load balancer as if they were already completed. This allows the algorithm to incrementally make decision which when executed with active migrations will produce the desired result. Overload of shards is limited by the fact that the algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. The coordinator executes the load balancer on edges of tablet state machine stransitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	c9ea215ce1	storage_service, tablets: Extract generate_migration_updates()	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fbc6076e6a	storage_service, tablets: Move get_leaving_replica() to tablets.cc For better encapsulation of tablet-specific code.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	18a59ab5ff	locator: tablets: Move std::hash definition earlier Will be needed in order to define a struct which has unordered_set<tablet_replica> as a field.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	889f2ceb1e	storage_service: Advance tablets independently This change makes the topology state machine advance each tablet independently which allows them to finish migrations at different speeds, not at the speed of the slowest tablet. It will also open the possibility of starting new transitions concurrently with already active ones. This is implemented by having a single transition state "tablet migration", and handling it by scanning all the transitions and advancing tablet state machines. Updates and barriers are batched for all tablets in each cycle. One complication is the tracking of streaming sessions. The operations are no longer nested in the scope of a single handle method, and cannot be waited on explicitly, as that would inhibit progress of the coordinator, which starts later migrations. They live as independent fibers, which associated with tablets in a transient data structure which lives within the coordinator instance. This data structure is consulted for a given tablet in each cycle of the handle_tablet_migration() pump to check if streaming has finished and we can move the tablet to the next stage. If the pump has no work, only then it waits for any streaming to finish by blocking on the _topo_sm.event.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	2811b1df0a	topology_coordinator: Fix missed notification on abort If _as is aborted while the coordinator is in the middle of handling, and decides to go to sleep, it may go to sleep without noticing that it was aborted. Fix by checking before blocking on the condition variable. In general, every condition which can cause signal() should be checked before when(). This patch doesn't fix all the cases. For example, signal() can be called when there arrives a new topology request. This can happen after the coordinator checked because it releases the guard before calling when().	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	e338679266	tablets: Add formatter for tablet_migration_info	2023-07-31 01:45:23 +02:00
Nadav Har'El	b55b8f29b9	test/cql-pytest: test confirming that casting to counter doesn't work In the previous patch we implemented CAST operations from the COUNTER type to various other types. We did not implement the reverse cast, from different types to the counter type. Should we? In this patch we add a test that shows we don't need to bother - Cassandra does not support such casts, so it's fine that we don't too - and indeed the test shows we don't support them. It's not a useful operation anyway. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-07-30 20:16:25 +03:00
Nadav Har'El	b513bba201	cql: support casting of counter to other types We were missing support in the "CAST(x AS type)" function for the counter type. This patch adds this support, as well as extensive testing that it works in Scylla the same as Cassandra. We also un-xfail an existing test translated from Cassandra's unit test. But note that this old test did not cover all the edge-cases that the new test checks - some missing cases in the implementation were not caught by the old test. Fixes #14501 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-07-30 20:16:25 +03:00
Nadav Har'El	c1762750ed	cql: implement missing counterasblob() and blobascounter() functions Code in functions.cc creates the different TYPEasblob() and blobasTYPE() functions for all type names TYPE. The functions for the "counter" type were skipped, supposedly because "counters are not supported yet". But counters are supported, so let's add the missing functions. The code fix is trivial, the tests that verify that the result behaves like Cassandra took more work. After this patch, unimplemented::cause::COUNTERS is no longer used anywhere in the code. I wanted to remove it, but noticed that unimplemented::cause is a graveyard of unused causes, so decided not to remove this one either. We should clean it up in a separate patch. Fixes #14742 Also includes tests for tangently-related issues: Refs #12607 Refs #14319 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-07-30 20:16:25 +03:00
Nadav Har'El	d9c2cd3024	cql: implement missing type functions for "counters" type types.cc had eight of its functions unimplemented for the "counters" types, throwing an "unimplemented::cause::COUNTERS" when used. A ninth function (validate) was unimplemented for counters but did not even throw. Many code paths did not use any of these functions so didn't care, but some do - e.g., the silly do-nothing "SELECT CAST(c AS counter)" when c is already a counter column, which causes this operation to fail. When the types.cc code encounters a counter value, it is (if I understand it correctly) already a single uint64_t ("long_type") value, so we fall back to the long_type implementation of all the functions. To avoid mistakes, I simply copied the reversed_type implementation for all these functions - whereas the reversed_type implementation falls back to using the underlying type, the counter_type implementation always falls back to long_type. After this patch, "SELECT CAST(c AS counter)" for a counter column works. We'll introduce a test that verifies this (and other things) in a later patch in this series. The following patches will also need more of these functions to be implemented correctly (e.g., blobascounter() fails to validate the size of the input blob if the validate function isn't implemented for the counter type). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-07-30 20:16:25 +03:00
Avi Kivity	accd6271bc	Merge 'tools: introduce tool_app_template and migrate all tools to it' from Botond Dénes The scaffolding required to have a working scylla tool app, is considerable, leading to a large amount of boilerplate code in each such app. This logic is also very similar across the two tool apps we have and would presumably be very similar in any future app. This PR extracts this logic into `tools/utils.hh` and introduces `tool_app_template`, which is similar to `seastar::app_template` in that it centralizes all the option handling and more in a single class, that each tool has to just instantiate and then call `run()` to run the app. This cuts down on the repetition and boilerplate in our current tool apps and make prototyping new tool apps much easier. Closes #14855 * github.com:scylladb/scylladb: tools/utils.hh: remove unused headers tools/utils: make get_selected_operation() and configure_tool_mode() private tools/utils.hh: de-template get_selected_operation() tools/scylla-types: migrate to tools_app_template tools/scylla-types: prepare for migration to tool_app_template tools/scylla-sstable.cc: fix indentation tools/scylla-sstables: migrate to tool_app_template tools/scylla-sstables: prepare for migration to tool_app_template tools: extract tool app skeleton to utils.hh	2023-07-30 18:31:10 +03:00
Pavel Emelyanov	b8d1c7fc0b	sstables-format-selector: Add and use system_keyspace dependency The selector keeps selected format in system.local and uses static db::system_keyspace::(get\|set)_scylla_local_param() helpers to access it. The helpers are turning into non-static so the selector should call those on system_keyspace object, not class Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14871	2023-07-30 18:12:16 +03:00
Benny Halevy	d9aee0929c	gossiper: mark_as_shutdown: make private It is used only internally in gossiper. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-30 12:00:12 +03:00
Benny Halevy	b324bf38ea	gossiper: convict: make private It is used only internally in gossiper. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-30 12:00:12 +03:00
Benny Halevy	a6a66edc84	gossiper: mark_as_shutdown: do not call convict convict doesn't do anything useful in this case since we're already in mark_as_shutdown and convict is called after mark_dead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-30 12:00:12 +03:00
Avi Kivity	1c3d22b717	build: update frozen toolchain to Fedora 38 This refreshes clang to 16.0.6 and libstdc++ to 13.1.1. compiler-rt, libasan, and libubsan are added to install-dependencies.sh since they are no longer pulled in as depdendencies. Closes #13730	2023-07-30 03:08:48 +03:00
Avi Kivity	14dee7a946	Revert "build: build with -O0 if Clang >= 16 is used" This reverts commit `fb05fddd7d`. After `1554b5cb61` ("Update seastar submodule"), which fixed a coroutine bug in Seastar, it is no longer necessary. Also revert the related "build: drop the warning on -O0 might fail tests" (`894039d444`).	2023-07-29 08:07:04 +03:00
Avi Kivity	1554b5cb61	Update seastar submodule * seastar c0e618bbb...0784da876 (11): > Revert "metrics: Remove registered_metric::operator()" > build: use new behavior defined by CMP0127 > build: pass -DBOOST_NO_CXX98_FUNCTION_BASE to C++ compiler > coroutine: fix a use-after-free in maybe_yield Ref #13730. > Merge 'sstring: add more accessors' from Kefu Chai > Merge 'semaphore: semaphore_units: return units when reassigned' from Benny Halevy > metrics: do not define defaulted copy assignment operator > HTTP headers in http_response are now case insensitive > rpc: Make server._proto a reference > Merge 'Cleanup class metrics::registered_metrics' from Pavel Emelyanov > core: undefine fallthrough to fix compilation error Closes #14862	2023-07-28 23:45:30 +03:00
Tomasz Grabiec	4e9d95d78c	Merge 'Compact data before streaming' from Botond Dénes Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness. This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use: * Load'n'stream and streaming uses the current time on the local node. * Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness. Importantly, this compaction does not purge tombstones (tombstone GC is disabled completely). Fixes: https://github.com/scylladb/scylladb/issues/3561 Closes #14756 * github.com:scylladb/scylladb: replica: make_[multishard_]streaming_reader(): make compaction_time mandatory repair/row_level: opt in to compacting the stream streaming: opt-in to compacting the stream sstables_loader: opt-in for compacting the stream replica/table: add optional compacting to make_multishard_streaming_reader() replica/table: add optional compacting to make_streaming_reader() db/config: add config item for enabling compaction for streaming and repair repair: log the error which caused the repair to fail readers: compacting_reader: use compact_mutation_state::abandon_current_partition() mutation/mutation_compactor: allow user to abandon current partition	2023-07-28 16:42:13 +02:00
Pavel Emelyanov	24fdd4297b	schema_tables: Use query_processor argument in save_system_schema() ... instead of global qctx. The now used qctx->execute_cql() just calls the query_processor::execute_internal with cache_internal::yes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14874	2023-07-28 15:55:16 +02:00
Pavel Emelyanov	ab6dbe654f	schema_tables: Extract query_processor from system_keyspace for querying The schema_tables() column-mapping code runs queries over system. table, but it needs LOCAL_ONE CL and cherry-pick on caching, so regular system_keyspace::execute_cql() won't work here. However, since schema_tables is somewhat part of system_keyspace, it's natural to let the former fetch private query_processor& from the latter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 16:02:14 +03:00
Pavel Emelyanov	cf4d4d7e9b	schema_tables: Add system_keyspace& argument to ..._column_mapping() calls The callers all have local sys_ks argument: - merge_tables_and_views() - service::get_column_mapping() - database::parse_system_tables() And a test that can get it from cql_test_env. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 15:55:13 +03:00
Pavel Emelyanov	c9530eae4e	migration_manager: Add system_keyspace argument to get_schema_mapping() It will need one to pass to db::schema_tables code. The caller is paxos code with sys_ks local variable at hand Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 15:54:19 +03:00
Kefu Chai	cc2bbde8f1	test: use BOOST_CHECK_EQUAL when appropriate in compaction_manager_basic_test compaction_manager_basic_test checks the stats of compaction_manager to verify that there are no ongoing or pending compactions after the triggering the compaction and waiting for its completion. but in #14865, there are still active compaction(s) after the compaction_manager's stats shows there is at least one task completed. to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead of `BOOST_REQUIRE()`, so that the test does not error out when the check fails, and we can have better understanding of the status when the test fails. Refs #14865 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14872	2023-07-28 15:45:07 +03:00
Botond Dénes	1eca60fe10	tools/utils.hh: remove unused headers	2023-07-28 08:41:34 -04:00
Botond Dénes	cbcb20f0f9	tools/utils: make get_selected_operation() and configure_tool_mode() private Their only user is in tools/utils.cc, so move them there, into an anonymous namespace.	2023-07-28 08:41:34 -04:00
Botond Dénes	fc0c87002c	tools/utils.hh: de-template get_selected_operation() It now has a single user, so it doesn't have to be a template. For now, make the method inline, so it can stay in the header. It will be moved to utils.cc in the next patch.	2023-07-28 08:41:16 -04:00
Botond Dénes	8caf258539	tools/scylla-types: migrate to tools_app_template Discard the locally coded app skeleton and reuse the tool app template instead. Reduces boilerplate greatly.	2023-07-28 08:30:53 -04:00
Botond Dénes	68a452be00	tools/scylla-types: prepare for migration to tool_app_template Make options more declarative and create a local reference to app.configuration() in the main lambda. To faciliate further patching.	2023-07-28 08:30:53 -04:00
Botond Dénes	7598c23359	tools/scylla-sstable.cc: fix indentation Broken in the previous patch.	2023-07-28 08:30:53 -04:00
Botond Dénes	d082622ab9	tools/scylla-sstables: migrate to tool_app_template Removing a great amount of boilerplate, streamlinging the main method.	2023-07-28 08:30:53 -04:00
Botond Dénes	092650b20b	tools/scylla-sstables: prepare for migration to tool_app_template Make options more declarative. To facilitate further patching.	2023-07-28 08:30:53 -04:00
Botond Dénes	89d7d80fce	tools: extract tool app skeleton to utils.hh The skeleton of the two existing scylla-native tools (scylla-types and scylla-sstable) is very similar. By skeleton, I mean all the boilerplate around creating and configuring a seastar::app_template, representing operations/command and their options, and presenting and selecting these. To facilitate code-sharing and quick development of any new tools, extract this skeleton from scylla-sstable.cc into tools/utils.hh, in the form of a new tool_app_template, which wraps a seastar::app_template and centralizes all the boilerplate logic in a single place. The extracted code is not a simple copy-paste, although many elements are simply copied. The original code is not removed yet.	2023-07-28 08:30:53 -04:00
Patryk Jędrzejczak	3468cbd66b	service: migration_manager: change the prepare_ methods to functions The migration_manager service is responsible for schema convergence in the cluster - pushing schema changes to other nodes and pulling schema when a version mismatch is observed. However, there is also a part of migration_manager that doesn't really belong there - creating mutations for schema updates. These are the functions with prepare_ prefix. They don't modify any state and don't exchange any messages. They only need to read the local database. We take these functions out of migration_manager and make them separate functions to reduce the dependency of other modules (especially query_processor and CQL statements) on migration_manager. Since all of these functions only need access to storage_proxy (or even only replica::database), doing such a refactor is not complicated. We just have to add one parameter, either storage_proxy or database and both of them are easily accessible in the places where these functions are called.	2023-07-28 13:55:27 +02:00
Botond Dénes	3a51053e66	Merge 'De-static system_keyspace::_group0_ methods' from Pavel Emelyanov These are users of global `qctx` variable or call `(get\|set)_scylla_local_param(_as)?` which, in turn, also reference the `qctx`. Unfortunately, the latter(s) are still in use by other code and cannot be marked non-static in this PR Closes #14869 * github.com:scylladb/scylladb: system_keyspace: De-static set_raft_group0_id() system_keyspace: De-static get_raft_group0_id() system_keyspace: De-static get_last_group0_state_id() system_keyspace: De-static group0_history_contains() raft: Add system_keyspace argument to raft_group0::join_group0()	2023-07-28 14:53:22 +03:00
Kefu Chai	df041c7dc8	build: cmake: add missing source file TLS certificate authenticator registers itself using a `class_registrator`. that's why CMake is able to build without compiling this source file. but for the sake of completeness, and to be sync with configure.py, let's add it to CMake. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14866	2023-07-28 14:30:58 +03:00
Pavel Emelyanov	d311784721	system_keyspace: De-static set_raft_group0_id() The caller is group0 code with sys_ks local variable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:13:59 +03:00
Pavel Emelyanov	7837bc7d5a	system_keyspace: De-static get_raft_group0_id() The callers are in group0 code that have sys_ks local variable/argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:13:11 +03:00
Pavel Emelyanov	26dd7985a8	system_keyspace: De-static get_last_group0_state_id() The caller is raft_group0_client with sys.ks. dependency reference and group0_state_machine with raft_group0_client exporing its sys.ks. This makes it possible to instantly drop one more qctx reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:12:04 +03:00
Pavel Emelyanov	3de0efd32c	system_keyspace: De-static group0_history_contains() The caller is raft_group0_client with sys.ks. dependency reference. This allows to drop one qctx reference right at once Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:11:08 +03:00
Pavel Emelyanov	0dbe83ce89	raft: Add system_keyspace argument to raft_group0::join_group0() The method will need one to access db::system_keyspace methods. The sys.ks. is at hand and in use in both callers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-28 13:10:24 +03:00
Patryk Jędrzejczak	3f29c98394	storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal We add the CDC generation optimality check in storage_service::raft_check_and_repair_cdc_streams so that it doesn't create new generations when unnecessary. Additionally, we modify the test_topology_ops.py test in a way that verifies the new changes. We call storage_service::raft_check_and_repair_cdc_streams multiple times concurrently and verify that exactly one generation has been created.	2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak	b11f42951b	storage_service: delay deletion of the new_cdc_generation request We delay the deletion of the new_cdc_generation request to the moment when the topology transition reaches the publish_cdc_generation state. We need this change to ensure adding the CDC generation optimality check in the next commit has an intended effect. If we didn't make it, it would be possible that a task makes the new_cdc_generation request, and then, after this request was removed but before committing the new generation, another task also makes the new_cdc_generation request. In such a scenario, two generations are created, but only one should. After the change introduced by this commit, the second request would have no effect.	2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak	c416c9ff33	raft topology: broadcast on topology_state_machine.event instead of signal After adding the CDC generation optimality check in storage_service::raft_check_and_repair_cdc_streams in the following commits, multiple tasks will be waiting for a single generation change. Calling signal on topology_state_machine.event won't wake them all. Moreover, we must ensure the topology coordinator wakes when his logic expects it. Therefore, we change all signal calls on topology_state_machine.event to broadcast.	2023-07-28 11:04:30 +02:00
Patryk Jędrzejczak	b05b4a352a	cdc: implement the is_cdc_generation_optimal function In the following commits, we add the CDC generation optimality check to storage_service::raft_check_and_repair_cdc_streams so that it doesn't create new CDC generations when unnecessary. Since generation_service::check_and_repair_cdc_streams already has this check, we extract it to the new is_cdc_generation_optimal function to not duplicate the code.	2023-07-28 11:04:17 +02:00
Aleksandra Martyniuk	bfa3a7325a	test: extend test_compaction_task.py to cover compaction group tasks	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	139e147ae1	compaction: turn custom_task_executor into compaction_task_impl custom_task_executor inherits both from compaction_task_executor and compaction_task_impl.	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	1853a5a355	compaction: turn sstables_task_executor into sstables_compaction_task_impl sstables_task_executor inherits both from compaction_task_executor and sstables_compaction_task_impl. Delete unused perform_task_on_all_files version.	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	1decf86d71	compaction: change sstables compaction tasks type	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	59b838688b	compaction: move table_upgrade_sstables_compaction_task_impl Move table_upgrade_sstables_compaction_task_impl so that the related classes where placed next to each other.	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	71db8645d5	compaction: pass task_info through sstables compaction	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	4e439ac957	compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl offstrategy_compaction_task_executor inherits both from compaction_task_executor and offstrategy_compaction_task_impl.	2023-07-28 10:51:55 +02:00
Aleksandra Martyniuk	92f2987217	compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl cleanup_compaction_task_executor inherits both from compaction_task_executor and cleanup_compaction_task_impl. Add a new version of compaction_manager::perform_task_on_all_files which accepts only the tasks that are derived from compaction_task_impl. After all task executors' conversions are done, the new version replaces the original one.	2023-07-28 10:48:58 +02:00
Avi Kivity	d73a393670	main: increase Seastar reactor task quota in debug mode Debug mode is so slow that the work:poll ratio decreases, leading to even more slowness as more polling is done for the same amount of work. Increase the task quota to recover some performance. Ref #14752. Closes #14820	2023-07-28 10:34:18 +03:00
Aleksandra Martyniuk	8317e4dd7f	comapction: use optional task info in major compaction To make it consistent with the upcoming methods, methods triggering major compaction get std::optional<tasks::task_info> as an argument. Thanks to that we can distinguish between a task that has no parent and the task which won't be registered in task manager.	2023-07-28 09:25:21 +02:00
Aleksandra Martyniuk	ef8512f65a	compaction: use perform_compaction in compaction_manager::perform_major_compaction	2023-07-28 09:25:21 +02:00
Avi Kivity	cf81eef370	Merge 'schema_mutations, migration_manager: Ignore empty partitions in per-table digest' from Tomasz Grabiec Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d` (5.2.0), it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485. Manually tested using ccm on cluster upgrade scenarios and node restarts. Closes #14441 * github.com:scylladb/scylladb: test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled schema_mutations, migration_manager: Ignore empty partitions in per-table digest migration_manager, schema_tables: Implement migration_manager::reload_schema() schema_tables: Avoid crashing when table selector has only one kind of tables	2023-07-28 00:01:33 +03:00
Anna Stuchlik	8ee6f6ecb6	doc: add the requirement to upgrade drivers This commit adds a requirement to upgrade ScyllaDB drivers before upgrading ScyllaDB. The requirement to upgrade the Monitoring Stack has been moved to the new section so that both prerequisites are documented together. NOTE: The information is added to the 5.2-to-5.3 upgrade guide because all future upgrade guides will be based on this one (as it's the latest one). If 5.3 is released, this commit should be backported to branch-5.3. Refs https://github.com/scylladb/scylladb/issues/13958 Closes #14771	2023-07-27 15:21:38 +02:00
Patryk Jędrzejczak	b81a6037f1	test: pylib: ensure ScyllaCluster.add_server does not start a second cluster If the cluster isn't empty and all servers are stopped, calling ScyllaCluster.add_server can start a new cluster. That's because ScyllaCluster._seeds uses the running servers to calculate the seed node list, so if all nodes are down, the new node would select only itself as a seed, starting a new cluster. As a single ScyllaCluster should describe a single cluster, we make ScyllaCluster.add_server fail when called on a non-empty cluster with all its nodes stopped. Closes #14804	2023-07-27 13:27:23 +02:00
Botond Dénes	7351c8424d	mutation/mutation_rebuilder: add comment about validity of returned mutation reference Closes #14853	2023-07-27 12:13:46 +03:00
Alexey Novikov	ff721ec3e3	make timestamp string format cassandra compatible when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z' it concerns any conversion except JSON timestamp format JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z' both formats always contain milliseconds and timezone specification Fixes #14518 Fixes #7997 Closes #14726	2023-07-27 12:01:09 +03:00
Botond Dénes	b599f15b26	replica: make_[multishard_]streaming_reader(): make compaction_time mandatory Now that all users have opted in unconditionally, there is no point in keeping this optional. Make it mandatory to make sure there are no opt-out by mistake. The global override via enable_compacting_data_for_streaming_and_repair config item still remains, allowing compaction to be force turned-off.	2023-07-27 04:57:52 -04:00
Botond Dénes	fdaf908967	repair/row_level: opt in to compacting the stream Using a centrally generated compaction-time, generated on the repair master and propagated to all repair followers. For repair it is imperative that all participants use the exact same compaction time, otherwise there can be artificial differences between participants, generating unnecessary repair activity. If a repair follower doesn't get a compaction-time from the repair master, it uses a locally generated one. This is no worse than the previous state of each node being on some undefined state of compaction.	2023-07-27 04:57:50 -04:00
Botond Dénes	5452fd1ce4	streaming: opt-in to compacting the stream Use locally generated compaction time on each node. This could lead to different nodes making different decisions on what is expired or not. But this is already the case for streaming, as what exactly is expired depends on when compaction last run.	2023-07-27 03:22:11 -04:00
Botond Dénes	5a73c3374e	sstables_loader: opt-in for compacting the stream No point in loading expired/covered data.	2023-07-27 03:22:11 -04:00
Botond Dénes	2f8d77e97b	replica/table: add optional compacting to make_multishard_streaming_reader() Doing to make_multishard_streaming_reader() what the previous commit did to make_streaming_reader(). In fact, the new compaction_time parameter is simply forwarded to the make_streaming_reader() on the shard readers. Call sites are updated, but none opt in just yet.	2023-07-27 03:22:11 -04:00
Botond Dénes	42b0dd5558	replica/table: add optional compacting to make_streaming_reader() Opt-in is possible by passing an engaged `compaction_time` (gc_clock::time_point) to the method. When this new parameter is disengaged, no compaction happens. Note that there is a global override, via the enable_compacting_data_for_streaming_and_repair config item, which can force-disable this compaction. Compaction done on the output of the streaming reader does not garbage-collect tombstones! All call-sites are adjusted (the new parameter is not defaulted), but none opt in yet. This will be done in separate commit per user.	2023-07-27 03:22:11 -04:00
Botond Dénes	9e3987fc96	db/config: add config item for enabling compaction for streaming and repair Compacting can greatly reduce the amount of data to be processed by streaming and repair, but with certain data shapes, its effectiveness can be reduced and its CPU overhead might outweight the benefits. This should very rarely be the case, but leave an off switch in case this becomes a problem in a deployment. Not wired yet.	2023-07-27 03:22:11 -04:00
Botond Dénes	a22446afe0	repair: log the error which caused the repair to fail Instead of just a boolean _failed flag, persist the error message of the exception which caused the repair to fail, and include it in the log message announcing the failure.	2023-07-27 03:22:11 -04:00
Botond Dénes	ac44efea11	readers: compacting_reader: use compact_mutation_state::abandon_current_partition() When next_partition() or fast_forward_to() is called. Instead of trying to simulate a properly closed partition by injecting synthetic mutation fragments to properly close it.	2023-07-27 02:50:44 -04:00
Botond Dénes	326c3b92e5	mutation/mutation_compactor: allow user to abandon current partition Currently, the compactor requires a valid stream and thus abandoning a partition in the middle was not possible. This causes some complications for the compacting reader, which implements methods such as `next_partition()` which is possibly called in the middle of a partition. In this case the compacting reader attempts to close the partition properly by inserting a synthetic partition-end fragment into the stream. This is not enough however as it doesn't close any range tombstone changes that might be active. Instead of piling on more complexity, add an API to the compactor which allows abandoning the current partition.	2023-07-27 02:50:44 -04:00
Kefu Chai	1b7bde2e9e	compaction_manager: use range in compacting_sstable_registration simpler than the "begin, end" iterator pair. and also tighten the type constraints, now require the value type to be sstables::shared_sstable. this matches what we are expecting in the implementation. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14678	2023-07-27 09:40:20 +03:00
Pavel Emelyanov	e9218e6873	system_keyspace: Don't update schema version in .setup() The db.get_version() called that early returns value that database got construction-time, i.e. -- empty_version thing. It makes little sense committing it into the system k.s. all the more so the "real" version is calculated and updated few steps after .setup(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14833	2023-07-27 09:38:57 +03:00
Pavel Emelyanov	c017117340	system_keyspace: Remove qctx usage from load_topology_state() Fortunately, this is pretty simple -- the only caller is storage_service that has sharded<system_keysace> dependency reference Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14824	2023-07-27 08:56:40 +03:00
Raphael S. Carvalho	050ce9ef1d	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818	2023-07-27 00:01:46 +02:00
Anna Stuchlik	3ed6754afc	doc: update info about cassandra superuser Fixes https://github.com/scylladb/scylla-docs/issues/4028 The goal of this update is to discourage the use of the default cassandra superuser in favor of a custom super user - and explain why it's a good practice. The scope of this commit: - Adding a new page on creating a custom superuser. The page collects and clarifies the information about the cassandra superuser from other pages. - Remove the (incomplete) information about superuser from the Authorization and Authentication pages, and add the link to the new page instead. Additionaly, this update will result in better searchability and ensures language clarity. Closes #14829	2023-07-26 23:15:31 +03:00
Avi Kivity	615544a09a	Merge 'Init messaging service preferred IP cache via config' from Pavel Emelyanov This is to make m.s. initialization more solid and simplify sys.ks.::setup() Closes #14832 * github.com:scylladb/scylladb: system_keyspace: Remove unused snitch arg from setup() messaging_service: Setup preferred IPs from config	2023-07-26 22:12:28 +03:00
Nadav Har'El	59c1498338	test/alternator: don't forget to delete tables on test failures Most of the Alternator tests are careful to unconditionally remove the test tables, even if the test fails. This is important when testing on a shared database (e.g., DynamoDB) but also useful to make clean shutdown faster as there should be no user table to flush. We missed a few such cases in test_gsi.py, and this patch corrects them. We do this by using the context manager new_test_table() - which automatically deletes the table when done - instead of the function create_test_table() which needs an explicit delete at the end. There are no functional changes in this patch - most of the lines changed are just reindents. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14835	2023-07-26 21:51:22 +03:00
Benny Halevy	1e7e2eeaee	gossiper: mark_alive: use deferred_action to unmark pending Make sure _pending_mark_alive_endpoints is unmarked in any case, including exceptions. Fixes #14839 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14840	2023-07-26 21:24:56 +03:00
Nadav Har'El	056d04954c	Merge 'view_updating_consumer: account empty partitions memory usage' from Botond Dénes Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions. This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed. Fixes: #14819 Closes #14821 * github.com:scylladb/scylladb: test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations db/view/view_updating_consumer: account for the size of mutations mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() mutation/mutation: add memory_usage()	2023-07-26 20:04:28 +03:00
Pavel Emelyanov	6b82071064	system_keyspace: Remove unused snitch arg from setup() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:05:26 +03:00
Pavel Emelyanov	0fba57a3e8	messaging_service: Setup preferred IPs from config Population of messageing service preferred IPs cache happens inside system keyspace setup() call and it needs m.s. per ce and additionally snitch. Moving preferred ip cache to initial configuration keeps m.s. start more self-contained and keeps system_keyspace::setup() simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-26 16:03:23 +03:00
Nadav Har'El	d2ca600eec	test//run: kill Scylla with SIGTERM Today, test//run always kills Scylla at the end of the test with SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was believed that a clean shutdown would take a long time, but in fact, it turns out that 99% of the shutdown time was a silly sleep in the gossip code, which this patch disables with the "--shutdown-announce-in-ms" option. After enabling this option, clean shutdown takes (in a dev build on my laptop) just 0.02 seconds. It's worth noting that this shutdown has no real work to do - no tables to flush, and so on, because the pytest framework removes all the tables in its own fixture cleanup phase. So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL. We then wait until a timeout of 10 seconds (much much more than 0.02 seconds!) for Scylla to exit. If for some reason it didn't exit (e.g., it hung during the shutdown), it is killed again with SIGKILL, which is guaranteed to succed. This change gives us two advantages 1. Every test run with test/*/run exercises the shutdown path. It is perhaps excessive, but since the shutdown is so quick, there is no big downside. 2. In a test-coverage run, a clean shutdown allows flushing the counter files, which wasn't possible when Scylla was killed with KILL -9. Fixes #8543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14825	2023-07-26 14:06:24 +03:00
Avi Kivity	ff1f461a42	Merge 'Introduce tablet load balancer' from Tomasz Grabiec After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details. The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally. The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced. Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration. Closes #14601 * github.com:scylladb/scylladb: tests: test_tablets: Add test for bootstraping a node storage_service: topology_coordinator: Implement tablet migration state machine tablets: Introduce tablet_mutation_builder service: tablet_allocator: Introduce tablet load balancer tablets: Introduce tablet_map::for_each_tablet() topology: Introduce get_node() token_metadata: Add non-const getter of tablet_metadata storage_service: Notify topology state machine after applying schema change storage_service: Implement stream_tablet RPC tablets: Introduce global_tablet_id stream_transfer_task, multishard_writer: Work with table sharder tablets: Turn tablet_id into a struct db: Do not create per-keyspace erm for tablet-based tables tablets: effective_replication_map: Take transition stage into account when computing replicas tablets: Store "stage" in transition info doc: Document tablet migration state machine and load balancer locator: erm: Make get_endpoints_for_reading() always return read replicas storage_service: topology_coordinator: Sleep on failure between retries storage_service: topology_coordinator: Simplify coordinator loop main: Require experimental raft to enable tablets	2023-07-26 12:30:29 +03:00
Botond Dénes	d0f725c1b9	test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations A test reproducing #14819, that is, the view update builder not flushing the buffer when only empty partitions are consumed (with only a tombstone in them).	2023-07-26 03:09:53 -04:00
Botond Dénes	d66b07823b	db/view/view_updating_consumer: account for the size of mutations All partitions will have a corresponding mutation object in the buffer. These objects have non-negligible sizes, yet the consumer did not bump the _buffer_size when a new partition was consumer. This resulted in empty partitions not moving the _buffer_size at all, and thus they could accumulate without bounds in the buffer, never triggering a flush just by themselves. We have recently seen this causing OOM. This patch fixes that by bumping the _buffer_size with the size of the freshly created mutation object.	2023-07-26 03:07:25 -04:00
Botond Dénes	ad2ddffb22	Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from - proxy::remote::handle_truncate() - schema_tables::merge_schema() - legacy_schema_migrator - tests All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx Closes #14778 * github.com:scylladb/scylladb: system_keyspace: Make save_truncation_record() non-static code: Pass sharded<db::system_keyspace>& to database::truncate() db: Add sharded<system_keyspace>& to legacy_schema_migrator	2023-07-26 08:48:49 +03:00
Benny Halevy	90b2e6515c	gossiper: mark_alive: enter background_msg gate The function dispatch a background operation that must be waited on in stop(). Fixes scylladb/scylladb#14791 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14797	2023-07-26 00:51:22 +02:00
Tomasz Grabiec	ae8ffe23fc	tests: test_tablets: Add test for bootstraping a node	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f0b9dcee04	storage_service: topology_coordinator: Implement tablet migration state machine See the documentation in topology-over-raft.md for description of the mechanism.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	5c681a1d63	tablets: Introduce tablet_mutation_builder	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	6f4a35f9ae	service: tablet_allocator: Introduce tablet load balancer Will be invoked by the topology coordinator later to decide which tablets to migrate.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	d59b8d316c	tablets: Introduce tablet_map::for_each_tablet()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	0e3eac29d0	topology: Introduce get_node()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f2fdf37415	token_metadata: Add non-const getter of tablet_metadata Needed for tests.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	1885f94474	storage_service: Notify topology state machine after applying schema change Table construction may allocate tablets which may need rebalancing. Notify topology change coordinator to invoke the load balancer.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	6d545b2f9e	storage_service: Implement stream_tablet RPC Performs streaming of data for a single tablet between two tablet replicas. The node which gets the RPC is the receiving replica.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	e3a8bb7ec9	tablets: Introduce global_tablet_id Identifies tablet in the scope of the whole cluster. Not to be confused with tablet replicas, which all share global_tablet_id. Will be needed by load balancer and tablet migration algorithm to identify tablets globally.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f88220aeee	stream_transfer_task, multishard_writer: Work with table sharder So that we can use it on tablet-based tables.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	8cf92d4c86	tablets: Turn tablet_id into a struct The IDL compiler cannot deal with enum classes like this.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	c2b18ae483	db: Do not create per-keyspace erm for tablet-based tables This erm is not updated when replicating token metadata in storage_service::replicate_to_all_cores() so will pin token metadata version and prevent token metadata barrier from finishing. It is not necessary to have per-keyspace erm for tablet-based tables, so just don't create it.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	91dee5c872	tablets: effective_replication_map: Take transition stage into account when computing replicas	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	dc2ec3f81c	tablets: Store "stage" in transition info It's needed to implement tablet migration. It stores the current step of tablet migration state machine. The state machine will be advanced by the topology change coordinator. See the "Tablet migration" section of topology-over-raft.md	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	05519bd5e5	doc: Document tablet migration state machine and load balancer	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	7851694eaa	locator: erm: Make get_endpoints_for_reading() always return read replicas Just a simplification. Drop the test case from token_metadata which creates pending endpoints without normal tokens. It fails after this change with exception: "sorted_tokens is empty in first_token_index!" thrown from token_metadata::first_token_index(), which is used when calculating normal endpoints. This test case is not valid, first node inserts its tokens as normal without going through bootstrap procedure.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	b642e69eb3	storage_service: topology_coordinator: Sleep on failure between retries Avoid failing in a tight loop. Can happen if some node is down, for example.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	f0e9dbf911	storage_service: topology_coordinator: Simplify coordinator loop This refactoring removes a boolean and branching which makes it easier to reason about the flow, and easier to extend it with more steps.	2023-07-25 21:08:01 +02:00
Tomasz Grabiec	b294932cf1	main: Require experimental raft to enable tablets Tablets depend on the topology changes on raft feature. Drop "tablets" from suite.yaml of the topology/ suite, which doesn't use tablets anymore.	2023-07-25 21:08:01 +02:00
Aleksandra Martyniuk	6e6ba7309e	replica: make tables_metadata's attributes private Make _column_families and _ks_cf_to_uuid private to prevent unsafe access. The maps can be accessed only through method which use locks if preemption is possible.	2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk	c5cad803b3	replica: add methods to get a filtered copy of tables map	2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk	ff26b2ba3f	replica: add methods to check if given table exists	2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk	6796721c3d	replica: add methods to get table or table id	2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk	e072a2341d	replica: api: return table_id instead of const table_id& Return table_id instead of const table_id& from database::find_uuid as copying table_id does not cause much overhead and simplifies methods signature.	2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk	cdbfa0b2f5	replica: iterate safely over tables related maps Loops over _column_families and _ks_cf_to_uuid which may preempt are protected by reader mode of rwlock so that iterators won't get invalid.	2023-07-25 17:13:04 +02:00
Botond Dénes	fda4168300	mutation/mutation_rebuilder*: return const mutation& from consume_new_partition() To allow const access to the mutation under construction, e.g. so the user can query its size.	2023-07-25 10:34:31 -04:00
Botond Dénes	e6fa21d1b3	mutation/mutation: add memory_usage()	2023-07-25 10:34:30 -04:00
Aleksandra Martyniuk	a21d3357c3	replica: pass tables_metadata to phased_barrier_top_10_counts	2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk	8842bd87c3	replica: add methods to safely add and remove table	2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk	52afd9d42d	replica: wrap column families related maps into tables_metadata As a preparation for ensuring access safety for column families related maps, add tables_metadata, access to members of which would be protected by rwlock.	2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk	395ce87eff	replica: futurize database::add_column_family and database::remove As a preparation for further changes, database::add_column_family and database::remove return future<>.	2023-07-25 16:13:00 +02:00
Pavel Emelyanov	c46c57d535	messaging_service: Clear list of clients on shutdown When messaging_service shuts down it first sets _shutting_down to true and proceeds with stopping clients and servers. Stopping clients, in turn, is calling client.stop() on each. Setting _shutting_down is used in two places. First, when a client is stopped it may happen that it's in the middle of some operation, which may result in call to remove_error_rpc_client() and not to call .stop() for the second time it just does nothing if the shutdown flag is set (see `357c91a076`). Second, get_rpc_client() asserts that this flag is not set, so once shutdown started it can make sure that it will call .stop() on _all_ clients and no new ones would appear in parallel. However, after shutdown() is complete the _clients vector of maps remains intact even though all clients from it are stopped. This is not very debugging-friendly, the clients are better be removed on shutdown. fixes: #14624 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14632	2023-07-25 13:08:20 +03:00
Botond Dénes	ed025890e5	scripts/coverage.py: --run: swallow KeyboardInterrupt It is quite common to stop a tested scylla process with ^C, which will raise KeyboardInterrupt from subprocess.run(). Catch and swallow this exception, allowing the post-processing to continue. The interrupted process has to handle the interrupt correctly too -- flush the coverage data even on premature exit -- but this is for another patch. Closes #14815	2023-07-25 12:29:22 +03:00
Kefu Chai	2943d3c1b0	tools/scylla-sstable: s/foo.find(bar) != foo.end()/foo.count(bar) != 0/ just for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14816	2023-07-25 11:38:44 +03:00
Petr Gusev	116444a01b	counter_mutation: add fencing As for regular mutations, we do the check twice in handle_counter_mutation, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to write_both_read_new, streaming might have already started and missed this update. In mutate_counters we can use a single fencing_token for all leaders, since all the erms are processed without yields and should underneath share the same token_metadata. We don't pass fencing token for replication explicitly in replicate_counter_from_leader since mutate_counter_on_leader_and_replicate doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current token_metadata version for outgoing requests.	2023-07-25 12:10:03 +04:00
Petr Gusev	edbb5cbb5f	encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant We will need it in later commit to return exceptions from handle_counter_mutation. We also add utils::Tuple concept restriction for add_replica_exception_to_query_result since its type parameters are always tuples.	2023-07-25 12:09:21 +04:00
Petr Gusev	f2cbdc7f18	counter_mutation: add replica::exception_variant to signature We are going to add fencing for counter mutations, this means handle_counter_mutation will sometimes throw stale_topology_exception. RPC doesn't marshall exceptions transparently, exceptions thrown by server are delivered to the client as a general remote_verb_error, which is not very helpful. The common practice is to embed exceptions into handler result type. In this commit we use already existing exception_variant as an exception container. We mark exception_variant with [[version]] attribute in the idl file, this should handle the case when the old replica (without exception_variant in the signature) is replying to the new one.	2023-07-25 12:09:19 +04:00
Raphael S. Carvalho	0ac43ea877	Fix stack-use-after-return in mutation source excluding staging The new test detected a stack-use-after-return when using table's as_mutation_source_excluding_staging() for range reads. This doesn't really affect view updates that generate single key reads only. So the problem was only stressed in the recently added test. Otherwise, we'd have seen it when running dtests (in debug mode) that stress the view update path from staging. The problem happens because the closure was feeded into a noncopyable_function that was taken by reference. For range reads, we defer before subsequent usage of the predicate. For single key reads, we only defer after finished using the predicate. Fix is about using sstable_predicate type, so there won't be a need to construct a temporary object on stack. Fixes #14812. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14813	2023-07-25 10:38:20 +03:00
Botond Dénes	3eec990e4e	Merge 'test: use different table names in simple_backlog_controller_test ' from Kefu Chai in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added. to address this problem, in this change 1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names 2. use the new ctor in simple_backlog_controller_test Fixes #14767 Closes #14783 * github.com:scylladb/scylladb: test: use different table names in simple_backlog_controller_test test/lib/simple_schema: add ctor for customizing ks.cf test/lib/simple_schema: do not hardwire ks.cf	2023-07-25 10:26:33 +03:00
Anna Stuchlik	f6732865b9	doc: doc: move unified installer from web to docs This commit adds the information on how to install ScyllaDB without root privileges (with "unified installer", but we've decided to drop that name - see the page title). The content taken from the website https://www.scylladb.com/download/?platform=tar&version=scylla-5.2#open-source is divided into two sections: "Download and Install" and "Configure and Run ScyllaDB". In addition, the "Next Steps" section is also copied from the website, and adjusted to be in sync with other installation pages in the docs. Refs https://github.com/scylladb/scylla-docs/issues/4091 Closes #14781	2023-07-25 10:23:02 +03:00
Benny Halevy	a07440173f	storage_service: node_ops_ctl: send_to_all: fix "Node is down for" log message args order The node and op_desc args are reversed. Fixes #14807 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14808	2023-07-24 21:13:06 +03:00
Petr Gusev	5fb8da4181	hints: add fencing In this commit we just pass a fencing_token through hint_mutation RPC verb. The hints manager uses either storage_proxy::send_hint_to_all_replicas or storage_proxy::send_hint_to_endpoint to send a hint. Both methods capture the current erm and use the corresponding fencing token from it in the mutation or hint_mutation RPC verb. If these verbs are fenced out, the server stale_topology_exception is translated to a mutation_write_failure_exception on the client with an appropriate error message. The hint manager will attempt to resend the failed hint from the commitlog segment after a delay. However, if delivery is unsuccessful, the hint will be discarded after gc_grace_seconds. Closes #14580	2023-07-24 18:12:48 +02:00
Tomasz Grabiec	5b30931406	Merge 'raft topology: restore gossiper eps' from Gusev Petr We don't load gossiper endpoint states in `storage_service::join_cluster` if `_raft_topology_change_enabled`, but gossiper is still needed even in case of `_raft_topology_change_enabled` mode, since it still contains part of the cluster state. To work correctly, the gossiper needs to know the current endpoints. We cannot rely on seeds alone, since it is not guaranteed that seeds will be up to date and reachable at the time of restart. The problem was demonstrated by the test `test_joining_old_node_fails`, it fails occasionally with `experimental_features: [consistent-topology-changes]` on the line where it waits for `TEST_ONLY_FEATURE` to become enabled on all nodes. This doesn't happen since `SUPPORTED_FEATURES` gossiper state is not disseminated, and feature_service still relies on gossiper to disseminate information around the cluster. The series also contains a fix for a problem in `gossiper::do_send_ack2_msg`, see commit message for details. Fixes #14675 Closes #14775 * github.com:scylladb/scylladb: storage_service: restore gossiper endpoints on topology_state_load fix gossiper: do_send_ack2_msg fix	2023-07-24 13:55:50 +02:00
Botond Dénes	a8feb7428d	Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak If semaphore mismatch occurs, check whether both semaphores belong to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error. Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup__querier` methods. The mismatch can happen if user's scheduling group changed during a query. We don't want to throw an error then, but drop and reset cached reader. This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it. Refers: https://github.com/scylladb/scylla-enterprise/issues/3182 Refers: https://github.com/scylladb/scylla-enterprise/issues/3050 Closes: #14770 Closes #14736 github.com:scylladb/scylladb: querier_cache: add stats of scheduling group mismatches querier_cache: check semaphore mismatch during querier lookup querier_cache: add reference to `replica::database::is_user_semaphore()` replica:database: add method to determine if semaphore is user one	2023-07-24 14:13:09 +03:00
Petr Gusev	75694aa080	storage_service: restore gossiper endpoints on topology_state_load fix We don't load gossiper endpoint states in storage_service::join_cluster if _raft_topology_change_enabled, but gossiper is still needed even in case of _raft_topology_change_enabled mode, since it still contains part of the cluster state. To work correctly, the gossiper needs to know the current endpoints. We cannot rely on seeds alone, since it is not guaranteed that seeds will be up to date and reachable at the time of restart. The specific scenario of the problem: cluster with three nodes, the second has the first in seeds, the third has the first and second. We restart all the nodes simultaneously, the third node uses its seeds as _endpoints_to_talk_with in the first gossiper round and sends SYN to the first and sedond. The first node hasn't started its gossiper yet, so handle_syn_msg returns immediately after if (!this->is_enabled()); The third node receives ack from the second node and no communication from the first node, so it fills its _live_endpoints collection with the second node and will never communicate with the first node again. The problem was demonstrated by the test test_joining_old_node_fails, it fails occasionally with experimental_features: [consistent-topology-changes] on the line where it waits for TEST_ONLY_FEATURE to become enabled on all nodes. This doesn't happen since SUPPORTED_FEATURES gossiper state is not disseminated because of the problem described above. The first commit is needed since add_saved_endpoint adds the endpoint with some default app states with locally incrementing versions and without that fix gossiper refuses to fill the real app states for this endpoint later. Fixes: #14675	2023-07-24 12:36:39 +04:00
Kamil Braun	e6099c4685	Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak Fixes #14668 In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`. Additionally, we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668. Closes #14704 * github.com:scylladb/scylladb: replica: do not derive the commitlog sync period for schema commitlog config: set schema_commitlog_segment_size_in_mb to 128 config: add schema_commitlog_segment_size_in_mb variable	2023-07-24 10:23:34 +02:00
Petr Gusev	87cd7e8741	gossiper: do_send_ack2_msg fix This commit is a first part of the fix for #14675. The issue is about the test test_joining_old_node_fails faling occasionally with experimental_features: [consistent-topology-changes]. The next commit contains a fix for it, here we solve the pre-existing gossiper problem which we stumble upon after the fix. Local generation for addr may have been increased since the current node sent an initial SYN. Comparing versions across different generations in get_state_for_version_bigger_than could result in loosing some app states with smaller versions. More specifically, consider a cluster with nodes .1, .2, .3, .3 has .1 and .2 as seeds, .2 has .1 as a seed. Suppose .2 receives a SYN from .3 before its gossiper starts, and it has a version 0.24 for .1 in endpoint_states. The digest from .3 contains 0.25 as a version for .1, so examine_gossiper produces .1->0.24 as a digest and this digest is send to .3 as part of the ack. Before processing this ack, .3 processed an ack from .1 (scylla sends SYN to many nodes) and updates its endpoint_states according to it, so now it has .1->100500.32 for .1. Then we get to do_send_ack2_msg and call get_state_for_version_bigger_than(.1, 24). This returns properties which has version > 24, ignoring a lot of them with smaller versions which has been received from .1. Also, get_state_for_version_bigger_than updates generation (it copies get_heart_beat_state from .3), so when we apply the ack in handle_ack2_msg at .2 we update the generation and now the skipped app states will only be updated on .2 if somebody change them and increment their version. Cassandra behaviour is the same in this case (see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/GossipDigestAckVerbHandler.java#L86). This is probably less of a problem for them since most of the time they send only one SYN in one gossiper round (save for unreachable nodes), so there is less room for conflicts.	2023-07-24 11:52:56 +04:00
Kefu Chai	3ad844a4bb	build: cmake: set scylla version strings as CACHED strings before this change, add_version_library() is a single function which accomplishes two tasks: 1. build scylla-version target using 2. add an object library but this has two problems: 1. we should run `SCYLLA-VERSION-GEN` at configure time, instead of at build time. otherwise the targets which read from the SCYLLA-{VERSION, RELEASE, PRODUCT}-FILE cannot access them, unless they are able to read them in their build rules. but they always use `file(STRINGS ..)` to read them, and thsee `file()` command is executed at configure time. so, this is a dead end. 2. we repeat the `file(STRING ..)` multiple places. this is not ideal if we want to minimize the repeatings. so, to address this problem, in this change: 1. use `execute_process()` instead of `add_custom_command()` for generating these *-FILE files. so they are always ready at build time. this partially reverts `bb7d99ad37`. 2. extract `generate_scylla_version()` out of `add_version_library()`. so we can call the former much earlier than the latter. this would allow us to reference the variables defined by the `generate_scylla_version()` much earlier. 3. define cached strings in the extracted function, so that they can consumed by other places. 4. reference the cached variables in `build_submodule.cmake`. also, take this opportunity to fix the version string used in build_submodule.cmake: we should have used `scylla_version_tilde`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14769	2023-07-24 08:57:19 +03:00
Michał Jadwiszczak	246728cbbb	querier_cache: add stats of scheduling group mismatches Add stats to count dropped queriers because of scheduling group mismatch.	2023-07-21 19:05:55 +02:00
Michał Jadwiszczak	a5fc53aa11	querier_cache: check semaphore mismatch during querier lookup Previously semaphore mismatch was checked only in multi-partition queries and if happened, an internal error was thrown. This commit pushed the check down to `querier_cache`, so each `lookup_*_querier` method will check for the mismatch. What's more, if semaphore mismatch occurs, check whether both semaphores belong to user. If so, log a warning and drop cached reader instead of throwing an error. The mismatch can happen if user's scheduling group changed during a query. We don't want to throw an error then, but drop and reset cached reader.	2023-07-21 19:05:50 +02:00
Michał Jadwiszczak	e5c965b280	querier_cache: add reference to `replica::database::is_user_semaphore()`	2023-07-21 18:58:57 +02:00
Jan Ciolek	decbc841b7	cql3/prepare_expr: fix partially preparing function arguments Before choosing a function, we prepare the arguments that can be prepared without a receiver. Preparing an argument makes its type known, which allows to choose the best overload among many possible functions. The function that prepared the argument passes the unprepared argument by mistake. Let's fix it so that it actually uses the prepared argument. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #14786	2023-07-21 18:59:56 +03:00
Jan Ciolek	cbc97b41d4	cql.g: make the parser reject INSERT JSON without a JSON value We allow inserting column values using a JSON value, eg: ```cql INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}'; ``` When no JSON value is specified, the query should be rejected. Scylla used to crash in such cases. A recent change fixed the crash (https://github.com/scylladb/scylladb/pull/14706), it now fails on unwrapping an uninitialized value, but really it should be rejected at the parsing stage, so let's fix the grammar so that it doesn't allow JSON queries without JSON values. A unit test is added to prevent regressions. Refs: https://github.com/scylladb/scylladb/pull/14707 Fixes: https://github.com/scylladb/scylladb/issues/14709 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #14785	2023-07-21 18:52:47 +03:00
Kefu Chai	d78c6d5f50	test: use different table names in simple_backlog_controller_test in `simple_backlog_controller_test`, we need to have multiple tables at the same time. but the default constructor of `simple_schema` always creates schema with the table name of "ks.cf". we are going to have a per-table metrics. and the new metric group will use the table name as its counter labels, so we need to either disable this per-table metrics or use a different table name for each table. as in real world, we don't have multiple tables at the same time. it would be better to stop reusing the same table name in a single test session. so, in this change, we use a random cf_name for each of the created table. Fixes #14767 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-21 19:08:29 +08:00
Kefu Chai	1f596e4669	test/lib/simple_schema: add ctor for customizing ks.cf some low level tests, like the ones exercising sstables, creates multiple tables. and we are going to add per-table metrics and the new metrics uses the ks.cf as part of its unique id. so, once the per-table metrics is enabled, the sstable tests would fail. as the metrics subsystem does not allow registering multiple metric groups with the same name. so, in this change, we add a new constructor for `simple_schema`, so that we can customize the the schema's ks and cf when creating the `simple_schema`. in the next commit, we will use this new constructor in a sstable test which creates multiple tables. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-21 19:07:45 +08:00
Kefu Chai	306439d3aa	test/lib/simple_schema: do not hardwire ks.cf instead, query the name of ks and cf from the scheme. this change prepare us for the a simple_schema whose ks and cf can be customized by its contructor. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-21 19:07:45 +08:00
Mikołaj Grzebieluch	37ceef23a6	test: raft: skip test_old_ip_notification_repro in debug mode Closes #14777	2023-07-21 12:41:03 +02:00
Pavel Emelyanov	db1c6e2255	system_keyspace: Make save_truncation_record() non-static ... and stop using qctx Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 13:12:50 +03:00
Pavel Emelyanov	eaeffcdb81	code: Pass sharded<db::system_keyspace>& to database::truncate() The arguments goes via the db::(drop\|truncate)_table_on_all_shards() pair of calls that start from - storage_proxy::remote: has its sys.ks reference already - schema_tables::merge_schema: has sys.ks argument already - legacy_schema_migrator: the reference was added by previous patch - tests: run in cql_test_env with sys.ks on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 13:11:59 +03:00
Pavel Emelyanov	1ef34a5ada	db: Add sharded<system_keyspace>& to legacy_schema_migrator One of the class' methods calls db::drop_table_on_all_shards() that will need sys.ks. in the next patch. The reference in question is provided from the only caller -- main.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-21 12:38:46 +03:00
Kefu Chai	a87b0d68cd	s3/test: remove the tempdir if test succeeds in `46616712`, we tried to keep the tmpdir only if the test failed, and keep up to 1 of them using the recently introduced option of `tmp_path_retention_count`. but it turns out this option is not supported by the pytest used by our jenkins nodes, where we have pytest 6.2.5. this is the one shipped along with fedora 36. so, in this change, the tempdir is removed if the test completes without failures. as the tempdir contains huge number of files, and jenkins is quite slow scanning them. after nuking the tempdir, jenkins will be much faster when scanning for the artifacts. Fixes #14690 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14772	2023-07-21 12:21:51 +03:00
Nadav Har'El	5860820934	Merge 'mutation/mutation_compactor: validate the input stream' from Botond Dénes The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to the validator for the most part. This mistake was introduced by https://github.com/scylladb/scylladb/commit `e2c9cdb576`, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor. This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in. The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios. Closes #13817 * github.com:scylladb/scylladb: test/mutation_test: test_compactor_validator_sanity_test mutation/mutation_compactor: fix indentation mutation/mutation_compactor: validate the input stream mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator readers: reader-from-fragment: don't modify stream when created without range	2023-07-21 00:26:46 +03:00
Avi Kivity	e00811caac	cql3: grammar: reject intValue with no contents The grammar mistakenly allows nothing to be parsed as an intValue (itself accepted in LIMIT and similar clauses). Easily fixed by removing the empty alternative. A unit test is added. Fixes #14705. Closes #14707	2023-07-21 00:24:51 +03:00
Pavel Emelyanov	98609e2115	Merge 's3/test: close using deferred_close() or deferred()' from Kefu Chai let's use RAII to tear down the client and the input file, so we can always perform the cleanups even if the test throws. Closes #14765 * github.com:scylladb/scylladb: s3/test: use seastar::deferred() to perform cleanup s3/test: close using deferred_close()	2023-07-20 20:05:34 +03:00
Botond Dénes	bf6186ed7e	Update tools/java submodule * tools/java 9f63a96f...585b30fd (1): > cassandra-stress: add support for using RackAwareRoundRobinPolicy	2023-07-20 18:13:32 +03:00
Botond Dénes	819b45d107	Merge 'Remove dead replacing_nodes_pending_ranges_updater manipulations' from Pavel Emelyanov The set in question is read-and-delete-only and thus always empty. Originally it was removed by commit `c9993f020d` (storage_service: get rid of handle_state_replacing), but some dangling ends were left. Consequentially, the on_alive() callback can get rid of few dead if-else branches Closes #14762 * github.com:scylladb/scylladb: storage_service: Relax on_alive() storage_service: Remove _replacing_nodes_pending_ranges_updater	2023-07-20 16:55:44 +03:00
Pavel Emelyanov	9df750fd4c	storage_service: Remove dead get_rpc_address() Unused. Locator calls gossiper directly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14761	2023-07-20 16:54:24 +03:00
Botond Dénes	53da97416a	Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov The "fix" is straightforward -- callers of system_keyspace::paxos methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference. Closes #14758 * github.com:scylladb/scylladb: db/system_keyspace: Move and use qctx::execute_cql_with_timeout() db/system_keyspace: Make paxos methods non-static service/paxos: Add db::system_keyspace& argument to some methods test: Optionally initialize proxy remote for cql_test_env proxy/remote: Keep sharded<db::system_keyspace>& dependency	2023-07-20 16:53:25 +03:00
Botond Dénes	e62325babc	Merge 'Compaction reshard task' from Aleksandra Martyniuk Task manager tasks covering reshard compaction. Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4. Regression test added. Closes #14739 * github.com:scylladb/scylladb: test: add test for resharding with non-empty owned_ranges_ptr test: extend test_compaction_task.py to test resharding compaction compaction: add shard_reshard_sstables_compaction_task_impl compaction: invoke resharding on sharded database compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() compaction: add reshard_sstables_compaction_task_impl compaction: create resharding_compaction_task_impl	2023-07-20 16:43:22 +03:00
Botond Dénes	a35f4f6985	test/mutation_test: test_compactor_validator_sanity_test Greatly expand this test to check that the compactor validates the input stream properly. The test is renamed (the _sanity_test suffix is removed) to reflect the expanded scope.	2023-07-20 08:48:50 -04:00
Botond Dénes	18ed94e60b	mutation/mutation_compactor: fix indentation Left broken by the previous patch.	2023-07-20 08:48:50 -04:00
Botond Dénes	3d5b70e0d7	mutation/mutation_compactor: validate the input stream The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to the validator for the most part. This mistake was introduced by `e2c9cdb576`, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor. This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.	2023-07-20 08:48:50 -04:00
Botond Dénes	dbb2a6f03a	mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator	2023-07-20 08:48:50 -04:00
Botond Dénes	93dd16fccc	readers: reader-from-fragment: don't modify stream when created without range The fragment reader currently unconditionally forwards its buffer to the passed-in partition range. Even if this range is `query::full_partition_range`, this will involve dropping any fragments up to the first partitions tart. This causes problems for test users who intentionally create invalid fragment streams, that don't start with a partition-start. Refactor the reader to not do any modifications on the stream, when neither slice, nor partition-range was passed by the user.	2023-07-20 08:48:50 -04:00
Kefu Chai	fdf61d2f7c	compaction_manager: prevent gc-only sstables from being compacted before this change, there are chances that the temporary sstables created for collecting the GC-able data create by a certain compaction can be picked up by another compaction job. this wastes the CPU cycles, adds write amplification, and causes inefficiency. in general, these GC-only SSTables are created with the same run id as those non-GC SSTables, but when a new sstable exhausts input sstable(s), we proactively replace the old main set with a new one so that we can free up the space as soon as possible. so the GC-only SSTables are added to the new main set along with the non-GC SSTables, but since the former have good chance to overlap the latter. these GC-only SSTables are assigned with different run ids. but we fail to register them to the `compaction_manager` when replacing the main sstable set. that's why future compactions pick them up when performing compaction, when the compaction which created them is not yet completed. so, in this change, * to prevent sstables in the transient stage from being picked up by regular compactions, a new interface class is introduced so that the sstable is always added to registration before it is added to sstable set, and removed from registration after it is removed from sstable set. the struct helps to consolidate the regitration related logic in a single place, and helps to make it more obvious that the timespan of an sstable in the registration should cover that in the sstable set. * use a different run_id for the gc sstable run, as it can overlap with the output sstable run. the run_id for the gc sstable run is created only when the gc sstable writer is created. because the gc sstables is not always created for all compactions. please note, all (indirect) callers of `compaction_task_executor::compact_sstables()` passes a non-empty `std::function` to this function, so there is no need to check for empty before calling it. so in this change, the check is dropped. Fixes #14560 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14725	2023-07-20 15:47:48 +03:00
Asias He	865891cf02	doc: Repair system_auth with nodetool repair -pr option Since repair is performed on all nodes, each node can just repair the primary ranges instead of all owned ranges. This avoids repair ranges more than once. Closes #14766	2023-07-20 15:12:20 +03:00
Anna Stuchlik	6c70aef2d1	doc: document customizing CPUSET Fixes https://github.com/scylladb/scylla-docs/issues/4004 This commit adds a Knowledge Base article on how to customize CUPUSET. Closes #13941	2023-07-20 15:07:32 +03:00
Michał Jadwiszczak	d7a3aa2698	replica:database: add method to determine if semaphore is user one Add method to compare semaphore with system ones (streaming, compaction, system read) to be able if the semaphore belongs to a user.	2023-07-20 10:24:21 +02:00
Raphael S. Carvalho	3117f2f066	tests: Add test for table's mutation source excluding staging Commit `f5e3b8df6d` introduced an optimization for as_mutation_source_excluding_staging() and added a test that verifies correctness of single key and range reads based on supplied predicates. This new test aims to improve the coverage by testing directly both table::as_mutation_source() and as_mutation_source_excluding_staging(), therefore guaranteeing that both supply the correct predicate to sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14763	2023-07-20 07:14:36 +03:00
Kefu Chai	77faec4f38	s3/test: use seastar::deferred() to perform cleanup let's use RAII to remove the object use as a fixture, so we don't leave some object in the bucket for testing. this might interfere with other tests which share the same minio server with the test which fails to do its clean up if an exception is thrown. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-20 10:04:54 +08:00
Kefu Chai	7a9c802fc3	s3/test: close using deferred_close() let's use RAII to tear down the client and the input file, so we can always perform the cleanups even if the test throws. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-20 10:04:54 +08:00
Pavel Emelyanov	7a3d61ce2c	storage_service: Relax on_alive() Now when there's always-false local variable, it can also be dropped and all the associated if-else branches can be simplified Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 22:02:12 +03:00
Pavel Emelyanov	61a37cf6bf	storage_service: Remove _replacing_nodes_pending_ranges_updater The set in question is always empty, so it can be removed and the only check for its contents can be constified Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 22:01:06 +03:00
Pavel Emelyanov	ea9db1b35c	Merge 'cql3: expr: remove the default constructor' from Avi Kivity `expression`'s default constructor is dangerous as an it can leak into computations and generate surprising results. Fix that by removing the default constructor. This is made somewhat difficult by the parser generator's reliance on default construction, and we need to expand our workaround (`uninitialized<>`) capabilities to do so. We also remove some incidental uses of default-constructed expressions. Closes #14706 * github.com:scylladb/scylladb: cql3: expr: make expression non-default-constructible cql3: grammar: don't default-construct expressions cql3: grammar: improve uninitialized<> flexibility cql3: grammar: adjust uninitialized<> wrapper test: expr_test: don't invoke expression's default constructor cql3: statement_restrictions: explicitly initialize expressions in index match code cql3: statement_restrictions: explicitly intitialize some expression fields cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions cql3: expr: prepare_expression: avoid default-constructed expression cql3: broadcast_tables: prepare new_value without relying on expression default constructor	2023-07-19 21:46:03 +03:00
Pavel Emelyanov	8a87c87824	db/system_keyspace: Move and use qctx::execute_cql_with_timeout() This template call is only used by system keyspace paxos methods. All those methods are no longer static and can use system_keyspace::_qp reference to real query processor instead of global qctx. The execute_cql_with_timeout() wrapper is moved to system_keyspace to make it work Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Pavel Emelyanov	b9ef16c06f	db/system_keyspace: Make paxos methods non-static The service::paxos_state methods that call those already have system keyspace reference at hand and can call method on an object Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Pavel Emelyanov	d9ba8eb8df	service/paxos: Add db::system_keyspace& argument to some methods The paxos_state's .prepare(), .accept(), .learn() and .prune() methods access system keyspace via its static methods. The only caller of those (storage_proxy::remote) already has the sharded system k.s. reference and can pass its .local() one as argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Pavel Emelyanov	b4fc1076e3	test: Optionally initialize proxy remote for cql_test_env Some test cases that use cql_test_env involve paxos state updates. Since this update is becoming via proxy->remote->system_keyspace those test cases need cql_test_env to initialize the remote part of the proxy too Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 19:32:10 +03:00
Aleksandra Martyniuk	bfb81b8cdd	test: add test for resharding with non-empty owned_ranges_ptr	2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk	4fc4c2527c	test: extend test_compaction_task.py to test resharding compaction	2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk	77dcdd743e	compaction: add shard_reshard_sstables_compaction_task_impl Add task manager's task covering resharding compaction on one shard.	2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk	f73178a114	compaction: invoke resharding on sharded database In reshard_sstables_compaction_task_impl::run() we call sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed to that method, we use both sharded sstable_directory service and its local instance. To make it straightforward that sharded and local instances are dependend, we call sharded<replica::database>::invoke_on_all instead and access local directory through the sharded one.	2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk	fa10c352a1	compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()	2023-07-19 17:19:10 +02:00
Aleksandra Martyniuk	7a7e287d8c	compaction: add reshard_sstables_compaction_task_impl Add task manager's task covering resharding compaction. A struct and some functions are moved from replica/distributed_loader.cc to compaction/task_manager_module.cc.	2023-07-19 17:15:40 +02:00
Pavel Emelyanov	b0b91bf5ec	proxy/remote: Keep sharded<db::system_keyspace>& dependency This dependency will be needed to call service::paxos_state:: calls and all of them are done in storage_proxy::remote() methods only Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-19 17:36:42 +03:00
Avi Kivity	fc71f49907	Update seastar submodule * seastar bac344d58...c0e618bbb (7): > resource: take kernel min_free_kbytes into account when allocating memory Fixes #14721 > build: append JOB_POOLS definition instead of setting it > net: use designated initialization when appropriate > websocket: print logging message before handle_ping() > circular_buffer_fixed_capacity_test: enable randomize test to reverse > prometheus: do not qualify return type with const. > alien: do not define customized copy ctor Closes #14755	2023-07-19 16:47:05 +03:00
Patryk Jędrzejczak	ee1c240f2a	replica: do not derive the commitlog sync period for schema commitlog We don't want to apply the value of the commitlog_sync_period_in_ms variable to schema commitlog. Schema commitlog runs in batch mode, so it doesn't need this parameter.	2023-07-19 14:16:50 +02:00
Patryk Jędrzejczak	b3be9617dc	config: set schema_commitlog_segment_size_in_mb to 128 We increase the default schema commitlog segment size so that the large mutations do not fail. We have agreed that 128 MB is sufficient.	2023-07-19 14:16:49 +02:00
Patryk Jędrzejczak	5b167a4ad7	config: add schema_commitlog_segment_size_in_mb variable In #14668, we have decided to introduce a new scylla.yaml variable for the schema commitlog segment size. The segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. Therefore, increasing the schema commitlog segment size is sometimes necessary.	2023-07-19 14:16:41 +02:00
Kefu Chai	8f390997cb	db: do not use std::cmp_not_equal() when appropriate this change is a follow-up of `3129ae3c8c`. since in both cases in this change, the `num_ranges` should always be greater than zero, there is no need to use `int` for its type, and "num_ranges" returned by the CQL query should always be greater or equal to zero, so there is no need to check if it is positive. in this change, we * change the type of `num_ranges` to `size_t` * change std::cmp_not_equal() to != to avoid using the verbose `std::cmp_not_equal()` helper, for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14754	2023-07-19 13:25:21 +03:00
Mikołaj Grzebieluch	00db47292b	test: raft: do not update raft address map with obsolete gossip data Regression test for #14257. It starts two nodes. It introduces a sleep in raft_group_registry::on_alive (in raft_group_registry.cc) when receiving a gossip notification about HOST_ID update from the second node. Then it restarts the second node with a different IP. Due to the sleep, the old notification from the old IP arrives after the second node has restarted. If the bug is present, this notification overrides the address map entry and the second read barrier times out, since the first node cannot reach the second node with the old IP. Closes #14609. Closes #14728	2023-07-19 11:57:49 +02:00
Botond Dénes	8916aa311e	Merge 'build: cmake: build: cmake: build submodules ' from Kefu Chai this series enables CMake to build submodules. it helps developers to build, for instance, the java tools on demand. Closes #14751 * github.com:scylladb/scylladb: build: cmake: build submodules build: cmake: generate version files with add_custom_command()	2023-07-19 12:04:29 +03:00
Kefu Chai	665135553d	build: cmake: remove nonexistent test the test of "type_json_test" was added locally, and has not landed on master. but it somehow was spilled into `87170bf07a` by accident. so, let's drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14749	2023-07-19 11:58:34 +03:00
Pavel Emelyanov	312184c0c7	keys: Move exploded_clustering_prefix's operator<< to keys.cc Now it sits in replicate/database.cc, but the latter is overloaded with code, worth keeping less, all the more so the ..._prefix itself lives in the keys.hh header. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14748	2023-07-19 11:57:27 +03:00
Pavel Emelyanov	5162028c71	storage_service: Remove do_stop_ms() The helper was left from the storage-service shutdown-vs-drain rework (long ago), now it just occupies space in code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14747	2023-07-19 11:56:27 +03:00
Avi Kivity	460b28d067	Merge 'Introduce `SELECT MUTATION FRAGMENTS` statement' from Botond Dénes SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table. The output schema is derived from the table's schema, as following: * The table's partition key is copied over as-is * The clustering key is formed from the following columns: - mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source. - partition_region (int): represents the enum with the same name. - the copy of the table's clustering columns - position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other. * The following regular columns: - metadata (text): the JSON representation of the mutation-fragment's metadata. - value (text): the JSON representation of the mutation-fragment's value. Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden. More details in the documentation commit (last commit). Example: ```cql cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck)); cqlsh> DELETE FROM ks.tbl WHERE pk = 0; cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2; cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0); cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0); cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0); cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0); cqlsh> SELECT * FROM ks.tbl; pk \| ck \| v ----+----+--- 1 \| 0 \| 0 0 \| 0 \| 0 0 \| 1 \| 0 0 \| 2 \| 0 (4 rows) cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl); pk \| mutation_source \| partition_region \| ck \| position_weight \| metadata \| mutation_fragment_kind \| value ----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+----------- 1 \| memtable:0 \| 0 \| \| \| {"tombstone":{}} \| partition start \| null 1 \| memtable:0 \| 2 \| 0 \| 0 \| {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} \| clustering row \| {"v":"0"} 1 \| memtable:0 \| 3 \| \| \| null \| partition end \| null 0 \| memtable:0 \| 0 \| \| \| {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} \| partition start \| null 0 \| memtable:0 \| 2 \| 0 \| 0 \| {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} \| clustering row \| {"v":"0"} 0 \| memtable:0 \| 2 \| 0 \| 1 \| {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} \| range tombstone change \| null 0 \| memtable:0 \| 2 \| 1 \| 0 \| {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} \| clustering row \| {"v":"0"} 0 \| memtable:0 \| 2 \| 2 \| -1 \| {"tombstone":{}} \| range tombstone change \| null 0 \| memtable:0 \| 2 \| 2 \| 0 \| {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} \| clustering row \| {"v":"0"} 0 \| memtable:0 \| 3 \| \| \| null \| partition end \| null (10 rows) ``` Perf simple query: ``` /build/release/scylla perf-simple-query -c1 -m2G --duration=60 ``` Before: ``` median 141596.39 tps ( 62.1 allocs/op, 13.1 tasks/op, 43688 insns/op, 0 errors) median absolute deviation: 137.15 maximum: 142173.32 minimum: 140492.37 ``` After: ``` median 141889.95 tps ( 62.1 allocs/op, 13.1 tasks/op, 43692 insns/op, 0 errors) median absolute deviation: 167.04 maximum: 142380.26 minimum: 141025.51 ``` Fixes: https://github.com/scylladb/scylladb/issues/11130 Closes #14347 * github.com:scylladb/scylladb: docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement test/topology_custom: add test_select_from_mutation_fragments.py test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema test/cql-pytest: add test_select_mutation_fragments.py test/cql-pytest: move scylla_data_dir fixture to conftest.py cql3/statements: wire-in mutation_fragments_select_statement cql3/restrictions/statement_restrictions: fix indentation cql3/restrictions/statement_restrictions: add check_indexes flag cql3/statments/select_statement: add mutation_fragments_select_statement cql3: add SELECT MUTATION FRAGMENTS select statement sub-type service/pager: allow passing a query functor override service/storage_proxy: un-embed coordinator_query_options replica: add mutation_dump replica: extract query_state into own header replica/table: add make_nonpopulating_cache_reader() replica/table: add select_memtables_as_mutation_sources() tools,mutation: extract the low-level json utilities into mutation/json.hh tools/json_writer: fold SstableKey() overloads into callers tools/json_writer: allow writing metadata and value separately tools/json_writer: split mutation_fragment_json_writer in two classes tools/json_writer: allow passing custom std::ostream to json_writer	2023-07-19 11:54:11 +03:00
Asias He	c29e7e4644	Revert "Revert "view_update_generator: Increase the registration_queue_size"" This reverts commit `4cee8206f8`. The test is fixed. Closes #14750	2023-07-19 11:46:28 +03:00
Aleksandra Martyniuk	e486f4eba6	compaction: create resharding_compaction_task_impl resharding_compaction_task_impl serves as a base class of all concrete resharding compaction task classes.	2023-07-19 10:41:35 +02:00
Kefu Chai	6ce0d3a202	build: cmake: build api/api-doc/metrics.json metrics.json was added in `d694a42745`, `configure.py` was updated accordingly. this change mirrors this change in CMake building system. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14753	2023-07-19 11:39:21 +03:00
Avi Kivity	503d21b570	cql3: expr: avoid separating column_mutation_attribute from its column_value when levellizing aggregation depth Since `ec77172b4b` (" Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity"), we rewrite non-aggregating selectors to include an aggregation, in order to have the rest of the code either deal with no aggregation, or all selectors aggregating, with nothing in between. This is done by wrapping column selectors with "first" function calls: col -> first(col). This broke non-aggregating selectors that included the ttl() or writetime() pseudo functions. This is because we rewrote them as writetime(first(col)), and writetime() isn't a function that operates on any values; it operates on mutations and so must have access to a column, not an expression. Fix by detecting this scenario and rewriting the expression as first(writetime(col)). Unit and integration tests are added. Fixes #14715. Closes #14716	2023-07-19 11:35:01 +03:00
Botond Dénes	718f57c510	docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement	2023-07-19 01:28:28 -04:00
Botond Dénes	a8fc71dbc0	test/topology_custom: add test_select_from_mutation_fragments.py	2023-07-19 01:28:28 -04:00
Botond Dénes	7540e62522	test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema Checking that the generated schema has deterministic id and version.	2023-07-19 01:28:28 -04:00
Botond Dénes	6709a71b96	test/cql-pytest: add test_select_mutation_fragments.py	2023-07-19 01:28:28 -04:00
Botond Dénes	05e010b1d3	test/cql-pytest: move scylla_data_dir fixture to conftest.py It will soon be used by more than one test file.	2023-07-19 01:28:28 -04:00
Botond Dénes	6458ff9917	cql3/statements: wire-in mutation_fragments_select_statement This commit contains all the changes required to wire-in the new select from mutation_fragment() statement.	2023-07-19 01:28:28 -04:00
Botond Dénes	81175b5ffc	cql3/restrictions/statement_restrictions: fix indentation Left broken in the previous patch.	2023-07-19 01:28:28 -04:00
Botond Dénes	c7b3faccd2	cql3/restrictions/statement_restrictions: add check_indexes flag Allowing caller to turn off checking for indexes. Useful if the restrictions are applied on a pseudo-table, which has no corresponding table object, and therefore no index manager (or indexes for that matter).	2023-07-19 01:28:28 -04:00
Botond Dénes	0b6b00178e	cql3/statments/select_statement: add mutation_fragments_select_statement Not wired in yet. SELECT * FROM MUTATION_FRAGMENTS($table) is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table. Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is not allowed.	2023-07-19 01:28:28 -04:00
Botond Dénes	aa31321da9	cql3: add SELECT MUTATION FRAGMENTS select statement sub-type SELECT * FROM MUTATION_FRAGMENTS($table) is a new select statement sub-type. More information will be provided in the patch which introduces it. This patch adds only the Cql.g changes and what is further strictly necessary.	2023-07-19 01:28:28 -04:00
Botond Dénes	ccf9eba521	service/pager: allow passing a query functor override To allow paging for requests that don't go through storage-proxy directly. By default, there is no override and the code falls-back to directly invoking storage_proxy::query() as before.	2023-07-19 01:28:28 -04:00
Botond Dénes	2174276bb7	service/storage_proxy: un-embed coordinator_query_options So it can be forward declared. Add an embedded alias to reduce churn. Requires similarly un-embedding clock_type.	2023-07-19 01:28:28 -04:00
Botond Dénes	a507ff5d88	replica: add mutation_dump This file contains facilities to dump the underlying mutations contained in various mutation sources -- like memtable, cache and sstables -- and return them as query results. This can be used with any table on the system. The output represents the mutation fragments which make up said mutations, and it will be generated according to a schema, which is a transformation of the table's schema. This file provides a method, which can be used to implement the backend of a select-statement: it has a similar signature to regular query methods.	2023-07-19 01:28:28 -04:00
Botond Dénes	8643e23d0d	replica: extract query_state into own header So it can be reused outside of replica/table.cc.	2023-07-19 01:28:28 -04:00
Botond Dénes	3053996371	replica/table: add make_nonpopulating_cache_reader() Allows reading the content of the cache, without populating it.	2023-07-19 01:28:28 -04:00
Botond Dénes	e2936b1eda	replica/table: add select_memtables_as_mutation_sources() Allowing reading from each individual memtable which contains the given token, without exposing the memtables themselves to the caller. Exposing the memtables directly to any code outside of table is undesired because they are mutable objects.	2023-07-19 01:28:28 -04:00
Botond Dénes	665f69b80d	tools,mutation: extract the low-level json utilities into mutation/json.hh Soon, we will want to convert mutation fragments into json inside the scylla codebase, not just in tools. To avoid scylla-core code having to include tools/ (and link against it), move the low-level json utilities into mutation/.	2023-07-19 01:28:28 -04:00
Botond Dénes	36bca5a6af	tools/json_writer: fold SstableKey() overloads into callers These are very simple methods, and we want to make the low lever writers not depend on knowing the sstable type.	2023-07-19 01:28:28 -04:00
Botond Dénes	043b0f316f	tools/json_writer: allow writing metadata and value separately The values of cells are potentially very large and thus, when presenting row content as json in SELECT * FROM MUTATION_FRAGMENTS($table) queries, we want to separate metadata and cell values into separate columns, so users can opt out from the potentially big values being included too. To support this use-case, write(row) and its downstream write methods get a new `include_value` flag, which defaults to true. When set to false, cell values will not be included in the json output. At the same time, new methods are added to convert only cell values of a row to json.	2023-07-19 01:28:28 -04:00
Botond Dénes	1df004db8c	tools/json_writer: split mutation_fragment_json_writer in two classes 1) mutation_partition_json_writer - containing all the low level utilities for converting sub-fragment level mutation components (such as rows, tombstones, etc.) and their components into json; 2) mutation_fragment_stream_json_writer - containing all the high level logic for converting mutation fragment streams to json; The latter using the former behind the scenes. The goal is to enable reuse of converting mutation-fragments into json, without being forced to work around differences in how the mutation fragments are reprenented in json, on the higher level.	2023-07-19 01:28:28 -04:00
Botond Dénes	0a5b67d6d9	tools/json_writer: allow passing custom std::ostream to json_writer To allow for use-cases where the user wants to write the json into a string.	2023-07-19 01:28:28 -04:00
Kefu Chai	959bfae665	build: cmake: build submodules this mirrors what we have in the `build.ninja` generated by `configure.py`. with this change, we can build for instance, `dist-tool-tar` from the `build.ninja` generated by CMake. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-19 13:08:35 +08:00
Kefu Chai	bb7d99ad37	build: cmake: generate version files with add_custom_command() instead of using execute_process(), let's use add_custom_command() to generate the SCYLLA-{VERSION,RELEASE,PRODUCT}-FILE, so that we can let other targets depend on these generated files. and generate them on demand. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-19 13:08:35 +08:00
Kamil Braun	bfaac5192a	gossiper: call `on_remove` subscriptions in the foreground in `remove_endpoint` `gossiper::remove_endpoint` performs `on_remove` callbacks for all endpoint change subscribers. This was done in the background (with a discarded future) due the following reason: ``` // We can not run on_remove callbacks here because on_remove in // storage_service might take the gossiper::timer_callback_lock ``` however, `gossiper::timer_callback_lock` no longer exists, it was removed in `19e8c14`. Today it is safe to perform the `storage_service::on_remove` callback in the foreground -- it's only taking the token metadata lock, which is also taken and then released earlier by the same fiber that calls `remove_endpoint` (i.e. `storage_service::handle_state_normal`). Furthermore, we want to perform it in the foreground. First, there already was a comment saying: ``` // do subscribers first so anything in the subscriber that depends on gossiper state won't get confused ``` it's not too precise, but it argues that subscriber callbacks should be serialized with the rest of `remove_endpoint`, not done concurrently with it. Second, we now have a concrete reason to do them in the foreground. In issue #14646 we observed that the subcriber callbacks are racing with the bootstrap procedure. Depending on scheduling order, if `storage_service::on_remove` is called too late, a bootstrapping node may try to wait for a node that was earlier replaced to become UP which is incorrect. By putting the `on_remove` call into foreground of `remove_endpoint`, we ensure that a node that was replaced earlier will not be included in the set of nodes that the bootstrapping node waits for (because `storage_service::on_remove` will clear it from `token_metadata` which we use to calculate this set of nodes). We also get rid of an unnecessary `seastar::async` call. Fixes #14646 Closes #14741	2023-07-18 21:29:29 +02:00
Pavel Emelyanov	8bc42f54d4	Merge 'feature_service: handle deprecated features correctly in feature check' from Piotr Dulikowski The feature check in `enable_features_on_startup` loads the list of features that were enabled previously, goes over every one of them and checks whether each feature is considered supported and whether there is a corresponding `gms::feature` object for it (i.e. the feature is "registered"). The second part of the check is unnecessary and wrong. A feature can be marked as supported but its `gms::feature` object not be present anymore: after a feature is supported for long enough (i.e. we only support upgrades from versions that support the feature), we can consider such a feature to be deprecated. When a feature is deprecated, its `gms::feature` object is removed and the feature is always considered enabled which allows to remove some legacy code. We still consider this feature to be supported and advertise it in gossip, for the sake of the old nodes which, even though they always support the feature, they still check whether other nodes support it. The problem with the check as it is now is that it disallows moving features to the disabled list. If one tries to do it, they will find out that upgrading the node to the new version does not work: `enable_features_on_startup` will load the feature, notice that it is not "registered" (there is no `gms::feature` object for it) and fail to boot. This commit fixes the problem by modifying `enable_features_on_startup` not to look at the registered features list at all. In addition to this, some other small cleanups are performed: - "LARGE_COLLECTION_DETECTION" is removed from the deprecated features list. For some reason, it was put there when the feature was being introduced. It does not break anything because there is a `gms::feature` object for it, but it's slightly confusing and therefore is removed. - The comment in `supported_feature_set` that invites developers to add features there as they are introduced is removed. It is no longer necessary to do so because registered features are put there automatically. Deprecated features should still be put there, as indicated as another comment. Fortunately, this issue does not break any upgrades as of now - since we added enabled cluster feature persisting, no features were deprecated, and we only add registered features to the persisted feature list. An error injection and a regression test is added. Closes #14701 * github.com:scylladb/scylladb: topology_custom: add deprecated features test feature_service: add error injection for deprecated cluster feature feature_service: move error injection check to helper function feature_service: handle deprecated features correctly in feature check	2023-07-18 21:01:48 +03:00
Kamil Braun	6f22ed9145	Merge 'raft: move group0_state_machine::merger to its own header and add unit test for it' from Mikołaj Grzebieluch Move `merger` to its own header file. Leave the logic of applying commands to `group0_state_machine`. Remove `group0_state_machine` dependencies from `merger` to make it an independent module. Add a test that checks if `group0_state_machine_merger` preserves timeuuid monotonicity. `last_id()` should be equal to the largest timeuuid, based on its timestamps. This test combines two commands in the reverse order of their timeuuids. The timeuuids yield different results when compared in both timeuuid order and uuid order. Consequently, the resulting command should have a more recent timeuuid. Fixes #14568 Closes #14682 * github.com:scylladb/scylladb: raft: group0_state_machine_merger: add test for timeuuid ordering raft: group0_state_machine: extract merger to its own header	2023-07-18 17:43:50 +02:00
Kamil Braun	56c91473f2	Merge 'storage_proxy: silence abort_requested_exception on reads and writes' from Patryk Jędrzejczak Fixes #10447 This issue is an expected behavior. However, `abort_requested_exception` is not handled properly. -- Why this issue appeared 1. The node is drained. 2. `migration_manager::drain` is called and executes `_as.request_abort();`. 3. The coordinator sends read RPCs to the drained replica. On the replica side, `storage_proxy::handle_read` calls `migration_manager::get_schema_for_read`, which is defined like this: ```cpp future<schema_ptr> migration_manager::get_schema_for_write(/* ... /) { if (_as.abort_requested()) { co_return coroutine::exception(std::make_exception_ptr(abort_requested_exception())); } / ... / ``` So, `abort_requested_exception` is thrown. 4. RPC doesn't preserve information about its type, and it is converted to a string containing its error message. 5. It is rethrown as `std::runtime_error` on the coordinator side, and `abstract_resolve_reader::error()` logs information about it. However, we don't want to report `abort_requested_exception` there. This exception should be catched and ignored: ```cpp void error(/ ... /) { / ... / else if (try_catch<abort_requested_exception>(eptr)) { // do not report aborts, they are trigerred by shutdown or timeouts } / ... / ``` -- Proposed solution To fix this issue, we can add `abort_requested_exception` to `replica::exception_variant` and make sure that if it is thrown by `migration_manager::get_schema_for_write`, `storage_proxy::handle_read` correctly encodes it. Thanks to this change, `abstract_read_resolver::error` can correctly handle `abort_requested_exception` thrown on the replica side by not reporting it. -- Side effect of the proposed solution If the replica supports it, the coordinator doesn't, and all nodes support `feature_service::typed_errors_in_read_rpc`, the coordinator will fail to decode `abort_requested_exception` and it will be decoded to `unknown_exception`. It will still be rethrown as `std::runtime_error`, however the message will change from abort requested* to unknown exception. -- Another issue Moreover, `handle_write` reports abort requests for the same reason. This also floods the logs (this time on the replica side) for the same reason. I don't think it is intended, so I've changed it too. This change is in the last commit. Closes #14681 * github.com:scylladb/scylladb: service: storage_proxy: do not report abort requests in handle_write service: storage_proxy: encode abort_requested_exception in handle_read service: storage_proxy: refactor encode_replica_exception_for_rpc replica: add abort_requested_exception to exception_variant	2023-07-18 17:04:05 +02:00
Nadav Har'El	4ce46a998a	cql-pytest: translate Cassandra's tests for BATCH operations This is a translation of Cassandra's CQL unit test source file BatchTest.java into our cql-pytest framework. This test file an old (2014) and small test file, with only a few minimal testing of mostly error paths in batch statements. All test tests pass in both Cassandra and Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14733	2023-07-18 17:01:18 +03:00
Raphael S. Carvalho	da18a9badf	Fix test.py with compaction groups test.py with --x-log2-compaction-groups option rotted a little bit. Some boost tests added later didn't use the correct header which parses the option or they didn't adjust suite.yaml. Perhaps it's time to set up a weekly (or bi-weekly) job to verify there are no regressions with it. It's important as it stresses the data plane for tablets reusing the existing tests available. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14732	2023-07-18 16:57:11 +03:00
Botond Dénes	7d5cca1958	Merge 'Regular compaction task' from Aleksandra Martyniuk Task manager's tasks covering regular compaction. Uses multiple inheritance on already existing regular_compaction_task_executor to keep track of the operation with task manager. Closes #14377 * github.com:scylladb/scylladb: test: add regular compaction task test compaction: turn regular_compaction_task_executor into regular_compaction_task_impl compaction: add compaction_manager::perform_compaction method test: modify sstable_compaction_test.cc compaction: add regular_compaction_task_impl compaction: switch state after compaction is done	2023-07-18 16:52:53 +03:00
Kefu Chai	4661671220	s3/test: do not keep the tempdir forever by default, up to 3 temporary directories are kept by pytest. but we run only a single time for each of the $TMPDIR. per our recent observation, it takes a lot more time for jenkins to scan the tempdir if we use it for scylla's rundir. so, to alleviate this symptom, we just keep up to one failed session in the tempdir. if the test passes, the tempdir created by pytest will be nuked. normally it is located at scylladb/testlog/${mode}/pytest-of-$(whoami). see also https://docs.pytest.org/en/7.3.x/reference/reference.html#confval-tmp_path_retention_policy Refs #14690 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14735 [xemul: Withdrawing from PR's comments object_store is the only test which is using tmpdir fixture starts / stops scylla by itself and put the rundir of scylla in its own tmpdir we don't register the step of cleaning up [the temp dir] using the utilities provided by cql-pytest. we rely on pytest to perform the cleanup. while cql-pytest performs the cleanup using a global registry. ]	2023-07-18 16:49:25 +03:00
Kamil Braun	69e22de54d	Merge 'minor test/pylib type fixes' from Alecco Some minor fixes reported by `mypy`. Closes #14693 * github.com:scylladb/scylladb: test/pylib: fix function attribute test/pylib: check cmd is defined before using it test/pylib: fix return type hint test/pylib: remove redundant method	2023-07-18 15:17:51 +02:00
Avi Kivity	a51fdadfed	Merge 'treewide: remove #includes not use directly' from Kefu Chai for faster build times and clear inter-module dependencies, we should not #includes headers not directly used. instead, we should only #include the headers directly used by a certain compilation unit. in this change, the source files under "/compaction" directories are checked using clangd, which identifies the cases where we have an #include which is not directly used. all the #includes identified by clangd are removed. because some source files rely on the incorrectly included header file, those ones are updated to #include the header file they directly use. if a forward declaration suffice, the declaration is added instead. see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning Closes #14740 * github.com:scylladb/scylladb: treewide: remove #includes not use directly size_tiered_backlog_tracker: do not include remove header	2023-07-18 14:45:33 +03:00
Alejo Sanchez	8fceb7b7a0	test/pylib: fix function attribute Instead of globally hardcoding an attribute, set it in the function itself. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-18 13:33:46 +02:00
Alejo Sanchez	f7ee4ee7f6	test/pylib: check cmd is defined before using it Add an assert to check cmd is defined. Helps the type checker. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-18 13:33:46 +02:00
Alejo Sanchez	ff564583a4	test/pylib: fix return type hint Fix type hint of return when using @asynccontextmanager. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-18 13:33:46 +02:00
Alejo Sanchez	2194d8864b	test/pylib: remove redundant method The ManagerClient.get_cql method is defined twice. Remove one and fix the assert. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-18 13:33:46 +02:00
Kamil Braun	eb6202ef9c	Merge 'db: hints: add checksum to sync_point encoding' from Patryk Jędrzejczak Fixes #9405 `sync_point` API provided with incorrect sync point id might allocate crazy amount of memory and fail with `std::bad_alloc`. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in `db::hints::decode` before decoding. Closes #14534 * github.com:scylladb/scylladb: db: hints: add checksum to sync point encoding db: hints: add the version_size constant	2023-07-18 13:05:10 +02:00
Kefu Chai	bab16eb30e	treewide: remove #includes not use directly for faster build times and clear inter-module dependencies, we should not #includes headers not directly used. instead, we should only #include the headers directly used by a certain compilation unit. in this change, the source files under "/compaction" directories are checked using clangd, which identifies the cases where we have an #include which is not directly used. all the #includes identified by clangd are removed. because some source files rely on the incorrectly included header file, those ones are updated to #include the header file they directly use. if a forward declaration suffice, the declaration is added instead. see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 17:36:31 +08:00
Kefu Chai	58302ab145	size_tiered_backlog_tracker: do not include remove header according to cppreference, > <ctgmath> is deprecated in C++17 and removed in C++20 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 17:36:31 +08:00
Michał Jadwiszczak	62ced66702	schema: add scylla specific options to schema description Add `paxos_grace_seconds`, `tombstone_gc`, `cdc` and `synchronous_updates` options to schema description. Fixes: #12389 Fixes: scylladb/scylla-enterprise#2979 Closes #14275	2023-07-18 11:16:19 +03:00
Botond Dénes	21ff6efd74	test/boost/view_build_test: improve test_view_update_generator_register_semaphore_unit_leak By making it independent of the number of units the view update generator's registration semaphore is created with. We want to increase this number significantly and that would destabilize this test significantly. To prevent this, detach the test from the number of units completely, while stil preserving the original intent behind it, as best as it could be determined. Closes #14727	2023-07-18 09:18:28 +03:00
Alejo Sanchez	13e31eaeca	test.py: show mode and suite name when listing tests For --list, show also mode and suite name. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14729	2023-07-18 09:06:47 +03:00
Botond Dénes	b3cb611be7	Merge 'treewide: enable -Wsign-compare and address the warnings from this option' from Kefu Chai in order to identify the problems caused by integer type promotion when comparing unsigned and signed integers, in this series, we - address the warnings raised by `-Wsign-compare` compiler option - add `-Wsign-compare` compiler option to the building systems Closes #14652 * github.com:scylladb/scylladb: treewide: use unsigned variable to compare with unsigned treewide: compare signed and unsigned using std::cmp_*()	2023-07-18 09:05:30 +03:00
Botond Dénes	6961fbcec7	Merge 'Add the metrics config api' from Amnon Heiman This series is based on top of the seastar relabel config API. The series adds a REST API for the configuration, it allows to get and set it. The API is registered under the V2 prefix and uses the swagger 2.0 definition. After this series to get the current relabel-config configuration: ``` curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/' ``` A set config example: ``` curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \ { \ "source_labels": [ \ "__name__" \ ], \ "action": "replace", \ "target_label": "level", \ "replacement": "1", \ "regex": "io_que." \ } \ ]' 'http://localhost:10000/v2/metrics-config/' ``` This is how it looks like in the UI ![image](https://user-images.githubusercontent.com/2118079/230763730-bafcaf8b-ea6d-4a6c-a778-6271fa3b6f82.png) Closes #12670 github.com:scylladb/scylladb: api: Add the metrics API api/config: make it optional if the config API is the first to register api: Add the metrics.json Swagger file Preparing for V2 API from files	2023-07-18 07:10:31 +03:00
Botond Dénes	f03efd7ea9	Merge 'build: cmake: fix the build of some tests' from Kefu Chai this series addresses the FTBFS of tests with CMake, and also checks for the unknown parameters in `add_scylla_test()` Closes #14650 * github.com:scylladb/scylladb: build: cmake: build SEASTAR tests as SEASTAR tests build: cmake: error out if found unknown keywords build: cmake: link tests against necessary libraries	2023-07-18 06:51:40 +03:00
Kefu Chai	4c1a26c99f	compaction_manager: sort sstables when compaction is enabled before this change, we sort sstables with compaction disabled, when we are about to perform the compaction. but the idea of of guarding the getting and registering as a transaction is to prevent other compaction to mutate the sstables' state and cause the inconsistency. but since the state is tracked on per-sstable basis, and is not related to the order in which they are processed by a certain compaction task. we don't need to guard the "sort()" with this mutual exclusive lock. for better readability, and probably better performance, let's move the sort out of the lock. and take this opportunity to use `std::ranges::sort()` for more concise code. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14699	2023-07-18 06:40:43 +03:00
Kefu Chai	fa3129fa29	treewide: use unsigned variable to compare with unsigned some times we initialize a loop variable like auto i = 0; or int i = 0; but since the type of `0` is `int`, what we get is a variable of `int` type, but later we compare it with an unsigned number, if we compile the source code with `-Werror=sign-compare` option, the compiler would warn at seeing this. in general, this is a false alarm, as we are not likely to have a wrong comparison result here. but in order to prevent issues due to the integer promotion for comparison in other places. and to prepare for enabling `-Werror=sign-compare`. let's use unsigned to silence this warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Kefu Chai	3129ae3c8c	treewide: compare signed and unsigned using std::cmp_() when comparing signed and unsigned numbers, the compiler promotes the signed number to coomon type -- in this case, the unsigned type, so they can be compared. but sometimes, it matters. and after the promotion, the comparison yields the wrong result. this can be manifested using a short sample like: ``` int main(int argc, char argv) { int x = -1; unsigned y = 2; fmt::print("{}\n", x < y); return 0; } ``` this error can be identified by `-Werror=sign-compare`, but before enabling this compiling option. let's use `std::cmp_()` to compare them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Amnon Heiman	123dd44c21	api: Add the metrics API This patch adds a metrics API implementation. The API supports get and set the metric relabel config. Seastar supports metrics relabeling in runtime, following Prometheus relabel_config. Based on metrics and label name, a user can add or remove labels, disable a metric and set the skip_when_empty flag. The metrics-config API support such configuration to be done using the RestFull API. As it's a new API it is placed under the V2 path. After this patch the following API will be available 'http://localhost:10000/v2/metrics-config/' GET/POST. For example: To get the current config: ``` curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/' ``` To set a config: ``` curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \ { \ "source_labels": [ \ "__name__" \ ], \ "action": "replace", \ "target_label": "level", \ "replacement": "1", \ "regex": "io_que.*" \ } \ ]' 'http://localhost:10000/v2/metrics-config/' ```	2023-07-17 17:09:36 +03:00
Amnon Heiman	eeac846ea7	api/config: make it optional if the config API is the first to register Until now, only the configuration API was part of the V2 API. Now, when other APIs are added, it is possible that another API would be the first to register. The first to register API is different in the sense that it does not have a leading ',' to it. This patch adds an option to mark the config API if it's the first.	2023-07-17 17:09:35 +03:00
Amnon Heiman	d694a42745	api: Add the metrics.json Swagger file This patch adds the swagger definition for the metrics API. Currently, the API defines a get and set of the metric_relabel_config.	2023-07-17 17:09:35 +03:00
Amnon Heiman	9e0ec3afba	Preparing for V2 API from files This patch changes the base path of the V2 of the API to be '/'. That means that the v2 prefix will be part of the path definition. Currently, it only affect the config API that is created from code. The motivation for the change is for Swagger definitions that are read from a file. Currently, when using the swagger-ui with a doc path set to http://localhost:10000/v2 and reading the Swagger from a file swagger ui will concatenate the path and look for http://localhost:10000/v2/v2/{path} Instead, the base path is now '/' and the /v2 prefix will be added by each endpoint definition. From the user perspective, there is no change in current functionality. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2023-07-17 17:09:35 +03:00
Patryk Jędrzejczak	02618831ef	db: hints: add checksum to sync point encoding sync point API provided with incorrect sync point id might allocate crazy amount of memory and fail with std::bad_alloc. To fix this, we can check if the encoded sync point has been modified before decoding. We can achieve this by calculating a checksum before encoding, appending it to the encoded sync point, and compering it with a checksum calculated in db::hints::decode before decoding.	2023-07-17 16:05:07 +02:00
Patryk Jędrzejczak	0a424e1760	db: hints: add the version_size constant The next commit changes the format of encoding sync points to V2. The new format appends the checksum to the encoded sync points and its implementation uses the checksum_size constant - the number of bytes required to store the checksum. To increase consistency and readability, we can additionally add and use the version_size constant. Definitions of sync_point::decode and sync_point::encode are slightly changed so that they don't depend on the version_size value and make implementation of the V2 format easier.	2023-07-17 16:02:18 +02:00
Aleksandra Martyniuk	7dbe624dee	test: add regular compaction task test	2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk	2e87ba1879	compaction: turn regular_compaction_task_executor into regular_compaction_task_impl regular_compaction_task_executor inherits both from compaction_task_executor and regular_compaction_task_impl.	2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk	e3b068be4d	compaction: add compaction_manager::perform_compaction method	2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk	ab4ae6b84a	test: modify sstable_compaction_test.cc Modify sstable_compaction_test.cc so that it does not depend on how quick compaction manager stats are updated after compaction is triggered. It is required since in the following changes the context may switch before the stats are updated.	2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk	9fdd130943	compaction: add regular_compaction_task_impl regular_compaction_task_impl serves as a base class of all concrete regular compaction task classes.	2023-07-17 15:54:33 +02:00
Aleksandra Martyniuk	33cb156ee3	compaction: switch state after compaction is done Compaction task executors which inherit from compaction_task_impl may stay in memory after the compaction is finished. Thus, state switch cannot happen in destructor. Switch state to none in perform_task defer.	2023-07-17 15:54:33 +02:00
Mikołaj Grzebieluch	bdf3959ae6	raft: group0_state_machine_merger: add test for timeuuid ordering This test checks if `group0_state_machine_merger` preserves timeuuid monotonicity. `last_id()` should be equal to the largest timeuuid, based on its timestamps. This test combines two commands in the reverse order of their timeuuids. The timeuuids yield different results when compared in both timeuuid order and uuid order. Consequently, the resulting command should have a more recent timeuuid. Closes #14568	2023-07-17 15:51:20 +02:00
Mikołaj Grzebieluch	96c6e0d0f7	raft: group0_state_machine: extract merger to its own header Move `merger` to its own header file. Leave the logic of applying commands to `group0_state_machine`. Remove `group0_state_machine` dependencies from `merger` to make it an independent module. Add `static` and `const` keywords to its methods signature. Change it to `class`. Add documentation. With this patch, it is easier to write unit tests for the merger.	2023-07-17 15:45:49 +02:00
Anna Stuchlik	2aa3672e5f	doc: fix the 5.2-to-5.3 upgrade guide Fixes https://github.com/scylladb/scylladb/issues/13993 This commit applies feedback from @mykaul added in https://github.com/scylladb/scylladb/pull/13960 after it was merged. In addition, I've removed the information about the Ubuntu version the images are based - the info doesn't belong here, and, it addition, it causes maintenance issues. Closes #14703	2023-07-17 15:26:33 +02:00
Patryk Jędrzejczak	7ae7be0911	locator: remove this_host_id from topology::config The `locator::topology::config::this_host_id` field is redundant in all places that use `locator::topology::config`, so we can safely remove it. Closes #14638 Closes #14723	2023-07-17 14:57:36 +02:00
Patryk Jędrzejczak	56bd9b5db3	service: storage_proxy: do not report abort requests in handle_write We don't want to report aborts in storage_proxy::handle_write, because it can be only triggered by shutdowns and timeouts. Before this change, such reports flooded logs when a drained node still received the write RPCs.	2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak	f9db9f5943	service: storage_proxy: encode abort_requested_exception in handle_read storage_proxy::handle_read now makes sure that abort_requested_exception is encoded in a way that preserves its type information. This allows the coordinator to properly deserialize and handle it. Before this change, if a drained replica was still receiving the read RPCs, it would flood the coordinator's logs with std::runtime_error reports.	2023-07-17 12:27:36 +02:00
Patryk Jędrzejczak	68bd0424c2	service: storage_proxy: refactor encode_replica_exception_for_rpc To properly handle abort_requested_exception thrown from migration_manager::get_schema_for_read in storage_proxy::handle_read (we do in the next commit) we have to somehow encode and return it. The encode_replica_exception_for_rpc function is not suitable for that because it requires the SourceTuple type (of a value returned by do_query()) which we don't know when calling get_schema_for_read. We move the part of encode_replica_exception_for_rpc responsible for handling exceptions to a new function and rewrite it in a way that doesn't require the SourceTuple type. As this function fits the name encode_replica_exception_for_rpc better, we name it this way and rename the previous encode_replica_exception_for_rpc.	2023-07-17 12:27:33 +02:00
Patryk Jędrzejczak	7f83dbd9e7	test: disable raft-topology in test_remove_garbage_group0_members With Raft-topology enabled, test_remove_garbage_group0_members has been flaky when it should always fail. This has been discussed in #14614. Disabling Raft-topology in the topology suite is problematic because the initial cluster size is non-zero, so we have nodes that already use Raft-topology at the beginning of the test. Therefore, we move test_topology_remove_garbage_group0.py to the topology_custom suite. Apart from disabling Raft-topology, we have to start 4 servers instead of 1 because of the different initial cluster sizes. Closes #14692	2023-07-17 11:42:57 +02:00
Anna Stuchlik	c53bbbf1b9	doc: document nodetool checkAndRepairCdcStreams Fixes https://github.com/scylladb/scylladb/issues/13783 This commit documents the nodetool checkAndRepairCdcStreams operation, which was missing from the docs. The description is added in a new file and referenced from the nodetool operations index. Closes #14700	2023-07-17 11:41:54 +02:00
Avi Kivity	bfaac3a239	Merge 'Make replace sstables implementations exception safe' from Benny Halevy This is the first phase of providing strong exception safety guarantees by the generic `compaction_backlog_tracker::replace_sstables`. Once all compaction strategies backlog trackers' replace_sstables provide strong exception safety guarantees (i.e. they may throw an exception but must revert on error any intermediate changes they made to restore the tracker to the pre-update state). Once this series is merged and ICS replace_sstables is also made strongly exception safe (using infrastructure from size_tiered_backlog_tracker introduced here), `compaction_backlog_tracker::replace_sstables` may allow exceptions to propagate back to the caller rather than disabling the backlog tracker on errors. Closes #14104 * github.com:scylladb/scylladb: leveled_compaction_backlog_tracker: replace_sstables: provide strong exception safety guarantees time_window_backlog_tracker: replace_sstables: provide strong exception safety guarantees size_tiered_backlog_tracker: replace_sstables: provide strong exception safety guarantees size_tiered_backlog_tracker: provide static calculate_sstables_backlog_contribution size_tiered_backlog_tracker: make log4 helper static size_tiered_backlog_tracker: define struct sstables_backlog_contribution size_tiered_backlog_tracker: update_sstables: update total_bytes only if set changed compaction_backlog_tracker: replace_sstables: pass old and new sstables vectors by ref compaction_backlog_tracker: replace_sstables: add FIXME comments about strong exception safety	2023-07-17 12:32:27 +03:00
Botond Dénes	c4f35d67e5	Merge 'utils: add fmt formatter for pretty printers' from Kefu Chai add fmt formatter for `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput`. this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput` without the help of `operator<<`. please note, despite that it's more popular to use the IEC prefixes when presenting the size of storage, i.e., MiB for 10242 bytes instead of MB for 10002 bytes, we are still using the SI binary prefixes as the default binary prefix, in order to preserve the existing behavior. the operator<< for these types are removed. the tests are updated accordingly. Refs #13245 Closes #14719 * github.com:scylladb/scylladb: utils: drop operator<< for pretty printers utils: add fmt formatter for pretty printers	2023-07-17 12:06:00 +03:00
Kefu Chai	ed5825ebdb	s3/test: correct outdated comments these comments or docstrings are not in-sync with the code they are supposed to explain. so let's update them accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14545	2023-07-17 12:03:11 +03:00
Kefu Chai	3ed982df87	query_context: do not include unused header in this header, none of the exceptions defined by `exceptions/exceptions.hh` is used. so let's drop the `#include`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14718	2023-07-17 12:00:49 +03:00
Kefu Chai	18166e0e43	sstable: do not include unused header `db/query_context.hh` contains the declaration of class `db::query_context`. but `replica/table.cc` does not use or need `db::query_context`. so, in this change, the `#include` is removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14717	2023-07-17 11:47:02 +03:00
Aleksandra Martyniuk	241b56b7b5	test: drain old compaction tasks from task manager When running compaction task test on the same scylla instantion other tests are run, some compaction tasks from other test cases may be left in task manager. If they stay in memory long enough, they may get unregistered during the compaction task test and cause bad_request status. Drain old compaction tasks before and after each test. Fixes: #14584. Closes #14585	2023-07-17 10:57:36 +03:00
Harsh Soni	78c8e92170	dbuild: fix ulimits hard value for docker on osx Docker-on-osx cannot parse "unlimited" as the hard limit value of ulimit, so, hardcode it to a fixed value. Closes #14295	2023-07-17 10:30:39 +03:00
Kefu Chai	a8254111ef	utils: drop operator<< for pretty printers since all callers of these operators have switched to fmt formatters. let's drop them. the tests are updated accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-17 14:02:13 +08:00
Kefu Chai	fc6b84ec1f	utils: add fmt formatter for pretty printers add fmt formatter for `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput`. this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput` without the help of `operator<<`. please note, despite that it's more popular to use the IEC prefixes when presenting the size of storage, i.e., MiB for 10242 bytes instead of MB for 10002 bytes, we are still using the SI binary prefixes as the default binary prefix, in order to preserve the existing behavior. also, we use the singular form of "byte" when formating "1". this is more correct. the tests are updated accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-17 14:02:13 +08:00
Botond Dénes	3945721dd6	Merge 'test/boost/database_test: split mutation sub-tests' from Alecco Split long-runing database mutation tests. At a trade-off with verbosity, split these sub-tests for the long running tests `database_with_data_in_sstables_is_a_mutation_source_`. Refs #13905 Closes #14455 github.com:scylladb/scylladb: test/lib/mutation_source_test: bump ttl test/boost/memtable_test: split memtable sub-tests test/boost/database_test: split mutation sub-tests	2023-07-17 08:29:28 +03:00
Botond Dénes	1f5b1679b0	Merge 'test: use different table names in sstable_expired_data_ratio and cleanups' from Kefu Chai it turns out we are creating two tables with the same name in sstable_expired_data_ratio. and when creating the second table, we don't destroy the first one. this does not happen in the real world, we could tolerate this in test. but this matters if we're going to have a system-wide per-table registry which use the name of table as the table's identifier in the registry. for instance, the metrics name for the tables would conflict. so, in this series, we use different name for the tables under testing. they can share the same set of sstables though. this fulfills the needs of this test in question. also, we rename some variables for better readability in this series. Fixes https://github.com/scylladb/scylladb/issues/14657 Closes #14665 * github.com:scylladb/scylladb: test: rename variables with better names test: use different table names in sstable_expired_data_ratio test: explicitly capture variables	2023-07-17 08:27:30 +03:00
Kefu Chai	567b453689	utils: avoid using out-of-range index in pretty_printers before this change, if the formatter size is greater than a pettabyte, `exp` would be 6. but we still use it as the index to find the suffix in `suffixes`, but the array's size is 6. so we would be referencing random bits after "PB" for the suffix of the formatted size. in this change * loop in the suffix for better readability. and to avoid the off-by-one errors. * add tests for both pretty printers Branches: 5.1,5.2,5.3 Fixes #14702 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14713	2023-07-16 18:46:09 +03:00
Kefu Chai	6459bf9c0b	test: randomized_nemesis_test: do not perform tautogical comparision it is not supported by C++, and might not yield expected result. as "0 <= d" evaluates to true, which is always less than "magic". so let's avoid using it. ``` /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2908:23: error: result of comparison of constant 54313 with expression of type 'bool' is always true [-Werror,-Wtautological-constant-out-of-range-compare] 2908 \| assert(0 <= d < magic); \| ~~~~~~ ^ ~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14695	2023-07-16 18:30:58 +03:00
Avi Kivity	4fc870a31a	cql3: expr: avoid redoing prepare work when evaluating field_selection prepare_expression() already validates the types and computes the index of the field; no need to redo that work when evaluating the expression. The tests are adjusted to also prepare the expression. Closes #14562	2023-07-16 14:29:19 +03:00
Alejo Sanchez	6d9709679d	test/lib/mutation_source_test: bump ttl Use a large ttl (2h+) to avoid deletions for database_test. An actual fix would be to make database_test to not ignore query_time, but this is much harder. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-15 10:51:09 +02:00
Alejo Sanchez	a9350493e3	test/boost/memtable_test: split memtable sub-tests Split long-runing memtable tests. At a trade-off with verbosity, split these sub-tests for the long running tests test_memtable_with_many_versions_conforms_to_mutation_source*. Refs #13905	2023-07-15 10:51:09 +02:00
Alejo Sanchez	79eedded35	test/boost/database_test: split mutation sub-tests Split long-runing database mutation tests. At a trade-off with verbosity, split these sub-tests for the long running tests database_with_data_in_sstables_is_a_mutation_source_*. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-15 10:51:06 +02:00
Kefu Chai	42bba50727	test: rename variables with better names we first use `cf` and then `lcs_table` later on in `sstable_expired_data_ratio` to represent "tables_for_tests" with schema of different compaction strategies. to improve the readability, we rename the variables which are related to STCS (Sized-Tiered Compaction Strategy) to "stcs_*", so better reflect their relations, for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-15 13:29:55 +08:00
Kefu Chai	f7af971181	test: use different table names in sstable_expired_data_ratio it turns out we are creating two tables with the same name in sstable_expired_data_ratio. and when creating the second table, we don't destroy the first one. this does not happen in the real world, we could tolerate this in test. this matters if we're going to have a system-wide per-table registry which use the name of table as the table's identifier in the registry. for instance, the metrics name for the tables would conflict. to avoid creating multiples tables with the same ${ks}.${cf}, after this change, we use different name for the tables under testing, and they can share the same set of sstables though. this fulfills the needs of this test in question, and the needs of having per-table metrics with table id as their identifiers. Fixes #14657 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-15 13:29:55 +08:00
Kefu Chai	c836e7940e	test: explicitly capture variables * sstable_expired_data_ratio: capture variables explictly for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-15 13:29:55 +08:00
Avi Kivity	b54265034d	cql3: expr: make expression non-default-constructible There is no obvious default expression, so better not to allow default construction of expressions to prevent unintended values from leaking in. Resolves a FIXME.	2023-07-14 18:35:59 +03:00
Avi Kivity	c0ba7040d5	cql3: grammar: don't default-construct expressions Use uninitialized<expression> for that. Since it's heavily used, alias it as "uexpression". To prevent uninitialized<> from leaking into the rest of the system, change do_with_parser() to unwrap it. We add an unwrap_uninitialized_t template type alias for that. Lots of std::move()s are sprinkled around to make things compile, as uninitialized<T> refuses to convert to T without them.	2023-07-14 18:33:06 +03:00
Anna Stuchlik	a93fd2b162	doc: fix internal links Fixes https://github.com/scylladb/scylladb/issues/14490 This commit fixes mulitple links that were broken after the documentation is published (but not in the preview) due to incorrect syntax. I've fixed the syntax to use the :docs: and :ref: directive for pages and sections, respectively. Closes #14664	2023-07-14 18:32:47 +03:00
Avi Kivity	ee2607324b	cql3: grammar: improve uninitialized<> flexibility uninitialized<> is used to work around the parser generator's propensity to default-construct return values by supplying a default constructor to otherwise non-default-constructible types. Make it easier to initialize it not only from the wrapped type, but also from types convertible to the wrapped type. This is useful to initialize an uninitialized<expression> from an expression element (say a binary_operator), without an explicit conversion.	2023-07-14 18:31:54 +03:00
Avi Kivity	4bc0b42639	cql3: grammar: adjust uninitialized<> wrapper The grammar generator relies on everything having a default constuctor, and to accomodate it we have an uninitialized<> template that fakes a default constructor where one doesn't exist. For convenience we have implicit conversion operators from uninitialized<T> to T. Currently, we have them for both rvalue-reference and normal reference wrappers. It turns out that C++ isn't clever enough to deal with both of them when templates are involved. When it needs a T but as an uninitialized_wrapper<T>&&, it sees both conversion operators and can't pick one. Aid it by removing the non-rvalue conversion operator. The rvalue conversion operator is more efficient, and is all that is needed, since we don't use values more than once in the grammar. Sprinkle std::move()s on the rest of the grammar to keep it compiling. In a few places the odd "$production" syntax is changed to the more common "var=production ... { var }".	2023-07-14 16:48:16 +03:00
Botond Dénes	eb8d7fa1c2	Merge 'test/pylib: handle paging for run_async' from Alecco Provide a way to fetch all pages for `run_async`. While there, move the code to a common helper module. Fixes https://github.com/scylladb/scylladb/issues/14451 Closes #14688 * github.com:scylladb/scylladb: test/pylib: handle paged results for async queries test/pylib: move async query wrapper to common module	2023-07-14 16:37:24 +03:00
Raphael S. Carvalho	d6029a195e	Remove DateTieredCompactionStrategy This is the last step of deprecation dance of DTCS. In Scylla 5.1, users were warned that DTCS was deprecated. In 5.2, altering or creation of tables with DTCS was forbidden. 5.3 branch was already created, so this is targetting 5.4. Users that refused to move away from DTCS will have Scylla falling back to the default strategy, either STCS or ICS. See: WARN 2023-07-14 09:49:11,857 [shard 0] schema_tables - Falling back to size-tiered compaction strategy after the problem: Unable to find compaction strategy class 'DateTieredCompactionStrategy Then user can later switch to a supported strategy with alter table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14559	2023-07-14 16:20:48 +03:00
Avi Kivity	5f6d00babf	test: expr_test: don't invoke expression's default constructor std::unordered_map::operator[] requires the default constructor of expression, which we're about to remove. Use std::unordered_map::at() instead.	2023-07-14 16:06:36 +03:00
Avi Kivity	4b6e38e704	cql3: statement_restrictions: explicitly initialize expressions in index match code The index match code has some default-initialized expressions. These won't compile when we remove expression's default constructor, so replace them with the current default value, an empty conjunction. An empty conjunction doesn't make any special sense here; the code should be refactored not to rely on this random initial value. But this is delicate code and the refactoring shouldn't be done in the middle of an unrelated series.	2023-07-14 16:03:14 +03:00
Avi Kivity	a5921e4923	cql3: statement_restrictions: explicitly intitialize some expression fields _partition_key_restrictions, _clustering_columns_restrictions, and _nonprimary_key_restrictions are currently default-initialized. As we're about to remove expression's default constructor, we need to initialize them with something. Use conjunction({}). Not only is this what the default constructor does, that's what those fields' manipulators assume - they adjust field x using make_conjunction(y, x). This dates to expression's roots as a replacement for restrictions.	2023-07-14 15:57:41 +03:00
Avi Kivity	f94eb708e9	cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions We have some gnarly code that classifies restrictions by the column they restrict. This uses std::unordered_map::operatorp[], which uses the value's default constructor. This happens to be "expression", and as we're about to remove the default constructor, this won't do. Fix by using try_emplace(), which makes the code nicer and more efficient. It could be further improved, but it's better to demolish it instead.	2023-07-14 15:52:03 +03:00
Avi Kivity	61be544431	cql3: expr: prepare_expression: avoid default-constructed expression We're about to remove expression's default constructor, so adjust the usertype_constructor code that checks whether a field has an initializer or whether we must supply a NULL to not rely on it.	2023-07-14 15:49:51 +03:00
Kefu Chai	8b10b1408b	migration_manager: correct format string when printing warning we intent to print the error message. but failed to pass it to the formatter. if we actually run into this case, fmtlib would throw. so in this change, we also print the error when announcing schema change fails. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14623	2023-07-14 15:47:10 +03:00
Avi Kivity	f01c4b3094	cql3: broadcast_tables: prepare new_value without relying on expression default constructor A broadcast_table modification query consists of the the key, the new value, and the condition. When preparing it, we construct the query with a default new_value expression, and pass it to operation::prepare_for_broadcast_tables() to fill .new_value. Since we're removing expression's default constructor, this won't work. So instead nothing to a (renamed) operation::prepare_new_value_for_broadcast_tables(), and use the return value to fill the query.	2023-07-14 15:42:58 +03:00
Piotr Dulikowski	39e41dec84	topology_custom: add deprecated features test Adds a test which simulates marking a cluster feature as deprecated.	2023-07-14 12:41:37 +02:00
Piotr Dulikowski	794d3f0b03	feature_service: add error injection for deprecated cluster feature Adds an error injection which allows to enable the TEST_ONLY_FEATURE as a deprecated feature, i.e. it is assumed to be always enabled, but still considered to be supported by the node and advertised in gossip.	2023-07-14 12:41:37 +02:00
Piotr Dulikowski	a775f929df	feature_service: move error injection check to helper function And also extract "features_enable_test_feature" literal to a string constant. This should slightly improve readability and make it more consistent with the next commit.	2023-07-14 12:41:37 +02:00
Piotr Dulikowski	1704d7e4f0	feature_service: handle deprecated features correctly in feature check The feature check in `enable_features_on_startup` loads the list of features that were enabled previously, goes over every one of them and checks whether each feature is considered supported and whether there is a corresponding `gms::feature` object for it (i.e. the feature is "registered"). The second part of the check is unnecessary and wrong. A feature can be marked as supported but its `gms::feature` object not be present anymore: after a feature is supported for long enough (i.e. we only support upgrades from versions that support the feature), we can consider such a feature to be deprecated. When a feature is deprecated, its `gms::feature` object is removed and the feature is always considered enabled which allows to remove some legacy code. We still consider this feature to be supported and advertise it in gossip, for the sake of the old nodes which, even though they always support the feature, they still check whether other nodes support it. The problem with the check as it is now is that it disallows moving features to the disabled list. If one tries to do it, they will find out that upgrading the node to the new version does not work: `enable_features_on_startup` will load the feature, notice that it is not "registered" (there is no `gms::feature` object for it) and fail to boot. This commit fixes the problem by modifying `enable_features_on_startup` not to look at the registered features list at all. In addition to this, some other small cleanups are performed: - "LARGE_COLLECTION_DETECTION" is removed from the deprecated features list. For some reason, it was put there when the feature was being introduced. It does not break anything because there is a `gms::feature` object for it, but it's slightly confusing and therefore is removed. - The comment in `supported_feature_set` that invites developers to add features there as they are introduced is removed. It is no longer necessary to do so because registered features are put there automatically. Deprecated features should still be put there, as indicated as another comment. Fortunately, this issue does not break any upgrades as of now - since we added enabled cluster feature persisting, no features were deprecated, and we only add registered features to the persisted feature list.	2023-07-14 12:41:18 +02:00
Asias He	dad5caf141	streaming: Add stream_plan_ranges_percentage This option allows user to change the number of ranges to stream in batch per stream plan. Currently, each stream plan streams 10% of the total ranges. With more ranges per stream plan, it reduces the waiting time between two stream plans. For example, stream_plan1: shard0 (t0), shard1 (t1) stream_plan2: shard0 (t2), shard1 (t3) We start stream_plan2 after all shards finish streaming in stream_plan1. If shard0 and shard1 in stream_plan1 finishes at different time. One of the shards will be idle. If we stream more ranges in a single stream plan, the waiting time will be reduced. Previously, we retry the stream plan if one of the stream plans is failed. That's one of the reasons we want more stream plans. With RBNO and `1f8b529e08` (range_streamer: Disable restream logic), the restream factor is not important anymore. Also, more ranges in a single stream plan will create bigger but fewer sstables on the receiver side. The default value is the same as before: 10% percentage of total ranges. Fixes #14191 Closes #14402	2023-07-14 09:03:01 +03:00
Botond Dénes	5c5c56820c	Merge 'Automatically close exhausted SSTable readers for cleanup' from Raphael "Raph" Carvalho This is a followup to `1545ae2d3b` A new reader is introduced that automatically closes the underlying sstable reader once it's exhausted after a fast forward call. Allowing us to revert `1fefe597e6` which is fragile. Closes #14669 * github.com:scylladb/scylladb: Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call" sstables: Automatically close exhausted SSTable readers in cleanup	2023-07-14 09:00:57 +03:00
Patryk Jędrzejczak	a21c4abad7	replica: add abort_requested_exception to exception_variant If migration_manager::get_schema_for_write is called after migration_manager::drain, it throws abort_requested_exception. This exception is not present in replica::exception_variant, which means that RPC doesn't preserve information about its type. If it is thrown on the replica side, it is deserialized as std::runtime_error on the coordinator. Therefore, abstract_read_resolver::error logs information about this exception, even though we don't want it (aborts are triggered on shutdown and timeouts). To solve this issue, we add abort_requested_exception to replica::exception_variant and, in the next commits, refactor storage_proxy::handle_read so that abort_requested_exception thrown in migration_manager::get_schema_for_write is properly serialized. Thanks to this change, unchanged abstract_read_resolver::error correctly handles abort_requested_exception thrown on the replica side by not reporting it.	2023-07-13 16:57:10 +02:00
Alejo Sanchez	9fefb601ef	test/pylib: handle paged results for async queries Provide a flag to fetch all pages for run_async(). Add a simple test to random tables. Runs within 6 seconds in debug mode. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-13 16:56:01 +02:00
Mikołaj Grzebieluch	b165f1e88b	utils: error injection: check if it is an ongoing one-shot injection in is_enabled Change it for consistency with `enabled_injections`. Closes #14597	2023-07-13 15:56:33 +02:00
Botond Dénes	4cee8206f8	Revert "view_update_generator: Increase the registration_queue_size" This reverts commit `d3034e0fab`. The test modified by this commit (view_build_test.test_view_update_generator_register_semaphore_unit_leak) often fails, breaking build jobs.	2023-07-13 16:48:50 +03:00
Pavel Emelyanov	ddbccf1952	main: Use invoke_on_all(&class::method, ...) where possible The sharded<service>::invoke_on_all() has the ability to call method by pointer with automagical unwrapping of sharded references. This makes the code shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14684	2023-07-13 16:31:14 +03:00
Anna Stuchlik	9db9dedb41	doc: document the minimum_keyspace_rf option Fixes https://github.com/scylladb/scylladb/issues/14598 This commit adds the description of minimum_keyspace_rf to the CREATE KEYSPACE section of the docs. (When we have the reference section for all ScyllaDB options, an appropriate link should be added.) This commit must be backported to branch-5.3, because the feature is already on that branch. Closes #14686	2023-07-13 15:37:52 +03:00
Kefu Chai	057701299c	compaction_manager: remove unnecessary include also, remove unnecessary forward declarations. * compaction_manager_test_task_executor is only referenced in the friend declaration. but this declaration does not need a forward declaration of the friend class * compaction_manager_test_task_executor is not used anywhere. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14680	2023-07-13 14:59:39 +03:00
Patryk Jędrzejczak	ed5627cb78	test: raft: add more unit tests for raft address map https://github.com/scylladb/scylladb/pull/12035 and https://github.com/scylladb/scylladb/pull/14329 have introduced a few features to the raft address map that haven't been tested yet: - mappings without an actual IP address (the first PR) - marking entries with generation numbers (the second PR) This commit adds unit tests that verify these changes. Closes #14572	2023-07-13 12:00:43 +02:00
Kamil Braun	a2fe63349d	Merge 'utils: error injection: add a string-to-string map of injection's parameters' from Mikołaj Grzebieluch Add `parameters` map to `injection_shared_data`. Now tests can attach string data to injections that can be read in injected code via `injection_handler`. Closes #14521 Closes #14608 * github.com:scylladb/scylladb: tests: add a `parameters` argument to code that enables injections api/error_injection: add passing injection's parameters to enable endpoint tests: utils: error injection: add test for injection's parameters utils: error injection: add a string-to-string map of injection's parameters utils: error injection: rename received_messages_counter to injection_shared_data	2023-07-13 11:52:15 +02:00
Kefu Chai	a871de33e6	test.py: remove redundant message in report before this change, we would have report in Jenkins like: ``` [Info] - 1 out of 3 times failed: failed. == [File] - test/boost/commitlog_test.cc == [Line] - 298 [Info] - passed: release=1, dev=1 == [File] - test/boost/commitlog_test.cc == [Line] - 298 [Info] - failed: debug=1 == [File] - test/boost/commitlog_test.cc == [Line] - 298 ``` the first section is rendered from the an `Info` tag, created by `test.py`. but the ending "failed" does not help in this context, as we already understand it's failing. so, in this change, it is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14546	2023-07-13 11:31:13 +03:00
Nadav Har'El	e01a369708	alternator: detect errors in AttributeDefinitions parameter Add missing validation of the AttributeDefinitions parameter of the CreateTable operation in Alternator. This validation isn't needed for correctness or safety - the invalid entries would have been ignored anyway. But this patch is useful for user-experience - the user should be notified when the request is malformed instead of ignoring the error. The fix itself is simple (a new validate_attribute_definitions() function, calling it in the right place), but much of the contents of this patch is a fairly large set of tests covering all the interesting cases of how AttributeDefinitions can be broken. Particularly interesting is the case where the same AttributeName appears more than once, e.g., attempting to give two different types to the same key attribute - which is not allowed. One of the new tests remains xfail even after this patch - it checks the case that a user attempts to add a GSI to an existing table where another GSI defined the key's type differently. This test can't succeed until we allow adding GSIs to existing tables (Refs #11567). Fixes #13870. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14556	2023-07-13 11:28:47 +03:00
Tomasz Grabiec	6449c59963	gossiper: Bring back abort on listener failure The refactoring in `c48dcf607a` dropped the noexcept around listener notification. This is probably unintentional, as the comment which explains why we need to abort was preserved. Closes #14573	2023-07-13 11:26:23 +03:00
Kefu Chai	565f5c7380	transport: correct format string when printing logging message we print the stream id in the logging messages, but in this case, we forgot to pass `stream` to `log::debug()`. but the placeholder for `stream` was added. if the underlying fmtlib actually formats the argument with the format string, it would throw. fortunately, we don't enable debug level logging often, guess that's why we haven't spotted this issue yet. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14620	2023-07-13 11:21:43 +03:00
Kefu Chai	3a67c31df0	compaction_manager: pass const reference to ctor the callers of the constructor does not move variable into this parameter, and the constructor itself is not able to consume it. as the parameter is a vector while `compaction_sstable_registration` use an `unordered_set` for tracking the sstables being compacted. so, to avoid creating a temporary copy of the vector, let's just pass by reference. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14661	2023-07-13 11:19:44 +03:00
Petr Gusev	3737bf8fa2	topology.cc: unindex_node: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. The problem was demonstrated by the dtest test_decommission_last_node_in_rack (scylladb/scylla-dtest#3299). The test set up four nodes, three on one rack and one on another, all within a single data center (dc). It then switched to a 'network_topology_strategy' for one keyspace and tried to decommission the single node on the second rack. This decomission command with error message 'zero replica after the removal.' This happened because unindex_node assigned the empty list from the second rack as a value for the single dc in _dc_endpoints dictionary. As a result, we got empty nodes list for single dc in natural_endpoints_tracker::_all_endpoints, node_count == 0 in data_center_endpoints, _rf_left == 0, so network_topology_strategy::calculate_natural_endpoints rejected all the endpoints and returned an empty endpoint_set. In repair_service::do_decommission_removenode_with_repair this caused the 'zero replica after the removal' error. With this fix the test passes both with --consistent-cluster-management option and without it. The specific unit test for this problem was added. Fixes: #14184 Closes #14673	2023-07-13 11:16:01 +03:00
Mikołaj Grzebieluch	382d797d81	tests: add a `parameters` argument to code that enables injections	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	507f750754	api/error_injection: add passing injection's parameters to enable endpoint	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	ef712e5d21	tests: utils: error injection: add test for injection's parameters	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	f60580ab3e	utils: error injection: add a string-to-string map of injection's parameters Add `parameters` map. Now tests can attach string data to injections that can be read in injected code via `injection_handler`.	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	b33714a0f0	utils: error injection: rename received_messages_counter to injection_shared_data For now, `received_messages_counter` have only data for messaging the injection. In future, there will be more data to keep, for example, a string-to-string map of injection's parameters. Rename this class and its attributes.	2023-07-13 10:10:52 +02:00
Asias He	1b577e0414	repair: Release permit earlier when the repair_reader is done Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes #14676 Closes #14677	2023-07-13 11:00:35 +03:00
Nadav Har'El	6a7d980a5d	docs/alternator: list more DynamoDB features not in Alternator This patch adds to docs/alternator/compatibility.md mentions of three recently-added DynamoDB features (ReturnValuesOnConditionCheckFailure, DeletionProtectionEnabled and TableClass) which Alternator does not yet support. Each of these mentions also links to the github issue we have on each feature - issues #14481, #14482 and #10431 respectively. During a review of this patch, the reviewers didn't like that I used words like "recent" and "new" to describe recently-added DynamoDB features, and asked that I use specific dates instead. So this is what I do in this patch for the new features - and I also went back and fixed a few pre-existing references to "recent" and "new" features, and added the dates. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14483	2023-07-13 09:52:08 +02:00
Kamil Braun	9d4b3c6036	test: use correct timestamp resolution in `test_group0_history_clearing_old_entries` In `10c1f1dc80` I fixed `make_group0_history_state_id_mutation` to use correct timestamp resolution (microseconds instead of milliseconds) which was supposed to fix the flakiness of `test_group0_history_clearing_old_entries`. Unfortunately, the test is still flaky, although now it's failing at a later step -- this is because I was sloppy and I didn't adjust this second part of the test to also use microsecond resolution. The test is counting the number of entries in the `system.group0_history` table that are older than a certain timestamp, but it's doing the counting using millisecond resolution, causing it to give results that are off by one sometimes. Fix it by using microseconds everywhere. Fixes #14653 Closes #14670	2023-07-13 10:33:52 +03:00
Kefu Chai	aeb160a654	sstables: use sstables_manager::uuid_stable_identifier() instead of accessing the `feature_service`'s member variable, use the accessor provided by sstable_manager. so we always access the this setting via a single channel. this should helps with the readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14658	2023-07-13 10:31:06 +03:00
Tomasz Grabiec	1ecd3c1a9a	test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled The new test cases are a mirror of old test cases, but with updated digests.	2023-07-12 21:21:55 +02:00
Tomasz Grabiec	b7bc991aa1	Merge 'Fix `test_node_isolation` flakiness' from Kamil Braun The test isolates a node and then connects to it through CQL. The `connect()` step would often timeout on ARM debug builds. This was already dealt with in the past in the context of other tests: #11289. The `ManagerClient.con_gen` function creates a connection in a way that avoids the problem -- connection timeout settings are adjusted to account for the slowness. Use it in this test to fix the flakiness. At the same time, reduce the timeout used for the actual CQL request (after the driver has already connected), because the test expects this request to timeout and waiting for 200 seconds here is just a waste of time. Closes #14663 * github.com:scylladb/scylladb: test: test_node_isolation: use `ManagerClient.con_gen` to create CQL connection test: manager_client: make `con_gen` for `ManagerClient.__init__` nonoptional	2023-07-12 16:36:54 +02:00
Raphael S. Carvalho	8829ff02c5	Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call" This reverts commit `1fefe597e6`. Can be reverted after auto-closed reader. Refs #12998.	2023-07-12 10:48:28 -03:00
Raphael S. Carvalho	ca8705bd82	sstables: Automatically close exhausted SSTable readers in cleanup Add a reader that will automatically close the underlying sstable reader if fast forward is called with a range past the range spanned by the SSTable. This is only to be used in the context of fast forward calls in cleanup, as combined reader in full scans can proactively close the readers that returned EOS. Regular reads that go through cache enable fast forwarding to position range, therefore won't enable auto-closed reader. Compactions don't enable any kind of forward, and they won't have it enabled either. The overhead is minimal, with cleanup being able to reach the same 38MB/s as before this patch. Refs #12998. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-12 10:48:14 -03:00
Calle Wilund	890f1f4ad3	generic_server: Handle TLS error codes indicating broken pipe Fixes #14625 In broken pipe detection, handle also TLS error codes. Requires https://github.com/scylladb/seastar/pull/1729 Closes #14626	2023-07-12 16:04:33 +03:00
Botond Dénes	6a63abcb9f	Merge 'doc: fix broken links reported by the link checker' from Anna Stuchlik This PR fixes or removes broken links reported by an online link checker. Fixes https://github.com/scylladb/scylladb/issues/14488 Closes #14462 * github.com:scylladb/scylladb: doc: update the link to ABRT doc: fix broken links on the Scylla SStable page	2023-07-12 16:02:23 +03:00
Asias He	d3034e0fab	view_update_generator: Increase the registration_queue_size When repair writes a sstable to disk, we check if the sstable needs view update processing. If yes, the sstable will be placed into the staging dir for processing, with the _registration_sem semaphore to prevent too many pending unprocessed sstables. We have seen multiple cases in the field where view update processing is inefficient and way too slow which blocks the base table repair to finish on time. This patch increases the registration_queue_size to a bigger number to mitigate the problem that slow view update processing blocks repair. It is better to have a consistent base table + inconsistent view table than inconsistent base table + inconsistent view table. Currently, sstables in staging dir are not compacted. So we could not increase the _registration_sem with too big number to avoid accumulate too many sstables. The view_build_test.cc is updated to make the test pass. Closes #14241	2023-07-12 15:51:35 +03:00
Tomasz Grabiec	e8ee0a2f86	Merge 'group0_state_machine: use correct comparison for timeuuids in `merger`' from Kamil Braun In `d2a4079bbe`, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row). However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the smaller of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse. Fix it by using the correct comparison method. Fixes: #14600 Closes #14616 * github.com:scylladb/scylladb: utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment group0_state_machine: use correct comparison for timeuuids in `merger` utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&` utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t*`	2023-07-12 14:48:18 +02:00
Botond Dénes	296837120d	db: move virtual tables into virtual_tables.cc The definitions of virtual tables make up approximately a quarter of the huge system_keyspace.cc file (almost 4K lines), pulling in a lot of headers only used by them. Move them to a separate source file to make system_keyspace.cc easier for humans and compilers to digest. This patch also moves the `register_virtual_tables()`, `install_virtual_readers()` as well as the `virtual_tables` global. Closes #14308	2023-07-12 15:26:54 +03:00
Anna Stuchlik	a414ac8fde	doc: update the link to ABRT	2023-07-12 14:13:42 +02:00
Kefu Chai	8f31f28446	build: cmake: add test/raft tests Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14656	2023-07-12 15:06:59 +03:00
Kamil Braun	820d7e9520	test: test_node_isolation: use `ManagerClient.con_gen` to create CQL connection The test isolates a node and then connects to it through CQL. The `connect()` step would often timeout on ARM debug builds. This was already dealt with in the past in the context of other tests: #11289. The `ManagerClient.con_gen` function creates a connection in a way that avoids the problem -- connection timeout settings are adjusted to account for the slowness. Use it in this test to fix the flakiness. At the same time, reduce the timeout used for the actual CQL request (after the driver has already connected), because the test expects this request to timeout and waiting for 200 seconds here is just a waste of time.	2023-07-12 12:34:02 +02:00
Kefu Chai	20c7b6057b	test: silence the deprecation warning. because `lw_shared_ptr::operator=(T&&)` was deprecated. we started to have following waring: ``` /home/kefu/dev/scylladb/test/boost/statement_restrictions_test.cc:394:41: warning: 'operator=' is deprecated: call make_lw_shared<> and assign the result instead [-Wdeprecated-declarations] 394 \| definition.column_specification = std::move(specification); \| ^ /home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:346:7: note: 'operator=' has been explicitly marked deprecated here 346 \| [[deprecated("call make_lw_shared<> and assign the result instead")]] \| ^ 1 warning generated. ``` so, in this change, we use the recommended way to update a lw_shared_ptr. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14648	2023-07-12 13:10:33 +03:00
Kamil Braun	3464877276	test: manager_client: make `con_gen` for `ManagerClient.__init__` nonoptional `ManagerClient` is given a function that is used to create CQL connections to the Scylla cluster. For some reason it was typed as `Optional` even though it was never passed `None`. Fix it.	2023-07-12 11:44:15 +02:00
Kefu Chai	5443bf69f7	storage_proxy: print the expected ex.what() before this change, the format string contains two placeholders, but only one extra argument is passed in. if we actually format this logging message, fmtlib would throw. after this change, we pass the exception's error message as yet another argument. this logging message is printed with "trace" level, guess that's why we haven't have the exception thrown by fmtlib. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14628	2023-07-12 12:34:51 +03:00
Nadav Har'El	a4087f58df	alternator: fix error path for size() function on constants The DynamoDB documentation for the size() function claims that it only works on paths (attribute names or references), but it actually works on constants from the query (e.g., ":val") as well. It turns out that Alternator supports this undocumented case already, but gets the error path wrong: Usually, when size() is calculated on the data, if the data has the wrong type of size() (e.g., an integer), the condition simply doesn't match. But if the value comes from the query - it should generate an error that the query is wrong - ValidationException. This patch fixes this case, and also adds tests for it that pass on both DynamoDB and Alternator (after this patch). Fixes #14592 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14593	2023-07-12 12:29:05 +03:00
Pavel Emelyanov	eb549234b0	scylla-gdb: Fix tables filtering There's -k\|--keyspace argument to the tables command that's supposed to filter tables belonging to specific keyspace that doesn't work. Fix it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14634	2023-07-12 12:26:25 +03:00
Avi Kivity	0fc067a54c	build: add -Wimplicit-fallthrough to cmake In `0cabf4eeb9` ("build: disable implicit fallthrough"), we added -Wimplicit-fallthrough to configure.py, but forgot to add it to cmake. Closes #14629	2023-07-12 12:24:22 +03:00
Nadav Har'El	f08bc83cb2	cql-pytest: translate Cassandra's tests for CAST operations This is a translation of Cassandra's CQL unit test source file functions/CastFctsTest.java into our cql-pytest framework. There are 13 tests, 9 of them currently xfail. The failures are caused by one recently-discovered issue: Refs #14501: Cannot Cast Counter To Double and by three previously unknown or undocumented issues: Refs #14508: SELECT CAST column names should match Cassandra's Refs #14518: CAST from timestamp to string not same as Cassandra on zero milliseconds Refs #14522: Support CAST function not only in SELECT Curiously, the careful translation of this test also caused me to find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647 which the test in Java missed because it made the same mistake as the implementation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14528	2023-07-12 11:42:04 +03:00
Nadav Har'El	599636b307	test/alternator: fix flaky test test_ttl_expiration_gsi_lsi The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky. The test incorrectly assumes that when we write an already expired item, it will be visible for a short time until being deleted by the TTL thread. But this doesn't need to be true - if the test is slow enough, it may go look or the item after it was already expired! So we fix this test by splitting it into two parts - in the first part we write a non-expiring item, and notice it eventually appears in the GSI, LSI, and base-table. Then we write the same item again, with an expiration time - and now it should eventually disappear from the GSI, LSI and base-table. This patch also fixes a small bug which prevented this test from running on DynamoDB. Fixes #14495 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14496	2023-07-12 11:23:12 +03:00
Botond Dénes	968421a3e0	Merge 'Stop task manager compaction module properly' from Aleksandra Martyniuk Due to wrong order of stopping of compaction services, shutdown needs to wait until all compactions are complete, which may take really long. Moreover, test version of compaction manager does not abort task manager, which is strictly bounded to it, but stops its compaction module. This results in tests waiting for compaction task manager's tasks to be unregistered, which never happens. Stopping and aborting of compaction manager and task manager's compaction module are performed in a proper order. Closes #14461 * github.com:scylladb/scylladb: tasks: test: abort task manager when wrapped_compaction_manager is destructed compaction: swap compaction manager stopping order compaction: modify compaction_manager::stop()	2023-07-12 09:54:00 +03:00
Avi Kivity	118fa59ba8	tools: add cqlsh shortcut Add bin/cqlsh as a shortcut to tools/cqlsh/bin/cqlsh, intended for developers. Closes #14362	2023-07-12 09:36:59 +03:00
Pavel Emelyanov	033e5348aa	scylla-gdb: Print all clients from all idx's The scylla netw command prints clients from [0] index only, but there are more of them on messaging service. Print all Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14633	2023-07-12 09:29:02 +03:00
Botond Dénes	c5cb23a825	Merge 'Add `scylla table` to scylla-gdb' from Pavel Emelyanov The command is to print interesting and/or hard-to-get-by-hand info about individual tables Closes #14635 * github.com:scylladb/scylladb: test: Add 'scylla table' cmd test scylla-gdb: Print table phased barriers scylla-gdb: Add 'table' command	2023-07-12 09:26:59 +03:00
Kefu Chai	cca8db5f03	build: cmake: build SEASTAR tests as SEASTAR tests both tagged_integer_test and tablets_test are drived by "scylla_test_case", and they use seastar thread. let's make them "SEASTAR" tests, so they can link against the used libraries. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-12 12:21:14 +08:00
Kefu Chai	bfe169a41c	build: cmake: error out if found unknown keywords this should helps to identify the error passing wrong keywords to this function. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-12 12:21:14 +08:00
Kefu Chai	7c6ecb1c54	build: cmake: link tests against necessary libraries * link alternator_unit_test against alternator * link schema_loader_test against tools since alternator_unit_test referecens the symbols defined by alternator. let's link the test against the library. otherwise, we'd have following link failure: ``` FAILED: test/boost/alternator_unit_test : && /usr/bin/clang++ -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" -O0 -g -gz -Wl,--build-id=sha1 -fuse-ld=lld -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o -o test/boost/alternator_unit_test test/lib/libtest-lib.a seastar/libseastar.a /usr/lib64/libxxhash.so /usr/lib64/libboost_unit_test_framework.so.1.78.0 libscylla-main.a /usr/lib64/libabsl_hash.so.2206.0.0 /usr/lib64/libabsl_city.so.2206.0.0 /usr/lib64/libabsl_bad_variant_access.so.2206.0.0 /usr/lib64/libabsl_low_level_hash.so.2206.0.0 -Xlinker --push-state -Xlinker --whole-archive auth/libscylla_auth.a -Xlinker --pop-state /usr/lib64/libcrypt.so cdc/libcdc.a compaction/libcompaction.a mutation_writer/libmutation_writer.a -Xlinker --push-state -Xlinker --whole-archive dht/libscylla_dht.a -Xlinker --pop-state gms/libgms.a types/libtypes.a index/libindex.a -Xlinker --push-state -Xlinker --whole-archive locator/libscylla_locator.a -Xlinker --pop-state sstables/libsstables.a /usr/lib64/libz.so readers/libreaders.a schema/libschema.a -Xlinker --push-state -Xlinker --whole-archive tracing/libscylla_tracing.a -Xlinker --pop-state service/libservice.a -lsystemd raft/libraft.a repair/librepair.a streaming/libstreaming.a replica/libreplica.a db/libdb.a mutation/libmutation.a data_dictionary/libdata_dictionary.a cql3/libcql3.a transport/libtransport.a cql3/libcql3.a transport/libtransport.a lang/liblang.a /usr/lib64/liblua-5.4.so -lm /usr/lib64/libsnappy.so.1.1.9 /usr/lib64/libabsl_raw_hash_set.so.2206.0.0 /usr/lib64/libabsl_bad_optional_access.so.2206.0.0 /usr/lib64/libabsl_hashtablez_sampler.so.2206.0.0 /usr/lib64/libabsl_exponential_biased.so.2206.0.0 /usr/lib64/libabsl_synchronization.so.2206.0.0 /usr/lib64/libabsl_graphcycles_internal.so.2206.0.0 /usr/lib64/libabsl_stacktrace.so.2206.0.0 /usr/lib64/libabsl_symbolize.so.2206.0.0 /usr/lib64/libabsl_malloc_internal.so.2206.0.0 /usr/lib64/libabsl_debugging_internal.so.2206.0.0 /usr/lib64/libabsl_demangle_internal.so.2206.0.0 /usr/lib64/libabsl_time.so.2206.0.0 /usr/lib64/libabsl_strings.so.2206.0.0 /usr/lib64/libabsl_int128.so.2206.0.0 /usr/lib64/libabsl_throw_delegate.so.2206.0.0 /usr/lib64/libabsl_strings_internal.so.2206.0.0 /usr/lib64/libabsl_base.so.2206.0.0 /usr/lib64/libabsl_spinlock_wait.so.2206.0.0 /usr/lib64/libabsl_raw_logging_internal.so.2206.0.0 /usr/lib64/libabsl_log_severity.so.2206.0.0 /usr/lib64/libabsl_civil_time.so.2206.0.0 /usr/lib64/libabsl_time_zone.so.2206.0.0 rust/libwasmtime_bindings.a rust/rust-debug/librust_combined.a /usr/lib64/libdeflate.so utils/libutils.a seastar/libseastar.a /usr/lib64/libboost_program_options.so /usr/lib64/libboost_thread.so /usr/lib64/libboost_chrono.so /usr/lib64/libboost_atomic.so /usr/lib64/libcares.so /usr/lib64/libcryptopp.so /usr/lib64/libfmt.so.9.1.0 /usr/lib64/liblz4.so /usr/lib64/libgnutls.so -latomic /usr/lib64/libsctp.so /usr/lib64/libyaml-cpp.so -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr /usr/lib64/libhwloc.so //usr/lib64/liburing.so /usr/lib64/libnuma.so /usr/lib64/libxxhash.so -lcryptopp /usr/lib64/libboost_regex.so.1.78.0 /usr/lib64/libicui18n.so /usr/lib64/libicuuc.so -ldl && : ld.lld: error: undefined symbol: alternator::internal::get_magnitude_and_precision(std::basic_string_view<char, std::char_traits<char>>) >>> referenced by alternator_unit_test.cc:148 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:148) >>> test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method()) >>> referenced by alternator_unit_test.cc:158 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:158) >>> test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method()) >>> referenced by alternator_unit_test.cc:160 (/home/kefu/dev/scylladb/test/boost/alternator_unit_test.cc:160) >>> test/boost/CMakeFiles/alternator_unit_test.dir/alternator_unit_test.cc.o:(test_magnitude_and_precision::test_method()) >>> referenced 2 more times ``` also, schema_loader_test references tools::load_schemas(). let's link the test against the library. otherwise, we'd have following link failure: ``` FAILED: test/boost/schema_loader_test : && /usr/bin/clang++ -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" -O0 -g -gz -Wl,--build-id=sha1 -fuse-ld=lld -fprofile-instr-generate="/home/kefu/dev/scylladb/build/cmake/profiles/%m.profraw" test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o -o test/boost/schema_loader_test test/lib/libtest-lib.a seastar/libseastar.a /usr/lib64/libxxhash.so seastar/libseastar_testing.a libscylla-main.a /usr/lib64/libabsl_hash.so.2206.0.0 /usr/lib64/libabsl_city.so.2206.0.0 /usr/lib64/libabsl_bad_variant_access.so.2206.0.0 /usr/lib64/libabsl_low_level_hash.so.2206.0.0 -Xlinker --push-state -Xlinker --whole-archive auth/libscylla_auth.a -Xlinker --pop-state /usr/lib64/libcrypt.so cdc/libcdc.a compaction/libcompaction.a mutation_writer/libmutation_writer.a -Xlinker --push-state -Xlinker --whole-archive dht/libscylla_dht.a -Xlinker --pop-state gms/libgms.a types/libtypes.a index/libindex.a -Xlinker --push-state -Xlinker --whole-archive locator/libscylla_locator.a -Xlinker --pop-state sstables/libsstables.a /usr/lib64/libz.so readers/libreaders.a schema/libschema.a -Xlinker --push-state -Xlinker --whole-archive tracing/libscylla_tracing.a -Xlinker --pop-state service/libservice.a -lsystemd raft/libraft.a repair/librepair.a streaming/libstreaming.a replica/libreplica.a db/libdb.a mutation/libmutation.a data_dictionary/libdata_dictionary.a cql3/libcql3.a transport/libtransport.a cql3/libcql3.a transport/libtransport.a lang/liblang.a /usr/lib64/liblua-5.4.so -lm /usr/lib64/libsnappy.so.1.1.9 /usr/lib64/libabsl_raw_hash_set.so.2206.0.0 /usr/lib64/libabsl_bad_optional_access.so.2206.0.0 /usr/lib64/libabsl_hashtablez_sampler.so.2206.0.0 /usr/lib64/libabsl_exponential_biased.so.2206.0.0 /usr/lib64/libabsl_synchronization.so.2206.0.0 /usr/lib64/libabsl_graphcycles_internal.so.2206.0.0 /usr/lib64/libabsl_stacktrace.so.2206.0.0 /usr/lib64/libabsl_symbolize.so.2206.0.0 /usr/lib64/libabsl_malloc_internal.so.2206.0.0 /usr/lib64/libabsl_debugging_internal.so.2206.0.0 /usr/lib64/libabsl_demangle_internal.so.2206.0.0 /usr/lib64/libabsl_time.so.2206.0.0 /usr/lib64/libabsl_strings.so.2206.0.0 /usr/lib64/libabsl_int128.so.2206.0.0 /usr/lib64/libabsl_throw_delegate.so.2206.0.0 /usr/lib64/libabsl_strings_internal.so.2206.0.0 /usr/lib64/libabsl_base.so.2206.0.0 /usr/lib64/libabsl_spinlock_wait.so.2206.0.0 /usr/lib64/libabsl_raw_logging_internal.so.2206.0.0 /usr/lib64/libabsl_log_severity.so.2206.0.0 /usr/lib64/libabsl_civil_time.so.2206.0.0 /usr/lib64/libabsl_time_zone.so.2206.0.0 rust/libwasmtime_bindings.a rust/rust-debug/librust_combined.a /usr/lib64/libdeflate.so utils/libutils.a /usr/lib64/libxxhash.so -lcryptopp /usr/lib64/libboost_regex.so.1.78.0 /usr/lib64/libicui18n.so /usr/lib64/libicuuc.so /usr/lib64/libboost_unit_test_framework.so.1.78.0 seastar/libseastar.a /usr/lib64/libboost_program_options.so /usr/lib64/libboost_thread.so /usr/lib64/libboost_chrono.so /usr/lib64/libboost_atomic.so /usr/lib64/libcares.so /usr/lib64/libcryptopp.so /usr/lib64/libfmt.so.9.1.0 /usr/lib64/liblz4.so -ldl /usr/lib64/libgnutls.so -latomic /usr/lib64/libsctp.so /usr/lib64/libyaml-cpp.so -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr /usr/lib64/libhwloc.so //usr/lib64/liburing.so /usr/lib64/libnuma.so /usr/lib64/libboost_unit_test_framework.so && : ld.lld: error: undefined symbol: tools::load_schemas(std::basic_string_view<char, std::char_traits<char>>) >>> referenced by schema_loader_test.cc:14 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:14) >>> test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_empty::do_run_test_case() const) >>> referenced by schema_loader_test.cc:15 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:15) >>> test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_empty::do_run_test_case() const) >>> referenced by schema_loader_test.cc:19 (/home/kefu/dev/scylladb/test/boost/schema_loader_test.cc:19) >>> test/boost/CMakeFiles/schema_loader_test.dir/schema_loader_test.cc.o:(test_keyspace_only::do_run_test_case() const) >>> referenced 21 more times ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-12 12:21:14 +08:00
Kamil Braun	dc6f6cb6b0	cql_test_env: load host ID from sstables after restart Performance tests such as `perf-fast-forward` are executed in our CI environments in two steps (two invocations of the `scylla` process): first by populating data directories (with `--populate` option), then by running the actual test. These tests are using `cql_test_env`, which did not load the previously saved (in the populate step) Host ID of this node, but generated a new one randomly instead. In `b39ca97919` we enabled `consistent_cluster_management` by default. This caused the perf tests to hang in `setup_group0` at `read_barrier` step. That's because Raft group 0 was initialized with old configuration -- the one created during the populate step -- but the Raft server was started with a newly generated Host ID (which is used as the server's Raft ID), so the server considered itself as being outside the configuration. Fix this by reloading the Host ID from disk, simulating more closely the behavior of main.cc initialization. Fixes #14599 Closes #14640	2023-07-11 23:30:44 +03:00
Avi Kivity	1545ae2d3b	Merge 'Make SSTable cleanup more efficient by fast forwarding to next owned range' from Raphael "Raph" Carvalho Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node. That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317. To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead. Without further ado, before: `INFO 2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.` after: `INFO 2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.` Fixes #12998. Fixes #14317. Closes #14469 * github.com:scylladb/scylladb: test: Extend cleanup correctness test to cover more cases compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range sstables: Close SSTable reader if index exhaustion is detected in fast forward call sstables: Simplify sstable reader initialization compaction: Extend make_sstable_reader() interface to work with mutation_source test: Extend sstable partition skipping test to cover fast forward using token	2023-07-11 23:28:15 +03:00
Avi Kivity	9cdae78d04	test: expr_test: add copyright/license Closes #14613	2023-07-11 21:45:27 +03:00
Raphael S. Carvalho	60ba1d8b47	test: Extend cleanup correctness test to cover more cases Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	8d58ff1be6	compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node. That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in #14317. To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead. Without further ado, before: ... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028. after: ... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028. Fixes #12998. Fixes #14317. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	1fefe597e6	sstables: Close SSTable reader if index exhaustion is detected in fast forward call When wiring multi range reader with cleanup, I found that cleanup wouldn't be able to release disk space of input SSTables earlier. The reason is that multi range reader fast forward to the next range, therefore it enables mutation_reader::forwarding, and as a result, combined reader cannot release readers proactively as it cannot tell for sure that the underlying reader is exhausted. It may have reached EOS for the current range, but it may have data for the next one. The concept of EOS actually only applies to the current range being read. A reader that returned EOS will actually get out of this state once the combined reader fast forward to the next range. Therefore, only the underlying reader, i.e. the sstable reader, can for certain know that the data source is completely exhausted, given that tokens are read in monotonically increasing order. For reversed reads, that's not true but fast forward to range is not actually supported yet for it. Today, the SSTable reader already knows that the underlying SSTable was exhausted in fast_forward_to(), after it call index_reader's advance_to(partition_range), therefore it disables subsequent reads. We can take a step further and also check that the index was exhausted, i.e. reached EOF. So if the index is exhausted, and there's no partition to read after the fast_forward_to() call, we know that there's nothing left to do in this reader, and therefore the reader can be closed proactively, allowing the disk space of SSTable to be reclaimed if it was already deleted. We can see that the combined reader, under multi range reader, will incrementally find a set of disjoint SSTable exhausted, as it fast foward to owned ranges 1: INFO 2023-07-05 10:51:09,570 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-4525396453480898112, start},{-4525396453480898112, end}] INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,570 [shard 0] sstable - closing reader 0x60100029d800 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-3-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == end, eof ? false 2: INFO 2023-07-05 10:51:09,572 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-2253424581619911583, start},{-2253424581619911583, end}] INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,572 [shard 0] sstable - closing reader 0x60100029d400 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false And so on. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	f08a4eaacb	sstables: Simplify sstable reader initialization It's odd that we see things like: if (!is_initialized()) { return initialize().then([this] { if (!is_initialized()) { and return ensure_initialized().then([this, &pr] { if (!is_initialized()) { One might think initialize will actually initialize the reader by setting up context, and ensure_initialized() will even have stronger guarantees, meaning that the reader must be initialized by it. But none are true. In the context of single-partition read, it can happen initialize() will not set up context, meaning is_initialized() returns false, which is why initialization must be checked even after we call ensure_initialized(). Let's merge ensure_initialized() and initialize() into a maybe_initialize() which returns a boolean saying if the reader is initialized. It makes the code initializing the reader easier to understand.	2023-07-11 13:56:23 -03:00
Michał Chojnowski	b511d57fc8	Revert "Merge 'Compaction resharding tasks' from Aleksandra Martyniuk" This reverts commit `2a58b4a39a`, reversing changes made to `dd63169077`. After patch `87c8d63b7a`, table_resharding_compaction_task_impl::run() performs the forbidden action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard, which is a data race that can cause a use-after-free, typically manifesting as allocator corruption. Note: before the bad patch, this was avoided by copying the _contents_ of the lw_shared_ptr into a new, local lw_shared_ptr. Fixes #14475 Fixes #14618 Closes #14641	2023-07-11 19:11:37 +03:00
Calle Wilund	e1a52af69e	messaging_service: Do TLS init early Fixes #14299 failure_detector can try sending messages to TLS endpoints before start_listen has been called (why?). Need TLS initialized before this. So do on service creation. Closes #14493	2023-07-11 18:19:01 +03:00
Alejo Sanchez	7b621617a9	test/pylib: move async query wrapper to common module Move async query wrapper code out of topology and into its own module. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-07-11 15:22:00 +02:00
Kefu Chai	b4dc3f7cd9	scylla-gdb: add sstable::generation_type printer to inspect the sstable generation after uuid-based generation change. in this change: * a pretty printer for sstable::generation_type is added * now that the pretty printer for the generation_type is registered, we can just leverage it when printing the sstable name, so instead of checking if `_generation` member variable contains `_value`, we use delegate it to `str()`, which is used by `str.format()`. as the behavior of `str()` is similar to that of the gdb `print` command, and calls `value.format_string()`, which in turn calls into `to_string()` if the "value" in question has a pretty printer. after this change, the printer is able to print both the generations before the uuid change and the ones after the change. a typical gdb session looks like: ``` (gdb) p generation._value $5 = f0770b40-1c7c-11ee-b136-bf28f8d18b88 (gdb) p generation $10 = 3g7g_0bu7_0jpvk2p0mmtlsb8lu0 (gdb) p/x generation._value.least_sig_bits $7 = 0xb136bf28f8d18b88 (gdb) p/x generation._value.most_sig_bits $8 = 0xf0770b401c7c11ee ``` if we use `scripts/base36-uuid.py` to encode the msb and lsb, we'd need to: ```console scripts/base36-uuid.py -e 0xf0770b401c7c11ee 0xb136bf28f8d18b88 3g7g_0bu7_0jpvk2p0mmtlsb8lu0 ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14561	2023-07-11 15:56:20 +03:00
Raphael S. Carvalho	3b1829f0d8	compaction: base compaction throughput on amount of data read Today, we base compaction throughput on the amount of data written, but it should be based on the amount of input data compacted instead, to show the amount of data compaction had to process during its execution. A good example is a compaction which expire 99% of data, and today throughput would be calculated on the 1% written, which will mislead the reader to think that compaction was terribly slow. Fixes #14533. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14615	2023-07-11 15:48:05 +03:00
Kefu Chai	25f4a7c400	sstables: format using format string instead of concatenating strings, let's format using the builtin support of `log::debug()`. for two reasons: 1. better performance, after this change, we don't need to materialize the concatenated string, if the "debug" level logging is not enabled. seasetar::log only formats when a certain log level is enabled. 2. better readability. with the format string, it is clear what is the fixed part, and which arguments are to be formatted. this also helps us to move to compile-time formatting check, as fmtlib requires the caller to be explicit when it wants to use runtime format string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14627	2023-07-11 15:31:20 +03:00
Pavel Emelyanov	5518502085	test: Add 'scylla table' cmd test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:12:43 +03:00
Pavel Emelyanov	2c2ad09d3c	scylla-gdb: Print table phased barriers These barriers show if there's any operation in progress (read, write, flush or stream). These are crucial to know if stopping fails, e.g. see issue #13100 These barriers are symmarized in 'scylla memory' command, but they are also good to know on per-table basis Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:10:47 +03:00
Pavel Emelyanov	1948b8fa17	scylla-gdb: Add 'table' command There's 'scylla tables' one that lists tables on the given/current shard, but the list is unable to show lots of information. It prints the table address so it can be explored by hand, but some data is more handy to be parsed and printed with the script The syntax is $ scylla table ks.cf For now just print the schema version. To be extended in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-11 15:08:55 +03:00
Botond Dénes	bc5174ced6	Merge 'doc: move the package installation instructions to the documentation' from Anna Stuchlik Refs: https://github.com/scylladb/scylla-docs/issues/4091 Fixes https://github.com/scylladb/scylla-docs/issues/3419 This PR moves the installation instructions from the [website](https://www.scylladb.com/download/) to the documentation. Key changes: - The instructions are mostly identical, so they were squeezed into one page with different tabs. - I've merged the info for Ubuntu and Debian, as well as CentOS and RHEL. - The page uses variables that should be updated each release (at least for now). - The Java requirement was updated from Java 8 to Java 11 following [this issue](https://github.com/scylladb/scylla-docs/issues/3419). - In addition, the title of the Unified Installer page has been updated to communicate better about its contents. Closes #14504 * github.com:scylladb/scylladb: doc: update the prerequisites section doc: improve the tile of Unified Installer page doc: move package install instructions to the docs	2023-07-11 14:30:11 +03:00
Kamil Braun	051728318d	utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment	2023-07-11 13:19:50 +02:00
Avi Kivity	f26e36f448	Update seastar submodule * seastar 2b7a341210...bac344d584 (3): > tls: Export error_category instance used by tls + some common error codes > reactor: cast enum to int when formatting it > cooking: bump up zlib to 1.2.13	2023-07-11 13:24:32 +03:00
Kamil Braun	5779230d28	group0_state_machine: use correct comparison for timeuuids in `merger` In `d2a4079bbe`, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row). However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the smaller of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse. Fix it by using the correct comparison method. Fixes: #14600	2023-07-11 11:48:02 +02:00
Kamil Braun	5ce802676f	utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&` The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte buffers. Introduce a version which operates directly on the `utils::UUID` type. To reuse existing comparison code, we serialize to a buffer before comparing. But we avoid allocations by using `std::array`. Since the serialized size needs to be known at compile time for `std::array`, mark `UUID::serialized_size()` as `constexpr`.	2023-07-11 11:48:02 +02:00
Kamil Braun	668beedadc	utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t` `timeuuid_tri_compare` takes `bytes_view` parameters and converts them to `const int8_t` before comparing. Extract the part that operates on `const int8_t*` to separate function which we will reuse in a later commit.	2023-07-11 11:48:02 +02:00
Kefu Chai	ef78b31b43	s3/client: add tagging ops with tagging ops, we will be able to attach kv pairs to an object. this will allow us to mark sstable components with taggings, and filter them based on them. * test/pylib/minio_server.py: enable anonymous user to perform more actions. because the tagging related ops are not enabled by "mc anonymous set public", we have to enable them using "set-json" subcommand. * utils/s3/client: add methods to manipulate taggings. * test/boost/s3_test: add a simple test accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14486	2023-07-11 09:30:46 +03:00
Kefu Chai	3b6e37051b	build: cmake: add more tests to CMake to be in-sync with configure.py Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14479	2023-07-11 09:21:26 +03:00
Botond Dénes	37dd2503ff	Merge 'replica,sstable: do not assign a value to a shared_ptr' from Kefu Chai instead using the operator=(T&&) to assign an instance of `T` to a shared_ptr, assign a new instance of shared_ptr to it. unlike std::shared_ptr, seastar::shared_ptr allows us to move a value into the existing value pointed by shared_ptr with operator=(). the corresponding change in seastar is `319ae0b530`. but this is a little bit confusing, as the behavior of a shared_ptr should look like a pointer instead the value pointed by it. and this could be error-prune, because user could use something like ```c++ p = std::string(); ``` by accident, and expect that the value pointed by `p` is cleared. and all copies of this shared_ptr are updated accordingly. what he/she really wants is: ```c++ p = std::string(); ``` and the code compiles, while the outcome of the statement is that the pointee of `p` is destructed, and `p` now points to a new instance of string with a new address. the copies of this instance of shared_ptr still hold the old value. this behavior is not expected. so before deprecating and removing this operator. let's stop using it. in this change, we update two caller sites of the `lw_shared_ptr::operator=(T&&)`. instead of creating a new instance pointee of the pointer in-place, a new instance of lw_shared_ptr is created, and is assigned to the existing shared_ptr. Closes #14470 github.com:scylladb/scylladb: sstables: use try_emplace() when appropriate replica,sstable: do not assign a value to a shared_ptr	2023-07-11 09:19:48 +03:00
Kefu Chai	0dca0a7f27	build: cmake: include pretty_printers.cc in util we added pretty_printers.cc back in `83c70ac04f`, in which configure.py is updated. so let's sync the CMake building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14442	2023-07-11 09:16:33 +03:00
Pavel Emelyanov	2eebb1312e	scylla-gdb: Format IPs with network byte order The scylla netw command prints connections IPs reversed: (gdb) scylla netw Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>} Outgoing connections: IP: 31.0.142.10, (netw::messaging_service::rpc_protocol_client_wrapper) 0x600008d6d490: stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0} outstanding: 0 It should unpack the address as if it was in big-endian to have it like (gdb) scylla netw Dropped messages: {0, 0, 0, 1, 0 <repeats 15 times>, 1, 0 <repeats 41 times>} Outgoing connections: IP: 10.142.0.31, (netw::messaging_service::rpc_protocol_client_wrapper) 0x600008d6d490: stats: {replied = 0, pending = 0, exception_received = 0, sent_messages = 1192, wait_reply = 0, timeout = 0} outstanding: 0 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14611	2023-07-11 09:12:12 +03:00
Raphael S. Carvalho	bd50943270	compaction: Extend make_sstable_reader() interface to work with mutation_source As the goal is to make compaction filter to the next owned range, make_sstable_reader() should be extended to create a reader with parameters forwarded from mutation_source interface, which will be used when wiring cleanup with multi range reader. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-10 17:19:30 -03:00
Avi Kivity	2de168e568	dist: sysctl: increase vm.vfs_cache_pressure Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref https://github.com/scylladb/scylladb/issues/14506 Closes #14509	2023-07-10 21:24:57 +03:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Avi Kivity	d645e7a515	Update seastar submodule locator/_snitch.cc updated for http::reply losing the _status_code member without a deprecation notice. seastar 99d28ff057...2b7a341210 (23): > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity Fixes #8828. > reactor: use structured binding when appropriate > Simplify payload length and mask parsing. > memcached: do not used deprecated API > build: serialize calls to openssl certificate generation > reactor: epoll backend: initialize _highres_timer_pending > shared_ptr: deprecate lw_shared_ptr operator=(T&&) > tests: fail spawn_test if output is empty > Support specifying the "build root" in configure > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov > build: correct the syntax error in comment > util: print_safe: fix hex print functions > Add code examples for handling exceptions > smp: warn if --memory parameter is not supported > Merge 'gate: track holders' from Benny Halevy > file: call lambda with std::invoke() > deleter: Delete move and copy constructors > file: fix the indent > file: call close() without the syscall thread > reactor: use s/::free()/::io_uring_free_probe()/ > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai > reactor: Don't re-evaliate local reactor for thread_pool > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov Closes #14602	2023-07-10 16:07:12 +03:00
Kamil Braun	3d58e8e424	Revert "cql3: Extend the scope of group0_guard during DDL statement execution" This reverts commit `c42a91ec72`. A significant performance regression was observed due to this change. From Avi: > perf-simple-query --smp 1 > > before: > > 216489.88 tps ( 61.1 allocs/op, 13.1 tasks/op, 43558 insns/op, 0 errors) > 217708.69 tps ( 61.1 allocs/op, 13.1 tasks/op, 43542 insns/op, 0 errors) > 219495.02 tps ( 61.1 allocs/op, 13.1 tasks/op, 43538 insns/op, 0 errors) > 216863.84 tps ( 61.1 allocs/op, 13.1 tasks/op, 43567 insns/op, 0 errors) > 218936.48 tps ( 61.1 allocs/op, 13.1 tasks/op, 43546 insns/op, 0 errors) > > after: > > 201773.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44600 insns/op, 0 errors) > 210875.48 tps ( 63.1 allocs/op, 15.1 tasks/op, 44558 insns/op, 0 errors) > 210186.55 tps ( 63.1 allocs/op, 15.1 tasks/op, 44588 insns/op, 0 errors) > 211021.76 tps ( 63.1 allocs/op, 15.1 tasks/op, 44569 insns/op, 0 errors) > 208597.52 tps ( 63.1 allocs/op, 15.1 tasks/op, 44587 insns/op, 0 errors) > > Two extra allocations, two extra tasks, 1k extra instructions, for > something that is DDL only. Fixes #14590	2023-07-10 13:20:49 +02:00
Pavel Emelyanov	ec292721d6	open_coredump: Add --scylla-build-id CLI option The script gets the build id on its own but eu-unstrip-ing the core file and searching for the necessary value in the output. This can be somewhat lenghthy operation especially on huge core files. Sometimes (e.g. in tests) the build id is known and can be just provided as an argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14574	2023-07-10 11:17:54 +03:00
Tomasz Grabiec	65a5942ec0	Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs' from Kamil Braun Before this PR, the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (`50e8ec77c6`). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for any `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, after we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test would still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: `ca61d88764` but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. We also introduce regression tests with necessary extensions to the test framework. Fixes #14468 Fixes #14487 Closes #14507 * github.com:scylladb/scylladb: test: rename `test_topology_ip.py` to `test_replace.py` test: test bootstrap after IP change test: scylla_cluster: return the new IP from `change_ip` API test: node replace with `ignore_dead_nodes` test test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig` storage_service: remove `get_nodes_to_sync_with` storage_service: use `token_metadata` to calculate nodes waited for to be UP storage_service: don't calculate `ignore_nodes` before waiting for normal handlers	2023-07-10 00:28:20 +02:00
Kefu Chai	1eb76d93b7	streaming: cast the progress to a float before formatting it before this change, we format a `long` using `{:f}`. fmtlib would throw an exception when actually formatting it. so, let's make the percentage a float before formatting it. Fixes #14587 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14588	2023-07-10 00:00:40 +03:00
Kefu Chai	894039d444	build: drop the warning on -O0 might fail tests Michał Chojnowski noted that this is not true. -O0 almost doubles the run time of `./test.py --mode=debug`. but it does not fail any of the tests. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14456	2023-07-09 23:23:12 +03:00
Avi Kivity	850d759fd9	Merge 'repair: optimise repair reader with different shard count' from Gusev Petr Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially. When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`. In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100). This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`: ``` with this patch 20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2... 20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859 without this patch 21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2... 21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140 ``` Fixes: #14093 Closes #14178 * github.com:scylladb/scylladb: repair_test: add test_reader_with_different_strategies repair: extract repair_reader declaration into reader.hh repair_meta: get_estimated_partitions fix repair_meta: use multishard_filter reader if the number of partitions is small repair_meta: delay _repair_reader creation database.hh: make_multishard_streaming_reader with range parameter database.cc: extract streaming_reader_lifecycle_policy	2023-07-09 23:21:06 +03:00
Aleksandra Martyniuk	61dc98b276	api: prevent non-owner cpu access to shared_ptr In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs to sstables is passes to reducer of map_reduce0. Reducer then accesses these shared pointers. As reducer is invoked on the same shard map_reduce0 is called, we have an illegal access to shared pointer on non-owner cpu. A set of shared pointers to sstables is trasnsformed in map function, which is guaranteed to be invoked on a shard associated with the service. Fixes: #14515. Closes #14532	2023-07-09 23:09:59 +03:00
Kefu Chai	7a334c53af	cql3: expression: correct format string fmtlib uses `{}` as the placeholder for the formatted argument, not `{}}`. so let's correct it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14586	2023-07-09 22:26:29 +03:00
Kefu Chai	56c3462cba	alternator: correct format string when formatting the error message for `api_error::validation`, we always include the caller in the error message, but in this case, forgot to pass the `caller` to `seastar::format()`. if fmtlib actually formats them, it would throw. so let's pass `caller` to `seastar::format()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14589	2023-07-09 22:25:13 +03:00
Aleksandra Martyniuk	23e3251fc3	tasks: test: abort task manager when wrapped_compaction_manager is destructed When task manager is not aborted, the tasks are stored in the memory, not allowing the tasks' gate to be closed. When wrapped_compaction_manager is destructed, task manager gets aborted, so that system could shutdown.	2023-07-09 12:08:32 +02:00
Aleksandra Martyniuk	529c703143	compaction: swap compaction manager stopping order task_manager::module::stop() waits till all compactions are complete. Thus, ongoing compactions should be aborted before stop() is called not to prolong shutdown process. Task manager's compaction module is stopped after compaction_manager::do_stop(), which aborts ongoing compactions, is called.	2023-07-09 12:05:49 +02:00
Aleksandra Martyniuk	a59485b6da	compaction: modify compaction_manager::stop() In compaction_manager::stop(), do_stop() is called unconditionally. It relies on do_stop to return immediately when _state == none.	2023-07-09 12:04:14 +02:00
Michał Chojnowski	c41f0ebd2a	test: mutation_test: unflake test_external_memory_usage The test has about 1/2500000 chance to fail due to a conflict of random values. And it recently did, just to spite us. Fight back. Fixes #14563 Closes #14576	2023-07-08 15:20:25 +03:00
Kefu Chai	27d6ff36df	compound_compat: do not format an sstring with {:d} before this change, we format a sstring with "{:d}", fmtlib would throw `fmt::format_error` at runtime when formatting it. this is not expected. so, in this change, we just print the int8_t using `seastar::format()` in a single pass. and with the format specifier of `#02x` instead of adding the "0x" prefix manually. Fixes #14577 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14578	2023-07-08 15:13:11 +03:00
Kefu Chai	26dcfea84a	estimated_histogram: do not use dynamic format_string fmtlib allows us to specify the field width dynamically, so specify the field width in the same statement formatting the argument improves the readability. and use the constexpr fmt string allows us to switch to compile-time formatter supported by fmtlib v8. this change also use `fmt::print()` to format the argument right to the output ostream, instead of creating a temporary sstring, and copy it to the output ostream. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14579	2023-07-08 15:10:41 +03:00
Anna Stuchlik	88e62ec573	doc: improve User Data info in Launch on AWS Fixes https://github.com/scylladb/scylladb/issues/14565 This commit improves the description of ScyllaDB configuration via User Data on AWS. - The info about experimental features and developer mode is removed. - The description of User Data is fixed. - The example in User Data is updated. - The broken link is fixed. Closes #14569	2023-07-07 16:34:06 +02:00
Kamil Braun	de7f668441	Merge 'raft topology: send cdc generation data in parts' from Mikołaj Grzebieluch The CDC generation data can be large and not fit in a single command. This pr splits it into multiple mutations by smartly picking a `mutation_size_threshold` and sending each mutation as a separate group 0 command. Commands are sent sequentially to avoid concurrency problems. Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If a new generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. New or delayed nodes can never get the applied part in this scenario. Send the entire cdc_generations_v3 table in the snapshot to resolve this problem. A mechanism to remove old CDC generations will be introduced as a follow-up. Closes #13962 * github.com:scylladb/scylladb: test: raft topology: test `prepare_and_broadcast_cdc_generation_data` service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop raft topology: introduce `prepare_and_broadcast_cdc_generation_data` raft: add release_guard raft: group0_state_machine::merger take state_id as the maximal value from all merged commands raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot raft topology: make `mutation_size_threshold` depends on `max_command_size` raft: reduce max batch size of raft commands and raft entries raft: add description argument to add_entry_unguarded raft: introduce `write_mutations` command raft: refactor `topology_change` applying	2023-07-07 16:31:29 +02:00
Kamil Braun	f9cfd7e4f5	Merge 'raft: do not ping self in direct failure detector' from Konstantin Osipov Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead. Fixes #14388 Closes #14558 * github.com:scylladb/scylladb: direct_fd: do not ping self raft: initialize raft_group_registry with host id early raft: code cleanup	2023-07-07 14:26:17 +02:00
Mikołaj Grzebieluch	4e3c97d8d4	test: raft topology: test `prepare_and_broadcast_cdc_generation_data` This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size` is limited to less than 1 MB. It adds an injection which copies mutations generated by `get_cdc_generation_mutations` n times, where n is picked that the memory size of all mutations exceeds `max_command_size`. This test passes if cdc generation data is committed by raft in multiple commands. If all the data is committed in a single command, the leader node will loop trying to send raft command and getting the error: ``` storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {}) ```	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	8d6c95f9e3	service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it prints an error. This exception is not an error in this case, and it can be thrown when the leader has changed. It can happen in `add_entry_unguarded` while sending a part of the CDC generation data in the `write_mutations` command. Catch this exception in `topology_coordinator::run` and print a warning.	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	ade15ad74a	raft topology: introduce `prepare_and_broadcast_cdc_generation_data` Broadcasts all mutations returned from `prepare_new_cdc_generation_data` except the last one. Each mutation is sent in separate raft command. It takes `group0_guard`, and if the number of mutations is greater than one, the guard is dropped, and a new one is created and returned, otherwise the old one will be returned. Commands are sent in parallel and unguarded (the guard used for sending the last mutation will guarantee that the term hasn't been changed). Returns the generation's UUID, guard and last mutation, which will be sent with additional topology data by the caller. If we send the last mutation in the `write_mutation` command, we would use a total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of mutations), so it's better to send it in `topology_change` (we need to send it after all `write_mutations`) with some small metadata. With the default commitlog segment size, `mutation_size_threshold` will be 4 MB. In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation data can reach the size of 30 MB, thus there will be no more than 8 commands. In a multi-DC cluster with 100ms latencies between DCs, this operation should take about 200ms since we send the commands concurrently, but even if the commands were replicated sequentially by Raft, it should take no more than 1.6s, which is incomparably smaller than bootstrapping operation (bootstrapping is quick if there is no data in the cluster, but usually if one has 100 nodes they have tons of data, so indeed streaming/repair will take much longer (hours/days)). Fixes FIXME in pr #13683.	2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch	04c38c6185	raft: add release_guard This function takes guard and calls its destructor. It's used to not call raw destructor.	2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch	d2a4079bbe	raft: group0_state_machine::merger take state_id as the maximal value from all merged commands If `group0_state_machine` applies all commands individually (without batching), the resulting current `state_id` -- which will be compared with the `prev_state_id` of the next command if it is a guarded command -- equals the maximum of the `next_state_id` of all commands applied up to this point. That's because the current `state_id` is obtained from the history table by taking the row with the largest clustering key. When `group0_state_machine::apply` is called with a batch of commands, the current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id` only once. When a command is merged, its `next_state_id` overwrites `last_group0_state_id`, regardless of their order. Let's consider the following situation: The leader sends two unguarded `write_mutations` commands concurrently, with timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded `topology_change` with `prev_state_id` equal T2. Suppose that the command with timeuuid T2 is committed first, and these commands are small enough that all of `write_mutations` could be merged into one command. Some followers can get all of these three commands before its `fsm` polls them. In this situation, `group0_state_machine::apply` is called with all three of them and `merger` will merge both `write_mutations` into one command. After that, `merger::last_group0_state_id` will be equal to T1 (this command was committed as the second one). When it processes the `topology_change` command, it will compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in making this command a no-op (which wouldn't happen if the commands were applied individually). Such a scenario results in inconsistent results: one replica applies `topology_change`, but another makes it a no-op.	2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch	b2d22d665e	raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot Topology snapshots contain only mutation of current CDC generation data but don't contain any previous or future generations. If new a generation of data is being broadcasted but hasn't been entirely applied yet, the applied part won't be sent in a snapshot. In this scenario, new or delayed nodes can never get the applied part. Send entire cdc_generations_v3 table in the snapshot to resolve this problem. As a follow-up, a mechanism to remove old CDC generations will be introduced.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	dc6017b71b	raft topology: make `mutation_size_threshold` depends on `max_command_size` `get_cdc_generation_mutations` splits data to mutations of maximal size `mutation_size_treshold`. Before this commit it was hardcoded to 2 MB. Calculate `mutation_size_threshold` to leave space for cdc generation data and not exceed `max_command_size`.	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	6dad582796	raft: reduce max batch size of raft commands and raft entries For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size` (half of the commitlog segment size), so with some additional information, it can exceed this threshold resulting in throwing an exception when writing mutation to the commitlog. A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size` (half of the commitlog segment size). It doesn't have additional metadata, but it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble, but it is prefered to be careful. Make `raft_sys_table_storage::_max_mutation_size` and `group0_state_machine::merger::max_command_size` more strict to leave space for metadata. Fixed typo "1204" => "1024".	2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch	760d415781	raft: add description argument to add_entry_unguarded Provide useful description for `write_mutations` and `broadcast_tables_query` that is stored in `system.group0_history`. Reduces scope of issue #13370.	2023-07-07 13:11:44 +02:00
Anna Stuchlik	799ae97b52	doc: add the Rust CDC Connector to the docs Fixes https://github.com/scylladb/scylladb/issues/13877 This commit adds the information about Rust CDC Connector to the documentation. All relevant pages are updated: the ScyllaDB Rust Driver page, and other places in the docs where Java and Go CDC connectors are mentioned. In addition, the drivers table is updated to indicate Rust driver support for CDC. Closes #14530	2023-07-07 11:13:25 +02:00
Nadav Har'El	edfb89ef65	sstables: stop warning when auto-snapshot leaves non-empty directory When a table is dropped, we delete its sstables, and finally try to delete the table's top-level directory with the rmdir system call. When the auto-snapshot feature is enabled (this is still Scylla's default), the snapshot will remain in that directory so it won't be empty and will cannot be removed. Today, this results in a long, ugly and scary warning in the log: ``` WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored. ``` It is bad to log as a warning something which is completely normal - it happens every time a table is dropped with the perfectly valid (and even default) auto-snapshot mode. We should only log a warning if the deletion failed because of some unexpected reason. And in fact, this is exactly what the code tried to do - it does not log a warning if the rmdir failed with EEXIST. It even had a comment saying why it was doing this. But the problem is that in Linux, deleting a non-empty directory does not return EEXIST, it returns ENOTEMPTY... Posix actually allows both. So we need to check both, and this is the only change in this patch. To confirm this that this patch works, edit test/cql-pytest/run.py and change auto-snapshot from 0 to 1, run test/alternator/run (for example) and see many "Directory not empty" warnings as above. With this patch, none of these warnings appear. Fixes #13538 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14557	2023-07-07 11:08:10 +02:00
Benny Halevy	cd44ad9338	docs: compaction: correct min_sstable_size default value DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L` which is 50 MB, not 50 bytes. Fixes #14413 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14414	2023-07-07 11:08:10 +02:00
Marcin Maliszkiewicz	c5de25be4c	locator: use deferred_close in azure and gcp snitches Close needs to be called even if function throws in the middle. Closes #14458	2023-07-07 11:08:10 +02:00
Avi Kivity	1f9a999c26	cql3: statement_restrictions: clean up dead code We have plenty of code marked with #if 0. Once it was an indication of missing functionality, but the code has evolved so much it's useless as an indication and only a distraction. Delete it. Closes #14511	2023-07-07 11:08:10 +02:00
Gleb Natapov	4f23eec44f	Rename experimental raft feature to consistent-topology-changes Make the name more descriptive Fixes #14145 Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>	2023-07-07 11:08:10 +02:00
Kamil Braun	3c139265b3	Merge 'doc: remove the dead link to unirestore' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/14459 This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements. Closes #14519 * github.com:scylladb/scylladb: doc: minor language improvements on the Migration Tools page doc: remove the link to the private repository	2023-07-07 11:08:10 +02:00
Nadav Har'El	d6aba8232b	alternator: configurable override for DescribeEndpoints The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554) where even if a user specifies a specific enpoint URL, the SDK uses DescribeEndpoints to try to "refresh" the endpoint. The problem is that DescribeEndpoints can't return a scheme (http or https) and the SDK arbitrarily picks https - making it unable to communicate with Alternator over http. As an example, the new "dynamodb shell" (written in C++) cannot communicate with Alternator running over http. This patch adds a configuration option, "alternator_describe_endpoints", which can be used to override what DescribeEndpoints does: 1. Empty string (the default) leaves the current behavior - DescribeEndpoints echos the request's "Host" header. 2. The string "disabled" disables the DescribeEndpoints (it will return an UnknownOperationException). This is how DynamoDB Local behaves, and the AWS C++ SDK and the Dynamodb Shell work well in this mode. 3. Any other string is a fixed string to be returned by DescribeEndpoints. It can be useful in setups that should return a known address. Note that this patch does not, by default, change the current behaivor of DescribeEndpoints. But it us the future to override its behavior in a user experiences problems in the field - without code changes. Fixes #14410. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14432	2023-07-07 11:08:10 +02:00
Konstantin Osipov	ff41ea86b6	direct_fd: do not ping self No need to ping self in direct failure detector. This is confusing during debugging and adds extra overhead. Fixes #14388	2023-07-06 21:05:39 +03:00
Konstantin Osipov	50140980ac	raft: initialize raft_group_registry with host id early Earlier, when local query processor wasn't available at the beginning of system start, we couldn't query our own host id when initializing the raft group registry. The local host id is needed by the registry since it is responsible to route RPC messages to specific raft groups, and needs to reject messages destined to a different host. Now that the host id is known early at boot, remove the optional and pass host id in the constructor. Resolves an earlier fixme.	2023-07-06 20:54:05 +03:00
Konstantin Osipov	d79d05aa46	raft: code cleanup Rename raft_rpc::_server_id to raft_rpc::_my_id as is already the name used in raft_group0: - for consistency - to reflect which server id it is.	2023-07-06 19:46:24 +03:00
Kamil Braun	0d437a7d63	Merge 'utils: error injection: add inject_with_handler for interactions with injected code' from Mikołaj Grzebieluch Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint. This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument. The `injection_handler` class is used to wait for events inside the injected code. The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message. Closes #14357. Closes #14460 * github.com:scylladb/scylladb: tests: introduce InjectionHandler class for communicating with injected code api/error_injection: add message_injection endpoint tests: utils: error injections: add test for inject_with_handler utils: error injection: add inject_with_handler for interactions with injected code utils: error injection: create structure for error injections data	2023-07-06 18:16:51 +02:00
Mikołaj Grzebieluch	907c0e8900	tests: introduce InjectionHandler class for communicating with injected code Add a client for sending empty messages to the injected code from tests.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	8b1f5ba293	api/error_injection: add message_injection endpoint Add an endpoint for sending empty messages to the injected code.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	7e5c42af0a	tests: utils: error injections: add test for inject_with_handler Add a test checking the correctness of the `inject_with_handler` method in presence of concurrency.	2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch	086b3369f4	utils: error injection: add inject_with_handler for interactions with injected code Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint. This commit adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument. The `injection_handler` class is used to wait for events inside the injected code. The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message. There is a counter of received messages in `received_messages_counter`; it is shared between the injection_data, which is created once when enabling an injection on a given shard, and all `injection_handler`s, that are created separately for each firing of this injection. The `counter` is incremented when receiving a message from the REST endpoint and the condition variable is signaled. Each `injection_handler` (separate for each firing) stores its own private counter, `_read_messages_counter` that private counter is incremented whenever we wait for a message, and compared to the received counter. We sleep on the condition variable if not enough messages were received.	2023-07-06 12:32:07 +02:00
Kamil Braun	431a8f8591	test: rename `test_topology_ip.py` to `test_replace.py` No idea why it was named like that before.	2023-07-06 10:24:46 +02:00
Kamil Braun	452d9a3c77	test: test bootstrap after IP change Regression test for #14468.	2023-07-06 10:24:46 +02:00
Kamil Braun	2032d7dbe4	test: scylla_cluster: return the new IP from `change_ip` API Also simplify the API by getting rid of `ActionReturn` and returning errors through exceptions (which are correctly forwarded to the client for some time already).	2023-07-06 10:24:46 +02:00
Kamil Braun	00f51ea753	test: node replace with `ignore_dead_nodes` test Regression test for #14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.	2023-07-06 10:24:46 +02:00
Kamil Braun	9b136ee574	test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`	2023-07-06 10:24:46 +02:00
Kamil Braun	9b8e5550b1	storage_service: remove `get_nodes_to_sync_with` It's no longer used.	2023-07-06 10:24:46 +02:00
Kamil Braun	96278a09d4	storage_service: use `token_metadata` to calculate nodes waited for to be UP At bootstrap, after we start gossiping, we calculate a set of nodes (`sync_nodes`) which we need to "synchronize" with, waiting for them to be UP before proceeding; these nodes are required for streaming/repair and CDC generation data write, and generally are supposed to constitute the current set of cluster members. In #14468 and #14487 we observed that this set may calculate entries corresponding to nodes that were just replaced or changed their IPs (but the old-IP entry is still there). We pass them to `_gossiper.wait_alive` and the call eventually times out. We need a better way to calculate `sync_nodes` which detects ignores obsolete IPs and nodes that are already gone but just weren't garbage-collected from gossiper state yet. In fact such a method was already introduced in the past: `ca61d88764` but it wasn't used everywhere. There, we use `token_metadata` in which collisions between Host IDs and tokens are resolved, so it contains only entries that correspond to the "real" current set of NORMAL nodes. We use this method to calculate the set of nodes passed to `_gossiper.wait_alive`. Fixes #14468 Fixes #14487	2023-07-06 10:24:46 +02:00
Kamil Braun	bbcf8305bb	storage_service: don't calculate `ignore_nodes` before waiting for normal handlers Before this commit the `wait_for_normal_state_handled_on_boot` would wait for a static set of nodes (`sync_nodes`), calculated using the `get_nodes_to_sync_with` function and `parse_node_list`; the latter was used to obtain a list of "nodes to ignore" (for replace operation) and translate them, using `token_metadata`, from IP addresses to Host IDs and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call which we do after `wait_for_normal_state_handled_on_boot`. Recently we started doing these calculations and this wait very early in the boot procedure - immediately after we start gossiping (`50e8ec77c6`). Unfortunately, as always with gossiper, there are complications. In #14468 and #14487 two problems were detected: - Gossiper may contain obsolete entries for nodes which were recently replaced or changed their IPs. These entries are still using status `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g. `handle_state_normal` is also called for it). The `_gossiper.wait_alive` call would wait for those entries too and eventually time out. - Furthermore, by the time we call `parse_node_list`, `token_metadata` may not be populated yet, which is required to do the IP<->Host ID translations -- and populating `token_metadata` happens inside `handle_state_normal`, so we have a chicken-and-egg problem here. The `parse_node_list` problem is solved in this commit. It turns out that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`) in order to wait for NORMAL state handlers. We can wait for handlers to finish for any `NORMAL`/`shutdown` entries appearing in gossiper, even those that correspond to dead/ignored nodes and obsolete IPs. `handle_state_normal` is called, and eventually finishes, for all of them. `wait_for_normal_state_handled_on_boot` no longer receives a set of nodes as parameter and is modified appropriately, it's now calculating the necessary set of nodes on each retry (the set may shrink while we're waiting, e.g. because an entry corresponding to a node that was replaced is garbage-collected from gossiper state). Thanks to this, we can now put the `sync_nodes` calculation (which is still necessary for `_gossiper.wait_alive`), and hence the `parse_node_list` call, after we wait for NORMAL state handlers, solving the chickend-and-egg problem. This addresses the immediate failure described in #14487, but the test will still fail. That's because `_gossiper.wait_alive` may still receive a too large set of nodes -- we may still include obsolete IPs or entries corresponding to replaced nodes in the `sync_nodes` set. We fix this in the following commit which will solve both issues.	2023-07-06 10:24:44 +02:00
Tomasz Grabiec	c25201c1a3	Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes https://github.com/scylladb/scylladb/issues/14503 Closes #14502 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-05 21:21:43 +02:00
Michał Chojnowski	f6203f2bd4	test: view_build_test: add range tombstones to test_view_update_generator_buffering This patch adds a full-range tombstone to the compacted mutation. This raises the coverage of the test. In particular, it reproduces issue #14503, which should have been caught by this test, but wasn't.	2023-07-05 17:33:49 +02:00
Michał Chojnowski	aab10402ce	test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations A random mutation test for view_updating_consumer's buffering logic. Reproduces #14503.	2023-07-05 17:33:49 +02:00
Michał Chojnowski	ac29b6f198	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-05 17:33:47 +02:00
Raphael S. Carvalho	5d34db2532	test: Extend sstable partition skipping test to cover fast forward using token Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-05 11:38:58 -03:00
Kefu Chai	fa8eaab62b	build: remove duplicated test this change has no impact on `build.ninja` generated by `configure.py`. as we are using a `set` for tracking the tests to be built. but it's still an improvement, as we should not add duplicated entries in a set when initializing it. there are two occurrences of `test/boost/double_decker_test`, the one which is in the club of the local cluster of collections tests - bptree, btree, radix_tree and double_decker are preserved. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14478	2023-07-05 15:43:04 +03:00
Kefu Chai	e4697e2bd2	sstable: remove stale comment this comment should have been removed in `f014ccf369`. but better late than never. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14497	2023-07-05 15:42:11 +03:00
Pavel Emelyanov	e91f95a629	Merge 's3/test: restructure object_store test into a pytest based test suite' from Kefu Chai in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store. Closes #14165 * github.com:scylladb/scylladb: s3/test: do not return ip in managed_cluster() s3/test: verify the behavior with asserts s3/test: restructure object_store/run into a pytest s3/test: extract get_scylla_with_s3_cmd() out s3/test: s/restart_with_dir/kill_with_dir/ s3/test: vendor run_with_dir() and friends s3/test: remove get_tempdir() s3/test: extract managed_cluster() out	2023-07-05 15:40:43 +03:00
Gleb Natapov	c42a91ec72	cql3: Extend the scope of group0_guard during DDL statement execution Currently we hold group0_guard only during DDL statement's execute() function, but unfortunately some statements access underlying schema state also during check_access() and validate() calls which are called by the query_processor before it calls execute. We need to cover those calls with group0_guard as well and also move retry loop up. This patch does it by introducing new function to cql_statement class take_guard(). Schema altering statements return group0 guard while others do not return any guard. Query processor takes this guard at the beginning of a statement execution and retries if service::group0_concurrent_modification is thrown. The guard is passed to the execute in query_state structure. Fixes: #13942 Message-Id: <ZJ2aeNIBQCtnTaE2@scylladb.com>	2023-07-05 14:38:34 +02:00
Mikołaj Grzebieluch	01bc6f5294	utils: error injection: create structure for error injections data This enables holding additional data associated with the injection.	2023-07-05 13:52:46 +02:00
Anna Stuchlik	4656d8c338	doc: update the prerequisites section	2023-07-05 11:52:03 +02:00
Anna Stuchlik	088a31cdb0	doc: minor language improvements on the Migration Tools page	2023-07-05 11:39:52 +02:00
Pavel Emelyanov	dfff5f2f2e	Merge 'test/pylib: retry if minio_server is not ready and define a name for alias' from Kefu Chai there is chance that minio_server is not ready to serve after launching the server executable process. so we need to retry until the first "mc" command is able to talk to it. in this change, add method `mc()` is added to run minio client, so we can retry the command before it timeouts. and it allows us to ignore the failure or specify the timeout. this should ready the minio server before tests start to connect to it. also, in this change, instead of hardwiring the alias of "local" in the code, define a variable for it. less repeating this way. Fixes https://github.com/scylladb/scylladb/issues/1719 Closes #14517 * github.com:scylladb/scylladb: test/pylib: do not hardwire alias to "local" test/pylib: retry if minio_server is not ready	2023-07-05 12:32:58 +03:00
Anna Stuchlik	3213feee5f	doc: remove the link to the private repository This commit removes the dead link to the unirestore tool in the private repository.	2023-07-05 11:28:37 +02:00
Kefu Chai	9080f8842b	s3/test: do not return ip in managed_cluster() let's just use cluster.contact_points for retrieving the IP address of the scylla node in this single-node cluster. so the name of managed_cluster() is less weird. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:07:39 +08:00
Kefu Chai	ec6410653f	s3/test: verify the behavior with asserts instead of assigning to "success", let's use assert for this purpose. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:07:21 +08:00
Kefu Chai	471d75c6c6	s3/test: restructure object_store/run into a pytest instead of using a single run to perform the test, restructure it into a pytest based test suite with a single test case. this should allow us to add more tests exercising the object-storage and cached/tierd storage in future. * add fixtures so they can be reused by tests * use tmpdir fixture for managing the tmpdir, see https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture * perform part of the teardown in the "test_tempdir()" fixture * change the type of test from "Run" to "Python" * rename "run" to "test_basic.py" * optionally start the minio server if the settings are not found in command line or env variables, so that the tests are self-contained without the fixture setup by test.py. * instead of sys.exit(), use assert statement, as this is what pytest uses. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:05:13 +08:00
Petr Gusev	b69bc97673	repair_test: add test_reader_with_different_strategies	2023-07-05 13:02:17 +04:00
Kefu Chai	bffaf84395	s3/test: extract get_scylla_with_s3_cmd() out * define a dedicated S3_server class which duck types MinioServer. it will be used to represent S3 server in place of MinioServer if S3 is used for testing * prepare object_storage.yaml in get_scylla_with_s3(), so it is more clear that we are using the same set of settings for launching scylla Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:49:04 +08:00
Kefu Chai	f74218f434	s3/test: s/restart_with_dir/kill_with_dir/ replace the restart_with_dir() with kill_with_dir(), so that we can simplify the usage of managed_cluster() by enabling it to start and stop the single-node cluster. with this change, the caller does not need to run the scylla and pass its pid to this function any more. since the restart_with_dir() call is superseded by managed_cluster(), which tears down the cluster, teardown() is now only responsible to print out the log file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:48:25 +08:00
Kefu Chai	a6bb5864ff	s3/test: vendor run_with_dir() and friends so we don't need to mess up with cql-pytest/run.py, which is use by cql-pytest. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:48:04 +08:00
Kefu Chai	b45049c968	s3/test: remove get_tempdir() to match with another call of managed_cluster(), so it's clear that we are just reusing test_tempdir. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:45:14 +08:00
Kefu Chai	a5a87d81c6	s3/test: extract managed_cluster() out for setting up the cluster and tearing down it. this helps to indent the code so that it is visually explicit the lifecycle of the cluster. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:45:14 +08:00
Kefu Chai	1faf50fc05	test/pylib: do not hardwire alias to "local" define a variable for it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 15:58:41 +08:00
Kefu Chai	d55cfdc152	test/pylib: retry if minio_server is not ready there is chance that minio_server is not ready to serve after launching the server executable process. so we need to retry until the first "mc" command is able to talk to it. in this change, add method `mc()` is added to run minio client, so we can retry the command before it timeouts. and it allows us to ignore the failure or specify the timeout. this should ready the minio server before tests start to connect to it. Fixes #1719 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 15:57:59 +08:00
Konstantin Osipov	b9c2b326bc	raft: do not update raft address map with obsolete gossip data It is possible that a gossip message from an old node is delivered out of order during a slow boot and the raft address map overwrites a new IP address with an obsolete one, from the previous incarnation of this node. Take into account the node restart counter when updating the address map. A test case requires a parameterized error injection, which we don't support yet. Will be added as a separate commit. Fixes #14257 Refs #14357 Closes #14329	2023-07-05 00:16:28 +02:00
Michał Chojnowski	5ad0846bff	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-04 20:33:21 +02:00
Mikołaj Grzebieluch	e6b0403326	raft: introduce `write_mutations` command This command is used to send mutations over raft. In later commits if `topology_change` doesn't fit the max command size, it will be split into smaller mutations and sent over multiple raft commands.	2023-07-04 16:12:50 +02:00
Mikołaj Grzebieluch	06cedaf978	raft: refactor `topology_change` applying Split up the `topology_change` command's logic to apply mutations and reload the topology state in seperate functions. This aims to extract the logic of applying mutations to use it in future raft commands.	2023-07-04 16:12:50 +02:00
Avi Kivity	0f59b17056	cql3: select_statement: don't copy metadata object needlessly It's a shared_ptr<const metadata>, so it's safe to pass around. perf-simple-query: before: 211989.40 tps ( 62.1 allocs/op, 13.1 tasks/op, 43812 insns/op, 0 errors) 217889.09 tps ( 62.1 allocs/op, 13.1 tasks/op, 43713 insns/op, 0 errors) 211418.75 tps ( 62.1 allocs/op, 13.1 tasks/op, 43782 insns/op, 0 errors) 217388.46 tps ( 62.1 allocs/op, 13.1 tasks/op, 43733 insns/op, 0 errors) 211528.74 tps ( 62.1 allocs/op, 13.1 tasks/op, 43766 insns/op, 0 errors) after: 215241.86 tps ( 61.1 allocs/op, 13.1 tasks/op, 43563 insns/op, 0 errors) 216172.41 tps ( 61.1 allocs/op, 13.1 tasks/op, 43562 insns/op, 0 errors) 212591.73 tps ( 61.1 allocs/op, 13.1 tasks/op, 43586 insns/op, 0 errors) 212217.28 tps ( 61.1 allocs/op, 13.1 tasks/op, 43553 insns/op, 0 errors) 215863.47 tps ( 61.1 allocs/op, 13.1 tasks/op, 43559 insns/op, 0 errors) About 200 instructions saved. Closes #14499	2023-07-04 16:41:51 +03:00
Marcin Maliszkiewicz	6424dd5ec4	alternator: close output_stream when exception is thrown during response streaming When exception occurs and we omit closing output_stream then the whole process is brought down by an assertion in ~output_stream. Fixes https://github.com/scylladb/scylladb/issues/14453 Relates https://github.com/scylladb/scylladb/issues/14403 Closes #14454	2023-07-04 16:15:08 +03:00
Anna Stuchlik	6408b520d4	doc: improve the tile of Unified Installer page Following the feedback, this commit changes the page title into "Install ScyllaDB Without root Privileges".	2023-07-04 15:05:56 +02:00
Anna Stuchlik	5895e210fd	doc: move package install instructions to the docs This commit moves the installation instructions with Linux packages from the webstite to the docs. The scope: - Added the install-on-linux.rst file that has information about all supported linux platform. The replace variables in the file must be updated per release. - Updated the index page to include the new file. Refs: scylladb/scylla-docs#4091	2023-07-04 15:03:01 +02:00
Pavel Emelyanov	3679792f49	Merge 'test/pylib: allow run minio_server.py as a stand-alone tool' from Kefu Chai this would allow developer to run a minio server for testing, for instance, s3_test. Closes #14485 * github.com:scylladb/scylladb: test/pylib: chmod +x minio_server.py test/pylib: allow run minio_server.py as a stand-alone tool	2023-07-04 13:41:51 +03:00
Petr Gusev	9198175b89	repair: extract repair_reader declaration into reader.hh It's needed to write a unit test for it in the following commits. No other code changes have been made.	2023-07-04 13:39:53 +03:00
Petr Gusev	b9f527bfa8	repair_meta: get_estimated_partitions fix The shard_range parameter was unused.	2023-07-04 13:39:53 +03:00
Petr Gusev	3aeee90f04	repair_meta: use multishard_filter reader if the number of partitions is small We replace is_local_reader bool_class with the read_strategy enum, since now we have three options. We choose our new multishard_streaming_reader if the number of partitions is less than the number of master subranges.	2023-07-04 13:39:53 +03:00
Petr Gusev	c0d049982c	repair_meta: delay _repair_reader creation In later commits we will need the estimated number of partitions in _repair_reader creation, so in this commit we delay it until the reader is first used in read_rows_from_disk function. read_rows_from_disk is used in get_sync_boundary, which is called by master after set_estimated_partitions.	2023-07-04 13:39:49 +03:00
Petr Gusev	f05ab33ee7	database.hh: make_multishard_streaming_reader with range parameter We add an overload of make_multishard_streaming_reader which reads all the data in the given range. We will use it later in row level repair if --smp is different on the nodes and the number of partitions is small.	2023-07-04 13:30:37 +03:00
Petr Gusev	614a1b3770	database.cc: extract streaming_reader_lifecycle_policy We are going to use it later in a new make_multishard_streaming_reader overload. In this commit we just move it outside into the anonymous namespace, no other code changes were made.	2023-07-04 13:30:37 +03:00
Kefu Chai	949bb719cd	sstables: use try_emplace() when appropriate so we don't have to search in the unordered_map twice. and it's more readable, as we don't need to compare an iterator with the sentry. also, take the opportunity to simplify the code by using the temporary `s3_cfg` when possible instead of `it->second.cfg` which is less readable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 15:40:10 +08:00
Kefu Chai	dcfbc85485	replica,sstable: do not assign a value to a shared_ptr instead using the operator=(T&&) to assign an instance of `T` to a shared_ptr, assign a new instance of shared_ptr to it. unlike std::shared_ptr, seastar::shared_ptr allows us to move a value into the existing value pointed by shared_ptr with operator=(). the corresponding change in seastar is `319ae0b530`. but this is a little bit confusing, as the behavior of a shared_ptr should look like a pointer instead the value pointed by it. and this could be error-prune, because user could use something like ```c++ p = std::string(); ``` by accident, and expect that the value pointed by `p` is cleared. and all copies of this shared_ptr are updated accordingly. what he/she really wants is: ```c++ *p = std::string(); ``` and the code compiles, while the outcome of the statement is that the pointee of `p` is destructed, and `p` now points to a new instance of string with a new address. the copies of this instance of shared_ptr still hold the old value. this behavior is not expected. so before deprecating and removing this operator. let's stop using it. in this change, we update two caller sites of the `lw_shared_ptr::operator=(T&&)`. instead of creating a new instance pointee of the pointer in-place, a new instance of lw_shared_ptr is created, and is assigned to the existing shared_ptr. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 15:39:52 +08:00
Kefu Chai	c005b6dce0	test/pylib: chmod +x minio_server.py add a shebang line. so we can just launch a minio_server using ```console test/pylib/minio_server.py --host 127.0.0.1 ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 13:19:34 +08:00
Kefu Chai	2bae0b9aa8	test/pylib: allow run minio_server.py as a stand-alone tool this would allow developer to run a minio server for testing, for instance, s3_test, using something like: ```console $ python3 test/pylib/minio_server.py --host 127.0.0.1 tempdir='/tmp/tmpfoobar-minio' export S3_SERVER_ADDRESS_FOR_TEST=127.0.0.1 export S3_SERVER_PORT_FOR_TEST=900 export S3_PUBLIC_BUCKET_FOR_TEST=testbucket ``` and developer is supposed to copy-and-paste the `export` commands to prepare the environmental variables for the test using the minio server. the tempdir is used for the rundir of minio, and it is also used for holding the log file of this tool. one might want to check it when necessary. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 13:14:42 +08:00
Tomasz Grabiec	7d35cf8657	Merge 'migration_manager: disable schema pulls when schema is Raft-managed' from Kamil Braun We want to disable `migration_manager` schema pulls and make schema managed only by Raft group 0 if Raft is enabled. This will be important with Raft-based topology, when schema will depend on topology (e.g. for tablets). We solved the problem partially in PR #13695. However, it's still possible for a bootstrapping node to pull schema in the early part of bootstrap procedure, before it setups group 0, because of how the currently used `_raft_gr.using_raft()` check is implemented. Here's the list of cases: - If a node is bootstrapping in non-Raft mode, schema pulls must remain enabled. - If a node is bootstrapping in Raft mode, it should never perform a schema pull. - If a bootstrapped node is restarting in non-Raft mode but with Raft feature enabled (which means we should start upgrading to use Raft), or restarting in the middle of Raft upgrade procedure, schema pulls must remain enabled until the Raft upgrade procedure finishes. This is also the case of restarting after RECOVERY. - If a bootstrapped node is restarting in Raft mode, it should never perform a schema pull. The `raft_group0` service is responsible for setting up Raft during boot and for the Raft upgrade procedure. So this is the most natural place to make the decision that schema pulls should be disabled. Instead of trying to come up with a correct condition that fully covers the above list of cases, store a `bool` inside `migration_manager` and set it from `raft_group0` function at the right moment - when we decide that we should boot in Raft mode, or restart with Raft, or upgrade. Most of the conditions are already checked in `setup_group0_if_exist`, we just need to set the bool. Also print a log message when schema pulls are disabled. Fix a small bug in `migration_manager::get_schema_for_write` - it was possible for the function to mark schema as synced without actually syncing it if it was running concurrently to the Raft upgrade procedure. Correct some typos in comments and update the comments. Fixes #12870 Closes #14428 * github.com:scylladb/scylladb: raft_group_registry: remove `has_group0()` raft_group0_client: remove `using_raft()` migration_manager: disable schema pulls when schema is Raft-managed	2023-07-03 23:54:34 +02:00
Tomasz Grabiec	f2ed9fcd7e	schema_mutations, migration_manager: Ignore empty partitions in per-table digest Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d`, it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in 18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485.	2023-07-03 23:06:55 +02:00
Nadav Har'El	ec77172b4b	Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity SELECT clause components (selectors) are currently evaluated during query execution using a stateful class hierarchy. This state is needed to hold intermediate state while aggregating over multiple rows. Because the selectors are stateful, we must re-create them each query using a selector_factory hierarchy. We'd like to convert all of this to the unified expression evaluation machinery, so we can have just one grammar for expressions, and just one way to evaluate expressions, but the statefulness makes this complex. In commit `59ab9aac44` "(Merge 'functions: reframe aggregate functions in terms of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving their state to aggregate_function_selector::_accumulator, and therefore into the class hierarchy we're addressing now. Another reason for keeping state is that selectors that aren't aggregated capture the first value they see in a GROUP BY group. Since expressions can't contain state directly, we break apart expressions that contain aggregate functions into two: an inner expression that processes incoming rows within a group, and an outer expression that generates the group's output. The two expressions communicate via a newly introduced expression element: a temporary. The problem of non-aggregated columns requiring state is solved by encapsulating those columns in an internal aggregate function, called the "first" function. In terms of performance, this series has little effect, since the common case of selectors that only contain direct column references without transformations is evaluated via a fast path (`simple_selection`). This fast-path is preserved with almost no changes. While the series makes it possible to start to extend the grammar and unify expression syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change: the `SELECT JSON` statement generates json object field names based on the input selectors. In one case the name of the field has changed, but it is an esoteric case (where a function call is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra. Closes #14467 * github.com:scylladb/scylladb: cql3: selection: drop selector_factories, selectables, and selectors cql3: select_statement: stop using selector_factories in SELECT JSON cql3: selection: don't create selector_factories any more cql3: selection: collect column_definitions using expressions cql3: selection: reimplement selection::is_aggregate() cql3: selection: evaluate aggregation queries via expr::evaluate() cql3: selection, select_statement: fine tune add_column_for_post_processing() usage cql3: selection: evaluate non-aggregating complex selections using expr::evaluate() cql3: selection: store primary key in result_set_builder cql3: expression: fix field_selection::type interpretation by evaluate() cql3: selection: make result_set_builder::current non-optional<> cql3: selection: simplify row/group processing cql3: selection: convert requires_thread to expressions cql: selection: convert used_functions() to expressions cql3: selection: convert is_reducible/get_reductions to expressions cql3: selection: convert is_count() to expressions cql3: selection convert contains_ttl/contains_writetime to work on expressions cql3: selection: make simple_selectors stateless cql3: expression: add helper to split expressions with aggregate functions cql3: selection: short-circuit non-aggregations cql3: selection: drop validate_selectors cql3: select_statement: force aggregation if GROUP BY is used cql3: select_statement: levellize aggregation depth cql3: selection: skip first_function when collecting metadata cql3: select_statement: explicitly disable automatic parallelization with no aggregates cql3: expression: introduce temporaries cql3: select_statement: use prepared selectors cql3: selection: avoid selector_factories in collect_metadata() cql3: expressions: add "metadata mode" formatter for expressions cql3: selection: convert collect_metadata() to the prepared expression domain cql3: selection: convert processes_selection to work on prepared expressions cql3: selection: prepare selectors earlier cql3: raw_selector: deinline cql3: expression: reimplement verify_no_aggregate_functions() cql3: expression: add helpers to manage an expression's aggregation depth cql3: expression: improve printing of prepared function calls cql3: functions: add "first" aggregate function	2023-07-03 23:21:33 +03:00
Tomasz Grabiec	0c86abab4d	migration_manager, schema_tables: Implement migration_manager::reload_schema() Will recreate schema_ptr's from schema tables like during table alter. Will be needed when digest calculation changes in reaction to cluster feature at run time.	2023-07-03 20:32:59 +02:00
Tomasz Grabiec	9bfe9f0b2f	schema_tables: Avoid crashing when table selector has only one kind of tables Currently not reachable, because selectors are always constructed with both kinds initailized. Will be triggered by the next patch.	2023-07-03 20:32:59 +02:00
Avi Kivity	66c47d40e6	cql3: selection: drop selector_factories, selectables, and selectors The whole class hierarchy is no longer used by anything and we can just delete it.	2023-07-03 19:45:17 +03:00
Avi Kivity	d9cf81f1a6	cql3: select_statement: stop using selector_factories in SELECT JSON SELECT JSON uses selector_factories to obtain the names of the fields to insert into the json object, and we want to drop selector_factories entirely. Switch instead to the ":metadata" mode of printing expressions, which does what we want. Unfortunately, the switch changes how system functions are converted into field names. A function such as unixtimestampof() is now rendered as "system.unixtimestampof()"; before it did not have the keyspace prefix. This is a compatiblity problem, albeit an obscure one. Since the new behavior matches Cassandra, and the odds of hitting this are very low, I think we can allow the change.	2023-07-03 19:45:17 +03:00
Avi Kivity	039472ffb9	cql3: selection: don't create selector_factories any more We no longer use selector_factories for anything, so we can drop them.	2023-07-03 19:45:17 +03:00
Avi Kivity	e521557ce5	cql3: selection: collect column_definitions using expressions The replica needs to know which columns we're interested in. Iterate and recurse into all selector expressions to collect all mentioned columns. We use the same algorithm that create_factories_and_collect_column_definitions() uses, even though it is quadratic, to avoid causing surprises.	2023-07-03 19:45:17 +03:00
Avi Kivity	7bd317ace4	cql3: selection: reimplement selection::is_aggregate() We can get rid of the last use of selector_factories by reimplementing is_aggregate(). It's simple - if we have an inner loop, we're aggregating.	2023-07-03 19:45:17 +03:00
Avi Kivity	91cdaa72bd	cql3: selection: evaluate aggregation queries via expr::evaluate() When constructing a selection_with_processing, split the selectors into an inner loop and an outer loop with split_aggregation(). We can then reimplement add_input_row() and get_output_row() as follows: - add_input_row(): evaluate the inner loop expressions and store the results in temporaries - get_output_row(): evaluate the outer loop expressions, pulling in values from those temporaries. reset(), which is called between groups, simply copies the initial values rathered by split_aggregation() into the temporaries. The only complexity comes from add_column_for_post_query_processing(), which essentially re-does the work of split_aggregation(). It would be much better if we added the column before split_aggregation() was called, but some refactoring has to take place before that happens.	2023-07-03 19:45:17 +03:00
Avi Kivity	27254c4f50	cql3: selection, select_statement: fine tune add_column_for_post_processing() usage In three cases we need to consult a column that's possibly not explicitly selected: - for the WHERE clause - for GROUP BY - for ORDER BY The return value of the function is the index where the newly-added column can be found. Currently, the index is correct for both the internal column vector and the result set, but soon in won't be. In the first two cases (WHERE clause and ORDER BY), we're interested in the column before grouping, in the last case (ORDER BY) we're interested in the column after grouping, so we need to distinguish between the two. Since we already have selection::index_of() that returns the pre-grouping index, choose the post-grouping index for the return value of selection::add_column_for_post_processing(), and change the GROUP BY code to use index_of(). Comments are added.	2023-07-03 19:45:17 +03:00
Avi Kivity	6bf1bd7130	cql3: selection: evaluate non-aggregating complex selections using expr::evaluate() Now that everything is in place, implement the fast-path transform_input_row() for selection_with_processing. It's a straightforward call to evaluate() in a loop. We adjust add_column_for_post_processing() to also update _selectors, otherwise ORDER BY clauses that require an additional column will not see that column. Since every sub-class implements transform_input_row(), mark the base class declaration as pure virtual.	2023-07-03 19:45:17 +03:00
Avi Kivity	f5eb7fd6dc	cql3: selection: store primary key in result_set_builder expr::evaluate() expects an exploded primary key in its evaluation_inputs structure (this dates back from the conversion of filtering to expressions). But right now, the exploded primary key is only available in the filter. That's easy to fix however: move the primary key containers to result_set_builder and just keep references in the filter. After this, we can evaluate column_value expressions that reference the primary key.	2023-07-03 19:45:17 +03:00
Avi Kivity	0021f77e30	cql3: expression: fix field_selection::type interpretation by evaluate() field_selection::type refers to the type of the selection operation, not the type of the structure being selected. This is what prepare_expression() generates and how all other expression elements work, but evaluate() for field_selection thinks it's the type of the structure, and so fails when it gets an expression from prepare_expression(). Fix that, and adjust the tests.	2023-07-03 19:45:17 +03:00
Avi Kivity	aed01018a3	cql3: selection: make result_set_builder::current non-optional<> Previously, we used the engagedness of result_set_builder::optional as a flag, but the previous patch eliminated that and it's always engaged. Remove the optional wrapper to reduce noise.	2023-07-03 19:45:17 +03:00
Avi Kivity	44c8507075	cql3: selection: simplify row/group processing Processing a result set relies on calling result_set_builder::new_row(). This function is quite complex as it has several roles: - complete processing of the previously computed row, if any - determine if GROUP BY grouping has changed, and flush the previous group if so - flush the last group if that's the case This works now, but won't work with expr::evaluate. The reason is that new_row() is called after the partition key and clustering key of the new row have been evaluated, so processing of the previous row will see incorrect data. It works today because we copy the partition key and clustering key into result_set_builder::current, but expr::evaluate uses the exploded partition key and clustering key, which have been clobbered. The solution is to separate the roles. Instead of new_row() that's responsible for completing the previous row and starting a new one, we have start_new_row() that's responsible for what its name says, and complete_row() that's responsible for completing the row and checking for group change. The responsibity for flushing the final group is moved to result_set_builder::build(). This removes the awkward "more_rows_coming" parameter that makes everything more complicated. result_set_builder::current is still optional, but it's always engaged. The next patch will clean that up.	2023-07-03 19:45:17 +03:00
Avi Kivity	877f4f86d2	cql3: selection: convert requires_thread to expressions If any function requires a thread to execute (due to running in Lua or wasm), then the entire selection needs to run in a thread.	2023-07-03 19:45:17 +03:00
Avi Kivity	cbd68abde8	cql: selection: convert used_functions() to expressions used_functions() is used to check whether prepared statements need to be invalidated when user-defined functions change. We need to skip over empty scalar components of aggregates, since these can be defined by users (with the same meaning as if the identity function was used).	2023-07-03 19:45:17 +03:00
Avi Kivity	bfb1acc6d3	cql3: selection: convert is_reducible/get_reductions to expressions The current version of automatic query parallelization works when all selectors are reducible (e.g. have a state_reduction_function member), and all the inputs to the aggregates are direct column selectors without further transformation. The actual column names and reductions need to be packed up for forward_service to be used. Convert is_reducible()/get_reductions() to the expression world. The conversion is fairly straightforward.	2023-07-03 19:45:17 +03:00
Avi Kivity	d99fc29e2d	cql3: selection: convert is_count() to expressions Early versions of automatic query parallelization only supported `SELECT count(*)` with one selector. Convert the check to expressions.	2023-07-03 19:45:17 +03:00
Avi Kivity	d36eb8cea6	cql3: selection convert contains_ttl/contains_writetime to work on expressions contains_ttl/contains_writetime are two attributes of a selection. If a selection contains them, we must ask the replica to send them over; otherwise we don't have data to process. Not sending ttl/writetime saves some effort. The implementation is a straightforward recursive descent using expr::find_in_expression.	2023-07-03 19:45:17 +03:00
Avi Kivity	6c2bb5e1ed	cql3: selection: make simple_selectors stateless Now that we push all GROUP BY queries to selection_with_processing, we always process rows via transform_input_row() and there's no reason to keep any state in simple_selectors. Drop the state and raise an internal error if we're ever called for aggregation.	2023-07-03 19:45:17 +03:00
Avi Kivity	a26516ef65	cql3: expression: add helper to split expressions with aggregate functions Aggregate functions cannot be evaluated directly, since they implicitly refer to state (the accumulator). To allow for evaluation, we split the expression into two: an inner expression that is evaluated over the input vector (once per element). The inner expression calls the aggregation function, with an extra input parameter (the accumulator). The outer expression is evaluated once per input vector; it calls the final function, and its input is just the accumulator. The outer expression also contains any expressions that operate on the result of the aggregate function. The acculator is stored in a temporary. Simple example: sum(x) is transformed into an inner expression: t1 = (t1 + x) // really sum.aggregation_function and an outer expression: result = t1 // really sum.state_to_result_function Complicated example: scalar_func(agg1(x, f1(y)), agg2(x, f2(y))) is transformed into two inner expressions: t1 = agg1.aggregation_function(t1, x, f1(y)) t2 = agg2.aggregation_function(t2, x, f2(y)) and an outer expression output = scalar_func(agg1.state_to_result_function(t1), agg2.state_to_result_function(t2)) There's a small wart: automatically parallelized queries can generate "reducible" aggregates that have no state_to_result function, since we want to pass the state back to the coordinator. Detect that and short circuit evaluation to pass the accumulator directly.	2023-07-03 19:45:17 +03:00
Avi Kivity	f48ecb5049	cql3: selection: short-circuit non-aggregations Currently, selector evaluation assumes the most complex case where we aggregate, so multiple input rows combine into one output row. In effect the query either specifies an outer loop (for the group) and an inner loop (for input rows), or it only specifies the inner loop; but we always perform the outer and inner loop. Prepare to have a separate path for the non-aggregation case by introducing transform_input_row().	2023-07-03 19:45:17 +03:00
Avi Kivity	4a2428e4ec	cql3: selection: drop validate_selectors It's unused. It dates from the (perhaps better) time when regularity of aggregation across selectors was enforced.	2023-07-03 19:45:17 +03:00
Avi Kivity	432cb02d64	cql3: select_statement: force aggregation if GROUP BY is used GROUP BY is typically used with aggregation. In one case the aggregation is implicit: SELECT a, b, c FROM tab GROUP BY x, y, z One row will appear from each group, even though no aggregation was specified. To avoid this irregularity, rewrite this query as SELECT first(a), first(b), first(c) FROM tab GROUP BY x, y, z This allows us to have different paths for aggregations and non-aggregations, without worrying about this special case.	2023-07-03 19:45:17 +03:00
Avi Kivity	bc6c64e13c	cql3: select_statement: levellize aggregation depth Avoid mixed aggregate/non-aggregate queries by inserting calls to the first() function. This allows us to avoid internal state (simple_selector::_current) and make selector evaluation stateless apart from explicit temporaries.	2023-07-03 19:45:17 +03:00
Avi Kivity	ecdded90cd	cql3: selection: skip first_function when collecting metadata We plan to rewrite aggregation queries that have a non-aggregating selector using the first function, so that all selectors are aggregates (or none are). Prevent the first function from affecting metadata (the auto-generated column names), by skipping over the first function if detected. They input and output types are unchanged so this only affects the name.	2023-07-03 19:45:17 +03:00
Avi Kivity	996e02f5bf	cql3: select_statement: explicitly disable automatic parallelization with no aggregates A query of the form `SELECT foo, count(foo) FROM tab` returns the first value of the foo column along with the count. This can't be parallized today since the first selector isn't an aggregate. We plan to rewrite the query internally as `SELECT first(foo), count(foo) FROM tab`, in order to make the query more regular (no mixing of aggregates and non-aggregates). However, this will defeat the current check since after the rewrite, all selectors are aggregates. Prepare for this by performing the check on a pre-rewrite variable, so it won't be affected by the query rewrite in the next patch. Note that although even though we could add support for running first() in parallel, it's not possible to get the correct results, since first() is not commutative and we don't reduce in order. It's also not a particularly interesting query.	2023-07-03 19:45:17 +03:00
Avi Kivity	778ae2b461	cql3: expression: introduce temporaries Temporaries are similar to bind variables - they are values provided from outside the expression. While bind variables are provided by the user, temporaries are generated internally. The intended use is for aggregate accumulator storage. Currently aggregates store the accumulator in aggregate_function_selector::_accumulator, which means the entire selector hierarchy must be cloned for every query. With expressions, we can have a single expression object reused for many computations, but we need a way to inject the accumulator into an aggregation, which this new expression element provides.	2023-07-03 19:45:17 +03:00
Avi Kivity	7c3ceb6473	cql3: select_statement: use prepared selectors Change one more layer of processing to work on prepared rather than raw selectors. This moves the call to prepare the selectors early in select_statement processing. In turn this changes maybe_jsonize_select_clause() and forward_service's mock_selection() to work in the prepared realm as well. This moves us one step closer to using evaluate() to process the select clause, as the prepared selectors are now available in select_statement. We can't use them yet since we can't evaluate aggregations.	2023-07-03 19:45:17 +03:00
Avi Kivity	a338d0455d	cql3: selection: avoid selector_factories in collect_metadata() Generate the column headings in the result set metadata using the newly introduced result_set_metadata mode of the expression printer.	2023-07-03 19:45:17 +03:00
Avi Kivity	7aee322a6c	cql3: expressions: add "metadata mode" formatter for expressions When returning a result set (and when preparing a statement), we return metadata about the result set columns. Part of that is the column names, which are derived from the expressions used as selectors. Currently, they are computed via selector::column_name(), but as we're dismantling that hierarchy we need a different way to obtain those names. It turns out that the expression formatter is close enough to what we need. To avoid disturbing the current :user mode, add a new :metadata mode and apply the adjustments needed to bring it in line with what column metadata looks like today. Note that column metadata is visible to applications and they can depend on it; e.g. the Python driver allows choosing columns based on their names rather than ordinal position.	2023-07-03 19:45:17 +03:00
Avi Kivity	a1f4abb753	cql3: selection: convert collect_metadata() to the prepared expression domain Simplifies refactoring later on.	2023-07-03 19:45:17 +03:00
Avi Kivity	91b251f6b4	cql3: selection: convert processes_selection to work on prepared expressions processes_selection() checks whether a selector passes-through a column or applies some form of processing (like a case or function application). It's more sensible to do this in the prepared domain as we have more information about the expression. It doesn't really help here, but it does help the refactoring later in the series.	2023-07-03 19:45:17 +03:00
Avi Kivity	4fb797303f	cql3: selection: prepare selectors earlier Currently, each selector expression is individually prepared, then converted into a selector object that is later executed. This is done (on a vector of raw selectors) by cql3::selection::raw_selector::to_selectables(). Split that into two phases. The first phase converts raw_selector into a new struct prepared_selector (a better name would be plain 'selector', but it's taken for now). The second phase continues the process and converts prepared_selector into selectables. This gives us a full view of the prepared expressions while we're preparing the select clause of the select statement.	2023-07-03 19:45:17 +03:00
Avi Kivity	70b246eaaf	cql3: raw_selector: deinline It's easier to refactor things if they don't cause the entire universe to recompile, plus adding new headers is less painful.	2023-07-03 19:45:17 +03:00
Avi Kivity	99fe0ee772	cql3: expression: reimplement verify_no_aggregate_functions() Most clauses in a CQL statement don't tolerate aggregate functions, and so they call verify_no_aggregate_functions(). It can now be reimplemented in terms of aggregation_depth(), removing some code.	2023-07-03 19:45:17 +03:00
Avi Kivity	b1b4a18ad8	cql3: expression: add helpers to manage an expression's aggregation depth We define the "aggregation depth" of an expression by how many nested aggregation functions are applied. In CQL/SQL, legal values are 0 and 1, but for generality we deal with any aggregation depth. The first helper measures the maximum aggregation depth along any path in the expression graph. If it's 2 or greater, we have something like max(max(x)) and we should reject it (though these helpers don't). If we get 1 it's a simple aggregation. If it's zero then we're not aggregating (though CQL may decide to aggregate anyway if GROUP BY is used). The second helper edits an expression to make sure the aggregation depth along any path that reaches a column is the same. Logically, `SELECT x, max(y)` does not make sense, as one is a vector of values and the other is a scalar. CQL resolves the problem by defining x as "the first value seen". We apply this resolution by converting the query to `SELECT first(x), max(y)` (where `first()` is an internal aggregate function), so both selectors refer to scalars that consume vectors. When a scalar is consumed by an aggregate function (for example, `SELECT max(x), min(17)` we don't have to bother, since a scalar is implicity promoted to a vector by evaluating it every row. There is some ambiguity if the scalar is a non-pure function (e.g. `SELECT max(x), min(random())`, but it's not worth following. A small unit test is added.	2023-07-03 19:45:16 +03:00
Nadav Har'El	94bf6bbeaa	Merge 'Remove unused storage_proxy args from some replica::database methods' from Pavel Emelyanov null Closes #14489 * github.com:scylladb/scylladb: database: Remove unused proxy arg from update_keyspace_on_all_shards() database: Remove unused proxy arg from update_keyspace()	2023-07-03 19:14:28 +03:00
Avi Kivity	faf0ea0f68	cql3: expression: improve printing of prepared function calls Currently, a prepared function_call expression is printed as an "anonymous function", but it's not really anonymous - the name is available. Print it out. This helps in a unit test later on (and is worthwhile by itself).	2023-07-03 19:02:33 +03:00
Alejo Sanchez	520bd90008	test/boost/memtable_test: split test plain/reverse Split long running test test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests for _plain and _reverse. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14447	2023-07-03 15:20:12 +03:00
Pavel Emelyanov	0d4c981423	database: Remove unused proxy arg from update_keyspace_on_all_shards() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-03 14:19:54 +03:00
Pavel Emelyanov	42b9ba48de	database: Remove unused proxy arg from update_keyspace() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-07-03 14:19:36 +03:00
Kefu Chai	04434c02b3	sstables: print generation without {:d} the formatter for sstables::generation_type does not support "d" specifier, so we should not use "{:d}" for printing it. this works before `d7c90b5239`, but after that change, generation_type is not an alias of int64_t anymore. and its formatter does not support "d", so we should either specialize fmt::formatter<generation_type> to support it or just drop the specifier. since seastar::format() is using ```c++ fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...); ``` to print the arguments with given fmt string, we cannot identify these kind of error at compile time. at runtime, if we have issues like this, {fmt} would throw exception like: ``` terminate called after throwing an instance of 'fmt::v9::format_error' what(): invalid format specifier ``` when constructing the `std::runtime_error` instance. so, in this change, "d" is removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14427	2023-07-03 13:53:13 +03:00
Pavel Emelyanov	f18bd23ec5	Merge 'repair: coroutinize some functions' from Kefu Chai also, take this opportunity to let `handle_mutation_fragment()` return void. for better readability. Closes #14258 * github.com:scylladb/scylladb: repair: do not check retval of handle_mutation_fragment() repair: coroutinize move_row_buf_to_working_row_buf() repair: coroutinize read_rows_from_disk() repair: coroutinize get_sync_boundary()	2023-07-03 13:45:42 +03:00
Avi Kivity	b7556e9482	cql3: functions: add "first" aggregate function first(x) returns the first x it sees in the group. This is useful for SELECT clauses that return a mix of aggregates and non-aggregates, for example SELECT max(x), x with inputs of x = { 1, 2, 3 } is expected to return (3, 1). Currently, this behavior is handled by individual selectors, which means they need to contain extra state for this, which cannot be easily translated to expressions. The new first function allows translating the SELECT clause above to SELECT max(x), first(x) so all selectors are aggregations and can be handled in the same way. The first() function is not exposed to users.	2023-07-02 18:15:00 +03:00
Kefu Chai	1ab2bb69b8	keys: do not use zip_iterator for printing key components boost's the operator==() implementation of boost's zip_iterator returns true only if all elements in enclosed tuple of zip_iterator are equal. and the zip_iterator always advances all the iterators in the enclosed tuple. but in our case, some components might be missing. in other words, the size of the `components` might be smaller than that of the `types` range. so, when the zip_iterator advances past the end of the components, scylla starts reading out of bounds. because zip_iterator does not allow us to customize how it implements the equal operator. and we cannot deduce the size of components without reading all of them. so in this change, we partially revert `3738fcbe05`, instead of using fmt::join(), just iterate through the components manually. this should avoid the out-of-bound reading, and also preserve the original behavior. Branches: 5.3 Fixes #14435 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14457	2023-07-01 23:49:02 +03:00
Avi Kivity	d88dfa0ad2	tools: scylla-sstable: fix stack overflow due to multiple db::config placed on the stack db::config is pretty large (~32k) and there are four of them, blowing the stack. Fix by allocating them on the heap. It's not clear why this shows up on my system (clang 16) and not in the frozen toolchain. Perhaps clang 16 is less able to reuse stack space. Closes #14464	2023-07-01 09:21:05 +03:00
Michał Jadwiszczak	2071ade171	docs:cql: add information about generic describe In our cql's documentation, there was no information that type can be omitted in a describe statement. Added this information along with the order of looking for the element.	2023-06-30 14:50:14 +02:00
Michał Jadwiszczak	58eb7a45b7	cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc	2023-06-30 14:50:08 +02:00
Michał Jadwiszczak	d5748fd895	cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index. This commit adds UDT/UDF/UDA to generic describe. Fixes: #14170	2023-06-30 14:38:22 +02:00
Anna Stuchlik	e7bb86e0f1	doc: fix broken links on the Scylla SStable page	2023-06-30 12:00:59 +02:00
Piotr Dulikowski	ee9bfb583c	combined: mergers: remove recursion in operator()() In mutation_reader_merger and clustering_order_reader_merger, the operator()() is responsible for producing mutation fragments that will be merged and pushed to the combined reader's buffer. Sometimes, it might have to advance existing readers, open new and / or close some existing ones, which requires calling a helper method and then calling operator()() recursively. In some unlucky circumstances, a stack overflow can occur: - Readers have to be opened incrementally, - Most or all readers must not produce any fragments and need to report end of stream without preemption, - There has to be enough readers opened within the lifetime of the combined reader (~500), - All of the above needs to happen within a single task quota. In order to prevent such a situation, the code of both reader merger classes were modified not to perform recursion at all. Most of the code of the operator()() was moved to maybe_produce_batch which does not recur if it is not possible for it to produce a fragment, instead it returns std::nullopt and operator()() calls this method in a loop via seastar::repeat_until_value. A regression test is added. Fixes: scylladb/scylladb#14415 Closes #14452	2023-06-30 12:07:13 +03:00
Kamil Braun	1760a84873	raft_group_registry: remove `has_group0()` The function is no longer used.	2023-06-30 11:06:05 +02:00
Kamil Braun	6c3f391c0a	raft_group0_client: remove `using_raft()` The function is no longer used.	2023-06-30 11:06:05 +02:00
Kamil Braun	0eee196a2e	migration_manager: disable schema pulls when schema is Raft-managed We want to disable `migration_manager` schema pulls and make schema managed only by Raft group 0 if Raft is enabled. This will be important with Raft-based topology, when schema will depend on topology (e.g. for tablets). We solved the problem partially in PR #13695. However, it's still possible for a bootstrapping node to pull schema in the early part of bootstrap procedure, before it setups group 0, because of how the currently used `_raft_gr.using_raft()` check is implemented. Here's the list of cases: - If a node is bootstrapping in non-Raft mode, schema pulls must remain enabled. - If a node is bootstrapping in Raft mode, it should never perform a schema pull. - If a bootstrapped node is restarting in non-Raft mode but with Raft feature enabled (which means we should start upgrading to use Raft), or restarting in the middle of Raft upgrade procedure, schema pulls must remain enabled until the Raft upgrade procedure finishes. This is also the case of restarting after RECOVERY. - If a bootstrapped node is restarting in Raft mode, it should never perform a schema pull. The `raft_group0` service is responsible for setting up Raft during boot and for the Raft upgrade procedure. So this is the most natural place to make the decision that schema pulls should be disabled. Instead of trying to come up with a correct condition that fully covers the above list of cases, store a `bool` inside `migration_manager` and set it from `raft_group0` function at the right moment - when we decide that we should boot in Raft mode, or restart with Raft, or upgrade. Most of the conditions are already checked in `setup_group0_if_exist`, we just need to set the bool. Also print a log message when schema pulls are disabled. Fix a small bug in `migration_manager::get_schema_for_write` - it was possible for the function to mark schema as synced without actually syncing it if it was running concurrently to the Raft upgrade procedure. Correct some typos in comments and update the comments. Fixes #12870	2023-06-30 11:06:02 +02:00
Botond Dénes	5648bfb9a0	Merge 'Find task manager task progress' from Aleksandra Martyniuk Modify task_manager::task::impl::get_progress method so that, whenever relevant, progress is calculated based on children's progress. Otherwise progress indicates only whether the task is finished or not. The method may be overriden in inheriting classes. Closes #14381 * github.com:scylladb/scylladb: tasks: delete task_manager::task::impl::_progress as it's unused tasks: modify task_manager::task::impl::get_progress method tasks: add is_complete method	2023-06-30 09:47:07 +03:00
Botond Dénes	8a7261fd70	Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents). This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1. Refs: https://github.com/scylladb/scylla-enterprise/issues/3046 - [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1) - [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1) - [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1) (see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864) Closes #14444 * github.com:scylladb/scylladb: doc: fix rollback in 4.3-to-2021.1 upgrade guide doc: fix rollback in 5.0-to-2022.1 upgrade guide doc: fix rollback in 5.1-to-2022.2 upgrade guide	2023-06-30 09:38:45 +03:00
Botond Dénes	d20ed2d4db	Merge 'doc: improve the Unified Installer page' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/14033 This PR: - replaces the OUTDATED list of platforms supported by Unified Installer with a link to the "OS Support" page. In this way, the list of supported OSes will be documented in one place, preventing outdated documentation. - improves the language and syntax, including: - Improving the wording. - Replacing "Scylla" with "ScyllaDB" - Fixing language mistakes - Fixing heading underline so that the headings render correctly. Closes #14445 * github.com:scylladb/scylladb: doc: update the language - Unified Installer page doc: update Unified Installer support	2023-06-30 09:38:18 +03:00
Kamil Braun	ff386e7a44	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336	2023-06-29 22:46:42 +02:00
Konstantin Osipov	3d81408a58	test.py: make `experimental: raft` the default for all tests Make sure all tests use the new centralized topology coordinator. This is a step forward towards maturing the coordinator implementation. Closes #14039	2023-06-29 14:44:00 +02:00
Tomasz Grabiec	a9282103ba	Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy This series aims at hardening schema merges and preventing inconsistencies across shards by updating the database shards before calling the notification callback. As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux. In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state . Other changes optimize this path by: - updating shard 0 first, to seed the effective_replication_map. - executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change. Fixes #13137 Closes #14158 * github.com:scylladb/scylladb: migration_manager: propagate listener notification exceptions storage_service: keyspace_changed: execute only on shard 0 database: modify_keyspace_on_all_shards: execute func first on shard 0 database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards database: add modify_keyspace_on_all_shards schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace database: create_keyspace_on_all_shards database: update_keyspace_on_all_shards database: drop_keyspace_on_all_shards	2023-06-29 12:17:53 +02:00
Anna Stuchlik	d3aba00131	doc: update the language - Unified Installer page This commit improves the language and syntax on the Unified Installer page. The changes cover: - Improving the wording. - Replacing "Scylla" with "ScyllaDB" - Fixing language mistakes - Fixing heading underline so that the headings render correctly.	2023-06-29 12:11:22 +02:00
Anna Stuchlik	944ce5c5c2	doc: update Unified Installer support This commit replaces the OUTDATED list of platforms supported by Unified Installer with a link to the "OS Support" page. In this way, the list of supported OSes will be documented in one place, preventing outdated documentation.	2023-06-29 11:51:21 +02:00
Aleksandra Martyniuk	f63825151e	tasks: delete task_manager::task::impl::_progress as it's unused	2023-06-29 11:30:27 +02:00
Aleksandra Martyniuk	d624be4e6b	tasks: modify task_manager::task::impl::get_progress method Modify task_manager::task::impl::get_progress method so that, whenever relevant, progress is calculated based on children's progress. Otherwise progress indicates only whether the task is finished or not.	2023-06-29 11:30:26 +02:00
Botond Dénes	2a58b4a39a	Merge 'Compaction resharding tasks' from Aleksandra Martyniuk Task manager's tasks covering resharding compaction on table and shard level. Closes #14044 * github.com:scylladb/scylladb: test: extend test_compaction_task.py to test resharding compaction compaction: add shard_reshard_sstables_compaction_task_impl compaction: invoke resharding on sharded database compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() replica: delete unused functions and struct compaction: add reshard_sstables_compaction_task_impl compaction: replica: copy struct and functions from distributed_loader.cc compaction: create resharding_compaction_task_impl	2023-06-29 12:10:54 +03:00
Aleksandra Martyniuk	0278b21e76	tasks: add is_complete method Add is_complete method to task_manager::task::impl and task_manager::task.	2023-06-29 11:02:14 +02:00
Anna Stuchlik	32cfde2f8b	doc: fix rollback in 4.3-to-2021.1 upgrade guide This commit fixes the Restore System Tables section in the 5.4.3-to-2021.1 upgrade guide by adding the command to restore system tables.	2023-06-29 10:36:47 +02:00
Anna Stuchlik	130ddc3d2b	doc: fix rollback in 5.0-to-2022.1 upgrade guide This commit fixes the Restore System Tables section in the 5.0-to-2022.1 upgrade guide by adding the command to restore system tables.	2023-06-29 10:29:23 +02:00
Anna Stuchlik	8b3153f9ef	doc: fix rollback in 5.1-to-2022.2 upgrade guide This commit fixes the Restore System Tables section in the 5.1-to-2022.2 upgrade guide by adding a command to clean upgraded SStables during rollback.	2023-06-29 10:17:06 +02:00
Nadav Har'El	dd63169077	Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco Reduce test string value size, parallelize inserts, and use a prepared statement, The debug running time for this tests is reduced from 13:18 to 7:52. Refs #13905 Closes #14380 * github.com:scylladb/scylladb: test/boost/index_with_paging_test: parallel insert test/boost/index_with_paging_test: prepared statement test/boost/index_with_paging_test: reduce running time	2023-06-29 10:45:01 +03:00
Kefu Chai	2894bc4954	repair: do not check retval of handle_mutation_fragment() handle_mutation_fragment() does not return `stop_iteration::yes` anymore after `fbbc86e18c`, so let's stop checking its return value. and make it return void. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-29 10:35:50 +08:00
Kefu Chai	e16b5ceb48	repair: coroutinize move_row_buf_to_working_row_buf() for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-29 10:19:02 +08:00
Kefu Chai	a973b43b9f	repair: coroutinize read_rows_from_disk() for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-29 10:17:48 +08:00
Kefu Chai	0cacc1fd4e	repair: coroutinize get_sync_boundary() for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-29 10:17:48 +08:00
Tomasz Grabiec	50e8ec77c6	Merge 'Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in `53636167ca`, then in `79ee38181c`. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`	2023-06-28 20:40:03 +02:00
Avi Kivity	f6f974cdeb	cql3: selection: fix GROUP BY, empty groups, and aggregations A GROUP BY combined with aggregation should produce a single row per group, except for empty groups. This is in contrast to an aggregation without GROUP BY, which produces a single row no matter what. The existing code only considered the case of no grouping and forced a row into the result, but this caused an unwanted row if grouping was used. Fix by refining the check to also consider GROUP BY. XFAIL tests are relaxed. Fixes #12477. Note, forward_service requires that aggregation produce exactly one row, but since it can't work with grouping, it isn't affected. Closes #14399	2023-06-28 18:56:22 +03:00
Kamil Braun	b912eeade5	Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. * 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla: test: add test for group0 raft command merging service: raft: respect max mutation size limit when persisting raft entries group0_state_machine: merge commands before applying them whenever possible	2023-06-28 17:21:07 +02:00
Kamil Braun	1fa9678c64	messaging_service: print which connections are dropped due to missing topology info This connection dropping caused us to spend a lot of time debugging. Those debugging sessions would be shorter if Scylla logs indicated that connections are being dropped and why. Connection drops for a given node are a one-time event - we only do it if we establish a connection to a node without topology info, which should only happen before we handle the node's NORMAL status for the first time. So it's a rare thing and we can log it on INFO level without worrying about log spam.	2023-06-28 16:20:29 +02:00
Kamil Braun	51cec2be86	storage_service: wait for nodes to be UP on bootstrap The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Refs: #14042 This doesn't completely fix #14042 yet becasue it's specific to gossiper-based topology mode only. For Raft-based topology, the node joining procedure will be coordinated by the topology coordinator right from the start and it will be the coordinator who issues the 'wait for node to see other live nodes'.	2023-06-28 16:20:29 +02:00
Kamil Braun	5ec5c7704c	storage_service: wait for NORMAL state handler before `setup_group0()` `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in `53636167ca`, then in `79ee38181c`. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). We do it only in non-raft-topology mode, because with Raft-based topology, node state changes are propagated to the cluster through explicit global barriers and we plan to remove node statuses from gossiper altogether. Fixes: #12972	2023-06-28 16:19:24 +02:00
Anna Stuchlik	f4ae2c095b	doc: fix rollback in 5.2-to-2023.1 upgrade guide This commit fixes the Restore System Tables section in the 5.2-to-2023.1 upgrade guide by adding a command to clean upgraded SStables during rollback. This is a bug (an incomplete command) and must be backported to branch-5.3 and branch-5.2. Refs: https://github.com/scylladb/scylla-enterprise/issues/3046 Closes #14373	2023-06-28 17:16:32 +03:00
Alejo Sanchez	d4697ed21e	test/boost/index_with_paging_test: parallel insert Parallelize inserts for long-running test_index_with_paging. Run time in debug mode reduced by 1 minute 48 seconds. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 16:11:58 +02:00
Alejo Sanchez	70a3179888	test/boost/index_with_paging_test: prepared statement Prepare statement for insert. Run time in debug mode reduced by 9 seconds. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 14:49:21 +02:00
Michał Jadwiszczak	0a8fcead08	cql3: Specify arguments types in UDA creation errors Display not only function name but also expected arguments if `state_function` or `final_function` was not found. Fixes: #12088 Closes #14278	2023-06-28 15:27:49 +03:00
Alejo Sanchez	48d24269f1	test/boost/index_with_paging_test: reduce running time Reduce test string value size for test_index_with_paging from 4096 to 100. With 100 bytes it should make the base row significantly larger than the key so the test will exercise both types of paging in the scanning code. The debug running time for this tests is reduced from 9 minutes to 6 minutes. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 13:55:52 +02:00
Nadav Har'El	49c8c06b1b	Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved. We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different. We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change. A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash. Fixes: https://github.com/scylladb/scylladb/issues/13129 Closes #14429 * github.com:scylladb/scylladb: modification_statement: reject conditional statements with empty clustering key statements/cas_request: fix crash on empty clustering range in LWT	2023-06-28 14:43:54 +03:00
Kamil Braun	64c302e777	storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()` This piece of `storage_service::wait_for_ring_to_settle()` will be performed earlier in the boot procedure in follow-up commits. Make it more generic, to be able to wait for `n` nodes to show up. Here we wait for `2` nodes - ourselves and at least one other.	2023-06-28 12:36:06 +02:00
Aleksandra Martyniuk	bf3e0744c1	test: extend test_compaction_task.py to test resharding compaction	2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk	87c8d63b7a	compaction: add shard_reshard_sstables_compaction_task_impl Add task manager's task covering resharding compaction on one shard.	2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk	db6e4a356b	compaction: invoke resharding on sharded database In reshard_sstables_compaction_task_impl::run() we call sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed to that method, we use both sharded sstable_directory service and its local instance. To make it straightforward that sharded and local instances are dependend, we call sharded<replica::database>::invoke_on_all instead and access local directory through the sharded one.	2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk	1acaed026a	compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()	2023-06-28 11:43:11 +02:00
Aleksandra Martyniuk	85cc85fc5a	replica: delete unused functions and struct	2023-06-28 11:41:43 +02:00
Aleksandra Martyniuk	837d77ba8c	compaction: add reshard_sstables_compaction_task_impl Add task manager's task covering resharding compaction.	2023-06-28 11:41:43 +02:00
Aleksandra Martyniuk	0d6dd3eeda	compaction: replica: copy struct and functions from distributed_loader.cc As a preparation for integrating resharding compaction with task manager a struct and some functions are copied from replica/distributed_loader.cc to compaction/task_manager_module.cc.	2023-06-28 11:41:42 +02:00
Aleksandra Martyniuk	2b4874bbf7	compaction: create resharding_compaction_task_impl resharding_compaction_task_impl serves as a base class of all concrete resharding compaction task classes.	2023-06-28 11:36:53 +02:00
Jan Ciolek	bfbc3d70b7	modification_statement: reject conditional statements with empty clustering key `modification_statement::execute_with_condition` validates that a query with an IF condition can be executed correctly. There's already a check for empty partition key ranges, but there was no check for empty clustering ranges. Let's add a check for the clustering ranges as well, they're not allowed to be empty. After this change Scylla outputs the same type of message for empty partition and clustering ranges, which improves UX. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-28 10:30:52 +02:00
Jan Ciolek	ccdb26bf9e	statements/cas_request: fix crash on empty clustering range in LWT LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` To fix it let's throw en exception when the clustering range is empty. Cassandra also rejects queries with `c = 1 AND c = 2`. There's also a check for empty partition range, as it used to crash in the past, can't really hurt to add it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-28 10:18:06 +02:00
Botond Dénes	586102b42e	Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s. Fixes #13491 Closes #14375 * github.com:scylladb/scylladb: readers: evictable_reader: don't accidentally consume the entire partition test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position	2023-06-28 07:58:45 +03:00
Kefu Chai	c3d2f0cd81	script: add base36-uuid.py this script provides a tool to decode a base36 encoded timeuuid to the underlying msb and lsb bits, and to encode msb and lsb to a string with base36. Both scylla and Cassandra 4.x support this new SSTable identifier used in SSTable names. like "nb-3fw2_0tj4_46w3k2cpidnirvjy7k-big-Data.db". Since this is a new way to print timeuuid, and unlike the representation defined by RFC4122, it is not straightforward to connect the the in-memory representation (0x6636ac00da8411ec9abaf56e1443def0) to its string representation of SSTable identifiers, like "3fw2_0tj4_46w3k2cpidnirvjy7k". It would be handy to have this tool to encode/decode the number/string for debugging purpose. For more context on the new SSTable identifier, please see https://cassandra.apache.org/_/blog/Apache-Cassandra-4.1-New-SSTable-Identifiers.html and https://issues.apache.org/jira/browse/CASSANDRA-17048 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14374	2023-06-27 16:56:31 +03:00
Kamil Braun	96bc78905d	readers: evictable_reader: don't accidentally consume the entire partition The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. Fixes #13491	2023-06-27 14:37:29 +02:00
Kamil Braun	5800ce8ddd	test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position test_range_tombstones_v2 is too strict for this reader -- it expects a particular sequence of `range_tombstone_change`s, but multishard_combining_reader, when tested with a small buffer, may generate -- as expected -- additional (redundant) range tombstone change pairs (end+start). Currently we don't observe these redundant fragments due to a bug in `evictable_reader_v2` but they start appearing once we fix the bug and the test must be prepared first. To prepare the test, modify `flat_reader_assertions_v2` so it squashes redundant range tombstone change pairs. This happens only in non-exact mode. Enable exact mode in `test_sstable_reversing_reader_random_schema` for comparing two readers -- the squashing of `r_t_c`s may introduce an artificial difference.	2023-06-27 14:37:25 +02:00
Gleb Natapov	945f476363	test: add test for group0 raft command merging Add a test that submits 3 large commands each one a little bit larger than 1/3 of maximum mutation size. Check that in the end 2 command were executed (first 2 were merged and third was executed separately).	2023-06-27 14:59:55 +03:00
Gleb Natapov	8307b09c64	service: raft: respect max mutation size limit when persisting raft entries The code that preserves raft entries builds one batch statement to store all of them, but the butch's statement execute() merges all of the statements into one mutation and passes it to the database. The mutation can be larger than max mutation size limit and the write will fail. Fix it by splitting the write to multiple batch statements if needed.	2023-06-27 14:59:55 +03:00
Gleb Natapov	311cfa1be8	group0_state_machine: merge commands before applying them whenever possible Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. Fixes: #12581	2023-06-27 14:40:46 +03:00
Botond Dénes	8d1dfbf0d9	Merge 'Fixing broken links to ScyllaDB University lessons, Scylla University…' from Guy Shtub … -> ScyllaDB University Closes #14385 * github.com:scylladb/scylladb: Update docs/operating-scylla/procedures/backup-restore/index.rst Fixing broken links to ScyllaDB University lessons, Scylla University -> ScyllaDB University	2023-06-27 13:05:46 +03:00
Avi Kivity	f86dd857ca	Merge 'Certificate based authorization' from Calle Wilund Fixes #10099 Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex. Example: scylla.yaml: ``` authenticator: com.scylladb.auth.CertificateAuthenticator auth_superuser_name: <name> auth_certificate_role_query: CN=([^,\s]+) client_encryption_options: enabled: True certificate: <server cert> keyfile: <server key> truststore: <shared trust> require_client_auth: True ``` In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for qlsh set "usercert" and "userkey" to these certificate files. No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it. Otherwise, connection becomes the role described. To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup. Closes #12214 * github.com:scylladb/scylladb: docs: Add documentation of certificate auth + auth_superuser_name auth: Add TLS certificate authenticator transport: Try to do early, transport based auth if possible auth: Allow for early (certificate/transport) authentication auth: Allow specifying initial superuser name + passwd (salted) in config roles-metadata: Coroutinuze some helpers	2023-06-27 12:52:14 +03:00
Guy Shtub	89bb690098	Update docs/operating-scylla/procedures/backup-restore/index.rst Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-06-27 12:22:59 +03:00
Calle Wilund	00e5aec7ec	docs: Add documentation of certificate auth + auth_superuser_name Not great docs. But a start.	2023-06-27 07:38:50 +00:00
Benny Halevy	3ca0c6c0a5	compaction_manager: try_perform_cleanup: set owned_ranges_ptr with compaction disabled Otherwise regular compaction can sneak in and see !cs.sstables_requiring_cleanup.empty() with cs.owned_ranges_ptr == nullptr and trigger the internal error in `compaction_task_executor::compact_sstables`. Fixes scylladb/scylladb#14296 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14297	2023-06-27 08:47:13 +03:00
Botond Dénes	f5e3b8df6d	Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho View building from staging creates a reader from scratch (memtable \+ sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: ``` + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert ``` That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from `INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s` to `INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s` Refs https://github.com/scylladb/scylladb/issues/14089. Fixes scylladb/scylladb#14244. Closes #14364 * github.com:scylladb/scylladb: table: Optimize creation of reader excluding staging for view building view_update_generator: Dump throughput and duration for view update from staging utils: Extract pretty printers into a header	2023-06-27 07:25:30 +03:00
Raphael S. Carvalho	1d8cb32a5d	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 22:30:39 -03:00
Raphael S. Carvalho	1ff8645eaa	view_update_generator: Dump throughput and duration for view update from staging Very helpful for user to understand how fast view update generation is processing the staging sstables. Today, logs are completely silent on that. It's not uncommon for operators to peek into staging dir and deduce the throughput based on removal of files, which is terrible. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 21:58:23 -03:00
Raphael S. Carvalho	83c70ac04f	utils: Extract pretty printers into a header Can be easily reused elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 21:58:20 -03:00
Benny Halevy	9231a6c480	cql-pytest: test_using_timestamp: increase ttl It seems like the current 1-second TTL is too small for debug build on aarch64 as seen in https://jenkins.scylladb.com/job/scylla-master/job/build/1513/artifact/testlog/aarch64/debug/cql-pytest.test_using_timestamp.1.log ``` k = unique_key_int() cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v1}) USING TIMESTAMP {ts} and TTL 1") cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v2}) USING TIMESTAMP {ts}") > assert_value(k, v1) test/cql-pytest/test_using_timestamp.py:140: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ k = 10, expected = 2 def assert_value(k, expected): select = f"SELECT k, v FROM {table} WHERE k = {k}" res = list(cql.execute(select)) > assert len(res) == 1 E assert 0 == 1 E + where 0 = len([]) ``` Increase the TTL used to write data to de-flake the test on slow machines running debug build. Ref #14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14396	2023-06-26 21:35:31 +03:00
Benny Halevy	825d617a53	migration_manager: propagate listener notification exceptions `1e29b07e40` claimed to make event notification exception safe, but swallawing the exceptions isn't safe at all, as this might leave the node in an inconsistent state if e.g. storage_service::keyspace_changed fails on any of the shards. Propagating the exception here will cause abort, but it is better than leaving the node up, but in an inconsistent state. We keep notifying other listeners even if any of them failed Based on `1e29b07e40`: ``` If one of the listeners throws an exception, we must ensure that other listeners are still notified. ``` The decision about swallowing exceptions can't be made in such a generic layer. Specific notification listeners that may ignore exceptions, like in transport/evenet_notifier, may decide to swallow their local exceptions on their own (as done in this patch). Refs #3389 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	a690f0e81f	storage_service: keyspace_changed: execute only on shard 0 Previously all shards called `update_topology_change_info` which in turn calls `mutate_token_metadata`, ending up in quadratic complexity. Now that the notifications are called after all database shards are updated, we can apply the changes on token metadata / effective replication map only on shard 0 and count on replicate_to_all_cores to propagate those changes to all other shards. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	13dd92e618	database: modify_keyspace_on_all_shards: execute func first on shard 0 When creating or altering a keyspace, we create a new effective_replication_map instance. It is more efficient to do that first on shard 0 and then on all other shards, otherwise multiple shards might need to calculate to new e_r_m (and reach the same result). When the new e_r_m is "seeded" on shard 0, other shards will find it there and clone a local copy of it - which is more efficient. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	ba15786059	database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards When creating, updating, or dropping keyspaces, first execute the database internal function to modify the database state, and only when all shards are updated, run the listener notifications, to make sure they would operate when the database shards are consistent with each other. Fixes #13137 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	3b8c913e61	database: add modify_keyspace_on_all_shards Run all keyspace create/update/drop ops via `modify_keyspace_on_all_shards` that will standardize the execution on all shards in the coming patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	dc9b0812e9	schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace Similar to create_keyspace_on_all_shards, `extract_scylla_specific_keyspace_info` and `create_keyspace_from_schema_partition` can be called once in the upper layer, passing keyspace_metadata& down to database::update_keyspace_on_all_shards which now would only make the per-shard keyspace_metadata from the reference it gets from the schema_tables layer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Benny Halevy	3520c786bd	database: create_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 21:08:09 +03:00
Kefu Chai	fb05fddd7d	build: build with -O0 if Clang >= 16 is used to workaround https://github.com/llvm/llvm-project/issues/62842, per the test this issue only surfaces when compiling the tree with `ae7bf2b80b` which is included in Clang version 16, and the issue disappears when the tree is compiled with -O0. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14391	2023-06-26 18:55:10 +03:00
Calle Wilund	a3db540142	auth: Add TLS certificate authenticator Fixes #10099 Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex. Example: scylla.yaml: authenticator: com.scylladb.auth.CertificateAuthenticator auth_superuser_name: <name> auth_certificate_role_queries: - source: SUBJECT query: CN=([^,\s]+) client_encryption_options: enabled: True certificate: <server cert> keyfile: <server key> truststore: <shared trust> require_client_auth: True In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for cqlsh set "usercert" and "userkey" to these certificate files. No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it. Otherwise, connection becomes the role described.	2023-06-26 15:00:21 +00:00
Calle Wilund	20e9619bb1	transport: Try to do early, transport based auth if possible Bypassing the need for an AUTH message+response. I.e. do auth _without_ client having login specified.	2023-06-26 15:00:21 +00:00
Calle Wilund	a4b13febde	auth: Allow for early (certificate/transport) authentication Preparing for new authenticators. Hint hint.	2023-06-26 15:00:20 +00:00
Calle Wilund	69217662bd	auth: Allow specifying initial superuser name + passwd (salted) in config Instead of locking this to "cassandra:cassandra", allow setting in scylla.yaml or commandline. Note that config values become redundant as soon as auth tables are initialized.	2023-06-26 15:00:20 +00:00
Calle Wilund	3638849e63	roles-metadata: Coroutinuze some helpers To make it easier to add options/features	2023-06-26 15:00:20 +00:00
Wojciech Mitros	b8473f45a5	auth: do not grant permissions to creator without actually creating Currently, when creating the table, permissions may be mistakenly granted to the user even if the table is already existing. This can happen in two cases: 1. The query has a IF NOT EXISTS clause - as a result no exception is thrown after encountering the existing table, and the permission granting is not prevented. 2. The query is handled by a non-zero shard - as a result we accept the query with a bounce_to_shard result_message, again without preventing the granting of permissions. These two cases are now avoided by checking the result_message generated when handling the query - now we only grant permissions when the query resulted in a schema_change message. Additionally, a test is added that reproduces both of the mentioned cases.	2023-06-26 16:29:49 +02:00
Alexey Novikov	ca4e7f91c6	compact and remove expired rows from cache on read when read from cache compact and expire row tombstones remove expired empty rows from cache do not expire range tombstones in this patch Refs #2252, #6033 Closes #12917	2023-06-26 15:29:01 +02:00
Benny Halevy	5710ec55c2	leveled_compaction_backlog_tracker: replace_sstables: provide strong exception safety guarantees Modify a temporary copy of `_size_per_level` and apply it back only when done. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 14:21:49 +03:00
Benny Halevy	39d4b548fc	time_window_backlog_tracker: replace_sstables: provide strong exception safety guarantees Modify a temporary copy of the `_windows` map and move-assign it back atomically when done. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 14:20:05 +03:00
Benny Halevy	635c564a9d	size_tiered_backlog_tracker: replace_sstables: provide strong exception safety guarantees By making all changes on temporary variables and eventually moving them back into the tracker members in a noexcept block the function can safely throw until the changes are committed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 14:02:21 +03:00
Benny Halevy	054a031504	size_tiered_backlog_tracker: provide static calculate_sstables_backlog_contribution Instead of providing refresh_sstables_backlog_contribution that updates the tracker in place, provide a static function calculate_sstables_backlog_contribution that doesn't change the tracker state to facilitate exception safety in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 13:32:56 +03:00
Benny Halevy	4e5bfe2c18	size_tiered_backlog_tracker: make log4 helper static It is completely generic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 13:30:43 +03:00
Benny Halevy	5d6c2b0d12	size_tiered_backlog_tracker: define struct sstables_backlog_contribution Encapsulate the contribution-related members in struct contribution, to be used for strong exception safety. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 13:29:38 +03:00
Benny Halevy	bf69584ccc	size_tiered_backlog_tracker: update_sstables: update total_bytes only if set changed Although replace_sstables is supposed to be called only once per {old_ssts, new_ssts} it is safer to update `_total_bytes` with `sst->data_size()` only if the sst was inserted/erased successfully. Otherwise _total_bytes may go out of sync with the contents of _all. That said, the next step should be to refer to the compaction_group's main sstable set directly rather than maintaining a "shadow" set in the tracker. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 13:28:50 +03:00
Benny Halevy	1a8cc84981	compaction_backlog_tracker: replace_sstables: pass old and new sstables vectors by ref To facilitate rollback on the error handling path, to provide strong exception safety guarantees. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 13:27:18 +03:00
Wojciech Mitros	7883a88abd	transport: add is_schema_change() method to result_message In the next patch, we will want to observe when the result message is a schema change and handle it differently than when it is not. This patch adds a helper method for that, which should be more readable than a dynamic_pointer_cast and a comparison with nullptr.	2023-06-26 12:22:14 +02:00
Benny Halevy	0877e7a846	compaction_backlog_tracker: replace_sstables: add FIXME comments about strong exception safety Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 12:51:48 +03:00
Botond Dénes	b23361977b	Merge 'Compaction reshape tasks' from Aleksandra Martyniuk Task manager's tasks covering resharding compaction on top and shard level. Closes #14112 * github.com:scylladb/scylladb: test: extend test_compaction_task.py to test reshaping compaction compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run() compaction: add shard_reshaping_compaction_task_impl replica: delete unused function compaction: add table_reshaping_compaction_task_impl compaction: copy reshape to task_manager_module.cc compaction: add reshaping_compaction_task_impl	2023-06-26 11:56:07 +03:00
Alejo Sanchez	4999cbc1cf	test/boost/cql_functions_test: split long running tests Split long running test_aggregate_functions to one case per type. This allows test.py to run them in parallel. Before this it would take 18 minutes to run in debug mode. Afterwards each case takes 30-45 seconds. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14368	2023-06-26 11:29:36 +03:00
Alejo Sanchez	8b1968cfbb	test/boost/schema_changes_test: split long-running test Split long running test test_schema_changes in 3 parts, one for each writable_sstable_versions so it can be run in parallel by test.py. Add static checks to alert if the array of types changed. Original test takes around 24 minutes in debug mode, and each new split test takes around 8 minutes. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14367	2023-06-26 11:24:07 +03:00
Alejo Sanchez	633f026d63	test/boost/memtable_test: allow parallel run Remove previous configuration blocking parallel run. Test cases run fine in local debug. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14369	2023-06-26 11:23:43 +03:00
Alejo Sanchez	3cbfd863eb	test/boost/database_test: split long running tests Split long running tests test_database_with_data_in_sstables_is_a_mutation_source_plain and test_database_with_data_in_sstables_is_a_mutation_source_reverse. They run with x_log2_compaction_groups of 0 and 1, each one taking from 10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes. Split the test cases to run with 0 and 1, so test.py can run them in parallel. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14356	2023-06-26 11:20:27 +03:00
Takuya ASADA	c70a9cbffe	scylla_fstrim_setup: start scylla-fstrim.timer on setup Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and just enables it, so the timer starts only after rebooted. This is incorrect behavior, we start start it during the setup. Also, unmask is unnecessary for enabling the timer. Fixes #14249 Closes #14252	2023-06-26 11:17:51 +03:00
Petr Gusev	1e851262f2	storage_proxy: handler responses, use pointers to default constructed values instead of nulls The current Seastar RPC infrastructure lacks support for null values in tuples in handler responses. In this commit we add the make_default_rpc_tuple function, which solves the problem by returning pointers to default-constructed values for smart pointer types rather than nulls. The problem was introduced in this commit `2d791a5ed4`. The function `encode_replica_exception_for_rpc` used `default_tuple_maker` callback to create tuples containing exceptions. Callers returned pointers to default-constructed values in this callback, e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`. The commit changed this to just `SourceTuple{}`, which means nullptr for pointer types. Fixes: #14282 Closes #14352	2023-06-26 11:10:38 +03:00
Anna Stuchlik	74fc69c825	doc: add Ubuntu 22 to 2021.1 OS support Fixes https://github.com/scylladb/scylla-enterprise/issues/3036 This commit adds support for Ubuntu 22.04 to the list of OSes supported by ScyllaDB Enterprise 2021.1. This commit fixex a bug and must be backported to branch-5.3 and branch-5.2. Closes #14372	2023-06-26 10:41:43 +03:00
Aleksandra Martyniuk	197635b44b	compaction: delete generation of new sequence number for table tasks Compaction tasks covering table major, cleanup, offstrategy, and upgrade sstables compaction inherit sequence number from their parents. Thus they do not need to have a new sequence number generated as it will be overwritten anyway. Closes #14379	2023-06-26 10:36:10 +03:00
Benny Halevy	53a6ea8616	database: update_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 09:35:35 +03:00
Benny Halevy	9d40305ef6	database: drop_keyspace_on_all_shards Part of moving the responsibility for applying and notifying keyspace schema changes from schema_tables to the database so that the database can control the order of applying the changes across shards and when to notify its listeners. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-26 09:34:42 +03:00
Guy Shtub	fa9df1b216	Fixing broken links to ScyllaDB University lessons, Scylla University -> ScyllaDB University	2023-06-25 09:09:49 +03:00
Aleksandra Martyniuk	b02a5fd184	test: extend test_compaction_task.py to test reshaping compaction	2023-06-23 16:22:53 +02:00
Aleksandra Martyniuk	f9a527b06d	compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run()	2023-06-23 16:22:53 +02:00
Aleksandra Martyniuk	1960904a72	compaction: add shard_reshaping_compaction_task_impl shard_reshaping_compaction_task_impl covers reshaping compaction on one shard.	2023-06-23 16:22:38 +02:00
Aleksandra Martyniuk	19ec5b4256	replica: delete unused function	2023-06-23 15:57:43 +02:00
Aleksandra Martyniuk	e3e2d6b886	compaction: add table_reshaping_compaction_task_impl	2023-06-23 15:57:37 +02:00
Aleksandra Martyniuk	dace5fb004	compaction: copy reshape to task_manager_module.cc distributed_loader::reshape is copied to compaction/task_manager_module.cc as it will be used in reshape compaction tasks.	2023-06-23 12:53:16 +02:00
Aleksandra Martyniuk	981a50e490	compaction: add reshaping_compaction_task_impl reshaping_compaction_task_impl serves as a base class of all concrete reshaping compaction task classes.	2023-06-23 12:53:15 +02:00
Kamil Braun	e6942d31d3	Merge 'query processor code cleanup' from Gleb The series contains mostly cleanups for query processor and no functional change. The last patch is a small cleanup for the storage_proxy. * 'qp-cleanup' of https://github.com/gleb-cloudius/scylla: storage_proxy: remove unused variable client_state: co-routinise has_column_family_access function query_processor: get rid of internal_state and create individual query_satate for each request cql3: move validation::validate_column_family from client_state::has_column_family_access client_state: drop unneeded argument from has.*access functions cql3: move check for dropping cdc tables from auth to the drop statement code itself query_processor: co-routinise execute_prepared_without_checking_exception_message function query_processor: co-routinize execute_direct_without_checking_exception_message function cql3: remove empty statement::validate functions cql3: remove empty function validate_cluster_support cql3/statements: fix indentation and spurious white spaces query_processor: move statement::validate call into execute_with_params function query_processor: co-routinise execute_with_params function query_processor: execute statement::validate before each execution of internal query instead of only during prepare query_processor: get rid of shared internal_query_state query_processor: co-routinize execute_paged_internal function query_processor: co_routinize execute_batch_without_checking_exception_message function query_processor: co-routinize process_authorized_statement function	2023-06-23 10:32:57 +02:00
Kamil Braun	be5b61b870	Merge 'cql3: expr: break up expression.hh header' from Avi Kivity It's very annoying to add a declaration to expression.hh and watch the whole world get recompiled. Improve that by moving less-common functions to a new header expr-utils.hh. Move the evaluation machinery to a new header evaluate.hh. The remaining definitions in expression.hh should not change as often, and thus cause less frequent recompiles. Closes #14346 * github.com:scylladb/scylladb: cql3: expr: break up expression.hh header cql3: expr: restrictions.hh: protect against double inclusions cql3: constants: deinline cql3: statement_restrictions: deinline cql3: deinline operation::fill_prepare_context()	2023-06-23 10:19:28 +02:00
Nadav Har'El	0a1283c813	Merge 'cql3:statements:describe_statement: check pointer after casting to UDF/UDA' from Michał Jadwiszczak There was a bug in describe_statement. If executing `DESC FUNCTION <uda name>` or ` DESC AGGREGATE <udf name>`, Scylla was crashing because the function was found (`functions::find()` searches both UDFs and UDAs) but the function was bad and the pointer wasn't checked after cast. Added a test for this. Fixes: #14360 Closes #14332 * github.com:scylladb/scylladb: cql-pytest:test_describe: add test for filtering UDF and UDA cql3:statements:describe_statement: check pointer to UDF/UDA	2023-06-22 20:54:25 +03:00
Michał Jadwiszczak	d3d9a15505	cql-pytest:test_describe: add test for filtering UDF and UDA	2023-06-22 18:08:45 +02:00
Michał Jadwiszczak	d498451cdf	cql3:statements:describe_statement: check pointer to UDF/UDA While looking for specific UDF/UDA, result of `functions::functions::find()` needs to be filtered out based on function's type. Fixes: #14360	2023-06-22 18:08:16 +02:00
Gleb Natapov	94fcba5662	storage_proxy: remove unused variable	2023-06-22 15:26:20 +03:00
Gleb Natapov	caee26ab4f	client_state: co-routinise has_column_family_access function	2023-06-22 15:26:20 +03:00
Gleb Natapov	28f31bcbb1	query_processor: get rid of internal_state and create individual query_satate for each request We want to put per-request field in query_satate, so make it unique for each internal execution (it is unique for non internal once).	2023-06-22 15:26:20 +03:00
Avi Kivity	b858a4669d	cql3: expr: break up expression.hh header Adding a function declaration to expression.hh causes many recompilations. Reduce that by: - moving some restrictions-related definitions to the existing expr/restrictions.hh - moving evaluation related names to a new header expr/evaluate.hh - move utilities to a new header expr/expr-utilities.hh expression.hh contains only expression definitions and the most basic and common helpers, like printing.	2023-06-22 14:21:03 +03:00
Avi Kivity	25c351a4f6	cql3: expr: restrictions.hh: protect against double inclusions Add #pragma once. Right now it's safe as it only has declarations (which can be repeated), but soon it will have a definition.	2023-06-22 14:19:43 +03:00
Avi Kivity	7302088274	cql3: constants: deinline To reduce future header fan-in, deinline all non-trivial functions. While these aer on the hot path, they can't be inlined anyway as they're virtual, and they're quite heavy anyway.	2023-06-22 14:19:43 +03:00
Avi Kivity	6c0f8a73c5	cql3: statement_restrictions: deinline Reduce future header fan-in by deinlining functions. These are all on the prepare path.	2023-06-22 14:19:43 +03:00
Avi Kivity	3834a1fd7c	cql3: deinline operation::fill_prepare_context() To reduce operation.hh include fan-in, deinline fill_prepare_context(). It's not performance sensitive has it's on the prepare phase.	2023-06-22 14:19:43 +03:00
Gleb Natapov	4bad482e4b	cql3: move validation::validate_column_family from client_state::has_column_family_access Checking keyspace/table presence should not be part of authorization code and it is not done consistently today. For instance keyspace presence is not checked in "alter keyspace" during authorization, but during statement execution. Make it consistent.	2023-06-22 13:57:36 +03:00
Gleb Natapov	31bddb65c7	client_state: drop unneeded argument from has.access functions After previous patch we can drop db argument to most of has.access functions in the client_state.	2023-06-22 13:57:36 +03:00
Gleb Natapov	06bcce53b5	cql3: move check for dropping cdc tables from auth to the drop statement code itself Checking if a table is CDC log and cannot be dropped should not be done as part of authentication (this has nothing to do with auth), but in the drop statement itself. Throwing unauthorized_exception is wrong as well, but unfortunately it is enshrined with a test. Not sure if it is a good idea to change it now.	2023-06-22 13:57:36 +03:00
Gleb Natapov	0820309c14	query_processor: co-routinise execute_prepared_without_checking_exception_message function	2023-06-22 13:57:36 +03:00
Gleb Natapov	818e72c029	query_processor: co-routinize execute_direct_without_checking_exception_message function	2023-06-22 13:57:36 +03:00
Gleb Natapov	45ce608117	cql3: remove empty statement::validate functions There are a lot of empty overloads for the function so lets remove them and use the one in the parent class instead.	2023-06-22 13:57:33 +03:00
Gleb Natapov	3cd9b8548d	cql3: remove empty function validate_cluster_support	2023-06-22 13:52:52 +03:00
Gleb Natapov	8c2c4a6a78	cql3/statements: fix indentation and spurious white spaces	2023-06-22 13:49:11 +03:00
Gleb Natapov	d75a41ba30	query_processor: move statement::validate call into execute_with_params function It is called before any call to the function anyway, so lets do it once instead.	2023-06-22 13:49:11 +03:00
Gleb Natapov	24e78059a5	query_processor: co-routinise execute_with_params function	2023-06-22 13:49:11 +03:00
Gleb Natapov	725fa5e0f3	query_processor: execute statement::validate before each execution of internal query instead of only during prepare There is a discrepancy on how statement::validate is used. On a regular path it is called before each execution, but on internal execution path it is called only once during prepare. Such discrepancy make it hard to reason what can and cannot be done during the call. Call it uniformly before each execution. This allow validate to check a state that can change after prepare.	2023-06-22 13:49:11 +03:00
Gleb Natapov	ce12a18135	query_processor: get rid of shared internal_query_state internal_query_state was passed in shared_ptr from the java translation times. It may be a regular c++ type with a lifetime bound by the function execution it was created in.	2023-06-22 13:49:11 +03:00
Gleb Natapov	aabd05e2ef	query_processor: co-routinize execute_paged_internal function	2023-06-22 13:49:11 +03:00
Gleb Natapov	64a67a59d6	query_processor: co_routinize execute_batch_without_checking_exception_message function	2023-06-22 13:49:11 +03:00
Gleb Natapov	c4ca24e636	query_processor: co-routinize process_authorized_statement function	2023-06-22 13:49:11 +03:00
Kamil Braun	23a60df92d	Merge 'cql3: expr: simplify evaluate()' from Avi Kivity Make evaluate()'s body more regular, then exploit it by replacing the long list of branches with a lambda template. Closes #14306 * github.com:scylladb/scylladb: cql3: expr: simplify evaluate() cql3: expr: standardize evaluate() branches to call do_evaluate() cql3: expr: rename evaluate(ExpressionElement) to do_evaluate()	2023-06-22 12:18:36 +02:00
Kamil Braun	563d466de1	Merge 'cql3: select_statement: coroutinize indexed statement's do_execute()' from Avi Kivity Improves readability, and probably a little faster too. Closes #14311 * github.com:scylladb/scylladb: cql3: select_statement: reindent indexed_table_select_statement::do_execute cql3: select_statement: simplify inner lambda in indexed_table_select_statement::do_execute() cql3: select_statement: coroutinize indexed_table_select_statement::do_execute()	2023-06-22 12:10:45 +02:00
Botond Dénes	55e09dbdc0	Merge 'doc: move cloud deployment instruction to docs -v2' from Anna Stuchlik This is V2 of https://github.com/scylladb/scylladb/pull/14108 This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs. The scope: * Added new files with instructions for AWS, GCP, and Azure. * Added the new files to the index. * Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section. * Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:" * Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files in the Enterprise repository. In addition: - The Configuration Reference file was moved out of the installation section (it's not about installation at all) - The links to creating a cluster were removed from the installation page (as not related). Related: https://github.com/scylladb/scylla-docs/issues/4091 Closes #14153 * github.com:scylladb/scylladb: doc: remove the rpm-info file (What is in each RPM) from the installation section doc: move cloud deployment instruction to docs -v2	2023-06-22 12:58:30 +03:00
Avi Kivity	32b27d6a08	cql3: expr: change evaluation_input vector components to take spans Spans are slightly cleaner, slightly faster (as they avoid an indirection), and allow for replacing some of the arguments with small_vector:s. Closes #14313	2023-06-22 11:28:01 +02:00
Anna Stuchlik	950ef5195e	Merge branch 'master' into anna-install-cloud-v2	2023-06-22 10:03:29 +02:00
Botond Dénes	e1c2de4fb8	Merge 'forward_service: fix forgetting case-sensitivity in aggregates ' from Jan Ciołek There was a bug that caused aggregates to fail when used on column-sensitive columns. For example: ```cql SELECT SUM("SomeColumn") FROM ks.table; ``` would fail, with a message saying that there is no column "somecolumn". This is because the case-sensitivity got lost on the way. For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written. The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it. This converted the name to lowercase, and later such column couldn't be found. To fix it, let's make the `column_identifier` case-sensitive. It will preserve the name, without converting it to lowercase. Fixes: https://github.com/scylladb/scylladb/issues/14307 Closes #14340 * github.com:scylladb/scylladb: service/forward_service.cc: make case-sensitivity explicit cql-pytest/test_aggregate: test case-sensitive column name in aggregate forward_service: fix forgetting case-sensitivity in aggregates	2023-06-22 08:25:33 +03:00
Botond Dénes	320159c409	Merge 'Compaction group major compaction task' from Aleksandra Martyniuk Task manager task covering compaction group major compaction. Uses multiple inheritance on already existing major_compaction_task_executor to keep track of the operation with task manager. Closes #14271 * github.com:scylladb/scylladb: test: extend test_compaction_task.py test: use named variable for task tree depth compaction: turn major_compaction_task_executor into major_compaction_task_impl compaction: take gate holder out of task executor compaction: extend signature of some methods tasks: keep shared_ptr to impl in task compaction: rename compaction_task_executor methods	2023-06-22 08:15:17 +03:00
Avi Kivity	8576502c48	Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected. This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node. Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details. Closes #13850 * github.com:scylladb/scylladb: test: pylib: increase checking period for `get_alive_endpoints` test: add node banning test test: pylib: manager_client: `get_cql()` helper test: pylib: ScyllaCluster: server pause/unpause API raft topology: ban left nodes raft topology: skip `left_token_ring` state during `removenode` raft topology: prepare decommission path for node banning raft topology: introduce `left_token_ring` state raft topology: `raft_topology_cmd` implicit constructor messaging_service: implement host banning messaging_service: exchange host IDs and map them to connections messaging_service: store the node's host ID messaging_service: don't use parameter defaults in constructor main: move messaging_service init after system_keyspace init	2023-06-21 20:16:45 +03:00
Anna Stuchlik	c65abb06cd	doc: udpate the OSS docs landing page Fixes https://github.com/scylladb/scylladb/issues/14333 This commit replaces the documentation landing page with the Open Source-only documentation landing page. This change is required as now there is a separate landing page for the ScyllaDB documentation, so the page is duplicated, creating bad user experience. Closes #14343	2023-06-21 17:06:48 +03:00
Jan Ciołek	16c21d7252	service/forward_service.cc: make case-sensitivity explicit Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-21 16:02:41 +02:00
Jan Ciolek	854b0301be	cql-pytest/test_aggregate: test case-sensitive column name in aggregate There was a bug which made aggregates fail when used with case-sensitive column names. Add a test to make sure that this doesn't happen in the future. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-21 14:49:24 +02:00
Jan Ciolek	7fca350075	forward_service: fix forgetting case-sensitivity in aggregates There was a bug that caused aggregates to fail when used on column-sensitive columns. For example: ``` SELECT SUM("SomeColumn") FROM ks.table; ``` would fail, with a message saying that there is no column "somecolumn". This is because the case-sensitivity got lost on the way. For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written. The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it. This converted the name to lowercase, and later such column couldn't be found. To fix it, let's make the `column_identifier` case-sensitive. It will preserve the name, without converting it to lowercase. Fixes: https://github.com/scylladb/scylladb/issues/14307 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-21 14:37:42 +02:00
Nadav Har'El	8a9de08510	sstable: limit compression chunk size to 128 KB The chunk size used in sstable compression can be set when creating a table, using the "chunk_length_in_kb" parameter. It can be any power-of-two multiple of 1KB. Very large compression chunks are not useful - they offer diminishing returns on compression ratio, and require very large memory buffers and reading a very large amount of disk data just to read a small row. In fact, small chunks are recommended - Scylla defaults to 4 KB chunks, and Cassandra lowered their default from 64 KB (in Cassandra 3) to 16 KB (in Cassandra 4). Therefore, allowing arbitrarily large chunk sizes is just asking for trouble. Today, a user can ask for a 1 GB chunk size, and crash or hang Scylla when it runs out of memory. So in this patch we add a hard limit of 128 KB for the chunk size - anything larger is refused. Fixes #9933 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14267	2023-06-21 14:26:02 +03:00
Kefu Chai	f014ccf369	Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"" This reverts commit `562087beff`. The regressions introduced by the reverted change have been fixed. So let's revert this revert to resurrect the uuid_sstable_identifier_enabled support. Fixes #10459	2023-06-21 13:02:40 +03:00
Avi Kivity	e233f471b8	Merge 'Respect tablet shard assignment' from Tomasz Grabiec This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets): 1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored. 2. CDC subsystem was not adjusted (not supported yet) 3. sstable sharding metadata reflects tablet boundaries 5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables) 6. The system is NOT prepared to handle tablet migration / topology changes in a safe way. 7. Sstable cleanup is not wired properly yet After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead. To make the life easier, support was added to obtain table pointer from the schema pointer: ``` schema_ptr s; s->table().shard_of(...) ``` Closes #13939 * github.com:scylladb/scylladb: locator: network_topology_startegy: Allocate shards to tablets locator: Store node shard count in topology service: topology: Extract topology updating to a lambda test: Move test_tablets under topology_experimental sstables: Add trace-level logging related to shard calculation schema: Catch incorrect uses of schema::get_sharder() dht: Rename dht::shard_of() to dht::static_shard_of() treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() storage_proxy: Avoid multishard reader for tablets storage_proxy: Obtain shard from erm in the read path db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() forward_service: Use table sharder alternator: Use table sharder db: multishard: Obtain sharder from erm sstable_directory: Improve trace-level logging db: table: Introduce shard_of() helper db: Use table sharder in compaction sstables: Compute sstable shards using sharder from erm when loading sstables: Generate sharding metadata using sharder from erm when writing test: partitioner: Test split_range_to_single_shard() on tablet-like sharder dht: Make split_range_to_single_shard() prepared for tablet sharder sstables: Move compute_shards_for_this_sstable() to load() dht: Take sharder externally in splitting functions locator: Make sharder accessible through effective_replication_map dht: sharder: Document guarantees about mapping stability tablets: Implement tablet sharder tablets: Include pending replica in get_shard() dht: sharder: Introduce next_shard() db: token_ring_table: Filter out tablet-based keyspaces db: schema: Attach table pointer to schema schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load() schema_registry: Make learn(schema_ptr) attach entry to the target schema test: lib: cql_test_env: Expose feature_service test: Extract throttle object to separate header	2023-06-21 10:20:41 +03:00
Calle Wilund	f18e967939	storage_proxy: Make split_stats resilient to being called from different scheduling group Fixes #11017 When doing writes, storage proxy creates types deriving from abstract_write_response_handler. These are created in the various scheduling groups executing the write inducing code. They pick up a group-local reference to the various metrics used by SP. Normally all code using (and esp. modifying) these metrics are executed in the same scheduling group. However, if gossip sees a node go down, it will notify listeners, which eventually calls get_ep_stat and register_metrics. This code (before this patch) uses _active_ scheduling group to eventually add metrics, using a local dict as guard against double regs. If, as described above, we're called in a different sched group than the original one however, this can cause double registrations. Fixed here by keeping a reference to creating scheduling group and using this, not active one, when/if creating new metrics. Closes #14294	2023-06-21 10:08:27 +03:00
Tomasz Grabiec	ebdebb982b	locator: network_topology_startegy: Allocate shards to tablets Uses a simple algorihtm for allocating shards which chooses least-loaded shard on a given node, encapsulated in load_sketch. Takes load due to current tablet allocation into account. Each tablet, new or allocated for other tables, is assumed to have an equal load weight.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	e110167a2a	locator: Store node shard count in topology Will be needed by tablet allocator.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	dd968e16bf	service: topology: Extract topology updating to a lambda Reduces code duplication.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	6defcb7bd5	test: Move test_tablets under topology_experimental Tablets will rely on shard_count information in topology, which is set only when using eperimental raft-based topology.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	34f28aa0cb	sstables: Add trace-level logging related to shard calculation	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	f6625e16ee	schema: Catch incorrect uses of schema::get_sharder() We still use it in many places in unit tests, which is ok because those tables are vnode-based. We want to check incorrect uses in production as they may lead to hard to debug consistency problems.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	29cbdb812b	dht: Rename dht::shard_of() to dht::static_shard_of() This is in order to prevent new incorrect uses of dht::shard_of() to be accidentally added. Also, makes sure that all current uses are caught by the compiler and require an explicit rename.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	21198e8470	treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	fb0bdcec0c	storage_proxy: Avoid multishard reader for tablets Currently, the coordinator splits the partition range at vnode (or tablet) boundaries and then tries to merge adjacent ranges which target the same replica. This is an optimization which makes less sense with tablets, which are supposed to be of substantial size. If we don't merge the ranges, then with tablets we can avoid using the multishard reader on the replica side, since each tablet lives on a single shard. The main reason to avoid a multishard reader is avoiding its complexity, and avoiding adapting it to work with tablet sharding. Currently, the multishard reader implementation makes several assumptions about shard assignment which do not hold with tablets. It assumes that shards are assigned in a round-robin fashion.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	10e05eec66	storage_proxy: Obtain shard from erm in the read path dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e48ec6fed3	db, storage_proxy: Drop mutation/frozen_mutation ::shard_of() dht::shard_of() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	d4497a058e	forward_service: Use table sharder schema::get_sharder() does not return the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ab94e74774	alternator: Use table sharder schema::get_sharder() does not return the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should use erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	d92287f997	db: multishard: Obtain sharder from erm This is not strictly necessary, as the multishard reader will be later avoided altogether for tablet-based tables, but it is a step towards converting all code to use the erm->get_sharder() instead of schema::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	18f567385c	sstable_directory: Improve trace-level logging	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	34ba8a6a53	db: table: Introduce shard_of() helper Saves some boiler plate code.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	36da062bcb	db: Use table sharder in compaction	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ad983ac23d	sstables: Compute sstable shards using sharder from erm when loading schema::get_sharder() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should obtain the sharder from erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	17d6163548	sstables: Generate sharding metadata using sharder from erm when writing We need to keep sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	36e12020b9	test: partitioner: Test split_range_to_single_shard() on tablet-like sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	28b972a588	dht: Make split_range_to_single_shard() prepared for tablet sharder The function currently assumes that shard assignment for subsequent tokens is round robin, which will not be the case for tablets. This can lead to incorrect split calculation or infinite loop. Another assumption was that subsequent splits returned by the sharder have distinct shards. This also doesn't hold for tablets, which may return the same shard for subsequent tokens. This assumption was embedded in the following line: start_token = sharder.token_for_next_shard(end_token, shard); If the range which starts with end_token is also owned by "shard", token_for_next_shard() would skip over it.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	fe7922d65c	sstables: Move compute_shards_for_this_sstable() to load() Soon, compute_shards_for_this_sstable() will need to take a sharder object. open_data() is called indirectly from sstable::load() and directly after writing an sstable from various paths. The latter don't really need to compute shards, since the field is already set by the writer. In order to reduce code churn, move compute_shards_for_this_sstable() to the load() path only so that only load() needs to take the sharder.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	390bcf3fae	dht: Take sharder externally in splitting functions We need those functions to work with tablet sharder, which is not accessible through schema::get_sharder(). In order to propagate the right sharder, those functions need to take it externally rather from the schema object. The sharder will come from the effective_replication_map attached to the table object. Those splitting functions are used when generating sharding metadata of an sstable. We need to keep this sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	353ce1a6d1	locator: Make sharder accessible through effective_replication_map For tablets, sharding depends on replication map, so the scope of the sharder should be effective_replicaion_map rather than the schema object. Existing users will be transitioned incrementally in later patches.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	606a8ee2da	dht: sharder: Document guarantees about mapping stability	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	22ab100b41	tablets: Implement tablet sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e44e6033d8	tablets: Include pending replica in get_shard() We need to move get_shard() from tablet_info to tablet_map in order to have access to transition_info.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e8dd5e34c3	dht: sharder: Introduce next_shard() The logic was extracted from ring_position_range_sharder::next(), and the latter was changed to rely on sharder::next_shard(). The tablet sharder will have a different implementation for next_shard(). This way, ring_position_range_sharder can work with both current sharder and the tablet sharder.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	16797c2d1a	db: token_ring_table: Filter out tablet-based keyspaces Querying from virtual table system.token_ring fails if there is a tablet-based table due to attempt to obtain a per-keyspace erm. Fix by not showing such keyspaces.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	2303466375	db: schema: Attach table pointer to schema This will make it easier to access table proprties in places which only have schema_ptr. This is in particular useful when replacing dht::shard_of() uses with s->table().shard_of(), now that sharding is no longer static, but table-specific. Also, it allows us to install a guard which catches invalid uses of schema::get_sharder() on tablet-based tables. It will be helpful for other uses as well. For example, we can now get rid of the static_props hack.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	84cb0f5df7	schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load() The netyr may exist, but its schema may not yet be loaded. learn() didn't take that into account. This problem is not reachable in production code, which currently always calls get_or_load() before learn(), except for boot, but there's no concurrency at that point. Exposed by unit test added later.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	053484e762	schema_registry: Make learn(schema_ptr) attach entry to the target schema System tables have static schemas and code uses those static schemas instead of looking them up in the database. We want those schemas to have a valid table() once the table is created, so we need to attach registry entry to the target schema rather than to a schema duplicate.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ebc49e89ab	test: lib: cql_test_env: Expose feature_service	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ad6d2b42f2	test: Extract throttle object to separate header	2023-06-21 00:58:24 +02:00
Kamil Braun	643e69af89	Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature. Two new columns are added to `system.topology`: - `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes. - `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator. These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns. During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features. Closes #14232 * github.com:scylladb/scylladb: storage_service: update supported cluster features in group0 on start storage_service: add methods for features to topology mutation builder storage_service: use explicit ::set overload instead of a template storage_service: reimplement mutation builder setters storage_service: introduce topology_mutation_builder_base topology_state_machine: include information about features system_keyspace: introduce deserialize_set_column db/system_keyspace: add storage for cluster features managed in group 0	2023-06-20 18:32:00 +02:00
Avi Kivity	453bbc1115	cql3: expr: improve error message when rejecting aggregation functions in illegal contexts Fix a small grammatical error, and capitalize WHERE in accordance with SQL tradition. Closes #14288	2023-06-20 17:52:53 +03:00
Piotr Dulikowski	3e955945de	storage_service: update supported cluster features in group0 on start Now, when a node starts, it will update its `supported_features` row in `system.topology` via `update_topology_with_local_metadata`. At this point, the functionality behind cluster features on raft is mostly incomplete and the state of the `supported_features` column does not influence anything so it's safe to update this column unconditionally. In the future, the node will only join / start group0 server if it is sure that it supports all enabled features and it can safely update the `supported_features` parameter.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	707e929831	storage_service: add methods for features to topology mutation builder The newly added `supported_features` and `enabled_features` columns can now be modified via topology mutation builders: - `supported_features` can now be overwritten via a new overload of `topology_node_mutation_builder::set`. - `enabled_features` can now be extended (i.e. more elements can be added to it) via `topology_mutation_builder::add_enabled_features`. As the set of enabled features only grows, this should be sufficient.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	2a4462a01f	storage_service: use explicit ::set overload instead of a template The `topology_node_mutation_builder::set` function has an overload which accepts any type which can be converted to string via `::format`. Its presence can lead to easy mistakes which can only be detected at runtime rather at compile time. A concrete example: I wrote a function that accepts an std::set<S> where S is convertible to sstring; it turns out that std::string_view is not std::convertible_to sstring and overload resolution falled back to the catch-all overload. This commit gets rid of the catch-all overload and replaces it with explicit ones. Fortunately, it was used for only two enums, so it wasn't much work.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	a8aaeabfac	storage_service: reimplement mutation builder setters As promised in the previous commit which introduced topology_mutation_builder_base, this commit adjusts existing setters of topology mutation builder and topology node mutation builder to use helper methods defined in the base class. Note that the `::set` method for the unordered set of tokens now does not delete the column in case an empty value is set, instead it just writes an empty set. This semantic is arguably more clear given that we have an explicit `::del` method and it shouldn't affect the existing implementation - we never intentionally insert an empty set of tokens.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	ee12192125	storage_service: introduce topology_mutation_builder_base Introduces `topology_mutation_builder_base` which will be a base class for both topology mutation builder and topology node mutation builder. Its purpose is to abstract away some detail about setting/deleting/etc. column in the mutation, the actual topology (node) mutation builder will only have to care about converting types and/or allowing only particular columns to be set. The class is using CRTP: derived classes provide access to the row being modified, schema and the timestamp. For the sake of commit diff readability, this commt only introduces this class and changes the builders to derive from it but no setter implementations are modified - this will be done in the next commit.	2023-06-20 16:41:08 +02:00
Piotr Dulikowski	bc84d59665	topology_state_machine: include information about features Now, the newly added `supported_features` and `enabled_features` columns are reflected in the `topology_state_machine` structure.	2023-06-20 16:41:05 +02:00
Piotr Dulikowski	e527e63abc	system_keyspace: introduce deserialize_set_column There are three places in system_keyspace.cc which deserialize a column holding a set of tokens and convert it to an unordered set of dht::token. The deserialization process involves a small number of steps that are the same in all of those places, therefore they can be abstracted away. This commit adds `deserialize_set_column` function which takes care of deserializing the column to `set_type_impl::native_type` which can be then passed to `decode_tokens`. The new function will also be useful for decoding set columns with cluster features, which will be handled in the next commit.	2023-06-20 16:37:09 +02:00
Avi Kivity	77ff78328b	cql3: select_statement: reindent indexed_table_select_statement::do_execute	2023-06-20 14:12:58 +03:00
Avi Kivity	218d9fe384	cql3: select_statement: simplify inner lambda in indexed_table_select_statement::do_execute() The lambda is defined to return a coordinator_result<stop_iteration>, but in fact only returns successful outcomes, never failures. Change it to return a plain stop_iteration, so its callers don't have to check for failure.	2023-06-20 14:11:36 +03:00
Kamil Braun	b38dcba6ed	test: pylib: increase checking period for `get_alive_endpoints` `server_sees_others` and similar functions periodically call `get_alive_endpoints`. The period was `.1` seconds, increase it to `.5` to reduce the log spam (I checked empirically that `.5` is usually how long it takes in dev mode on my laptop.)	2023-06-20 13:03:46 +02:00
Kamil Braun	279a109ce0	test: add node banning test Pause one of the nodes and once it's marked as DOWN, remove it from the cluster. Check that it is not able to perform queries once it unpauses.	2023-06-20 13:03:46 +02:00
Kamil Braun	ae92932240	test: pylib: manager_client: `get_cql()` helper	2023-06-20 13:03:46 +02:00
Kamil Braun	e02249f0cd	test: pylib: ScyllaCluster: server pause/unpause API	2023-06-20 13:03:46 +02:00
Kamil Braun	63229e48e8	raft topology: ban left nodes	2023-06-20 13:03:46 +02:00
Kamil Braun	737c1b4ae6	raft topology: skip `left_token_ring` state during `removenode` The "tell the node to shut down" RPC would fail every time in the removenode path (since the node is dead), which is kind of awkward. Besides, for removenode we don't really need the `left_token_ring` state, we don't need to coordinate with the node - writes destined for it are failing anyway (since it's dead) and we can ban the node immediately. Remove the node from group 0 while in `write_both_read_new` transition state (even when we implement abort, in this state it's too late to abort, we're committed to removing the node - so it's fine to remove it from group 0 at this point).	2023-06-20 13:03:46 +02:00
Kamil Braun	977680773b	raft topology: prepare decommission path for node banning Currently the decommissioned node waits until it observes that it was moved to the `left` state, then proceeds to leave group 0 and shut down. Unfortunately, this strategy won't work once we introduce banning nodes that are in `left` state - there is no guarantee that the decommissioning node will observe that it entered `left` state. The replication of Raft commands races with the ban propagating through the cluster. We also can't make the node leave as soon as it observes the `left_token_ring` state, which would defeat the purpose of `left_token_ring` - allowing all nodes to observe that the node has left the token ring before it shuts down. We could introduce yet another state between `left_token_ring` and `left`, which the node waits for before shutting down; the coordinator would request a barrier from the node before moving to `left` state. The alternative - which we chose here - is to have the coordinator explicitly tell the node to shutdown while we're in `left_token_ring` through a direct RPC. We introduce `raft_topology_cmd::command::shutdown` and send it to the node while in `left_token_ring` state, after we requested a cluster barrier. We don't require the RPC to succeed; we need to allow it to fail to preserve availability. This is because an earlier incarnation of the coordinator may have requested the node to shut down already, so the new coordinator will fail the RPC as the node is already dead. This also improves availability in general - if the node dies while we're in `left_token_ring`, we can proceed. We don't lose safety from that, since we'll ban the node (later commit). We only lose a bit of user experience if there's a failure at this decommission step - the decommissioning node may hang, never receiving the RPC (it will be necessary to shut it down manually). Another complication arising from banning the node is that it won't be able to leave group 0 on its own; by the time it tries that, it may have already been banned by the cluster (the coordinator moves the node to `left` state after telling it to shut down). So we get rid of the `leave_group0` step from `raft_decommission()` (which simplifies the function too), putting a `remove_from_raft_config` inside the coordinator code instead - after we told the node to shut down. (Removing the node from configuration is also another reason why we need to allow the above RPC to fail; the node won't be able to handle the request once it's outside the configuration, because it handles all coordinator requests by starting a read barrier.) Finally, a complication arises when the coordinator is the decommissioning node. The node would shut down in the middle of handling the `left_token_ring` state, leading to harmless but awkward errors even though there were no node/network failures (the original coordinator would fail the `left_token_ring` state logic; a new coordinator would take over and do it again, this time succeeding). We fix that by checking if we're the decommissioning node at the beginning of `left_token_ring` state handler, and if so, stepping down from leadership by becoming a nonvoter first.	2023-06-20 13:03:46 +02:00
Kamil Braun	b8ddfd9ef9	raft topology: introduce `left_token_ring` state We want for the decommissioning node to wait before shutting down until every node learns that it left the token ring. Otherwise some nodes may still try coordinating writes to that nodes after it already shut down, leading to unnecessary failures on the data path(e.g. for CL=ALL writes). Before this change, a node would shut down immediately after observing that it was in `left` state; some other nodes may still see it in `decommissioning` state and the topology transition state as `write_both_read_new`, so they'd try to write to that node. After this change, the node first enters the `left_token_ring` state before entering `left`, while the topology transition state is removed (so we've finished the token ring change - the node no longer has tokens in the ring, but it's still part of the topology). There we perform a read barrier, allowing all nodes to observe that the decommissioning node has indeed left the token ring. Only after that barrier succeeds we allow the node to shut down.	2023-06-20 13:03:46 +02:00
Kamil Braun	c94c07804d	raft topology: `raft_topology_cmd` implicit constructor Saves some redundant typing when passing `raft_topology_cmd` parameters, so we can change this: ``` raft_topology_cmd{raft_topology_cmd::command::fence_old_reads} ``` into this: ``` raft_topology_cmd::command::fence_old_reads ```	2023-06-20 13:03:46 +02:00
Kamil Braun	8cf47d76a4	messaging_service: implement host banning Calling `ban_host` causes the following: - all connections from that host are dropped, - any further attempts to connect will be rejected (the connection will be immediately dropped) when receiving the `CLIENT_ID` verb.	2023-06-20 13:03:46 +02:00
Kamil Braun	95c726a8df	messaging_service: exchange host IDs and map them to connections When a node first establishes a connection to another node, it always sending a `CLIENT_ID` one-way RPC first. The message contains some metadata such as `broadcast_address`. Include the `host_id` of the sender in that RPC. On the receiving side, store a mapping from that `host_id` to the connection that was just opened. This mapping will be used later when we ban nodes that we remove from the cluster.	2023-06-20 13:03:46 +02:00
Kamil Braun	87f65d01b8	messaging_service: store the node's host ID	2023-06-20 13:03:46 +02:00
Kamil Braun	a78cc17bd4	messaging_service: don't use parameter defaults in constructor	2023-06-20 13:03:46 +02:00
Kamil Braun	7f3ad6bd25	main: move messaging_service init after system_keyspace init	2023-06-20 13:03:46 +02:00
Kamil Braun	8b152361f4	Merge 'raft topology: fixes after #13884 ' from Gusev Petr This PR fixes some problems found after the PR was merged: * missed `node_to_work_on` assignment in `handle_topology_transition`; * change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation. * `update_fence_version` has beed moved after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error. Fixes: #14303 Closes #14292 * github.com:scylladb/scylladb: main.cc: move update_fence_version after group0_service.setup_group0_if_exist shared_token_metadata: update_fence_version: on_internal_error -> throw storage_service: handle_topology_transition: fix missed node assignment	2023-06-20 13:02:17 +02:00
Avi Kivity	e5ed07e3e1	cql3: select_statement: coroutinize indexed_table_select_statement::do_execute() Will lead to more readable code after a bit more prettifying. Also has fewer allocations, though this isn't a hot path.	2023-06-20 13:51:50 +03:00
Aleksandra Martyniuk	8ad6f1f481	test: extend test_compaction_task.py Extend test_compaction_task.py to test major compaction tasks covering compaction group compaction.	2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk	648cf4e748	test: use named variable for task tree depth	2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk	74e5b4ebfc	compaction: turn major_compaction_task_executor into major_compaction_task_impl major_compaction_task_executor inherits both from compaction_task_executor and major_compaction_task_impl. Thanks to that an executed operation is represented in task manager.	2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk	4922f4cf80	compaction: take gate holder out of task executor In the following commits, classes deriving from compaction_task_executor will be alive longer than they are kept in compaction_manager::_tasks. Thus, the compaction_task_executor::_gate_holder would be held, blocking other compactions. compaction_task_executor::_gate_holder is moved outside of compaction_task_executor object.	2023-06-20 12:12:45 +02:00
Tomasz Grabiec	87b4606cd6	Merge 'atomic_cell: compare value last' from Benny Halevy Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in `e225c88a65` ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability	2023-06-20 12:11:48 +02:00
Petr Gusev	41b950dd21	main.cc: move update_fence_version after group0_service.setup_group0_if_exist Otherwise, the validation new_fence_version <= token_metadata::version inside update_fence_version will use an uninitialized token_metadata::version == 0 and we will get an error. The test_topology_ops was improved to catch this problem. Fixes: #14303	2023-06-20 13:40:01 +04:00
Petr Gusev	246eaec14e	shared_token_metadata: update_fence_version: on_internal_error -> throw on_internal_error is wrong for fence_version condition violation, since in case of topology change coordinator migrating to another node we can have raft_topology_cmd::command::fence command from the old coordinator running in parallel with the fence command (or topology version upgrading raft command) from the new one. The comment near the raft_topology_cmd::command::fence handling describes this situation, assuming an exception is thrown in this case.	2023-06-20 13:39:17 +04:00
Botond Dénes	8bfe3ca543	query: move max_result_size to query-request.hh It is currently located in query_class_config.hh, which is named after a now defunct struct. This arrangement is unintuitive and there is no upside to it. The main user of max_result_size is query_comand, so colocate it next to the latter. Closes #14268	2023-06-20 11:37:50 +02:00
Benny Halevy	26ff8f7bf7	docs: dml: add update ordering section and add docs/dev/timestamp-conflict-resolution.md to document the details of the conflict resolution algorithm. Refs scylladb/scylladb#14063 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 11:55:54 +03:00
Aleksandra Martyniuk	e317ffe23a	compaction: extend signature of some methods Extend a signature of table::compact_all_sstables and compaction_manager::perform_major_compaction so that they get the info of a covering task. This allows to easily create child tasks that cover compaction group compaction.	2023-06-20 10:45:34 +02:00
Aleksandra Martyniuk	ea470316fb	tasks: keep shared_ptr to impl in task Keep seastar::shared_ptr to task::impl instead of std::unique_ptr in task. Some classes deriving from task::impl may be used outside task manager context.	2023-06-20 10:45:34 +02:00
Aleksandra Martyniuk	3007fbeee3	compaction: rename compaction_task_executor methods compaction_task_executor methods are renamed to prevent name colisions between compaction_task_executor and tasks::task_manager::task::impl.	2023-06-20 10:45:34 +02:00
Benny Halevy	31a3152a59	cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp Add reproducers for #14182: test_rewrite_different_values_using_same_timestamp verifies expiration-based cell reconciliation. test_rewrite_different_values_using_same_timestamp_and_expiration is a scylla_only test, verifying that when two cells with same timestamp and same expiration are compared, the one with the lesser ttl prevails. test_rewrite_using_same_timestamp_select_after_expiration reproduces the specific issue hit in #14182 where a cell is selected after it expires since it has a lexicographically larger value than the other cell with later expiration. test_rewrite_multiple_cells_using_same_timestamp verifies atomicity of inserts of multiple columns, with a TTL. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	0aa13f70eb	mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same As in compare_atomic_cell_for_merge, we want to consider the row marker ttl for ordering, in case both are expiring and have the same expiration time. This was missed in `a57c087c89` and `a085ef74ff`. With that in mind, add documentation to compare_row_marker_for_merge and a mutual note to both functions about their equivalence. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	6717e45ff0	atomic_cell: compare_atomic_cell_for_merge: update and add documentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	761d62cd82	compare_atomic_cell_for_merge: compare value last for live cells Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was changed in CASSANDRA-14592 for consistency with the preference for dead cells over live cells, as expiring cells will become tombstones at a future time and then they'd win over live cells with the same timestamp, hence they should win also before expiration. In addition, comparing the cell value before expiration can lead to unintuitive corner cases where rewriting a cell using the same timestamp but different TTL may cause scylla to return the cell with null value if it expired in the meanwhile. Also, when multiple columns are written using two upserts using the same write timestamp but with different expiration, selecting cells by their value may return a mixed result where each cell is selected individually from either upsert, by picking the cells with the largest values for each column, while using the expiration time to break tie will lead to a more consistent results where a set of cell from only one of the upserts will be selected. Fixes scylladb/scylladb#14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Benny Halevy	ec034b92c0	mutation_test: test_cell_ordering: improve debuggability Currently, it is hard to tell which of the many sub-cases fail in this unit test, in case any of them fails. This change uses logging in debug and trace level to help with that by reproducing the error with --logger-log-level testlog=trace (The cases are deterministic so reproducing should not be a problem) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-20 10:10:39 +03:00
Botond Dénes	63b395fe70	Merge 'docs: changes subdomain to opensource.docs.scylladb.com' from David Garcia docs.scylladb.com will point to https://github.com/scylladb/scylladb-docs-homepage This pull request changes the domain of this repo to opensource.docs.scylladb.com and moves all the redirects to https://github.com/scylladb/scylladb-docs-homepage/blob/main/docs/_utils/redirects.yaml Closes #14221 * github.com:scylladb/scylladb: Update conf.py docs: separate homepage	2023-06-20 10:00:40 +03:00
Botond Dénes	9e9636ef15	Merge 'cql3: select_statement: coroutinize and simplify do_execute()' from Avi Kivity Split off do_execute() into a fast path and slow(ish) path, and coroutinize the latter. perf-simple-query shows no change in performance (which is unsurprising since it picks the fast path which is essentially unchanged). Closes #14246 * github.com:scylladb/scylladb: cql3: select_statement: reindent execute_without_checking_exception_message_aggregate_or_paged() cql3: select_statement: coroutinize execute_without_checking_exception_message_aggregate_or_paged() cql3: select_statement: split do_execute into fast-path and slow/slower paths cql3: select_statement: disambiguate execute() overloads	2023-06-20 08:02:07 +03:00
Kamil Braun	732feca115	storage_proxy: query_partition_key_range_concurrent: don't access empty range `query_partition_range_concurrent` implements an optimization when querying a token range that intersects multiple vnodes. Instead of sending a query for each vnode separately, it sometimes sends a single query to cover multiple vnodes - if the intersection of replica sets for those vnodes is large enough to satisfy the CL and good enough in terms of the heat metric. To check the latter condition, the code would take the smallest heat metric of the intersected replica set and compare them to smallest heat metrics of replica sets calculated separately for each vnode. Unfortunately, there was an edge case that the code didn't handle: the intersected replica set might be empty and the code would access an empty range. This was catched by an assertion added in `8db1d75c6c` by the dtest `test_query_dc_with_rf_0_does_not_crash_db`. The fix is simple: check if the intersected set is empty - if so, don't calculate the heat metrics because we can decide early that the optimization doesn't apply. Also change the `assert` to `on_internal_error`. Fixes #14284 Closes #14300	2023-06-20 07:56:40 +03:00
Botond Dénes	ddf8547f25	Merge 'Add concurrency control and workload isolation for S3 client' from Pavel Emelyanov In its current state s3 client uses a single default-configured http client thus making different sched classes' workload compete with each other for sockets to make requests on. There's an attempt to handle that in upload-sink implementation that limits itself with some small number of concurrent PUT requests, but that doesn't help much as many sinks don't share this limit. This PR makes S3 client maintain a set of http clients, one per sched-group, configures maximum number of TCP connections proportional to group's shares and removes the artificial limit from sinks thus making them share the group's http concurrency limit. As a side effect, the upload-sink fixes the no-writes-after-flush protection -- if it's violated, write will result in exception, while currently it just hangs on a semaphore forever. fixes: #13458 fixes: #13320 fixes: #13021 Closes #14187 * github.com:scylladb/scylladb: s3/client: Replace skink flush semaphore with gate s3/client: Configure different max-connections on http clients s3/client: Maintain several http clients on-board s3/client: Remove now unused http reference from sink and file s3/client: Add make_request() method	2023-06-20 07:09:21 +03:00
Nadav Har'El	7deba4f4a5	test/cql-pytest: add tests reproducing bugs in compression configuration This patch adds some minimal tests for the "with compression = {..}" table configuration. These tests reproduce three known bugs: Refs #6442: Always print all schema parameters (including default values) Scylla doesn't return the default chunk_length_in_kb, but Cassandra does. Refs #8948: Cassandra 3.11.10 uses "class" instead of "sstable_compression" for compression settings by default Cassandra switched, long ago, the "sstable_compression" attribute's name to "class". This can break Cassandra applications that create tables (where we won't understand the "class" parameter) and applications that inquire about the configuration of existing tables. This patch adds tests for both problems. Refs #9933: ALTER TABLE with "chunk_length_kb" (compression) of 1MB caused a core dump on all nodes Our test for this issue hangs Scylla (or crashes, depending on the test environment configuration), when a huge allocation is attempted during memtable flush. So this test is marked "skip" instead of xfail. The tests included here also uncovered a new minor/insignificant bug, where Scylla allows floating point numbers as chunk_length_in_kb - this number is truncated to an integer, and allowed, unlike Cassandra or common sense. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14261	2023-06-20 06:36:13 +03:00
Avi Kivity	792c46c0f8	cql3: expr: simplify evaluate() Now that all branches in the visitor are uniform and consist of a single call to do_evaluate() overloads, we can simplify by calling a lambda template that does just that.	2023-06-20 02:33:10 +03:00
Tomasz Grabiec	5fa08adc88	Merge 'cache_flat_mutation_reader: use the correct schema in prepare_hash' from Michał Chojnowski Since `mvcc: make schema upgrades gentle` (`51e3b9321b`), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. This series fixes the bug and adds a unit test which reproduces the issue. Fixes #14110 Closes #14305 * github.com:scylladb/scylladb: test: boost/row_cache_test: add a reproducer for #14110 cache_flat_mutation_reader: use the correct schema in prepare_hash mutation: mutation_cleaner: add pause()	2023-06-20 01:30:11 +02:00
Avi Kivity	66e0326385	cql3: expr: standardize evaluate() branches to call do_evaluate() Extract the various snippets into do_evaluate() overloads. We'll exploit this in the next patch.	2023-06-20 02:19:33 +03:00
Avi Kivity	b64eeefa35	cql3: expr: rename evaluate(ExpressionElement) to do_evaluate() evaluate(expression) calls the various evaluate(ExpressionElement) overloads to perform its work. However, if we add an ExpressionElement and forget to implement its evaluate() overload, we'll end up in with infinite recursion. It will be caught immediately, but better to avoid it. Also sprinkle static:s on do_evaluate() where missing.	2023-06-20 02:10:18 +03:00
Israel Fruchter	3889e9040c	Update tools/cqlsh submodule * tools/cqlsh 6e1000f1...2254e920 (2): > test: add support for testing cloud bundle option > Fix cloudconf handling Closes #14259	2023-06-20 00:10:53 +03:00
Michał Chojnowski	02bcb5d539	test: boost/row_cache_test: add a reproducer for #14110	2023-06-19 22:50:46 +02:00
Michał Chojnowski	d56b0c20f4	cache_flat_mutation_reader: use the correct schema in prepare_hash Since `mvcc: make schema upgrades gentle` (`51e3b9321b`), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. Fixes #14110	2023-06-19 22:50:43 +02:00
Michał Chojnowski	4f73a28174	mutation: mutation_cleaner: add pause() In unit tests, we would want to delay the merging of some MVCC versions to test the transient scenarios with multiple versions present. In many cases this can be done by holding snapshots to all versions. But sometimes (i.e. during schema upgrades) versions are added and scheduled for merge immediately, without a window for the test to grab a snapshot to the new version. This patch adds a pause() method to mutation_cleaner, which ensures that no asynchronous/implicit MVCC version merges happen within the scope of the call. This functionality will be used by a test added in an upcoming patch.	2023-06-19 22:50:43 +02:00
David Garcia	43bb19ce62	Merge branch 'master' into separate-homepage	2023-06-19 14:02:32 +01:00
David Garcia	73806de8b0	Update conf.py	2023-06-19 14:01:04 +01:00
Nadav Har'El	a66c407bf1	Merge 'scylla-sstable: add scrub operation' from Botond Dénes Exposing scrub compaction to the command-line. Allows for offline scrub of sstables, in cases where online scrubbing (via scylla itself) is not possible or not desired. One such case recently was an sstable from a backup which turned out to be corrupt, `nodetool refresh --load-and-stream` refusing to load it. Fixes: #14203 Closes #14260 * github.com:scylladb/scylladb: docs/operating-scylla/admin-tools: scylla-sstable: document scrub operation test/cql-pytest: test_tools.py: add test for scylla sstable scrub tools/scylla-sstable: add scrub operation tools/scylla-sstable: write operation: add none to valid validation levels tools/scylla-sstable: handle errors thrown by the operation test/cql-pytest: add option to omit scylla's output from the test output tools/scylla-sstable: s/option/operation_option/ tool/scylla-sstable: add missing comments	2023-06-19 15:40:51 +03:00
Nadav Har'El	25bbc424c3	Merge 'test_using_timestamp: update expected errors' from Benny Halevy This mini-series updates the expected errors in `test/cql-pytest/test-timestamp.py` to the ones changed in `b7bbcdd178`. Then, it renamed the test to `test_using_timestamp.py` so it would run automatically with `test.py`. Closes #14293 * github.com:scylladb/scylladb: cql-pytest: rename test-timestamp.py to test_using_timestamp.py cql-pytest: test-timestamp: test_key_writetime: update expected errors	2023-06-19 15:12:10 +03:00
Avi Kivity	1c6c7992e4	Revert "build: cmake: use -O0 for debug build" This reverts commit `8a54e478ba`. As commit `7dadd38161` ("Revert "configure: Switch debug build from -O0 to -Og") was reverted (by `b7627085cb`, "Revert "Revert "configure: Switch debug build from -O0 to -Og""")), we do the same to cmake to keep the two build systems in sync. Closes #14286	2023-06-19 14:31:28 +03:00
Botond Dénes	bd7a3e5871	Merge 'Sanitize sstables-making utils in tests' from Pavel Emelyanov There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those Closes #14280 * github.com:scylladb/scylladb: test/utils: Generalize making memtable from vector<mutation> test/util: Generalize make_sstable_easy()-s test/sstable_mutation: Remove useless helper test/sstable_mutation: Make writer config in make_sstable_mutation_source() test/utils: De-duplicate make_sstable_containing-s test/sstable_compaction: Remove useless one-line local lambda test/sstable_compaction: Simplify sstable making test/sstables*: Make sstable from vector of mutations test/mutation_reader: Remove create_sstable() helper from test	2023-06-19 14:05:29 +03:00
Pavel Emelyanov	6bec03f96f	test: Remove sstable_utils' storage_prefix() helper It's excessive, test case that needs it can get storage prefix without this fancy wrapper-helper Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14273	2023-06-19 13:51:04 +03:00
Pavel Emelyanov	1a332ef5e2	test: Check sstable bytes correctness on S3 too Commit `4e205650` (test: Verify correctness of sstable::bytes_on_disk()) added a test to verify that sstable::bytes_on_disk() is equal to the real size of real files. The same test case makes sense for S3-backed sstables as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14272	2023-06-19 13:47:31 +03:00
Piotr Dulikowski	0bd8b7c663	test/topology/test_cluster_features: workaround for the python driver not reconnecting after full cluster restart in test_downgrade_after_successful_upgrade_fails Followup to `9bfa63fe37`. Like in `test_downgrade_after_successful_upgrade_fails`, the test `test_joining_old_node_fails` also restarts all nodes at once and is prone to a bug in the Python driver which can prevent the session from reconnecting to any of the nodes. This commit applies the same workaround to the other test (manual reconnect by recreating the Python driver session). Closes #14291	2023-06-19 12:38:23 +02:00
Anna Stuchlik	5dbf169068	doc: remove the rpm-info file (What is in each RPM) from the installation section	2023-06-19 12:37:57 +02:00
Kamil Braun	aa2ccb3ac4	Merge 'raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs' from Mikołaj Grzebieluch Another node can stop after it joined the group0 but before it advertised itself in gossip. `get_inet_addrs` will try to resolve all IPs and `wait_for_peers_to_enter_synchronize_state` will loop indefinitely. But `wait_for_peers_to_enter_synchronize_state` can return early if one of the nodes confirms that the upgrade procedure has finished. For that, it doesn't need the IPs of all group 0 members - only the IP of some nodes which can do the confirmation. This pr restructures the code so that IPs of nodes are resolved inside the `max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs. Then, even if some IPs won't be resolved, but one of the nodes confirms a successful upgrade, we can continue. Fixes #13543 Closes #14046 * github.com:scylladb/scylladb: raft topology: test: check if aborted node replacing blocks bootstrap raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs	2023-06-19 12:31:27 +02:00
Benny Halevy	b0bcad0c91	cql-pytest: rename test-timestamp.py to test_using_timestamp.py 1. Otherwise test.py doesn't recognize it. 2. As it represents what the test does in a better way. 3. Following `test_using_timeout.py` naming convention. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-19 13:26:24 +03:00
Benny Halevy	19208c42dc	cql-pytest: test-timestamp: test_key_writetime: update expected errors The error messages were changed in `b7bbcdd178`. Extend the `match` regular expression param to pytest.raises to include both old and new message to remain backward compatible also with Cassandra, as this test is run against both Cassandra and Scylla. Note that the test didn't run automatically since it's named `test-timestamp.py` and test.py looks up only test scripts beginning with `test_`. The test will be renamed in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-06-19 13:25:13 +03:00
Anna Stuchlik	77ebc18730	Merge branch 'master' into anna-install-cloud-v2	2023-06-19 12:09:31 +02:00
Anna Stuchlik	d0bae532bd	doc: move cloud deployment instruction to docs -v2 This is V2 of https://github.com/scylladb/scylladb/pull/14108 This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs. The scope: * Added new files with instructions for AWS, GCP, and Azure. * Added the new files to the index. * Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section. * Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:" * Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files in the Enterprise repository. In addition: - The Configuration Reference file was moved out of the installation section (it's not about installation at all) - The links to creating a cluster were removed from the installation page (as not related). Related: https://github.com/scylladb/scylla-docs/issues/4091	2023-06-19 12:06:28 +02:00
Nadav Har'El	ac3d0d4460	Merge 'cql3: expr: support evaluate(column_mutation_attribute)' from Avi Kivity In preparation for converting selectors to evaluate expressions, add support for evaluating column_mutation_attribute (representing the WRITETIME/TTL pseudo-functions). A unit test is added. Fixes #12906 Closes #14287 * github.com:scylladb/scylladb: test: expr: test evaluation of column_mutation_attribute test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps cql3: expr: evaluate() column_mutation_attribute	2023-06-19 11:11:49 +03:00
Petr Gusev	1770feebda	storage_service: handle_topology_transition: fix missed node assignment This defect remained after the refactoring of exec_global_command in #13884.	2023-06-19 11:26:57 +04:00
Botond Dénes	562087beff	Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai" This reverts commit `d1dc579062`, reversing changes made to `3a73048bc9`. Said commit caused regressions in dtests. We need to investigate and fix those, but in the meanwhile let's revert this to reduce the disruption to our workflows. Refs: #14283	2023-06-19 08:49:27 +03:00
Avi Kivity	135efa3360	Merge 'Simplify `system_keyspace` initialization' from Kamil Braun Initialization of `system_keyspace` is now done in a single place instead of being spread out through the entire procedure. `system_keyspace` is also available for queries much earlier which allows, for example, to load our Host ID before we initialize any of the distributed services (like gossiper, messaging_service etc.) This is doable because `query_processor` is now available early. A couple of FIXMEs have been resolved. Refs: #14202 Closes #14285 * github.com:scylladb/scylladb: main, cql_test_env: simplify `system_keyspace` initialization db: system_keyspace: take simpler service references in `make` db: system_keyspace: call `initialize_virtual_tables` from `main` db: system_keyspace: refactor virtual tables creation db: system_keyspace: remove `system_keyspace_make` db: system_keyspace: refactor local system table creation code replica: database: remove `is_bootstrap` argument from create_keyspace replica: database: write a comment for `parse_system_tables` replica: database: remove redundant `keyspace::get_erm_factory()` getter db: system_keyspace: don't take `sharded<>` references	2023-06-18 23:48:46 +03:00
Avi Kivity	0f98e9f8c8	test: expr: test evaluation of column_mutation_attribute There's no way to evaluate a column_mutation_attribute via CQL yet (the only user uses old-style cql3::selection::selector), so we only supply a unit test.	2023-06-18 22:47:46 +03:00
Avi Kivity	5e2fd0bbaf	test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps While remaining backwards compatible, allow supplying custom timestamp/ttl with each fake column value. Note: I tried to use a formatter<> for the new data structure, but got entangled in a template loop.	2023-06-18 22:45:25 +03:00
Avi Kivity	7090f4c43b	cql3: expr: evaluate() column_mutation_attribute Enhance evaluation_inputs with timestamps and ttls, and use them to evaluate writetime/ttl. The data structure is compatible with the current way of doing things (see result_set_builder::_timestamps, result_set_build::_ttls). We use std::span<> instead of std::vector<> as it is more general and a tiny bit faster. The algorithm is taken from writetime_or_ttl_selector::add_input().	2023-06-18 22:41:09 +03:00
Avi Kivity	3b3f28fc12	test.py: report CPU utilization Low CPU utilization is a major contributor to high test time. Low CPU utilization can happen due to tests sleeping, or lack of concurrency due to Amdahl's law. Utilization is computed by dividing the utilized CPU by the available CPU (CPU count times wall time). Example output: Found 134 tests. ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ [134/134] boost dev [ PASS ] boost.json_cql_query_test.test_unpack_decimal.1 ------------------------------------------------------------------------------ CPU utilization: 4.8% Closes #14251	2023-06-18 19:33:02 +03:00
Michał Chojnowski	db0871a644	test: test_keyspace: add a test checking that ALTER KEYSPACE preserves UDTs Reproduces #14139 Closes #14144	2023-06-18 16:50:39 +03:00
Kamil Braun	028183c793	main, cql_test_env: simplify `system_keyspace` initialization Initialization of `system_keyspace` is now all done at once instead of being spread out through the entire procedure. This is doable because `query_processor` is now available early. A couple of FIXMEs have been resolved.	2023-06-18 13:39:27 +02:00
Kamil Braun	33c19baabc	db: system_keyspace: take simpler service references in `make` Take references to services which are initialized earlier. The references to `gossiper`, `storage_service` and `raft_group0_registry` are no longer needed. This will allow us to move the `make` step right after starting `system_keyspace`.	2023-06-18 13:39:27 +02:00
Kamil Braun	b34605d161	db: system_keyspace: call `initialize_virtual_tables` from `main` `initialize_virtual_tables` was called from `system_keyspace::make`, which caused this `make` function to take a bunch of references to late-initialized services (`gossiper`, `storage_service`). Call it from `main`/`cql_test_env` instead. Note: `system_keyspace::make` is called from `distributed_loader::init_system_keyspace`. The latter function contains additional steps: populate the system keyspaces (with data from sstables) and mark their tables ready for writes. None of these steps apply to virtual tables. There exists at least one writable virtual table, but writes into virtual tables are special and the implementation of writes is virtual-table specific. The existing writable virtual table (`db_config_table`) only updates in-memory state when written to. If a virtual table would like to create sstables, or populate itself with sstable data on startup, it will have to handle this in its own initialization function. Separating `initialize_virtual_tables` like this will allow us to simplify `system_keyspace` initialization, making it independent of services used for distributed communication.	2023-06-18 13:39:27 +02:00
Kamil Braun	c931d9327d	db: system_keyspace: refactor virtual tables creation Split `system_keyspace::make` into two steps: creating regular `system` and `system_schema` tables, then creating virtual tables. This will allow, in later commit, to make `system_keyspace` initialization independent of services used for distributed communication such as `gossiper`. See further commits for details.	2023-06-18 13:39:27 +02:00
Kamil Braun	035045c288	db: system_keyspace: remove `system_keyspace_make` The code can now be inlined in `system_keyspace::make` as we no longer access private members of `database`.	2023-06-18 13:39:27 +02:00
Kamil Braun	cf120e46b8	db: system_keyspace: refactor local system table creation code `system_keyspace_make` would access private fields of `database` in order to create local system tables (creating the `keyspace` and `table` in-memory structures, creating directory for `system` and `system_schema`). Extract this part into `database::create_local_system_table`. Make `database::add_column_family` private.	2023-06-18 13:39:27 +02:00
Kamil Braun	3f04a5956c	replica: database: remove `is_bootstrap` argument from create_keyspace Unused.	2023-06-18 13:39:27 +02:00
Kamil Braun	8848c3b809	replica: database: write a comment for `parse_system_tables`	2023-06-18 13:39:27 +02:00
Kamil Braun	4ca149c1f0	replica: database: remove redundant `keyspace::get_erm_factory()` getter `keyspace` can simply access its private field.	2023-06-18 13:39:27 +02:00
Kamil Braun	53cf646103	db: system_keyspace: don't take `sharded<>` references Take `query_processor` and `database` references directly, not through `sharded<...>&`. This is now possible because we moved `query_processor` and `database` construction early, so by the time `system_keyspace` is started, the services it depends on were also already started. Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member functions can now be replaced with direct uses of `_qp` and `_db`. Runtime assertions for dependant services being initialized are gone.	2023-06-18 13:39:26 +02:00
Nadav Har'El	97d444bbf7	Merge 'cql3/expression: implement evaluate(field_selection) ' from Jan Ciołek Implement `expr:valuate()` for `expr::field_selection`. `field_selection` is used to represent access to a struct field. For example, with a UDT value: ``` CREATE TYPE my_type (a int, b int); ``` The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`. Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it. Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`. Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906) Fixes: https://github.com/scylladb/scylladb/issues/12906 Closes #14235 * github.com:scylladb/scylladb: boost/expr_test: test evaluate(field_selection) cql3/expr: fix printing of field_selection cql3/expression: implement evaluate(field_selection) types/user: modify idx_of_field to use bytes_view column_identifer: add column_identifier_raw::text() types: add read_nth_user_type_field() types: add read_nth_tuple_element()	2023-06-18 11:08:25 +03:00
Avi Kivity	b7627085cb	Revert "Revert "configure: Switch debug build from -O0 to -Og"" This reverts commit `7dadd38161`. The latest revert cited debuggability trumping performance, but the performance loss is su huge here that debug builds are unusable and next promotions time out. In the interest of progress, pick the lesser of two evils.	2023-06-17 15:20:26 +03:00
Pavel Emelyanov	15ac192cc2	test/utils: Generalize making memtable from vector<mutation> Both, make_sstable_easy() and make_sstable_containing() prepare memtable by allocating it and applying mutations from vector. Make a local helper. Many test cases can, probably, benefit from it too, but they often do more stuff before applying mutation to memtable, so this is left for future patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:24:24 +03:00
Pavel Emelyanov	2badad1b15	test/util: Generalize make_sstable_easy()-s There are two of them, one making sstable from memtable and the other one doing the same from a custom reader. The former can just call the latter with memtable's flat reader Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:23:46 +03:00
Pavel Emelyanov	85310bc043	test/sstable_mutation: Remove useless helper There are two make_sstable_mutation_source() helpers that call one another and test cases only need one of them, so leave just one that's in use. Also don't pass env's tempdir to make_sstable() util call, it can get env's tempdir on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:21:40 +03:00
Pavel Emelyanov	4a7be304ac	test/sstable_mutation: Make writer config in make_sstable_mutation_source() These local helpers accept writer config which's made the same way by callers, so the helpers can do it on their own Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:20:50 +03:00
Pavel Emelyanov	6fe7476ba9	test/utils: De-duplicate make_sstable_containing-s The function that prepares memtable from mutations vector can call its overload that writes this memtable into an sstable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:19:55 +03:00
Pavel Emelyanov	753b674c31	test/sstable_compaction: Remove useless one-line local lambda The get_usable_sst() wrapper lambda is not needed, calling the make_sstable_containing() is shorter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:19:15 +03:00
Pavel Emelyanov	5b46993438	test/sstable_compaction: Simplify sstable making There's a temporary memtable and on-stack lambda that makes the mutation. Both are overkill, make_sstable_containing() can work on just plan on-stack-constructed mutation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:18:13 +03:00
Pavel Emelyanov	ce29f41436	test/sstables*: Make sstable from vector of mutations There are many cases that want to call make_sstable_containing() with the vector of mutations at hand. For that they apply it to a temporary memtable, but sstable-utils can work with the mutations vector as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:17:12 +03:00
Pavel Emelyanov	c2eb3e2c4c	test/mutation_reader: Remove create_sstable() helper from test It's a one-liner wrapper, caller can get the same result with existing utils facilities Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-16 21:16:34 +03:00
Kamil Braun	9bfa63fe37	Merge 'test/topology/test_cluster_features: workaround for python driver not reconnecting after full cluster restart' from Piotr Dulikowski The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully. More context can be found on the python driver issue. This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart. Refs: scylladb/python-driver#230 Closes #14276 * github.com:scylladb/scylladb: tests/topology: work around python driver issue in cluster feature tests test/topology{_raft_disabled}: move reconnect_driver to topology utils	2023-06-16 16:54:58 +02:00
Pavel Emelyanov	900c609269	Merge 'Initialize `query_processor` early, without `messaging_service` or `gossiper`' from Kamil Braun In https://github.com/scylladb/scylladb/pull/14231 we split `storage_proxy` initialization into two phases: for local and remote parts. Here we do the same with `query_processor`. This allows performing queries for local tables early in the Scylla startup procedure, before we initialize services used for cluster communication such as `messaging_service` or `gossiper`. Fixes: #14202 As a follow-up we will simplify `system_keyspace` initialization, making it available earlier as well. Closes #14256 * github.com:scylladb/scylladb: main, cql_test_env: start `query_processor` early cql3: query_processor: split `remote` initialization step cql3: query_processor: move `migration_manager&`, `forwarder&`, `group0_client&` to a `remote` object cql3: query_processor: make `forwarder()` private cql3: query_processor: make `get_group0_client()` private cql3: strongly_consistent_modification_statement: fix indentation cql3: query_processor: make `get_migration_manager` private tracing: remove `qp.get_migration_manager()` calls table_helper: remove `qp.get_migration_manager()` calls thrift: handler: move implementation of `execute_schema_command` to `query_processor` data_dictionary: add `get_version` cql3: statements: schema_altering_statement: move `execute0` to `query_processor` cql3: statements: pass `migration_manager&` explicitly to `prepare_schema_mutations` main: add missing `supervisor::notify` message	2023-06-16 17:41:08 +03:00
Kamil Braun	23d5ddbecb	Merge 'storage_service: remove optimization in cleanup_group0_config_if_needed' from Piotr Dulikowski The `topology_coordinator::cleanup_group0_config_if_needed` function first checks whether the number of group 0 members is larger than the number of non-left entries in the topology table, then attempts to remove nodes in left state from group 0 and prints a warning if no such nodes are found. There are some problems with this check: - Currently, a node is added to group 0 before it inserts its entry to the topology table. Such a node may cause the check to succeed but no nodes will be removed, which will cause the warning to be printed needlessly. - Cluster features on raft will reverse the situation and it will be possible for an entry in system.topology to exist without the corresponding node being a part of group 0. This, in turn, may cause the check not to pass when it should and nodes could be removed later than necessary. This commit gets rid of the optimization and the warning, and the topology coordinator will always compute the set of nodes that should be removed. Additionally, the set of nodes to remove is now computed differently: instead of iterating over left nodes and including only those that are in group 0, we now iterate over group 0 members and include those that are in `left` state. As the number of left nodes can potentially grow unbounded and the number of group 0 members is more likely to be bounded, this should give better performance in long-running clusters. Closes #14238 * github.com:scylladb/scylladb: storage_service: fix indentation after previous commit storage_service: remove optimization in cleanup_group0_config_if_needed	2023-06-16 15:59:32 +02:00
Piotr Dulikowski	fadb1351bd	tests/topology: work around python driver issue in cluster feature tests The test `test_downgrade_after_successful_upgrade_fails` stops all nodes, reconfigures them to support the test-only feature and restarts them. Unfortunately, it looks like python driver sometimes does not handle this properly and might not reconnect after all nodes are shut down. This commit adds a workaround for scylladb/python-driver#230 - the test re-creates python driver session right after nodes are restarted.	2023-06-16 15:25:02 +02:00
Piotr Dulikowski	b3771e6011	test/topology{_raft_disabled}: move reconnect_driver to topology utils The `reconnect_driver` function will be useful outside the `topology_raft_disabled` test suite - namely, for cluster feature tests in `topology`. The best course of action for this function would be to put it into pylib utils; however, the function depends on ManagerClient which is defined in `test.pylib.manager_client` that depends on `test.pylib.utils` - therefore we cannot put it there as it would cause an import cycle. The `topology.utils` module sounds like the next best thing. In addition, the docstring comment is updated to reflect that this function will now be used to work around another issue as well.	2023-06-16 15:25:02 +02:00
Kamil Braun	9f9f4c224b	main, cql_test_env: start `query_processor` early Start it right after `storage_proxy`. We also need to start `cql_config` earlier because `query_processor` uses it.	2023-06-16 14:29:59 +02:00
Kamil Braun	c212370cf1	cql3: query_processor: split `remote` initialization step Pass `migration_manager&`, `forward_service&` and `raft_group0_client&` in the remote init step which happens after the constructor. Add a corresponding uninit remote step. Make sure that any use of the `remote` services is finished before we destroy the `remote` object by using a gate. Thanks to this in a later commit we'll be able to move the construction of `query_processor` earlier in the Scylla initialization procedure.	2023-06-16 14:29:59 +02:00
Kamil Braun	ec5b831c13	cql3: query_processor: move `migration_manager&`, `forwarder&`, `group0_client&` to a `remote` object These services are used for performing distributed queries, which require remote calls. As a preparation for 2-phase initialization of `query_processor` (for local queries vs for distributed queries), move them to a separate `remote` object which will be constructed in the second phase. Replace the getters for the different services with a single `remote()` getter. Once we split the initialization into two phases, `remote()` will include a safety protection.	2023-06-16 14:08:21 +02:00
Kamil Braun	c2fa6406ad	cql3: query_processor: make `forwarder()` private	2023-06-16 13:45:59 +02:00
Kamil Braun	f616408a87	cql3: query_processor: make `get_group0_client()` private	2023-06-16 13:45:19 +02:00
Kamil Braun	db769c8eb3	cql3: strongly_consistent_modification_statement: fix indentation	2023-06-16 13:44:59 +02:00
Kamil Braun	2e441e17cf	cql3: query_processor: make `get_migration_manager` private After previous commits it's no longer used outside `query_processor`. Also remove the `const` version - not needed for anything. Use the getter instead of directly accessing `_mm` in `query_processor` methods. Later we will put `_mm` in a separate object.	2023-06-16 13:44:14 +02:00
Piotr Dulikowski	dcd520f6cf	db/system_keyspace: add storage for cluster features managed in group 0 The `system.topology` table is extended with two new columns that will be used to manage cluster features: - `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes. - `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather when via an explicit action of the topology coordinator.	2023-06-16 13:19:53 +02:00
Botond Dénes	e92b71c451	docs/operating-scylla/admin-tools: scylla-sstable: document scrub operation	2023-06-16 06:20:14 -04:00
Botond Dénes	19708d39ae	test/cql-pytest: test_tools.py: add test for scylla sstable scrub The tests are meant to excercise the command line interface and the plumbing, not the scrub logic itself, we have dedicated tests for that.	2023-06-16 06:20:14 -04:00
Botond Dénes	c294f2480c	tools/scylla-sstable: add scrub operation Exposing scrub compaction to the command-line. Scrubbed sstables are written into a directory specified by the `--output-directory` command line parameter. This directory is expected to be empty, to avoid clashes with any pre-existing sstables. This can be overriden by the user if they wish.	2023-06-16 06:20:14 -04:00
Botond Dénes	84aeb21297	tools/scylla-sstable: write operation: add none to valid validation levels This validation level was added recently, but scylla sstable write didn't know about it yet, fix that.	2023-06-16 06:20:14 -04:00
Botond Dénes	34f1827ffc	tools/scylla-sstable: handle errors thrown by the operation Instead of letting the runtime catch them. Also, make sure all exception throw due to bad arguments are instances of `std::invalid_argument`, these are now reported differently from other, runtime errors. Remove the now extraneous `error:` prefix from all exception messages.	2023-06-16 06:20:14 -04:00
Botond Dénes	e32fdcba06	test/cql-pytest: add option to omit scylla's output from the test output Scylla's output is often unnecessary to debug a failed test, or even detrimental because one has to scroll back in the terminal after each test run, to see the actual test's output. Add an option, --omit-scylla-output, which when present on the command line of `run`, the output of scylla will be omitted from the test output. Also, to help discover this option (and others), don't run the tests when either -h or --help is present on the command line. Just invoke pytest (with said option) and exit.	2023-06-16 06:20:14 -04:00
Botond Dénes	21d9fbe875	tools/scylla-sstable: s/option/operation_option/ A future include will bring in a type with a similar name, resulting in a name clash. Avoid by renaming to something more specific.	2023-06-16 06:20:14 -04:00
Botond Dénes	f31bf152aa	tool/scylla-sstable: add missing comments Separating entries in the operation list (pretty hard to visually separate without comments).	2023-06-16 06:20:14 -04:00
Tomasz Grabiec	e41ff4604d	Merge 'raft_topology: fencing and global_token_metadata_barrier' from Gusev Petr This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit). * the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm; * when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely; * the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future. * after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced. * the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command; * currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness; * along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done; * we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present. Closes #13884 * github.com:scylladb/scylladb: storage_service: warn if can't find ip for server storage_proxy.cc: add and use global_token_metadata_barrier storage_service: exec_global_command: bool result -> exceptions raft_topology: add cmd_index to raft commands storage_proxy.cc: add fencing to read RPCs storage_proxy.cc: extract handle_read storage_proxy.cc: refactor encode_replica_exception_for_rpc storage_proxy: fix indentation storage_proxy: add fencing for mutation storage_servie: fix indentation storage_proxy: add fencing_token and related infrastructure raft topology: add fence_version raft_topology: add barrier_and_drain cmd token_metadata: add topology version	2023-06-16 12:07:31 +02:00
Mikołaj Grzebieluch	fa76d6bd64	raft topology: test: check if aborted node replacing blocks bootstrap Scenario: 1. Start a cluster with nodes node1, node2, node3 2. Start node4 replacing node node2 3. Stop node node4 after it joined group0 but before it advertised itself in gossip 4. Start node5 replacing node node2 Test simulates the behavior described in #13543. Test passes only if `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs to return early. If not, node5 will hang trying to resolve the IP of node4: ``` raft_group0_upgrade - : failed to resolve IP addresses of some of the cluster members ([node4's host ID]) ```	2023-06-16 11:09:19 +02:00
Pavel Emelyanov	5412c7947a	backlog_controller: Unwrap scheduling_group Some time ago (`997a34bf8c`) the backlog controller was generalized to maintain some scheduling group. Back then the group was the pair of seastar::scheduling_group and seastar::io_priority_class. Now the latter is gone, so the controller's notion of what sched group is can be relaxed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14266	2023-06-16 12:02:14 +03:00
Michał Chojnowski	3cf15e6ad7	test: perf: memory_footprint_test: don't use obsolete sstable versions memory_footprint_test fails with: `sstable - writing sstables with too old format` because it attempts to write the obsolete sstables formats, for which the writer code has been long removed. Fix that. Closes #14265	2023-06-16 11:58:26 +03:00
Kefu Chai	f6c24c9b70	repair: set repair state correctly repair_node_state::state is only for debugging purpose, see `ab57cea783` which introduced it. so this change does not impact the behavior of scylla, but can improve the debugging experience by reflecting more accurate state of repair when we are actually inspecting it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14255	2023-06-16 11:16:59 +03:00
Jan Ciolek	d6728a7eb5	boost/expr_test: test evaluate(field_selection) Add a unit test which tests evaluating field selections. Alas at the moment it's impossible to add a cql-pytest, as the grammar and query planning doesn't handle field selections inside the WHERE clause. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:21:02 +02:00
Jan Ciolek	ee660f2d61	cql3/expr: fix printing of field_selection expression printing has two modes: debug and user. The user mode should output standard CQL that can be parsed back to an expression. In debug mode there can be some additional information that helps with debugging stuff. The code for printing `field_selection` didn't distinguish between user mode and debug mode. It just always printed in debug mode, with extra parenthesis around the field selection. Let's change it so that it emits valid CQL in user mdoe. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:21:02 +02:00
Jan Ciolek	f79f3ea3ae	cql3/expression: implement evaluate(field_selection) Implement expr:valuate() for expr::field_selection. `field_selection` is used to represent access to a struct field. For example, with a UDT value: ``` CREATE TYPE my_type (a int, b int); ``` The expression `my_type_value.a` would be represented as a field_selection, which selects the field 'a'. Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it. Fixes: https://github.com/scylladb/scylladb/issues/12906 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:21:00 +02:00
Jan Ciolek	464437ef90	types/user: modify idx_of_field to use bytes_view Let's change the argument type from `bytes` to `bytes_view`. Sometimes it's possible to get an instance of `bytes_view`, but getting `bytes` would require a copy, which is wasteful. `bytes_view` allows to avoid copies. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:11:31 +02:00
Jan Ciolek	d8d5442db8	column_identifer: add column_identifier_raw::text() I would like to be able to get a reference to the string inside `column_identifer_raw`, but there was no such function. There was only `to_string()`, which copies the entire string, which is wasteful. Let's add the method `text()`, which returns a reference instead of a copy. `column_identifier` already has such method, so `column_identifier_raw` can have one as well. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:11:30 +02:00
Jan Ciolek	ab1ba497b5	types: add read_nth_user_type_field() Add a function which can be used to read the nth field of a serialized UDT value. We could deserialize the whole value and then choose one of the deserialized fields, but that would be wasteful. Sometimes we only need the value of one field, not all of them. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-16 01:11:30 +02:00
Mikołaj Grzebieluch	a45e0765e4	raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs Another node can stop after it joined the group0 but before it advertised itself in gossip. `get_inet_addrs` will try to resolve all IPs and `wait_for_peers_to_enter_synchronize_state` will loop indefinitely. But `wait_for_peers_to_enter_synchronize_state` can return early if one of the nodes confirms that the upgrade procedure has finished. For that, it doesn't need the IPs of all group 0 members - only the IP of some nodes which can do the confirmation. This commit restructures the code so that IPs of nodes are resolved inside the `max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs. Then, even if some IPs won't be resolved, but one of the nodes confirms a successful upgrade, we can continue. Fixes #13543	2023-06-15 16:28:15 +02:00
Nadav Har'El	e1513f1199	Merge 'cql3: prepare selectors' from Avi Kivity CQL statements carry expressions in many contexts: the SELECT, WHERE, SET, and IF clauses, plus various attributes. Previously, each of these contexts had its own representation for an expression, and another one for the same expression but before preparation. We have been gradually moving towards a uniform representation of expressions. This series tackles SELECT clause elements (selectors), in their unprepared phase. It's relatively simple since there are only five types of expression components (column references, writetime/ttl modifiers, function calls, casts, and field selections). Nevertheless, there isn't much commonality with previously converted expression elements so quite a lot of code is involved. After the series, we are still left with a custom post-prepare representation of expressions. It's quite complicated since it deals with two passes, for aggregation, so it will be left for another series. Closes #14219 * github.com:scylladb/scylladb: cql3: seletor: drop inheritance from assignment_testable cql3: selection: rely on prepared expressions cql3: selection: prepare selector expressions cql3: expr: match counter arguments to function parameters expecting bigint cql3: expr: avoid function constant-folding if a thread is needed cql3: add optional type annotation to assignment_testable cql3: expr: wire unresolved_identifier to test_assignment() cql3: expr: support preparing column_mutation_attribute cql3: expr: support preparing SQL-style casts cql3: expr: support preparing field_selection expressions cql3: expr: make the two styles of cast expressions explicit cql3: error injection functions: mark enabled_injections() as impure cql3: eliminate dynamic_cast<selector> from functions::get() cql3: test_assignment: pass optional schema everywhere cql3: expr: prepare_expr(): allow aggregate functions cql3: add checks for aggregation functions after prepare cql3: expr: add verify_no_aggregate_functions() helper test: add regression test for rejection of aggregates in the WHERE clause cql3: expr: extract column_mutation_attribute_type cql3: expr: add fmt formatter for column_mutation_attribute_kind cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON	2023-06-15 15:59:41 +03:00
Kefu Chai	befc78274b	install.sh: pass -version to java executable currently, despite that we are moving from Java-8 to Java-11, we still support both Java versions. and the docker image used for testing Datatax driver has not been updated to install java-11. the "java" executable provided by openjdk-java-8 does not support "--version" command line argument. java-11 accept both "-version" and "--version". so to cater the needs of the the outdated docker image, we pass "-version" to the selected java. so the test passes if java-8 is found. a better fix is to update the docker image to install java-11 though. the output of "java -version" and "java --version" is attached here as a reference: ```console $ /usr/lib/jvm/java-1.8.0/bin/java --version Unrecognized option: --version Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. ``` ```console $ /usr/lib/jvm/java-1.8.0/bin/java -version openjdk version "1.8.0_362" OpenJDK Runtime Environment (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode) ``` ```console /usr/lib/jvm/jre-11/bin/java --version openjdk 11.0.19 2023-04-18 OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7) OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing) ``` ```console $ /usr/lib/jvm/jre-11/bin/java -version openjdk version "11.0.19" 2023-04-18 OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7) OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing) ``` Fixes #14253 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14254	2023-06-15 15:42:09 +03:00
Botond Dénes	d1dc579062	Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai this series adds an option named "uuid_sstable_identifier_enabled", and the related cluster feature bit, which is set once all nodes in this cluster set this option to "true". and the sstable subsystem will start using timeuuid instead plain integer for the identifier of sstables. timeuuid should be a better choice for identifiers as we don't need to worry about the id conflicts anymore. but we still have quite a few tests using static sstables with integer in their names, these tests are not changed in this series. we will create some tests to exercise the sstable subsystem with this option set. a very simple inter-op test with Cassandra 4.1.1 was also performed to verify that the generated sstables can be read by the Cassandra: 1. start scylla, and connect it with cqlsh, run following commands, and stop it ``` cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ; cqlsh> CREATE TABLE ks.cf ( name text primary key, value text ); cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one'); cqlsh> SELECT * FROM ks.cf; ``` 2. enable Cassandra's `uuid_sstable_identifiers_enabled`, and start Cassandra 4.1.1, and connect it with cqlsh, run following commands, and stop it ``` cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ; cqlsh> CREATE TABLE ks.cf ( name text primary key, value text ); cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one'); cqlsh> SELECT * FROM ks.cf; ``` 2. move away the sstables generated by Cassandra, and replace it with the sstables generated by scylladb: ```console $ mv ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49 /tmp $ mv ~/scylla/ks/cf-db47a12009d611eea6b8b179df3a2d5d ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49 ``` 3. start Cassandra 4.1.1 again, and connect it with cqlsh, run following commands ``` cqlsh> SELECT * FROM ks.cf; name \| value ------+------- 1 \| one ``` Fixes https://github.com/scylladb/scylladb/issues/10459 Closes #13932 * github.com:scylladb/scylladb: replica,sstable: introduce invalid generation id sstables, replica: pass uuid_sstable_identifiers to generation generator gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature db: config: add uuid_sstable_identifiers_enabled option sstables, replica: support UUID in generation_type	2023-06-15 15:23:24 +03:00
Petr Gusev	fe5e1a5462	storage_service: warn if can't find ip for server This shouldn't happen during normal operation.	2023-06-15 15:52:50 +04:00
Petr Gusev	5a3384f495	storage_proxy.cc: add and use global_token_metadata_barrier fence_old_reads is removed since it's replaced by this barrier.	2023-06-15 15:52:50 +04:00
Petr Gusev	d9d29ec293	storage_service: exec_global_command: bool result -> exceptions This allows to reflect cause-and-effect relationships in the logs: if some command failed, we write to the log at the top level of the topology state machine. The log message includes the current state of the state machine and a description of what exactly went wrong. Note that in the exec_global_command overload returning node_to_work_on we don't call retake_node() if the nested exec_global_command failed. This is fine, since all the callers just log/break in this case.	2023-06-15 15:52:50 +04:00
Petr Gusev	96a1c661bd	raft_topology: add cmd_index to raft commands In this commit we add logic to protect against raft commands reordering. This way we can be sure that the topology state (_topology_state_machine._topology) on all the nodes processing the command is consistent with the topology state on the topology change coordinator. In particular, this allows us to simply use _topology.version as the current version in barrier_and_drain instead of passing it along with the command as a parameter. Topology coordinator maintains an index of the last command it has sent to the cluster. This index is incremented for each command and sent along with it. The receiving node compares it with the last index it received in the same term and returns an error if it's not greater. We are protected against topology change coordinator migrating to other node by the already existing terms check: if the term from the command doesn't match the current term we return an error.	2023-06-15 15:52:50 +04:00
Petr Gusev	94605e4839	storage_proxy.cc: add fencing to read RPCs On the call site we use the version captured in read_executor/erm/token_metadata. In the handlers we use apply_fence twice just like in mutation RPC. Fencing was also added to local query calls, such as query_result_local in make_data_request. This is for the case when query coordinator was isolated from topology change coordinator and didn't receive barrier_and_drain.	2023-06-15 15:52:50 +04:00
Petr Gusev	4004ce1f44	storage_proxy.cc: extract handle_read We continue the refactoring by introducing the common implementation for all read methods.	2023-06-15 15:52:50 +04:00
Petr Gusev	2d791a5ed4	storage_proxy.cc: refactor encode_replica_exception_for_rpc We are going to add fencing to read RPCs, it would be easier to do it once for all three of them. This refactoring enables this since it allows to use encode_replica_exception_for_rpc for handle_read_digest.	2023-06-15 15:52:50 +04:00
Petr Gusev	6b115e902b	storage_proxy: fix indentation	2023-06-15 15:52:50 +04:00
Petr Gusev	46f73fcaa6	storage_proxy: add fencing for mutation At the call site, we use the version, captured in erm/token_metadata. In the handler, we use double checking, apply_fence after the local write guarantees that no mutations succeed on coordinators if the fence version has been updated on the replica during the write. Fencing was also added to mutate_locally calls on request coordinator, for the case if this coordinator was isolated from the topology change coordinator and missed the barrier_and_drain command.	2023-06-15 15:52:49 +04:00
Petr Gusev	7fe707570a	storage_servie: fix indentation	2023-06-15 15:48:00 +04:00
Petr Gusev	d34da12240	storage_proxy: add fencing_token and related infrastructure A new stale_topology_exception was introduced, it's raised in apply_fence when an RPC comes with a stale fencing_token. An overload of apply_fence with future will be used to wrap the storage_proxy methods which need to be fenced.	2023-06-15 15:48:00 +04:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	4f99302c2b	raft_topology: add barrier_and_drain cmd We use utils::phased_barrier. The new phase is started each time the version is updated. We track all instances of token_metadata, when an instance is destroyed the corresponding phased_barrier::operation is released.	2023-06-15 15:48:00 +04:00
Petr Gusev	253d8a8c65	token_metadata: add topology version It's stored in as a static column in topology table, will be updated at various steps of the topology change state machine. The initial value is 1, zero means that topology versions are not yet supported, will be used in RPC handling.	2023-06-15 15:48:00 +04:00
Kefu Chai	2d265e860d	replica,sstable: introduce invalid generation id the invalid sstable id is the NULL of a sstable identifier. with this concept, it would be a lot simpler to find/track the greatest generation. the complexity is hidden in the generation_type, which compares the a) integer-based identifiers b) uuid-based identifiers c) invalid identitifer in different ways. so, in this change * the default constructor generation_type is now public. * we don't check for empty generation anymore when loading SSTables or enumerating them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	939fa087cc	sstables, replica: pass uuid_sstable_identifiers to generation generator before this change, we assume that generation is always integer based. in order to enable the UUID-based generation identifier if the related option is set, we should populate this option down to generation generator. because we don't have access to the cluster features in some places where a new generation is created, a new accessor exposing feature_service from sstable manager is added. Fixes #10459 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	49071e48ae	gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature UUID_SSTABLE_IDENTIFIERS is a new cluster wide feature. if it is enabled, all nodes will generate new sstables with the UUID as their generation identifiers. this feature is configured using config option of "uuid_sstable_identifiers_enabled". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	4c2df04449	db: config: add uuid_sstable_identifiers_enabled option unlike Cassandra 4.1, this option is true by default, will be used for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet. please note, because we are still using sstableloader and sstabledump based on 3.x branch, while the Cassandra upstream introduced the uuid sstable identifier in its 4.x branch, these tool fail to work with the sstables with uuid identifier, so this option is disabled when performing these tests. we will enable it once these tools are updated to support the uuid-basd sstable identifiers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	15543464ce	sstables, replica: support UUID in generation_type this change generalize the value of generation_type so it also supports UUID based identifier. * sstables/generation_type.h: - add formatter and parse for UUID. please note, Cassandra uses a different format for formatting the SSTable identifier. and this formatter suits our needs as it uses underscore "_" as the delimiter, as the file name of components uses dash "-" as the delimiter. instead of reinventing the formatting or just use another delimiter in the stringified UUID, we choose to use the Cassandra's formatting. - add accessors for accessing the type and value of generation_type - add constructor for constructing generation_type with UUID and string. - use hash for placing sstables with uuid identifiers into shards for more uniformed distrbution of tables in shards. * replica/table.cc: - only update the generator if the given generation contains an integer * test/boost: - add a simple test to verify the generation_type is able to parse and format Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kamil Braun	59d4bb3787	tracing: remove `qp.get_migration_manager()` calls Pass `migration_manager&` from top-level instead.	2023-06-15 09:48:54 +02:00
Kamil Braun	1b68e8582b	table_helper: remove `qp.get_migration_manager()` calls Push those calls up the call stack, to `trace_keyspace_helper` module. Pass `migration_manager` reference around together with `query_processor` reference.	2023-06-15 09:48:54 +02:00
Kamil Braun	9e25a3cbed	thrift: handler: move implementation of `execute_schema_command` to `query_processor` It's now named `execute_thrift_schema_command` in `query_processor`. This allows us to remove yet another `query_processor::get_migration_manager()` call. Now that `execute_thrift_schema_command` sits near `execute_schema_statement` (the latter used for CQL), we can see a certain similarity. The Thrift version should also in theory get a retry loop like the one CQL has, so the similarity would become even stronger. Perhaps the two functions could be refactored to deduplicate some logic later.	2023-06-15 09:48:54 +02:00
Kamil Braun	26cd3b9b78	data_dictionary: add `get_version` The `replica::database` version simply calls `get_version` on the real database. The `schema_loader` version throws `bad_function_call`.	2023-06-15 09:48:54 +02:00
Kamil Braun	eace351ca3	cql3: statements: schema_altering_statement: move `execute0` to `query_processor` Rename it to `execute_schema_statement`. This allows us to remove a call to `query_processor::get_migration_manager`, the goal being to make it a private member function.	2023-06-15 09:48:54 +02:00
Kamil Braun	2606c190af	cql3: statements: pass `migration_manager&` explicitly to `prepare_schema_mutations` We want to stop relying on `qp.get_migration_manager()`, so we can make the function private in the future. This in turn is a prerequisite for splitting `query_processor` initialization into two phases, where the first phase will only allow local queries (and won't require `migration_manager`).	2023-06-15 09:48:54 +02:00
Kamil Braun	817aff6615	main: add missing `supervisor::notify` message	2023-06-15 09:48:54 +02:00
Nadav Har'El	3a73048bc9	test/cql-pytest: reproducer for bug of PER PARTITION LIMIT with INDEX This patch adds an xfailing test reproducing the bug in issue #12762: When a SELECT uses a secondary index to list rows, if there is also a PER PARTITION LIMIT given, Scylla forgets to apply it. The test shows that the PER PARTITION LIMIT is correctly applied when the index doesn't exist, but forgotten when the index is added. In contrast, both cases work correctly in Cassandra. This patch also adds a second variant of this test, which adds filtering to the mix, and ensures that PER PARTITION LIMIT 1 doesn't give up on the first row of each partition - but rather looks for the first row that passes the filter, and only then moves on to the next partition. Refs #12762. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14248	2023-06-15 09:17:50 +03:00
Piotr Dulikowski	41fff6f425	storage_service: fix indentation after previous commit	2023-06-14 17:47:42 +02:00
Piotr Dulikowski	1f58c1e762	storage_service: remove optimization in cleanup_group0_config_if_needed The `topology_coordinator::cleanup_group0_config_if_needed` function first checks whether the number of group 0 members is larger than the number of non-left entries in the topology table, then attempts to remove nodes in left state from group 0 and prints a warning if no such nodes are found. There are some problems with this check: - Currently, a node is added to group 0 before it inserts its entry to the topology table. Such a node may cause the check to succeed but no nodes will be removed, which will cause the warning to be printed needlessly. - Cluster features on raft will reverse the situation and it will be possible for an entry in system.topology to exist without the corresponding node being a part of group 0. This, in turn, may cause the check not to pass when it should and nodes could be removed later than necessary. This commit gets rid of the optimization and the warning, and the topology coordinator will always compute the set of nodes that should be removed. Additionally, the set of nodes to remove is now computed differently: instead of iterating over left nodes and including only those that are in group 0, we now iterate over group 0 members and include those that are in `left` state. As the number of left nodes can potentially grow unbounded and the number of group 0 members is more likely to be bounded, this should give better performance in long-running clusters.	2023-06-14 17:47:24 +02:00
Kefu Chai	780aee9568	repair: extract remove_shard_task_id() in shard_repair_task_impl::run() simpler this way. and this also matches with the "add" call at the beginning of this function. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14239	2023-06-14 17:03:27 +03:00
Avi Kivity	48f66dab38	cql3: select_statement: reindent execute_without_checking_exception_message_aggregate_or_paged()	2023-06-14 16:57:48 +03:00
Botond Dénes	a5ce2d5fb4	Merge 'Initialize `storage_proxy` early, without `messaging_service` and `gossiper`' from Kamil Braun Move the initialization of `storage_proxy` early in the startup procedure, before starting `system_keyspace`, `messaging_service`, `gossiper`, `storage_service` and more. As a follow-up, we'll be able to move initialization of `query_processor` right after `storage_proxy` (but this requires a bit of refactoring in `query_processor` too). Local queries through `storage_proxy` can be done after the early initialization step. In a follow-up, when we do a similar thing for `query_processor`, we'll be able to perform local CQL queries early as well. (Before starting `gossiper` etc.) Closes #14231 * github.com:scylladb/scylladb: main, cql_test_env: initialize `storage_proxy` early main, cql_test_env: initialize `database` early storage_proxy: rename `init_messaging_service` to `start_remote` storage_proxy: don't pass `gossiper&` and `messaging_service&` during initialization storage_proxy: prepare for missing `remote` storage_proxy: don't access `remote` during local queries in `query_partition_key_range_concurrent` db: consistency_level: remove overload of `filter_for_query` storage_proxy: don't access `remote` when calculating target replicas for local queries storage_proxy: introduce const version of `remote()` replica: table: introduce `get_my_hit_rate` storage_proxy: `endpoint_filter`: remove gossiper dependency	2023-06-14 15:37:33 +03:00
Kefu Chai	4f85839be3	migration_manager: use try_emplace() when appropriate try_emplace() is - simpler than the lookup-and-insert dance, and - presumably, it is more efficient. - also, most importantly, it is simpler to read. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14237	2023-06-14 15:00:53 +03:00
Avi Kivity	809c67ad77	cql3: select_statement: coroutinize execute_without_checking_exception_message_aggregate_or_paged() This function will have a continuation for sure, so converting it to a coroutine will not cause extra allocations for sure. wrap_result_to_error_message(), which is used to convert a coordinator_result-unaware continuation to a coordinator_result aware continuation, is converted to traditional check-error-and-return.	2023-06-14 14:27:46 +03:00
Avi Kivity	f54049322d	cql3: select_statement: split do_execute into fast-path and slow/slower paths select_statement::do_execute() has a fast path where it forwards to execute_without_checking_exception_message_non_aggregate_unpaged(). In this fast path, we aren't paging (a good reason for that is reading a partition without clustering keys) and in the slow/slower paths we page and/or perform complex processing like aggregation. The fast path doesn't need any continuations, but the slow/slower paths do. Split them off so that the slow/slower paths can be coroutinized without impacting the fast path.	2023-06-14 14:24:41 +03:00
Kefu Chai	c508c656c5	Revert "build: make gen_headers a dependency of gen/*.o" This reverts commit `9526258b89`. Because the issue (#14213) supposed to be fix only exists in the enterprise branch. And that issue has been fixed in a different way in a different place. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14234	2023-06-14 14:13:46 +03:00
Tomasz Grabiec	8986cdafb9	Merge 'migration_manager: coroutinize some member functions of migration_manager' from Kefu Chai to reduce the indentation level, and to improve the readability. also, take this opportunity to name some variables for better readability. Closes #14236 * github.com:scylladb/scylladb: migration_manager: coroutineize migration_manager::do_merge_schema_from() migration_manager: coroutineize migration_manager::sync_schema()	2023-06-14 12:49:07 +02:00
Tomasz Grabiec	87bbd2614b	raft: Populate address mapping from system.peers early Currently, the mapping is initialized from the gossiper state when group0 server is started and updated from a gossiper change listener. Gossiper state is restored from system.peers in storage_service::join_cluster(), which is later than setup_group0_if_exists() is called. The restarted server will hang in group0_service.setup_group0_if_exist(), which waits for snapshot loading, which waits for storage_service::topology_state_load(), which waits for IP mapping for servers mentioned in the topology, and produces logs like this: WARN 2023-06-12 15:45:21,369 [shard 0] storage_service - (rate limiting dropped 196 similar messages) raft topology: cannot map c94ae68f-869d-4727-8b2f-d40814e395f0 to ip, retrying. This is a regression after `f26179c`, where group0 server is initialized before the gossiper is started. The fix is to load the mapping from system.peers before group0 is started. Gossiper state is not available at this point, so we read the mapping directly from system keyspace. This change will also be needed to implement messaging by host id, even if raft is disabled, where we will need to restore the mapping early. Fixes #14217 Closes #14220	2023-06-14 11:52:47 +02:00
Kamil Braun	b23cc9b441	main, cql_test_env: initialize `storage_proxy` early This is another part of splitting Scylla initialization into two phases: local and remote parts. Performing queries is done with `storage_proxy`, so for local queries we want to initialize it before we initialize services specific to cluster communication such as `gossiper`, `messaging_service`, `storage_service`. `system_keyspace` should also be initialized after `storage_proxy` (and is after this patch) so in the future we'll be able to merge the multiple initialization steps of `system_keyspace` into one (it only needs the local part to work).	2023-06-14 11:41:36 +02:00
Kamil Braun	a8f6afc2fd	main, cql_test_env: initialize `database` early We want to separate two phases of Scylla service initialization: first we initialize the local part, which allows performing local queries, then a remote part, which requires contacting other nodes in a cluster and allows performing distributed queries. The `database` object is crucial for both remote and local queries, but it was created pretty late, after services such as `gossiper` or `storage_service` which are used for distributed operations. Fortunately we can easily move `database` initialization and all of its prerequisites early in the init procedure.	2023-06-14 11:41:36 +02:00
Kamil Braun	a740fbf58a	storage_proxy: rename `init_messaging_service` to `start_remote` The function now has more responsibilities than before, rename it and add a comment to better illustrate this.	2023-06-14 11:41:36 +02:00
Kamil Braun	f26e98c3be	storage_proxy: don't pass `gossiper&` and `messaging_service&` during initialization These services are now passed during `init_messaging_service`, and that's when the `remote` object is constructed. The `remote` object is then destroyed in `uninit_messaging_service`. Also, `migration_manager*` became `migration_manager&` in `init_messaging_service`.	2023-06-14 11:41:36 +02:00
Kamil Braun	10f11b89ea	storage_proxy: prepare for missing `remote` Prepare the users of `remote` for the possibility that it's gone. The `remote()` accessor throws an error if it's gone. Observe that `remote()` is only used in places where it's verified that we really want to send a message to a remote node, with a small exception: `truncate_blocking`, which truncates locally by sending an RPC to ourselves (and truncate always sends RPC to the whole cluster; we might want to change this behavior in the future, see #11087). Other places are easy to check (it's either implementations of `apply_remotely` which is only called for remote nodes, or there's an `if` that checks we don't apply the operation to ourselves). There is one direct access to `_remote` which checks first if `_remote` is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we consider nodes other than us dead. Indeed, if `gossiper` is unavailable, we didn't have a chance to gossip with other nodes and mark them alive.	2023-06-14 11:41:36 +02:00
Kamil Braun	8db1d75c6c	storage_proxy: don't access `remote` during local queries in `query_partition_key_range_concurrent` In `query_partition_key_range_concurrent` there's a calculation of cache hit rates which requires accessing `gossiper` through `remote`. We want to support local queries when `remote` is unavailable. Check if it's a local query and only if not, fetch `gossiper` from `remote`.	2023-06-14 11:41:36 +02:00
Kamil Braun	0e36377f56	db: consistency_level: remove overload of `filter_for_query` Not used anymore after the previous commit.	2023-06-14 11:41:36 +02:00
Kamil Braun	ddcbade919	storage_proxy: don't access `remote` when calculating target replicas for local queries We only want to access `remote` when it's necessary - when we're performing a query that involves remote nodes. We want to support local queries when `remote` (in particular, `gossiper&`) is unavailable. Add a helper, `storage_proxy::filter_replicas_for_read`, which will check if it's a local query and return early in that case without accessing `remote`.	2023-06-14 11:41:34 +02:00
Kefu Chai	6687aa35df	repair: coroutinize repair_cf_range_row_level() for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14240	2023-06-14 12:39:53 +03:00
Pavel Emelyanov	d1de796f6b	sstable: Move XFS renamer hack into fs storage The method sits on sstable, but is called only from fs storage and it's the only place that really needs it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14230	2023-06-14 12:35:04 +03:00
Wojciech Mitros	89b6c84b49	database: remove unused header After recent changes, all wasm related logic has been moved from the database class to the query_processor. As a result, the wasm headers no longer need to be included there, and in particular, files that include replica/database.hh no longer need to wait on the generated header rust/wasmtime_bindings.hh to compile. Fixes #14224 Closes #14223	2023-06-14 12:33:20 +03:00
Nadav Har'El	5a75713ea7	cql-pytest: translate Cassandra's test for UPDATE operations This is a translation of Cassandra's CQL unit test source file validation/operations/UpdateTest.java into our cql-pytest framework. There are 18 tests, and they did not reproduce any previously-unknown bug, but did provide additional reproducers for two known issues: Refs #12243: Setting USING TTL of "null" should be allowed Refs #12474: DELETE/UPDATE print misleading error message suggesting ALLOW FILTERING would work Note that we knew about this issue for the DELETE operation, and the new test shows the same issue exists for UPDATE. I had to modify some of the tests to allow for different error messages in ScyllaDB (in cases where the different message makes sense), as well as cases where we decided to allow in Scylla some behaviors that are forbidden in Cassandra - namely Refs #12472. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14222	2023-06-14 12:31:15 +03:00
Botond Dénes	4191e97d19	Update tools/java submodule * tools/java 0cbfeb03...9f63a96f (1): > s/egrep/grep -E/	2023-06-14 12:29:59 +03:00
Botond Dénes	3479adc85f	Merge 'Prepare sstable_directory lister to garbage_collect() s3 stuff' from Pavel Emelyanov When scylla starts it collects dangling sstables from the datadir. It includes temporary sstable directories and pending-deletion log. S3-backed sstables cannot be garbage-collected like that, instead "garbage" entries from the ownership table should be processed. Currently the g.c. code is unaware of storage and scans datadir for whatever sstable it's called for. This PR prepares the garbage_collect() call to become virtual, but no-op for ownership-table lister. Proper S3 garbage-collecting is not yet here, it needs an extra patch to seastar http client. refs: #13024 Closes #14023 * github.com:scylladb/scylladb: sstable_directory: Do not collect filesystem garbage for S3-backed sstables sstable_directory: Deduplicate .process() location argument sstable_directory: Keep directory lister on stack sstable_directory: Use directory_lister API directly	2023-06-14 12:06:37 +03:00
Botond Dénes	aaac455ebe	Merge 'doc: add OS support for ScyllaDB 5.3' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/14084 This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page. Closes #14228 * github.com:scylladb/scylladb: doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x doc: add OS support for ScyllaDB 5.3	2023-06-14 11:42:48 +03:00
Anna Stuchlik	bbd7c7db72	doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x	2023-06-14 09:46:23 +02:00
Kefu Chai	25587b679d	migration_manager: coroutineize migration_manager::do_merge_schema_from() for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-14 15:31:46 +08:00
Kefu Chai	befc9096d1	migration_manager: coroutineize migration_manager::sync_schema() to reduce the indentation level, and to improve the readability. also, take this opportunity to name some variables for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-14 15:19:56 +08:00
Jan Ciolek	5fce4d9675	types: add read_nth_tuple_element() Add a function which retrieves the value of nth field from a serialized tuple value. I tried to make it as efficient as possible. Other functions, like evaluate(subscript) tend to deserialize the whole structure and put all of its elements in a vector. Then they select a single element from this vector. This is wasteful, as we only need a single element's value. This function goes over the serialized fields and directly returns the one that is needed. No allocations are needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-14 07:22:39 +02:00
Avi Kivity	190d1b20bf	cql3: seletor: drop inheritance from assignment_testable Since all function overload selection is done by prepare_expression(), we no longer need to implement the assignment_testable interface, so drop it. Since there's now just one implementation of assignment_testable, we can drop it and replace it by the implementation (expressions), but that is left for later.	2023-06-13 21:04:49 +03:00
Avi Kivity	f438b9b044	cql3: selection: rely on prepared expressions Now that selector expressions are prepared, we can avoid doing the work ourselves: - function_name:s are resolved into functions, so we can error out if we see a function_name (and drop the with_function class) - casts are converted to anonymous functions, so we can error out if we see them (and drop with with_cast class) - field_selection:s can relay on the prepared field_idx	2023-06-13 21:04:49 +03:00
Avi Kivity	1040589828	cql3: selection: prepare selector expressions Call prepare_expression() on selector expressions to resolve types. This leaves us with just one way to move from the unprepared domain to the prepared domain. The change is somewhat awkward since do_prepare_selectable() is re-doing work that is done by prepare_expression(), but somehow it all works. The next patch will tear down the unnecessary double-preparation.	2023-06-13 21:04:49 +03:00
Avi Kivity	6c55bdc417	cql3: expr: match counter arguments to function parameters expecting bigint assignment_testable is used to convey type information to function overload selection. The implementation for `selector` recognizes that counters are really bigints and special cases them. The equivalent implementation for expressions doesn't, so bring over that nuance here too. With this, things like sum(counter_column) match the overload for sum(bigint) rather than failing.	2023-06-13 21:04:49 +03:00
Avi Kivity	2c1e36d0ac	cql3: expr: avoid function constant-folding if a thread is needed Our prepare phase performs constant-folding: if an expression is composed of constants, and is pure, it is evalauted during the preparation phase rather than during query execution. This however can't work for user-defined functions as these require running in a thread, and we aren't running in a thread during prepration time. Skip the optimization in this case.	2023-06-13 21:04:49 +03:00
Avi Kivity	8d3d8eeedb	cql3: add optional type annotation to assignment_testable Before this series, function overload resolution peeked at function arguments to see if they happened to be selectors, and if so grabbed their type. If they did not happen to be selectors, we woudln't know their type, but as it happened all generic functions are aggregates, and aggregates are only legal in the SELECT clause, so that didn't matter. In a previous patch, we changed assignment_testable to carry an optional type and wired it to selector, so we wouldn't need to dynamic_cast<selector>. Now, we wire the optional type to assignment_testable_expression, so overload resolution of generic functions can happen during expression preparation. The code that bridges the function argument expressions to assignment_testable is extracted into a function, since it's too complicated to be written as a transform.	2023-06-13 21:04:49 +03:00
Avi Kivity	2cb15d0829	cql3: expr: wire unresolved_identifier to test_assignment()	2023-06-13 21:04:49 +03:00
Avi Kivity	b7bbcdd178	cql3: expr: support preparing column_mutation_attribute Fairly straightforward. A unit test is added.	2023-06-13 21:04:49 +03:00
Avi Kivity	73b6b6e3d1	cql3: expr: support preparing SQL-style casts We convert the cast to a function, just like the existing with_function selectable.	2023-06-13 21:04:49 +03:00
Avi Kivity	521a128a2a	cql3: expr: support preparing field_selection expressions The field_selection structure is augmented with the field index so that does not need to be done at evaluation time, similar to the current with_field_selection selectable.	2023-06-13 21:04:49 +03:00
Avi Kivity	ecfe4ad53a	cql3: expr: make the two styles of cast expressions explicit CQL supports two cast styles: - C-style: (type) expr, used for casts between binary-compatible types and for type hinting of bind variables - SQL-tyle: (expr AS type), used for real type convertions Currently, the expression system differentiates them by the cast::type field, which is a data_type for SQL-style casts and a cql3_type::raw for C-style casts, but that won't work after the prepare phase is applied to SQL-style casts when the type field will be prepared into a data_type. Prepare for this by adding a separate enum to distinguish between the two styles.	2023-06-13 21:04:49 +03:00
Avi Kivity	871c1c4f99	cql3: error injection functions: mark enabled_injections() as impure A pure function should return the same value on every invocation, but enabled_injections() returns true or false depending on global state. Mark it impure to reflect that. Currently, the bug has no effect, but once we prepare selectors, the prepare_function_call() will constant-fold calls to pure functions, so we'll capture global state at prepare time rather than evaluate it each time anew.	2023-06-13 21:04:49 +03:00
Avi Kivity	c0f59f0789	cql3: eliminate dynamic_cast<selector> from functions::get() Type inference for function calls is a bit complicated: - a function argument can be inferred from the signature: a call to my_func(:arg) will infer :arg's type from the function signature - a function signature can be inferred from its argument types: a call to max(my_column) will select the correct max() signature (as max is generic) from my_column's type Currently, functions::get() implements this by invoking dynamic_cast<selector*> on the argument. If the caller of functions::get() is the SELECT clause preparation, then the cast will succeed and we'll be able to find the type. If not, we fail (and fall back to inferring the argument types from a non-generic function signature). Since we're about to move selectors to expressions, the dynamic_cast will fail, so we must replace it with a less fragile approach. The fix is to augment assignment_testable (the interface representing a function argument) with an intentionally-awkwardly-named assignment_testable_type_opt(), that sees whether we happen to know the type for the argument in order to implement signature-from-argument inference. A note about assignment_testable: this is a bridge interface that is the least common denominator of anything that calls functions. Since we're moving towards expressions, there are fewer implementations of the interface as the code evolves.	2023-06-13 21:04:49 +03:00
Avi Kivity	5983e9e7b2	cql3: test_assignment: pass optional schema everywhere test_assignment() and related functions check for type compatibility between a right-hand-side and a left-hand-side. It started its life with a limited functionality for INSERT and UPDATE, but now it's about to be used for cast expression in selectors, which can cast a column_value. A column_value is still an unresolved_identifier during the prepare phase, and cannot be resolved without a schema. To prepare for this, pass an optional schema everywhere. Ultimately, test_assignment likely needs to be folded into prepare_expr(), but before that prepare_expr() has to be used everywhere.	2023-06-13 21:04:49 +03:00
Avi Kivity	8dc22293bf	cql3: expr: prepare_expr(): allow aggregate functions prepare_expr() began its life as a replacement for the WHERE clause, so it shares its restrictions, one of which is not supporting aggregate functions. In previous patches, we added an explicit check to all users, so we can now remove the check here, so that we can later prepare selectors. In addition to dropping the check, we drop the dynamic_cast<scalar_function>, as it can now fail. It turns out it's unnecessary since everything is available from the base class. Note we don't allow constant folding involving aggregate functions: first, our evaluator doesn't support it, and second, we don't have the iteration count at prepare time.	2023-06-13 21:04:49 +03:00
Avi Kivity	b7a90d51d2	cql3: add checks for aggregation functions after prepare Since we don't yet prepare selectors, all calls to prepare_expr() are adjusted. Note that missing a check isn't fatal - it will be trapped at runtime because evaluate(aggregate) will throw.	2023-06-13 21:04:49 +03:00
Avi Kivity	6db916e5b6	cql3: expr: add verify_no_aggregate_functions() helper Aggregate functions are only allowed in certain contexts (the SELECT clause and the HAVING clause, which we don't yet have). prepare_expr() currently rejects aggregate functions, but that means we cannot use it to prepare selectors. To prepare for the use of prepare_expr() in selectors, we'll have to move the check out of prepare_expr(). This helper is the beginning of that change. I considered adding a parameter to prepare_expr(), but that is even more noisy than adding a call to the helper.	2023-06-13 21:04:49 +03:00
Avi Kivity	e7c1824ed0	test: add regression test for rejection of aggregates in the WHERE clause The test passes on Cassandra and ScyllaDB.	2023-06-13 21:04:49 +03:00
Avi Kivity	54f3050225	cql3: expr: extract column_mutation_attribute_type column_mutation_attribute_type() returns int32_type or long_type depending on whether TTL or WRITETIME is requested. Will be used later when we prepare column_mutation_attribute expressions.	2023-06-13 21:04:49 +03:00
Avi Kivity	d2f4bd8b85	cql3: expr: add fmt formatter for column_mutation_attribute_kind It's easier to use for logging.	2023-06-13 21:04:49 +03:00
Avi Kivity	220a3efa73	cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON We store the result of to_selectables() in a local variable, then compute it again in the next line. Fix by reusing the variable.	2023-06-13 21:04:49 +03:00
Avi Kivity	096e569054	cql3: select_statement: disambiguate execute() overloads There are two execute() overloads, but they don't do the same thing - one is a partial implementation of the other. The same is true of two execute_without_checking_exception_message() overloads. Change the name of the subordinate overload to indicate its role. Overloads should be used when the only difference between overloads is the argument type, not when one does a subset of the other.	2023-06-13 19:28:29 +03:00
Kefu Chai	8a54e478ba	build: cmake: use -O0 for debug build per clang's document, -Og is like -O1, which is in turn an optimization level between -O0 and -O2. -O0 "generates the most debuggable code". for instance, with -O0, presumably, the variables are not allocated in the registers and later get overwritten, they are always allocated on the stack. this helps with the debugging. in this change, -O0 is used for better debugging experience. the downside is that the emitted code size will be greater than the one emitted from -Og, and the executable will be slower. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14210	2023-06-13 18:25:36 +03:00
Avi Kivity	7dadd38161	Revert "configure: Switch debug build from -O0 to -Og" This reverts commit `7e68ed6a5d`. With -Og simple 'gdb build/debug/scylla -ex start' shows the function parameters as "optimized out" while with -O0 they display fine. This applies to all variables, not just main's parameters. Bisect revealed that this behavior started with the reverted commit; it's not due to a later toolchain update. Fixes #14196. Closes #14197	2023-06-13 18:08:21 +03:00
David Garcia	7c22877613	docs: separate homepage Update index.rst Update index.rst Update index.rst	2023-06-13 15:16:24 +01:00
Anna Stuchlik	152e90d4fa	doc: add OS support for ScyllaDB 5.3 Fixes https://github.com/scylladb/scylladb/issues/14084 This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.	2023-06-13 16:03:48 +02:00
Nadav Har'El	ed1f37a8f9	Merge 'Tests for cluster features' from Piotr Dulikowski This PR adds a set of tests for cluster features. The set covers those tests from the test plan for the upcoming "cluster features on raft" that do not depend on implementation-specific details. Those tests are applicable to the existing gossip-based implementation, so they can be useful for us right now. The tests simulate cluster upgrades by conditionally marking support for the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the tests work only in non-release mode. The `test_partial_upgrade_can_be_finished_with_removenode` test is marked as skipped because of a bug in gossip that prevents features from being enabled if a node was removed within the last 3 days (#14194). Closes #14211 * github.com:scylladb/scylladb: test/topology: add cluster feature tests test: introduce get_supported_features/get_enabled_features test: move wait_for_feature to pylib utils feature_service: add a test-only cluster feature	2023-06-13 16:59:56 +03:00
Avi Kivity	78f4ee385f	cql3: functions: fix count(col) for non-scalar types count(col), unlike count(), does not count rows for which col is NULL. However, if col's data type is not a scalar (e.g. a collection, tuple, or user-defined type) it behaves like count(), counting NULLs too. The cause is that get_dynamic_aggregate() converts count() to the count(*) version. It works for scalars because get_dynamic_aggregate() intentionally fails to match scalar arguments, and functions::get() then matches the arguments against the pre-declared count functions. As we can only pre-declare count(scalar) (there's an infinite number of non-scalar types), we change the approach to be the same as min/max: we make count() a generic function. In fact count(col) is much better as a generic function, as it only examines its input to see if it is NULL. A unit test is added. It passes with Cassandra as well. Fixes #14198. Closes #14199	2023-06-13 14:40:14 +03:00
Michał Sala	e0855b1de2	forward_service: introduce shutdown checks This commit introduces a new boolean flag, `shutdown`, to the forward_service, along with a corresponding shutdown method. It also adds checks throughout the forward_service to verify the value of the shutdown flag before retrying or invoking functions that might use the messaging service under the hood. The flag is set before messaging service shutdown, by invoking forward_service::shutdown in main. By checking the flag before each call that potentially involves the messaging service, we can ensure that the messaging service is still operational. If the flag is false, indicating that the messaging service is still active, we can proceed with the call. In the event that the messaging service is shutdown during the call, appropriate exceptions should be thrown somewhere down in called functions, avoiding potential hangs. This fix should resolve the issue where forward_service retries could block the shutdown. Fixes #12604 Closes #13922	2023-06-13 13:44:33 +03:00
Kamil Braun	ff8d88a228	storage_proxy: introduce const version of `remote()` One version is implemented using the other (with `const_cast`) because some additional safety checks will be added in later commit.	2023-06-13 12:44:03 +02:00
Anna Stuchlik	9d1f62fdbf	doc: remove warnings against reverse queries Refs: https://github.com/scylladb/scylla-doc-issues/issues/831 This commit removes the troubleshooting page about reverse queries, as well as a warning on the Tips page against using reverse queries. Closes #14190	2023-06-13 13:19:39 +03:00
David Garcia	b4b13f43dd	docs: edit landing page docs: add icons docs: update icons Closes #13559	2023-06-13 12:14:01 +03:00
Piotr Dulikowski	423dc613c3	system_keyspace: overwrite, not add tokens in topology_node_mutation_builder::set The `topology_node_mutation_builder::set` function, when passed a non-empty set of tokens, will construct a mutation that adds given tokens to the column instead of overwriting them. This is not a problem today because we are always calling `set` on an empty column, but given the fact that the function is called `set` not `add` and other overloads of `set` do overwrite, the function might be misused in the future. This commit fixes the problem by initializing the tombstone in `collection_mutation_description` properly, causing the previous state to be dropped before applying new tokens. The tombstone has a timestamp which is one less than the timestamp of the added cells, mimicking the CQL behavior which happens when a non-frozen collection is overwritten. Closes #14216	2023-06-12 23:37:43 +03:00
Kamil Braun	2cd17819cd	replica: table: introduce `get_my_hit_rate` Doesn't require `gossiper&`.	2023-06-12 15:23:56 +02:00
Kamil Braun	c5c78a7922	storage_proxy: `endpoint_filter`: remove gossiper dependency The function used `gossiper&` to check whether an endpoint is considered alive. Abstract this out through `noncopyable_function`. This will allow us to use `endpoint_filter` during local queries when `remote` (which contains the `gossiper` reference) is unavailable.	2023-06-12 15:23:48 +02:00
Kefu Chai	9526258b89	build: make gen_headers a dependency of gen/*.o when compiling the generated source files, sometimes, we can run into the FTBFS like: 02:18:54 FAILED: build/release/gen/cql3/CqlParser.o 02:18:54 clang++ ... -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp ... 02:18:54 In file included from build/release/gen/cql3/CqlParser.cpp:44: 02:18:54 In file included from build/release/gen/cql3/CqlParser.hpp:75: 02:18:54 In file included from ./cql3/statements/create_function_statement.hh:12: 02:18:54 In file included from ./cql3/functions/user_function.hh:16: 02:18:54 ./lang/wasm.hh:15:10: fatal error: 'rust/wasmtime_bindings.hh' file not found 02:18:54 #include "rust/wasmtime_bindings.hh" 02:18:54 ^~~~~~~~~~~~~~~~~~~~~~~~~~~ CqlParser.cc is a source file generated from cql3/Cql.g, this source in turn includes another source file generated from wasmtime_bindings/src/lib.rs. but we failed to setup this dependency in the build.ninja rules -- we only teach ninja that "to compile the grammer source files, please prepare the`serializers` source files first". but this is not enough. so, in this change, we just replace `serializers` with `gen_headers`, as the latter is a superset of the former. and should fulfill the needs of CqlParser.cc. Fixes #14213 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14214	2023-06-12 15:36:03 +03:00
Piotr Dulikowski	1fb17061d2	test/topology: add cluster feature tests This commit adds a set of tests for cluster features. The set covers those tests from the test plan for the upcoming "cluster features on raft" that do not depend on implementation-specific details. Those tests are applicable to the existing gossip-based implementation, so they can be useful for us right now. The tests simulate cluster upgrades by conditionally marking support for the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the tests work only in non-release mode. The `test_partial_upgrade_can_be_finished_with_removenode` test is marked as skipped because of a bug in gossip that prevents features from being enabled if a node was removed within the last 3 days (#14194).	2023-06-12 13:28:31 +02:00
Piotr Dulikowski	e7c355e84f	test: introduce get_supported_features/get_enabled_features Introduces two helper functions that allow getting information about supported/enabled features on a node, according to its system tables. As a bonus, the `wait_for_feature` function is refactored to use `get_enabled_features`.	2023-06-12 13:28:16 +02:00
Anna Stuchlik	b7022cd74e	doc: remove support for Ubuntu 18 Fixes https://github.com/scylladb/scylladb/issues/14097 This commit removes support for Ubuntu 18 from platform support for ScyllaDB Enterprise 2023.1. The update is in sync with the change made for ScyllaDB 5.2. This commit must be backported to branch-5.2 and branch-5.3. Closes #14118	2023-06-12 13:27:07 +03:00
Piotr Dulikowski	56d3d8b9e2	test: move wait_for_feature to pylib utils The `wait_for_feature` can be useful, and will be used, in other test suites than `topology_raft_disabled`, so it is moved to the common pylib utils.	2023-06-12 10:09:00 +02:00
Piotr Dulikowski	4c5e44a1cd	feature_service: add a test-only cluster feature Adds a cluster feature called TEST_ONLY_FEATURE. It can only be marked as supported via error injection "features_enable_test_feature", which can be enabled on node startup via CLI option or YAML configuration. The purpose of this cluster feature is to simulate upgrading a node to a version that supports a new feature. This allows us to write tests which verify that the cluster feature mechanism works. The fact that TEST_ONLY_FEATURE can only be enabled via error injection should make it impossible to accidentally enable it in release mode and, consequently, in production.	2023-06-12 10:05:59 +02:00
Harsh Soni	d8c3b144cb	alternator: optimize validate_table_name call Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception). Fixes: #12538 Closes #13966	2023-06-12 10:46:13 +03:00
Avi Kivity	79bfe04d2a	cql3: remove abstract_marker vestiges Removed by `e458340821` ("cql3: Remove term") Closes #14192	2023-06-12 10:41:04 +03:00
Nadav Har'El	59f331c4e1	Merge 'create-relocatable-package.py: package build/node_export only for stripped version' from Kefu Chai because we build stripped package and non-stripped package in parallel using ninja. there are chances that the non-stripped build job could be adding build/node_exporter directory to the tarball while the job building stripped package is using objcopy to extract the symbols from the build/node_exporter/node_exporter executable. but objcopy creates temporary files when processing the executables. and the temporary files can be spotted by the non-stripped build job. there are two consequences: 1. non-stripped build job includes the temporary files in its tarball, even they are not supposed to be distributed 2. non-stripped build job fails to include the temporary file(s), as they are removed after objcopy finishes its job. but the job did spot them when preparing the tarball. so when the tarfile python module tries to include the previous found temporary file(s), it throws. neither of these consequences is expected. but fortunately, this only happens when packaging the non-stripped package. when packaging the stripped package, the build/node_exported directory is not in flux anymore. as ninja ensures the dependencies between the jobs. so, in this change, we do not add the whole directory when packaging the non-stripped version. as all its ingredients have been added separately as regular files. and when packaing the stripped version, we still use the existing step, as we don't have to list all the files created by strip.sh: node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz} we could do so in this script, but the repeatings is unnecessary and error-prune. so, let's keep including the whole directory recursively, so all the debug symbols are included. Fixes https://github.com/scylladb/scylladb/issues/14079 Closes #14081 * github.com:scylladb/scylladb: create-relocatable-package.py: package build/node_export only for stripped version create-relocatable-package.py: use positive condition when possible	2023-06-12 10:39:10 +03:00
Kefu Chai	c3d91f5190	tracing: drop trace(.., std::string&&) overload this change is a follow-up of `4f5fcb02fd`, the goal is to avoid the programming oversights like ```c++ trace(trace_ptr, "foo {} with {} but {} is {}"); ``` as `trace(const trace_state_ptr& p, const std::string& msg)` is a better match than the templated one, i.e., `trace(const trace_state_ptr& p, fmt::format_string<T...> fmt, T&&... args)`. so we cannot detect this with the compile-time format checking. so let's just drop this overload, and update its callers to use the other overload. The change was suggested by Avi. the example also came from him. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14188	2023-06-10 20:09:35 +03:00
Kefu Chai	e464ad2568	table: s/lw_shared/unique_ptr/ when appropriate sel is a local variable, and it is not shared with anybody else. so make it a unique_ptr<> for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14189	2023-06-09 17:38:18 +03:00
Pavel Emelyanov	c1c1752f88	s3/client: Replace skink flush semaphore with gate Uploading sinks have internal semaphore limiting the maximum number of uploading parts and pieces with the value of two. This approach has several drawbacks. 1. The number is random. It could as well be three, four and any other 2. Jumbo upload in fact violates this parallelizm, because it applies to maximum number of pieces _and_ maximum number of parts in each piece that can be uploaded in parallels. Thus jumbo upload results in four parts in parallel. 3. Multiple uploads don't sync with each other, so uploading N objects would result in N * 2 (or even N * 4 with jumbo) uploads in parallel. 4. Single upload could benefit from using more sockets if no other uploads happen in parallel. IOW -- limit should be shard-wide, not single-upload-wide Previous patches already put the per-shard parallelizm under (some) control, so this semaphore is in fact used as a way to collect background uploading fibers on final flush and thus can be replaced with a gate. As a side effect, this fixes an issue that writes-after-flush shouldn't happen (see #13320) -- when flushed the upload gate is closed and subsequent writes would hit gate-closed error. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:38:57 +03:00
Pavel Emelyanov	99b92f0ed8	s3/client: Configure different max-connections on http clients After previous patch different sched groups got different http clients. By default each client is started with 100 allowed connections. This can be too much -- 100 * nr-sched-groups * smp::count can be quite huge number. Also, different groups should have different parallelizm, e.g. flush/compaction doesn't care that much about latency and can use fewer sockets while query class is more welcome to have larger concurrency. As a starter -- configure http clients with maximum shares/100 sockets. Thus query class would have 10 and flush/compaction -- 1. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:35:59 +03:00
Pavel Emelyanov	81d1bfce2a	s3/client: Maintain several http clients on-board The intent is to isolate workloads from different sched groups from each other and not let one sched group consume all sockets from the http client thus affecting requests made by other sched groups. The conention happens in the maximim number of socket an http client may have (see scylladb/seastar#1652). If requests take time and client is asked to make more and more it will eventually stop spawning new connections and would get blocked internally waiting for running requests to complete and put a socket back to pool. If a sched group workload (e.g. -- memtable flush) consumes all the available sockets then workload from another group (e.g. -- query) would be blocked thus spoiling its latency (which is poor on its own, but still) After this change S3 client maintains a sched_group:http_client map thus making sure different sched groups don't clash with each other so that e.g. query requests don't wait for flush/compaction to release a socket. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:28:55 +03:00
Pavel Emelyanov	a8492a065b	s3/client: Remove now unused http reference from sink and file Now these two classes use client-> calls and don't need the http& shortcut Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:28:30 +03:00
Pavel Emelyanov	b9ee0d385b	s3/client: Add make_request() method This helper call will serve several purposes. First, make necessary preparations to the request before making, in particular -- calling authorize() Second, there's the need to re-make requests that failed with "connection closed" error (see #13736) Third, one S3 client is shared between different scheduling groups. In order to isolate groups' workload from each other different http clients should be used, and this helper will be in change of selecting one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:19:19 +03:00
Asias He	4592bbe182	repair: Use the updated estimated_partitions to create writer The estimated_partitions is estimated after the repair_meta is created. Currently, the default estimated_partitions was used to create the write which is not correct. To fix, use the updated estimated_partitions. Reported by Petr Gusev Closes #14179	2023-06-08 11:44:48 +03:00
Botond Dénes	b4c21cfaa0	Merge 'api: task_manager: Return proper response status code' from Aleksandra Martyniuk Return 400 Bad Request instead of 500 Internal Server Error when user requests task or module which does not exist through task manager and task manager test apis. Closes #14166 * github.com:scylladb/scylladb: test: add test checking response status when requested module does not exist api: fix indentation api: throw bad_param_exception when requested task/module does not exists	2023-06-08 11:31:41 +03:00
Botond Dénes	4f4d3f9d9e	Merge 'tracing: use compile-time formatting check and avoid creating temporary string' from Kefu Chai in this series, we use {fmt}'s compile-time formatting check, and avoid deep copy when creating sstring from std::string. Closes #14169 * github.com:scylladb/scylladb: tracing: use std::string instead of sstring for event_record::message tracing: use compile-time formatting check	2023-06-08 11:26:43 +03:00
Botond Dénes	51672219f8	Merge 'Prevent errors while running compaction task tests in parallel' from Aleksandra Martyniuk Compaction task test should only check the intended group of task. Thus, the tasks are filtered in each test. In order to be able to run the tests in parallel, checks for the tasks of the same type are grouped together. Fixes: #14131. Closes #14161 * github.com:scylladb/scylladb: test: put compaction task checks of the same type together test: filter tasks of given compaction type	2023-06-08 11:23:42 +03:00
Kefu Chai	c123f4644a	test.py: do not abort if fails to parse an XML logger file there are chances that a Boost::test test fails to generate a valid XML file after the test finishes. and xml.etree.ElementTree.parse() throws when parsing it. see https://github.com/scylladb/scylla-pkg/issues/3196 before this change, the exception is not handled, and test.py aborts in this case. this does not help and could be misleading. after this change, the exception is handled and printed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14180	2023-06-08 11:02:01 +03:00
Avi Kivity	5acb137c2e	Merge 'docs/dev/reader-concurrency-semaphore.md: add section about operations' from Botond Dénes Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps. Closes #14171 * github.com:scylladb/scylladb: docs/dev/reader-concurrency-semaphore.md: add section about operations docs/dev/reader-concurrency-semaphore.md: switch to # headers markings reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps	2023-06-07 22:53:18 +03:00
Avi Kivity	fc0357de79	Merge 'Coroutinize and change return type for table::get_sstables_by_partition_key()' from Pavel Emelyanov The helper if huge in the form of then-chain. Also it's generic enough not to limit itself in returning sstables' Data file names only. refs: #14122 (detached from the one that needs more thinking about) Closes #14174 * github.com:scylladb/scylladb: table: Return shared sstable from get_sstables_by_partition_key() table: Coroutinize get_sstables_by_partition_key()	2023-06-07 22:29:37 +03:00
Pavel Emelyanov	ce6a1ca13b	Update seastar submodule * seastar afe39231...99d28ff0 (16): > file/util: Include seastar.hh > http/exception: Use http::reply explicitly > http/client: Include lost condition-variable.hh > util: file: drop unnecessary include of reactor.hh > tests: perf: add a markdown printer > http/client: Introduce unexpected_status_error for client requests > sharded: avoid #include <seastar/core/reactor.hh> for run_in_background() > code: Use std::is_invocable_r_v instead of InvokeReturns > http/client: Add ability to change pool size on the fly > http/client: Add getters for active/idle connections counts > http/client: Count and limit the number of connections > http/client: Add connection->client RAII backref > build: use the user-specified compiler when building DPDK > build: use proper toolchain based on specified compiler > build: only pass CMAKE_C_COMPILER when building ingredients > build: use specified compiler when building liburing Two changes are folded into the commit: 1. missing seastar/core/coroutine.hh include in one .cc file that got it indirectly included before seastar reactor.hh drop from file.hh 2. http client now returns unexpected_status_error instead of std::runtime_error, so s3 test is updated respectively Closes #14168	2023-06-07 20:25:49 +03:00
Pavel Emelyanov	c68c154fb6	code: Reduce tracing/hh fanout There are some headers that include tracing/.hh ones despite all they need is forward-declared trace_state_ptr Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14155	2023-06-07 19:19:22 +03:00
Tomasz Grabiec	b9a30dd5ac	Merge 'raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry`' from Kamil Braun There are some cases where we can deduce that the entry was committed, but we were throwing `commit_status_unknown`. Handle one more such case. The added comment explains it in detail. Also add a FIXME for another case where we throw `commit_status_unknown` but we could do better. Fixes: #14029 Fixes: #14072 Closes #14167 * github.com:scylladb/scylladb: raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry` raft: replication test: don't hang if `_seen` overshots `_apply_entries` raft: replication test: print a warning when handling `commit_status_unknown`	2023-06-07 16:34:51 +02:00
Kamil Braun	fd66bb1a61	storage_service: reduce timeout in `wait_for_ring_to_settle` In `297c75c6d8` I set the timeout to 5 minutes mainly due to debug mode which is often quite slow on Jenkins. But 5 minutes is a bit of an overkill. It wouldn't be a problem but there is a dtest that waits for a node to fail bootstrap; it's wasteful for the test to sleep for an entire 5 minutes. Set it to: - 3 minutes in debug mode, - 30 seconds in dev/release modes. Ref: scylladb/scylla-dtest#3203 Closes #14140	2023-06-07 17:31:43 +03:00
Alejo Sanchez	5b8fc86737	test/pylib: minio unique temp dir Create a unique minio server temp dir for each test run. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14095	2023-06-07 16:29:58 +03:00
Aleksandra Martyniuk	2e54a322fb	test: add test checking response status when requested module does not exist	2023-06-07 14:49:48 +02:00
Kamil Braun	5504da3745	raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry` There are some cases where we can deduce that the entry was committed, but we were throwing `commit_status_unknown`. Handle one more such case. The added comment explains it in detail. Also add a FIXME for another case where we throw `commit_status_unknown` but we could do better. Fixes: #14029	2023-06-07 14:17:23 +02:00
Kamil Braun	2fea2fc19c	raft: replication test: don't hang if `_seen` overshots `_apply_entries` As in the previous commit, if a command gets doubly applied due to `commit_status_unknown`, this will could lead to hard-to-debug failures; one of them was the test hanging because we would never call `_done.set_value()` in `state_machine::apply` due to `_seen` overshooting `_apply_entries`. Fix the problem and print a warning if we apply too many commands. Fixes: #14072	2023-06-07 14:17:23 +02:00
Kamil Braun	43b48c59fd	raft: replication test: print a warning when handling `commit_status_unknown` `commit_status_unknown` may lead to double application and then a hard-to-debug failure. But some tests actually rely on retrying it, so print a warning and leave a FIXME for maybe a better future solution. Ref: #14029	2023-06-07 14:17:20 +02:00
Piotr Sarna	9064d3c6ec	docs: mention the new synchronous_updates option in mv docs This commit adds a table (with 1 row) explaining Scylla-specific materialized view options - which now consists just of synchronous_updates. Tested manually by running `make preview` from docs/ directory. Closes #11150	2023-06-07 15:06:06 +03:00
Pavel Emelyanov	198bca98ec	table: Return shared sstable from get_sstables_by_partition_key() The call is generic enough not to drop the sstable itself on return so that callers can do whatever they need with it. The only today's caller is API which will convert sstables to filenames on its own Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-07 15:04:48 +03:00
Pavel Emelyanov	f895ac0adb	table: Coroutinize get_sstables_by_partition_key() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-07 15:04:48 +03:00
Botond Dénes	0c632b6e3d	docs/dev/reader-concurrency-semaphore.md: add section about operations Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps.	2023-06-07 14:22:52 +03:00
Botond Dénes	0067fa0a09	docs/dev/reader-concurrency-semaphore.md: switch to # headers markings As they allow for more levels, than the current `---` and `===` ones.	2023-06-07 14:22:10 +03:00
Botond Dénes	c4faa05888	reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps "description" is not the respective column contains, so fix the header.	2023-06-07 14:21:48 +03:00
Kefu Chai	428c13076f	tracing: use std::string instead of sstring for event_record::message when creating an event_record, the typical use case is to use a string created using fmt::format(), which returns a std::string. before this change, we always convert the std::string to a sstring, and move this shinny new sstring into a new event_record. but when creating sstring, we always performs a deep copy, which is not necessary, as we own the std::string already. so, in this change, instead of performing a deep copy, we just keep the std::string and pass it all the way to where event_record is created. please note, the std::string will be implicitly converted to data_value, and it will be dropped on the floor after being serialized in abstract_type::decompose(). so this deep copy is inevitable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-07 18:59:37 +08:00
Kefu Chai	4f5fcb02fd	tracing: use compile-time formatting check in this change we pass the fmt string using fmt::format_string<T...> in order to {fmt}'s compile-time formatting. so we can identify the bad format specifier or bad format placeholders at compile-time. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-07 18:55:32 +08:00
Aleksandra Martyniuk	fcb15d3b8f	api: fix indentation	2023-06-07 11:59:39 +02:00
Aleksandra Martyniuk	be3317623f	api: throw bad_param_exception when requested task/module does not exists In task manager and task manager test rest apis when a task or module which does not exist is requested, we get Internal Server Error. In such cases, wrap thrown exceptions in bad_param_exception to respond with Bad Request code. Modify test accordingly.	2023-06-07 11:58:28 +02:00
Marcin Maliszkiewicz	8b06684a8c	docs: dev: document pytest run convenience script Closes #13995	2023-06-07 12:37:52 +03:00
Nadav Har'El	5984db047d	Merge 'mv: forbid IS NOT NULL on columns outside the primary key' from Jan Ciołek statement_restrictions: forbid IS NOT NULL on columns outside the primary key IS NOT NULL is currently allowed only when creating materialized views. It's used to convey that the view will not include any rows that would make the view's primary key columns NULL. Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365). Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns. It would be a big effort, and there are some reasons why we don't support them. For now let's forbid such restrictions, it's better to fail than be wrong silently. Throwing a hard error would be a breaking change. To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag. This flag can have the following values: * `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown. * `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored. * `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored. The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml. This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view. New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`. New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning. This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters. Fixes: #10365 Closes #13013 * github.com:scylladb/scylladb: boost/restriction_test: test the strict_is_not_null_in_views flag docs/cql/mv: columns outside of view's primary key can't be restricted cql-pytest: enable test_is_not_null_forbidden_in_filter statement_restrictions: forbid IS NOT NULL on columns outside the primary key schema_altering_statement: return warnings from prepare_schema_mutations() db/config: add strict_is_not_null_in_views config option statement_restrictions: add get_not_null_columns() test: remove invalid IS NOT NULL restrictions from tests	2023-06-07 12:12:19 +03:00
Kamil Braun	2dbf6f32cd	Merge 'Fix crash during restart of a single node with topology over raft' from Gleb This is a regression introduced in `f26179cd27`. Fixes: #14136 * 'gleb/set_group0' of github.com:scylladb/scylla-dev: test: restart first node to see if it can boot after restart service: move setting of group0 point in storage_service earlier	2023-06-07 10:21:17 +02:00
Aleksandra Martyniuk	3c5094dce8	test: put compaction task checks of the same type together In test_compaction_task.py tests concerning the same type of compaction are squashed together so that they are run synchronously and there is no data race when the tests are run in parallel.	2023-06-07 09:49:42 +02:00
Aleksandra Martyniuk	94a2895874	test: filter tasks of given compaction type In test_compaction_task.py the tasks are filtered by compaction type so that each test case checks only the intended tasks.	2023-06-07 09:30:44 +02:00
Jan Ciolek	ec0cac8862	boost/restriction_test: test the strict_is_not_null_in_views flag Add unit tests for the strict_is_not_null_in_views flag. This flag controls the behavior in case of an invalid IS NOT NULL restrictions on a materialized view column. Materialized views allow only restricting columns that belong to the view's primary key, all other restrictions should be rejected. There was a bug where IS NOT NULL restrictions weren't rejected, but simply ignored instead. This flags controls what should happen when the user runs a query with such an invalid IS NOT NULL restriction. strict_is_not_null_in_views can have the following values: * `true` - strict checking, invalid queries are rejected * `warn` - the query is allowed, but a warning is printed * `false` - the query is allowed, the invalid restrictions are silently ignored. The tests are based on the ones for strict_allow_filtering, which reside in the lines preceding the newly added tests. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 02:30:11 +02:00
Jan Ciolek	83932f9f37	docs/cql/mv: columns outside of view's primary key can't be restricted We used to allow IS NOT NULL restrictions on columns that were not part of the materialized view's primary key. It runs out that such restrictions are silently ignored (see #10365), so we no longer allow such restrictions. Update the documentation to reflect that change. Also there was a mistake in the documentation. It said that restrictions are allowed on all columns of the base table's primary key, but they are actually allowed on all columns of the view table's primary key, not the base tables. This change also fixes that mistake. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 02:30:11 +02:00
Jan Ciolek	50943e825b	cql-pytest: enable test_is_not_null_forbidden_in_filter IS NOT NULL is now allowed only on the view's primary key columns, so the xfail marker can be removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 02:30:11 +02:00
Jan Ciolek	3326f90709	statement_restrictions: forbid IS NOT NULL on columns outside the primary key IS NOT NULL is currently allowed only when creating materialized views. It's used to convey that the view will not include any rows that would make the view's primary key columns NULL. Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored. Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns. It would be a big effort, and there are some reasons why we don't support them. For now let's forbid such restrictions, it's better to fail than be wrong silently. Throwing a hard error would be a breaking change. To avoid breaking existing code the reaction to invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag. The default values for this flag are `warn` in db::config and `true` in scylla.yaml. This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view. New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in db::config. New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 02:30:11 +02:00
Jan Ciolek	a8cc5ed491	schema_altering_statement: return warnings from prepare_schema_mutations() Validation of a CREATE MATERIALIZED VIEW statement takes place inside the prepare_schema_mutations() method. I would like to generate warnings during this validation, but there's currently no way to pass them. Let's add one more return value - a vector of CQL warnings generated during the execution of this statement. A new alias is added to make it clear what the function is returning: ```c++ // A vector of CQL warnings generated during execution of a statement. using cql_warnings_vec = std::vector<sstring>; ``` Later the warnings will be sent to the user by the function schema_altering_statment::execute(), which is the only caller of prepare_schema_mutations(). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 02:30:07 +02:00
Jan Ciolek	c67d65987e	db/config: add strict_is_not_null_in_views config option IS NOT NULL shouldn't be allowed on columns which are outside of the materialized view's primary key. It's currently allowed to create views with such restrictions, but they're silently ignored, it's a bug. In the following commits restricting regular columns with IS NOT NULL will be forbidden. This is a breaking change. Some users might have existing code that creates views with such restrictions, we don't want to break it. To deal with this a new feature flag is introduced: strict_is_not_null_in_views. By default it's set to `warn`. If a user tries to create a view with such invalid restrictions they will get a warning saying that this is invalid, but the query will still go through, it's just a warning. The default value in scylla.yaml will be `true`. This way new clusters will have strict enforcement enabled and they'll throw errors when the user tries to create such an invalid view, Old clusters without the flag present in scylla.yaml will have the flag set to warn, so they won't break on an update. There's also the option to set the flag to `false`. It's dangerous, as it silences information about a bug, but someone might want it to silence the warnings for a moment. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-07 01:48:39 +02:00
Kefu Chai	84683c3549	sstable_loader: update comment to reflect latest changes we have a dedicated facility for loading sstables since `68dfcf5256`, and column_family (i.e. table) is not responsible for loading new sstables. so update the comment to reflect this change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14154	2023-06-06 14:31:15 +03:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Gleb Natapov	e50f96fc4e	test: restart first node to see if it can boot after restart From: Kamil Braun <kbraun@scylladb.com>	2023-06-06 12:14:27 +03:00
Raphael S. Carvalho	156d771101	compaction: Fix sstable cleanup after resharding on refresh Problem can be reproduced easily: 1) wrote some sstables with smp 1 2) shut down scylla 3) moved sstables to upload 4) restarted scylla with smp 2 5) ran refresh (resharding happens, adds sstable to cleanup set and never removes it) 6) cleanup (tries to cleanup resharded sstables which were leaked in the cleanup set) Bumps into assert "Assertion `!sst->is_shared()' failed", as cleanup picks a shared sstable that was leaked and already processed by resharding. Fix is about not inserting shared sstables into cleanup set, as shared sstables are restricted to resharding and cannot be processed later by cleanup (nor it should because resharding itself cleaned up its input files). Dtest: https://github.com/scylladb/scylla-dtest/pull/3206 Fixes #14001. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14147	2023-06-06 12:14:03 +03:00
Gleb Natapov	8598cebb11	service: move setting of group0 point in storage_service earlier group0 pointer in storage_service should be set when group0 starts. After `f26179cd27` we start group0 earlier, so we need to move setting of the group0 pointer as well.	2023-06-06 12:12:48 +03:00
Benny Halevy	17795757d3	compaction_manager: compact_sstables: fix typo in log message about cleanup Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14151	2023-06-06 11:17:02 +03:00
Michał Chojnowski	1a521172ec	data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE Due to a simple programming oversight, one of keyspace_metadata constructors is using empty user_types_metadata instead of the passed one. Fix that. Fixes #14139 Closes #14143	2023-06-06 11:03:17 +03:00
Kefu Chai	9ba610c811	build: specify link-args using build script as an alternative to passing the link-args using the environmental variable, we can also use build script to pass the "-C link-args=<FLAG>" to the compiler. see https://doc.rust-lang.org/nightly/cargo/reference/build-scripts.html#cargorustc-link-argflag to ensure that cargo is called again by ninja, after build.rs is updated, build.rs is added as a dependency of {wasm} files along with Cargo.lock. this change is verified using following command ``` RUSTFLAGS='--print link-args' cargo build \ --target=wasm32-wasi \ --example=return_input \ --locked \ --manifest-path=Cargo.toml \ --target-dir=build/cmake/test/resource/wasm/rust ``` the output includes "-zstack-size=131072" in the argument passed to lld: ``` Compiling examples v0.0.0 (/home/kefu/dev/scylladb/test/resource/wasm/rust) LC_ALL="C" PATH="/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin:/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/kefu/.local/bin:/home/kefu/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin" VSLANG="1033" "lld" "-flavor" "wasm" "--rsp-quoting=posix" "--export" "_scylla_abi" "--export" "_scylla_free" "--export" "_scylla_malloc" "--export" "return_input" "-z" "stack-size=1048576" "--stack-first" "--allow-undefined" "--fatal-warnings" "--no-demangle" ... "-L" "/usr/lib/rustlib/wasm32-wasi/lib" "-L" "/usr/lib/rustlib/wasm32-wasi/lib/self-contained" "-o" "/home/kefu/dev/scylladb/build/cmake/test/resource/wasm/rust/wasm32-wasi/debug/examples/return_input-ef03083560989040.wasm" "--gc-sections" "--no-entry" "-O0" "-zstack-size=131072" ``` with this change, it'd be easier to build .wat files in CMake, so we don't need to repeat the settings in both configure.py and CMakeLists.txt Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14123	2023-06-06 10:54:39 +03:00
Kefu Chai	9e562f8707	build: cmake s/FATAL/FATAL_ERROR/ we should have used "FATAL_ERROR" instead of "FATAL", as the first parameter passed to the "message()" command. see https://cmake.org/cmake/help/v3.0/command/message.html Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14124	2023-06-06 10:53:32 +03:00
Botond Dénes	80b944a9b8	Merge 'Table compaction tasks' from Aleksandra Martyniuk Implementation of task_manager's tasks that cover major, cleanup, offstrategy, and upgrade sstables compaction of one table. Closes #13619 * github.com:scylladb/scylladb: test: extend compaction tasks test compaction: fix indentation compaction: create table_upgrade_sstables_compaction_task_impl compaction: create table_offstrategy_keyspace_compaction_task_impl compaction: create table_cleanup_keyspace_compaction_task_impl compaction: create table_major_keyspace_compaction_task_impl compaction: add helpers for table tasks scheduling compaction: add run_on_table compaction: pass std::string to run_on_existing_tables	2023-06-06 10:51:53 +03:00
Botond Dénes	33e4ac9f2a	Merge 'Enlighten messaging_service::shutdown()' from Pavel Emelyanov Recent seastar update added rpc::server::shutdown() method that only isolates the server from the network, but lets all internal handler callbacks continue running up until stop() is called. This patch makes use of it in messaging service by calling this new shiny shutdown() in its shutdown() and calling good old stop() in its stop(). Intentionally, it will prevent scylla from freezing on drain in case some RPC handler gets stuck. It may later freeze on stop(), but it's less horrible. Also chances are that by stop time some other handler's dependencies would have been drained/shut-down so the handler can wake up and stop normally. fixes: #14031 Closes #14115 * github.com:scylladb/scylladb: messaging_service: Shutdown rpc server on shutdown messaging_service: Generalize stop_servers() messaging_service: Restore indentation after previous patch messaging_service: Coroutinize stop() messaging_service: Coroutinize stop_servers()	2023-06-06 10:47:06 +03:00
Pavel Emelyanov	dba00acbe9	Merge 's3/test: cleanups to avoid using hardcoded values' from Kefu Chai this series replaces hard-coded values with variables. will need to expand this test to cover most test cases when working on tiered-storage. Closes #14137 * github.com:scylladb/scylladb: s3/test: use variable for inserted data s3/test: replace test_ks and test_cf with variables s3/test: introduce format_tuples() for formatting CQL queries	2023-06-06 10:43:53 +03:00
Kefu Chai	32f5026ccb	s3/test: use variable for inserted data instead of repeating it, let's define it and reuse it later. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-06 14:16:23 +08:00
Kefu Chai	236d2ded42	s3/test: replace test_ks and test_cf with variables instead of hardwiring the dataset in test, let's define them with variables and use the variables instead. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-06 14:16:23 +08:00
Kefu Chai	8ec56599f5	s3/test: introduce format_tuples() for formatting CQL queries in order to make data set for testing more visible, format_tuples() is introduced for formatting a dict into a set of structured values consumable by CQL. this function is added to test/cql-pytest/util.py in hope that it can be reused by other tests using CQL. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-06 14:16:23 +08:00
Asias He	32cad54c00	repair: Add aborted_by_user to repair status report Add the aborted_by_user flag to the repair status report, for example: INFO [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: starting user-requested repair for keyspace ks2a, repair id 1, options {{trace -> false}, {columnFamilies -> tb5}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> false}} INFO [shard 0] repair - Started to aborting repair jobs={4342512b-5a5f-48fc-a840-934100264cbc}, nr_jobs=1 WARN [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: Repair job aborted by user, job=4342512b-5a5f-48fc-a840-934100264cbc, keyspace=ks2a, tables={tb5} WARN [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true WARN [shard 1] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true WARN [shard 0] repair - repair[4342512b-5a5f-48fc-a840-934100264cbc]: user-requested repair failed: std::runtime_error ({ shard 0: std::runtime_error (repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true), shard 1: std::runtime_error (repair[4342512b-5a5f-48fc-a840-934100264cbc]: 3 out of 513 ranges failed, keyspace=ks2a, tables={tb5}, repair_reason=repair, nodes_down_during_repair={}, aborted_by_user=true)}) In addition, change the log from "Aborted {} repair job(s), aborted={}" to "Started to abort repair jobs={}, nr_jobs={}" to reflect the fact the user requested abort api is async. Closes #14062	2023-06-06 09:08:00 +03:00
Takuya ASADA	45ef09218e	test/perf/perf_fast_forward: avoid allocating AIO slots on startup On main.cc, we have early commands which want to run prior to initialize Seastar. Currently, perf_fast_forward is breaking this, since it defined "app_template app" on global variable. To avoid that, we should defer running app_template's constructor in scylla_fast_forward_main(). Fixes #13945 Closes #14026	2023-06-06 08:53:36 +03:00
David Garcia	285066e8eb	docs: update theme 1.5 Closes #14119	2023-06-06 08:36:56 +03:00
Avi Kivity	26c8470f65	treewide: use #include <seastar/...> for seastar headers We treat Seastar as an external library, so fix the few places that didn't do so to use angle brackets. Closes #14037	2023-06-06 08:36:09 +03:00
Kamil Braun	f51312e580	auth: don't use infinite timeout in `default_role_row_satisfies` query A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: `620e950fc8`. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134	2023-06-05 17:17:02 +03:00
Nadav Har'El	d2e089777b	Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz Adds preemption points used in Alternator when: - sending bigger json response - building results for BatchGetItem I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to: auto start = std::chrono::steady_clock::now(); do { } while ((std::chrono::steady_clock::now() - start) < 100ms); and seeing reactor stall times. After the patch they were not increasing while before they kept building up due to no preemption. Refs #7926 Fixes #13689 Closes #12351 * github.com:scylladb/scylladb: alternator: remove redundant flush call in make_streamed utils: yield when streaming json in print() alternator: yield during BatchGetItem operation	2023-06-04 23:22:51 +03:00
Nadav Har'El	8a1334cf6f	Merge 'alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues' from Marcin Maliszkiewicz Summary of the patch set: - eliminates not needed calls to rjson::find (~1% tps improvement in `perf-simple-query --write`) - adds some very specific test in this area (more general cases were covered already) - fixes some minor validation bug Fixes https://github.com/scylladb/scylladb/issues/13251 Closes #12675 * github.com:scylladb/scylladb: alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues	2023-06-04 23:10:12 +03:00
Alexey Novikov	ffd4fcceec	Alternator: return full table description on return of DeleteTable The DeleteTable operation in Alternator shoudl return a TableDescription object describing the table which has just been deleted, similar to what DescribeTable returns Fixes scylladb#11472 Closes #11628	2023-06-04 21:00:26 +03:00
Israel Fruchter	1ce739b020	Update tools/cqlsh submodule * tools/cqlsh 8769c4c2...6e1000f1 (5): > build: erase uid/gid information from tar archives > Add github action to update the dockerhub description > cqlsh: Add extension handler for "scylla_encryption_options" > requirements.txt: update python-driver==3.26.0 > Add support for arm64 docker image Closes #13878	2023-06-04 19:56:52 +03:00
Kefu Chai	3cd9aa1448	build: cmake: build .wat from source files we compile .wat files from .rs and .c source files since `6d89d718d9`. these .wat are used by test/cql-pytest/test_wasm.py . let's update the CMake building system accordingly so these .wat files can also be generated using the "wasm" target. since the ctest system is not used. this change should allow us to perform this test manually. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14126	2023-06-04 14:55:38 +03:00
Aarav Arora	a12d2d5f16	fix: keyspace spell Closes #14121	2023-06-04 13:48:43 +03:00
Kefu Chai	421331a20b	test.py: consolidate multiple runs of the same test before this change, when consolidating the boost's XML logger file, we just practically concatenate all the tests' logger file into a single one. sometimes, we run the tests for multiple times, and these runs share the same TestSuite and TestCase tags. this has two sequences, 1. there is chance that only a test has both successful and failed runs. but jenkins' "Test Results" page cannot identify the failed run, it just picks a random run when one click for the detail of the run. as it takes the TestCase's name as part of its identifier. and we have multiple of them if the argument passed to the --repeat option is greater than 1 -- this is the case when we promote the "next" branch. 2. the testReport page of Jenkins' xUnit plugin created for the "next" job is 3 times as large as the one for the regular "scylla-ci" run. as all tests are repeated for 3 times. but what we really cares is history of a certain test not a certain run of it. in this change, we just pick a representive run of a test if it is repeated multiple times and add a "Message" tag for including the summary of the runs. this should address the problems above: 1. the failed tests always stand out so we can always pinpoint it with Jenkins's "Test Results" page. 2. the tests are deduped by its name. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14069	2023-06-04 13:15:46 +03:00
Konstantin Osipov	b39ca97919	consistent_cluster_management: make the default As per our roll out plan, make consistent_cluster_management (aka Raft for schema changes) the default going forward. It means all clusters which upgrade from the previous version and don't have `consistent_cluster_management` explicitly set in scylla.yaml will begin upgrading to Raft once all nodes in the cluster have moved to the new version. Fixes #13980 Closes #13984	2023-06-02 09:05:09 +02:00
Pavel Emelyanov	7e8b9aecab	messaging_service: Shutdown rpc server on shutdown The RPC server now has a lighter .shutdown() method that just does what m.s. shutdown() needs, so call it. On stop call regular stop to finalize the stopping process Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 21:24:13 +03:00
Pavel Emelyanov	a55fb7f1d7	messaging_service: Generalize stop_servers() Make it do_with_servers() and make it accept method to call and message to print. This gives the ability to reuse this helper in next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 21:24:06 +03:00
Pavel Emelyanov	8b3149c942	messaging_service: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 20:48:01 +03:00
Pavel Emelyanov	13a6b25f24	messaging_service: Coroutinize stop() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 20:47:42 +03:00
Pavel Emelyanov	b643f18df6	messaging_service: Coroutinize stop_servers() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-01 20:47:28 +03:00
Kamil Braun	34b60ba82b	test_tablets: use `run_async` instead of `execute` Don't block the thread which prevents concurrent tests from running during this time. Use the dedicated `run_async`. Also to silence `mypy` which complains that `manager.cql` is `Optional` (so in theory might be `None`, e.g. after `driver_close`), use `manager.get_cql()`. Closes #14109	2023-06-01 18:05:05 +02:00
Kamil Braun	8be69fc3a0	Merge 'Initialize group0 server on boot before allowing incoming requests' from Gleb The series includes mostly cleanups and one bug fix. The fix is for the race where messages that need to access group0 server are arriving before the server is initialized. * 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev: service: raft: fix typo service: raft: split off setup_group0_if_exist from setup_group0 storage_service: do not allow override_decommission flag if consistent cluster management is enabled storage_service: fix indentation after the previous patch storage_service: co-routinize storage_service::join_cluster() function storage_service: do not reload topology from peers table if topology over raft is enabled storage_service: optimize debug logging code in case debug log is not enabled	2023-06-01 17:37:58 +02:00
Kamil Braun	297c75c6d8	storage_service: wait for schema agreement during initial boot In production environments the Scylla boot procedure includes various sleeps such as 'ring delay' and 'waiting for gossip to settle'. We disable those sleeps in test.py tests and we'd also like to disable them, if possible, in dtests. Unfortunately, disabling the sleeps causes problems with schema: a bootstrapping node creates its own versions of distributed keyspaces and tables (such as `system_distributed`) because it doesn't first wait for gossip to settle, during which it would usually pull existing schemas of those keyspaces/tables from existing nodes. This may cause schema disagreement for the whole duration of the bootstrap procedure (the other nodes don't pull schema from a bootstrapping node; pulls are only allowed once it becomes NORMAL), which causes the bootstrapping node to costantly pull schema in attempts to synchronize, which doesn't work because it's the other nodes which don't have schema mutations, not this node. Even when the bootstrapping node finishes, the existing nodes won't automatically pull schema from that node - only once we perform another schema change a pull will be triggered. The continuous pulls and the lack of schema synchronization until manual schema change cause problems in tests. For example we observed the test timing out in debug mode because bootstrap took too long due to the node having to perform ~700 schema pulls (it attempts to synchronize schema on each range repair). There's also potential for permanent schema divergence, although I haven't seen this yet - in my experiments, once the existing nodes pull from the new node, schema would always converge. In any case, the safe and robust solution is to ensure that the bootstrapping node pulls schema from existing nodes early in the boot procedure. Then it won't try to create its own versions of the distributed keyspaces/tables because it'll see they are already present in the cluster. In fact there already is `storage_service::wait_for_ring_to_settle` which is supposed to wait until schema is in agreement before proceeding. However, this schema agreement wait relied on an earlier wait at the beginning of the function - for a node to show up in gossiper (otherwise, if we're the only node in gossiper, the schema agreement wait trivially finishes immediately). Unfortunately, this wait would timeout after `ring_delay` and proceed, even if no other node was observed, instead of throwing an error... To make it safe, modify the logic so if we timeout, we refuse to bootstrap. To make it work in tests which set `ring_delay` to 0, make it independent of `ring_delay` - just set the timeout to 5 minutes. Fixes #14065 Fixes #14073 Closes #14105	2023-06-01 13:24:43 +03:00
Petr Gusev	0415ac3d5f	test_secondary_index_collections: change insert/create index order Secondary index creation is asynchronous, meaning it takes time for existing data to be reflected within the index. However, new data added after the index is created should appear in it immediately. The test consisted of two parts. The first created a series of indexes for one table, added test data to the table, and then ran a series of checks. In the second part, several new indexes were added to the same table, and checks were made to make sure that already existing data would appear in them. This last part was flaky. The patch just moves the index creation statements from the second part to the first. Fixes: #14076 Closes #14090	2023-05-31 23:30:57 +03:00
Nadav Har'El	0e602159b9	storage_service: avoid excessive delay in wait_for_ring_to_settle() The function storage_service::wait_for_ring_to_settle() is called when bootstrapping a new node in an existing cluster, and it's supposed to wait until the caller has the right schema - to allow the bootstrap to start (the bootstrap needs to copy all existing tables from other nodes). The code of this function mostly checks in-memory structures in the gossiper and migration manager, and if they aren't ready, sleeps and tries again (until a timeout of "ring_delay_ms"). Today we sleep a whole second between each try, but that's excessive - the checks are very cheap, and we can do them much more often, so we can stop the loop much closer to when the schema becomes available. This patch changes the sleep from 1 second to 10 milliseconds. The benefit of this patch is not huge - on average I measured about 0.25 seconds saving on adding a node to a cluster. But I don't see any downside either. Noticed while looking into Refs #14073 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14101	2023-05-31 17:49:38 +02:00
Benny Halevy	bda3705974	test/lib: test_reader_conversions: always close reader read_mutation_from_flat_mutation_reader might throw so we need to close the reader returned from ms.make_fragment_v1_stream also on the error path to avoid the internal error abort when the reader is destroyed while opened. Fixes #14098 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14099	2023-05-31 17:49:38 +02:00
Kamil Braun	8b718db42f	Update seastar submodule * seastar aff87d5b...afe39231 (12): > rpc: Fix formatting after previous patches > rpc: Introduce server::shutdown() > rpc: Wait for server socket to stop before killing conns > rpc: Document server::stop() method > util: remove unused #include > rpc: rpc_types: make `connection_id` a class > tests: rpc_test: simple test for connection aborting > rpc: introduce `server::abort_connection(connection_id)` > treewide: add C++ modules support > rpc: remove `connection::_server` field > rpc: add `server&` and `connection_id` to `client_info` > rpc: rpc_types: move `connection_id` definition before `client_info`	2023-05-31 17:49:38 +02:00
Aleksandra Martyniuk	b325bf11bc	test: extend compaction tasks test Compaction task test checks whether child-parent relationship in tasks tree is valid.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	fecdd75cd6	compaction: fix indentation	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	53c24c0f7d	compaction: create table_upgrade_sstables_compaction_task_impl Implementation of task_manager's task that covers upgrade sstables compaction of one table.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	143919cfa7	compaction: create table_offstrategy_keyspace_compaction_task_impl Implementation of task_manager's task that covers offstrategy keyspace compaction of one table.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	55ef1c24e1	compaction: create table_cleanup_keyspace_compaction_task_impl Implementation of task_manager's task that covers cleanup keyspace compaction of one table.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	5c7832ab59	compaction: create table_major_keyspace_compaction_task_impl Implementation of task_manager's task that covers major keyspace compaction of one table.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	d0c4028d64	compaction: add helpers for table tasks scheduling In shard compaction tasks per table tasks will be created all at once and then they will wait for their turn to run. A function that allows waking up tasks one after another and a function that makes the task wait for its turn are added.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	6dacc45c70	compaction: add run_on_table Extract code which runs a function on a particular table from run_on_existing_tables to run_on_table.	2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk	5c65ac00ef	compaction: pass std::string to run_on_existing_tables Keyspace argument passed to run_on_existing_tables has its type changed from std::string_view to std::string.	2023-05-31 14:59:24 +02:00
Gleb Natapov	dcfd224e8b	service: raft: fix typo	2023-05-31 11:01:33 +03:00
Gleb Natapov	f26179cd27	service: raft: split off setup_group0_if_exist from setup_group0 Currently setup_group0 is responsible to start existing group0 on restart or create a new one and joining the cluster with it during bootstrap. We want to create the server for existing group0 earlier, before we start to accept messages because some messages may assume that the server exists already. For that we split creation of exiting group0 server into a separate function and call it on restart before the messaging service starts accepting messages. Fixes: #13887	2023-05-31 11:00:41 +03:00
Gleb Natapov	acc035b504	storage_service: do not allow override_decommission flag if consistent cluster management is enabled If consistent cluster management is enabled it is not possible to restart decommissioned node since it will not be part of the grouup0.	2023-05-31 10:40:42 +03:00
Raphael S. Carvalho	23443e0574	compaction: Fix incremental compaction for sstable cleanup After `c7826aa910`, sstable runs are cleaned up together. The procedure which executes cleanup was holding reference to all input sstables, such that it could later retry the same cleanup job on failure. Turns out it was not taking into account that incremental compaction will exhaust the input set incrementally. Therefore cleanup is affected by the 100% space overhead. To fix it, cleanup will now have the input set updated, by removing the sstables that were already cleaned up. On failure, cleanup will retry the same job with the remaining sstables that weren't exhausted by incremental compaction. New unit test reproduces the failure, and passes with the fix. Fixes #14035. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14038	2023-05-31 06:46:12 +03:00
Pavel Emelyanov	66ccc14fcb	scylla-gdb: Add commitlog command The command prints segment_manager address, because it's the manager who's on interest, not the db::commitlog itself. Also it prints out all found segments, it's just for convenience -- segments are in a vector of shared pointers and it's handy to have object addresses instantly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14088	2023-05-30 22:55:18 +03:00
Avi Kivity	bb361f41d8	Merge 'RPC: add [[ref]] attribute to heavy parameters' from Gusev Petr By default `idl-compiler.py` emits code to pass parameters by value. There was an attribute `[[ref]]`, which makes it to use `const&`, but it was not used systematically and in many cases parameters were redundantly copied. In this PR, all `verb` directives have been reviewed and the `[[ref]]` attribute has been added where it makes sense. The parameters [are serialised synchronously](https://github.com/scylladb/seastar/blob/master/include/seastar/rpc/rpc_impl.hh#L471) so there should be no lifetime issues. This was not the case before, but the behaviour changed in [this commit](`3942546d41`). Now it's not a problem to get an object by reference when using `send_` methods. Fixes: #12504 Closes #14003 * github.com:scylladb/scylladb: tracing::trace_info: pass by ref storage_proxy: pass inet_address_vector_replica_set by ref raft: add [[ref]] attribute repair: add [[ref]] attribute forward_request: add [[ref]] attribute storage_proxy: paxos:: add [[ref]] attribute storage_proxy: read_XXX:: make read_command [[ref]] storage_proxy: hint_mutation:: make frozen_mutation [[ref]] storage_proxy: mutation:: make frozen_mutation [[ref]]	2023-05-30 16:37:24 +03:00
Kefu Chai	037113f752	reloc: raise if rmtree fails occasionally, we are observing build failures like: ``` 17:20:54 FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz 17:20:54 dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz' 17:20:54 Traceback (most recent call last): 17:20:54 File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module> 17:20:54 os.makedirs(f'build/{SCYLLA_DIR}') 17:20:54 File "<frozen os>", line 225, in makedirs 17:20:54 FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package' ``` to understand the root cause better, instead of swallowing the error, let's raise the exception it is not caused by non-existing directory. a similar change was applied to scripts/create-relocatable-package.py in `a0b8aa9b13`. which was correct per-se. but the original intention was to understand the root cause of the failure when packaging scylla-debuginfo-*.tar.gz, which is created by the dist/debuginfo/scripts/create-relocatable-package.py. so, in this change, the change is ported to this script. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14082	2023-05-30 15:39:24 +03:00
Botond Dénes	9a1e5784b0	Merge 'Use table_info in compaction' from Aleksandra Martyniuk In compaction logs table is identified by {keyspace}.{table_id}. Instead, table name should be used in run_on_existing_tables logs. To do so, task manager's compaction tasks use table_info instead of table_id. Keyspace argument is copied to run_on_existing_tables to ensure it's alive. Closes #13816 * github.com:scylladb/scylladb: compaction: use table_info in compaction tasks api: move table_info to schema/schema_fwd.hh	2023-05-30 15:10:47 +03:00
Kefu Chai	82cac8e7cf	treewide: s/std::source_location/seastar::compact::source_location/ CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports an issue on how the default argument is evaluated. this problem is more obvious when it comes to how `std::source_location::current()` is evaluated as a default argument. but not all compilers have the same behavior, see https://godbolt.org/z/PK865KdG4. notebaly, clang-15 evaluates the default argument at the callee site. so we need to check the capability of compiler and fall back to the one defined by util/source_location-compat.hh if the compiler suffers from CWG 2631. and clang-16 implemented CWG2631 in https://reviews.llvm.org/D136554. But unfortunately, this change was not backported to clang-15. before switching over to clang-16, for using std::source_location::current() as the default parameter and expect the behavior defined by CWG2631, we have to use the compatible layer provided by Seastar. otherwise we always end up having the source_location at the callee side, which is not interesting under most circumstances. so in this change, all places using the idiom of passing std::source_location::current() as the default parameter are changed to use seastar::compat::source_location::current(). despite that we have `#include "seastarx.h"` for opening the seastar namespace, to disambiguate the "namespace compat" defined somewhere in scylladb, the fully qualified name of `seastar::compat::source_location::current()` is used. see also `09a3c63345`, where we used std::source_location as an alias of std::experimental::source_location if it was available. but this does not apply to the settings of our current toolchain, where we have GCC-12 and Clang-15. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14086	2023-05-30 15:10:12 +03:00
Petr Gusev	3a88c7769f	tracing::trace_info: pass by ref sizeof(std::optional<tracing::trace_info>) == 64 bytes, so it should be more efficient.	2023-05-30 14:32:10 +04:00
Petr Gusev	48600049fc	storage_proxy: pass inet_address_vector_replica_set by ref sizeof(inet_address_vector_replica_set) == 96 bytes and it has complex move constructor.	2023-05-30 14:04:53 +04:00
Pavel Emelyanov	577cd96da8	scripts: Fix options iteration in open-coredump.sh When run like 'open-coredump.sh --help' the options parsing loop doesn't run because $# == 1 and [ $# -gt 1 ] evaluates to false. The simplest fix is to parse -h\|--help on its own as the options parsing loop assumes that there's core-file argument present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14075	2023-05-30 12:25:01 +03:00
Petr Gusev	896e3bb425	raft: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	4ff1adaef9	repair: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	282d66d15d	forward_request: add [[ref]] attribute	2023-05-30 13:14:19 +04:00
Petr Gusev	db4030f792	storage_proxy: paxos:: add [[ref]] attribute read_command, partition_key and paxos::proposal are marked with [[ref]]. partition_key contains dynamic allocations and can be big. proposal contains frozen_mutation, so it's also contains dynamic allocations. The call sites are fine, the already passed by reference.	2023-05-30 13:14:19 +04:00
Petr Gusev	f2cba20945	storage_proxy: read_XXX:: make read_command [[ref]] We had a redundant copies at the call sites of these methods. Class read_command does not contain dynamic allocations, but it's quite but by itself (368 bytes).	2023-05-30 13:14:19 +04:00
Petr Gusev	ffb4e39e40	storage_proxy: hint_mutation:: make frozen_mutation [[ref]] We had a redundant copy in hint_mutation::apply_remotely. This frozen_mutation is dynamically allocated and can be arbitrary large.	2023-05-30 13:14:19 +04:00
Petr Gusev	5adbb6cde2	storage_proxy: mutation:: make frozen_mutation [[ref]] We had a redundant copy in receive_mutation_handler forward_fn callback. This frozen_mutation is dynamically allocated and can be arbitrary large. Fixes: #12504	2023-05-30 13:14:19 +04:00
Tzach Livyatan	e655060429	Remove Ubuntu 18.04 support from 5.2 Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list https://github.com/scylladb/scylla-pkg/issues/3346 Closes #13529	2023-05-30 11:12:17 +03:00
Aleksandra Martyniuk	f48b57e7b9	compaction: use table_info in compaction tasks Task manager compaction tasks need table names for logs. Thus, compaction tasks store table infos instead of table ids. get_table_ids function is deleted as it isn't used anywhere.	2023-05-30 09:58:55 +02:00
Aleksandra Martyniuk	4206139e5a	api: move table_info to schema/schema_fwd.hh table_info is moved from api/storage_service.hh to schema/schema_fwd.hh so that it could be used in task manager's tasks.	2023-05-30 09:57:21 +02:00
Kefu Chai	024b96a211	create-relocatable-package.py: package build/node_export only for stripped version because we build stripped package and non-stripped package in parallel using ninja. there are chances that the non-stripped build job could be adding build/node_exporter directory to the tarball while the job building stripped package is using objcopy to extract the symbols from the build/node_exporter/node_exporter executable. but objcopy creates temporary files when processing the executables. and the temporary files can be spotted by the non-stripped build job. there are two consequences: 1. non-stripped build job includes the temporary files in its tarball, even they are not supposed to be distributed 2. non-stripped build job fails to include the temporary file(s), as they are removed after objcopy finishes its job. but the job did spot them when preparing the tarball. so when the tarfile python module tries to include the previous found temporary file(s), it throws. neither of these consequences is expected. but fortunately, this only happens when packaging the non-stripped package. when packaging the stripped package, the build/node_exported directory is not in flux anymore. as ninja ensures the dependencies between the jobs. so, in this change, we do not add the whole directory when packaging the non-stripped version. as all its ingredients have been added separately as regular files. and when packaing the stripped version, we still use the existing step, as we don't have to list all the files created by strip.sh: node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz} we could do so in this script, but the repeatings is unnecessary and error-prune. so, let's keep including the whole directory recursively, so all the debug symbols are included. Fixes #14079 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-30 14:00:02 +08:00
Kefu Chai	665a747fab	create-relocatable-package.py: use positive condition when possible to reduce the programmer's cognitive load. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-30 14:00:02 +08:00
Avi Kivity	ffce6d94fc	Merge 'service: storage_proxy: make hint write handlers cancellable' from Kamil Braun The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Use this new functionality to also cancel hint write handlers when we shutdown. This fixes #8079. Closes #14047 * github.com:scylladb/scylladb: test: reproducer for hints manager shutdown hang test: pylib: ScyllaCluster: generalize config type for `server_add` test: pylib: scylla_cluster: add explicit timeout for graceful server stop service: storage_proxy: make hint write handlers cancellable service: storage_proxy: rename `view_update_handlers_list` service: storage_proxy: make it possible to cancel all write handler types	2023-05-30 01:36:50 +03:00
Avi Kivity	27f7cc4032	Revert "Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros" This reverts commit `52e4edfd5e`, reversing changes made to `d2d53fc1db`. The associated test fails with about 10% probablity, which blocks other work. Fixes #13919 Reopens #13747	2023-05-29 23:03:25 +03:00
Botond Dénes	a35758607a	Update tools/java submodule * tools/java eb3c43f8...0cbfeb03 (1): > nodetool: add `--primary-replica-only` option to `refresh`	2023-05-29 23:03:25 +03:00
Botond Dénes	fc24685b4d	Update tools/jmx submodule * tools/jmx 1fd23b60...d1077582 (1): > Support `--primary-replica-only` option from `nodetool refresh`	2023-05-29 23:03:25 +03:00
Pavel Emelyanov	b0525e20d5	main: Ignore sleep_aborted exception in main When scylla starts it may go to sleep along the way before the "serving" message appears. If SIGINT is sent at that time the whole thing unrolls and the main code ends up catching the sleep_aborted exception, printing the error in logs and exiting with non-zero code. However, that's not an error, just the start was interrupted earlier than it was expected by the stop_signal thing. fixes: #12898 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14034	2023-05-29 23:03:25 +03:00
Avi Kivity	2303f08eea	utils: logalloc: correct asan_interface.h location It's a system header, so it deserves angle brackets. Closes #14036	2023-05-29 23:03:25 +03:00
Benny Halevy	c685ef9e71	partitioned_sstable_set: insert: return early if sst is already in the set Currently, partitioned_sstable_set::insert may erase a sstable from the set inadvertently, if an exception is thrown while (re-)inserting it. To prevent that, simply return early after detecting that insertion didn't took place, based on the unordered_set::insert result. This issue is theoretical, as there are no known case of re-inserting sstables into the partitioned sstable set. Fixes #14060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14061	2023-05-29 23:03:25 +03:00
Aleksandra Martyniuk	24864e39dd	compaction: delete unnecessary sequence number incrementations Task manager's tasks that have parent task inherit sequence number from their parents. Thus they do not need to have a new sequence number generated as it will be overwritten anyway. Closes #14045	2023-05-29 23:03:25 +03:00
Kefu Chai	c00f4af5d4	build: cmake: link auth against libcrypt libxcrypt is used by auth subsystem, for instance, `crypt_r()` provided by this library is used by passwords.cc. so let's link against it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14030	2023-05-29 23:03:24 +03:00
Benny Halevy	774a10017c	backlog_controller: destroy _update_timer before _current_backlog The _update_timer callback calls adjust() that depends on _current_backlog and currently, _current_backlog is destroyed before _update_timer. This is benign since there are no preemption points in the destructor, but it's more correct and elegant to destroy the timer first, before other members it depends on. Fixes #14056 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14057	2023-05-29 23:03:24 +03:00
Kefu Chai	a0b8aa9b13	create-relocatable-package.py: raise if rmtree fails occasionally, we are observing build failures like: ``` 17:20:54 FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz 17:20:54 dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz' 17:20:54 Traceback (most recent call last): 17:20:54 File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module> 17:20:54 os.makedirs(f'build/{SCYLLA_DIR}') 17:20:54 File "<frozen os>", line 225, in makedirs 17:20:54 FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package' ``` to understand the root cause better, instead of swallowing the error, let's raise the exception it is not caused by non-existing directory. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13978	2023-05-29 23:03:24 +03:00
Avi Kivity	2cef3350af	Merge 'Initialize/destroy ks/cf directories with explicit class methods' from Pavel Emelyanov This set encapsulates ks/cf directories creation and deletion into keyspace and table classes methods. This is needed to facilitate making the storage initialization storage-type aware in the future. Also this makes the replica/ code less involved in formatting sstables' directory path by hand. refs: #13020 refs: #12707 Closes #14048 * github.com:scylladb/scylladb: keyspace: Introduce init_storage() keyspace: Remove column_family_directory() table: Introduce destroy_storage() table: Simplify init_storage() table: Coroutinize init_storage() table: Relocate ks.make_directory_for_column_family() distributed_loader: Use cf.dir() instead of ks.column_family_directory() test: Don't create directory for system tables in cql_test_env	2023-05-29 23:03:24 +03:00
Kefu Chai	55ee0e2724	build: preserve $libs when linking a single testing executable if we just want to build a single test and scylla executables, we might want to use `configure.py` like: ./configure.py --mode debug --compiler clang++ --with scylla --with test/boost/database_test which generates `build.ninja` for us, with following rules: build $builddir/debug/test/boost/database_test_g: link.debug ... \| $builddir/debug/seastar/libseastar.so $builddir/debug/seastar/libseastar_testing.so libs = $seastar_libs_debug $libs -lthrift -lboost_system $seastar_testing_libs_debug libs = $seastar_libs_debug but the last line prevents database_test_g for linking against the third-party libraries like libabsl, which could have been pulled in by $libs. but the second assignment expression just makes the value of `libs` identical to that of `seastar_libs_debug`. but that library does not include the libraries which are only used by scylla. so we could run into link failure with the `build.ninja` generated with this command line. like: ``` FAILED: build/debug/test/boost/database_test_g ... ld.lld: error: undefined symbol: seastar::testing::entry_point(int, char**) >>> referenced by scylla_test_case.hh:22 (./test/lib/scylla_test_case.hh:22) >>> build/debug/test/boost/database_test.o:(main) ... ld.lld: error: undefined symbol: boost::unit_test::unit_test_log_t::set_checkpoint(boost::unit_test::basic_cstring<char const>, unsigned long, boost::unit_tes t::basic_cstring<char const>) >>> referenced by database_test.cc:298 (test/boost/database_test.cc:298) >>> build/debug/test/boost/database_test.o:(require_exist(seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool)) ... ``` with this change, the extra assignment expression is dropped. this should not cause any regression. as f'$seastar_libs_{mode}' as been included as a part of `local_libs` before the grand if-the-else block in the for loop before this `f.write()` statement. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14041	2023-05-29 23:03:24 +03:00
Kefu Chai	74dd6dc185	Revert "test: string_format_test: don't compare std::string with sstring" This reverts commit `3c54d5ec5e`. The reverted change fixed the FTBFS of the test in question with Clang 16, which rightly stopped convert the LHS of `"hello" == sstring{"hello"}` to the type of the type acceptable by the member operator even we have a constructor for this conversion, like class sstring { public: bar_t(const char); bool operator==(const sstring&) const; bool operator!=(const sstring&) const; }; because we have an operator!=, as per the draft of C++ standard https://eel.is/c++draft/over.match.oper#4 : > A non-template function or function template F named operator== > is a rewrite target with first operand o unless a search for the > name operator!= in the scope S from the instantiation context of > the operator expression finds a function or function template > that would correspond ([basic.scope.scope]) to F if its name were > operator==, where S is the scope of the class type of o if F is a > class member, and the namespace scope of which F is a member > otherwise. in `397f4b51c3`, the seastar submodule was updated. in which, we now have a dedicated overload for the `const char` case. so the compiler is now able to compile the expression like `"hello" == sstring{"hello"}` in C++20 now. so, in this change, the workaround is reverted. Closes #14040	2023-05-29 23:03:24 +03:00
Benny Halevy	26705ba6af	partitioned_sstable_set: erase empty runs When erasing a sstable first check if its run_id exists in _all_runs, otherwise do nothing with that respect, and then if the run becomes empty when erasing the last sstable (and it could have been a single-sstable run from get go), erase the run from `_all_runs`. Fixes #14052 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14054	2023-05-29 23:03:24 +03:00
Alejo Sanchez	2050a1a125	test.py: warn and skip for missing unit/boost tests If the executable of a matching unit or boost test is not executable, warn to console and skip. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13982	2023-05-29 23:03:24 +03:00
Gleb Natapov	775e1dd4fd	storage_service: fix indentation after the previous patch	2023-05-29 13:34:00 +03:00
Gleb Natapov	28bc5e365b	storage_service: co-routinize storage_service::join_cluster() function	2023-05-29 13:33:32 +03:00
Gleb Natapov	fd1b1279c4	storage_service: do not reload topology from peers table if topology over raft is enabled If topology over raft is enabled the source if the truth for the topology is in the group0 state machine and no other code should create topology metadata.	2023-05-29 13:32:14 +03:00
Gleb Natapov	3a5cb9d35c	storage_service: optimize debug logging code in case debug log is not enabled	2023-05-29 13:29:22 +03:00
Kamil Braun	beabb61566	test: reproducer for hints manager shutdown hang	2023-05-29 11:03:39 +02:00
Kamil Braun	7e56388721	test: pylib: ScyllaCluster: generalize config type for `server_add` Generalize from `dict[str, str]` to `dict[str, Any]`.	2023-05-29 11:03:36 +02:00
Kamil Braun	ce13395ce4	test: pylib: scylla_cluster: add explicit timeout for graceful server stop If server shutdown hangs, the `manager.server_stop_gracefully` call would eventually (after 5 minutes) timeout with a cryptic `TimeoutError`; it's a generic timeout for performing requests by the tests to `ScyllaClusterManager`. It was non-obvious how to find what actually caused the timeout - you'd have to browse multiple logs. Introduce an explicit timeout in `ScyllaServer.stop_gracefully`. Set it to 1 minute. Whether this is a good value may be arguable, but shutdown taking longer than that probably indicates problems. The important thing is that this timeout is shorter than the generic request timeout. If this times out we get a nice error in the test: ``` E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/server/1/stop_gracefully, params: None, json: None, body: E Stopping server ScyllaServer(1, 127.162.40.1, 826d5884-4696-4a22-80a7-cc872aa43102) gracefully took longer than 60s ```	2023-05-29 11:03:30 +02:00
Kamil Braun	0ef35ceed4	service: storage_proxy: make hint write handlers cancellable Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes #8079	2023-05-29 11:03:18 +02:00
Kamil Braun	eddb7406b4	service: storage_proxy: rename `view_update_handlers_list` The list will be used for non-view-update write handlers as well, so generalize the name. Also generalize some variable names used in the implementation. This commit only renames things + some comments were added, there are no logical changes.	2023-05-29 10:59:50 +02:00
Kamil Braun	c7ef9a12ee	service: storage_proxy: make it possible to cancel all write handler types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.	2023-05-29 10:42:57 +02:00
Kefu Chai	af65d5a1e8	test: sstable: use BOOST_REQUIRE_*() when appropriate instead of using BOOST_REQUIRE() use, for instance BOOST_REQUIRE_NE() and BOOST_REQUIRE_EQUAL() for better error message when the test fails, as Boost::test would print out the LHS and RHS of the comparison expression if it fails. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14050	2023-05-27 11:10:47 +03:00
Pavel Emelyanov	5861d15912	Merge 'Small gossiper and migration_manager cleanups' from Gleb Some assorted cleanups here: consolidation of schema agreement waiting into a single place and removing unused code from the gossiper. CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/ Reviewed-by: Konstantin Osipov <kostja@scylladb.com> * gleb/gossiper-cleanups of github.com:scylladb/scylla-dev: storage_service: avoid unneeded copies in on_change storage_service: remove check that is always true storage_service: rename handle_state_removing to handle_state_removed storage_service: avoid string copy storage_service: delete code that handled REMOVING_TOKENS state gossiper: remove code related to advertising REMOVING_TOKEN state migration_manager: add wait_for_schema_agreement() function	2023-05-27 10:49:54 +03:00
Avi Kivity	e4d6ed7a70	Merge 'Coroutinize utils::verify_owner_and_mode()' from Pavel Emelyanov Closes #14049 * github.com:scylladb/scylladb: utils: Restore indentation after previous patch utils: Coroutinize verify_owner_and_mode()	2023-05-26 23:20:30 +03:00
Pavel Emelyanov	2eb88945ea	utils: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:53:14 +03:00
Pavel Emelyanov	4ebb812df0	utils: Coroutinize verify_owner_and_mode() There's a helper verification_error() that prints a warning and returns excpetional future. The one is converted into void throwing one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:52:15 +03:00
Pavel Emelyanov	29d80d1fe9	keyspace: Introduce init_storage() Similarly to class table, the keyspace class also needs to create directory for itself for some reason. It looks excessive as table creation would call recursive_touch_directory() and would create the ks directory too, but this call is there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:46 +03:00
Pavel Emelyanov	93d8240bfb	keyspace: Remove column_family_directory() It's no longer used outside of make_column_family_config(). Not to encourage people to use it -- drop it and open-code into that single caller Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:43 +03:00
Pavel Emelyanov	0e50fc609c	table: Introduce destroy_storage() When table is DROP-ed the directory with all its sstables is removed (unless it contains snapshots). Wrap this into table.destroy_storage() method, later it will need to become sstable::storage-specific Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:43 +03:00
Pavel Emelyanov	7ae49f513e	table: Simplify init_storage() There's no need in copying the datadirs vector to call parallel_for_each upon. The datadirs[0] is in fact datadir field. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:43 +03:00
Pavel Emelyanov	99dfade020	table: Coroutinize init_storage() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:43 +03:00
Pavel Emelyanov	a19b8af187	table: Relocate ks.make_directory_for_column_family() This method initializes storage for table naturally belongs to that class. So rename it while moving. Also, there's no longer need to carry table name and uuid as arguments, being table method it can just get the paths to work on from config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:15:41 +03:00
Pavel Emelyanov	6db5f08eab	distributed_loader: Use cf.dir() instead of ks.column_family_directory() These two return the same, but the latter makes it the harder way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 17:59:47 +03:00
Pavel Emelyanov	44b811ce19	test: Don't create directory for system tables in cql_test_env The distributed_loader::init_system_keyspaces() does it when called few lines above this place Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 17:58:46 +03:00
Marcin Maliszkiewicz	9ce65270d5	alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem BatchGetItem request is a map of table names and 'sub-requests', ExpressionAttributeNames is defined on 'sub-request' level but the code was instead checking the top level, obtaining nullptr every time which effectively disables unused names check. Fixes #13251	2023-05-26 15:03:15 +02:00
Marcin Maliszkiewicz	fb5c325cfd	alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues rjson::find is not a very cheap function, it involves bunch of function calls and loop iteration. Overall it costs 120-170 intructions even for small requests. Some example profile of alternator::executor::query execution shows ~18 rjson::find calls, taking in total around 7% of query internal proessing time (note that JSON parse/print and http handling are not part of this function). This patch eliminates 2 rjson::find calls for most request types. I saw 1-2% tps improvement in `perf-simple-query --write` although it does improve tps I suspect real percentage is smaller and don't have much confindentce in this particular number, observed benchmark variance is too high to measure it reliably.	2023-05-26 15:03:15 +02:00
Kamil Braun	a58beb8ce4	Merge 'Fix flakiness of test_tablets.py' from Tomasz Grabiec We've observed sporadic failures of this test in CI related to driver reconnection after server restart. Fixes #14032 Closes #14027 * github.com:scylladb/scylladb: test: test_tablets.py: Wait for driver to see the hosts after restart test: test_tablets.py: Pass server id to server_restart() test: test_tablets.py: Add missing await on server_restart()	2023-05-25 14:38:37 +02:00
Gleb Natapov	0e80c5162a	storage_service: avoid unneeded copies in on_change Move array of strings instead of copying.	2023-05-25 14:51:14 +03:00
Gleb Natapov	3a201c25c8	storage_service: remove check that is always true The array cannot be empty since we access the first element of the array before we call this function.	2023-05-25 14:50:23 +03:00
Gleb Natapov	715897ff31	storage_service: rename handle_state_removing to handle_state_removed The function no longer handles REMOVING_TOKING state so rename the function and drop no longer needed checks for the non existing state.	2023-05-25 14:48:58 +03:00
Gleb Natapov	4103281648	storage_service: avoid string copy	2023-05-25 14:48:39 +03:00
Gleb Natapov	05aa07835d	storage_service: delete code that handled REMOVING_TOKENS state The state is never advertised so the code is never used.	2023-05-25 14:48:09 +03:00
Gleb Natapov	66ff072540	gossiper: remove code related to advertising REMOVING_TOKEN state Apparently it was needed for removetoken support which was deprecated in the ORIGIN already.	2023-05-25 14:47:16 +03:00
Gleb Natapov	a429018a8a	migration_manager: add wait_for_schema_agreement() function Several subsystems re-implement the same logic for waiting for schema agreement. Provide the function in the migration_manager and use it instead.	2023-05-25 14:44:53 +03:00
Tomasz Grabiec	9d3d9be29e	test: test_tablets.py: Wait for driver to see the hosts after restart Apparently, the driver may be still establishing connections in the background after connecting to the cluster and queries may fail with: cassandra.cluster.NoHostAvailable Replace reconnection with wait_for_cql_and_get_hosts(), which ensures that the driver sees the host.	2023-05-25 11:38:40 +02:00
Botond Dénes	5a14c3311a	Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low. To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object. The new jumbo sink is capable of uploading ~500Tb of data, which looks enough. fixes: #13019 Closes #13577 * github.com:scylladb/scylladb: sstables: Switch data and index sink to use jumbo uploader s3/test: Tune-up multipart upload test alignment s3/test: Add jumbo upload test s3/client: Wait for background upload fiber on close-abort c3/client: Implement jumbo upload sink s3/client: Move memory buffers to upload_sink from base s3/client: Move last part upload out of finalize_upload() s3/client: Merge do_flush() with upload_part() s3/client: Rename upload_sink -> upload_sink_base	2023-05-25 11:44:06 +03:00
Kamil Braun	1339ae141a	Merge 'Small improvements after pending_ranges, endpoints_for_reading -> erm PR' from Gusev Petr This is a small follow-up for [this PR](https://github.com/scylladb/scylladb/pull/13715), it resolves some comments in the initial PR that didn't make their way into it. * remove `noexcept` from `clear_gently`, since exceptions can be raised from move constructor; * an optimisation for `vnode_effective_replication_map::get_range_addresses`, avoid redundant binary search. Closes #14015 * github.com:scylladb/scylladb: vnode_erm: optimize get_range_addresses clear_gently: remove noexcept for rvalue references overload	2023-05-25 10:37:27 +02:00
Pavel Emelyanov	222f21d180	messaging_service: Remove unused headers from m.s..hh The tracing.hh is quite large to care Another one is "while at it" Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14024	2023-05-25 08:38:49 +03:00
Kefu Chai	8e7c7e1079	docs/dev/repair_based_node_ops: better formatting * indent the nested paragraphs of list items * use table to format the time sequence for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14016	2023-05-25 08:31:43 +03:00
Kefu Chai	8e6fbb99c7	docs/operating-scylla: lowercase the name of an option "Enable_repair_based_node_ops" is the name of an option, and the leading character should be lowecase "e". so fix it. Fixes #14017 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14018	2023-05-25 08:21:59 +03:00
Tomasz Grabiec	51e3b9321b	Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls. This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery. Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version. After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade. The series: 1. Does some code cleanup in the mutation_partition area. 2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry). 3. Adds upgrading variants of constructors and apply() for `row` and its wrappers. 4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains. 5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version. Fixes #2577 Closes #13761 * github.com:scylladb/scylladb: test: mvcc_test: add a test for gentle schema upgrades partition_version: make partition_entry::upgrade() gentle partition_version: handle multi-schema snapshots in merge_partition_versions mutation_partition_v2: handle schema upgrades in apply_monotonically() partition_version: remove the unused "from" argument in partition_entry::upgrade() row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades partition_version: handle multi-schema entries in partition_entry::squashed partition_snapshot_row_cursor: handle multi-schema snapshots partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots partition_version: add a logalloc::region argument to partition_entry::upgrade() memtable: propagate the region to memtable_entry::upgrade_schema() mutation_partition: add an upgrading variant of lazy_row::apply() mutation_partition: add an upgrading variant of rows_entry::rows_entry mutation_partition: switch an apply() call to apply_monotonically() mutation_partition: add an upgrading variant of rows_entry::apply_monotonically() mutation_fragment: add an upgrading variant of clustering_row::apply() mutation_partition: add an upgrading variant of row::row partition_version: remove _schema from partition_entry::operator<< partition_version: remove the schema argument from partition_entry::read() memtable: remove _schema from memtable_entry row_cache: remove _schema from cache_entry partition_version: remove the _schema field from partition_snapshot partition_version: add a _schema field to partition_version mutation_partition: change schema_ptr to schema& in mutation_partition::difference mutation_partition: change schema_ptr to schema& in mutation_partition constructor mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor mutation_partition: add upgrading variants of row::apply() partition_version: update the comment to apply_to_incomplete() mutation_partition_v2: clean up variants of apply() mutation_partition: remove apply_weak() mutation_partition_v2: remove a misleading comment in apply_monotonically() row_cache_test: add schema changes to test_concurrent_reads_and_eviction mutation_partition: fix mixed-schema apply()	2023-05-24 22:58:43 +02:00
Nadav Har'El	7cdee303cf	Merge 'ks_prop_defs: disallow empty replication factor string in NTS' from Jan Ciołek A CREATE KEYSPACE query which specifies an empty string ('') as the replication factor value is currently allowed: ```cql CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''}; ``` This is wrong, it's invalid to have an empty replication factor string. It creates a keyspace without any replication, so the tables inside of it aren't writable. Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error, `NetworkTopolgyStrategy` should do the same. The problem was in `prepare_options`, it treated an empty replication factor string as no replication factor. Changing it to `std::optional` fixes the problem, Now `std::nullopt` means no replication factor, and `make_optional("")` means that there is a replication factor, but it's described by an empty string. Fixes: https://github.com/scylladb/scylladb/issues/13986 Closes #13988 * github.com:scylladb/scylladb: test/network_topology_strategy_test: Test NTS with replication_factor option in test_invalid_dcs ks_prop_defs: disallow empty replication factor string in NTS	2023-05-24 21:39:31 +03:00
Pavel Emelyanov	d2f5a44e3b	test/alternator: Don't use empty AWS secret key There's a test case that checks in valid credentials (wrong key). However, some boto3 libraries don't like empty secret key values: request = <FixtureRequest for <Function test_wrong_key_access>> dynamodb = dynamodb.ServiceResource() def test_wrong_key_access(request, dynamodb): print("Please make sure authorization is enforced in your Scylla installation: alternator_enforce_authorization: true") url = dynamodb.meta.client._endpoint.host with pytest.raises(ClientError, match='UnrecognizedClientException'): if url.endswith('.amazonaws.com'): boto3.client('dynamodb',endpoint_url=url, aws_access_key_id='wrong_id', aws_secret_access_key='').describe_endpoints() else: verify = not url.startswith('https') > boto3.client('dynamodb',endpoint_url=url, region_name='us-east-1', aws_access_key_id='whatever', aws_secret_access_key='', verify=verify).describe_endpoints() test_authorization.py:23: ... cls = <class 'awscrt.auth.AwsCredentialsProvider'>, access_key_id = 'whatever' secret_access_key = '', session_token = None @classmethod def new_static(cls, access_key_id, secret_access_key, session_token=None): """ Create a simple provider that just returns a fixed set of credentials. Args: access_key_id (str): Access key ID secret_access_key (str): Secret access key session_token (Optional[str]): Optional session token Returns: AwsCredentialsProvider: """ assert isinstance(access_key_id, str) assert isinstance(secret_access_key, str) assert isinstance(session_token, str) or session_token is None > binding = _awscrt.credentials_provider_new_static(access_key_id, secret_access_key, session_token) E RuntimeError: 34 (AWS_ERROR_INVALID_ARGUMENT): An invalid argument was passed to a function. $ pip3 show boto3 Name: boto3 Version: 1.26.139 Summary: The AWS SDK for Python Home-page: https://github.com/boto/boto3 Author: Amazon Web Services Author-email: License: Apache License 2.0 Location: /home/xemul/.local/lib/python3.11/site-packages Requires: botocore, jmespath, s3transfer Required-by: Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14022	2023-05-24 19:46:16 +03:00
Jan Ciolek	55fb91bf10	exceptions: remove relation field from unrecognized_entity_exception The exception unrecognized_entity_exception used to have two fields: * entity - the name that wasn't recognized * relation_str - part of the WHERE clause that contained this entity In `4e0a089f3e` the places that throw this exception were modified, the thrower started passing unrecognized column name to both fields - entity and relation_str. It was easier to do things this way, accessing the whole WHERE clause can be problematic. The problem is that this caused error messages to get weird, e.g: "Undefined name x in where clause ('x')". x is not the WHERE clause, it's the unrecognized name. Let's remove the `relation_str` field as it isn't used anymore, it only causes confusion. After this change the message would be: "Unrecognized name x" Which makes much more sense. Refs #10632 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #13944	2023-05-24 19:35:26 +03:00
Nadav Har'El	3b2c87a82b	cql: fix column name in writetime() error message Found and fixed yet another place where an error message prints a column name as "bytes" type which causes it to be printed as hexadecimal codes instead of the actual characters of the name. The specific error message fixed here is "Cannot use selection function writeTime on PRIMARY KEY part k" which happens when you try to use writetime() or ttl() on a key column (which isn't allowed today - see issue #14019). Before this patch we got "6b" in the error message instead of "k". The patch also includes a regression test that verifies that this error condition is recognized and the real name of the column is printed. This test fails before this patch, and passes after it. As usual, the test also passes on Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14021	2023-05-24 19:28:44 +03:00
Pavel Emelyanov	e435ec1b5e	sstable_directory: Do not collect filesystem garbage for S3-backed sstables The sstable_directory::garbage_collect() scans /var/lib/scylla for whatever sstable it's called for. S3-backed ones don't have anything there, so the g.c. run is no-op. Make this call be lister virtual method, so that only filesystem lister does this scan and the ownership table lister becomes the real no-op. Later it will be filled with code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:45:50 +03:00
Pavel Emelyanov	16d66f2fe9	sstable_directory: Deduplicate .process() location argument When sstable directory calls lister it passes the _sstable_dir as an argument. However, the very same _sstable_dir was used to construct the lister, and by now all the lister implementations keep this value aboard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:43:36 +03:00
Pavel Emelyanov	d6b5e18cb3	sstable_directory: Keep directory lister on stack The directory_lister _lister exists as class member, but is only used once -- when the .process() is called -- and then is closed forever. It's simpler to keep the lister on the .process() stack. This change also makes filesystem lister keep the copy of directory as class member, it will be useful for the next patch as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:42:08 +03:00
Pavel Emelyanov	524614087a	sstable_directory: Use directory_lister API directly The filesystem components lister has private wrappers on top of directory lister it uses internally. These are lefrovers from making the sstable directory storage-aware, now they can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:40:38 +03:00
Tomasz Grabiec	fbd103744c	test: test_tablets.py: Pass server id to server_restart() It works with ids, not ServerInfo	2023-05-24 15:01:06 +02:00
Tomasz Grabiec	b423d132f5	test: test_tablets.py: Add missing await on server_restart() Could be responsible for test failures due to inability to connect to the server afterwards.	2023-05-24 15:01:06 +02:00
Kefu Chai	b0c40a2a03	db: config: s/ingore/ignore/ this string is used in as the option description in the command line help message. so it is a part of user facing interface. in this change, the typo is fixed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14013	2023-05-24 13:35:24 +03:00
Alejo Sanchez	91f609d065	migration_manager: do not pull schema if raft is on After consistent schema changes, remove schema pulls from gossiper events if Raft is enabled, and considering Raft upgrade state. Only disable pull if Raft is fully enabled. Fixes #12870 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13695	2023-05-24 10:39:45 +02:00
Petr Gusev	819d710753	vnode_erm: optimize get_range_addresses In get_range_addresses we are iterating over vnode tokens, don't need to do binary search for them in tmptr->first_token, they can be directly used as keys for _replication_map.	2023-05-24 12:16:37 +04:00
Petr Gusev	79c6bf0885	clear_gently: remove noexcept for rvalue references overload We use this overload in vnode_erm, one of the arguments is boost::icl::interval_map, whose move constructor is not noexcept.	2023-05-24 12:08:19 +04:00
Botond Dénes	eb457b6104	Merge 'fixed broken links, added community forum link, university link, spelling and other mistakes' from Guy Shtub Closes #13979 * github.com:scylladb/scylladb: Update docker-hub.md Update docs/dev/docker-hub.md Update docs/dev/docker-hub.md Update docs/dev/docker-hub.md Update docs/dev/docker-hub.md Update docs/dev/docker-hub.md fixed broken links, added community forum link, university link, other mistakes	2023-05-24 09:58:58 +03:00
Nadav Har'El	02d31786ff	test/alternator: better README.md on how to run and write tests Improve test/alternator/README.md by adding better and more beginner- friendly introduction to how to run the Alternator tests, as well as a section about the philosophy of the Alternator test suite, and some guideliness on how to write good tests in that framework. Much of this text was copied from test/cql-pytest/README.md. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13999	2023-05-24 09:58:12 +03:00
Kefu Chai	2fbcbc09b0	api: specialize fmt::formatter<api::table_info> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `api::table_info` without the help of `operator<<`. but the corresponding `operator<<()` is preserved in this change, as we still have lots of callers relying on this << operator instorage_service.cc where std::vector<table_info> is formatted using operator<<(ostream&, const Range&) defined in to_string.hh. we could have used fmt/ranges.h to print the std::vector<table_info>. but the combination of operator<<(ostream&, const Range&) and FMT_DEPRECATED_OSTREAM renders this impossible. because unlike the builtin range formatter specializations, the fallback formatter synthesized from the operator<< does not have brackets defined for the range printer. the brackets are used as the left and right marks of the range, for instance, the array-alike containers are printed like [1,2,3], while the tuple-alike containers are printed like (1,2,3). once we are allowed to remove FMT_DEPRECATED_OSTREAM, we should be able to use the builtin range formatter, and remove the operator<< for api::table_info by then. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13975	2023-05-24 09:49:44 +03:00
Kefu Chai	8efb5c30ce	counters: move fmt::formatter<counter_{shard,cell}_view>::format() to .cc to reduce the size of header file, in hope to speed the compilation. let's implement the implementation of format() function into .cc file. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14010	2023-05-24 09:36:49 +03:00
Pavel Emelyanov	132260973a	tests: Add perf test for S3 client (reading latencies) Here's a simple test that can be used to check S3 object read latencies. To run one must export the same variables as for any other S3 unit test: - S3_SERVER_ADDRESS_FOR_TEST - S3_SERVER_PORT_FOR_TEST - S3_PUBLIC_BUCKET_FOR_TEST and the AWS creds are a must via AWS_S3_EXTRA='$key:$secret:$region' env variable. Accepted options are --duration SEC -- test duration in seconds --parallel NR -- number of fibers to run in parallel --object-size BYTES -- object size to use (1MB by default) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13895	2023-05-24 09:29:48 +03:00
Botond Dénes	57758ec3e1	Merge 'Put streaming sched group onto stream manager' from Pavel Emelyanov The manager is in charge of updating IO bandwidth on the respective prio class. Nowadays it uses global priority-manager, but unifying sched classes effort will require it to use non-global streaming sched group. After the patch the sched class field is unused, but it's a preparation towards huge (really huge) "switch to seastar API level 7" patch ref: #13963 Closes #13997 * github.com:scylladb/scylladb: stream_manager: Add streaming sched group copy cql_test_env: Move sched groups initialization up	2023-05-24 09:27:30 +03:00
Nadav Har'El	644787535a	test/cql-pytest: revert incorrect fix to avoid a warning In commit `0a71151bc4` I wanted to avoid a incorrect deprecation warning from the Python driver but fixed it in an incorrect way. I never noticed the fix was incorrect because the test was already xfailing, and the incorrect fix just made it fail differently... In this patch I revert that commit. With this revert, I am not bringing back the spurious warning - the Python driver bug was already fixed in https://github.com/datastax/python-driver/pull/1103 - so developers with a fairly recent version will no longer see the spurious warning. Both old and new drivers will at least do the correct thing, as it was before that unfortunate commit. Fixes #8752. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14002	2023-05-24 09:25:57 +03:00
Botond Dénes	2526b232f1	Merge 'Remove explicit default_priority_class() usage from sstable aux methods' from Pavel Emelyanov There are few places in sstables/ code that require caller to specify priority class to pass it along to file stream options. All these callers use default class, so it makes little sense to keep it. This change makes the sched classes unification mega patch a bit smaller. ref: #13963 Closes #13996 * github.com:scylladb/scylladb: sstables: Remove default prio class from rewrite_statistics() sstables: Remove prio class from validate_checksums subs sstables: Remove always default io-prio from validate_checksums()	2023-05-24 09:23:24 +03:00
Kefu Chai	cb22492379	raft: specialize fmt::formatter<raft::server_address&> and friends this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print - raft::server_address - raft::config_member - raft::configuration without the help of `operator<<`. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13976	2023-05-24 09:11:55 +03:00
Botond Dénes	1ef600fb7f	Merge 'docs/dev/system_keyspace: move regular tables into another section and add the raft table' from Kefu Chai not all tables in system keyspace are volatile. among other things, system.sstables and system.tablets are persisted using sstables like regular user tables. so add a dedicated section for them. also, in this change, raft table is added to the new section. Closes #13981 * github.com:scylladb/scylladb: docs/dev/system_keyspace: add raft table docs/dev/system_keyspace: move sstables and tablets into another section	2023-05-24 08:54:10 +03:00
Botond Dénes	313ae4ddac	Merge 'Generalize some file accessing helpers in test/' from Pavel Emelyanov Several test cases use common operations one files like existence checking, content comparing, etc. with the help of home-brew local helpers. The set makes use of some existing seastar:: ones and generalizes others into test/lib/. The primary intent here is `57 insertions(+), 135 deletions(-)` Closes #13936 * github.com:scylladb/scylladb: test: Generalize touch_file() into test_utils.* test/database: Generalize file/dir touch and exists checks test/sstables: Use seastar::file_exists() to check test/sstables: Remove sstdesc test/sstables: Use compare_files from utils/ in sstable_test test/sstables: Use compare_files() from utils/ in sstable_3_x_test test/util: Add compare_file() helpers	2023-05-24 08:43:41 +03:00
Guy Shtub	65c0afc899	Update docker-hub.md	2023-05-24 07:34:58 +03:00
Guy Shtub	7e3d768369	Update docs/dev/docker-hub.md Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-05-24 07:27:07 +03:00
Guy Shtub	6329036656	Update docs/dev/docker-hub.md Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-05-24 07:26:42 +03:00
Guy Shtub	3538a2e1c2	Update docs/dev/docker-hub.md Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-05-24 07:23:51 +03:00
Guy Shtub	53183d6302	Update docs/dev/docker-hub.md Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-05-24 07:23:37 +03:00
Guy Shtub	2677d47bbc	Update docs/dev/docker-hub.md Co-authored-by: Anna Stuchlik <37244380+annastuchlik@users.noreply.github.com>	2023-05-24 07:23:28 +03:00
Kefu Chai	b8c565875b	docs/dev/system_keyspace: add raft table it is one of the non-volatile tables. we need add more of them. but let's do this piecemeal. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-24 10:08:04 +08:00
Kefu Chai	eee0003312	docs/dev/system_keyspace: move sstables and tablets into another section not all tables in system keyspace are volatile. among other things, system.sstables and system.tablets are persisted using sstables like regular user tables. so move them into the section where we have other regular tables there. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-24 10:08:03 +08:00
Pavel Emelyanov	5aea6938ae	commitlog: Introduce and use comitlog sched group Nowadays all commitlog code runs in whatever sched group it's kicked from. Since IO prio classes are going to be inherited from the current sched group the commitlog IO loops should be moved into commitlog sched group, not inherit a "random" one. There are currently two places that need correct context for IO -- the .cycle() method and segments replenisher. `$ perf-simple-query --write -c2` results --- Before the patch --- 194898.36 tps ( 56.3 allocs/op, 12.7 tasks/op, 54307 insns/op, 0 errors) 199286.23 tps ( 56.2 allocs/op, 12.7 tasks/op, 54375 insns/op, 0 errors) 199815.84 tps ( 56.2 allocs/op, 12.7 tasks/op, 54377 insns/op, 0 errors) 198260.98 tps ( 56.3 allocs/op, 12.7 tasks/op, 54380 insns/op, 0 errors) 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median 198572.86 tps ( 56.2 allocs/op, 12.7 tasks/op, 54371 insns/op, 0 errors) median absolute deviation: 713.36 maximum: 199815.84 minimum: 194898.36 --- After the patch --- 194751.80 tps ( 56.3 allocs/op, 12.7 tasks/op, 54331 insns/op, 0 errors) 199084.70 tps ( 56.2 allocs/op, 12.7 tasks/op, 54389 insns/op, 0 errors) 195551.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54385 insns/op, 0 errors) 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) 198710.00 tps ( 56.3 allocs/op, 12.7 tasks/op, 54387 insns/op, 0 errors) median 197953.47 tps ( 56.3 allocs/op, 12.7 tasks/op, 54386 insns/op, 0 errors) median absolute deviation: 1131.24 maximum: 199084.70 minimum: 194751.80 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14005	2023-05-23 21:25:57 +03:00
Avi Kivity	da5467c687	Merge 'Use implicit default prio class in tests' from Pavel Emelyanov There are several places in tests that either use default_priority_class() explicitly, or use some specific prio class obtained from priority manager. There's currently an ongoing work to remove all priority classes, this set makes the final patch a bit smaller and easier to review. In particular -- in many cases default_priority_class() is implicit and can be avoided by callers. Also, using any prio class by test is excessive, it can go with (implicit) default_priority_class. ref: #13963 Closes #13991 * github.com:scylladb/scylladb: test, memtable: Use default prio class test, memtable: Add default value for make_flush_reader() last arg test, view_build: Use default prio class test, sstables: Use implicit default prio class in dma_write() test, sstables: Use default sstable::get_writer()'s prio class arg	2023-05-23 18:46:52 +03:00
Avi Kivity	3956e01640	Merge 'Clean index_reader API' from Pavel Emelyanov The way index_reader maintains io_priority_class can be relaxed a bit. The main intent is to shorten the #13963 final patch a bit, as a side effect index_reader gets its portion of API polishing. ref: #13963 Closes #13992 * github.com:scylladb/scylladb: index_reader: Introduce and use default arguments to constructor index_reader: Use _pc field in get_file_input_stream_options() directly index_reader: Move index_reader::get_file_input_stream_options to private: block	2023-05-23 18:46:26 +03:00
Piotr Smaroń	5f6491987d	Deregister table's metrics when disposing a table to work around #8627 The metrics that are being deregistered (in this PR) cause Scylla to crash when a table is dropped, but the corresponding table object in memory is not yet deallocated, and a new table with the same name is created. This caused a double-metrics-registration exception to be thrown. In order to avoid it, we are deregistering table's metrics as soon as the table is marked to be disposed from the database. Table's representation in memory can still live, but shouldn't forbid other table with the same name to be created. Fixes #13548 Closes #13971	2023-05-23 18:41:51 +03:00
Nadav Har'El	88fd7f7111	Merge 'Docs: add feature store tutorial' from Attila Tóth * Adds the new feature tutorial site to the docs * fixes the unnecessary redirection (iot.scylladb.com) Closes #13998 * github.com:scylladb/scylladb: Skip unnecessary redirection Add links to feature store tutorial	2023-05-23 16:17:23 +03:00
Alejo Sanchez	c276ac3099	test/topology: run first slow topology tests To speed up total test suite run, change configuration to schedule slow topology tests first. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13948	2023-05-23 15:12:40 +03:00
Attila Toth	cf686b4238	Skip unnecessary redirection	2023-05-23 14:09:39 +02:00
Attila Toth	a8008760f7	Add links to feature store tutorial	2023-05-23 14:08:01 +02:00
Kefu Chai	1246568e3b	docs/dev/system_keyspace: use timeuuid for sstables.generation we changed the type of generation column in system.sstables from bigint to timeuuid in `74e9e6dd1a` but that change failed to update the document accordingly. so let's update the document to reflect the change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13994	2023-05-23 14:37:28 +03:00
Pavel Emelyanov	678f8fb1b7	stream_manager: Add streaming sched group copy The manager in question is responsible for maintaining the streaming class IO bandwidth update. Nowadays it does it via priority manager's global streaming IO priority class field, but it will need to switch to streaming sched group. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 14:31:23 +03:00
Pavel Emelyanov	ff9d65f6ad	cql_test_env: Move sched groups initialization up The streaming manager will need to keep its copy of streaming/maintenance group, so groups should be created early. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 14:31:23 +03:00
Avi Kivity	1c0e8c25ca	Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an entire page worth of work. Also add a unit test checking that the mismatch is detected in the first place and that readers are closed. Fixes: #13784 Closes #13790 * github.com:scylladb/scylladb: test/boost/database_test: add unit test for semaphore mismatch on range scans partition_slice_builder: add set_specific_ranges() multishard_mutation_query: make reader_context::lookup_readers() exception safe multishard_mutation_query: lookup_readers(): make inner lambda a coroutine	2023-05-23 14:05:10 +03:00
Pavel Emelyanov	6c453df9d7	sstables: Remove default prio class from rewrite_statistics() The method is called with explicitly default pririty class and puts one into the fstream options. This whole chain can be avoided Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	438132ad4b	sstables: Remove prio class from validate_checksums subs The sstable.read_checksum() and .read_digest() accept prio class argument from validate_checsums(), but it's always the "default" one. Remove the arg and remove stream options initializations as they'll pick up default prio class on their default constructing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	7396d9d291	sstables: Remove always default io-prio from validate_checksums() All calls to sstables::validate_checksums() happen with explicitly default priority class. Just hard-code it as such in the method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	2bb024c948	index_reader: Introduce and use default arguments to constructor Most of creators of index_reader construct it with default prio class, null trace pointer and use_caching::yes. Assigning implicit defaults to constructor arguments keeps the code shorter and easier to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:29:04 +03:00
Avi Kivity	397f4b51c3	Update seastar submodule * seastar f94b1bb9cb...aff87d5bb9 (9): > prometheus.cc: change function in foreach_metrics to use const ref Fixes #13929 > sstring: add == comparison operators > Add a simple code example for promise in tutorial.md > Add example for cpuset docs > Merge 'httpd: modernize' from Avi Kivity > Merge 'TLS: support for extracting certificate subject alt names from client certs' from Calle Wilund > Merge 'Update IO stats label set' from Pavel Emelyanov > file: s/(void)/()/ in function's parameter list > scripts: addr2line: allow specifying kallsyms path Closes #13985	2023-05-23 11:24:39 +03:00
Pavel Emelyanov	3fd5d3cc2b	index_reader: Use _pc field in get_file_input_stream_options() directly No need to pass this-> field into this-> call Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Pavel Emelyanov	21d24e8ea3	index_reader: Move index_reader::get_file_input_stream_options to private: block A "while at it" cleanup. When pathing the method (next patch) it turned out that there are no other callers other than local class, so it _is_ private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Asias He	7056b7ee9a	repair: Log nodes down during repair in case of failed repair This helps users to figure if the repair has failed due to a peer node was down during repair. For example: ``` WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: Repair 1026 out of 1026 ranges, keyspace=ks2a, table={test_table, tb}, range=(9203128250168517738,+inf), peers={127.0.0.2}, live_peers={}, status=skipped_no_live_peers INFO [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: stats: repair_reason=repair, keyspace=ks2a, tables={test_table, tb}, ranges_nr=513, round_nr=0, round_nr_fast_path_already_synced=0, round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0 seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={}, row_from_disk_nr={}, row_from_disk_bytes_per_sec={} MiB/s, row_from_disk_rows_per_sec={} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={} WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out of 1026 ranges failed, keyspace=ks2a, tables={test_table, tb}, repair_reason=repair, nodes_down_during_repair={127.0.0.2} WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: repair_tracker run failed: std::runtime_error ({shard 0: std::runtime_error (repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out of 1026 ranges failed, keyspace=ks2a, tables={test_table, tb}, repair_reason=repair, nodes_down_during_repair={127.0.0.2})}) ``` In addition, change the `status=skipped` to `status=skipped_no_live_peers` to make it more clear. Closes #13928	2023-05-23 11:12:42 +03:00
Anna Stuchlik	f45976730c	doc: add versioning and support information Fixes https://github.com/scylladb/scylla-docs/issues/3966 Fixes https://github.com/scylladb/scylladb/issues/12753 This commit adds a new page that explains the ScyllaDB versioning convention and the new ScyllaDB Enterprise support policy. Closes #13987	2023-05-23 11:08:38 +03:00
Pavel Emelyanov	9bdc0d3f44	test: Generalize touch_file() into test_utils.* Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:40:55 +03:00
Pavel Emelyanov	730c0439e0	test/database: Generalize file/dir touch and exists checks There are cases that implement the same set of lambda helpers. Keep them common in this .cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:40:55 +03:00
Pavel Emelyanov	54fb8a022e	test/sstables: Use seastar::file_exists() to check There's a rather boring test_sstable_exists() helper in the test that can be replaced with a more standard seastar API call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:40:54 +03:00
Pavel Emelyanov	c06b5e2714	test/sstables: Remove sstdesc The helper class is used to transfer directory name and generation int value into the compare_sstables() helper. Remove both, the utils/ stuff is useful enough not to use wrappers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:40:20 +03:00
Pavel Emelyanov	c3dbe37669	test/sstables: Use compare_files from utils/ in sstable_test There's yet another implementation of read-the-whole-file and check-file-contents-matches helpers in the test. Replace it with the utils/ facility. Next patch will be able to wash more stuff out of this test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:39:32 +03:00
Pavel Emelyanov	6619e87b70	test/sstables: Use compare_files() from utils/ in sstable_3_x_test There's a static helper under the same name that can be replaced with utils/ one. The code here runs in async context to .get0() the result. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:39:31 +03:00
Pavel Emelyanov	1f4c3be50c	test/util: Add compare_file() helpers To be used later Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:37:08 +03:00
Anna Stuchlik	508b68377e	doc: add the upgrade guide from 5.2 to 5.2 Fixes https://github.com/scylladb/scylladb/issues/13288 This commit adds the upgrade guide from ScyllaDB Open Source 5.2 to 5.3. The details of the metric update will be added with a separate commit. Closes #13960	2023-05-23 09:36:39 +02:00
Pavel Emelyanov	f9ff5cdfdf	test, memtable: Use default prio class Similarly to previous patch with view-building -- using default class is OK for a unit test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:21:27 +03:00
Pavel Emelyanov	daa808aa21	test, memtable: Add default value for make_flush_reader() last arg Many places call memtable::make_flush_reader() with default priority class. Make it a default-arg for the method, other reader making methods of memtable already have it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:20:37 +03:00
Pavel Emelyanov	5e0a1d7546	test, view_build: Use default prio class The test case tries to be "correct" and calls sst->write_components() with streaming priority class. It's a test anyway, no need to be too diligent here Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:14:05 +03:00
Pavel Emelyanov	dd387d4ec1	test, sstables: Use implicit default prio class in dma_write() Calls to file.dma_write() may omit specifying default prio class by hand Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:14:05 +03:00
Pavel Emelyanov	5392f845a4	test, sstables: Use default sstable::get_writer()'s prio class arg The sstable::get_writer()'s prio class argument has its default value. No need to pass it explicitly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 10:14:05 +03:00
Michał Chojnowski	81a753e69f	real_dirty_memory_accounter: document what the class is doing The current documentation of real_dirty_memory_accounter is not quite compelling enough. One can find some proper documentation by digging through the git log and reading the description of the commit which added it, but it shouldn't be that way. This patch replaces the current documentation of the class with something more explanatory. Closes #13927	2023-05-23 09:11:31 +03:00
Jan Ciolek	d2ef55b12c	test: use NetworkTopologyStrategy in all unit tests As described in https://github.com/scylladb/scylladb/issues/8638, we're moving away from `SimpleStrategy`, in the future it will become deprecated. We should remove all uses of it and replace them with `NetworkTopologyStrategy`. This change replaces `SimpleStrategy` with `NetworkTopologyStrategy` in all unit tests, or at least in the ones where it was reasonable to do so. Some of the tests were written explicitly to test the `SimpleStrategy` strategy, or changing the keyspace from `SimpleStrategy` to `NetworkTopologyStrategy`. These tests were left intact. It's still a feature that is supported, even if it's slowly getting deprecated. The typical way to use `NetworkTopologyStrategy` is to specify a replication factor for each datacenter. This could be a bit cumbersome, we would have to fetch the list of datacenters, set the repfactors, etc. Luckily there is another way - we can just specify a replication factor to use for or each existing datacenter, like this: ```cql CREATE KEYSPACE {} WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}; ``` This makes the change rather straightforward - just replace all instances of `'SimpleStrategy'', with `'NetworkTopologyStrategy'`. Refs: https://github.com/scylladb/scylladb/issues/8638 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #13990	2023-05-23 08:52:56 +03:00
Anna Stuchlik	cad83bd53d	doc: publish the docs for branch-5.3 Fixes https://github.com/scylladb/scylladb/issues/13969 This commit enables publishing the docs for branch-5.3 The documentation for version 5.3 will be marked as unstable until it is released, and an appropriate warning is in place. Closes #13977	2023-05-23 08:10:55 +03:00
Kamil Braun	e9b7bf82b4	Merge 'test/topology: split raft upgrade tests and run firs slowest' from Alecco To speed up total test suite run, split the raft upgrade tests and run schedule slow tests first. Closes #13951 * github.com:scylladb/scylladb: test/topology: run first slow raft upgrade tests test/topology: split raft upgrade tests	2023-05-22 21:38:41 +02:00
Avi Kivity	a7c2c9f92b	Merge ' message: match unknown tenants to the default tenant' from Botond Dénes On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore. I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before. Fixes: #13841 Fixes: #12552 Closes #13843 * github.com:scylladb/scylladb: message: match unknown tenants to the default tenant message: generalize per-tenant connection types	2023-05-22 21:38:41 +02:00
Tomasz Grabiec	809ddd7f79	Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`. We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`. At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring. The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key instead of duplicating them for each `vnode` token. The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation. Closes #13715 * github.com:scylladb/scylladb: token_metadata_test: add a test for everywhere strategy token_metadata_test: check read_endpoints when bootstrapping first node token_metadata_test: refactor tests, extract create_erm token_metadata: drop has_pending_ranges and migration_info effective_replication_map: add has_pending_ranges token_metadata: drop update_pending_ranges effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading calculate_effective_replication_map: compute pending_endpoints and read_endpoints vnode_erm: optimize replication_map vnode_erm::get_range_addresses: use sorted_tokens abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token sequenced_set: add extract_vector method effective_replication_map: clone_endpoints_gently -> clone_data_gently vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints stall_free.hh: add clear_gently for rvalues stall_free.hh: relax Container requirement token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map token_metadata: introduce topology_change_info token_metadata: replace set_topology_transition_state with set_read_new	2023-05-22 21:37:06 +02:00
Jan Ciolek	07e7724468	test/network_topology_strategy_test: Test NTS with replication_factor option in test_invalid_dcs test_invalid_dcs is a test which has a list of incorrect replication factor values, and tries to create keyspaces with these incorrect values. The standard way of creating a NetworkTopologyStrategy keyspace is to specify the replication factor for each specific datacenter, but there's also a simpler way - a user can just write: 'replication_factor': X to convey that all of the current datacenters should have replication_factor X. This way of creating a NetworkTopologyStrategy wasn't tested by test_invalid_dcs, let's add it to the test to improve coverage. Refs: https://github.com/scylladb/scylladb/issues/13986 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-22 17:56:27 +02:00
Jan Ciolek	9f5a55bcb9	ks_prop_defs: disallow empty replication factor string in NTS A CREATE KEYSPACE query which specifies an empty string ('') as the replication factor value is currently allowed: ```cql CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''}; ``` This is wrong, it's invalid to have an empty replication factor string. It creates a keyspace without any replication, so the tables inside of it aren't writable. Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error, `NetworkTopolgyStrategy` should do the same. The problem was in `prepare_options`, it treated an empty replication factor string as no replication factor. Changing it to `std::optional` fixes the problem, Now `std::nullopt` means no replication factor, and `make_optional("")` means that there is a replication factor, but it's described by an empty string. Fixes: https://github.com/scylladb/scylladb/issues/13986 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-22 17:56:16 +02:00
Guy Shtub	eefaad189a	fixed broken links, added community forum link, university link, other mistakes	2023-05-22 13:12:16 +03:00
Alejo Sanchez	1940016cd1	test.py: warn and skip for missing unit/boost tests If the executable of a matching unit or boost test is not present, warn to console and skip. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13949	2023-05-22 12:49:32 +03:00
Tomasz Grabiec	9d4bca26cc	Merge 'raft topology: implement `check_and_repair_cdc_streams` API' from Kamil Braun `check_and_repair_cdc_streams` is an existing API which you can use when the current CDC generation is suboptimal, e.g. after you decommissioned a node the current generation has more stream IDs than you need. In that case you can do `nodetool checkAndRepairCdcStreams` to create a new generation with fewer streams. It also works when you change number of shards on some node. We don't automatically introduce a new generation in that case but you can use `checkAndRepairCdcStreams` to create a new generation with restored shard-colocation. This PR implements the API on top of raft topology, it was originally implemented using gossiper. It uses the `commit_cdc_generation` topology transition state and a new `publish_cdc_generation` state to create new CDC generations in a cluster without any nodes changing their `node_state`s in the process. Closes #13683 * github.com:scylladb/scylladb: docs: update topology-over-raft.md test: topology_experimental_raft: test `check_and_repair_cdc` API raft topology: implement `check_and_repair_cdc_streams` API raft topology: implement global request handling raft topology: introduce `prepare_new_cdc_generation_data` raft_topology: `get_node_to_work_on_opt`: return guard if no node found raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition raft topology: separate `publish_cdc_generation` state raft topology: non-node-specific `exec_global_command` raft topology: introduce `start_operation()` raft topology: non-node-specific `topology_mutation_builder` topology_state_machine: introduce `global_topology_request` topology_state_machine: use `uint16_t` for `enum_class`es raft topology: make `new_cdc_generation_data_uuid` topology-global	2023-05-22 11:33:58 +02:00
Kefu Chai	8d79811c6a	scripts/refresh-submodules.sh: use the correct sha1 in title `0d4ffe1d69` introduced a regression where it used the sha1 of the local "master" branch instead of the remote's "master" branch in the title of the commit message. in this change, let's use the origin/${branch}'s sha1 in the title. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13974	2023-05-22 12:10:03 +03:00
Botond Dénes	93e4671c83	Merge 'doc: add a cloud instance recommendations page' from Anna Stuchlik Fixes https://github.com/scylladb/scylladb/issues/13808 This commit moves cloud instance recommendations from the Requirements page to a new dedicated page. The content of subsections is simply copy-pasted, but I added the introduction and metadata for better searchability. Closes #13935 * github.com:scylladb/scylladb: doc: add a cloud instance recommendations page	2023-05-22 08:38:40 +03:00
Tomasz Grabiec	c39332710d	test: test_tablets: materialize all rows from the result set When paging, iterating twice over the result set is not possible, making the second loop noop.	2023-05-21 19:49:57 +03:00
Tomasz Grabiec	1d0be495b6	test: test_tablets: Reconnect the driver after rolling restart Fixes sporadic failures to execute INSERT which follows the restart: cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.194.238.2:9042 datacenter1>: ConnectionShutdown('Connection to 127.194.238.2:9042 is closed')})	2023-05-21 19:49:23 +03:00
Tomasz Grabiec	493e7fc3de	main: Load tablet metadata after schema commit log replay There could be system.tablet mutations in the schema commit log. We need to see them before loading sstables of user tables because we need sharding information.	2023-05-21 18:50:11 +03:00
Kefu Chai	3928a9a4e9	counters: specialize fmt::formatter<counter_{shard,cell}_view> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `counter_shard_view` and `counter_cell_view` without the help of `operator<<`. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13967	2023-05-21 17:13:06 +03:00
Petr Gusev	095f35a47d	token_metadata_test: add a test for everywhere strategy	2023-05-21 13:17:42 +04:00
Petr Gusev	8877641b0f	token_metadata_test: check read_endpoints when bootstrapping first node	2023-05-21 13:17:42 +04:00
Petr Gusev	e9a6fcc8e1	token_metadata_test: refactor tests, extract create_erm No logical changes, just tidied up	2023-05-21 13:17:42 +04:00
Petr Gusev	5976277c2c	token_metadata: drop has_pending_ranges and migration_info Use the new erm::has_pending_ranges function, drop the old implementation from token_metadata.	2023-05-21 13:17:42 +04:00
Petr Gusev	5495065242	effective_replication_map: add has_pending_ranges We add the has_pending_ranges function to erm. The implementation for vnode is similar to that of token_metadata. For tablets, we add new code that checks if the given endpoint is contained in tablet_map::_transitions.	2023-05-21 13:17:42 +04:00
Petr Gusev	8cb709d3d6	token_metadata: drop update_pending_ranges The function storage_service::update_pending_ranges is turned to update_topology_changes_info. The pending_endpoints and read_endpoints will be computed later, when the erms are rebuilt.	2023-05-21 13:17:42 +04:00
Petr Gusev	87307781c4	effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading We already use the new pending_endpoints from erm though the get_pending_ranges virtual function, in this commit we update all the remaining places to use the new implementation in erm, as well as remove the old implementation in token_metadata.	2023-05-21 13:17:42 +04:00
Petr Gusev	d4f004f5c7	token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers We want to switch token_metadata_test to the new implementation of pending_endpoints and read_endpoints in erm. To do this, it is convenient to have token_metadata and replication_strategy as shared pointers, as it fits better with the signature of calculate_effective_replication_map. In this commit we don't change the logic of the tests, we just migrate them to use pointers.	2023-05-21 13:17:42 +04:00
Petr Gusev	e22a5c42c8	vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading In this commit we introduce functions to erm for accessing pending_endpoints and read_endpoints similar to the corresponding functions in token_metadata. The only difference - we no longer need the keyspace_name map. The functions get_pending_endpoints and get_endpoints_for_reading are virtual, since they have different implementations for vnode and for tablets. The get_pending_endpoints already existed. For tablets it remained unchanged, while for vnode we just changed it from calling on token_metadata to using a local field. We have also removed ks_name from the signature as it's no longer needed. For vnodes, the get_endpoints_for_reading also just employs the local field. In the case of tablets, we currently return nullptr as the appropriate implementation remains unclear.	2023-05-21 13:17:42 +04:00
Petr Gusev	fbe3254a9e	calculate_effective_replication_map: compute pending_endpoints and read_endpoints In this commit we add logic to calculate pending_endpoints and read_endpoints, similar to how it was done in update_pending_ranges. For situations where 'natural_endpoints_depend_on_token' is false we short-circuit the calculations, breaking out of the loop after the first iteration. In this case we add a single item with key=default_replication_map_key to the replication_map and set pending_endpoints/read_endpoints key range to the entire set of possible values. In the loop we iterate over all_tokens, which contains the union of all boundary tokens, from the old and from the new topology. In addition to updating pending_endpoints and read_endpoints in the loop, we remember the new natural endpoints in the replication_map if the current token is contained in the current set of boundary tokens.	2023-05-21 13:17:42 +04:00
Petr Gusev	a8c36aad0b	vnode_erm: optimize replication_map We optimise memory usage of replication_map by storing endpoints list only once in case of natural_endpoints_depend_on_token() == false. For simplicity, this list is stored in the same unordered_map with special key default_replication_map_key. We inline both get_natural_endpoints and for_each_natural_endpoint_until from abstract_replication_strategy into vnode_erm since now the overrides in local and everywhere strategies are redundant. The default implementation works for them as empty sorted_tokens() is not a problem, we store endpoints with a special key. Function do_get_natural_endpoints was extracted, since get_natural_endpoints returns by val, but for_each_natural_endpoint_until reference in sufficient.	2023-05-21 13:17:42 +04:00
Beni Peled	1e63cf6c50	release: prepare for 5.4.0-dev	2023-05-21 10:39:21 +03:00
Petr Gusev	b9812023c6	vnode_erm::get_range_addresses: use sorted_tokens We want to refactor replication_map so that it doesn't store multiple copies of the same endpoints vector in case of natural_endpoints_depend_on_token == false. To preserve get_range_addresses behaviour we iterate over tm.sorted_tokens() instead of _replication_map. It's possible that the callers of this function are ok with single range in case of natural_endpoints_depend_on_token == false, but to restrict the scope of the refactoring we refrain from going to that direction.	2023-05-21 11:33:38 +04:00
Petr Gusev	99ff1fefe5	abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token We are going to use this function in vnode_erm::get_natural_endpoints, so for efficiency it's better to have fewer virtual calls.	2023-05-21 11:33:38 +04:00
Petr Gusev	e0bc98a217	sequenced_set: add extract_vector method Can be useful if we want to reuse the vector when we are done with this sequenced_set instance.	2023-05-21 11:33:38 +04:00
Petr Gusev	6f12c72c3f	effective_replication_map: clone_endpoints_gently -> clone_data_gently We need to account for the new fields in the clone implementation. The signature future<erm> erm::clone() const; doesn't work because the call will be made via foreign_ptr on an instance from another shard, so we need to use local values for replication_strategy and token_metadata.	2023-05-21 11:33:38 +04:00
Petr Gusev	959f9757d3	vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints Refactor ~vnode_effective_replication_map, use our new clear_gently overload for rvalue references. Add new fields _pending_endpoints and _read_endpoints to the call. vnode_efficient_replication_map::clear_gently is removed as it was not used.	2023-05-21 11:33:38 +04:00
Petr Gusev	700eb90ed8	stall_free.hh: add clear_gently for rvalues	2023-05-21 11:33:33 +04:00
Petr Gusev	4a127c3782	stall_free.hh: relax Container requirement We don't use the return value of erase, so we can allow it to return anything. We'll need this for ring_mapping, since boost::icl::interval_map::erase(it) returns void.	2023-05-19 22:11:09 +04:00
Petr Gusev	084abc0e44	token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map In this commit, we just add fields and pass them through the constructor. Calculation and usage logic will be added later.	2023-05-19 19:04:43 +04:00
Petr Gusev	10bf8c7901	token_metadata: introduce topology_change_info We plan to move pending_endpoints and read_endpoints, along with their computation logic, from token_metadata to vnode_effective_replication_map. The vnode_effective_replication_map seems more appropriate for them since it contains functionally similar _replication_map and we will be able to reuse pending_endpoints/read_endpoints across keyspaces sharing the same factory_key. At present, pending_endpoints and read_endpoints are updated in the update_pending_ranges function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the migration_info for specific keyspaces. In this commit, we introduce a new topology_change_info structure to hold the first part's data add create an update_topology_change_info function to update it. This structure will later be used in vnode_effective_replication_map to compute pending_endpoints and read_endpoints. This enables the reuse of topology_change_info across all keyspaces, unlike the current update_pending_ranges implementation, which is another benefit of this refactoring. The update_topology_change_info implementation is mostly derived from update_pending_ranges, there are a few differences though: * replacing async and thread with plain co_awaits; * adding a utils::clear_gently call for the previous value to mitigate reactor stalls if target_token_metadata grows large; * substituting immediately invoked lambdas with simple variables and blocks to reduce noise, as lambdas would need to be converted into coroutines. The original update_pending_ranges remains unchanged, and will be removed entirely upon transitioning to the new implementation. Meanwhile, we add an update_topology_change_info call to storage_service::update_pending_ranges so that we can iteratively switch the system to the new implementation.	2023-05-19 19:04:43 +04:00
Petr Gusev	51e80691ef	token_metadata: replace set_topology_transition_state with set_read_new This helps isolate topology::transition_state dependencies, token_metadata doesn't need the entire enum, just this boolean flag.	2023-05-19 19:04:43 +04:00
Anna Stuchlik	e106f6714d	Merge branch 'scylladb:master' into anna-cloud-recommendation-pages	2023-05-19 12:27:42 +02:00
Botond Dénes	3b424e391b	Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy cleanup_compaction should resolve only after all sstables that require cleanup are cleaned up. Since it is possible that some of them are in staging and therefore cannot be cleaned up, retry once a second until they become eligible. Timeout if there is no progress within 5 minutes to prevent hanging due to view building bug. Fixes #9559 Closes #13812 * github.com:scylladb/scylladb: table: signal compaction_manager when staging sstables become eligible for cleanup compaction_manager: perform_cleanup: wait until all candidates are cleaned up compaction_manager: perform_cleanup: perform_offstrategy if needed compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance sstable_set: add for_each_sstable_gently* helpers	2023-05-19 12:35:59 +03:00
Anna Stuchlik	a456222ec4	doc: add a cloud instance recommendations page Fixes https://github.com/scylladb/scylladb/issues/13808 This commit moves cloud instance recommendations from the Requirements page to a new dedicated page. The content of subsections is simply copy-pasted, but I added the introduction and metadata for better searchability.	2023-05-19 11:08:16 +02:00
Kefu Chai	031f770557	install.sh: use scylla-jmx for detecting JRE now that scylla-jmx has a dedicated script for detecting the existence of OpenJDK, and this script is included in the unified package, let's just leverage it instead of repeating it in `install.sh`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13514	2023-05-19 11:22:57 +03:00
Kefu Chai	8bb1f15542	test: sstable_3_x_test: avoid using helper using generation_type::int_t this change is one of the series which drops most of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. so, in this change, instead of using `generation_type::int_t` in the helper functions, we just pass `generation_type` in place of integer. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13931	2023-05-19 11:21:35 +03:00
Tomasz Grabiec	05be5e969b	migration_manager: Fix snapshot transfer failing if TABLETS feature is not enabled Without the feature, the system schema doesn't have the table, and the read will fail with: Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system) We should not attempt to read tablet metadata in the experimental feature is not enabled. Fixes #13946 Closes #13947	2023-05-19 09:58:56 +02:00
Botond Dénes	c2aee26278	Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware. Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit. refs: #13024 refs: #13020 refs: #12707 Closes #13767 * github.com:scylladb/scylladb: sstable: Toss tempdir extension usage sstable: Drop pending_delete_dir_basename() sstable: Drop is_pending_delete_dir() helper sstable_directory: Make garbage_collect() non-static sstable_directory: Move deletion log exists check distributed_loader: Move garbage collecting into sstable_directory distributed_loader: Collect garbace collecting in one call sstable: Coroutinize remove_temp_dir() sstable: Coroutinize touch_temp_dir() sstable: Use storage::temp_dir instead of hand-crafted path	2023-05-19 08:50:13 +03:00
Alejo Sanchez	4ed178c42e	test/topology: run first slow raft upgrade tests Mark to run first the slowest raft upgrade tests Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-05-19 01:10:41 +02:00
Alejo Sanchez	2de6b8f49c	test/topology: split raft upgrade tests Split raft upgrade tests to run in parallel by default Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-05-19 01:07:41 +02:00
Jan Ciolek	1bcb4c024c	cql3/expr: print expressions in user-friendly way by default When a CQL expression is printed, it can be done using either the `debug` mode, or the `user` mode. `user` mode is basically how you would expect the CQL to be printed, it can be printed and then parsed back. `debug` mode is more detailed, for example in `debug` mode a column name can be displayed as `unresolved_identifier(my_column)`, which can't be parsed back to CQL. The default way of printing is the `debug` mode, but this requires us to remember to enable the `user` mode each time we're printing a user-facing message, for example for an invalid_request_exception. It's cumbersome and people forget about it, so let's change the default to `user`. There issues about expressions being printed in a `strange` way, this fixes them. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> Closes #13916	2023-05-18 20:57:00 +03:00
Kamil Braun	64dc76db55	test: pylib: fix `read_barrier` implementation The previous implementation didn't actually do a read barrier, because the statement failed on an early prepare/validate step which happened before read barrier was even performed. Change it to a statement which does not fail and doesn't perform any schema change but requires a read barrier. This breaks one test which uses `RandomTables.verify_schema()` when only one node is alive, but `verify_schema` performs a read barrier. Unbreak it by skipping the read barrier in this case (it makes sense in this particular test). Closes #13933	2023-05-18 18:30:11 +02:00
Kamil Braun	13df85ea11	Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov This implicit link it pretty bad, because feature service is a low-level one which lots of other services depend on. System keyspace is opposite -- a high-level one that needs e.g. query processor and database to operate. This inverse dependency is created by the feature service need to commit enabled features' names into system keyspace on cluster join. And it uses the qctx thing for that in a best-effort manner (not doing anything if it's null). The dependency can be cut. The only place when enabled features are committed is when gossiper enables features on join or by receiving state changes from other nodes. By that time the sharded<system_keyspace> is up and running and can be used. Despite gossiper already has system keyspace dependency, it's better not to overload it with the need to mess with enabling and persisting features. Instead, the feature_enabler instance is equipped with needed dependencies and takes care of it. Eventually the enabler is also moved to feature_service.cc where it naturally belongs. Fixes: #13837 Closes #13172 * github.com:scylladb/scylladb: gossiper: Remove features and sysks from gossiper system_keyspace: De-static save_local_supported_features() system_keyspace: De-static load_\|save_local_enabled_features() system_keyspace: Move enable_features_on_startup to feature_service (cont) system_keyspace: Move enable_features_on_startup to feature_service feature_service: Open-code persist_enabled_feature_info() into enabler gms: Move feature enabler to feature_service.cc gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() gms: Persist features explicitly in features enabler feature_service: Make persist_enabled_feature_info() return a future system_keyspace: De-static load_peer_features() gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features() gossiper: Enable features and register enabler from outside gms: Add feature_service and system_keyspace to feature_enabler	2023-05-18 18:21:06 +02:00
Gleb Natapov	701d6941a5	storage_proxy: raft topology: use gossiper state to populate peers table Some state that is used to fill in 'peeers' table is still propagated over gossiper. When moving a node into the normal state in raft topology code use the data from the gossiper to populate peers table because storage_service::on_change() will not do it in case the node was not in normal state at the time it was called. Fixes: #13911 Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>	2023-05-18 16:00:29 +02:00
Pavel Emelyanov	5216dcb1b3	Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`. This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups. This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR. Refs: #11870 Closes #13869 * github.com:scylladb/scylladb: db/system_keyspace: remove dependency on storage_proxy db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent replica: add query.hh	2023-05-18 10:53:27 +03:00
Raphael S. Carvalho	38b226f997	Resurrect optimization to avoid bloom filter checks during compaction Commit `8c4b5e4283` introduced an optimization which only calculates max purgeable timestamp when a tombstone satisfy the grace period. Commit 'repair: Get rid of the gc_grace_seconds' inverted the order, probably under the assumption that getting grace period can be more expensive than calculating max purgeable, as repair-mode GC will look up into history data in order to calculate gc_before. This caused a significant regression on tombstone heavy compactions, where most of tombstones are still newer than grace period. A compaction which used to take 5s, now takes 35s. 7x slower. The reason is simple, now calculation of max purgeable happens for every single tombstone (once for each key), even the ones that cannot be GC'ed yet. And each calculation has to iterate through (i.e. check the bloom filter of) every single sstable that doesn't participate in compaction. Flame graph makes it very clear that bloom filter is a heavy path without the optimization: 45.64% 45.64% sstable_compact sstable_compaction_test_g [.] utils::filter::bloom_filter::is_present With its resurrection, the problem is gone. This scenario can easily happen, e.g. after a deletion burst, and tombstones becoming only GC'able after they reach upper tiers in the LSM tree. Before this patch, a compaction can be estimated to have this # of filter checks: (# of keys containing any tombstone) * (# of uncompacting sstable runs[1]) [1] It's # of runs, as each key tend to overlap with only one fragment of each run. After this patch, the estimation becomes: (# of keys containing a GC'able tombstone) * (# of uncompacting runs). With repair mode for tombstone GC, the assumption, that retrieval of gc_before is more expensive than calculating max purgeable, is kept. We can revisit it later. But the default mode, which is the "timeout" (i.e. gc_grace_seconds) one, we still benefit from the optimization of deferring the calculation until needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13908	2023-05-18 09:01:50 +03:00
Kefu Chai	03be1f438c	sstables: move get_components_lister() into sstable_directory sstables_manager::get_component_lister() is used by sstable_directory. and almost all the "ingredients" used to create a component lister are located in sstable_directory. among the other things, the two implementations of `components_lister` are located right in `sstable_directory`. there is no need to outsource this to sstables_manager just for accessing the system_keyspace, which is already exposed as a public function of `sstables_manager`. so let's move this helper into sstable_directory as a member function. with this change, we can even go further by moving the `components_lister` implementations into the same .cc file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13853	2023-05-18 08:43:35 +03:00
Botond Dénes	88a2421961	Merge 'Generalize global table pointer' from Pavel Emelyanov There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class. This PR generalizes both into global_table_ptr facility. Closes #13909 * github.com:scylladb/scylladb: replica: Use global_table_ptr in distributed loader replica: Make global_table_ptr a class replica: Add type alias for vector of foreign lw-pointers replica: Put get_table_on_all_shards() to header replica: Rewrite get_table_on_all_shards()	2023-05-18 08:42:04 +03:00
Kefu Chai	8bcbc9a90d	sstables: add an maybe_owned_by_this_shard() helper instead of encoding the fact that we are using generation identifier as a hint where the SSTable with this generation should be processed at the caller sites of `as_int()`, just provide an accessor on sstable_generation_generator's side. this helps to encapsulate the underlying type of generation in `generation_type` instead of exposing it to its users. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13846	2023-05-18 08:41:02 +03:00
Benny Halevy	8a7e77e0ed	gossiper: is_alive: fix use-after-move if endpoint is unknown `ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr but it's used later for logger.warn() Fixes #13921 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13920	2023-05-17 21:57:26 +03:00
Pavel Emelyanov	c3fca9481c	replica: Use global_table_ptr in distributed loader The loader has very similar global_column_family_ptr class for its distributed loadings. Now it can use the "standard" one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 18:14:34 +03:00
Pavel Emelyanov	d7f99d031d	replica: Make global_table_ptr a class Right now all users of global_table know it's a vector and reference its elements with this_shard_id() index. Making the global_table_ptr a class makes it possible to stop using operator[] and "index" this_shard_id() in its -> and * operators. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 18:14:34 +03:00
Pavel Emelyanov	b4a8843907	replica: Add type alias for vector of foreign lw-pointers This is to convert the global_table_ptr into a class with less bulky patch further Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 18:14:34 +03:00
Pavel Emelyanov	fffe3e4336	replica: Put get_table_on_all_shards() to header This is to share it with distributed loader some time soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 18:14:34 +03:00
Pavel Emelyanov	f974617c79	replica: Rewrite get_table_on_all_shards() Use sharded<database>::invoke_on_all() instead of open-coded analogy. Also don't access database's _column_families directly, use the find_column_family() method instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 18:14:34 +03:00
Jan Ciolek	9223cd8c3a	statement_restrictions: add get_not_null_columns() The IS NOT NULL restrictions are handled in a special way. Instead of putting them together with other restrictions, statement_restrictions collects all columns restricted by IS NOT NULL and puts them in the _not_null_columns field. Add a getter to access this set of columns. The field is private, so it can't be accessed without a function that explictily exposes it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-17 16:12:10 +02:00
Jan Ciolek	7f0c64a69d	test: remove invalid IS NOT NULL restrictions from tests The IS NOT NULL restrictions is currently supported only in the CREATE MATERIALIZED VIEW statements. These restrictions works correctly for columns that are part of the view's primary key, but they're silently ignored on other columns. The following commits will forbid placing the IS NOT NULL restriction on columns that aren't a part of the view's primary key. The tests have to be modified in order to pass, because some of them have a useless IS NOT NULL restriction on regular columns that don't belong to the view's primary key. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-17 15:38:03 +02:00
Pavel Emelyanov	ed50fda1fe	sstable: Toss tempdir extension usage The tempdir for filesystem-based sstables is {generation}.sstable one. There are two places that need to know the ".sstable" extention -- the tempdir creating code and the tempdir garbage-collecting code. This patch simplifies the sstable class by patching the aforementioned functions to use newly introduced tempdir_extension string directly, without the help of static one-line helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:19:38 +03:00
Pavel Emelyanov	e8c0ae28b5	sstable: Drop pending_delete_dir_basename() The helper is used to return const char* value of the pending delete dir. Callers can use it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:17:33 +03:00
Pavel Emelyanov	7792479865	sstable: Drop is_pending_delete_dir() helper It's only used by the sstable_directory::replay_pending_delete_log() method. The latter is only called by the sstable_directory itself with the path being pending-delete dir for sure. So the method can be made private and the is_pending_delete_dir() can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:17:32 +03:00
Pavel Emelyanov	7429205632	sstable_directory: Make garbage_collect() non-static When non static the call can use sstable_directory::_sstable_dir path, not the provided argument. The main benefit is that the method can later be moved onto lister so that filesystem and ownership-table listers can process dangling bits differently. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:16:23 +03:00
Pavel Emelyanov	45adf61490	sstable_directory: Move deletion log exists check Check if the deletion log exists in the handling helper, not outside of it. This makes next patch shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:16:23 +03:00
Pavel Emelyanov	3d7122d2fe	distributed_loader: Move garbage collecting into sstable_directory It's the directory that owns the components lister and can reason about the way to pick up dangling bits, be it local directories or entries from the ownership table. First thing to do is to move the g.c. code into sstable_directory. While at it -- convert ssting dir into fs::path dir and switch logger. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:16:23 +03:00
Pavel Emelyanov	99f924666f	distributed_loader: Collect garbace collecting in one call When the loader starts it first scans the directory for sstables' tempdirs and pending deletion logs. Put both into one call so that it can be moved more easily later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:16:23 +03:00
Pavel Emelyanov	22299a31c8	sstable: Coroutinize remove_temp_dir() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:16:23 +03:00
Pavel Emelyanov	9db5e9f77f	sstable: Coroutinize touch_temp_dir() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:15:38 +03:00
Pavel Emelyanov	7e506354fd	sstable: Use storage::temp_dir instead of hand-crafted path When opening an sstable on filesystem it's first created in a temporary directory whose path is saved in storage::temp_dir variable. However, the opening method constructs the path by hand. Fix that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:14:04 +03:00
Anna Stuchlik	6f4a68175b	doc: fix the links to the Enterprise docs Fixes https://github.com/scylladb/scylladb/issues/13915 This commit fixes broken links to the Enterprise docs. They are links to the enterprise branch, which is not published. The links to the Enterprise docs should include "stable" instead of the branch name. This commit must be backported to branch-5.2, because the broken links are present in the published 5.2 docs. Closes #13917	2023-05-17 13:56:21 +03:00
Benny Halevy	bb59687116	table: signal compaction_manager when staging sstables become eligible for cleanup perform_cleanup may be waiting for those sstables to become eligible for cleanup so signal it when table::move_sstables_from_staging detects an sstable that requires cleanup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-17 11:33:22 +03:00
Benny Halevy	a5a8020ecd	compaction_manager: perform_cleanup: wait until all candidates are cleaned up cleanup_compaction should resolve only after all sstables that require cleanup are cleaned up. Since it is possible that some of them are in staging and therefore cannot be cleaned up, retry once a second until they become eligible. Timeout if there is no progress within 5 minutes to prevent hanging due to view building bug. Fixes #9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-17 11:31:07 +03:00
Benny Halevy	be4e23437f	compaction_manager: perform_cleanup: perform_offstrategy if needed It is possible that cleanup will be executed right after repair-based node operations, in which case we have a 5 minutes timer before off-strategy compaction is started. After marking the sstables that need cleanup, perform offstrategy compaction, if needed. This will implicitly cleanup those sstables as part of offstrategy compaction, before they are even passed for view update (if the table has views/secondary index). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-17 11:31:07 +03:00
Benny Halevy	53fbf9dd32	compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance Scan all sstables to determine which of them requires cleanup before calling perform_task_on_all_files. This allows for cheaper no-op return when no sstable was identified as requiring cleanup, and also it will allow triggering offstrategy compaction if needed, after selecting the sstables for cleanup, in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-17 11:31:07 +03:00
Benny Halevy	ff7c9c661d	sstable_set: add for_each_sstable_gently* helpers Currently callers of `for_each_sstable` need to use a seastar thread to allow preemption in the for_each_sstable loop. Provide for_each_sstable_gently and for_each_sstable_gently_until to make using this facility from a coroutine easier, without requiring a seastar thread. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-17 11:31:07 +03:00
Kefu Chai	6cd745fd8b	build: cmake: add missing test string_format_test was added in `1b5d5205c8`, so let's add it to CMake building system as well. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13912	2023-05-17 09:51:51 +03:00
Raphael S. Carvalho	5544d12f18	compaction: avoid excessive reallocation and during input list formatting with off-strategy, input list size can be close to 1k, which will lead to unneeded reallocations when formatting the list for logging. in the past, we faced stalls in this area, and excessive reallocation (log2 ~1k = ~10) may have contributed to that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13907	2023-05-17 09:40:06 +03:00
Benny Halevy	302a89488a	test: sstable_3_x_test: add test_compression_premature_eof Reproduces #13599 and verifies the fix. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13903	2023-05-17 09:00:44 +03:00
Gleb Natapov	605e53e617	do not report raft as enabled before group0 is configured Currently we may start to receive requests before group0 is configured during boot. If that happens those requests may try to pull schema and issue raft read barrier which will crash the system because group0 is not yet available. Workaround it by pretending the raft is disabled in this case and use non raft procedure. The proper fix should make sure that storage proxy verbs are registered only after group0 is fully functional. Message-Id: <ZGOZkXC/MsiWtNGu@scylladb.com>	2023-05-17 01:06:42 +02:00
Michał Chojnowski	9b0679c140	range_tombstone_change_generator: fix an edge case in flush() range_tombstone_change_generator::flush() mishandles the case when two range tombstones are adjacent and flush(pos, end_of_range=true) is called with pos equal to the end bound of the lesser-position range tombstone. In such case, the start change of the greater-position rtc will be accidentally emitted, and there won't be an end change, which breaks reader assumptions by ending the stream with an unclosed range tombstone, triggering an assertion. This is due to a non-strict inequality used in a place where strict inequality should be used. The modified line was intended to close range tombstones which end exactly on the flush position, but this is unnecessary because such range tombstones are handled by the last `if` in the function anyway. Instead, this line caused range tombstones beginning right after the flush position to be emitted sometimes. Fixes #12462 Closes #13906	2023-05-16 17:54:08 +02:00
Nadav Har'El	24c3cbcb0b	Merge 'Improve verbosity of test/pylib/minio.py' from Pavel Emelyanov CI once failed due to mc being unable to configure minio server. There's currently no glues why it could happen, let's increase the minio.py verbosity a bit refs: #13896 Closes #13901 * github.com:scylladb/scylladb: test,minio: Run mc with --debug option test,minio: Log mc operations to log file	2023-05-16 18:04:36 +03:00
Nadav Har'El	52e4edfd5e	Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros Currently, when a user creates a function or a keyspace, no permissions on functions are update. Instead, the user should gain all permissions on the function that they created, or on all functions in the keyspace they have created. This is also the behavior in Cassandra. However, if the user is granted permissions on an function after performing a CREATE OR REPLACE statement, they may actually only alter the function but still gain permissions to it as a result of the approach above, which requires another workaround added to this series. Lastly, as of right now, when a user is altering a function, they need both CREATE and ALTER permissions, which is incompatible with Cassandra - instead, only the ALTER permission should be required. This series fixes the mentioned issues, and the tests are already present in the auth_roles_test dtest. Fixes #13747 Closes #13814 * github.com:scylladb/scylladb: cql: adjust tests to the updated permissions on functions cql: fix authorization when altering a function cql: grant permissions on functions when creating a keyspace/function cql: pass a reference to query processor in grant_permissions_to_creator test_permissions: make tests pass on cassandra	2023-05-16 18:04:35 +03:00
Avi Kivity	d2d53fc1db	Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map. get_endpoints is used here by gossiper and storage_service for iterations that may preempt instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators. Fixes #13899 Closes #13900 * github.com:scylladb/scylladb: storage_service: do not preempt while traversing endpoint_state_map gossiper: do not preempt while traversing endpoint_state_map	2023-05-16 18:04:35 +03:00
Botond Dénes	3ea521d21b	Update tools/jmx submodule * tools/jmx f176bcd1...1fd23b60 (1): > select-java: query java version using -XshowSettings	2023-05-16 18:04:35 +03:00
Kamil Braun	5a8e2153a0	Merge 'Fix heart_beat_state::force_highest_possible_version_unsafe' from Benny Halevy It turns out that numeric_limits defines an implicit implementation for std::numeric_limits<utils::tagged_integer<Tag, ValueType>> which apprently returns a default-constructed tagged_integer for min() and max(), and this broke `gms::heart_beat_state::force_highest_possible_version_unsafe()` since [gms: heart_beat_state: use generation_type and version_type](`4cdad8bc8b`) (merged in [Merge 'gms: define and use generation and version types'...](`7f04d8231d`)) Implementing min/max correctly Fixes #13801 Closes #13880 * github.com:scylladb/scylladb: storage_service: handle_state_normal: on_internal_error on "owns no tokens" utils: tagged_integer: implement std::numeric_limits::{min,max} test: add tagged_integer_test	2023-05-16 13:59:41 +02:00
Wojciech Mitros	6bc16047ba	rust: update wasmtime dependency The previous version of wasmtime had a vulnerability that possibly allowed causing undefined behavior when calling UDFs. We're directly updating to wasmtime 8.0.1, because the update only requires a slight code modification and the Wasm UDF feature is still experimental. As a result, we'll benefit from a number of new optimizations. Fixes #13807 Closes #13804	2023-05-16 13:03:29 +03:00
Pavel Emelyanov	29fffaa160	schema_tables: Use sharded<database>& variable The auto& db = proxy.local().get_db() is called few lines above this patch, so the &db can be reused for invoke_on_all() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13896	2023-05-16 12:57:47 +03:00
Benny Halevy	1da0b0ff76	storage_service: do not preempt while traversing endpoint_state_map The map iterators might be invalidated while yielding on insert if the map is rehashed. See https://en.cppreference.com/w/cpp/container/unordered_map/insert Refs #13899 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-16 12:24:44 +03:00
Benny Halevy	ba13056eba	gossiper: do not preempt while traversing endpoint_state_map The map iterators might be invalidated while yielding on insert if the map is rehashed. See https://en.cppreference.com/w/cpp/container/unordered_map/insert Refs #13899 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-16 12:24:42 +03:00
Pavel Emelyanov	b58ad040d2	sstables: Switch data and index sink to use jumbo uploader These two can grow large. Non-jumbo sink is effectively limited with 10000 parts, since each is ~5Mb the maximum uploadable data/index happens to be 50Gb which is too small. Other components shouldn't grow that big and continue using simple and a bit faster uploading sink. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	b3df2d0db0	s3/test: Tune-up multipart upload test alignment Currently the test uses a sequence of 1024-bytes buffers. This lets minio server actively de-duplicate those blocks by page boundary (it's a guess, but it it's truish because minio reports back equivalent ETags for lots of uploading parts). Make the buffer not be power of two so that when squashed together the resulting 2^X buffers don't get equal. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	fffa04fa67	s3/test: Add jumbo upload test It re-uses most of the existing upload sink test, but configures the jumbo sink with at most 3 parts in each intermediate object not to upload 50Gb part to switch to the next one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	908d0d2e6a	s3/client: Wait for background upload fiber on close-abort When uploading a part (and a piece) there can be one or more background fibers handling the upload. In case client needs to abort the operation it calls .close() without flush()ing. In this case the S3 API Abort is made and the sink can be terminated. It's expected that background fibers would resolve on their own eventually, but it's not quite the case. First, they hold units for the semaphore and the semaphore should be alive by the time units are returned. Second, the PUT (or copy) request can finish successfully and it may be sitting in the reactor queue waiting for its continuation to get scheduler. The continuation references sink via "this" capture to put the part etag. Finally, in case of piece uploading the copy fiber needs _client at the end to issue delete-object API call dropping the no longer needed part. Said that -- background fibers must be waited upon on .close() if the closing is aborting (if it's successfull close, then the fibers mush have been picked up by final flush() call). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	f9686926c2	c3/client: Implement jumbo upload sink The sink is also in charge of uploading large objects in parts, but this time each part is put with the help of upload-part-copy API call, not the regular upload-part one. To make it work the new sink inherits from the uploading base class, but instead of keeping memory_data_sink_buffers with parts it keeps a sink to upload a temporary intermediate object with parts. When the object is "full", i.e. the number of parts in it hits the limit, the object is flushed, then copied into the target object with the S3 API call, then deletes the intermediate object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	8fa3294ae1	s3/client: Move memory buffers to upload_sink from base All the buffers manipulations now happen in the upload_sink class and the respective member can be removed from base class. The base class only messes with the buffers in its upload_part() call, but that's unavoidable, as uploading part implies sending its contents which sits in buffers. Now the base class can be re-used for uploading parts with the help of copy-part API call (next patches) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00
Pavel Emelyanov	2ac5ecd659	s3/client: Move last part upload out of finalize_upload() This change has two reasons. First, is to facilitate moving the memory_data_sink_buffers from base class, i.e. -- continuation of the previous patch. Also this fixes a corner case -- if final sink flush happens right after the previous part was sent for uploading, the finalization doesn't happen and sink closing aborts the upload even if it was successful. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00
Pavel Emelyanov	407b40c430	s3/client: Merge do_flush() with upload_part() The do_flush() helper is practically useless because what it does can be done by the upload_part() itself. This merge also facilitates moving the memory_data_sink_buffers from base class to uploader class by next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00
Pavel Emelyanov	a88629227f	s3/client: Rename upload_sink -> upload_sink_base There will appear another sink that would implement multipart upload with the help of copy-part functionality. Current uploading code is going to be partially re-used, so this patch moves all of it into the base class in advance. Next patches will pick needed parts. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00
Pavel Emelyanov	01628ae8c1	test,minio: Run mc with --debug option With that if mc fails we'll (hopefully) get some meaningful information about why it happened. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:16:15 +03:00
Pavel Emelyanov	4041c2f30d	test,minio: Log mc operations to log file Currently everything minio.py does goes to test.py log, while mc (and minio) output go to another log file. That's inconvenient, better to keep minio.py's messages in minio log file. Also, while at it, print a message if local alias drop fails (it's benign failure, but it's good to have the note anyway). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:14:49 +03:00
Kefu Chai	67dae95f58	build: cmake: add Scylla_USE_LINKER option this option allows user to use specified linker instead of the default one. this is more flexible than adding more linker candidates to the known linkers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13874	2023-05-16 11:30:18 +03:00
Tzach Livyatan	a73fde6888	Update Azure recommended instances type from the Lsv2-series to the Lsv3-series Closes #13835	2023-05-16 10:58:19 +03:00
Avi Kivity	3c54d5ec5e	test: string_format_test: don't compare std::string with sstring For unknown reasons, clang 16 rejects equality comparison (operator==) where the left-hand-side is an std::string and the right-hand-side is an sstring. gcc and older clang versions first convert the left-hand-side to an sstring and then call the symmetric equality operator. I was able to hack sstring to support this assymetric comparison, but the solution is quite convoluted, and it may be that it's clang at fault here. So instead this patch eliminates the three cases where it happened. With is applied, we can build with clang 16. Closes #13893	2023-05-16 08:56:16 +03:00
Kefu Chai	b112a3b78a	api: storage_service: use string for generation in this change, the type of the "generation" field of "sstable" in the return value of RESTful API entry point at "/storage_service/sstable_info" is changed from "long" to "string". this change depends on the corresponding change on tools/jmx submodule, so we have to include the submodule change in this very commit. this API is used by our JMX exporter, which in turn exposes the SSTable information via the "StorageService.getSSTableInfo" mBean operation, which returns the retrieved SSTable info as a list of CompositeData. and "generation" is a field of an element in the CompositeData. in general, the scylla JMX exporter is consumed by the nodetool, which prints out returned SSTable info list with a pretty formatted table, see tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java. the nodetool's formatter is not aware of the schema or type of the SSTables to be printed, neither does it enforce the type -- it just tries it best to pretty print them as a tabular. But the fields in CompositeData is typed, when the scylla JMX exporter translates the returned SSTables from the RESTful API, it sets the typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`. So, we should be consistent on the type of "generation" field on both the JMX and the RESTful API sides. because we package the same version of scylla-jmx and nodetool in the same precompiled tarball, and enforce the dependencies on exactly same version when shipping deb and rpm packages, we should be safe when it comes to interoperability of scylla-jmx and scylla. also, as explained above, nodetool does not care about the typing, so it is not a problem on nodetool's front. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13834	2023-05-15 20:33:48 +03:00
Botond Dénes	646396a879	mutation/mutation_partition: append_clustered_row(): use on_internal_error() Instead of simply throwing an exception. With just the exception, it is impossible to find out what went wrong, as this API is very generic and is used in a variety of places. The backtrace printed by `on_internal_error()` will help zero in on the problem. Fixes: #13876 Closes #13883	2023-05-15 20:31:44 +03:00
Calle Wilund	469e710caa	docs: Add initial doc on commitlog segment file format Refs #12849 Just a few lines on the file format of segments. Closes #13848	2023-05-15 16:22:44 +03:00
Benny Halevy	502b5522ca	storage_service: handle_state_normal: on_internal_error on "owns no tokens" Although this condition should not happen, we suspect that certain timing conditions might lead this state of node in handle_normal_state (possibly when shutdown) has no tokens. Currently we call on_internal_error_noexcept, so if abort_on_internal_error is false, we will just print an error and continue on with handle_state_normal. Change that to `on_internal_error` so to throw an exception in production in this unexpected state. Refs #13801 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-15 12:49:17 +03:00
Anna Stuchlik	84ed95f86f	doc: add OS support for version 2023.1 Fixes https://github.com/scylladb/scylladb/issues/13857 This commit adds the OS support for ScyllaDB Enterprise 2023.1. The support is the same as for ScyllaDB Open Source 5.2, on which 2023.1 is based. After this commit is merged, it must be backported to branch-5.2. In this way, it will be merged to branch-2023.1 and available in the docs for Enterprise 2023.1 Closes: #13858	2023-05-15 10:51:53 +03:00
Alejo Sanchez	19687b54f1	test/pytest: yaml configuration cluster section Separate cluster_size into a cluster section and specify this value as initial_size. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13440	2023-05-15 09:48:39 +02:00
Benny Halevy	a70b53b6e7	utils: tagged_integer: implement std::numeric_limits::{min,max} Add add a respective unit test. It turns out that numeric_limits defines an implicit implementation for std::numeric_limits<utils::tagged_integer<Tag, ValueType>> which apprently returns a default-constructed tagged_integer for min() and max(), and this broke `gms::heart_beat_state::force_highest_possible_version_unsafe()` since `4cdad8bc8b` (merged in `7f04d8231d`) Implementing min/max correctly Fixes #13801 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-15 10:19:39 +03:00
Botond Dénes	0cff0ffa08	Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai before this change, alternator_timeout_in_ms is not live-updatable, as after setting executor's default timeout right before creating sharded executor instances, they never get updated with this option anymore. but many users would like to set the driver timers based on server timers. we need to enable them to configure timeout even when the server is still running. in this change, * `alternator_timeout_in_ms` is marked as live-updateable * `executor::_s_default_timeout` is changed to a thread_local variable, so it can be updated by a per-shard updateable_value. and it is now a updateable_value, so its variable name is updated accordingly. this value is set in the ctor of executor, and it is disconnected from the corresponding named_value<> option in the dtor of executor. * alternator_timeout_in_ms is passed to the constructor of executor via sharded_parameter, so `executor::_timeout_in_ms` can be initialized on per-shard basis * `executor::set_default_timeout()` is dropped, as we already pass the option to executor in its ctor. Fixes #12232 Closes #13300 * github.com:scylladb/scylladb: alternator: split the param list of executor ctor into multi lines alternator,config: make alternator_timeout_in_ms live-updateable	2023-05-15 10:16:29 +03:00
Botond Dénes	6c27297406	Merge 'test: sstable_test: use generator to create new generations' from Kefu Chai in this series, instead of hardwiring to integer, we switch to generation generator for creating new generations. this should helps us to migrate to a generation identifier which can also represented by UUID. and potentially can help to improve the testing coverage once we switch over to UUID-based generation identifier. will need to parameterize these tests by then, for sure. Closes #13863 github.com:scylladb/scylladb: test: sstable: use generator to generate generations test: sstable: pass generation_type in helper functions test: sstable: use generator to generate generations	2023-05-15 10:04:30 +03:00
Botond Dénes	3256afe263	Update tools/jmx submodule * tools/jmx 5f988945...f176bcd1 (1): > sstableinfo: change the type of generation to string Refs: #13834	2023-05-15 09:59:40 +03:00
Asias He	93c93c69f9	repair: Add per peer node error for get_sync_boundary and friends It is useful to know which node has the error. For example, when a node has a corrupted sstable, with this patch, repair master node can tell which node has the corrupted sstable. ``` WARN 2023-05-15 10:54:50,213 [shard 0] repair - repair[2df49b2c-219d-411d-87c6-2eae7073ba61]: get_combined_row_hash: got error from node=127.0.0.2, keyspace=ks2a, table=tb, range=(8992118519279586742,9031388867920791714], error=seastar::rpc::remote_verb_error (some error) ``` Fixes #13881 Closes #13882	2023-05-15 09:52:27 +03:00
Pavel Emelyanov	07b7e9faf1	load-meter: Remove unused get_load_string Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13873	2023-05-15 09:21:08 +03:00
Piotr Dulikowski	760651b4ad	error injection: allow enabling injections via config Currently, error injections can be enabled either through HTTP or CQL. While these mechanisms are effective for injecting errors after a node has already started, it can't be reliably used to trigger failures shortly after node start. In order to support this use case, this commit adds possibility to enable some error injections via config. A configuration option `error_injections_at_startup` is added. This option uses our existing configuration framework, so it is possible to supply it either via CLI or in the YAML configuration file. - When passed in commandline, the option is parsed as a semicolon-separated list of error injection names that should be enabled. Those error injections are enabled in non-oneshot mode. The CLI option is marked as not used in release mode and does not appear in the option list. Example: --error-injections-at-startup failure_point1;failure_point2 - When provided in YAML config, the option is parsed as a list of items. Each item is either a string or a map or parameters. This method is more flexible as it allows to provide parameters for each injection point. At this time, the only benefit is that it allows enabling points in oneshot mode, but more parameters can be added in the future if needed. Explanatory example: error_injections_at_startup: - failure_point1 # enabled in non-oneshot mode - name: failure_point2 # enabled in oneshot mode one_shot: true # due to one_shot optional parameter The primary goal of this feature is to facilitate testing of raft-based cluster features. An error injection will be used to enable an additional feature to simulate node upgrade. Tests: manual Closes #13861	2023-05-15 09:14:07 +03:00
Botond Dénes	1b04fc1425	Merge 'Use member initializer list for trace_state and related helper classes' from Pavel Emelyanov Constructors of trace_state class initialize most of the fields in constructor body with the help of non-inline helper method. It's possible and is better to initialize as much as possible with initializer lists. Closes #13871 * github.com:scylladb/scylladb: tracing: List-initialize trace_state::_records tracing: List-initialize trace_state::_props tracing: List-initialize trace_state::_slow_query_threshold tracing: Reorder trace_state fields initialization tracing: Remove init_session_records() tracing: List-initialize one_session_records::ttl tracing: List-initialize one_session_records tracing: List-initialize session_record	2023-05-15 09:06:14 +03:00
Botond Dénes	20ff122a84	Merge 'Delete S3 sstables without the help of deletion log' from Pavel Emelyanov There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care. Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful. So this PR - makes atomic deletion be storage-specific - makes S3 wipe non-atomic fixes: #13016 note: Replaying sstables deletion from ownership table on boot is not here, see #13024 Closes #13562 * github.com:scylladb/scylladb: sstables: Implement atomic deleter for s3 storage sstables: Get atomic deleter from underlying storage sstables: Move delete_atomically to manager and rename	2023-05-15 08:57:47 +03:00
Benny Halevy	1b5d5205c8	test: add tagged_integer_test Add basic test for tagged+integer arithmetic operations. Remove const qualifier from `tagged_integer::operator[+-]=` as these are add/sub-assign operators that need to modify the value in place. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-14 23:26:58 +03:00
Wojciech Mitros	96e912e1cf	auth: disallow CREATE permission on a specific function Similarly to how we handle Roles and Tables, we do not allow permissions on non-existent objects, so the CREATE permission on a specific function is meaningless, because for the permission to be granted to someone, the function must be already created. This patch removes the CREATE permission from the set of permissions applicable to a specific function. Fixes #13822 Closes #13824	2023-05-14 18:40:34 +03:00
Wojciech Mitros	1e18731a69	cql-pytest: translate Cassandra's UFTypesTest This is a translation of Cassandra's CQL unit test source file validation/entities/UFTypesTest.java into our cql-pytest framework. There are 7 tests, which reproduce one known bug: Refs #13746: UDF can only be used in SELECT, and abort when used in WHERE, or in INSERT/UPDATE/DELETE commands And uncovered two previously unknown bugs: Refs #13855: UDF with a non-frozen collection parameter cannot be called on a frozen value Refs #13860: A non-frozen collection returned by a UDF cannot be used as a frozen one Additionally, we encountered an issue that can be treated as either a bug or a hole in documentation: Refs #13866: Argument and return types in UDFs can be frozen Closes #13867	2023-05-14 15:22:03 +03:00
Avi Kivity	31e820e5a1	Merge 'Allow tombstone GC in compaction to be disabled on user request' from Raphael "Raph" Carvalho Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction. Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction. column_family variant must specify a single table only, following existing convention. whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace. column_family API usage ----- ``` The table name must be in keyspace:name format Get status: curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" Enable GC curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" Disable GC curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" ``` storage_service API usage ----- ``` Tables can be specified using a comma-separated list. Enable GC on keyspace curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks" Disable GC on keyspace curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks" Enable GC on a subset of tables curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2" ``` Closes #13793 * github.com:scylladb/scylladb: test: Test new API for disabling tombstone GC test: rest_api: extract common testing code into generic functions Add API to disable tombstone GC in compaction api: storage_service: restore indentation api: storage_service: extract code to set attribute for a set of tables tests: Test new option for disabling tombstone GC in compaction compaction_strategy: bypass tombstone compaction if tombstone GC is disabled table: Allow tombstone GC in compaction to be disabled on user request	2023-05-14 14:16:16 +03:00
Tomasz Grabiec	a91e83fad6	Merge "issue raft read barrier before pulling schema" from Gleb Schema pull may fail because the pull does not contain everything that is needed to instantiate a schema pointer. For instance it does not contain a keyspace. This series changes the code to issue raft read barrier before the pull which will guaranty that the keyspace is created before the actual schema pull is performed.	2023-05-14 14:14:24 +03:00
Raphael S. Carvalho	a7ceb987f5	test: Fix sporadic failures of database_test database_test is failing sporadically and the cause was traced back to commit `e3e7c3c7e5`. The commit forces a subset of tests in database_test, to run once for each of predefined x_log2_compaction_group settings. That causes two problems: 1) test becomes 240% slower in dev mode. 2) queries on system.auth is timing out, and the reason is a small table being spread across hundreds of compaction groups in each shard. so to satisfy a range scan, there will be multiple hops, making the overhead huge. additionally, the compaction group aware sstable set is not merged yet. so even point queries will unnecessarily scan through all the groups. Fixes #13660. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13851	2023-05-14 14:14:24 +03:00
Avi Kivity	97694d26c4	Merge 'reader_permit: minor improvements to resource consume/release safety' from Botond Dénes This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore: * reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend. * reader_resources: split `reset()` into `noexcept` and potentially throwing variant. * reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one) Closes #13678 * github.com:scylladb/scylladb: reader_permit: resource_units::reset_to(): try harder to avoid calling consume() reader_permit: split resource_units::reset() reader_permit: make consume()/signal() API private	2023-05-14 14:14:23 +03:00
Avi Kivity	5d6f31df8e	Merge 'Coroutinize sstable::read_toc()' from Pavel Emelyanov It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation. Closes #13862 * github.com:scylladb/scylladb: sstables: Restore indentation after previous patches sstables: Coroutinuze read_toc() outer part sstables: Coroutinuze read_toc() inner part	2023-05-14 14:14:23 +03:00
Avi Kivity	0a78995e2b	Merge 'Share s3 clients between sstables' from Pavel Emelyanov Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server. In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks. There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard. 1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file 2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s) 3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top. refs: #13458 refs: #13369 refs: scylladb/seastar#1652 Closes #13859 * github.com:scylladb/scylladb: s3: Pick client from manager via handle s3: Generalize s3 file handle s3: Live-update clients' configs sstables: Keep clients shared across sstables storage_manager: Rewrap config map sstables, database: Move object storage config maintenance onto storage_manager sstables: Introduce sharded<storage_manager>	2023-05-14 14:14:23 +03:00
Pavel Emelyanov	8bca54902c	sstables: Implement atomic deleter for s3 storage The existing storage::wipe() method of s3 is in fact atomic deleter -- it commits "deleting" status into ownership table, deletes the objects from server, then removes the entry from ownership table. So the atomic deleter does the same and the .wipe() just removes the objects, because it's not supposed to be atomic. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 17:52:13 +03:00
Pavel Emelyanov	6a8139a4fe	sstables: Get atomic deleter from underlying storage While the driver isn't known without the sstable itself, we have a vector of them can can get it from the front element. This is not very generic, but fortunately all sstables here belong to the same table and, respectively, to the same storage and even prefix. The latter is also assert-checked by the sstable_directory atomic deleter code. For now S3 storage returns the same directory-based deleter, but next patch will change that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 17:52:13 +03:00
Pavel Emelyanov	5985f00da9	sstables: Move delete_atomically to manager and rename This is to let manager decide which storage driver to call for atomic sstables deletion in the next patch. While at it -- rename the sstable_directory's method into something more descriptive (to make compiler catch all callers of it). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 17:52:12 +03:00
Raphael S. Carvalho	107999c990	test: Test new API for disabling tombstone GC Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:34:38 -03:00
Raphael S. Carvalho	c396db2e4c	test: rest_api: extract common testing code into generic functions Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:34:38 -03:00
Raphael S. Carvalho	abc1eae1c2	Add API to disable tombstone GC in compaction Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc. Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction. column_family variant must specify a single table only, following existing convention. whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace. column_family API usage ----- The table name must be in keyspace:name format Get status: curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" Enable GC curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" Disable GC curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf" storage_service API usage ----- Tables can be specified using a comma-separated list. Enable GC on keyspace curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks" Disable GC on keyspace curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks" Enable GC on a subset of tables curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2" Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:34:38 -03:00
Raphael S. Carvalho	07104393af	api: storage_service: restore indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:34:36 -03:00
Raphael S. Carvalho	501b5a9408	api: storage_service: extract code to set attribute for a set of tables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:33:50 -03:00
Pavel Emelyanov	d58bc9a797	tracing: List-initialize trace_state::_records This field needs to call trace_state::ttl_by_type() which, in turn, looks into _props. The latter should have been initialized already Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:15:58 +03:00
Pavel Emelyanov	5aebbedaba	tracing: List-initialize trace_state::_props It takes props from constructor args and tunes them according to the constructing "flavor" -- primary or secondary state. Adding two static helpers code-document the intent and make list-initialization possible Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:14:32 +03:00
Raphael S. Carvalho	6c32148751	tests: Test new option for disabling tombstone GC in compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:14:28 -03:00
Raphael S. Carvalho	777af7df44	compaction_strategy: bypass tombstone compaction if tombstone GC is disabled compaction strategies know how to pick files that are most likely to satisfy tombstone purge conditions (i.e. not shadow data in uncompacting files). This logic can be bypassed if tombstone GC was disabled by user, as it's a waste of effort to proceed with it until re-enabled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:14:28 -03:00
Raphael S. Carvalho	3b28c26c77	table: Allow tombstone GC in compaction to be disabled on user request If tombstone GC was disabled, compaction will ensure that fully expired sstables won't be bypassed and that no expired tombstones will be purged. Changing the value takes immediate effect even on ongoing compactions. Not wired into an API yet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-12 10:14:28 -03:00
Pavel Emelyanov	e7978dbf98	tracing: List-initialize trace_state::_slow_query_threshold Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:14:15 +03:00
Pavel Emelyanov	3ebbc25cec	tracing: Reorder trace_state fields initialization The instance ptr and props have to be set up early, because other members' initialization depends on them. It's currently OK, because other members are initialized in the constructor body, but moving them into initializer list would require correct ordering Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:13:13 +03:00
Pavel Emelyanov	16e1315eef	tracing: Remove init_session_records() It now does nothing but wraps make_lw_shared<one_session_records>() call. Callers can do it on their own thus facilitating further list-initialization patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:11:18 +03:00
Pavel Emelyanov	dd87adadf3	tracing: List-initialize one_session_records::ttl For that to happen the value evaluation is moved from the init_session_records() into a private trace_state helper as it checks the props values initialized earlier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:09:05 +03:00
Pavel Emelyanov	b63084237c	tracing: List-initialize one_session_records This touches session_id, parent_id and my_span_id fields Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:07:24 +03:00
Pavel Emelyanov	944b98f261	tracing: List-initialize session_record This object is constructed via one_session_records thus the latter needs to pass some arguments along Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-12 16:04:01 +03:00
Botond Dénes	157fdb2f6d	db/system_keyspace: remove dependency on storage_proxy The methods that take storage_proxy as argument can now accept a replica::database instead. So update their signatures and update all callers. With that, system_keyspace.* no longer depends on storage_proxy directly.	2023-05-12 07:27:55 -04:00
Botond Dénes	f4f757af23	db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent Use the recently introduced replica side query utility functions to query the content of the system tables. This allows us to cut the dependency of the system keyspace on storage proxy. The methods still take storage proxy parameter, this will be replaced with replica::database in the next patch. There is still one hidden storage proxy dependency left, via clq3::query_processor. This will be addressed later.	2023-05-12 07:27:55 -04:00
Botond Dénes	f5d41ac88c	replica: add query.hh Containing utility methods to query data from the local replica. Intended to be used to read system tables, completely bypassing storage proxy in the process. This duplicates some code already found in storage proxy, but that is a small price to pay, to be able to break some circular dependencies involving storage proxy, that have been plaguing us since time immemorial. One thing we lose with this, is the smp service level using in storage proxy. If this becomes a problem, we can create one in database and use it in these methods too. Another thing we lose is increasing `replica_cross_shard_ops` storage proxy stat. I think this is not a problem at all as these new functions are meant to be used by internal users, which will reduce the internal noise in this metric, which is meant to indicate users not using shard-aware clients.	2023-05-12 07:26:18 -04:00
Wojciech Mitros	d50f048279	cql: adjust tests to the updated permissions on functions As a result of the preceding patches, permissions on a function are now granted to its creator. As a result, some permissions may appear which we did not expect before. In the test_udf_permissions_serialization, we create a function as the superuser, and as a result, when we compare the permissions we specifically granted to the ones read from the LIST PERMISSIONS result, we get more than expected - this is fixed by granting permissions explicitly to a new user and only checking this user's permissions list. In the test_grant_revoke_udf_permissions case, we test whether the DROP permission in enforced on a function that we have previously created as the same user - as a result we have the DROP permission even without granting it directly. We fix this by testing the DROP permission on a function created by a different user. In the test_grant_revoke_alter_udf_permissions case, we previously tested that we require both ALTER and CREATE permissions when executing a CREATE OR REPLACE FUNCTION statement. The new permissions required for this statement now depend on whether we actually CREATE or REPLACE a function, so now we test that the ALTER permission is required when REPLACING a function, and the CREATE permission is required when CREATING a function. After the changes, the case no longer needs to be arfitifially extracted from the previous one, so they are merged now. Analogous adjustments are made in the test case test_grant_revoke_alter_uda_permissions.	2023-05-12 10:56:29 +02:00
Wojciech Mitros	8abed6445a	cql: fix authorization when altering a function Currently, when a user is altering a function, they need both CREATE and ALTER permissions, instead of just ALTER. Additionally, after altering a function, the user is treated as an owner of this function, gaining all access permissions to it. This patch fixes these 2 issues, by checking only the ALTER permission when actually altering, and by not modifying user's permisssions if the user did not actually create the function.	2023-05-12 10:56:29 +02:00
Wojciech Mitros	1d099644d4	cql: grant permissions on functions when creating a keyspace/function When a user creates a function, they should have all permissions on this function. Similarly, when a user creates a keyspace, they should have all permissions on functions in the keyspace. This patch introduces GRANTs on the missing permissions.	2023-05-12 10:56:29 +02:00
Wojciech Mitros	dd20621d71	cql: pass a reference to query processor in grant_permissions_to_creator In the following patch, the grant_permissions_to_creator method is going to be also used to grant permissions on a newly created function. The function resource may contain user-defined types which need the query processor to be prepared, so we add a reference to it in advance in this patch for easier review.	2023-05-12 10:56:29 +02:00
Wojciech Mitros	f4d2cd15e9	test_permissions: make tests pass on cassandra Despite the cql-pytests being intended to pass on both Scylla and Cassandra, the test_permissions.py case was actually failing on Cassandra in a few cases. The most common issue was a different exception type returned by Scylla and Cassandra for an invalid query. This was fixed by accepting 2 types of exceptions when necessary. The second issue was java UDF code that did not compile, which was fixed simply by debugging the code. The last issue was a case that was scylla_only with no good reason. The missing java UDFs were added to that case, and the test was adjusted so that the ALTER permission was only checked in a CREATE OR REPLACE statement only if the UDF was already existing - - Scylla requires it in both cases, which will get resolved in the next patch.	2023-05-12 10:50:12 +02:00
Kefu Chai	e89e0d4b28	test: sstable: use generator to generate generations instead of assuming the integer-based generation id, let's use the generation generator for creating a new generation id. this helps us to improve the testing coverity once we migrate to the UUID-based generation identifier. this change uses generator to generate generations for `make_sstable_for_all_shards()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-12 13:22:32 +08:00
Kefu Chai	e3d6dd46b7	test: sstable: pass generation_type in helper functions always avoid using generation_type if possible. this helps us to hide the underlying type of generation identifier, which could also be a UUID in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-12 13:22:32 +08:00
Kefu Chai	e788bfbb43	test: sstable: use generator to generate generations instead of assuming the integer-based generation id, let's use the generation generator for creating a new generation id. this helps us to improve the testing coverity once we migrate to the UUID-based generation identifier. this change uses generator to create generations for `make_sstable_for_this_shard()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-12 13:22:30 +08:00
Pavel Emelyanov	613acba5d0	s3: Pick client from manager via handle Add the global-factory onto the client that is - cross-shard copyable - generates a client from local storage_manager by given endpoint With that the s3 file handle is fixed and also picks up shared s3 clients from the storage manager instead of creating its own one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:01 +03:00
Pavel Emelyanov	8ed9716f59	s3: Generalize s3 file handle Currently the s3 file handle tries to carry client's info via explicit host name and endpoint config pointer. This is buggy, the latter pointer is shard-local can cannot be transferred across shards. This patch prepares the fix by abstracting the client handle part. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:01 +03:00
Pavel Emelyanov	63ff6744d8	s3: Live-update clients' configs Now when the client is accessible directli via the storage_manager, when the latter is requested to update its endpoint config, it can kick the client to do the same. The latter, in turn, can only update the AWS creds info for now. The endpoint port and https usage are immutable for now. Also, updating the endpoint address is not possible, but for another reason -- the endpoint itself is the part of keyspace configuration and updating one in the object_storage.yaml will have no effect on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:01 +03:00
Pavel Emelyanov	e6760482b2	sstables: Keep clients shared across sstables Nowadays each sstable gets its own instance of an s3::client. This patch keeps clients on storage_manager's endpoints map and when creating a storage for an sstable -- grab the shared pointer from the map, thus making one client serve all sstables over there (except for those that duplicated their files with the help of foreign-info, but that's to be handled by next patches). Moving the ownership of a client to the storage_manager level also means that the client has to be closed on manager's stop, not on sstable destroy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:01 +03:00
Pavel Emelyanov	743f26040f	storage_manager: Rewrap config map Now the map is endpoint -> config_ptr. Wrap the config_ptr into an s3_endpoint struct. Next patch will keep the client on this new wrapper struct thus making them shared between sstables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:01 +03:00
Pavel Emelyanov	a59096aa70	sstables, database: Move object storage config maintenance onto storage_manager Right now the map<endpoint, config> sits on the sstables manager and its update is governed by database (because it's peering and can kick other shards to update it as well). Having the sharded<storage_manager> at hand lets freeing database from the need to update configs and keeps sstables_manager a bit smaller. Also this will allow keeping s3 clients shared between sstables via this map by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:39:00 +03:00
Pavel Emelyanov	2153751d45	sstables: Introduce sharded<storage_manager> The manager in question keeps track of whatever sstables_manager needs to work with the storage (spoiler: only S3 one). It's main-local sharded peering service, so that container() call can be used by next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 19:36:01 +03:00
Pavel Emelyanov	d7af178f20	sstables: Restore indentation after previous patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 18:40:24 +03:00
Pavel Emelyanov	54e892caf1	sstables: Coroutinuze read_toc() outer part It just needs to catch the system_error of ENOENT and re-throw it as malformed_sstable_exception. Indentatil is deliberately left broken. Again. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 18:40:14 +03:00
Pavel Emelyanov	1eb3ae2256	sstables: Coroutinuze read_toc() inner part One non-trivial change is the removal of buf temporary variable. That's because it existed under the same name in the .then() lambda generating name conflict after coroutinization. Other than that it's pretty straightforward. Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-11 18:40:07 +03:00
Gleb Natapov	7caf1d26fb	migration manager: Make schema pull abortable. Now which schema pull may issues raft read barrier it may stuck if majority is not available. Make the operation abortable and abort it during queries if timeout is reached.	2023-05-11 16:31:23 +03:00
Gleb Natapov	091ec285fe	serialized_action: make serialized_action abortable Add an ability to abort waiting for a result of a specific trigger() invocation.	2023-05-11 16:31:23 +03:00
Asias He	7fcc403122	tombstone_gc: Fix gc_before for immediate mode The immediate mode is similar to timeout mode with gc_grace_seconds zero. Thus, the gc_before returned should be the query_time instead of gc_clock::time_point::max in immediate mode. Setting gc_before to gc_clock::time_point::max, a row could be dropped by compaction even if the ttl is not expired yet. The following procedure reproduces the issue: - Start 2 nodes - Insert data ``` CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }; CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'}; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z') USING TTL 1000000; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z') USING TTL 1000000; INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z') USING TTL 1000000; ``` - Run nodetool flush and nodetool compact - Compaction drops all data ``` ~128 total partitions merged to 0. ``` Fixes #13572 Closes #13800	2023-05-11 15:10:00 +03:00
Botond Dénes	3d75158fda	Merge 'Allow no owned token ranges in cleanup compaction' from Benny Halevy It is possible that a node will have no owned token ranges in some keyspaces based on their replication strategy, if the strategy is configured to have no replicas in this node's data center. In this case we should go ahead with cleanup that will effectively delete all data. Note that this is current very inefficient as we need to filter every partition and drop it as unowned. It can be optimized by either special casing this case or, better, use skip forward to the next owned range. This will skip to end-of-stream since there are no owned ranges. Fixes #13634 Also, add a respective rest_api unit test Closes #13849 * github.com:scylladb/scylladb: test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges compaction_manager: perform_cleanup: handle empty owned ranges	2023-05-11 15:05:06 +03:00
Gleb Natapov	70189b60de	migration manager: if raft is enables sync with group0 leader before pulling a schema which is not available locally Schema pull may fail because the pull does not contain everything that is needed to instantiate a schema pointer. For instance it does not contain a keyspace. This patch changes the code to issue raft read barrier before the pull which will guaranty that the keyspace is created before the actual schema pull is performed. Refs: #3760 Fixes: #13211	2023-05-11 13:28:54 +03:00
Gleb Natapov	d4417442e9	service: raft_group0_client: add using_raft function Make it easy to check if raft is enabled.	2023-05-11 13:27:58 +03:00
Anna Stuchlik	7f7ab3ae3e	doc: fix the broken Glossary link Fixes https://github.com/scylladb/scylladb/issues/13805 This commit fixes the redirection required by moving the Glossary page from the top of the page tree to the Reference section. As the change was only merged to master (not to branch-5.2), it is not working for version 5.2, which is now the latest stable version. For this reason, "stable" in the path must be replaced with "master". Closes #13847	2023-05-11 10:30:59 +03:00
Botond Dénes	24cb351655	Merge 'test: sstable_test: avoid using helper using generation_type::int_t ' from Kefu Chai the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. Closes #13845 github.com:scylladb/scylladb: test: drop unused helper functions test: sstable_mutation_test: avoid using helper using generation_type::int_t test: sstable_move_test: avoid using helper using generation_type::int_t test: sstable_*test: avoid using helper using generation_type::int_t test: sstable_3_x_test: do not use reuseable_sst() accepting integer	2023-05-11 10:17:02 +03:00
Benny Halevy	2fc142279f	compaction_manager: perform_cleanup: hold on to sstable_set around yielding Updates to the compaction_group sstable sets are never done in place. Instead, the update is done on a mutable copy of the sstable set, and the lw_shared result is set back in the compaction_group. (see for example compaction_group::set_main_sstables) Therefore, there's currently a risk in perform_cleanup `get_sstables` lambda that if it yield while in set.for_each_sstable, the sstable_set might be replaced and the copy it is traversing may be destroyed. This was introduced in `c2bf0e0b72`. To prevent that, hold on to set.shared_from_this() around set.for_each_sstable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13852	2023-05-11 09:46:53 +03:00
Benny Halevy	0b91bfbcc5	test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges Test cleanup on a keyspace after altering it replication factor to 0. Expect no sstables to remain. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-11 08:16:31 +03:00
Kefu Chai	29284d64a5	test: drop unused helper functions all users of these two helpers have switched to their alternatives, so there is no need to keep them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-11 12:32:37 +08:00
Kefu Chai	b036d2b50c	test: sstable_mutation_test: avoid using helper using generation_type::int_t this change is one of the series which drops most of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. so, in this change, instead of using `generation_type::int_t` in the helper functions, we just pass `generation_type` in place of integer. also, since `generate_clustered()` is only used by functions in the same compilation unit, let's take the opportunity to mark it `static`. and there is no need to pass generation as a template parameter, we just pass it as a regular parameter. we will divert other callers of `reusable_sst(..., generation_type::int)` in following-up changes in different ways. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-11 12:32:22 +08:00
Kefu Chai	689e1e99d6	test: sstable_move_test: avoid using helper using generation_type::int_t this change is one of the series which drops most of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. so, in this change, instead of using `generation_type::int_t` in helper functions, we just use `generation_type`. please note, despite that we'd prefer generating the generations using generator, the SSTables used by the tests modified by this change are stored in the repo, to ensure that the tests are always able to find the SSTable files, we keep them unchanged instead of using generation_generator, or a random generation for the testing. we will divert other callers of `reusable_sst(..., generation_type::int)` in following-up changes in different ways. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-11 12:32:22 +08:00
Kefu Chai	bfd6caffbb	test: sstable_*test: avoid using helper using generation_type::int_t this change is one of the series which drops most of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. so, in this change, instead of using the helper accepting int, we switch to the one which accepts generation_type by offering a default paramter, which is a generation created using 1. this preserves the existing behavior. we will divert other callers of `reusable_sst(..., generation_type::int)` in following-up changes in different ways. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-11 12:32:22 +08:00
Kefu Chai	ab8efbf1ab	test: sstable_3_x_test: do not use reuseable_sst() accepting integer this change is one of the series which drops most of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation. so, in this change, instead of using the helper accepting int, we switch to the one which accepts generation_type. also, as no callers are using the last parameter of `make_test_sstable()`, let's drop it . Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-11 12:32:21 +08:00
Nadav Har'El	f1cad230bb	Merge 'cql: enable setting permissions on resources with quoted UDT names' from Wojciech Mitros This series fixes an issue with altering permissions on UDFs with parameter types that are UDTs with quoted names and adds a test for it. The issue was caused by the format of the temporary string that represented the UDT in `auth::resource`. After parsing the user input to a raw type, we created a string representing the UDT using `ut_name::to_string()`. The segment of the resulting string that represented the name of the UDT was not quoted, making us unable to parse it again when the UDT was being `prepare`d. Other than for this purpose, the `ut_name::to_string()` is used only for logging, so the solution was modifying it to maybe quote the UDT name. Ref: https://github.com/scylladb/scylladb/pull/12869 Closes #13257 * github.com:scylladb/scylladb: cql-pytest: test permissions for UDTs with quoted names cql: maybe quote user type name in ut_name::to_string() cql: add a check for currently used stack in parser cql-pytest: add an optional name parameter to new_type()	2023-05-10 19:10:29 +03:00
Wojciech Mitros	1f45c7364c	cql: check permissions for used functions when creating a UDA Currently, when creating a UDA, we only check for permissions for creating functions. However, the creator gains all permissions to the UDA, including the EXECUTE permission. This enables the user to also execute the state/reduce/final functions that were used in the UDA, even if they don't have the EXECUTE permissions on them. This patch adds checks for the missing EXECUTE permissions, so that the UDA can be only created if the user has all required permissions. The new permissions that are now required when creating a UDA are now granted in the existing UDA test. Fixes #13818 Closes #13819	2023-05-10 18:06:04 +03:00
Wojciech Mitros	a86b9fa0bb	auth: fix formatting of function resource with no arguments Currently, when a function has no arguments, the function_args() method, which is supposed to return a vector of string_views representing the arguments of the function, returns a nullopt instead, as if it was a functions_resource on all functions or all functions in a keyspace. As a result, the functions_resource can't be properly formatted. This is fixed in this patch by returning an empty vector instead, and the fix is confirmed in a cql-pytest. Fixes #13842 Closes #13844	2023-05-10 17:07:33 +03:00
Benny Halevy	3771d48488	sstables: mx: validate: close consumer context data_consume_rows keeps an input_stream member that must be closed. In particular, on the error path, when we destroy it possibly with readaheads in flight. Fixes #13836 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13840	2023-05-10 17:05:43 +03:00
Benny Halevy	c720754e37	compaction_manager: perform_cleanup: handle empty owned ranges It is possible that a node will have no owned token ranges in some keyspaces based on their replication strategy, if the strategy is configured to have no replicas in this node's data center. In this case we should go ahead with cleanup that will effectively delete all data. Note that this is current very inefficient as we need to filter every partition and drop it as unowned. It can be optimized by either special casing this case ot, better, use skip forward to the next owned range. This will skip to end-of-stream since there are no owned ranges. Fixes #13634 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-10 15:11:53 +03:00
Avi Kivity	171a6cbbaa	cql3: untyped_result_set: document performance characteristics untyped_result_set is optimized towards conventience and safety, so note that. Closes #13661	2023-05-10 15:03:12 +03:00
Nadav Har'El	e57252092c	Merge 'cql3: result_set, selector: change value type to managed_bytes_opt' from Avi Kivity CQL evolved several expression evaluation mechanisms: WHERE clause, selectors (the SELECT clause), and the LWT IF clause are just some examples. Most now use expressions, which use managed_bytes_opt as the underlying value representation, but selectors still use bytes_opt. This poses two problems: 1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency 2. trying to use expressions with bytes_opt will incur a copy, reducing performance To solve the problem, we harmonize the data types to managed_bytes_opt (#13216 notwithstanding). This is somewhat difficult since the source of the values are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view are mostly compatible so with a little effort this can be done. The series is neutral wrt performance: before: ``` 222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors) 224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors) 224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors) 223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors) 223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors) ``` after: ``` 220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors) 225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors) 222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors) 224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors) 225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors) ``` Though I expect with some more effort we can eliminate some copies. Closes #13637 * github.com:scylladb/scylladb: cql3: untyped_result_set: switch to managed_bytes_view as the cell type cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt cql3: untyped_result_set: always own data types: abstract_type: add mixed-type versions of compare() and equal() utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt utils: managed_bytes: add managed_bytes_view::with_linearized() utils: managed_bytes: mark managed_bytes_view::is_linearized() const	2023-05-10 15:01:45 +03:00
Wojciech Mitros	9ae1b02144	service: revoke permissions on functions when a function/keyspace is dropped Currently, when a user has permissions on a function/all functions in keyspace, and the function/keyspace is dropped, the user keeps the permissions. As a result, when a new function/keyspace is created with the same name (and signature), they will be able to use it even if no permissions on it are granted to them. Simliarly to regular UDFs, the same applies to UDAs. After this patch, the corresponding permissions on functions are dropped when a function/keyspace is dropped. Fixes #13820 Closes #13823	2023-05-10 14:39:42 +03:00
Botond Dénes	bb62038119	Merge 'Scrub compaction task' from Aleksandra Martyniuk Task manager's tasks covering scrub compaction on top, shard and table level. For this levels we have common scrub tasks for each scrub mode since they share code. Scrub modes will be differentiated on compaction group level. Closes #13694 * github.com:scylladb/scylladb: test: extend test_compaction_task.py to test scrub compaction compaction: add table_scrub_sstables_compaction_task_impl compaction: add shard_scrub_sstables_compaction_task_impl compaction: add scrub_sstables_compaction_task_impl api: get rid of unnecessary std::optional in scrub compaction: rename rewrite_sstables_compaction_task_impl	2023-05-10 14:18:20 +03:00
Anna Stuchlik	4898a20ae9	doc: add troubleshooting for failed schema sync Fixes https://github.com/scylladb/scylladb/issues/12133 This commit adds a Troubleshooting article to support users when schema sync failed on their cluster. Closes #13709	2023-05-10 14:01:36 +03:00
Avi Kivity	1a3545b13d	Merge 'data_dictionary: define helpers in options and define == operator only' from Kefu Chai in this series, `data_dictionary::storage_options` is refactored so that each dedicated storage option takes care of itself, instead of putting all the logic into `storage_options`. cleaner this way. as the next step, i will add yet another set of options for the tiered_storage which is backed by the s3_storage and the local filesystem_storage. with this change, we will be able to group the per-option functionalities together by the option thy are designed for, instead of sharding them by the actual function. Closes #13826 * github.com:scylladb/scylladb: data_dictionary: define helpers in options data_dictionary: only define operator== for storage options	2023-05-10 12:59:57 +03:00
Avi Kivity	e252dbcfb8	Merge ' readers,mutation: move mutation_fragment_stream_validator to mutation/' from Botond Dénes The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc. This PR fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now). Closes #13831 * github.com:scylladb/scylladb: mutation/mutation_fragment_stream_validator.cc: rename logger readers,mutation: move mutation_fragment_stream_validator to mutation/	2023-05-10 12:54:53 +03:00
Botond Dénes	6bea0c04cf	message: match unknown tenants to the default tenant On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name. If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen. This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster. This patch proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.	2023-05-10 05:09:34 -04:00
Botond Dénes	8663b27f25	message: generalize per-tenant connection types We have a set amount of connection types for each tenant. The amount of these connection types can change. Although currently these are hardcoded in a single place, soon (in the next patch) there will be yet another place where these will be used. To avoid duplicating these names, making future changes error prone, centralize them in a const array, generalizing the concept of a tenant connection type.	2023-05-10 04:28:57 -04:00
Kamil Braun	7d9ab44e81	Merge 'token_metadata: read remapping for write_both_read_new' from Gusev Petr When new nodes are added or existing nodes are deleted, the topology state machine needs to shunt reads from the old nodes to the new ones. This happens in the `write_both_read_new` state. The problem is that previously this state was not handled in any way in `token_metadata` and the read nodes were only changed when the topology state machine reached the final 'owned' state. To handle `write_both_read_new` an additional `interval_map` inside `token_metadata` is maintained similar to `pending_endpoints`. It maps the ranges affected by the ongoing topology change operation to replicas which should be used for reading. When topology state sm reaches the point when it needs to switch reads to a new topology, it passes `request_read_new=true` in a call to `update_pending_ranges`. This forces `update_pending_ranges` to compute the ranges based on new topology and store them to the `interval_map`. On the data plane, when a read on coordinator needs to decide which endpoints to use, it first consults this `interval_map` in `token_metadata`, and only if it doesn't contain a range for current token it uses normal endpoints from `effective_replication_map`. Closes #13376 * github.com:scylladb/scylladb: storage_proxy, storage_service: use new read endpoints storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading token_metadata: add unit test for endpoints_for_reading token_metadata: add endpoints for reading sequenced_set: add extract_set method token_metadata_impl: extract maybe_migration_endpoints helper function token_metadata_impl: introduce migration_info token_metadata_impl: refactor update_pending_ranges token_metadata: add unit tests token_metadata: fix indentation token_metadata_impl: return unique_ptr from clone functions	2023-05-10 10:03:30 +02:00
Avi Kivity	550aa01242	Merge 'Restore raft::internal::tagged_uint64 type' from Benny Halevy Change `f5f566bdd8` introduced tagged_integer and replaced raft::internal::tagged_uint64 with utils::tagged_integer. However, the idl type for raft::internal::tagged_uint64 was not marked as final, but utils::tagged_integer is, breaking the on-the-wire compatibility. This change restores the use of raft::internal::tagged_uint64 for the raft types and adds back an idl definition for it that is not marked as final, similar to the way raft::internal::tagged_id extends utils::tagged_uuid. Fixes #13752 Closes #13774 * github.com:scylladb/scylladb: raft, idl: restore internal::tagged_uint64 type raft: define term_t as a tagged uint64_t idl: gossip_digest: include required headers	2023-05-09 22:51:25 +03:00
Kefu Chai	d8cd62b91a	compaction/compaction: initialize local variable the initial `validation_errors` should be zero. so let's initialize it instead of leaving it to uninitialized. this should address following warning from Clang-16: ``` /usr/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=6 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -I/home/kefu/dev/scylladb/build/cmake/gen -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wno-error=deprecated-declarations -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT compaction/CMakeFiles/compaction.dir/compaction.cc.o -MF compaction/CMakeFiles/compaction.dir/compaction.cc.o.d -o compaction/CMakeFiles/compaction.dir/compaction.cc.o -c /home/kefu/dev/scylladb/compaction/compaction.cc /home/kefu/dev/scylladb/compaction/compaction.cc:1681:9: error: variable 'validation_errors' is uninitialized when used here [-Werror,-Wuninitialized] validation_errors += co_await sst->validate(permit, descriptor.io_priority, cdata.abort, [&schema] (sstring what) { ^~~~~~~~~~~~~~~~~ /home/kefu/dev/scylladb/compaction/compaction.cc:1676:31: note: initialize the variable 'validation_errors' to silence this warning uint64_t validation_errors; ^ = 0 ``` the change which introduced this local variable was `7ba5c9cc6a`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13813	2023-05-09 22:49:29 +03:00
Avi Kivity	8c6229d229	Merge 'sstable: encode value using UUID' from Kefu Chai in this series, we encode the value of generation using UUID to prepare for the UUID generation identifier. simpler this way, as we don't need to have two ways to encode integer or a timeduuid: uuid with a zero timestamp, and a variant. also, add a `from_string()` factory method to convert string to generation to hide the the underlying type of value from generation_type's users. Closes #13782 * github.com:scylladb/scylladb: sstable: use generation_type::from_string() to convert from string sstable: encode int using UUID in generation_type	2023-05-09 22:07:23 +03:00
Avi Kivity	996f717dfc	Merge 'cql3/prepare_expr: force token() receiver name to be partition key token' from Jan Ciołek Let's say that we have a prepared statement with a token restriction: ```cql SELECT * FROM some_table WHERE token(p1, p2) = ? ``` After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker. In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`. In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`. It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things. This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`. Fixes: #13769 Closes #13815 * github.com:scylladb/scylladb: cql-pytest: test that bind marker is partition key token cql3/prepare_expr: force token() receiver name to be partition key token	2023-05-09 20:44:46 +03:00
Anna Stuchlik	c64109d8c7	doc: add driver support for Serverless Fixes https://github.com/scylladb/scylladb/issues/13453 This is V2 of https://github.com/scylladb/scylladb/pull/13710/. This commit adds: - the information about which ScyllaDB drivers support ScyllaDB Cloud Serverless. - language and organization improvements to the ScyllaDB CQL Drivers page. Closes #13825	2023-05-09 20:43:22 +03:00
Kefu Chai	c872ade50f	sstable: use generation_type::from_string() to convert from string in this change, * instead of using "\d+" to match the generation, use "[^-]", * let generation_type to convert a string to generation before this change, we casts the matched string in SSTable file name to integer and then construct a generation identifier from the integer. this solution has a strong assumption that the generation is represented with an integer, we should not encode this assumption in sstable.cc, instead we'd better let generation_type itself to take care of this. also, to relax the restriction of regex for matching generation, let's just use any characters except for the delimeter -- "-". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 22:57:39 +08:00
Kefu Chai	478c13d0d4	sstable: encode int using UUID in generation_type since we already use UUID for encoding an bigint in SSTable registry table, let's just use the same approach for encoding bigint in generation_type, to be more consistent, and less repeatings this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 22:57:38 +08:00
Petr Gusev	08529a1c6c	storage_proxy, storage_service: use new read endpoints We use set_topology_transition_state to set read_new state in storage_service::topology_state_load based on _topology_state_machine._topology.tstate. This triggers update_pending_ranges to compute and store new ranges for read requests. We use this information in storage_proxy::get_endpoints_for_reading when we need to decide which nodes to use for reading.	2023-05-09 18:42:03 +04:00
Petr Gusev	052b91fb1f	storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading We are going to use remapped_endpoints_for_reading, we need to make sure we use it in the right place. The get_live_sorted_endpoints function looks like what we need - it's used in all read code paths. From its name, however, this was not obvious. Also, we add the parameter ks_name as we'll need it to pass to remapped_endpoints_for_reading.	2023-05-09 18:42:03 +04:00
Petr Gusev	15fe4d8d69	token_metadata: add unit test for endpoints_for_reading	2023-05-09 18:42:03 +04:00
Petr Gusev	0e4e2df657	token_metadata: add endpoints for reading In this patch we add token_metadata::set_topology_transition_state method. If the current state is write_both_read_new update_pending_ranges will compute new ranges for read requests. The default value of topology_transition_state is null, meaning no read ranges are computed. We will add the appropriate set_topology_transition_state calls later. Also, we add endpoints_for_reading method to get read endpoints based on the computed ranges.	2023-05-09 18:41:59 +04:00
Kefu Chai	d24687ea26	data_dictionary: define helpers in options instead of dispatching and implementing the per-option handling right in `storage_option`, define these helpers in the dedicated option themselves, so `storage_option` is only responsible for dispatching. much cleaner this way. this change also makes it easier to add yet another storage backend. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 21:51:52 +08:00
Kefu Chai	152d0224dc	data_dictionary: only define operator== for storage options as the only user of these comparison operators is `storage_options::can_update_to()`, which just check if the given `storage_options` is equal to the stored one. so no need to define the <=> operator. also, no need to add the `friend` specifier, as the options are plain struct, all the member variables are public. make the comparison operator a member function instead of a free function, as in C++20 comparision operators are symmetric. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 21:51:45 +08:00
Botond Dénes	ef7b7223d5	mutation/mutation_fragment_stream_validator.cc: rename logger This code inherited its logger variable name from mutation reader, rename it to better match its new context.	2023-05-09 07:55:13 -04:00
Botond Dénes	8681f3e997	readers,mutation: move mutation_fragment_stream_validator to mutation/ The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc. This patch fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now).	2023-05-09 07:55:13 -04:00
Botond Dénes	287ccce1cc	Merge 'sstables: extract storage out ' from Kefu Chai this change extracts the storage class and its derived classes out into their own source files. for couple reasons: - for better readability. the sstables.hh is over 1005 lines. and sstables.cc 3602 lines. it's a little bit difficult to figure out how the different parts in these sources interact with each other. for instance, with this change, it's clear some of helper functions are only used by file_system_storage. - probably less inter-source dependency. by extracting the sources files out, they can be compiled individually, so changing one .cc file does not impact others. this could speed up the compilation time. Closes #13785 * github.com:scylladb/scylladb: sstables: storage: coroutinize idempotent_link_file() sstables: extract storage out	2023-05-09 14:03:40 +03:00
Jan Ciolek	9ad1c5d9f2	cql-pytest: test that bind marker is partition key token When preparing a query each bind marker gets a name. For a query like: ```cql SELECT * FROM some_table WHERE token(p1, p2) = ? ``` The bind marker's name should be `"partition key token"`. Java driver relies on this name, having something else, like `"token(p1, p2)"` be the name breaks the Java driver. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-09 12:33:06 +02:00
Jan Ciolek	8a256f63db	cql3/prepare_expr: force token() receiver name to be partition key token Let's say that we have a prepared statement with a token restriction: ```cql SELECT * FROM some_table WHERE token(p1, p2) = ? ``` After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker. In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`. In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`. It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things. This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`. Fixes: #13769 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-05-09 12:32:57 +02:00
Petr Gusev	b2e5d8c21c	sequenced_set: add extract_set method Can be useful if we want to reuse the set when we are done with this sequenced_set instance.	2023-05-09 13:56:38 +04:00
Petr Gusev	0567ab82ac	token_metadata_impl: extract maybe_migration_endpoints helper function We are going to add a function in token_metadata to get read endpoints, similar to pending_endpoints_for. So in this commit we extract the maybe_migration_endpoints helper function, which will be used in both cases.	2023-05-09 13:56:38 +04:00
Petr Gusev	030f0f73aa	token_metadata_impl: introduce migration_info We are going to store read_endpoints in a way similar to pending ranges, so in this commit we add migration_info - a container for two boost::icl::interval_map. Also, _pending_ranges_interval_map is renamed to _keyspace_to_migration_info, since it captures the meaning better.	2023-05-09 13:56:38 +04:00
Petr Gusev	56c2b3e893	token_metadata_impl: refactor update_pending_ranges Now update_pending_ranges is quite complex, mainly because it tries to act efficiently and update only the affected intervals. However, it uses the function abstract_replication_strategy::get_ranges, which calls calculate_natural_endpoints for every token in the ring anyway. Our goal is to start reading from the new replicas for ranges in write_both_read_new state. In the current code structure this is quite difficult to do, so in this commit we first simplify update_pending_ranges. The main idea of the refactoring is to build a new version of token_metadata based on all planned changes (join, bootstrap, replace) and then for each token range compare the result of calculate_natural_endpoints on the old token_metadata and on the new one. Those endpoints that are in the new version and are not in the old version should be added to the pending_ranges. The add_mapping function is extracted for the future - we are going to use it to handle read mappings. Special care is taken when replacing with the same IP. The coordinator employs the get_natural_endpoints_without_node_being_replaced function, which excludes such endpoints from its result. If we compare the new (merged) and current token_metadata configurations, such endpoints will also be absent from pending_endpoints since they exist in both. To address this, we copy the current token_metadata and remove these endpoints prior to comparison. This ensures that nodes being replaced are treated like those being deleted.	2023-05-09 13:56:28 +04:00
Petr Gusev	3120cabf56	token_metadata: add unit tests We are going to refactor update_pending_ranges, so in this commit we add some simple unit tests to ensure we don't break it.	2023-05-09 13:56:06 +04:00
Benny Halevy	adfb79ba3e	raft, idl: restore internal::tagged_uint64 type Change `f5f566bdd8` introduced tagged_integer and replaced raft::internal::tagged_uint64 with utils::tagged_integer. However, the idl type for raft::internal::tagged_uint64 was not marked as final, but utils::tagged_integer is, breaking the on-the-wire compatibility. This change defines the different raft tagged_uint64 types in idl/raft_storage.idl.hh as non-final to restore the way they were serialized prior to `f5f566bdd8` Fixes #13752 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 12:38:20 +03:00
Kamil Braun	41cac23aa4	Merge 'raft: verify RPC destination ID' from Mikołaj Grzebieluch All Raft verbs include `dst_id`, the ID of the destination server, but it isn't checked. `append_entries` will work even if it arrives at completely the wrong server (but in the same group). It can cause problems, e.g. in the scenario of replacing a dead node. This commit adds verifying if `dst_id` matches the server's ID and if it doesn't, the Raft verb is rejected. Closes #12179 Testing --- Testcase and scylla's configuration: `57d3ef14d8` It artificially lengthens the duration of replacing the old node. It increases the chance of getting the RPC command sent to a replaced node, by the new node. In the logs of the node that replaced the old one, we can see logs in the form: ``` DEBUG <time> [shard 0] raft_group_registry - Got message for server <dst_id>, but my id is <my_id> ``` It indicates that the Raft verb with the wrong `dst_id` was rejected. This test isn't included in the PR because it doesn't catch any specific error. Closes #13575 * github.com:scylladb/scylladb: service/raft: raft_group_registry: Add verification of destination ID service/raft: raft_group_registry: `handle_raft_rpc` refactor	2023-05-09 11:33:28 +02:00
Aleksandra Martyniuk	f199ec5ec3	test: extend test_compaction_task.py to test scrub compaction	2023-05-09 11:15:26 +02:00
Aleksandra Martyniuk	83d3463d10	compaction: add table_scrub_sstables_compaction_task_impl Implementation of task_manager's task covering scrub sstables compaction of one table.	2023-05-09 11:15:25 +02:00
Aleksandra Martyniuk	d8e4a2fee3	compaction: add shard_scrub_sstables_compaction_task_impl Implementation of task_manager's task covering scrub sstables compaction on one shard.	2023-05-09 11:14:36 +02:00
Aleksandra Martyniuk	8d32579fe6	compaction: add scrub_sstables_compaction_task_impl Implementation of task_manager's task covering scrub sstables compaction.	2023-05-09 11:13:57 +02:00
Kefu Chai	a69282e69b	sstables: storage: coroutinize idempotent_link_file() Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 16:47:00 +08:00
Kefu Chai	2eefcb37eb	sstables: extract storage out this change extracts the storage class and its derived classes out into storage.cc and storage.hh. for couple reasons: - for better readability. the sstables.hh is over 1005 lines. and sstables.cc 3602 lines. it's a little bit difficult to figure out how the different parts in these sources interact with each other. for instance, with this change, it's clear some of helper functions are only used by file_system_storage. - probably less inter-source dependency. by extracting the sources files out, they can be compiled individually, so changing one .cc file does not impact others. this could speed up the compilation time. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-09 16:47:00 +08:00
Aleksandra Martyniuk	79c39e4ea7	api: get rid of unnecessary std::optional in scrub In scrub lambdas returning std::optional<compaction_stats> cannot return empty value. Hence, std::optional wrapper isn't needed.	2023-05-09 10:31:44 +02:00
Aleksandra Martyniuk	40809c887e	compaction: rename rewrite_sstables_compaction_task_impl Rename rewrite_sstables_compaction_task_impl to sstables_compaction_task_impl as a new name describes the class of tasks better. Rewriting sstables is a slightly more fine-grained type of sstable compaction task then the one needed here.	2023-05-09 10:31:44 +02:00
Botond Dénes	20f620feb9	Merge 'replica, sstable: replace generation_type::value() with generation_type::as_int()' from Kefu Chai this series prepares for the UUID based generation by replacing the general `value()` function with the function with more specific name: `as_int()`. Closes #13796 * github.com:scylladb/scylladb: test: drop a reusable_sst() variant which accepts int as generation treewide: replace generation_type::value() with generation_type::as_int()	2023-05-09 07:30:54 +03:00
Benny Halevy	531ac63a8d	raft: define term_t as a tagged uint64_t It was defined as a tagged (signed) int64_t by mistake in `f5f566bdd8`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 06:51:26 +03:00
Benny Halevy	d3a59fdefd	idl: gossip_digest: include required headers To be self-sufficient, before the next patch that will affect tagged_integer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 06:51:26 +03:00
Michał Chojnowski	0813fa1da0	database: fix reads_memory_consumption for system semaphore The metric shows the opposite of what its name suggests. It shows available memory rather than consumed memory. Fix that. Fixes #13810 Closes #13811	2023-05-09 06:42:43 +03:00
Kamil Braun	ddb5b45aef	docs: update topology-over-raft.md It was already outdated before this PR. Describe the version of topology state machine implemented in this PR. Fix some typos and make it proper markdown so it renders nicely on GitHub etc.	2023-05-08 16:49:01 +02:00
Kamil Braun	f581282625	test: topology_experimental_raft: test `check_and_repair_cdc` API	2023-05-08 16:49:01 +02:00
Kamil Braun	372a06f735	raft topology: implement `check_and_repair_cdc_streams` API The original API is gossiper-based. Since we're moving CDC generations handling to Raft-based topology, we need to implement this API as well. For now the API creates a new generation unconditionally, in a follow-up I'll introduce a check to skip the creation if the current generation is optimal.	2023-05-08 16:49:01 +02:00
Kamil Braun	a09ed01ffa	raft topology: implement global request handling The only possible request for now is creating a new CDC generation.	2023-05-08 16:49:01 +02:00
Kamil Braun	2bd333dd84	raft topology: introduce `prepare_new_cdc_generation_data` Refactor the code, taking a bulk of the CDC-specific code used when there's a bootstrap request to a separate function. We'll use it elsewhere as well.	2023-05-08 16:48:59 +02:00
Kamil Braun	2863ef3df4	raft_topology: `get_node_to_work_on_opt`: return guard if no node found We'll need the guard back.	2023-05-08 16:48:29 +02:00
Kamil Braun	afcf17f168	raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition We don't need it for anything in this state, and this change allows us to commit CDC generations without transitioning nodes.	2023-05-08 16:47:37 +02:00
Kamil Braun	1b21a3c5ae	raft topology: separate `publish_cdc_generation` state Previously the generation committed in `commit_cdc_generation` state would be published by the coordinator in `write_both_read_old` state. This logic assumed that we only create new CDC generations during node bootstrap. We'll allow committing new generations without bootstrap (without any node transitions in fact), so we need this separate state. After publishing the generation, we check whether there is a transitioning node; if so, we'll enter `write_both_read_old` as next state, otherwise we'll make the topology non-transitioning.	2023-05-08 16:47:24 +02:00
Kamil Braun	6d5b8c1b7c	raft topology: non-node-specific `exec_global_command` This function broadcasts a command to cluster members. It takes a `node_to_work_on`. We'll need a version which works in situations where there is no 'node to work on'.	2023-05-08 16:47:13 +02:00
Kamil Braun	8b5237a058	raft topology: introduce `start_operation()` This calls `raft_group0_client::start_operation` and checks if the term is different from the term that the coordinator was initially created with; if so, we must no longer continue coordinating the topology. There was one direct call to `raft_group0_client::start_operation` without a term check, replace it with the introduced function.	2023-05-08 16:47:13 +02:00
Kamil Braun	90770f712c	raft topology: non-node-specific `topology_mutation_builder` The existing `topology_mutation_builder` took a `raft::server_id` in its constructor and immediately created a clustering row in the `system.topology` mutation that it was building for the given node. This does not allow building mutations which only affect the static columns. Split the class into two: - `topology_mutation_builder` doesn't take `raft::server_id` in its constructor and contains only the methods that are used to set static columns. It also has a `with_node` method taking a `raft::server_id` which returns a `topology_node_mutation_builder&`. - `topology_node_mutation_builder` creates the clustering row and allows seting its columns. We'll use `topology_mutation_builder` when we only want to transition the cluster-global topology state, without affecting any specific nodes' states.	2023-05-08 16:47:11 +02:00
Kamil Braun	acfb6bf3ed	topology_state_machine: introduce `global_topology_request` `topology` currently contains the `requests` map, which is suitable for node-specific requests such as "this node wants to join" or "this node must be removed". But for requests for operations that affect the cluster as a whole, a separate request type and field is more appropriate. Introduce one. The enum currently contains the option `new_cdc_generation` for requests to create a new CDC generation in the cluster. We will implement the whole procedure in later commits.	2023-05-08 16:46:14 +02:00
Kamil Braun	7c5056492e	topology_state_machine: use `uint16_t` for `enum_class`es 16 bits ought to be enough for everyone.	2023-05-08 16:46:14 +02:00
Kamil Braun	93dcdcd4eb	raft topology: make `new_cdc_generation_data_uuid` topology-global - make it a static column in `system.topology` - move it from node-specific `ring_slice` to cluster-global `topology` We will use it in scenarios where no node is transitioning. Also make it `std::optional` in topology for consistency with other fields (previously, the 'no value' state for this field was represented using default-constructed `utils::UUID`).	2023-05-08 16:46:14 +02:00
Nadav Har'El	5f37d43ee6	Merge 'compaction: validate: validate the index too' from Botond Dénes In addition to the data file itself. Currently validation avoids the index altogether, using the crawling reader which only relies on the data file and ignores the index+summary. This is because a corrupt sstable usually has a corrupt index too and using both at the same time might hide the corruption. This patch adds targeted validation of the index, independent of and in addition to the already existing data validation: it validates the order of index entries as well as whether the entry points to a complete partition in the data file. This will usually result in duplicate errors for out-of-order partitions: one for the data file and one for the index file. Fixes: #9611 Closes #11405 * github.com:scylladb/scylladb: test/cql-pytest: add test_sstable_validation.py test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py tools/scylla-sstables: write validation result to stdout sstables/sstable: validate(): delegate to mx validator for mx sstables sstables/mx/reader: add mx specific validator mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter sstables/mx/reader: template data_consume_rows_context_m on the consumer sstables/mx/reader: move row_processing_result to namespace scope sstables/mx/reader: use data_consumer::proceed directly sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic) compaction/compaction: remove now unused scrub_validate_mode_validate_reader() compaction/compaction: move away from scrub_validate_mode_validate_reader() tools/scylla-sstable: move away from scrub_validate_mode_validate_reader() test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader() sstables/sstable: add validate() method compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one compaction: scrub: use error messages from validator mutation_fragment_stream_validator: produce error messages in low-level validator	2023-05-08 17:14:26 +03:00
Botond Dénes	b790f14456	reader_concurrency_semaphore: execution_loop(): trigger admission check when _ready_list is empty The execution loop consumes permits from the _ready_list and executes them. The _ready_list usually contains a single permit. When the _ready_list is not empty, new permits are queued until it becomes empty. The execution loops relies on admission checks triggered by the read releasing resouces, to bring in any queued read into the _ready_list, while it is executing the current read. But in some cases the current read might not free any resorces and thus fail to trigger an admission check and the currently queued permits will sit in the queue until another source triggers an admission check. I don't yet know how this situation can occur, if at all, but it is reproducible with a simple unit test, so it is best to cover this corner-case in the off-chance it happens in the wild. Add an explicit admission check to the execution loop, after the _ready_list is exhausted, to make sure any waiters that can be admitted with an empty _ready_list are admitted immediately and execution continues. Fixes: #13540 Closes #13541	2023-05-08 17:11:41 +03:00
Takuya ASADA	fdceda20cc	scylla_raid_setup: wipe filesystem signatures from specified disks The discussion on the thread says, when we reformat a volume with another filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it detected two filesystem signatures, because mkfs.xxx did not cleared previous filesystem signature. To avoid this, we need to run wipefs before running mkfs. Note that this runs wipefs twice, for target disks and also for RAID device. wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks). Also dropped -f option from mkfs.xfs, it will check wipefs is working as we expected. Fixes #13737 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes #13738	2023-05-08 16:53:43 +03:00
Anna Stuchlik	98e1d7a692	doc: add the Elixir driver to the docs This commit adds the link to the Exlixir driver to the list of the third-party drivers. The driver actively supports ScyllaDB. This is v2 of https://github.com/scylladb/scylladb/pull/13701 Closes #13806	2023-05-08 15:36:35 +03:00
Botond Dénes	a6ee3b25a7	test/boost/database_test: add unit test for semaphore mismatch on range scans Check that: * Mismatch is detected; * Mismatch is handled without crash (e.g. due to unclosed readers);	2023-05-08 07:35:39 -04:00
Botond Dénes	71bae0c549	partition_slice_builder: add set_specific_ranges() The builder only has a method to mutate existing specific ranges. This patch adds one to set or overwrite it.	2023-05-08 07:35:39 -04:00
Botond Dénes	4ba7810f60	multishard_mutation_query: make reader_context::lookup_readers() exception safe With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it.	2023-05-08 07:35:39 -04:00
Botond Dénes	227b0d3f08	multishard_mutation_query: lookup_readers(): make inner lambda a coroutine Needed by the next patch. Sad, but it runs once/shard/page, so it shouldn't be noticable.	2023-05-08 07:35:33 -04:00
Kamil Braun	153cb00e9d	test: test_random_tables: wait for token ring convergence before data queries The test performs an `INSERT` followed by a `SELECT`, checking if the previously inserted data is returned. This may fail because we're using `ring_delay = 0` in tests and the two queries may arrive at different nodes, whose `token_metadata` didn't converge yet (it's eventually consistent based on gossiping). I illustrated this here: https://github.com/scylladb/scylladb/issues/12937#issuecomment-1536147455 Ensure that the nodes' token rings are synchronized (by waiting until the token ring members on each node is the same as group 0 configuration). Fixes #12937 Closes #13791	2023-05-08 13:22:52 +02:00
Kamil Braun	3f3dcf451b	test: pylib: random_tables: perform read barrier in `verify_schema` `RandomTables.verify_schema` is often called in topology tests after performing a schema change. It compares the schema tables fetched from some node to the expected latest schema stored by the `RandomTables` object. However there's no guarantee that the latest schema change has already propagated to the node which we query. We could have performed the schema change on a different node and the change may not have been applied yet on all nodes. To fix that, pick a specific node and perform a read barrier on it, then use that node to fetch the schema tables. Fixes #13788 Closes #13789	2023-05-08 13:21:10 +02:00
Avi Kivity	198738f2b1	Merge 'build: compile wasm udfs automatically' from Wojciech Mitros Currently, when we deal with a Wasm program, we store it in its final WebAssembly Text form. This causes a lot of code bloat and is hard to read. Instead, we would like to store only the source codes, and build Wasm when necessary. This series adds build commands that compile C/Rust sources to Wasm and uses them for Wasm programs that we're already using. After these changes, adding a new program that should be compiled to Rust, requires only adding the source code of it and updating the `wasms` and `wasm_deps` lists in `configure.py`. All Wasm programs are build by default when building all artifacts, artifacts in a given mode, or when building tests. Additionally, a {mode}-wasm target is added, so that it's possible to build just the wasm files. The generated files are saved in $builddir/{mode}/wasm, and are accessed in cql-pytests similarly to the way we're accessing the scylla binary - using glob. Closes #13209 * github.com:scylladb/scylladb: wasm: replace wasm programs with their source programs build: prepare rules for compiling wasm files build: set the type of build_artifacts test: extend capabilities of Wasm reading helper funciton	2023-05-08 13:51:53 +03:00
Petr Gusev	e5c6af17e6	token_metadata: fix indentation	2023-05-08 13:16:21 +04:00
Petr Gusev	435a7573ff	token_metadata_impl: return unique_ptr from clone functions token_metadata takes token_metadata_impl as unique_ptr, so it makes sense to create it that way in the first place to avoid unnecessary moves. token_metadata_impl constructor with shallow_copy parameter was made public for std::make_unique. The effective accessibility of this constructor hasn't changed though since shallow_copy remains private.	2023-05-08 13:16:21 +04:00
Wojciech Mitros	6d89d718d9	wasm: replace wasm programs with their source programs After recent changes, we are able to store only the C/Rust source codes for Wasm programs, and only build them when neccessary. This patch utilizes this opportunity by removing most of the currently stored raw Wasm programs, replacing them with C/Rust sources and adding them to the new build system.	2023-05-08 10:47:34 +02:00
Wojciech Mitros	c065ae0ded	build: prepare rules for compiling wasm files Currently, when we deal with a Wasm program, we store it in its final WebAssembly Text form. This causes a lot of code bloat and is hard to read. Instead, we would like to store only the (C/Rust) source codes, and build Wasm when neccessary. This patch adds build commands that compile C/Rust sources to Wasm. After these changes, adding a new program that should be compiled to Rust, requires only adding the source code of it and updating the wasms and wasm_deps lists in configure.py. All Wasm programs are build by default when building all artifacts, all artifacts in a given mode, or when building tests. Additionally, a ninja wasm target is added, so that it's possible to build just the wasm files. The generated files are saved in $builddir/wasm.	2023-05-08 10:47:34 +02:00
Wojciech Mitros	c53d68ee3e	build: set the type of build_artifacts Currently, build_artifacts are of type set[str] \| list, which prevents us from performing set operations on it. In a future patch, we will want to take a set difference and set intersections with it, so we initialize the type of build_artifacts to a set in all cases.	2023-05-08 10:47:34 +02:00
Wojciech Mitros	0a34a54c73	test: extend capabilities of Wasm reading helper funciton Currently, we require that the Wasm file is named the same as the funciton. In the future we may want multiple functions with the same name, which we can't currently do due to this limitation. This patch allows specifying the function name, so that multiple files can have a function with the same name. Additionally, the helper method now escapes "'" characters, so that they can appear in future Wasm files.	2023-05-08 10:47:34 +02:00
Botond Dénes	ab5fd0f750	Merge 's3: Provide timestamps in the s3 file implementation' from Raphael "Raph" Carvalho SSTable relies on st.st_mtime for providing creation time of data file, which in turn is used by features like tombstone compaction. Therefore, let's implement it. Fixes https://github.com/scylladb/scylladb/issues/13649. Closes #13713 * github.com:scylladb/scylladb: s3: Provide timestamps in the s3 file implementation s3: Introduce get_object_stats() s3: introduce get_object_header()	2023-05-08 11:43:41 +03:00
Raphael S. Carvalho	ad471e5846	s3: Provide timestamps in the s3 file implementation SSTable relies on st.st_mtime for providing creation time of data file, which in turn is used by features like tombstone compaction. Fixes #13649. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-07 19:51:12 -03:00
Raphael S. Carvalho	57661f0392	s3: Introduce get_object_stats() get_object_stats() will be used for retrieving content size and also last modified. The latter is required for filling st_mtim, etc, in the s3::client::readable_file::stat() method. Refs #13649. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-07 19:51:10 -03:00
Raphael S. Carvalho	da2ccc44a4	s3: introduce get_object_header() This allows other functions to reuse the code to retrieve the object header. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-05-07 19:49:52 -03:00
Kefu Chai	5fa459bd1a	treewide: do not include unused header since #13452, we switched most of the caller sites from std::regex to boost::regex. in this change, all occurences of `#include <regex>` are dropped unless std::regex is used in the same source file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13765	2023-05-07 19:01:29 +03:00
Kefu Chai	468460718a	utils: UUID: drop uint64_t_tri_compare() functinoality wise, `uint64_t_tri_compare()` is identical to the three-way comparison operator, so no need to keep it. in this change, it is dropped in favor of <=>. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13794	2023-05-07 18:07:49 +03:00
Avi Kivity	380c0b0f33	cql3: untyped_result_set: switch to managed_bytes_view as the cell type Now that result_set uses managed_bytes_opt for its internals, it's easy to switch untyped_result_set too. This avoids large contiguous allocations.	2023-05-07 17:17:36 +03:00
Avi Kivity	42a1ced73b	cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt The expression system uses managed_bytes_opt for values, but result_set uses bytes_opt. This means that processing values from the result set in expressions requires a copy. Out of the two, managed_bytes_opt is the better choice, since it prevents large contiguous allocations for large blobs. So we switch result_set to use managed_bytes_opt. Users of the result_set API are adjusted. The db::function interface is not modified to limit churn; instead we convert the types on entry and exit. This will be adjusted in a following patch.	2023-05-07 17:17:36 +03:00
Avi Kivity	df4b7e8500	cql3: untyped_result_set: always own data untyped_result_set is used for internal queries, where ease-of-use is more important than performance. Currently, cells are held either by value or by reference (managed_bytes_view). An upcoming change will cause the result set to be built from managed_bytes_view, making it non-owning, but the source data is not actually held, resulting in a use-after-free. Rather than chase the source and force the data to be owned in this case, just drop the possibility for a non-owning untyped_result_set. It's only used in non-performance-critical paths and safety is more important than saving a few cycles. This also results in simplification: previously, we had a variant selecting monostate (for NULL), managed_bytes_view (for a reference), and bytes (for owning data); now we only have a bytes_opt since that already signifies data-or-NULL. Once result_set transitions to managed_bytes_opt, untyped_result_set will follow. For now it's easier to use bytes_opt.	2023-05-07 17:17:36 +03:00
Avi Kivity	d3e9fd49a3	types: abstract_type: add mixed-type versions of compare() and equal() compare() and equal() can compare two unfragmented values or two fragmented values, but a mix of a fragmented value and an unfragmented value runs afoul of C++ conversion rules. Add more overloads to make it simpler for users.	2023-05-07 17:17:36 +03:00
Avi Kivity	11d651b606	utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view The codebase evolved to have several different ways to hold a fragmented buffer: fragmented_temporary_buffer (for data received from the network; not relevant for this discussion); bytes_ostream (for fragmented data that is built incrementally; also used for a serialized result_set), and managed_bytes (used for lsa and serialized individual values in expression evaluation). One problem with this state of affairs is that using data in one fragmented form with functions that accept another fragmented form requires either a copy, or templating everything. The former is unpalatable for fast-path code, and the latter is undesirable for compile time and run-time code footprint. So we'd like to make the various forms compatible. In `53e0dc7530` ("bytes_ostream: base on managed_bytes") we changed bytes_ostream to have the same underlying data structure as managed_bytes, so all that remains is to add the right API. This is somewhat difficult as the data is hidden in multiple layers: ser::buffer_view<> is used to abstract a slice of bytes_ostream, and this is further abstracted by using iterators into bytes_ostream rather than directly using the internals. Likewise, it's impossible to construct a managed_bytes_view from the internals. Hack through all of these by adding extract_implementation() methods, and a build_managed_bytes_view_from_internals() helper. These are all used by new APIs buffer_view_to_managed_bytes_view() that extract the internals and put them back together again. Ideally we wouldn't need any of this, but unifying the type system in this area is quite an undertaking, so we need some shortcuts.	2023-05-07 17:17:34 +03:00
Avi Kivity	613f4b9858	utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt Useful, rather than open-coding the conversions.	2023-05-07 17:16:38 +03:00
Avi Kivity	1e6ef5503c	utils: managed_bytes: add managed_bytes_view::with_linearized() Becomes useful in later patches. To avoid double-compiling the call to func(), use an immediately-invoked lambda to calculate the bytes_view we'll be calling func() with.	2023-05-07 17:16:38 +03:00
Avi Kivity	08ba0935e2	utils: managed_bytes: mark managed_bytes_view::is_linearized() const It's trivially const, mark it so.	2023-05-07 17:16:38 +03:00
Tomasz Grabiec	d8826acaa3	tablets: Fix stack smashing in tablet_map_to_mutation() The code was incorrectly passing a data_value of type bytes due to implicit conversion of the result of serialize() (bytes_opt) to a data_value object of type bytes_type via: data_value(std::optional<NativeType>); mutation::set_static_cell() accepts a data_value object, which is then serialized using column's type in abstract_type::decompose(data_value&): bytes b(bytes::initialized_later(), serialized_size(*this, value._value)); auto i = b.begin(); value.serialize(i); Notice that serialized_size() is taken from the column type, but serialization is done using data_value's type. The two types may have a compatible CQL binary representation, but may differ in native types. serialized_size() may incorrectly interpret the native type and come up with the wrong size. If the size is too smaller, we end up with stack or heap corruption later after serialize(). For example, if the column type is utf8 but value holds bytes, the size will be wrong because even though both use the basic_sstring type, they have a different layout due to max_size (15 vs 31). Fixes #13717 Closes #13787	2023-05-07 14:07:50 +03:00
Botond Dénes	c1e8e86637	reader_concurrency_semaphore: reader_permit: clean-up after failed memory requests When requesting memory via `reader_permit::request_memory()`, the requested amount is added to `_requested_memory` member of the permit impl. This is because multiple concurrent requests may be blocked and waiting at the same time. When the requests are fulfilled, the entire amount is consumed and individual requests track their requested amount with `resource_units` to release later. There is a corner-case related to this: if a reader permit is registered as inactive while it is waiting for memory, its active requests are killed with `std::bad_alloc`, but the `_requested_memory` fields is not cleared. If the read survives because the killed requests were part of a non-vital background read-ahead, a later memory request will also include amount from the failed requests. This extra amount wil not be released and hence will cause a resource leak when the permit is destroyed. Fix by detecting this corner case and clearing the `_requested_memory` field. Modify the existing unit test for the scenario of a permit waiting on memory being registered as inactive, to also cover this corner case, reproducing the bug. Fixes: #13539 Closes #13679	2023-05-07 14:06:51 +03:00
Kefu Chai	bd3e8d0460	test: drop a reusable_sst() variant which accepts int as generation this is one of the changes to reduce the usage of integer based generation test. in future, we will need to expand the test to exercise the UUID based generation, or at least to be neutral to the underlying generation's identifier type. so, to remove the helpers which only accept `generation_type::int_t` would helps us to make this happen. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-06 18:24:48 +08:00
Kefu Chai	9b35faf485	treewide: replace generation_type::value() with generation_type::as_int() * replace generation_type::value() with generation_type::as_int() * drop generation_value() because we will switch over to UUID based generation identifier, the member function or the free function generation_value() cannot fulfill the needs anymore. so, in this change, they are consolidated and are replaced by "as_int()", whose name is more specific, and will also work and won't be misleading even after switching to UUID based generation identifier. as `value()` would be confusing by then: it could be an integer or a UUID. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-06 18:24:45 +08:00
Kamil Braun	aba31ad06c	storage_service: use `seastar::format` instead of `fmt::format` For some reason Scylla crashes on `aarch64` in release mode when calling `fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this line: ``` group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id())); ``` I found this in our configure.py: ``` def get_clang_inline_threshold(): if args.clang_inline_threshold != -1: return args.clang_inline_threshold elif platform.machine() == 'aarch64': # we see miscompiles with 1200 and above with format("{}", uuid) # also coroutine miscompiles with 600 return 300 else: return 2500 ``` but reducing it to `0` didn't help. I managed to get the following backtrace (with inline threshold 0): ``` void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751 (inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728 (inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255 void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const, wchar_t const) at ??:? void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:? fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:? std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206 (inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572 ``` Maybe it's a bug in `fmt` library? In any case replacing the call with `::format` (i.e. `seastar::format` from seastar/core/print.hh) helps. Do it for the entire file for consistency (and avoiding this bug). Also, for the future, replace `format` calls with `::format` - now it's the same thing, but the latter won't clash with `std::format` once we switch to libstdc++13. Fixes #13707 Closes #13711	2023-05-05 19:23:22 +02:00
Kamil Braun	70f2b09397	Merge 'scylla_cluster.py: fix read_last_line' from Gusev Petr This is a follow-up to #13399, the patch addresses the issues mentioned there: * linesep can be split between blocks; * linesep can be part of UTF-8 sequence; * avoid excessively long lines, limit to 256 chars; * the logic of the function made simpler and more maintainable. Closes #13427 * github.com:scylladb/scylladb: pylib_test: add tests for read_last_line pytest: add pylib_test directory scylla_cluster.py: fix read_last_line scylla_cluster.py: move read_last_line to util.py	2023-05-05 13:29:15 +02:00
Botond Dénes	1e9dcaff01	Merge 'build: cmake: use Seastar API level 6' from Kefu Chai to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this. Closes #13780 * github.com:scylladb/scylladb: build: cmake: disable deprecated warning build: cmake: use Seastar API level 6	2023-05-05 13:55:34 +03:00
Kefu Chai	05a172c7e7	build: cmake: link against Boost::unit_test_framework we introduced the linkage to Boost::unit_test_framework in `fe70333c19`, this library is used by test/lib/test_utils.cc, so update CMake accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13781	2023-05-05 13:55:00 +03:00
Petr Gusev	8a0bcf9d9d	pylib_test: add tests for read_last_line	2023-05-05 12:57:43 +04:00
Petr Gusev	7476e91d67	pytest: add pylib_test directory We want to add tests for read_last_line, in this commit we add a new directory for them since there were no tests for pylib code before.	2023-05-05 12:57:43 +04:00
Petr Gusev	330d1d5163	scylla_cluster.py: fix read_last_line This is a follow-up to #13399, the patch addresses the issues mentioned there: * linesep can be split between blocks; * linesep can be part of UTF-8 sequence; * avoid excessively long lines, limit to 512 chars; * the logic of the function made simpler and more maintainable.	2023-05-05 12:57:36 +04:00
Petr Gusev	8a5e211c30	scylla_cluster.py: move read_last_line to util.py We want to add tests for read_last_line, so we move it to make this simper.	2023-05-05 12:51:25 +04:00
Botond Dénes	687a8bb2f0	Merge 'Sanitize test::filename(sstable) API' from Pavel Emelyanov There are two of them currently with slightly different declaration. Better to leave only one. Closes #13772 * github.com:scylladb/scylladb: test: Deduplicate test::filename() static overload test: Make test::filename return fs::path	2023-05-05 11:36:08 +03:00
Botond Dénes	b704698ba5	Merge 'Close toc file in remove_by_toc_name()' from Pavel Emelyanov The method in question suffers from scylladb/seastar#1298. The PR fixes it and makes a bit shorter along the way Closes #13776 * github.com:scylladb/scylladb: sstable: Close file at the end sstables: Use read_entire_stream_cont() helper	2023-05-05 11:33:05 +03:00
Anna Stuchlik	27b0dff063	doc: make branch-5.2 latest and stable This commit changes the configuration in the conf.py file to make branch-5.2 the latest version and remove it from the list of unstable versions. As a result, the docs for version 5.2 will become the default for users accessing the ScyllaDB Open Source documentation. This commit should be merged as soon as version 5.2 is released. Closes #13681	2023-05-05 11:11:17 +03:00
Botond Dénes	0cccf9f1cc	Merge 'Remove some file_writer public methods' from Pavel Emelyanov One is unused, the other one is not really required in public Closes #13771 * github.com:scylladb/scylladb: file_writer: Remove static make() helper sstable: Use toc_filename() to print TOC file path	2023-05-05 10:48:46 +03:00
Pavel Emelyanov	ac305076bd	test: Split test_twcs_interposer_on_memtable_flush naturally The test case consists of two internal sub-test-cases. Making them explicit kills three birds with one stone - improves parallelizm - removes env's tempdir wiping - fixes code indentation refs: #12707 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13768	2023-05-05 10:42:30 +03:00
Raphael S. Carvalho	1f69c46889	sstables: use version_types received from parser or writer This is only a cosmetical change, no change in semantics Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #13779	2023-05-05 10:32:14 +03:00
Kefu Chai	e4c6b0b31d	build: cmake: disable deprecated warning since Seastar now deprecates a bunch of APIs which accept io_priority_class, we started to have deprecated warnings. before migrating to V7 API, let's disable this warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-05 15:31:39 +08:00
Kefu Chai	3c941e8b8a	build: cmake: use Seastar API level 6 to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-05 15:21:42 +08:00
Avi Kivity	fe1cd6f477	Update seastar submodule * seastar 02d5a0d7c...f94b1bb9c (12): > Merge 'Unify CPU scheduling groups and IO priority classes' from Pavel Emelyanov > scripts: addr2line: relax regular expression for matching kernel traces > add dirs for clangd to .gitignore > http::client: Log failed requests' body > build: always quote the ENVIRONMENT with quotes > exception_hacks: Change guard check order to work around static init fail > shared_future: remove support for variadic futures > iotune: Don't close file that wasn't opened Fixes #13439 > Merge 'Relax per tick IO grab threshold' from Pavel Emelyanov > future: simplify constraint on then() a little > Merge 'coroutine: generator: initialize const member variable and enable generator tests' from Kefu Chai > future: drop libc++ std::tuple compatibility hack Closes #13777	2023-05-05 00:32:11 +03:00
Pavel Emelyanov	75e7187e1a	sstable: Close file at the end The thing is than when closing file input stream the underlying file is not .close()-d (see scylladb/seastar#1298). The remove_by_toc_name() is buggy in this sense. Using with_closeable() fixes it and makes the code shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 20:37:48 +03:00
Pavel Emelyanov	334383beb5	sstables: Use read_entire_stream_cont() helper The remove_by_toc_name() wants to read the whole stream into a sstring. There's a convenience helper to facilitate that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 20:37:09 +03:00
Avi Kivity	f125a3e315	Merge 'tree: finish the reader_permit state renames' from Botond Dénes In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`. This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date. Closes #13573 * github.com:scylladb/scylladb: reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes reader_concurrency_semaphore: update API w.r.t. recent permit state name changes reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes	2023-05-04 18:29:04 +03:00
Avi Kivity	204521b9a7	Merge 'mutation/mutation_compactor: validate range tombstone change before it is moved' from Botond Dénes `e2c9cdb576` moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call. Refs: #12575 Closes #13749 * github.com:scylladb/scylladb: test/boost/mutation_test: add sanity test for mutation compaction validator mutation/mutation_compactor: add validation level to compaction state query constructor mutation/mutation_compactor: validate range tombstone change before it is moved	2023-05-04 18:15:35 +03:00
Avi Kivity	1d351dde06	Merge 'Make S3 client work with real S3' from Pavel Emelyanov Current S3 client was tested over minio and it takes few more touches to work with amazon S3. The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS. So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs. In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust. fixes: #13425 Closes #13493 * github.com:scylladb/scylladb: doc: Add a document describing how to configure S3 backend s3/test: Add ability to run boost test over real s3 s3/client: Sign requests if configured s3/client: Add connection factory with DNS resolve and configurable HTTPS s3/client: Keep server port on config s3/client: Construct it with config s3/client: Construct it with sstring endpoint sstables: Make s3_storage with endpoint config sstables_manager: Keep object storage configs onboard code: Introduce conf/object_storage.yaml configuration file	2023-05-04 18:08:54 +03:00
Avi Kivity	2d74dc0efd	Merge 'sstable_directory: parallel_for_each_restricted: do not move container' from Benny Halevy Commit `ecbd112979` `distributed_loader: reshard: consider sstables for cleanup` caused a regression in loading new sstables using the `upload` directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/ ``` query = "SELECT COUNT() FROM cf" statement = SimpleStatement(query) s = self.patient_cql_connection(node, 'ks') result = list(s.execute(statement)) > assert result[0].count == expected_number_of_rows, \ "Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT FROM ks.cf"))) E AssertionError: Expected 1 rows. Got [] E assert 0 == 1 E +0 ``` The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared. parallel_for_each_restricted moves the container passed to it into a `do_with` continuation. This is required for parallel_for_each_restricted. However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee. This patch changes the signature of parallel_for_each_restricted to accept a container rather than a rvalue reference, allowing the callers to decide whether to move or not. Most callers are converted to move the container, except for `do_for_each_sstable` that copies `_unshared_local_sstables`, allowing callers to call `dir.do_for_each_sstable` multiple times without moving the list contents. Closes #13526 * github.com:scylladb/scylladb: sstable_directory: coroutinize parallel_for_each_restricted sstable_directory: parallel_for_each_restricted: use std::ranges for template definition sstable_directory: parallel_for_each_restricted: do not move container	2023-05-04 17:39:05 +03:00
Pavel Emelyanov	56dfc21ba0	test: Deduplicate test::filename() static overload There are two of them currently, both returning fs::path for sstable components. One is static and can be dropped, callers are patched to use the non-static one making the code tiny bit shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 17:16:00 +03:00
Pavel Emelyanov	3f30a253be	test: Make test::filename return fs::path The sstable::filename() is private and is not supposed to be used as a path to open any files. However, tests are different and they sometimes know it is. For that they use test wrapper that has access to private members and may make assumptions about meaning of sstable::filename(). Said that, the test::filename() should return fs::path, not sstring. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 17:14:04 +03:00
Michał Chojnowski	eb5ccb7356	mutation_partition_v2: fix a minor bug in printer Commit `1cb95b8cf` caused a small regression in the debug printer. After that commit, range tombstones are printed to stdout, instead of the target stream. In practice, this causes range tombstones to appear in test logs out of order with respect to other parts of the debug message. Fix that. Closes #13766	2023-05-04 16:56:40 +03:00
Pavel Emelyanov	c4394a059c	file_writer: Remove static make() helper It's simply unused Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 16:55:41 +03:00
Pavel Emelyanov	eaf534cc4b	sstable: Use toc_filename() to print TOC file path The sstable::write_toc() gets TOC filename from file writer, while it can get it from itself. This makes the file_writer::get_filename() private and actually improves logging, as the writer is not required to have the filename onboard, while sstable always has it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-04 16:54:21 +03:00
Mikołaj Grzebieluch	4a8a8c153c	service/raft: raft_group_registry: Add verification of destination ID All Raft verbs include dst_id, the ID of the destination server, but it isn't checked. `append_entries` will work even if it arrives at completely the wrong server (but in the same group). It can cause problems, e.g. in the scenario of replacing a dead node. This commit adds verifying if `dst_id` matches the server's ID and if it doesn't, the Raft verb is rejected. Closes #12179	2023-05-04 15:25:23 +02:00
Tomasz Grabiec	e385ce8a2b	Merge "fix stack use after free during shutdown" from Gleb storage_service uses raft_group0 but the during shutdown the later is destroyed before the former is stopped. This series move raft_group0 destruction to be after storage_service is stopped already. For the move to work some existing dependencies of raft_group0 are dropped since they do not really needed during the object creation. Fixes #13522	2023-05-04 15:14:18 +02:00
Pavel Emelyanov	fe70333c19	test: Auto-skip object-storage test cases if run from shell In case an sstable unit test case is run individually, it would fail with exception saying that S3_... environment is not set. It's better to skip the test-case rather than fail. If someone wants to run it from shell, it will have to prepare S3 server (minio/AWS public bucket) and provide proper environment for the test-case. refs: #13569 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13755	2023-05-04 14:15:18 +03:00
Mikołaj Grzebieluch	ae41d908d7	service/raft: raft_group_registry: `handle_raft_rpc` refactor One-way RPC and two-way RPC have different semantics, i.e. in the first one client doesn't need to wait for an answer. This commit splits the logic of `handle_raft_rpc` to enable handle differences in semantics, e.g. errors handling.	2023-05-04 13:05:04 +02:00
Botond Dénes	0c9af10470	test/cql-pytest: add test_sstable_validation.py This test file, focuses on stressing the underlying sstable validator with cases where the data/index has discrepancies.	2023-05-04 06:48:05 -04:00
Botond Dénes	a26224ffb8	test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py From test_tools.py, their current home. They will soon be used by more then one test file.	2023-05-04 06:48:05 -04:00
Konstantin Osipov	e7c9ca560b	test: issue a read barrier before checking ring consistency Raft replication doesn't guarantee that all replicas see identical Raft state at all times, it only guarantees the same order of events on all replicas. When comparing raft state with gossip state on a node, first issue a read barrier to ensure the node has the latest raft state. To issue a read barrier it is sufficient to alter a non-existing state: in order to validate the DDL the node needs to sync with the leader and fetch its latest group0 state. Fixes #13518 (flaky topology test). Closes #13756	2023-05-04 12:22:07 +02:00
Gleb Natapov	dc6c3b60b4	init: move raft_group0 creation before storage_service storage_service uses raft_group0 so the later needs to exists until the former is stopped.	2023-05-04 13:03:18 +03:00
Gleb Natapov	e9fb885e82	service/raft: raft_group0: drop dependency on cdc::generation_service raft_group0 does not really depends on cdc::generation_service, it needs it only transiently, so pass it to appropriate methods of raft_group0 instead of during its creation.	2023-05-04 13:03:07 +03:00
Benny Halevy	205daf49fd	sstable_directory: coroutinize parallel_for_each_restricted Using a coroutine simplifies the function and reduced the number of moves it performs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-04 11:46:59 +03:00
Benny Halevy	e4acc44814	sstable_directory: parallel_for_each_restricted: use std::ranges for template definition We'd like the container to be a std::ranges::range. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-04 11:44:24 +03:00
Benny Halevy	e2023877f2	sstable_directory: parallel_for_each_restricted: do not move container Commit `ecbd112979` `distributed_loader: reshard: consider sstables for cleanup` caused a regression in loading new sstables using the `upload` directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/ ``` query = "SELECT COUNT() FROM cf" statement = SimpleStatement(query) s = self.patient_cql_connection(node, 'ks') result = list(s.execute(statement)) > assert result[0].count == expected_number_of_rows, \ "Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT FROM ks.cf"))) E AssertionError: Expected 1 rows. Got [] E assert 0 == 1 E +0 E -1 ``` The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared. parallel_for_each_restricted moves the container passed to it into a `do_with` continuation. This is required for parallel_for_each_restricted. However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee. This patch changes the signature of parallel_for_each_restricted to accept a lvalue reference to the container rather than a rvalue reference, allowing the callers to decide whether to move or not. Most callers are converted to move the container, as they effectively do today, and a new method, `filter_sstables` was added for the `collect_all_shared_sstables` us case, that allows the `func` that processes each sstable to decide whether the sstable is kept in `_unshared_local_sstables` or not. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-04 11:36:25 +03:00
Botond Dénes	6bc5c4acf6	tools/scylla-sstables: write validation result to stdout Currently the validate command uses the logger to output the result of validation. This is inconsistent with other commands which all write their output to stdout and log any additional information/errors to stderr. This patch updates the validate command to do the same. While at it, remove the "Validating..." message, it is not useful.	2023-05-04 03:13:07 -04:00
Botond Dénes	c1f18cb0c1	sstables/sstable: validate(): delegate to mx validator for mx sstables We have a more in-depth validator for the mx format, so delegate to that if the validated sstable is of that format. For kl/la we fall-back to the reader-level validator we used before.	2023-05-04 03:13:07 -04:00
Botond Dénes	d941d38759	sstables/mx/reader: add mx specific validator Working with the low-level sstable parser and index reader, this validator also cross-checks the index with the data file, making sure all partitions are located at the position and in the order the index describes. Furthermore, if the index also has promoted index, the order and position of clustering elements is checked against it. This is above the usual fragment kind order, partition key order and clustering order checks that we already had with the reader-level validator.	2023-05-04 03:13:03 -04:00
Botond Dénes	11f2d6bd0a	Merge 'build: only apply -Wno-parentheses-equality to ANTLR generated sources' from Kefu Chai it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR. please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported. for a sample of the warnings, see: ``` /home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality] if ( (LA4_0 == '$')) ~~~~~~^~~~~~ /home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning if ( (LA4_0 == '$')) ~ ^ ~ ``` Closes #13762 * github.com:scylladb/scylladb: build: only apply -Wno-parentheses-equality to ANTLR generated sources compaction: disambiguate format_to()	2023-05-04 10:09:36 +03:00
Kefu Chai	c76486c508	build: only apply -Wno-parentheses-equality to ANTLR generated sources it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR. please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported. for a sample of the warnings, see: ``` /home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality] if ( (LA4_0 == '$')) ~~~~~~^~~~~~ /home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning if ( (LA4_0 == '$')) ~ ^ ~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-04 11:16:27 +08:00
Michał Chojnowski	2d1a345068	test: mvcc_test: add a test for gentle schema upgrades	2023-05-04 03:35:15 +02:00
Michał Chojnowski	80c8a6d0e6	partition_version: make partition_entry::upgrade() gentle Preceding commits in this patch series have extended the MVCC mechanism to allow for versions with different schemas in the same entry/snapshot, with on-the-fly and background schema upgrades to the most recent version in the chain. Given that, we can perform gentle schema upgrades by simply adding an empty version with the target schema to the front of the entry. This patch is intended to be the first and only behaviour-changing patch in the series. Previous patches added code paths for multi-schema snapshots, but never exercised them, because before this patch two different schemas within a single MVCC chain never happened. This patch makes it happen and thus exercises all the code in the series up until now. Fixes #2577	2023-05-04 03:35:15 +02:00
Michał Chojnowski	fe576f8f29	partition_version: handle multi-schema snapshots in merge_partition_versions Each partition_version is allowed to have a different schema now. As of this patch, all versions reachable from a snapshot/entry always have the same schema, but this will change in an upcoming patch. This commit prepares merge_partition_versions() for that. See code comments added in this patch for a detailed description. The design chosen in this patch requires adding a bit of information to partition_version. Due to alignment, it results in a regrettable waste of 8 bytes per partition. If we want, we can recover that in the future by squeezing the bit into some free bit in other fields, for example the highest or lowest bits of one of the pointers in partition_version. After this patch, MVCC should be prepared for replacing the atomic schema upgrade() of cache/memtable entries with a gentle upgrade().	2023-05-04 03:35:15 +02:00
Michał Chojnowski	152b4cd4c2	mutation_partition_v2: handle schema upgrades in apply_monotonically() To avoid reactor stalls during schema upgrades of memtable and cache entries, we want to do them interruptibly, not atomically. To achieve that, we want to reuse the existing gentle version merging mechanism. If we generalize version merging algorithms to handle `mutation_partition`s with different schemas, a schema upgrade will boil down simply to adding a new empty MVCC version with the new schema. In a previous patch, we already generalized the cursor to upgrade rows on the fly when reading. But we still have to generalize the other MVCC algorithm: the merging of superfluous mutation_partition_v2 objects. This patch modifies the two-version merging algorithm: apply_monotonically(). The next patch will update its caller, merge_partition_versions() to make of use the updated apply_monotonically() properly.	2023-05-04 03:35:15 +02:00
Michał Chojnowski	0273101890	partition_version: remove the unused "from" argument in partition_entry::upgrade() partition_entry now contains a reference to its schema, so it doesn't have to be supplied by the caller anymore.	2023-05-04 02:37:30 +02:00
Michał Chojnowski	fc4b812e62	row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades The upcoming schema upgrade change will perform the schema upgrade by adding a new version (with the new schema) to the partition entry. To clean a multi-version entry, eviction is not enough - the versions have to be merged and/or cleared first. drain() does just that.	2023-05-04 02:37:30 +02:00
Michał Chojnowski	db6a35e3a8	partition_version: handle multi-schema entries in partition_entry::squashed An upcoming patch will enable multiple schemas within a single entry, after the entry is upgraded. partition_entry::squashed isn't prepared for that yet. This patch prepares it.	2023-05-04 02:37:30 +02:00
Michał Chojnowski	5f68409934	partition_snapshot_row_cursor: handle multi-schema snapshots To support gentle schema upgrades, each version has its own schema. Currently this facility is unused, and the schema is equal for all versions in a snapshot. But in upcoming commits this will change. In the new design, after an entry upgrade, there will be a transitional period where two versions with different schemas will coexist in a snapshot. Eventually, these versions will be merged by mutation_cleaner into one version with the current schema, but until then reads have to merge multi-schema snapshots on the fly. This commit implements in the cursor support for per-version schemas.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	f4e853b32d	partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots When in upcoming patches we allow multiple schema versions within a single snapshot, reads will have to upgrade rows on the fly. This also applies to squashed()	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a2e3cf7463	partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots When in upcoming patches we allow multiple schema versions within a single snapshot, reads will have to upgrade rows on the fly. This also applies to the static row.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	94e4dc3d8d	partition_version: add a logalloc::region argument to partition_entry::upgrade() The argument is currently unused, but will be further propagated to add_version() in an upcoming patch.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	98dfe3355e	memtable: propagate the region to memtable_entry::upgrade_schema() Adds a logalloc::region argument to upgrade_schema(). It's currently unused, but will be further propagated to partition_entry::upgrade() in an upcoming patch.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	effd1fe70f	mutation_partition: add an upgrading variant of lazy_row::apply() A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	dce1b3e820	mutation_partition: add an upgrading variant of rows_entry::rows_entry A helper which will be used in upcoming commits.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	2fe25a5aa2	mutation_partition: switch an apply() call to apply_monotonically()	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a34c5e410f	mutation_partition: add an upgrading variant of rows_entry::apply_monotonically() A helper which will be used during upcoming changes to mutation_partition_v2::apply_monotonically(), which will extend it to merging versions with different schemas.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	333e65447c	mutation_fragment: add an upgrading variant of clustering_row::apply() It will be used during upcoming changes in partition_snapshot_row_cursor to prepare it for multi-schema snapshots.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	b488e4d541	mutation_partition: add an upgrading variant of row::row It will be used in upcoming commits. A factory function is used, rather than an actual constructor, because we want to delegate the (easy) case of equal schemas to the existing single-schema constructor. And that's impossible (at least without invoking a copy/move constructor) to do with only constructors.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	caaf0bd6bf	partition_version: remove _schema from partition_entry::operator<< operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	f6e11c95e2	partition_version: remove the schema argument from partition_entry::read() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	4e4ae43a84	memtable: remove _schema from memtable_entry After adding a _schema field to each partition version, the field in memtable_entry is redundant. It can be always recovered from the latest version. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	d999e46fa5	row_cache: remove _schema from cache_entry After adding a _schema field to each partition version, the field in cache_entry is redundant. It can be always recovered from the latest version. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	d7d6449a8f	partition_version: remove the _schema field from partition_snapshot After adding a _schema field to each partition version, the field in partition_snapshot is redundant. It can be always recovered from the latest version. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	1d01a4a168	partition_version: add a _schema field to partition_version Currently, partition_version does not reference its schema. All partition_version reachable from a entry/snapshot have the same schema, which is referenced in memtable_entry/cache_entry/partition_snapshot. To enable gentle schema upgrades, we want to use the existing background version merging mechanism. To achieve that, we will move the schema reference into partition_version, and we will allow neighbouring MVCC versions to have different schemas, and we will merge them on-the-fly during reads and persistently during background version merges. This way, an upgrade will boil down to adding a new empty version with the new schema. This patch adds the _schema field to partition_version and propagates the schema pointer to it from the version's containers (entry/snapshot). Subsequent patches will remove the schema references from the containers, because they are now redundant.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	bc6a07a16a	mutation_partition: change schema_ptr to schema& in mutation_partition::difference Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a70c5704df	mutation_partition: change schema_ptr to schema& in mutation_partition constructor Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	781514acfe	mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor We don't have a convention for when to pass `schema_ptr` and and when to pass `const schema&` around. In general, IMHO the natural convention for such a situation is to pass the shared pointer if the callee might extend the lifetime of shared_ptr, and pass a reference otherwise. But we convert between them willy-nilly through shared_from_this(). While passing a reference to a function which actually expects a shared_ptr can make sense (e.g. due to the fact that smart pointers can't be passed in registers), the other way around is rather pointless. This patch takes one occurence of that and modifies the parameter to a reference. Since enable_shared_from_this makes shared pointer parameters and reference parameters interchangeable, this is a purely cosmetic change.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	021b345832	mutation_partition: add upgrading variants of row::apply() They will be used in upcoming patches which introduce incremental schema upgrades. Currently, these variants always copy cells during upgrade. This could be optimized in the future by adding a way to move them instead.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	4214f8d0de	partition_version: update the comment to apply_to_incomplete() The comment refers to "other", but it means "pe". Fix that. The patch also adds a bit of context to the mutation_partition jargon ("evictability" and "continuity"), by reminding how it relates to the concrete abstractions: memtable and cache.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	49a02b08de	mutation_partition_v2: clean up variants of apply() Most variants of apply() and apply_monotonically() in mutation_partition_v2 are leftovers from mutation_partition, and are unused. Thus they only add confusion and maintenance burden. Since we will be modifying apply_monotonically() in upcoming patches, let's clean them up, lest the variants become stale. This patch removes all unused variants of apply() and apply_monotonically() and "manually inlines" the variants which aren't used often enough to carry their own weight. In the end, we are left with a single apply_monotonically() and two convenience apply() helpers. The single apply_monotonically() accepts two schema arguments. This facility is unimplemented and unused as of this patch - the two arguments are always the same - but it will be implemented and used in later parts of the series.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	88a0871729	mutation_partition: remove apply_weak() apply_weak is just an alias for apply(), and most of its variants are dead code. Get rid of it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	38d9241c30	mutation_partition_v2: remove a misleading comment in apply_monotonically() The comment suggests that the order of sentinel insertion is meaningful because of the resulting eviction order. But the sentinels are added to the tracker with the two-argument version of insert(), which inserts the second argument into the LRU right before the (more recent) first argument. Thus the eviction order of sentinels is decided explicitly, and it doesn't rely on insertion order.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	42c7bc0391	row_cache_test: add schema changes to test_concurrent_reads_and_eviction Reads with multiple schema verions have a different code path now, so add schema changes to the test, to test these paths too.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	fb8ae3cca4	mutation_partition: fix mixed-schema apply() In some mixed-schema apply helpers for tests, the source mutation is accidentally copied with the target schema. Fix that. Nothing seems to be currently affected by this bug; I found it when it was triggered by a new test I was adding.	2023-05-04 02:37:29 +02:00
Kefu Chai	113fb32019	compaction: disambiguate format_to() we should always qualify `format_to` with its namespace. otherwise we'd have following failure when compiling with libstdc++ from GCC-13: ``` /home/kefu/dev/scylladb/compaction/table_state.hh:65:16: error: call to 'format_to' is ambiguous return format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id()); ^~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13760	2023-05-03 20:33:18 +03:00
Pavel Emelyanov	0b18e3bff9	doc: Add a document describing how to configure S3 backend Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:23:38 +03:00
Pavel Emelyanov	e00d3188ed	s3/test: Add ability to run boost test over real s3 Support the AWS_S3_EXTRA environment vairable that's :-split and the respective substrings are set as endpoint AWS configuration. This makes it possible to run boost S3 test over real S3. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:23:38 +03:00
Pavel Emelyanov	98b9c205bb	s3/client: Sign requests if configured If the endpoint config specifies AWS key, secret and region, all the S3 requests get signed. Signature should have all the x-amz-... headers included and should contain at least three of them. This patch includes x-ams-date, x-amz-content-sha256 and host headers into the signing list. The content can be unsigned when sent over HTTPS, this is what this patch does. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:23:37 +03:00
Pavel Emelyanov	3dd82485f6	s3/client: Add connection factory with DNS resolve and configurable HTTPS Existing seastar's factories work on socket_address, but in S3 we have endpoint name which's a DNS name in case of real S3. So this patch creates the http client for S3 with the custom connection factory that does two things. First, it resolves the provided endpoint name into address. Second, it loads trust-file from the provided file path (or sets system trust if configured that way). Since s3 client creation is no-waiting code currently, the above initialization is spawned in afiber and before creating the connection this fiber is waited upon. This code probably deserves living in seastar, but for now it can land next to utils/s3/client.cc. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:23:19 +03:00
Pavel Emelyanov	3bec5ea2ce	s3/client: Keep server port on config Currently the code temporarily assumes that the endpoint port is 9000. This is what tests' local minio is started with. This patch keeps the port number on endpoint config and makes test get the port number from minio starting code via environment. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:43 +03:00
Pavel Emelyanov	85f06ca556	s3/client: Construct it with config Similar to previous patch -- extent the s3::client constructor to get the endpoint config value next to the endpoint string. For now the configs are likely empty, but they are yet unused too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:43 +03:00
Pavel Emelyanov	caf9e357c8	s3/client: Construct it with sstring endpoint Currently the client is constructed with socket_address which's prepared by the caller from the endpoint string. That's not flexible engouh, because s3 client needs to know the original endpoint string for two reasons. First, it needs to lookup endpoint config for potential AWS creds. Second, it needs this exact value as Host: header in its http requests. So this patch just relaxes the client constructor to accept the endpoint string and hard-code the 9000 port. The latter is temporary, this is how local tests' minio is started, but next patch will make it configurable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:43 +03:00
Pavel Emelyanov	711514096a	sstables: Make s3_storage with endpoint config Continuation of the previous patch. The sstables::s3_storage gets the endpoint config instance upon creation. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:43 +03:00
Pavel Emelyanov	bd1e3c688f	sstables_manager: Keep object storage configs onboard The user sstables manager will need to provide endpoint config for sstables' storage drivers. For that it needs to get it from db::config and keep in-sync with its updates. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:43 +03:00
Pavel Emelyanov	2f6aa5b52e	code: Introduce conf/object_storage.yaml configuration file In order to access real S3 bucket, the client should use signed requests over https. Partially this is due to security considerations, partially this is unavoidable, because multipart-uploading is banned for unsigned requests on the S3. Also, signed requests over plain http require signing the payload as well, which is a bit troublesome, so it's better to stick to secure https and keep payload unsigned. To prepare signed requests the code needs to know three things: - aws key - aws secret - aws region name The latter could be derived from the endpoint URL, but it's simpler to configure it explicitly, all the more so there's an option to use S3 URLs without region name in them we could want to use some time. To keep the described configuration the proposed place is the object_storage.yaml file with the format endpoints: - name: a.b.c port: 443 aws_key: 12345 aws_secret: abcdefghijklmnop ... When loaded, the map gets into db::config and later will be propagated down to sstables code (see next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-03 20:19:15 +03:00
Botond Dénes	4365f004c1	test/boost/mutation_test: add sanity test for mutation compaction validator Checking that compacted fragments are forwarded to the validator intact.	2023-05-03 04:19:42 -04:00
Botond Dénes	60e1a23864	mutation/mutation_compactor: add validation level to compaction state query constructor Allowing the validation level to be customized by whoever creates the compaction state. Add a default value (the previous hardcoded level) to avoid the churn of updating all call sites.	2023-05-03 04:17:05 -04:00
Botond Dénes	be859db112	mutation/mutation_compactor: validate range tombstone change before it is moved `e2c9cdb576` moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call. Refs: #12575	2023-05-03 03:07:31 -04:00
Botond Dénes	48b9f31a08	Merge 'db, sstable: use generation_type instead of its value when appropriate' from Kefu Chai in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id. Closes #13652 * github.com:scylladb/scylladb: db: use uuid for the generation column in sstable registry table db, sstable: add operator data_value() for generation_type db, sstable: print generation instead of its value	2023-05-03 09:04:54 +03:00
Nadav Har'El	b5f28e2b55	Merge 'Add S3 support to sstables::test_env' from Pavel Emelyanov Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too. This PR adds this support and patches several test cases to use it. More test cases are to come later on demand. fixes: #13015 Closes #13569 * github.com:scylladb/scylladb: test: Make resharding test run over s3 too test: Add lambda to fetch bloom filter size test: Tune resharding test use of sstable::test_env test: Make datafile test case run over s3 too test: Propagate storage options to table_for_test test: Add support for s3 storage_options in config test: Outline sstables::test_env::do_with_async() test: Keep storage options on sstable_test_env config sstables: Add and call storage::destroy() sstables: Coroutinize sstable::destroy()	2023-05-02 21:48:05 +03:00
Botond Dénes	a6387477fa	mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter	2023-05-02 09:42:42 -04:00
Botond Dénes	d79db676b1	sstables/mx/reader: template data_consume_rows_context_m on the consumer Sadly this means all accesses of base-class members have to be qualified with `this->`.	2023-05-02 09:42:42 -04:00
Botond Dénes	06fb48362a	sstables/mx/reader: move row_processing_result to namespace scope Reduce `data_consume_rows_context_m`'s dependency on the `mp_row_consumer_m` symbol, preparing the way to make the former templated on the consumer.	2023-05-02 09:42:42 -04:00
Botond Dénes	00362754a0	sstables/mx/reader: use data_consumer::proceed directly Currently mp_row_consumer_m creates an alias to data_consumer::proceed. Code in the rest of the file uses both unqualified name and mp_row_consumer_m::proceed. Remove the alias and just use `data_consumer::proceed` directly everywhere, leads to cleaner code.	2023-05-02 09:42:42 -04:00
Botond Dénes	388e7ddc03	sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)	2023-05-02 09:42:42 -04:00
Botond Dénes	10fe76a0fe	compaction/compaction: remove now unused scrub_validate_mode_validate_reader()	2023-05-02 09:42:42 -04:00
Botond Dénes	f6e5be472d	compaction/compaction: move away from scrub_validate_mode_validate_reader() Use sstable::validate() directly instead.	2023-05-02 09:42:42 -04:00
Botond Dénes	3e52f0681e	tools/scylla-sstable: move away from scrub_validate_mode_validate_reader() Use sstable::validate() directly instead. Since sstables have to be validated individually, this means the operation looses the `--merge` option.	2023-05-02 09:42:42 -04:00
Botond Dénes	393c42d4a9	test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader() Test sstable::validate() instead. Also rename the unit test testing said method from scrub_validate_mode_validate_reader_test to sstable_validate_test to reflect the change. At this point this test should probably be moved to sstable_datafile_test.cc, but not in this patch. Sadly this transition means we loose some test scenarios. Since now we have to write the invalid data to sstables, we have to drop scenarios which trigger errors on either the write or read path.	2023-05-02 09:42:42 -04:00
Botond Dénes	47959454eb	sstables/sstable: add validate() method To replace the validate code currently in compaction/compaction.cc (not in this commit). We want to push down this logic to the sstable layer, so that: * Non compaction code that wishes to validate sstables (tests, tools) doesn't have to go through compaction. * We can abstract how sstables are validated, in particular we want to add a new more low-level validation method that only the more recent sstable versions (mx) will support.	2023-05-02 09:42:41 -04:00
Botond Dénes	7ba5c9cc6a	compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one Currently said method creates a combined reader from all the sstables passed to it then validates this combined reader. Change it to validate each sstable (reader) individually in preparation of the new validate method which can handle a single sstable at a time. Note that this is not going to make much impact in practice, all callers pass a single sstable to this method already.	2023-05-02 09:42:41 -04:00
Botond Dénes	e8c7ba98f1	compaction: scrub: use error messages from validator	2023-05-02 09:42:41 -04:00
Botond Dénes	d3749b810a	mutation_fragment_stream_validator: produce error messages in low-level validator Currently, error messages for validation errors are produced in several places: * the high-level validator (which is built on the low-level one) * scrub compaction and validation compaction (scrub in validate mode) * scylla-sstable's validate operation We plan to introduce yet another place which would use the low-level validator and hence would have to produce its own error messages. To cut down all this duplication, centralize the production of error messages in the low-level validator, which now returns a `validation_result` object instead of bool from its validate methods. This object can be converted to bool (so its backwards compatible) and also contains an error message if validation failed. In the next patches we will migrate all users of the low level validator (be that direct or indirect) to use the error messages provided in this result object instead of coming up with one themselves.	2023-05-02 09:42:41 -04:00
Botond Dénes	72003dc35c	readers: evictable_reader: skip progress guarantee when next pos is partition start The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the last buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward change has a bug: when the next expected position is a partition-start (another partition), the code would loop forever, effectively reading all there is from the underlying reader. To avoid this, add a special case to ignore the progress guarantee loop altogether when the next expected position is a partition start. In this case, progress is garanteed anyway, because there is exactly one partition-start fragment in each partition. Fixes: #13491 Closes #13563	2023-05-02 16:19:32 +03:00
Botond Dénes	7baa2d9cb2	Merge 'Cleanup range printing' from Benny Halevy This mini-series cleans up printing of ranges in utils/to_string.hh It generalizes the helper function to work on a std::ranges::range, with some exceptions, and adds a helper for boost::transformed_range. It also changes the internal interface by moving `join` the the utils namespace and use std::string rather than seastar::sstring. Additional unit tests were added to test/boost/json_test Fixes #13146 Closes #13159 * github.com:scylladb/scylladb: utils: to_string: get rid of utils::join utils: to_string: get rid of to_string(std::initializer_list) utils: to_string: get rid of to_string(const Range&) utils: to_string: generalize range helpers test: add string_format_test utils: chunked_vector: add std::ranges::range ctor	2023-05-02 14:55:18 +03:00
Botond Dénes	d6ed5bbc7e	Merge 'alternator: fix validation of numbers' magnitude and precision' from Nadav Har'El DynamoDB limits the allowed magnitude and precision of numbers - valid decimal exponents are between -130 and 125 and up to 38 significant decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal" type which offers unlimited precision. This can cause two problems: 1. Users might get used to this "unofficial" feature and start relying on it, not allowing us to switch to a more efficient limited-precision implementation later. 2. If huge exponents are allowed, e.g., 1e-1000000, summing such a number with 1.0 will result in a huge number, huge allocations and stalls. This is highly undesirable. This series adds more tests in this area covering additional corner cases, and then fixes the issue by adding the missing verification where it's needed. After the series, all 12 tests in test/alternator/test_number.py now pass. Fixes #6794 Closes #13743 * github.com:scylladb/scylladb: alternator: unit test for number magnitude and precision function alternator: add validation of numbers' magnitude and precision test/alternator: more tests for limits on number precision and magnitude test/alternator: reproducer for DoS in unlimited-precision addition	2023-05-02 14:33:36 +03:00
Kefu Chai	74e9e6dd1a	db: use uuid for the generation column in sstable registry table * change the "generation" column of sstable registry table from bigint to uuid * from helper to convert UUID back to the original generation in the long run, we encourage user to use uuid based generation identifier. but in the transition period, both bigint based and uuid based identifiers are used for the generation. so to cater both needs, we use a hackish way to store the integer into UUID. to differentiate the was-integer UUID from the geniune UUID, we check the UUID's most_significant_bits. because we only support serialize UUID v1, so if the timestamp in the UUID is zero, we assume the UUID was generated from an integer when converting it back to a generation identififer. also, please note, the only use case of using generation as a column is the sstable_registry table, but since its schema is fixed, we cannot store both a bigint and a UUID as the value of its `generation` column, the simpler way forward is to use a single type for the generation. to be more efficient and to preserve the type of the generation, instead of using types like ascii string or bytes, we will always store the generation as a UUID in this table, if the generation's identifier is a int64_t, the value of the integer will be used as the least significant bits of the UUID. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-05-02 19:23:22 +08:00
Nadav Har'El	ed34f3b5e4	cql-pytest: translate Cassandra's test for LWT with collections This is a translation of Cassandra's CQL unit test source file validation/operations/InsertUpdateIfConditionTest.java into our cql-pytest framework. This test file checks various LWT conditional updates which involve collections or UDTs (there is a separate test file for LWT conditional updates which do not involve collections, which I haven't translated yet). The tests reproduce one known bug: Refs #5855: lwt: comparing NULL collection with empty value in IF condition yields incorrect results And also uncovered three previously-unknown bugs: Refs #13586: Add support for CONTAINS and CONTAINS KEY in LWT expressions Refs #13624: Add support for UDT subfields in LWT expression Refs #13657: Misformatted printout of column name in LWT error message Beyond those bona-fide bugs, this test also demonstrates several places where we intentionally deviated from Cassandra's behavior, forcing me to comment out several checks. These deviations are known, and intentional, but some of them are undocumented and it's worth listing here the ones re-discovered by this test: 1. On a successful conditional write, Cassandra returns just True, Scylla also returns the old contents of the row. This difference is officially documented in docs/kb/lwt-differences.rst. 2. Scylla allows the test "l = [null]" or "s = {null}" with this weird null element (the result is false), whereas Cassandra prints an error. 3. Scylla allows "l[null]" or "m[null]" (resulting in null), Cassandra prints an error. 4. Scylla allows a negative list index, "l[-2]", resulting in null. Cassandra prints an error in this case. 5. Cassandra allows in "IF v IN (?, ?)" to bind individual values to UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659. 6. Scylla allows "IN null" (the condition just fails), Cassandra prints an error in this case. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13663	2023-05-02 11:53:58 +03:00
Pavel Emelyanov	d4a72de406	test: Make resharding test run over s3 too Now when the test case and used lib/utils code is using storage-agnostic approach, it can be extended to run over S3 storage as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:46:23 +03:00
Pavel Emelyanov	2601c58278	test: Add lambda to fetch bloom filter size The resharding test compares bloom filter sizes before and after reshard runs. For that it gets the filter on-disk filename and stat()s it. That won't work with S3 as it doesn't have its accessable on-disk files. Some time ago there existed the storage::get_stats() method, but now it's gone. The new s3::client::get_object_stat() is coming, but it will take time to switch to it. For now, generalize filter size fetching into a local lambda. Next patch will make a stub in it for S3 case, and once the get_object_stat() is there we'll be able to smoothly start using it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:43:26 +03:00
Kefu Chai	135b4fd434	db: schema_tables: capture reference to temporary value by value `clustering_key_columns()` returns a range view, and `front()` returns the reference to its first element. so we cannot assume the availability of this reference after the expression is evaluated. to address this issue, let's capture the returned range by value, and keep the first element by reference. this also silences warning from GCC-13: ``` /home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ^~~~~~~~~~~~~ /home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’ 3654 \| const column_definition& first_view_ck = v->clustering_key_columns().front(); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ ``` Fixes #13720 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13721	2023-05-02 11:42:43 +03:00
Pavel Emelyanov	76594bf72b	test: Tune resharding test use of sstable::test_env The test case in question spawns async context then makes the test_env instance on the stack (and stopper for it too). There's helper for the above steps, better to use them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:30:03 +03:00
Pavel Emelyanov	439c8770aa	test: Make datafile test case run over s3 too Most of the sstable_datafile test cases are capable of running with S3 storage, so this patch makes the simplest of them do it. Patching the rest from this file is optional, because mostly the cases test how the datafile data manipulations work without checking the files manipulations. So even if making them all run over S3 is possible, it will just increase the testing time w/o real test of the storage driver. So this patch makes one test case run over local and S3 storages, more patches to update more test cases with files manipulations are yet to come. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:30:03 +03:00
Pavel Emelyanov	f7df238545	test: Propagate storage options to table_for_test Teach table_for_tests use any storage options, not just local one. For now the only user that passes non-local options is sstables::test_env. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:30:03 +03:00
Pavel Emelyanov	fa1de16f30	test: Add support for s3 storage_options in config When the sstable test case wants to run over S3 storage it needs to specify that in test config by providing the S3 storage options. So first thing this patch adds is the helper that makes these options based on the env left by minio launcher from test.py. Next, in order to make sstables_manager work with S3 it needs the plugged system keyspace which, in turn, needs query processor, proxy, database, etc. All this stuff lives in cql_test_env, so the test case running with S3 options will run in a sstables::test_env nested inside cql_test_env. The latter would also need to plug its system keyspace to the former's sstables manager and turn the experimental feature ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:30:03 +03:00
Nadav Har'El	57ffbcbb22	cql3: fix spurious token names in syntax error messages We have known for a long time (see issue #1703) that the quality of our CQL "syntax error" messages leave a lot to be desired, especially when compared to Cassandra. This patch doesn't yet bring us great error messages with great context - doing this isn't easy and it appears that Antlr3's C++ runtime isn't as good as the Java one in this regard - but this patch at least fixes garbage printed in some error messages. Specifically, when the parser can deduce that a specific token is missing, it used to print line 1:83 missing ')' at '<missing ' After this patch we get rid of the meaningless string '<missing ': line 1:83 : Missing ')' Also, when the parser deduced that a specific token was unneeded, it used to print: line 1:83 extraneous input ')' expecting <invalid> Now we got rid of this silly "<invalid>" and write just: line 1:83 : Unexpected ')' Refs #1703. I didn't yet marked that issue "fixed" because I think a complete fix would also require printing the entire misparsed line and the point of the parse failure. Scylla still prints a generic "Syntax Error" in most cases now, and although the character number (83 in the above example) can help, it's much more useful to see the actual failed statement and where character 83 is. Unfortunately some tests enshrine buggy error messages and had to be fixed. Other tests enshrined strange text for a generic unexplained error message, which used to say " : syntax error..." (note the two spaces and elipses) and after this patch is " : Syntax error". So these tests are changed. Another message, "no viable alternative at input" is deliberately kept unchanged by this patch so as not to break many more tests which enshrined this message. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #13731	2023-05-02 11:23:58 +03:00
Pavel Emelyanov	1e03733e8c	test: Outline sstables::test_env::do_with_async() It's growing larger, better to keep it in .cc file Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:15:45 +03:00
Pavel Emelyanov	f223f5357d	test: Keep storage options on sstable_test_env config So that it could be set to s3 by the test case on demand. Default is local storage which uses env's tempdir or explicit path argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:15:45 +03:00
Pavel Emelyanov	81a1416ebf	sstables: Add and call storage::destroy() The s3_storage leaks client when sstable gets destoryed. So far this came unnoticed, but debug-mode unit test ran over minio captured it. So here's the fix. When sstable is destroyed it also kicks the storage to do whatever cleanup is needed. In case of s3 storage the cleanup is in closing the on-boarded client. Until #13458 is fixed each sstable has its own private version of the client and there's no other place where it can be close()d in co_await-able mannter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:15:44 +03:00
Avi Kivity	c0eb0d57bc	install-dependencies.sh: don't use fgrep fgrep says: fgrep: warning: fgrep is obsolescent; using grep -F follow its advice. Closes #13729	2023-05-02 11:15:40 +03:00
Pavel Emelyanov	3e0c3346a8	sstables: Coroutinize sstable::destroy() To simiplify patching by next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-02 11:15:11 +03:00
Nadav Har'El	e74f69bb56	alternator: unit test for number magnitude and precision function In the previous patch we added a limit in Alternator for the magnitude and precision of numbers, based on a function get_magnitude_and_precision whose implementation was, unfortunately, rather elaborate and delicate. Although we did add in the previous patches some end-to-end tests which confirmed that the final decision made based on this function, to accept or reject numbers, was a correct decision in a few cases, such an elaborate function deserves a separate unit test for checking just that function in isolation. In fact, this unit tests uncovered some bugs in the first implementation of get_magnitude_and_precision() which the other tests missed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-05-02 11:04:05 +03:00
Nadav Har'El	3c0603558c	alternator: add validation of numbers' magnitude and precision DynamoDB limits the allowed magnitude and precision of numbers - valid decimal exponents are between -130 and 125 and up to 38 significant decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal" type which offers unlimited precision. This can cause two problems: 1. Users might get used to this "unofficial" feature and start relying on it, not allowing us to switch to a more efficient limited-precision implementation later. 2. If huge exponents are allowed, e.g., 1e-1000000, summing such a number with 1.0 will result in a huge number, huge allocations and stalls. This is highly undesirable. After this patch, all tests in test/alternator/test_number.py now pass. The various failing tests which verify magnitude and precision limitations in different places (key attributes, non-key attributes, and arithmetic expressions) now pass - so their "xfail" tags are removed. Fixes #6794 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-05-02 11:04:05 +03:00
Nadav Har'El	0eccc49308	test/alternator: more tests for limits on number precision and magnitude We already have xfailing tests for issue #6794 - the missing checks on precision and magnitudes of numbers in Alternator - but this patch adds checks for additional corner cases. In particular we check the case that numbers are used in a key column, which goes to a different code path than numbers used in non-key columns, so it's worth testing as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-05-02 11:04:05 +03:00
Nadav Har'El	56b8b9d670	test/alternator: reproducer for DoS in unlimited-precision addition As already noted in issue #6794, whereas DynamoDB limits the magnitude of numbers to between 10^-130 and 10^125, Scylla does not. In this patch we add yet another test for this problem, but unlike previous tests which just shown too much magnitude being allowed which always sounded like a benign problem - the test in this patch shows that this "feature" can be used to DoS Scylla - a user user can send a short request that causes arbitrarily-large allocations, stalls and CPU usage. The test is currently marked "skip" because it cause cause Scylla to take a very long time and/or run out of memory. It passes on DynamoDB because the excessive magnitude is simply not allowed there. Refs #6794 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-05-02 11:03:51 +03:00
Benny Halevy	959a740dac	utils: to_string: get rid of utils::join Use `fmt::format("{}", fmt::join(...))` instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:59:58 +03:00
Benny Halevy	e6bcb1c8df	utils: to_string: get rid of to_string(std::initializer_list) It's unused. Just in case, add a unit test case for using the fmt library to format it (that includes fmt::to_string(std::initializer_list)). Note that the existing to_string implementation used square brackets to enclose the initializer_list but the new, standardized form uses curly braces. This doesn't break anything since to_string(initializer_list) wasn't used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:48:46 +03:00
Benny Halevy	ba883859c7	utils: to_string: get rid of to_string(const Range&) Use fmt::to_string instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:48:46 +03:00
Benny Halevy	15c9f0f0df	utils: to_string: generalize range helpers As seen in https://github.com/scylladb/scylladb/issues/13146 the current implementation is not general enough to provide print helpers for all kind of containers. Modernize the implementation using templates based on std::ranges::range and using fmt::join. Extend unit test for formatting different types of ranges, boost::transformed ranges, deque. Fixes #13146 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:48:46 +03:00
Benny Halevy	59e89efca6	test: add string_format_test Test string formatting before cleaning up utils/to_string.hh in the next patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:48:46 +03:00
Benny Halevy	45153b58bd	utils: chunked_vector: add std::ranges::range ctor To be used in next patch for constructing chunked_vector from an initializer_list. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-02 10:48:46 +03:00
Wojciech Mitros	b18c21147f	cql: check if the keyspace is system when altering permissions Currently, when altering permissions on a functions resource, we only check if it's a builtin function and not if it's all functions in the "system" keyspace, which contains all builtin functions. This patch adds a check of whether the function resource keyspace is "system". This check actually covers both "single function" and "all functions in keyspace" cases, so the additional check for single functions is removed. Closes #13596	2023-05-02 10:13:59 +03:00
Botond Dénes	022465d673	Merge 'Tone down offstrategy log message' from Benny Halevy In many cases we trigger offstrategy compaction opportunistically also when there's nothing to do. In this case we still print to the log lots of info-level message and call `run_offstrategy_compaction` that wastes more cpu cycles on learning that it has nothing to do. This change bails out early if the maintenance set is empty and prints a "Skipping off-strategy compaction" message in debug level instead. Fixes #13466 Also, add an group_id class and return it from compaction_group and table_state. Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages. Fixes #13467 Closes #13520 * github.com:scylladb/scylladb: compaction_manager: print compaction_group id compaction_group, table_state: add group_id member compaction_manager: offstrategy compaction: skip compaction if no candidates are found	2023-05-02 08:05:18 +03:00
Avi Kivity	9c37fdaca3	Revert "dht: incremental_owned_ranges_checker: use lower_bound()" This reverts commit `d85af3dca4`. It restores the linear search algorithm, as we expect the search to terminate near the origin. In this case linear search is O(1) while binary search is O(log n). A comment is added so we don't repeat the mistake. Closes #13704	2023-05-02 08:01:44 +03:00
Benny Halevy	707bd17858	everywhere: optimize calls to make_flat_mutation_reader_from_mutations_v2 with single mutation No point in going through the vector<mutation> entry-point just to discover in run time that it was called with a single-element vector, when we know that in advance. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13733	2023-05-02 07:58:34 +03:00
Avi Kivity	72c12a1ab2	Merge 'cdc, db_clock: specialize fmt::formatter<{db_clock::time_point, generation_id}>' from Kefu Chai this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `cdc::generation_id` and `db_clock::time_point` without the help of `operator<<`. the formatter of `cdc::generation_id` uses that of `db_clock::time_point` , so these two commits are posted together in a single pull request. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Closes #13703 * github.com:scylladb/scylladb: db_clock: specialize fmt::formatter<db_clock::time_point> cdc: generation: specialize fmt::formatter<generation_id>	2023-05-01 22:56:33 +03:00
Avi Kivity	7b7d9bcb14	Merge 'Do not access owned_ranges_ptr across shards in update_sstable_cleanup_state' from Benny Halevy This series fixes a few issues caused by `f1bbf705f9` (`f1bbf705f9`): - table, compaction_manager: prevent cross shard access to owned_ranges_ptr - Fixes #13631 - distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners - compaction: make_partition_filter: do not assert shard ownership - allow the filtering reader now used during resharding to process tokens owned by other shards Closes #13635 * github.com:scylladb/scylladb: compaction: make_partition_filter: do not assert shard ownership distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners table, compaction_manager: prevent cross shard access to owned_ranges_ptr	2023-05-01 22:51:00 +03:00
Avi Kivity	c9dab3ac81	Merge 'treewide: fix warnings from GCC-13' from Kefu Chai this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately. see also #13243 Closes #13723 * github.com:scylladb/scylladb: cdc: initialize an optional using its value type compaction: disambiguate type name db: schema_tables: drop unused variable reader_concurrency_semaphore: fix signed/unsigned comparision locator: topology: disambiguate type names raft: disambiguate promise name in raft::awaited_conf_changes	2023-05-01 22:48:00 +03:00
Kefu Chai	37f1beade5	s3/client: do not allocate potentially big object on stack when compiling using GCC-13, it warns that: ``` /home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=] 224 \| sstring parse_multipart_upload_id(sstring& body) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~ ``` so it turns out that `rapidxml::xml_document<>` could be very large, let's allocate it on heap instead of on the stack to address this issue. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13722	2023-05-01 22:46:18 +03:00
Kefu Chai	108f20c684	cql3: capture reference to temporary value by value `data_dictionary::database::find_keyspace()` returns a temporary object, and `data_dictionary::keyspace::user_types()` returns a references pointing to a member of this temporary object. so we cannot use the reference after the expression is evaluated. in this change, we capture the return value of `find_keyspace()` using universal reference, and keep the return value of `user_types()` with a reference, to ensure us that we can use it later. this change silences the warning from GCC-13, like: ``` /home/kefu/dev/scylladb/cql3/statements/authorization_statement.cc:68:21: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 68 \| const auto& utm = qp.db().find_keyspace(*keyspace).user_types(); \| ^~~ ``` Fixes #13725 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13726	2023-05-01 22:41:41 +03:00
Kefu Chai	b76877fd99	transport: capture reference to temp value by value `current_scheduling_group()` returns a temporary value, and `name()` returns a reference, so we cannot capture the return value by reference, and use the reference after this expression is evaluated. this would cause undefined behavior. so let's just capture it by value. this change also silence following warning from GCC-13: ``` /home/kefu/dev/scylladb/transport/server.cc:204:11: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 204 \| auto& cur_sg_name = current_scheduling_group().name(); \| ^~~~~~~~~~~ /home/kefu/dev/scylladb/transport/server.cc:204:56: note: the temporary was destroyed at the end of the full expression ‘seastar::current_scheduling_group().seastar::scheduling_group::name()’ 204 \| auto& cur_sg_name = current_scheduling_group().name(); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ ``` Fixes #13719 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13724	2023-05-01 22:40:36 +03:00
Kefu Chai	0a3a254284	cql3: do not capture reference to temporary value `data_dictionary::database::find_column_family()` return a temporary value, and `data_dictionary::table::get_index_manager()` returns a reference in this temporary value, so we cannot capture this reference and use it after the expression is evaluated. in this change, we keep the return value of `find_column_family()` by value, to extend the lifecycle of the return value of `get_index_manager()`. this should address the warning from GCC-13, like: ``` /home/kefu/dev/scylladb/cql3/restrictions/statement_restrictions.cc:519:15: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 519 \| auto& sim = db.find_column_family(_schema).get_index_manager(); \| ^~~ ``` Fixes #13727 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13728	2023-05-01 22:39:48 +03:00
Nadav Har'El	1cefb662cd	Merge 'cql3/expr: remove expr::token' from Jan Ciołek Let's remove `expr::token` and replace all of its functionality with `expr::function_call`. `expr::token` is a struct whose job is to represent a partition key token. The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses `expr::token` to represent the `token(p1, p2)` part. The situation with `expr::token` is a bit complicated. On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the `token()` function, for example `token(1, 2, 3)` could be a `function_call`, but it could also be `expr::token`. The query planning code assumes that each occurence of expr::token represents the partition token without checking the arguments. Because of this allowing `token(1, 2, 3)` to be represented as `expr::token` is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong. Currently `expr::token` is created only in one specific case. When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates `expr::token`. In all other cases it generates an `expr::function_call`. Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle. There is also a problem because there's a lot of code duplication between a `function_call` and `expr::token`. All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function. Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`. One more aspect is multi-table queries. With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific. What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is represented as `expr::token` for one schema would be represented as `function_call` in the context of a different schema. Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries. Overall I think it would be best to remove `expr::token`. Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons. I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble. Instead of having both `expr::token` and `function_call` we can just have the `function_call` and check if it represents a partition token when needed. Refs: #12906 Refs: #12677 Closes: #12905 Closes #13480 * github.com:scylladb/scylladb: cql3: remove expr::token cql3: keep a schema in visitor for extract_clustering_prefix_restrictions cql3: keep a schema inside the visitor for extract_partition_range cql3/prepare_expr: make get_lhs_receiver handle any function_call cql3/expr: properly print token function_call expr_test: use unresolved_identifier when creating token cql3/expr: split possible_lhs_values into column and token variants cql3/expr: fix error message in possible_lhs_values cql3: expr: reimplement is_satisfied_by() in terms of evaluate() cql3/expr: add a schema argument to expr::replace_token cql3/expr: add a comment for expr::has_partition_token cql3/expr: add a schema argument to expr::has_token cql3: use statement_restrictions::has_token_restrictions() wherever possible cql3/expr: add expr::is_partition_token_for_schema cql3/expr: add expr::is_token_function cql3/expr: implement preparing function_call without a receiver cql3/functions: make column family argument optional in functions::get cql3/expr: make it possible to prepare expr::constant cql3/expr: implement test_assignment for column_value cql3/expr: implement test_assignment for expr::constant	2023-04-30 15:31:35 +03:00
Tomasz Grabiec	aba5667760	Merge 'raft topology: refactor the coordinator to allow non-node specific topology transitions' from Kamil Braun We change the meaning and name of `replication_state`: previously it was meant to describe the "state of tokens" of a specific node; now it describes the topology as a whole - the current step in the 'topology saga'. It was moved from `ring_slice` into `topology`, renamed into `transition_state`, and the topology coordinator code was modified to switch on it first instead of node state - because there may be no single transitioning node, but the topology itself may be transitioning. This PR was extracted from #13683, it contains only the part which refactors the infrastructure to prepare for non-node specific topology transitions. Closes #13690 * github.com:scylladb/scylladb: raft topology: rename `update_replica_state` -> `update_topology_state` raft topology: remove `transition_state::normal` raft topology: switch on `transition_state` first raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` raft topology: parse replaced node in `exec_global_command` raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on` storage_service: extract raft topology coordinator fiber to separate class raft topology: rename `replication_state` to `transition_state` raft topology: make `replication_state` a topology-global state	2023-04-30 10:55:24 +02:00
Kefu Chai	e333bcc2da	cdc: initialize an optional using its value type as this syntax is not supported by the standard, it seems clang just silently construct the value with the initializer list and calls the operator=, but GCC complains: ``` /home/kefu/dev/scylladb/cdc/split.cc:392:54: error: converting to ‘std::optional<partition_deletion>’ from initializer list would use explicit constructor ‘constexpr std::optional<_Tp>::optional(_Up&&) [with _Up = const tombstone&; typename std::enable_if<__and_v<std::__not_<std::is_same<std::optional<_Tp>, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::__not_<std::is_same<std::in_place_t, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::is_constructible<_Tp, _Up>, std::__not_<std::is_convertible<_Iter, _Iterator> > >, bool>::type <anonymous> = false; _Tp = partition_deletion]’ 392 \| _result[t.timestamp].partition_deletions = {t}; \| ^ ``` to silences the error, and to be more standard compliant, let's use emplace() instead. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 19:34:12 +08:00
Jan Ciolek	be8ef63bf5	cql3: remove expr::token Let's remove expr::token and replace all of its functionality with expr::function_call. expr::token is a struct whose job is to represent a partition key token. The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses expr::token to represent the `token(p1, p2)` part. The situation with expr::token is a bit complicated. On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the token() function, for example `token(1, 2, 3)` could be a function_call, but it could also be expr::token. The query planning code assumes that each occurence of expr::token represents the partition token without checking the arguments. Because of this allowing `token(1, 2, 3)` to be represented as expr::token is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong. Currently expr::token is created only in one specific case. When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates expr::token. In all other cases it generates an `expr::function_call`. Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle. There is also a problem because there's a lot of duplication between a `function_call` and `expr::token`. All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function. Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`. One more aspect is multi-table queries. With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific. What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is representad as `expr::token` for one schema would be represented as `function_call` in the context of a different schema. Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries. Overall I think it would be best to remove expr::token. Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons. I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble. Instead of having expr::token and function_call we can just have the function_call and check if it represents a partition token when needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:11:31 +02:00
Jan Ciolek	6e0ae59c5a	cql3: keep a schema in visitor for extract_clustering_prefix_restrictions The schema will be needed once we remove expr::token and switch to using expr::is_partition_token_for_schema, which requires a schema arguments. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:11:31 +02:00
Jan Ciolek	551135e83f	cql3: keep a schema inside the visitor for extract_partition_range The schema will be needed once we remove expr::token and switch to using expr::is_partition_token_for_schema, which requires a schema arguments. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:11:30 +02:00
Jan Ciolek	16bc1c930f	cql3/prepare_expr: make get_lhs_receiver handle any function_call get_lhs_receiver looks at the prepared LHS of a binary operator and creates a receiver corresponding to this LHS expression. This receiver is later used to prepare the RHS of the binary operator. It's able to handle a few expression types - the ones that are currently allowed to be on the LHS. One of those types is `expr::token`, to handle restrictions like `token(p1, p2) = 3`. Soon token will be replaced by `expr::function_call`, so the function will need to handle `function_calls` to the token function. Although we expect there to be only calls to the `token()` function, as other functions are not allowed on the LHS, it can be made generic over all function calls, which will help in future grammar extensions. The functions call that it can currently get are calls to the token function, but they're not validated yet, so it could also be something like `token(pk, pk, ck)`. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:53 +02:00
Jan Ciolek	d3a958490e	cql3/expr: properly print token function_call Printing for function_call is a bit strange. When printing an unprepared function it prints the name and then the arguments. For prepared function it prints <anonymous function> as the name and then the arguments. Prepared functions have a name() method, but printing doesn't use it, maybe not all functions have a valid name(?). The token() function will soon be represent as a function_call and it should be printable in a user-readable way. Let's add an if which prints `token(arg1, arg2)` instead of `<anonymous function>(arg1, arg2)` when printing a call to the token function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:53 +02:00
Jan Ciolek	289ca51ee5	expr_test: use unresolved_identifier when creating token One test for expr::token uses raw column identifier in the test. Let's change it to unresloved_identifier, which is a standard representation of unresolved column names in expressions. Once expr::token is removed it will be possible to create a function_call with unresolved_identifiers as arguments. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:53 +02:00
Jan Ciolek	096efc2f38	cql3/expr: split possible_lhs_values into column and token variants The possible_lhs_values takes an expression and a column and finds all possible values for the column that make the expression true. Apart from finding column values it's also capable of finding all matching values for the partition key token. When a nullptr column is passed, possible_lhs_values switches into token values mode and finds all values for the token. This interface isn't ideal. It's confusing to pass a nullptr column when one wants to find values for the token. It would be better to have a flag, or just have a separate function. Additionally in the future expr::token will be removed and we will use expr::is_partition_token_for_schema to find all occurences of the partition token. expr::is_partition_token_for_schema takes a schema as an argument, which possible_lhs_values doesn't have, so it would have to be extended to get the schema from somewhere. To fix these two problems let's split possible_lhs_values into two functions - one that finds possible values for a column, which doesn't require a schema, and one that finds possible values for the partition token and requires a schema: value_set possible_column_values(const column_definition* col, const expression& e, const query_options& options); value_set possible_partition_token_values(const expression& e, const query_options& options, const schema& table_schema); This will make the interface cleaner and enable smooth transition once expr::token is removed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:53 +02:00
Jan Ciolek	f2e5f654f2	cql3/expr: fix error message in possible_lhs_values In possible_lhs_values there was a message talking about is_satisifed_by. It looks like a badly copy-pasted message. Change it to possibel_lhs_values as it should be. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:52 +02:00
Avi Kivity	dc3c28516d	cql3: expr: reimplement is_satisfied_by() in terms of evaluate() It calls evaluate() internally anyway. There's a scary if () in there talking about tokens, but everything appears to work.	2023-04-29 13:04:52 +02:00
Jan Ciolek	ad5c931102	cql3/expr: add a schema argument to expr::replace_token Just like has_token, replace_token will use expr::is_partition_token_for_schema to find all instance of the partition token to replace. Let's prepare for this change by adding a schema argument to the function before making the big change. It's unsued at the moment, but having a separate commit should make it easier to review. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:52 +02:00
Jan Ciolek	d50db32d14	cql3/expr: add a comment for expr::has_partition_token Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:52 +02:00
Jan Ciolek	18879aad6f	cql3/expr: add a schema argument to expr::has_token In the future expr::token will be removed and checking whether there is a partition token inside an expression will be done using expr::is_partition_token_for_schema. This function takes a schema as an argument, so all functions that will call it also need to get the schema from somewhere. Right now it's an unused argument, but in the future it will be used. Adding it in a separate commit makes it easier to review. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:52 +02:00
Jan Ciolek	90b3b85bd0	cql3: use statement_restrictions::has_token_restrictions() wherever possible The statement_restrictions class has a method called has_token_restriction(). This method checks whether the partition key restrictions contain expr::token. Let's use this function in all applicable places instead of manually calling has_token(). In the future has_token() will have an additional schema argument, so eliminating calls to has_token() will simplify the transition. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:52 +02:00
Jan Ciolek	7af010095e	cql3/expr: add expr::is_partition_token_for_schema Add a function to check whether the expression represents a partition token - that is a call to the token function with consecutive partition key columns as the arguments. For example for `token(p1, p2, p3)` this function would return `true`, but for `token(1, 2, 3)` or `token(p3, p2, p1)` the result would be `false`. The function has a schema argument because a schema is required to get the list of partition columns that should be passed as arguments to token(). Maybe it would be possible to infer the schema from the information given earlier during prepare_expression, but it would be complicated and a bit dangerous to do this. Sometimes we operate on multiple tables and the schema is needed to differentiate between them - a token() call can represent the base table's partition token, but for an index table this is just a normal function call, not the partition token. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:51 +02:00
Jan Ciolek	694d9298aa	cql3/expr: add expr::is_token_function Add a function that can be used to check whether a given expression represents a call to the token() function. Note that a call to token() doesn't mean that the expression represents a partition token - it could be something like token(1, 2, 3), just a normal function_call. The code for checking has been taken from functions::get. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:51 +02:00
Jan Ciolek	f7cac10fe0	cql3/expr: implement preparing function_call without a receiver Currently trying to do prepare_expression(function_call) with a nullptr receiver fails. It should be possible to prepare function calls without a known receiver. When the user types in: `token(1, 2, 3)` the code should be able to figure out that they are looking for a function with name `token`, which takes 3 integers as arguments. In order to support that we need to prepare all arguments that can be prepared before attempting to find a function. Prepared expressions have a known type, which helps to find the right function for the given arguments. Additionally the current code for finding a function requires all arguments to be assignment_testable, which requires to prepare some expression types, e.g column_values. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:04:51 +02:00
Jan Ciolek	15ed83adbc	cql3/functions: make column family argument optional in functions::get The method `functions::get` is used to get the `functions::function` object of the CQL function called using `expr::function_call`. Until now `functions::get` required the caller to pass both the keyspace and the column family. The keyspace argument is always needed, as every CQL function belongs to some keyspace, but the column family isn't used in most cases. The only case where having the column family is really required is the `token()` function. Each variant of the `token()` function belongs to some table, as the arguments to the function are the consecutive partition key columns. Let's make the column family argument optional. In most cases the function will work without information about column family. In case of the `token()` function there's gonna be a check and it will throw an exception if the argument is nullopt. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-29 13:00:01 +02:00
Kefu Chai	0232115eaa	compaction: disambiguate type name otherwise GCC-13 complains: ``` /home/kefu/dev/scylladb/compaction/compaction_state.hh:38:22: error: declaration of ‘compaction::owned_ranges_ptr compaction::compaction_state::owned_ranges_ptr’ changes meaning of ‘owned_ranges_ptr’ [-Wchanges-meaning] 38 \| owned_ranges_ptr owned_ranges_ptr; \| ^~~~~~~~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Kefu Chai	56511a42d0	db: schema_tables: drop unused variable this also silence the warning from GCC-13: ``` /home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable] 1489 \| auto ts = db_clock::now(); \| ^~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Kefu Chai	48387a5a9a	reader_concurrency_semaphore: fix signed/unsigned comparision a signed/unsigned comparsion can overflow. and GCC-13 rightly points this out. so let's use `std::cmp_greater_equal()` when comparing unsigned and signed for greater-or-equal. ``` /home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare] 931 \| if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] { \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Kefu Chai	6d8188ad70	locator: topology: disambiguate type names otherwise GCC-13 complains: ``` /home/kefu/dev/scylladb/locator/topology.hh:70:21: error: declaration of ‘const locator::topology* locator::node::topology() const’ changes meaning of ‘topology’ [-Wchanges-meaning] 70 \| const topology* topology() const noexcept { \| ^~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Kefu Chai	f80f638bb9	raft: disambiguate promise name in raft::awaited_conf_changes otherwise GCC 13 complains that ``` /home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning] 42 \| promise<> promise; \| ^~~~~~~ /home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’ 42 \| promise<> promise; \| ^~~~~~~~~ ``` see also `cd4af0c722` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Botond Dénes	f527b28174	Merge 'treewide: reenable -Wmissing-braces' from Kefu Chai this change silences the warning of `-Wmissing-braces` from clang. in general, we can initialize an object without constructor with braces. this is called aggregate initialization. but the standard does allow us to initialize each element using either copy-initialization or direct-initialization. but in our case, neither of them applies, so the clang warns like ``` suggest braces around initialization of subobject [-Werror,-Wmissing-braces] options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())}); ^~~~~~~~~~~~~~~~~~~~~~~~~ { } ``` in this change, also, take the opportunity to use structured binding to simplify the related code. Closes #13705 * github.com:scylladb/scylladb: build: reenable -Wmissing-braces treewide: add braces around subobject cql3/stats: use zero-initialization	2023-04-28 16:00:14 +03:00
Kefu Chai	43e9910fa0	utils/chunked_managed_vector: use operator<=> when appropriate instead of crafting 4 operators manually, just delegate it to <=>. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13698	2023-04-28 15:59:08 +03:00
Botond Dénes	0b5d9d94fa	Merge 'Kill sstable::storage::get_stats() to help S3 provide accurate SSTable component stats' from Raphael "Raph" Carvalho S3 wasn't providing filter size and accurate size for all SSTable components on disk. First, filter size is provided by taking advantage that its in-memory representation is roughly the same as on-disk one. Second, size for all components is provided by piggybacking on sstable parser and writer, so no longer a need to do a separate additional step after Scylla have either parsed or written all components. Finally, sstable::storage::get_stats() is killed, so the burden is no longer pushed on the storage type implementation. Refs #13649. Closes #13682 * github.com:scylladb/scylladb: test: Verify correctness of sstable::bytes_on_disk() sstable: Piggyback on sstable parser and writer to provide bytes_on_disk sstable: restore indentation in read_digest() and read_checksum() sstable: make all parsing of simple components go through do_read_simple() sstable: Add missing pragma once to random_access_reader.hh sstable: make all writing of simple components go through do_write_simple() test: sstable_utils: reuse set_values() sstable: Restore indentation in read_simple() sstable: Coroutinize read_simple() sstable: Use filter memory footprint in filter_size()	2023-04-28 15:58:39 +03:00
Kefu Chai	ba8402067f	db, sstable: add operator data_value() for generation_type so we can apply `execute_cql()` on `generation_type` directly without extracting its value using `generation.value()`. this paves the road to adding UUID based generation id to `generation_type`. as by then, we will have both UUID based and integer based `generation_type`, so `generation_type::value()` will not be able to represent its value anymore. and this method will be replaced by `operator data_value()` in this use case. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Kefu Chai	ae9aa9c4bd	db, sstable: print generation instead of its value this change prepares for the change to use `variant<UUID, int64_t>` as the value of `generation_type`. as after this change, the "value" of a generation would be a UUID or an integer, and we don't want to expose the variant in generation's public interface. so the `value()` method would be changed or removed by then. this change takes advantage of the fact that the formatter of `generation_type` always prints its value. also, it's better to reuse `generation_type` formatter when appropriate. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 20:39:12 +08:00
Jan Ciolek	b3d05f3525	cql3/expr: make it possible to prepare expr::constant try_prepare_expression(constant) used to throw an error when trying to prepeare expr::constant. It would be useful to be able to do this and it's not hard to implement. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-28 14:34:59 +02:00
Jan Ciolek	bf36cde29a	cql3/expr: implement test_assignment for column_value Make it possible to do test_assignment for column_values. It's implemented using the generic expression assignment testing function. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-28 14:34:59 +02:00
Jan Ciolek	fd174bda60	cql3/expr: implement test_assignment for expr::constant test_assignment checks whether a value of some type can be assigned to a value of different type. There is no implementation of test_assignment for expr::constant, but I would like to have one. Currently there is a custom implementation of test_assignment for each type of expression, but generally each of them boils down to checking: ``` type1->is_value_compatible_with(type2) ``` Instead of implementing another type-specific funtion I added expresion_test_assignment and used it to implement test_assignment for constant. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-04-28 14:34:56 +02:00
Kefu Chai	a34e417069	streaming: remove unused operator== since this operator is used nowhere, let's drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13697	2023-04-28 12:39:17 +03:00
Kefu Chai	662f8fa66e	build: reenable -Wmissing-braces since we've addressed all the -Wmissing-braces warnings, we can now enable this warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 16:59:29 +08:00
Kefu Chai	eb7c41767b	treewide: add braces around subobject this change helps to silence the warning of `-Wmissing-braces` from clang. in general, we can initialize an object without constructor with braces. this is called aggregate initialization. but the standard does allow us to initialize each element using either copy-initialization or direct-initialization. but in our case, neither of them applies, so the clang warns like ``` suggest braces around initialization of subobject [-Werror,-Wmissing-braces] options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())}); ^~~~~~~~~~~~~~~~~~~~~~~~~ { } ``` in this change, also, take the opportunity to use structured binding to simplify the related code. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 16:59:29 +08:00
Kefu Chai	91f22b0e81	cql3/stats: use zero-initialization use {} instead of {0ul} for zero initialization. as `_query_cnt` is a multi-dimension array, each elements in `_query_cnt` is yet another array. so we cannot initialize it with a `{0ul}`. but to zero-initialize this array, we can just use `{}`, as per https://en.cppreference.com/w/cpp/language/zero_initialization > If T is array type, each element is zero-initialized. so this should recursively zero-initialize all arrays in `_query_cnt`. this change should silence following warning: stats.hh:88:60: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] [statements::statement_type::MAX_VALUE + 1] = {0ul}; ^~~ { } Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 16:59:29 +08:00
Botond Dénes	a93e5698b0	Merge 'Adding MindsDB integration to Docs' from Guy Shtub @annastuchlik please review Closes #13691 * github.com:scylladb/scylladb: adding documentation for integration with MindsDB adding documentation for integration with MindsDB	2023-04-28 11:47:10 +03:00
Botond Dénes	c6be764d46	Merge 'build: cmake: pick up tablets related changes and cleanups' from Kefu Chai this series syncs the CMake building system with `configure.py` which was updated for introducing the tablets feature. also, this series include a couple cleanups. Closes #13699 * github.com:scylladb/scylladb: build: cmake: remove dead code build: move test-perf down to test/perf build: cmake: pick up tablets related changes	2023-04-28 11:35:04 +03:00
Kefu Chai	066371adfa	db_clock: specialize fmt::formatter<db_clock::time_point> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `db_clock::time_point` without the help of `operator<<`. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 15:48:06 +08:00
Kefu Chai	7863ef53ad	cdc: generation: specialize fmt::formatter<generation_id> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `generation_id` without the help of `operator<<`. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 15:47:44 +08:00
Michał Sala	3b44ecd1e7	scripts: open-coredump.sh: suggest solib-search-path Loading cores from Scylla executables installed in a non-standard location can cause gdb to fail reading required libraries. This is an example of a warning I've got after trying to load core generated by dtest jenkins job (using ./scripts/open-coredump.sh): > warning: Can't open file /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/0d64f327e1af9bcbb711ee217eda6df16e517c42/libreloc/libboost_system.so.1.78.0 during file-backed mapping note processing Invocations of `scylla threads` command ended with an error: > (gdb) scylla threads > Python Exception <class 'gdb.error'>: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla: > Cannot find thread-local variables on this target > Error occurred in Python: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla: > Cannot find thread-local variables on this target An easy fix for this is to set solib-search-path to /opt/scylladb/libreloc/. This commit adds that set command to suggested command line gdb arguments. I guess it's a good idea to always suggest setting solib-search-path to that path, as it can save other people from wasting their time on looking why does coredump opening does not work. Closes #13696	2023-04-28 08:11:01 +03:00
Kefu Chai	572fab37bb	build: cmake: remove dead code the removed CMake script was designed to cater the needs when Seastar's CMake script is not included in the parent project, but this part is never tested and is dysfunctional as the `target_source()` misses the target parameter. we can add it back when it is actually needed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 11:13:41 +08:00
Kefu Chai	d4530b023e	build: move test-perf down to test/perf so it is closer to where the sources are located. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 11:13:41 +08:00
Kefu Chai	56b99b7879	build: cmake: pick up tablets related changes to sync with the changes in `5e89f2f5ba` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-28 11:13:41 +08:00
Benny Halevy	935ff0fcbb	types: timestamp_from_string: print current_exception on error We may catch exceptions that are not `marshal_exception`. Print std::current_exception() in this case to provide some context about the marshalling error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #13693	2023-04-27 22:30:55 +03:00
Asias He	a8040306bb	storage_service: Fix removing replace node as pending Consider - n1, n2, n3 - n3 is down - n4 replaces n3 with the same ip address 127.0.0.3 - Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2 ``` auto host_id = _gossiper.get_host_id(endpoint); auto existing = tmptr->get_endpoint_for_host_id(host_id); ``` host_id = new host id existing = empty As a result, del_replacing_endpoint() will not be called. This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when replacing is done. This is wrong. This is a regression since commit `9942c60d93` (storage_service: do not inherit the host_id of a replaced a node), where replacing node uses a new host id than the node to be replaced. To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing is empty. Before: n1: storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3 After: n1: storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3 token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3 storage_service - Node 127.0.0.3 state jump to normal token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3 storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3 Tests: https://github.com/scylladb/scylla-dtest/pull/3126 Closes #13677	2023-04-27 21:03:01 +03:00
Raphael S. Carvalho	4e205650b6	test: Verify correctness of sstable::bytes_on_disk() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:48 -03:00
Raphael S. Carvalho	2dbae856f8	sstable: Piggyback on sstable parser and writer to provide bytes_on_disk bytes_on_disk is the sum of all sstable components. As read_simple() fetches the file size before parsing the component, bytes_on_disk can be added incrementally rather than an additional step after all components were already parsed. Likewise, write_simple() tracks the offset for each new component, and therefore bytes_on_disk can also be added incrementally. This simplifies s3 life as it no longer have to care about feeding a bytes_on_disk, which is currently limited to data and index sizes only. Refs #13649. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:48 -03:00
Raphael S. Carvalho	4d02821094	sstable: restore indentation in read_digest() and read_checksum() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:48 -03:00
Raphael S. Carvalho	75dc7b799e	sstable: make all parsing of simple components go through do_read_simple() With all parsing of simple components going through do_read_simple(), common infrastructure can be reused (exception handling, debug logging, etc), and also statistics spanning all components can be easily added. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:48 -03:00
Raphael S. Carvalho	71cd8e6b51	sstable: Add missing pragma once to random_access_reader.hh Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:48 -03:00
Raphael S. Carvalho	b783bddbdf	sstable: make all writing of simple components go through do_write_simple() With all writing of simple components going through do_write_simple(), common infrastructure can be reused (exception handling, debug logging, etc), and also statistics spanning all components can be easily added. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:06:46 -03:00
Raphael S. Carvalho	bc486b05fa	test: sstable_utils: reuse set_values() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:04:52 -03:00
Raphael S. Carvalho	dcee5c4fae	sstable: Restore indentation in read_simple() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:04:52 -03:00
Raphael S. Carvalho	253d9e787b	sstable: Coroutinize read_simple() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:04:52 -03:00
Raphael S. Carvalho	0dcdec6a55	sstable: Use filter memory footprint in filter_size() For S3, filter size is currently set to zero, as we want to avoid "fstat-ing" each file. On-disk representation of bloom filter is similar to the in-memory one, therefore let's use memory footprint in filter_size(). User of filter_size() is API implementing "nodetool cfstats" and it cares about the size of bloom filter data (that's how it's described). This way, we provide the filter data size regardless of the underlying storage type. Refs #13649. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-27 12:04:52 -03:00
Harsh Soni	84ea2f5066	raft: fsm: add empty check for `max_read_id_with_quorum` Updated the empty() function in the struct fsm_output to include the max_read_id_with_quorum field when checking whether the fsm output is empty or not. The change was made in order maintain consistency with the codebase and adding completeness to the empty check. This change has no impact on other parts of the codebase. Closes #13656	2023-04-27 16:04:58 +02:00
Kamil Braun	0bee872fb1	raft topology: rename `update_replica_state` -> `update_topology_state` The new name is more generic and appropriate for topology transitions which don't affect any specific replica but the entire cluster as a whole (which we'll introduce later). Also take `guard` directly instead of `node_to_work_on` in this more generic function. Since we want `node_to_work_on` to die when we steal its guard, introduce `take_guard` which takes ownership of the object and returns the guard.	2023-04-27 15:22:19 +02:00
Kamil Braun	22ab5982e7	raft topology: remove `transition_state::normal` What this state really represented is that there is currently no transition. So remove it and make `transition_state` optional instead.	2023-04-27 15:18:32 +02:00
Kamil Braun	61c4e0ae20	raft topology: switch on `transition_state` first Previously the code assumed that there was always a 'node to work on' (a node which wants to change its state) or there was no work to do at all. It would find such a node, switch on its state (e.g. check if it's bootstrapping), and in some states switch on the topology `transition_state` (e.g. check if it's `write_both_read_old`). We want to introduce transitions that are not node-specific and can work even when all nodes are 'normal' (so there's no 'node to work on'). As a first step, we refactor the code so it switches on `transition_state` first. In some of these states, like `write_both_read_old`, there must be a 'node to work on' for the state to make sense; but later in some states it will be optional (such as `commit_cdc_generation`).	2023-04-27 15:14:59 +02:00
Kamil Braun	a023ca2cf1	raft topology: `handle_ring_transition`: rename `res` to `exec_command_res` A more descriptive name.	2023-04-27 15:12:12 +02:00
Kamil Braun	4ddfce8213	raft topology: parse replaced node in `exec_global_command` Will make following commits easier.	2023-04-27 15:10:49 +02:00
Kamil Braun	bafce8fd28	raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`	2023-04-27 15:04:36 +02:00
Kamil Braun	98f69f52aa	storage_service: extract raft topology coordinator fiber to separate class The lambdas defined inside the fiber are now methods of this class. Currently `handle_node_transition` is calling `handle_ring_transition`, in a later commit we will reverse this: `handle_ring_transition` will call `handle_node_transition`. We won't have to shuffle the functions around because they are members of the same class, making the change easier to review. In general, the code will be easier to maintain in this new form (no need to deal with so many lambda captures etc.) Also break up some lines which exceeded the 120 character limit (as per Seastar coding guidelines).	2023-04-27 15:04:35 +02:00
Kefu Chai	87e9686f61	cdc: generation: simpify std::visit() call if the visitor clauses are the same, we can just use the generic version of it by specifying the parameter with `auto&`. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13626	2023-04-27 14:43:20 +02:00
Alejo Sanchez	47d7939b8f	test/topology: register RF pytest marker Register pytest marker for replication_factor. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13688	2023-04-27 12:14:28 +02:00
Guy Shtub	c4664f9b66	adding documentation for integration with MindsDB	2023-04-27 13:13:19 +03:00
Guy Shtub	7e35a07f93	adding documentation for integration with MindsDB	2023-04-27 13:12:38 +03:00
Kamil Braun	defa63dc20	raft topology: rename `replication_state` to `transition_state` The new name is more generic - it describes the current step of a 'topology saga` (a sequence of steps used to implement a larger topology operation such as bootstrap).	2023-04-27 11:39:38 +02:00
Kamil Braun	af1ea2bb16	raft topology: make `replication_state` a topology-global state Previously it was part of `ring_slice`, belonging to a specific node. This commit moves it into `topology`, making it a cluster-global property. The `replication_state` column in `system.topology` is now `static`. This will allow us to easily introduce topology transition states that do not refer to any specific node. `commit_cdc_generation` will be such a state, allowing us to commit a new CDC generation even though all nodes are normal (none are transitioning). One could argue that the other states are conceptually already cluster-global: for example, `write_both_read_new` doesn't affect only the tokens of a bootstrapping (or decommissioning etc.) node; it affects replica sets of other tokens as well (with RFs greater than 1).	2023-04-27 11:39:38 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Kefu Chai	f5b05cf981	treewide: use defaulted operator!=() and operator==() in C++20, compiler generate operator!=() if the corresponding operator==() is already defined, the language now understands that the comparison is symmetric in the new standard. fortunately, our operator!=() is always equivalent to `! operator==()`, this matches the behavior of the default generated operator!=(). so, in this change, all `operator!=` are removed. in addition to the defaulted operator!=, C++20 also brings to us the defaulted operator==() -- it is able to generated the operator==() if the member-wise lexicographical comparison. under some circumstances, this is exactly what we need. so, in this change, if the operator==() is also implemented as a lexicographical comparison of all memeber variables of the class/struct in question, it is implemented using the default generated one by removing its body and mark the function as `default`. moreover, if the class happen to have other comparison operators which are implemented using lexicographical comparison, the default generated `operator<=>` is used in place of the defaulted `operator==`. sometimes, we fail to mark the operator== with the `const` specifier, in this change, to fulfil the need of C++ standard, and to be more correct, the `const` specifier is added. also, to generate the defaulted operator==, the operand should be `const class_name&`, but it is not always the case, in the class of `version`, we use `version` as the parameter type, to fulfill the need of the C++ standard, the parameter type is changed to `const version&` instead. this does not change the semantic of the comparison operator. and is a more idiomatic way to pass non-trivial struct as function parameters. please note, because in C++20, both operator= and operator<=> are symmetric, some of the operators in `multiprecision` are removed. they are the symmetric form of the another variant. if they were not removed, compiler would, for instance, find ambiguous overloaded operator '=='. this change is a cleanup to modernize the code base with C++20 features. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13687	2023-04-27 10:24:46 +03:00
Botond Dénes	3e92bcaa20	Merge 'utils: redesign reusable_buffer' from Michał Chojnowski Common compression libraries work on contiguous buffers. Contiguous buffers are a problem for the allocator. However, as long as they are short-lived, we can avoid the expensive allocations by reusing buffers across tasks. This idea is already applied to the compression of CQL frames, but with some deficiencies. `utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion. Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually increases memory pressure significantly, instead of reducing it. `zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard. Closes #13324 * github.com:scylladb/scylladb: zstd: share buffers between compressor instances utils: redesign reusable_buffer	2023-04-27 09:09:09 +03:00
Pavel Emelyanov	4f93b440a5	sstables: Remove lost eptr variable from do_write_simple() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13684	2023-04-27 07:37:15 +03:00
Anna Stuchlik	c7df168059	doc: move Glossary to the Reference section This commit moves the Glossary page to the Reference section. In addition, it adds the redirection so that there are no broken links because of this change and fixes a link to a subsection of Glossary. Closes #13664	2023-04-27 07:03:55 +03:00
Michał Chojnowski	16dd93cb7e	zstd: share buffers between compressor instances The zstd implementation of `compressor` has a separate decompression and compression context per instance. This is unreasonably wasteful. One decompression buffer and one compression buffer per shard is enough. The waste is significant. There might exist thousands of SSTable readers, each containing its own instance of `compressor` with several hundred KiB worth of unneeded buffers. This adds up to gigabytes of wasted memory and gigapascals of allocator pressure. This patch modifies the implementation of zstd_processor so that all its instances on the shard share their contexts. Fixes #11733	2023-04-26 22:09:17 +02:00
Michał Chojnowski	bf26a8c467	utils: redesign reusable_buffer Large contiguous buffers put large pressure on the allocator and are a common source of reactor stalls. Therefore, Scylla avoids their use, replacing it with fragmented buffers whenever possible. However, the use of large contiguous buffers is impossible to avoid when dealing with some external libraries (i.e. some compression libraries, like LZ4). Fortunately, calls to external libraries are synchronous, so we can minimize the allocator impact by reusing a single buffer between calls. An implementation of such a reusable buffer has two conflicting goals: to allocate as rarely as possible, and to waste as little memory as possible. The bigger the buffer, the more likely that it will be able to handle future requests without reallocation, but also the memory memory it ties up. If request sizes are repetitive, the near-optimal solution is to simply resize the buffer up to match the biggest seen request, and never resize down. However, if we anticipate pathologically large requests, which are caused by an application/configuration bug and are never repeated again after they are fixed, we might want to resize down after such pathological requests stop, so that the memory they took isn't tied up forever. The current implementation of reusable buffers handles this by resizing down to 0 every 100'000 requests. This patch attempts to solve a few shortcomings of the current implementation. 1. Resizing to 0 is too aggressive. During regular operation, we will surely need to resize it back to the previous size again. If something is allocated in the hole left by the old buffer, this might cause a stall. We prefer to resize down only after pathological requests. 2. When resizing, the current implementation allocates the new buffer before freeing the old one. This increases allocator pressure for no reason. 3. When resizing up, the buffer is resized to exactly the requested size. That is, if the current size is 1MiB, following requests of 1MiB+1B and 1MiB+2B will both cause a resize. It's preferable to limit the set of possible sizes so that every reset doesn't tend to cause multiple resizes of almost the same size. The natural set of sizes is powers of 2, because that's what the underlying buddy allocator uses. No waste is caused by rounding up the allocation to a power of 2. 4. The interval of 100'000 uses is both too low and too arbitrary. This is up for discussion, but I think that it's preferable to base the dynamics of the buffer on time, rather than the number of uses. It's more predictable to humans. The implementation proposed in this patch addresses these as follows: 1. Instead of resizing down to 0, we resize to the biggest size seen in the last period. As long as at least one maximal (up to a power of 2) "normal" request appears each period, the buffer will never have to be resized. 2. The capacity of the buffer is always rounded up to the nearest power of 2. 3. The resize down period is no longer measured in number of requests but in real time. Additionally, since a shared buffer in asynchronous code is quite a footgun, some rudimentary refcounting is added to assert that only one reference to the buffer exists at a time, and that the buffer isn't downsized while a reference to it exists. Fixes #13437	2023-04-26 22:09:17 +02:00
Anna Stuchlik	1ce50faf02	doc: remove reduntant information about versions Fixes https://github.com/scylladb/scylladb/issues/13578 Now that the documentation is versioned, we can remove the .. versionadded:: and .. versionchanged:: information (especially that the latter is hard to maintain and now outdated), as well as the outdated information about experimental features in very old releases. This commit removes that information and nothing else. Closes #13680	2023-04-26 17:20:52 +03:00
Botond Dénes	5aaa30b267	Merge 'treewide: stop using std::rel_ops' from Kefu Chai std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`. so, in this series, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. there are many more cases where we could either use `operator<=>` or the default-generated `operator!=` to simplify the implementation. but here, we care more about `std::rel_ops`, we will drop the most (if not all of them) of the explicitly defined `operator!=` and other comparison operators later. Closes #13676 * github.com:scylladb/scylladb: treewide: do not use std::rel_ops dht: token: s/tri_compare/operator<=>/	2023-04-26 16:49:44 +03:00
Aleksandra Martyniuk	725110a035	docs: clarify the meaning of cfhistogram's sstable column Closes #13669	2023-04-26 16:19:23 +03:00
Tomasz Grabiec	8d5467fa9c	Merge 'Some minor improvements in table' from Raphael "Raph" Carvalho Removed outdated comments and added reverse() to avoid reallocations. Closes #13672 * github.com:scylladb/scylladb: table: Avoid reallocations in make_compaction_groups() table: Remove another outdated comment regarding sstable generation table: Remove outdated comment regarding automatic compaction	2023-04-26 14:43:49 +02:00
Botond Dénes	88c19b23dc	reader_permit: resource_units::reset_to(): try harder to avoid calling consume() Currently, the `reset_to()` implementation calls `consume(new_amount)` (if not zero), then calls `signal(old_amount)`. This means that even if `reset_to()` is a net reduction in the amount of resources, there is a call to `consume()` which can now potentially throw. Add a special case for when the new amount of resources is strictly smaller than the old amount. In this case, just call `signal()` with the difference. This not just avoids a potential `std::bad_alloc`, but also helps relieving memory pressure when this is most needed, by not failing calls to release memory.	2023-04-26 07:41:57 -04:00
Botond Dénes	2449b714df	reader_permit: split resource_units::reset() Into reset_to() and reset_to_zero(). The latter replaces `reset()` with the default 0 resources argument, which was often called from noexcept contexts. Splitting it out from `reset()` allows for a specialized implementation that is guaranteed to be `noexcept` indeed and thus peace of mind.	2023-04-26 07:41:57 -04:00
Botond Dénes	21988842de	reader_permit: make consume()/signal() API private This API is dangerous, all resource consumption should happen via RAII objects that guarantee that all consumed resources are appropriately released. At this poit, said API is just a low-level building block for higher-level, RAII objects. To ensure nobody thinks of using it for other purposes, make it private and make external users friends instead.	2023-04-26 07:41:53 -04:00
Tomasz Grabiec	ce94a2a5b0	Merge 'Fixes and tests for raft-based topology changes' from Kamil Braun Fix two issues with the replace operation introduced by recent PRs. Add a test which performs a sequence of basic topology operations (bootstrap, decommission, removenode, replace) in a new suite that enables the `raft` experimental feature (so that the new topology change coordinator code is used). Fixes: #13651 Closes #13655 * github.com:scylladb/scylladb: test: new suite for testing raft-based topology test: remove topology_custom/test_custom.py raft topology: don't require new CDC generation UUID to always be present raft topology: include shard_count/ignore_msb during replace	2023-04-26 11:38:07 +02:00
Kefu Chai	951457a711	treewide: do not use std::rel_ops std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`. so, in this change, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-26 14:09:58 +08:00
Kefu Chai	5a11d67709	dht: token: s/tri_compare/operator<=>/ now that C++20 is able to generate the default-generated comparing operators for us. there is no need to define them manually. and, `std::rel_ops::*` are deprecated in C++20. also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-26 14:09:57 +08:00
Kefu Chai	20da130cdf	mutation: specialize fmt::formatter<range_tombstone_{entry,list}> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `range_tombstone_list` and `range_tombstone_entry` without the help of `operator<<`. the corresponding `operator<<()` for `range_tombstone_entry` is moved into test, where it is used. and the other one is dropped in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13627	2023-04-26 09:00:25 +03:00
Kefu Chai	c8aa7295d4	cql3: drop unused function there are two variants of `query_processor::for_each_cql_result()`, both of them perform the pagination of results returned by a CQL statement. the one which accepts a function returning an instant value is not used now. so let's drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13675	2023-04-26 08:43:22 +03:00
Raphael S. Carvalho	59904be5c3	table: Avoid reallocations in make_compaction_groups() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-25 11:14:33 -03:00
Raphael S. Carvalho	9f5e19224d	table: Remove another outdated comment regarding sstable generation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-25 11:09:51 -03:00
Raphael S. Carvalho	2d45dd35c7	table: Remove outdated comment regarding automatic compaction We already provide a way to disable automatic compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-04-25 11:09:45 -03:00
Pavel Emelyanov	9bb4ee160f	gossiper: Remove features and sysks from gossiper Now gossiper doesn't need those two as its dependencies, they can be removed making code shorter and dependencies graph simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:06:06 +03:00
Pavel Emelyanov	5cbc8fe2f9	system_keyspace: De-static save_local_supported_features() That's, in fact, an independent change, because feature enabler doesn't need this method. So this patch is like "while at it" thing, but on the other hand it ditches one more qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:04:54 +03:00
Pavel Emelyanov	a5bd6cc832	system_keyspace: De-static load_\|save_local_enabled_features() All callers now have the system keyspace instance at hand. Unfortunately, this de-static doesn't allow more qctx drops, because both methods use set_\|get_scylla_local_param helpers that do use qctx and are still in use by other static methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:03:09 +03:00
Pavel Emelyanov	9bfbcaa3f6	system_keyspace: Move enable_features_on_startup to feature_service (cont) Now move the code itself. No functional changes here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:02:38 +03:00
Pavel Emelyanov	858db9f706	system_keyspace: Move enable_features_on_startup to feature_service This code belongs to feature service, system keyspace shoulnd't be aware of any pecularities of startup features enabling, only loading and saving the feature lists. For now the move happens only in terms of code declarations, the implementation is kept in its old place to reduce the patch churn. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 17:00:30 +03:00
Pavel Emelyanov	71eb4edf3c	feature_service: Open-code persist_enabled_feature_info() into enabler The method in question is only called by the enabler and is short enough to be merged into the caller. This kills two birds with one stone -- makes less walks over features list and will make it possible to de-static system keyspace features load and save methods. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:58:49 +03:00
Pavel Emelyanov	474548f614	gms: Move feature enabler to feature_service.cc No functional changes, just move the code. Thie makes gossiper not mess with enabling/persisting features, but just gossiping them around. Feature intersection code is still in gossiper, but can be moved in more suitable location any time later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:57:19 +03:00
Pavel Emelyanov	dcf88b07a4	gms: Move gossiper::enable_features() to feature_service::enable_features_on_join() This will make it possible to move the enabler to feature_service.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:56:07 +03:00
Pavel Emelyanov	1461a892a6	gms: Persist features explicitly in features enabler Nowadays features are persisted in feature_service::enable() and there are four callers of it - feature enabler via gossiper notifications - boot kicks feature enabler too - schema loader tool - cql test env None but the first case need to persist features. The loader tool in fact doesn't do it even now it by keeping qctx uninitialized. Cql test env wires up the qctx, but it makes no differences for the test cases themselves if the features are persisted or not. Boot-time is a bit trickier -- it loads the feature list from system keyspace and may filter-out some of them, then enable. In that case committing the list back into system keyspace makes no sense, as the persisted list doesn't extend. The enabler, in turn, can call system keyspace directly via its explicit dependency reference. This fixes the inverse dependency between system keyspace and feature service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:51:15 +03:00
Pavel Emelyanov	ba7af749b1	feature_service: Make persist_enabled_feature_info() return a future It now knows that it runs inside async context, but things are changing and soon it will be moved out of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:50:32 +03:00
Pavel Emelyanov	1ee04e4934	system_keyspace: De-static load_peer_features() This makes use of feature_enabler::_sys_ks dependency and gets rid of one more global qctx usage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:50:00 +03:00
Pavel Emelyanov	e30c72109f	gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features() It's the enabler that's responsible for enabling the features and, implicitly, persisting them into the system keyspace. This patch moves this logic from gossiper to feature_enabler, further patching will make the persisting code be explicit. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:47:30 +03:00
Pavel Emelyanov	ac60d8afca	gossiper: Enable features and register enabler from outside It's a bit hairy. The maybe_enable_features() is called from two places -- the feature_enabler upon notifications from gossiper and directory by gossiper from wait_for_gossip_to_settle(). The _latter_ is called only when the wait_for_gossip_to_settle() is called for the first time because of the _gossip_settled checks in it. For the first time this method is called by storage_service when it tries to join the ring (next it's called from main, but that's not of interest here). Next, despite feature_enabler is registered early -- when gossiper instance is constructed by sharded<gossiper>::start() -- it checks for the _gossip_settled to be true to take any actions. Considering both -- calling maybe_enable_features() _and_ registering enabler after storage_service's call to wait_for_gossip_to_settle() doesn't break the code logic, but make further patching possible. In particular, the feature_enabler will move to feature_service not to pollute gossiper code with anything that's not gossiping. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:42:17 +03:00
Pavel Emelyanov	cefcdeee1e	gms: Add feature_service and system_keyspace to feature_enabler And rename the guy. These dependencies will be used further, both are available and started when the enabler is constructed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-25 16:41:09 +03:00
Kamil Braun	a29b8cd02b	Merge 'cql3: fix a few misformatted printouts of column names in error messages' from Nadav Har'El Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name. Fixes #13657 Closes #13658 * github.com:scylladb/scylladb: cql3: fix printing of column_specification::name in some error messages cql3: fix printing of column_definition::name in some error messages	2023-04-25 14:21:09 +02:00
Avi Kivity	a1b99d457f	Update tools/jmx submodule (error handling when jdk not available) * tools/jmx fdd0474...5f98894 (1): > install.sh: bail out if jdk is not available	2023-04-25 14:20:57 +02:00
Kefu Chai	5804eb6d81	storage_service: specialize fmt::formatter<storage_service::mode> this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `storage_service::mode` without the help of `operator<<`. the corresponding `operator<<()` for `storage_service::mode` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13640	2023-04-25 14:20:57 +02:00
Tomasz Grabiec	a717c803c7	tests: row_cache: Add reproducer for reader producing missing closing range tombstone Adds a reproducer for #12462, which doesn't manifest in master any more after `f73e2c992f`. It's still useful to keep the test to avoid regresions. The bug manifests by reader throwing: std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}} The reason is that prior to the rework of the cache reader, range_tombstone_generator::flush() was used with end_of_range=true to produce the closing range_tombstone_change and it did not handle correctly the case when there are two adjacent range tombstones and flush(pos, end_of_range=true) is called such that pos is the boundary between the two. Closes #13665	2023-04-25 14:20:57 +02:00
Gleb Natapov	9849409c2a	service/raft: raft_group0: drop dependency on migration_manager raft_group0 does not really depends on migration_manager, it needs it only transiently, so pass it to appropriate methods of raft_group0 instead of during its creation.	2023-04-25 12:38:01 +03:00
Gleb Natapov	d5d156d474	service/raft: raft_group0: drop dependency on query_processor raft_group0 does not really depends on query_processor, it needs it only transiently, so pass it to appropriate methods of raft_group0 instead of during its creation.	2023-04-25 12:35:57 +03:00
Kamil Braun	59eb01b7a6	test: new suite for testing raft-based topology Introduce new test suite for testing the new topology coordinator (runs under `raft` experimental flag). Add a simple test that performs a basic sequence of topology operations.	2023-04-25 11:04:51 +02:00
Gleb Natapov	029f1737ef	service/raft: raft_group0: drop dependency on storage_service raft_group0 does not really depends on storage_service, it needs it only transiently, so pass it to appropriate methods of raft_group0 instead of during its creation.	2023-04-25 11:07:47 +03:00
Botond Dénes	8765442f3f	Merge 'utils: add basic_xx_hasher' from Benny Halevy Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher. To be used in a followup series for netw:msg_addr. Closes #13530 * github.com:scylladb/scylladb: utils: hashing: use simple_xx_hasher utils: hashing: add simple_xx_hasher utils: hashers: add HasherReturning concept hashing: move static_assert to source file	2023-04-25 09:53:47 +02:00
Kefu Chai	f4016d3289	cql3: coroutinize query_processor::for_each_cql_result Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13621	2023-04-25 09:53:47 +02:00
Botond Dénes	b9491c0134	Merge 'Test the column_family rest api' from Benny Halevy Add a test for get/enable/disable auto_compaction via to column_family api. And add log messages for admin operations over that api. Closes #13566 * github.com:scylladb/scylladb: api: column_family: add log messages for admin operation test: rest_api: add test_column_family	2023-04-25 09:53:47 +02:00
Wojciech Mitros	b0fa59b260	build: add tools for optimizing the Wasm binaries and translating to wat After the addition of the rust-std-static-wasm32-wasi target, we're able to compile the Rust programs to Wasm binaries. However, we're still only able to handle the Wasm UDFs in the Text format, so we need a tool to translate the .wasm files to .wat. Additionally, the .wasm files generated by default are unnecessarily large, which can be helped using wasm-opt and wasm-strip. The tool for translating wasm to wat (wasm2wat), and the tool for stripping the wasm binaries (wasm-strip) are included in the `wabt` package, and the optimization tool (wasm-opt) is included in the `binaryen` package. Both packages are added to install-dependencies.sh Closes #13282 [avi: regenerate frozen toolchain] Closes #13605	2023-04-25 09:53:47 +02:00
Pavel Emelyanov	9a9dbffce3	s3/client: Zeroify stat by default The s3::readable_file::stat() call returns a hand-crafted stat structure with some fields set to some sane values, most are constants. However, other fields remain not initialized which leads to troubles sometimes. Better to fill the stat with zeroes and later revisit it for more sane values. fixes: #13645 refs: #13649 Using designated initializers is not an option here, see PR #13499 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13650	2023-04-25 09:53:47 +02:00
Kefu Chai	b0a01d85e9	s3/test: collect log on exit the temporary directory holding the log file collecting the scylla subprocess's output is specified by the test itself, and it is `test_tempdir`. but unfortunately, cql-pytest/run.py is not aware of this. so `cleanup_all()` is not able to print out the logging messages at exit. as, please note, cql-pytest/run.py always collect "log" file under the directory created using `pid_to_dir()` where pid is the spawned subprocesses. but `object_store/run` uses the main process's pid for its reusable tempdir. so, with this change, we also register a cleanup func to printout the logging message when the test exits. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-25 09:53:47 +02:00
Alejo Sanchez	c06e01cfba	test/topology: log stages for concurrent test For concurrent schema changes test, log when the different stages of the test are finished. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #13654	2023-04-25 09:53:47 +02:00
Kefu Chai	cc87e10f40	dht: print pk in decorated_key with "pk" prefix this change ensures that `dk._key` is formatted with the "pk" prefix. as in `3738fcb`, the `operator<<` for partition_key was removed. so the compiler has to find an alternative when trying to fulfill the needs when this operator<< is called. fortunately, from the compiler's perspective, `partition_key` has an `operator managed_bytes_view`, and this operator does not have the explicit specifier, and, `managed_bytes_view` does support `operator<<`. so this ends up with a change in the format of `decorated_key` when it is printed using `operator<<`. the code compiles. but unfortunately, the behavior is changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the partition_key is supposed to be printed like "pk{010203}" instead of "010203". the latter is how `managed_bytes_view` is formatted. a test is added accordingly to avoid future changes which break the dtest. Fixes scylladb#13628 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13653	2023-04-25 09:53:47 +02:00
Nadav Har'El	bd09dc308c	cql3: fix printing of column_specification::name in some error messages column_specification::name is a shared pointer, so it should be dereferenced before printing - because we want to print the name, not the pointer. Fix a few instances of this mistake in prepare_expr.cc. Other instances were already correct. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-04-25 10:46:56 +03:00
Nadav Har'El	4eabb3f429	cql3: fix printing of column_definition::name in some error messages Printing a column_definition::name() in an error message is wrong, because it is "bytes" and printed as hexadecimal ASCII codes :-( Some error messages in cql3/operation.cc incorrectly used name() and should be changed to name_as_text(), as was correctly done in a few other error messages in the same file. This patch also fixes a few places in the test/cql approval tests which "enshrined" the wrong behavior - printing things like 666c697374696e74 in error messages - and now needs to be fixed for the right behavior. Fixes #13657 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2023-04-25 10:46:47 +03:00
Kamil Braun	b1d58c3d3a	test: remove topology_custom/test_custom.py It was a temporary test just to check that the `topology_custom` suite works. The suite now contains a real test so we can remove this one.	2023-04-24 14:41:33 +02:00
Kamil Braun	3f0498ca53	raft topology: don't require new CDC generation UUID to always be present During node replace we don't introduce a new CDC generation, only during regular bootstrap. Instead of checking that `new_cdc_generation_uuid` must be present whenever there's a topology transition, only check it when we're in `commit_cdc_generation` state.	2023-04-24 14:41:33 +02:00
Kamil Braun	9ca53478ed	raft topology: include shard_count/ignore_msb during replace Fixes: #13651	2023-04-24 14:40:47 +02:00
Kefu Chai	124153d439	build: cmake: sync with configure.py this changes updates the CMake building system with the changes introduced by `3f1ac846d8` and `d1817e9e1b` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13648	2023-04-24 14:55:20 +03:00
Benny Halevy	b3d91cbf65	utils: hashing: use simple_xx_hasher Use simple_xx_hasher for bytes_view and effective_replication_map::factory_key appending hashers instead of their custom, yet identical implementations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 14:07:25 +03:00
Benny Halevy	f4fefec343	utils: hashing: add simple_xx_hasher And a respective unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 14:06:43 +03:00
Benny Halevy	b638dddf1b	utils: hashers: add HasherReturning concept And a more specific HasherReturningBytes for hashers that return bytes in finalize(). HasherReturning will be used by the following patch also for simple hashers that return size_t from finalize(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 14:06:40 +03:00
Benny Halevy	a765472b8b	hashing: move static_assert to source file No need to check it inline in the header. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 12:23:03 +03:00
Tomasz Grabiec	03035e3675	test: topology: Introduce test_tablets.py	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	c1fdbe79b7	raft: Introduce 'raft_server_force_snapshot' error injection Will be used by tests to force followers to catch up from the snapshot.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	819bc86f0f	locator: network_topology_strategy: Support tablet replication	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	5e89f2f5ba	service: Introduce tablet_allocator Currently, responsible for injecting mutations of system.tablets to schema changes. Note that not all migrations are handled currently. Dependant view or cdc table drops are not handled.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	6d4d3d8bbd	locator: Introduce tablet_aware_replication_strategy tablet_aware_replication_strategy is a trait class meant to be inherited by replication strategy which want to work with tablets. The trait produces per-table effective_replication_map which looks at tablet metadata to determine replicas. No replication startegy is changed to use tablets yet in this patch.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	97b969224c	locator: Extract maybe_remove_node_being_replaced()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	e6b76ac4b9	dht: token_metadata: Introduce get_my_id()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	46eae545ad	migration_manager: Send tablet metadata as part of schema pull This is currently used by group0 to transfer snapshot of the raft state machine.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	a8a03ee502	storage_service: Load tablet metadata when reloading topology state This change puts the reloading into topology_state_load(), which is a function which reloads token_metadata from system.topology (the new raft-based topology management). It clears the metadata, so needs to reload tablet map too. In the future, tablet metadata could change as part of topology transaction too, so we reload rather than preserve.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	d42685d0cb	storage_service: Load tablet metadata on boot and from group0 changes	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	41e69836fd	db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	b754433ac1	migration_notifier: Introduce before_drop_keyspace() Tablet allocator will need to inject mutations on keyspace drop.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	5b046043ea	migration_manager: Make prepare_keyspace_drop_announcement() return a future<> It will be extended with listener notification firing, which is an async operation.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	4b4238b069	test: perf: Introduce perf-tablets Example output: $ build/release/scylla perf-tablets --tables 10 --tablets-per-table $((8*1024)) --rf 3 testlog - Total tablet count: 81920 testlog - Size of tablet_metadata in memory: 7683 KiB testlog - Copied in 2.163421 [ms] testlog - Cleared in 0.767507 [ms] testlog - Saved in 774.813232 [ms] testlog - Read in 246.666885 [ms] testlog - Read mutations in 211.677292 [ms] testlog - Size of canonical mutations: 20.633621 [MiB] testlog - Disk space used by system.tablets: 0.902344 [MiB]	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	70a35f70a6	test: Introduce tablets_test	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	b4ac329367	test: lib: Do not override table id in create_table() It is already set by schema_maker. In tablets_test we will depend on the id being the same as that set in the schema_builder, so don't change it to something else.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	5a24984147	utils, tablets: Introduce external_memory_usage()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	f3fbfdaa37	db: tablets: Add printers Example: TRACE 2023-03-30 12:06:33,918 [shard 0] tablets - Read tablet metadata: { 8cd5b560-cee2-11ed-9cd5-7f37187f2167: { [0]: last_token=-6917529027641081857, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [1]: last_token=-4611686018427387905, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [2]: last_token=-2305843009213693953, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [3]: last_token=-1, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [4]: last_token=2305843009213693951, replicas={3160b965-1925-4677-884b-c761e2bf4272:0}, [5]: last_token=4611686018427387903, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [6]: last_token=6917529027641081855, replicas={4fe5c4d5-7030-4ddd-8117-ba22c29f4f57:0}, [7]: last_token=9223372036854775807, replicas={3160b965-1925-4677-884b-c761e2bf4272:0} } }	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	9d786c1ebc	db: tablets: Add persistence layer	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	fa8ad9a585	dht: Use last_token_of_compaction_group() in split_token_range_msb()	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	fceb5f8cf6	locator: Introduce tablet_metadata token_metadata now stores tablet metadata with information about tablets in the system.	2023-04-24 10:49:37 +02:00
Tomasz Grabiec	241f7febec	dht: Introduce first_token()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	462e3ffd36	dht: Introduce next_token()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	27acf3b129	storage_proxy: Improve trace-level logging	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	34a9c62ae5	locator: token_metadata: Fix confusing comment on ring_range() It could be interpreted to mean that the search token is excluded.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	e4865bd4d1	dht, storage_proxy: Abstract token space splitting Currently, scans are splitting partition ranges around tokens. This will have to change with tablets, where we should split at tablet boundaries. This patch introduces token_range_splitter which abstracts this task. It is provided by effective_replication_map implementation.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	b769c4ee55	Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" This reverts commit `95bf8eebe0`. Later patches will adapt this code to work with token_range_splitter, and the unit test added by the reverted commit will start to fail. The unit test asks the query_ranges_to_vnodes_generator to split the range: [t:end, t+1:start) around token t, and expects the generator to produce an empty range [t:end, t:end] After adapting this code to token_range_splitter, the input range will not be split because it is recognized as adjacent to t:end, and the optimization logic will not kick in. Rather than adding more logic to handle this case, I think it's better to drop the optimization, as it is not very useful (rarely happens) and not required for correctness.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	94e1c7b859	db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() This allows update_pending_ranges(), invoked on keyspace creation, to succeed in the presence of keyspaces with per-table replication strategy. It will update only vnode-based erms, which is intended behavior, since only those need pending ranges updated. This change will also make node operations like bootstrap, repair, etc. to work (not fail) in the presence of keyspaces with per-table erms, they will just not be replicated using those algorithms. Before, these would fail inside get_effective_replication_map(), which is forbidden for keyspaces with per-table replication.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	dc04da15ec	db: Introduce get_non_local_vnode_based_strategy_keyspaces() It's meant to be used in places where currently get_non_local_strategy_keyspaces() is used, but work only with keyspaces which use vnode-based replication strategy.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	8fcb320e71	service: storage_proxy: Avoid copying keyspace name in write handler	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	9b17ad3771	locator: Introduce per-table replication strategy Will be used by tablet-based replication strategies, for which effective replication map is different per table. Also, this patch adapts existing users of effective replication map to use the per-table effective replication map. For simplicity, every table has an effective replication map, even if the erm is per keyspace. This way the client code can be uniform and doesn't have to check whether replication strategy is per table. Not all users of per-keyspace get_effective_replication_map() are adapted yet to work per-table. Those algorithms will throw an exception when invoked on a keyspace which uses per-table replication strategy.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	5d9bcb45de	treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	bb297d86a0	locator: Introduce effective_replication_map With tablet-based replication strategies it will represent replication of a single table. Current vnode_effective_replication_map can be adapted to this interface. This will allow algorithms like those in storage_proxy to work with both kinds of replication strategies over a single abstraction.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	d3c9ad4ed6	locator: Rename effective_replication_map to vnode_effective_replication_map In preparation for introducing a more abstract effective_replication_map which can describe replication maps which are not based on vnodes.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	1343bfa708	locator: effective_replication_map: Abstract get_pending_endpoints()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	7b01fe8742	db: Propagate feature_service to abstract_replication_strategy::validate_options() Some replication strategy options may be feature-dependent.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	9781d3ffc5	db: config: Introduce experimental "TABLETS" feature	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	a892e144cc	db: Log replication strategy for debugging purposes	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	7543c75b62	db: Log full exception on error in do_parse_schema_tables()	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	c923bdd222	db: keyspace: Remove non-const replication strategy getter Keyspace will store replication_ptr, which is a const pointer. No user needs a mutable reference.	2023-04-24 10:49:36 +02:00
Tomasz Grabiec	bf2ce8ff75	config: Reformat	2023-04-24 10:49:36 +02:00
Benny Halevy	9768046d7c	compaction_manager: print compaction_group id Add a formatter to compaction::table_state that prints the table ks_name.cf_name and compaction group id. Fixes #13467 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 10:07:03 +03:00
Benny Halevy	dabf46c37f	compaction_group, table_state: add group_id member To help identify the compaction group / table_state. Ref #13467 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 10:06:04 +03:00
Benny Halevy	1134ca2767	compaction_manager: offstrategy compaction: skip compaction if no candidates are found In many cases we trigger offstrategy compaction opportunistically also when there's nothing to do. In this case we still print to the log lots of info-level message and call `run_offstrategy_compaction` that wastes more cpu cycles on learning that it has nothing to do. This change bails out early if the maintenance set is empty and prints a "Skipping off-strategy compaction" message in debug level instead. Fixes #13466 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-24 09:23:32 +03:00
Benny Halevy	2e24b05122	compaction: make_partition_filter: do not assert shard ownership Now, with `f1bbf705f9` (Cleanup sstables in resharding and other compaction types), we may filter sstables as part of resharding compaction and the assertion that all tokens are owned by the current shard when filtering is no longer true. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 15:24:20 +03:00
Benny Halevy	c7d064b8b1	distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners When distributing the resharding jobs, prefer one of the sstable shard owners based on foreign_sstable_open_info. This is particularly important for uploaded sstables that are resharded since they require cleanup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 15:13:16 +03:00
Benny Halevy	2f61de8f7b	table, compaction_manager: prevent cross shard access to owned_ranges_ptr Seen after `f1bbf705f9` in debug mode distributed_loader collect_all_shared_sstables copies compaction::owned_ranges_ptr (lw_shared_ptr<const dht::token_range_vector>) across shards. Since update_sstable_cleanup_state is synchronous, it can be passed a const refrence to the token_range_vector instead. It is ok to access the memory read-only across shards and since this happens on start-up, there are no special performance requirements. Fixes #13631 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 15:12:13 +03:00
Botond Dénes	ecbb118d32	reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes Update comments, test names and etc. that are still using the old terminology for permit state names, bring them up to date with the recent state name changes.	2023-04-19 05:31:27 -04:00
Botond Dénes	e71d6566ab	reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes They are still using the old terminology for permit state names, bring them up to date with the recent state name changes.	2023-04-19 05:20:44 -04:00
Botond Dénes	804403f618	reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes They is still using the old terminology for permit state names, bring them up to date with the recent state name changes.	2023-04-19 05:20:42 -04:00
Botond Dénes	89328ce447	reader_concurrency_semaphore: update API w.r.t. recent permit state name changes It is still using the old terminology for permit state names, bring it up to date with the recent state name changes.	2023-04-19 05:18:13 -04:00
Botond Dénes	3919effe2d	reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes It is still using the old terminology for permit state names, bring it up to date with the recent state name changes.	2023-04-19 05:17:34 -04:00
Benny Halevy	456f5dfce5	api: column_family: add log messages for admin operation Similar to the storage_service api, print a log message for admin operations like enabling/disabling auto_compaction, running major compaction, and setting the table compaction strategy. Note that there is overlap in functionality between the storage_service and the column_family api entry points. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-18 17:11:33 +03:00
Benny Halevy	5e371e7861	test: rest_api: add test_column_family Add a test for column_family/autocompaction Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-18 17:09:31 +03:00
Kefu Chai	37cf04818e	alternator: split the param list of executor ctor into multi lines before this change, the line is 249 chars long, so split it into multiple lines for better readabitlity. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-23 20:57:28 +08:00
Kefu Chai	69c21f490a	alternator,config: make alternator_timeout_in_ms live-updateable before this change, alternator_timeout_in_ms is not live-updatable, as after setting executor's default timeout right before creating sharded executor instances, they never get updated with this option anymore. in this change, * alternator_timeout_in_ms is marked as live-updateable * executor::_s_default_timeout is changed to a thread_local variable, so it can be updated by a per-shard updateable_value. and it is now a updateable_value, so its variable name is updated accordingly. this value is set in the ctor of executor, and it is disconnected from the corresponding named_value<> option in the dtor of executor. * alternator_timeout_in_ms is passed to the constructor of executor via sharded_parameter, so executor::_timeout_in_ms can be initialized on per-shard basis * executor::set_default_timeout() is dropped, as we already pass the option to executor in its ctor. please note, in the ctor of executor, we always update the cached value of `s_default_timeout` with the value of `_timeout_in_ms`, and we set the default timeout to 10s in `alternator_test_env`. this is a design decision to avoid bending the production code for testing, as in production, we always set the timeout with the value specified either by the default value of yaml conf file. Fixes #12232 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-23 20:57:08 +08:00
Wojciech Mitros	b03fce524b	cql-pytest: test permissions for UDTs with quoted names Currently, we only tested whether permissions with UDFs that have quoted names work correctly. This patch adds the missing test that confirms that we can also use UDTs (as UDF parameter types) when altering permissions.	2023-03-23 01:41:58 +01:00
Wojciech Mitros	169a821316	cql: maybe quote user type name in ut_name::to_string() Currently, the ut_name::to_string() is used only in 2 cases: the first one is in logs or as part of error messages, and the second one is during parsing, temporarily storing the user defined type name in the auth::resource for later preparation with database and data_dictionary context. This patch changes the string so that the 'name' part of the ut_name (as opposed to the 'keyspace' part) is now quoted when needed. This does not worsen the logging set of cases, but it does help with parsing of the resulting string, when finishing preparing the auth::resource. After the modification, a more fitting name for the function is "ut_name::to_cql_string()", so the function is renamed to that.	2023-03-23 01:41:58 +01:00
Wojciech Mitros	fc8dcc1a62	cql: add a check for currently used stack in parser While in debug mode, we may switch the default stack to a larger one when parsing cql. We may, however, invoke the parser recusively, causing us to switch to the big stack while currently using it. After the reset, we assume that the stack is empty, so after switching to the same stack, we write over its previous contents. This is fixed by checking if we're already using the large stack, which is achieved by comparing the address of a local variable to the start and end of the large stack.	2023-03-23 01:41:58 +01:00
Wojciech Mitros	a086682ecb	cql-pytest: add an optional name parameter to new_type() Currently, when creating a UDT, we're always generating a new name for it. This patch enables setting the name to a specific string instead.	2023-03-23 01:41:58 +01:00
Marcin Maliszkiewicz	0b5655021a	alternator: remove redundant flush call in make_streamed As output_stream close() is doing flush anyway.	2023-01-23 13:46:06 +01:00
Marcin Maliszkiewicz	f96ed4dba5	utils: yield when streaming json in print() - removed buffer reuse to simplify the code - added co_await suspention point on each send() making it yield	2023-01-23 13:46:06 +01:00
Marcin Maliszkiewicz	f2788f5391	alternator: yield during BatchGetItem operation	2023-01-20 14:44:24 +01:00

1476 changed files with 70819 additions and 51440 deletions

2

.gitignore vendored

View File

@@ -26,8 +26,6 @@ tags
 testlog
 test/*/*.reject
 .vscode
 docs/_build
 docs/poetry.lock
 compile_commands.json
 .ccls-cache/
 .mypy_cache

									
										44

CMakeLists.txt
									
												View File
												
				@@ -1,4 +1,4 @@

				cmake_minimum_required(VERSION 3.18)

				cmake_minimum_required(VERSION 3.27)

				project(scylla)

				@@ -8,11 +8,19 @@ list(APPEND CMAKE_MODULE_PATH

				  ${CMAKE_CURRENT_SOURCE_DIR}/cmake

				  ${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)

				set(CMAKE_BUILD_TYPE "${CMAKE_BUILD_TYPE}" CACHE

				    STRING "Choose the type of build." FORCE)

				# Set the possible values of build type for cmake-gui

				set(scylla_build_types

				    "Debug" "Release" "Dev" "Sanitize" "Coverage")

				set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS

				  "Debug" "Release" "Dev" "Sanitize")

				  ${scylla_build_types})

				if(NOT CMAKE_BUILD_TYPE)

				    set(CMAKE_BUILD_TYPE "Release" CACHE

				        STRING "Choose the type of build." FORCE)

				    message(WARNING "CMAKE_BUILD_TYPE not specified, Using 'Release'")

				elseif(NOT CMAKE_BUILD_TYPE IN_LIST scylla_build_types)

				    message(FATAL_ERROR "Unknown CMAKE_BUILD_TYPE: ${CMAKE_BUILD_TYPE}. "

				        "Following types are supported: ${scylla_build_types}")

				endif()

				string(TOUPPER "${CMAKE_BUILD_TYPE}" build_mode)

				include(mode.${build_mode})

				include(mode.common)

				@@ -26,6 +34,9 @@ set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")

				set(CMAKE_CXX_VISIBILITY_PRESET hidden)

				set(Seastar_TESTING ON CACHE BOOL "" FORCE)

				set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)

				set(Seastar_APPS ON CACHE BOOL "" FORCE)

				set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)

				add_subdirectory(seastar)

				# System libraries dependencies

				@@ -45,6 +56,8 @@ find_package(xxHash REQUIRED)

				set(scylla_gen_build_dir "${CMAKE_BINARY_DIR}/gen")

				file(MAKE_DIRECTORY "${scylla_gen_build_dir}")

				include(add_version_library)

				generate_scylla_version()

				add_library(scylla-main STATIC)

				target_sources(scylla-main

				@@ -65,7 +78,6 @@ target_sources(scylla-main

				    debug.cc

				    init.cc

				    keys.cc

				    message/messaging_service.cc

				    multishard_mutation_query.cc

				    mutation_query.cc

				    partition_slice_builder.cc

				@@ -111,8 +123,10 @@ add_subdirectory(index)

				add_subdirectory(interface)

				add_subdirectory(lang)

				add_subdirectory(locator)

				add_subdirectory(message)

				add_subdirectory(mutation)

				add_subdirectory(mutation_writer)

				add_subdirectory(node_ops)

				add_subdirectory(readers)

				add_subdirectory(redis)

				add_subdirectory(replica)

				@@ -130,7 +144,6 @@ add_subdirectory(tracing)

				add_subdirectory(transport)

				add_subdirectory(types)

				add_subdirectory(utils)

				include(add_version_library)

				add_version_library(scylla_version

				    release.cc)

				@@ -152,6 +165,7 @@ target_link_libraries(scylla PRIVATE

				    index

				    lang

				    locator

				    message

				    mutation

				    mutation_writer

				    raft

				@@ -180,22 +194,8 @@ target_link_libraries(scylla PRIVATE

				    seastar

				    Boost::program_options)

				# Force SHA1 build-id generation

				set(default_linker_flags "-Wl,--build-id=sha1")

				include(CheckLinkerFlag)

				foreach(linker "lld" "gold")

				    set(linker_flag "-fuse-ld=${linker}")

				    check_linker_flag(CXX ${linker_flag} "CXX_LINKER_HAVE_${linker}")

				    if(CXX_LINKER_HAVE_${linker})

				        string(APPEND default_linker_flags " ${linker_flag}")

				        break()

				    endif()

				endforeach()

				set(CMAKE_EXE_LINKER_FLAGS "${default_linker_flags}" CACHE INTERNAL "")

				# TODO: patch dynamic linker to match configure.py behavior

				target_include_directories(scylla PRIVATE

				    "${CMAKE_CURRENT_SOURCE_DIR}"

				    "${scylla_gen_build_dir}")

				add_subdirectory(dist)

12

SCYLLA-VERSION-GEN

View File

@@ -7,6 +7,7 @@ Options:
   -h|--help show this help message.
   -o|--output-dir PATH specify destination path at which the version files are to be created.
   -d|--date-stamp DATE manually set date for release parameter
   -v|--verbose also print out the version number
 By default, the script will attempt to parse 'version' file
 in the current directory, which should contain a string of
@@ -33,6 +34,7 @@ END
 )
 DATE=""
 PRINT_VERSION=false
 while [ $# -gt 0 ]; do
 	opt="$1"
@@ -51,6 +53,10 @@ while [ $# -gt 0 ]; do
 			shift
 			shift
 			;;
 		-v|--verbose)
 			PRINT_VERSION=true
 			shift
 			;;
 		*)
 			echo "Unexpected argument found: $1"
 			echo
@@ -72,7 +78,7 @@ fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=5.3.0-dev
 VERSION=5.5.0-dev
 if test -f version
 then
@@ -102,7 +108,9 @@ if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
 	fi
 fi
 echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
 if $PRINT_VERSION; then
 	echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"
 fi
 mkdir -p "$OUTPUT_DIR"
 echo "$SCYLLA_VERSION" > "$OUTPUT_DIR/SCYLLA-VERSION-FILE"
 echo "$SCYLLA_RELEASE" > "$OUTPUT_DIR/SCYLLA-RELEASE-FILE"

									
										2

alternator/auth.cc
									
												View File
												
				@@ -53,7 +53,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::strin

				    if (result_set->empty()) {

				        co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));

				    }

				    const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column

				    const managed_bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column

				    if (!salted_hash) {

				        co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));

				    }

									
										9

alternator/controller.cc
									
												View File
												
				@@ -76,13 +76,16 @@ future<> controller::start_server() {

				        _ssg = create_smp_service_group(c).get0();

				        rmw_operation::set_default_write_isolation(_config.alternator_write_isolation());

				        executor::set_default_timeout(std::chrono::milliseconds(_config.alternator_timeout_in_ms()));

				        net::inet_address addr = utils::resolve(_config.alternator_address, family).get0();

				        auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();

				        auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {

				            return cfg.alternator_timeout_in_ms;

				        };

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks),

				                        sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),

				                        sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();

				        _server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();

				        // Note: from this point on, if start_server() throws for any reason,

				        // it must first call stop_server() to stop the executor and server

									
										425

alternator/executor.cc
									
												View File
												
				@@ -6,8 +6,6 @@

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <regex>

				#include "utils/base64.hh"

				#include <seastar/core/sleep.hh>

				@@ -40,7 +38,6 @@

				#include <seastar/json/json_elements.hh>

				#include <boost/algorithm/cxx11/any_of.hpp>

				#include "collection_mutation.hh"

				#include "db/query_context.hh"

				#include "schema/schema.hh"

				#include "db/tags/extension.hh"

				#include "db/tags/utils.hh"

				@@ -62,7 +59,28 @@ logging::logger elogger("alternator-executor");

				namespace alternator {

				static future<std::vector<mutation>> create_keyspace(std::string_view keyspace_name, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, api::timestamp_type);

				enum class table_status {

				    active = 0,

				    creating,

				    updating,

				    deleting

				};

				static sstring_view table_status_to_sstring(table_status tbl_status) {

				    switch(tbl_status) {

				        case table_status::active:

				            return "ACTIVE";

				        case table_status::creating:

				            return "CREATING";

				        case table_status::updating:

				            return "UPDATING";

				        case table_status::deleting:

				            return "DELETING";

				    }

				    return "UKNOWN";

				}

				static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type);

				static map_type attrs_type() {

				    static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);

				@@ -90,17 +108,20 @@ json::json_return_type make_streamed(rjson::value&& value) {

				        // move objects to coroutine frame.

				        auto los = std::move(os);

				        auto lrs = std::move(rs);

				        std::exception_ptr ex;

				        try {

				            co_await rjson::print(*lrs, los);

				            co_await los.flush();

				            co_await los.close();

				        } catch (...) {

				            // at this point, we cannot really do anything. HTTP headers and return code are

				            // already written, and quite potentially a portion of the content data.

				            // just log + rethrow. It is probably better the HTTP server closes connection

				            // abruptly or something...

				            elogger.error("Unhandled exception in data streaming: {}", std::current_exception());

				            throw;

				            ex = std::current_exception();

				            elogger.error("Exception during streaming HTTP response: {}", ex);

				        }

				        co_await los.close();

				        if (ex) {

				            co_await coroutine::return_exception_ptr(std::move(ex));

				        }

				        co_return;

				    };

				@@ -190,9 +211,8 @@ static std::string lsi_name(const std::string& table_name, std::string_view inde

				/** Extract table name from a request.

				 *  Most requests expect the table's name to be listed in a "TableName" field.

				 *  This convenience function returns the name, with appropriate validation

				 *  and api_error in case the table name is missing or not a string, or

				 *  doesn't pass validate_table_name().

				 *  This convenience function returns the name or api_error in case the

				 *  table name is missing or not a string.

				 */

				static std::optional<std::string> find_table_name(const rjson::value& request) {

				    const rjson::value* table_name_value = rjson::find(request, "TableName");

				@@ -203,7 +223,6 @@ static std::optional<std::string> find_table_name(const rjson::value& request) {

				        throw api_error::validation("Non-string TableName field in request");

				    }

				    std::string table_name = table_name_value->GetString();

				    validate_table_name(table_name);

				    return table_name;

				}

				@@ -230,6 +249,10 @@ schema_ptr executor::find_table(service::storage_proxy& proxy, const rjson::valu

				    try {

				        return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + sstring(*table_name), *table_name);

				    } catch(data_dictionary::no_such_column_family&) {

				        // DynamoDB returns validation error even when table does not exist

				        // and the table name is invalid.

				        validate_table_name(table_name.value());

				        throw api_error::resource_not_found(

				                format("Requested resource not found: Table: {} not found", *table_name));

				    }

				@@ -280,6 +303,10 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {

				        try {

				            return { proxy.data_dictionary().find_schema(sstring(internal_ks_name), sstring(internal_table_name)), type };

				        } catch (data_dictionary::no_such_column_family&) {

				            // DynamoDB returns validation error even when table does not exist

				            // and the table name is invalid.

				            validate_table_name(table_name);

				            throw api_error::resource_not_found(

				                format("Requested resource not found: Internal table: {}.{} not found", internal_ks_name, internal_table_name));

				        }

				@@ -415,6 +442,91 @@ static rjson::value generate_arn_for_index(const schema& schema, std::string_vie

				        schema.ks_name(), schema.cf_name(), index_name));

				}

				static rjson::value fill_table_description(schema_ptr schema, table_status tbl_status, service::storage_proxy const& proxy)

				{

				    rjson::value table_description = rjson::empty_object();

				    rjson::add(table_description, "TableName", rjson::from_string(schema->cf_name()));

				    // FIXME: take the tables creation time, not the current time!

				    size_t creation_date_seconds = std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch()).count();

				    // FIXME: In DynamoDB the CreateTable implementation is asynchronous, and

				    // the table may be in "Creating" state until creating is finished.

				    // We don't currently do this in Alternator - instead CreateTable waits

				    // until the table is really available. So/ DescribeTable returns either

				    // ACTIVE or doesn't exist at all (and DescribeTable returns an error).

				    // The states CREATING and UPDATING are not currently returned.

				    rjson::add(table_description, "TableStatus", rjson::from_string(table_status_to_sstring(tbl_status)));

				    rjson::add(table_description, "TableArn", generate_arn_for_table(*schema));

				    rjson::add(table_description, "TableId", rjson::from_string(schema->id().to_sstring()));

				    // FIXME: Instead of hardcoding, we should take into account which mode was chosen

				    // when the table was created. But, Spark jobs expect something to be returned

				    // and PAY_PER_REQUEST seems closer to reality than PROVISIONED.

				    rjson::add(table_description, "BillingModeSummary", rjson::empty_object());

				    rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");

				    rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));

				    // In PAY_PER_REQUEST billing mode, provisioned capacity should return 0

				    rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());

				    rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);

				    rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);

				    rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);

				    data_dictionary::table t = proxy.data_dictionary().find_column_family(schema);

				    if (tbl_status != table_status::deleting) {

				        rjson::add(table_description, "CreationDateTime", rjson::value(creation_date_seconds));

				        std::unordered_map<std::string,std::string> key_attribute_types;

				        // Add base table's KeySchema and collect types for AttributeDefinitions:

				        executor::describe_key_schema(table_description, *schema, key_attribute_types);

				        if (!t.views().empty()) {

				            rjson::value gsi_array = rjson::empty_array();

				            rjson::value lsi_array = rjson::empty_array();

				            for (const view_ptr& vptr : t.views()) {

				                rjson::value view_entry = rjson::empty_object();

				                const sstring& cf_name = vptr->cf_name();

				                size_t delim_it = cf_name.find(':');

				                if (delim_it == sstring::npos) {

				                    elogger.error("Invalid internal index table name: {}", cf_name);

				                    continue;

				                }

				                sstring index_name = cf_name.substr(delim_it + 1);

				                rjson::add(view_entry, "IndexName", rjson::from_string(index_name));

				                rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));

				                // Add indexes's KeySchema and collect types for AttributeDefinitions:

				                executor::describe_key_schema(view_entry, *vptr, key_attribute_types);

				                // Add projection type

				                rjson::value projection = rjson::empty_object();

				                rjson::add(projection, "ProjectionType", "ALL");

				                // FIXME: we have to get ProjectionType from the schema when it is added

				                rjson::add(view_entry, "Projection", std::move(projection));

				                // Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter

				                rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;

				                rjson::push_back(index_array, std::move(view_entry));

				            }

				            if (!lsi_array.Empty()) {

				                rjson::add(table_description, "LocalSecondaryIndexes", std::move(lsi_array));

				            }

				            if (!gsi_array.Empty()) {

				                rjson::add(table_description, "GlobalSecondaryIndexes", std::move(gsi_array));

				            }

				        }

				        // Use map built by describe_key_schema() for base and indexes to produce

				        // AttributeDefinitions for all key columns:

				        rjson::value attribute_definitions = rjson::empty_array();

				        for (auto& type : key_attribute_types) {

				            rjson::value key = rjson::empty_object();

				            rjson::add(key, "AttributeName", rjson::from_string(type.first));

				            rjson::add(key, "AttributeType", rjson::from_string(type.second));

				            rjson::push_back(attribute_definitions, std::move(key));

				        }

				        rjson::add(table_description, "AttributeDefinitions", std::move(attribute_definitions));

				    }

				    executor::supplement_table_stream_info(table_description, *schema, proxy);

				    // FIXME: still missing some response fields (issue #5026)

				    return table_description;

				}

				bool is_alternator_keyspace(const sstring& ks_name) {

				    return ks_name.find(executor::KEYSPACE_NAME_PREFIX) == 0;

				}

				@@ -431,85 +543,7 @@ future<executor::request_return_type> executor::describe_table(client_state& cli

				    tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());

				    rjson::value table_description = rjson::empty_object();

				    rjson::add(table_description, "TableName", rjson::from_string(schema->cf_name()));

				    // FIXME: take the tables creation time, not the current time!

				    size_t creation_date_seconds = std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch()).count();

				    rjson::add(table_description, "CreationDateTime", rjson::value(creation_date_seconds));

				    // FIXME: In DynamoDB the CreateTable implementation is asynchronous, and

				    // the table may be in "Creating" state until creating is finished.

				    // We don't currently do this in Alternator - instead CreateTable waits

				    // until the table is really available. So/ DescribeTable returns either

				    // ACTIVE or doesn't exist at all (and DescribeTable returns an error).

				    // The other states (CREATING, UPDATING, DELETING) are not currently

				    // returned.

				    rjson::add(table_description, "TableStatus", "ACTIVE");

				    rjson::add(table_description, "TableArn", generate_arn_for_table(*schema));

				    rjson::add(table_description, "TableId", rjson::from_string(schema->id().to_sstring()));

				    // FIXME: Instead of hardcoding, we should take into account which mode was chosen

				    // when the table was created. But, Spark jobs expect something to be returned

				    // and PAY_PER_REQUEST seems closer to reality than PROVISIONED.

				    rjson::add(table_description, "BillingModeSummary", rjson::empty_object());

				    rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");

				    rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));

				    // In PAY_PER_REQUEST billing mode, provisioned capacity should return 0

				    rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());

				    rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);

				    rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);

				    rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);

				    std::unordered_map<std::string,std::string> key_attribute_types;

				    // Add base table's KeySchema and collect types for AttributeDefinitions:

				    describe_key_schema(table_description, *schema, key_attribute_types);

				    data_dictionary::table t = _proxy.data_dictionary().find_column_family(schema);

				    if (!t.views().empty()) {

				        rjson::value gsi_array = rjson::empty_array();

				        rjson::value lsi_array = rjson::empty_array();

				        for (const view_ptr& vptr : t.views()) {

				            rjson::value view_entry = rjson::empty_object();

				            const sstring& cf_name = vptr->cf_name();

				            size_t delim_it = cf_name.find(':');

				            if (delim_it == sstring::npos) {

				                elogger.error("Invalid internal index table name: {}", cf_name);

				                continue;

				            }

				            sstring index_name = cf_name.substr(delim_it + 1);

				            rjson::add(view_entry, "IndexName", rjson::from_string(index_name));

				            rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));

				            // Add indexes's KeySchema and collect types for AttributeDefinitions:

				            describe_key_schema(view_entry, *vptr, key_attribute_types);

				            // Add projection type

				            rjson::value projection = rjson::empty_object();

				            rjson::add(projection, "ProjectionType", "ALL");

				            // FIXME: we have to get ProjectionType from the schema when it is added

				            rjson::add(view_entry, "Projection", std::move(projection));

				            // Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter

				            rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;

				            rjson::push_back(index_array, std::move(view_entry));

				        }

				        if (!lsi_array.Empty()) {

				            rjson::add(table_description, "LocalSecondaryIndexes", std::move(lsi_array));

				        }

				        if (!gsi_array.Empty()) {

				            rjson::add(table_description, "GlobalSecondaryIndexes", std::move(gsi_array));

				        }

				    }

				    // Use map built by describe_key_schema() for base and indexes to produce

				    // AttributeDefinitions for all key columns:

				    rjson::value attribute_definitions = rjson::empty_array();

				    for (auto& type : key_attribute_types) {

				        rjson::value key = rjson::empty_object();

				        rjson::add(key, "AttributeName", rjson::from_string(type.first));

				        rjson::add(key, "AttributeType", rjson::from_string(type.second));

				        rjson::push_back(attribute_definitions, std::move(key));

				    }

				    rjson::add(table_description, "AttributeDefinitions", std::move(attribute_definitions));

				    supplement_table_stream_info(table_description, *schema, _proxy);

				    // FIXME: still missing some response fields (issue #5026)

				    rjson::value table_description = fill_table_description(schema, table_status::active, _proxy);

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "Table", std::move(table_description));

				    elogger.trace("returning {}", response);

				@@ -521,10 +555,17 @@ future<executor::request_return_type> executor::delete_table(client_state& clien

				    elogger.trace("Deleting table {}", request);

				    std::string table_name = get_table_name(request);

				    // DynamoDB returns validation error even when table does not exist

				    // and the table name is invalid.

				    validate_table_name(table_name);

				    std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;

				    tracing::add_table_name(trace_state, keyspace_name, table_name);

				    auto& p = _proxy.container();

				    schema_ptr schema = get_table(_proxy, request);

				    rjson::value table_description = fill_table_description(schema, table_status::deleting, _proxy);

				    co_await _mm.container().invoke_on(0, [&] (service::migration_manager& mm) -> future<> {

				        // FIXME: the following needs to be in a loop. If mm.announce() below

				        // fails, we need to retry the whole thing.

				@@ -534,18 +575,14 @@ future<executor::request_return_type> executor::delete_table(client_state& clien

				            throw api_error::resource_not_found(format("Requested resource not found: Table: {} not found", table_name));

				        }

				        auto m = co_await mm.prepare_column_family_drop_announcement(keyspace_name, table_name, group0_guard.write_timestamp(), service::migration_manager::drop_views::yes);

				        auto m2 = mm.prepare_keyspace_drop_announcement(keyspace_name, group0_guard.write_timestamp());

				        auto m = co_await service::prepare_column_family_drop_announcement(_proxy, keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);

				        auto m2 = co_await service::prepare_keyspace_drop_announcement(_proxy.local_db(), keyspace_name, group0_guard.write_timestamp());

				        std::move(m2.begin(), m2.end(), std::back_inserter(m));

				        co_await mm.announce(std::move(m), std::move(group0_guard));

				        co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: delete {} table", table_name));

				    });

				    // FIXME: need more attributes?

				    rjson::value table_description = rjson::empty_object();

				    rjson::add(table_description, "TableName", rjson::from_string(table_name));

				    rjson::add(table_description, "TableStatus", "DELETING");

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TableDescription", std::move(table_description));

				    elogger.trace("returning {}", response);

				@@ -830,17 +867,6 @@ future<executor::request_return_type> executor::list_tags_of_resource(client_sta

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				}

				static future<> wait_for_schema_agreement(service::migration_manager& mm, db::timeout_clock::time_point deadline) {

				    return do_until([&mm, deadline] {

				        if (db::timeout_clock::now() > deadline) {

				            throw std::runtime_error("Unable to reach schema agreement");

				        }

				        return mm.have_schema_agreement();

				    }, [] {

				        return seastar::sleep(500ms);

				    });

				}

				static void verify_billing_mode(const rjson::value& request) {

				        // Alternator does not yet support billing or throughput limitations, but

				    // let's verify that BillingMode is at least legal.

				@@ -858,6 +884,38 @@ static void verify_billing_mode(const rjson::value& request) {

				    }

				}

				// Validate that a AttributeDefinitions parameter in CreateTable is valid, and

				// throws user-facing api_error::validation if it's not.

				// In particular, verify that the same AttributeName doesn't appear more than

				// once (Issue #13870).

				static void validate_attribute_definitions(const rjson::value& attribute_definitions){

				    if (!attribute_definitions.IsArray()) {

				        throw api_error::validation("AttributeDefinitions must be an array");

				    }

				    std::unordered_set<std::string> seen_attribute_names;

				    for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {

				        const rjson::value* attribute_name = rjson::find(*it, "AttributeName");

				        if (!attribute_name) {

				            throw api_error::validation("AttributeName missing in AttributeDefinitions");

				        }

				        if (!attribute_name->IsString()) {

				            throw api_error::validation("AttributeName in AttributeDefinitions must be a string");

				        }

				        auto [it2, added] = seen_attribute_names.emplace(rjson::to_string_view(*attribute_name));

				        if (!added) {

				            throw api_error::validation(format("Duplicate AttributeName={} in AttributeDefinitions",

				                rjson::to_string_view(*attribute_name)));

				        }

				        const rjson::value* attribute_type = rjson::find(*it, "AttributeType");

				        if (!attribute_type) {

				            throw api_error::validation("AttributeType missing in AttributeDefinitions");

				        }

				        if (!attribute_type->IsString()) {

				            throw api_error::validation("AttributeType in AttributeDefinitions must be a string");

				        }

				    }

				}

				static future<executor::request_return_type> create_table_on_shard0(tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper) {

				    assert(this_shard_id() == 0);

				@@ -866,11 +924,14 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra

				    // (e.g., verify that this table doesn't already exist) - we can only

				    // do this further down - after taking group0_guard.

				    std::string table_name = get_table_name(request);

				    validate_table_name(table_name);

				    if (table_name.find(executor::INTERNAL_TABLE_PREFIX) == 0) {

				        co_return api_error::validation(format("Prefix {} is reserved for accessing internal tables", executor::INTERNAL_TABLE_PREFIX));

				    }

				    std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;

				    const rjson::value& attribute_definitions = request["AttributeDefinitions"];

				    validate_attribute_definitions(attribute_definitions);

				    tracing::add_table_name(trace_state, keyspace_name, table_name);

				@@ -1060,8 +1121,9 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra

				    auto group0_guard = co_await mm.start_group0_operation();

				    auto ts = group0_guard.write_timestamp();

				    std::vector<mutation> schema_mutations;

				    auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts);

				    try {

				        schema_mutations = co_await create_keyspace(keyspace_name, sp, mm, gossiper, ts);

				        schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);

				    } catch (exceptions::already_exists_exception&) {

				        if (sp.data_dictionary().has_schema(keyspace_name, table_name)) {

				            co_return api_error::resource_in_use(format("Table {} already exists", table_name));

				@@ -1071,22 +1133,14 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra

				        // This should never happen, the ID is supposed to be unique

				        co_return api_error::internal(format("Table with ID {} already exists", schema->id()));

				    }

				    db::schema_tables::add_table_or_view_to_schema_mutation(schema, ts, true, schema_mutations);

				    // we must call before_create_column_family callbacks - which allow

				    // listeners to modify our schema_mutations. For example, CDC may add

				    // another table (the CDC log table) to the same keyspace.

				    // Unfortunately the convention is that this callback must be run in

				    // a Seastar thread.

				    co_await seastar::async([&] {

				        mm.get_notifier().before_create_column_family(*schema, schema_mutations, ts);

				    });

				    co_await service::prepare_new_column_family_announcement(schema_mutations, sp, *ksm, schema, ts);

				    for (schema_builder& view_builder : view_builders) {

				        db::schema_tables::add_table_or_view_to_schema_mutation(

				            view_ptr(view_builder.build()), ts, true, schema_mutations);

				    }

				    co_await mm.announce(std::move(schema_mutations), std::move(group0_guard));

				    co_await mm.announce(std::move(schema_mutations), std::move(group0_guard), format("alternator-executor: create {} table", table_name));

				    co_await wait_for_schema_agreement(mm, db::timeout_clock::now() + 10s);

				    co_await mm.wait_for_schema_agreement(sp.local_db(), db::timeout_clock::now() + 10s, nullptr);

				    rjson::value status = rjson::empty_object();

				    executor::supplement_table_info(request, *schema, sp);

				    rjson::add(status, "TableDescription", std::move(request));

				@@ -1149,11 +1203,11 @@ future<executor::request_return_type> executor::update_table(client_state& clien

				        auto schema = builder.build();

				        auto m = co_await mm.prepare_column_family_update_announcement(schema, false,  std::vector<view_ptr>(), group0_guard.write_timestamp());

				        auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, false,  std::vector<view_ptr>(), group0_guard.write_timestamp());

				        co_await mm.announce(std::move(m), std::move(group0_guard));

				        co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: update {} table", tab->cf_name()));

				        co_await wait_for_schema_agreement(mm, db::timeout_clock::now() + 10s);

				        co_await mm.wait_for_schema_agreement(p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);

				        rjson::value status = rjson::empty_object();

				        supplement_table_info(request, *schema, p.local());

				@@ -1365,14 +1419,11 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) co

				// The DynamoDB API doesn't let the client control the server's timeout, so

				// we have a global default_timeout() for Alternator requests. The value of

				// s_default_timeout is overwritten in alternator::controller::start_server()

				// s_default_timeout_ms is overwritten in alternator::controller::start_server()

				// based on the "alternator_timeout_in_ms" configuration parameter.

				db::timeout_clock::duration executor::s_default_timeout = 10s;

				void executor::set_default_timeout(db::timeout_clock::duration timeout) {

				    s_default_timeout = timeout;

				}

				thread_local utils::updateable_value<uint32_t> executor::s_default_timeout_in_ms{10'000};

				db::timeout_clock::time_point executor::default_timeout() {

				    return db::timeout_clock::now() + s_default_timeout;

				    return db::timeout_clock::now() + std::chrono::milliseconds(s_default_timeout_in_ms);

				}

				static future<std::unique_ptr<rjson::value>> get_previous_item(

				@@ -1592,7 +1643,7 @@ static parsed::condition_expression get_parsed_condition_expression(rjson::value

				        throw api_error::validation("ConditionExpression must not be empty");

				    }

				    try {

				        return parse_condition_expression(rjson::to_string_view(*condition_expression));

				        return parse_condition_expression(rjson::to_string_view(*condition_expression), "ConditionExpression");

				    } catch(expressions_syntax_error& e) {

				        throw api_error::validation(e.what());

				    }

				@@ -1607,17 +1658,16 @@ static bool check_needs_read_before_write(const parsed::condition_expression& co

				// Fail the expression if it has unused attribute names or values. This is

				// how DynamoDB behaves, so we do too.

				static void verify_all_are_used(const rjson::value& req, const char* field,

				        const std::unordered_set<std::string>& used, const char* operation) {

				    const rjson::value* attribute_names = rjson::find(req, field);

				    if (!attribute_names) {

				static void verify_all_are_used(const rjson::value* field,

				        const std::unordered_set<std::string>& used, const char* field_name, const char* operation) {

				    if (!field) {

				        return;

				    }

				    for (auto it = attribute_names->MemberBegin(); it != attribute_names->MemberEnd(); ++it) {

				    for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {

				        if (!used.contains(it->name.GetString())) {

				            throw api_error::validation(

				                format("{} has spurious '{}', not used in {}",

				                       field, it->name.GetString(), operation));

				                    field_name, it->name.GetString(), operation));

				        }

				    }

				}

				@@ -1644,8 +1694,8 @@ public:

				            resolve_condition_expression(_condition_expression,

				                    expression_attribute_names, expression_attribute_values,

				                    used_attribute_names, used_attribute_values);

				            verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "PutItem");

				            verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "PutItem");

				            verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "PutItem");

				            verify_all_are_used(expression_attribute_values, used_attribute_values,"ExpressionAttributeValues", "PutItem");

				        } else {

				            if (expression_attribute_names) {

				                throw api_error::validation("ExpressionAttributeNames cannot be used without ConditionExpression");

				@@ -1729,8 +1779,8 @@ public:

				            resolve_condition_expression(_condition_expression,

				                    expression_attribute_names, expression_attribute_values,

				                    used_attribute_names, used_attribute_values);

				            verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "DeleteItem");

				            verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "DeleteItem");

				            verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "DeleteItem");

				            verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "DeleteItem");

				        } else {

				            if (expression_attribute_names) {

				                throw api_error::validation("ExpressionAttributeNames cannot be used without ConditionExpression");

				@@ -2300,14 +2350,14 @@ static std::optional<attrs_to_get> calculate_attrs_to_get(const rjson::value& re

				 * as before.

				 */ 

				void executor::describe_single_item(const cql3::selection::selection& selection,

				    const std::vector<bytes_opt>& result_row,

				    const std::vector<managed_bytes_opt>& result_row,

				    const std::optional<attrs_to_get>& attrs_to_get,

				    rjson::value& item,

				    bool include_all_embedded_attributes) 

				{

				    const auto& columns = selection.get_columns();

				    auto column_it = columns.begin();

				    for (const bytes_opt& cell : result_row) {

				    for (const managed_bytes_opt& cell : result_row) {

				        std::string column_name = (*column_it)->name_as_text();

				        if (cell && column_name != executor::ATTRS_COLUMN_NAME) {

				            if (!attrs_to_get || attrs_to_get->contains(column_name)) {

				@@ -2315,7 +2365,9 @@ void executor::describe_single_item(const cql3::selection::selection& selection,

				                // so add() makes sense

				                rjson::add_with_string_name(item, column_name, rjson::empty_object());

				                rjson::value& field = item[column_name.c_str()];

				                rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(*cell, **column_it));

				                cell->with_linearized([&] (bytes_view linearized_cell) {

				                    rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(linearized_cell, **column_it));

				                });

				            }

				        } else if (cell) {

				            auto deserialized = attrs_type()->deserialize(*cell);

				@@ -2371,21 +2423,22 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,

				    return item;

				}

				std::vector<rjson::value> executor::describe_multi_item(schema_ptr schema,

				        const query::partition_slice& slice,

				        const cql3::selection::selection& selection,

				        const query::result& query_result,

				        const std::optional<attrs_to_get>& attrs_to_get) {

				    cql3::selection::result_set_builder builder(selection, gc_clock::now());

				    query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));

				future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get) {

				    cql3::selection::result_set_builder builder(*selection, gc_clock::now());

				    query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				    auto result_set = builder.build();

				    std::vector<rjson::value> ret;

				    for (auto& result_row : result_set->rows()) {

				        rjson::value item = rjson::empty_object();

				        describe_single_item(selection, result_row, attrs_to_get, item);

				        describe_single_item(*selection, result_row, *attrs_to_get, item);

				        ret.push_back(std::move(item));

				        co_await coroutine::maybe_yield();

				    }

				    return ret;

				    co_return ret;

				}

				static bool check_needs_read_before_write(const parsed::value& v) {

				@@ -2500,8 +2553,8 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso

				            expression_attribute_names, expression_attribute_values,

				            used_attribute_names, used_attribute_values);

				    verify_all_are_used(_request, "ExpressionAttributeNames", used_attribute_names, "UpdateItem");

				    verify_all_are_used(_request, "ExpressionAttributeValues", used_attribute_values, "UpdateItem");

				    verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "UpdateItem");

				    verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "UpdateItem");

				    // DynamoDB forbids having both old-style AttributeUpdates or Expected

				    // and new-style UpdateExpression or ConditionExpression in the same request

				@@ -3110,7 +3163,8 @@ future<executor::request_return_type> executor::get_item(client_state& client_st

				    std::unordered_set<std::string> used_attribute_names;

				    auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names);

				    verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");

				    const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");

				    verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");

				    return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

				            service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state)).then(

				@@ -3221,7 +3275,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli

				        rs.cl = get_read_consistency(it->value);

				        std::unordered_set<std::string> used_attribute_names;

				        rs.attrs_to_get = ::make_shared<const std::optional<attrs_to_get>>(calculate_attrs_to_get(it->value, used_attribute_names));

				        verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");

				        const rjson::value* expression_attribute_names = rjson::find(it->value, "ExpressionAttributeNames");

				        verify_all_are_used(expression_attribute_names, used_attribute_names,"ExpressionAttributeNames", "GetItem");

				        auto& keys = (it->value)["Keys"];

				        for (rjson::value& key : keys.GetArray()) {

				            rs.add(key);

				@@ -3257,8 +3312,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli

				                    service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(

				                    [schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {

				                utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });

				                std::vector<rjson::value> jsons = describe_multi_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);

				                return make_ready_future<std::vector<rjson::value>>(std::move(jsons));

				                return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get));

				            });

				            response_futures.push_back(std::move(f));

				        }

				@@ -3391,7 +3445,7 @@ filter::filter(const rjson::value& request, request_type rt,

				            throw api_error::validation("Cannot use both old-style and new-style parameters in same request: FilterExpression and AttributesToGet");

				        }

				        try {

				            auto parsed = parse_condition_expression(rjson::to_string_view(*expression));

				            auto parsed = parse_condition_expression(rjson::to_string_view(*expression), "FilterExpression");

				            const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");

				            const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");

				            resolve_condition_expression(parsed,

				@@ -3498,7 +3552,7 @@ public:

				        _column_it = _columns.begin();

				    }

				    void accept_value(const std::optional<query::result_bytes_view>& result_bytes_view) {

				    void accept_value(managed_bytes_view_opt result_bytes_view) {

				        if (!result_bytes_view) {

				            ++_column_it;

				            return;

				@@ -3795,8 +3849,10 @@ future<executor::request_return_type> executor::scan(client_state& client_state,

				    // optimized the filtering by modifying partition_ranges and/or

				    // ck_bounds. We haven't done this optimization yet.

				    verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "Scan");

				    verify_all_are_used(request, "ExpressionAttributeValues", used_attribute_values, "Scan");

				    const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");

				    const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");

				    verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "Scan");

				    verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Scan");

				    return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,

				            std::move(filter), query::partition_slice::option_set(), client_state, _stats.cql_stats, trace_state, std::move(permit));

				@@ -4017,7 +4073,7 @@ calculate_bounds_condition_expression(schema_ptr schema,

				    // sort-key range.

				    parsed::condition_expression p;

				    try {

				        p = parse_condition_expression(rjson::to_string_view(expression));

				        p = parse_condition_expression(rjson::to_string_view(expression), "KeyConditionExpression");

				    } catch(expressions_syntax_error& e) {

				        throw api_error::validation(e.what());

				    }

				@@ -4237,13 +4293,17 @@ future<executor::request_return_type> executor::query(client_state& client_state

				        throw api_error::validation("Query must have one of "

				                "KeyConditions or KeyConditionExpression");

				    }

				    const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");

				    const rjson::value* expression_attribute_values = rjson::find(request, "ExpressionAttributeValues");

				    // exactly one of key_conditions or key_condition_expression

				    auto [partition_ranges, ck_bounds] = key_conditions

				                ? calculate_bounds_conditions(schema, *key_conditions)

				                : calculate_bounds_condition_expression(schema, *key_condition_expression,

				                        rjson::find(request, "ExpressionAttributeValues"),

				                        expression_attribute_values,

				                        used_attribute_values,

				                        rjson::find(request, "ExpressionAttributeNames"),

				                        expression_attribute_names,

				                        used_attribute_names);

				    filter filter(request, filter::request_type::QUERY,

				@@ -4270,8 +4330,8 @@ future<executor::request_return_type> executor::query(client_state& client_state

				    select_type select = parse_select(request, table_type);

				    auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names, select);

				    verify_all_are_used(request, "ExpressionAttributeValues", used_attribute_values, "Query");

				    verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "Query");

				    verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "Query");

				    verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Query");

				    query::partition_slice::option_set opts;

				    opts.set_if<query::partition_slice::option::reversed>(!forward);

				    return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,

				@@ -4332,6 +4392,17 @@ future<executor::request_return_type> executor::list_tables(client_state& client

				future<executor::request_return_type> executor::describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header) {

				    _stats.api_operations.describe_endpoints++;

				    // The alternator_describe_endpoints configuration can be used to disable

				    // the DescribeEndpoints operation, or set it to return a fixed string

				    std::string override = _proxy.data_dictionary().get_config().alternator_describe_endpoints();

				    if (!override.empty()) {

				        if (override == "disabled") {

				            _stats.unsupported_operations++;

				            return make_ready_future<request_return_type>(api_error::unknown_operation(

				                "DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)"));

				        }

				        host_header = std::move(override);

				    }

				    rjson::value response = rjson::empty_object();

				    // Without having any configuration parameter to say otherwise, we tell

				    // the user to return to the same endpoint they used to reach us. The only

				@@ -4369,6 +4440,10 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie

				    try {

				        schema = _proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);

				    } catch(data_dictionary::no_such_column_family&) {

				        // DynamoDB returns validation error even when table does not exist

				        // and the table name is invalid.

				        validate_table_name(table_name);

				        throw api_error::table_not_found(

				                format("Table {} not found", table_name));

				    }

				@@ -4382,25 +4457,23 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie

				    co_return make_jsonable(std::move(response));

				}

				// Create the keyspace in which we put the alternator table, if it doesn't

				// already exist.

				// Create the metadata for the keyspace in which we put the alternator

				// table if it doesn't already exist.

				// Currently, we automatically configure the keyspace based on the number

				// of nodes in the cluster: A cluster with 3 or more live nodes, gets RF=3.

				// A smaller cluster (presumably, a test only), gets RF=1. The user may

				// manually create the keyspace to override this predefined behavior.

				static future<std::vector<mutation>> create_keyspace(std::string_view keyspace_name, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, api::timestamp_type ts) {

				    sstring keyspace_name_str(keyspace_name);

				    int endpoint_count = gossiper.get_endpoint_states().size();

				static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts) {

				    int endpoint_count = gossiper.num_endpoints();

				    int rf = 3;

				    if (endpoint_count < rf) {

				        rf = 1;

				        elogger.warn("Creating keyspace '{}' for Alternator with unsafe RF={} because cluster only has {} nodes.",

				                keyspace_name_str, rf, endpoint_count);

				                keyspace_name, rf, endpoint_count);

				    }

				    auto opts = get_network_topology_options(sp, gossiper, rf);

				    auto ksm = keyspace_metadata::new_keyspace(keyspace_name_str, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);

				    co_return mm.prepare_new_keyspace_announcement(ksm, ts);

				    return keyspace_metadata::new_keyspace(keyspace_name, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);

				}

				future<> executor::start() {

									
										39

alternator/executor.hh
									
												View File
												
				@@ -22,6 +22,7 @@

				#include "alternator/error.hh"

				#include "stats.hh"

				#include "utils/rjson.hh"

				#include "utils/updateable_value.hh"

				namespace db {

				    class system_distributed_keyspace;

				@@ -170,8 +171,16 @@ public:

				    static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";

				    static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";

				    executor(gms::gossiper& gossiper, service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, cdc::metadata& cdc_metadata, smp_service_group ssg)

				        : _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {}

				    executor(gms::gossiper& gossiper,

				             service::storage_proxy& proxy,

				             service::migration_manager& mm,

				             db::system_distributed_keyspace& sdks,

				             cdc::metadata& cdc_metadata,

				             smp_service_group ssg,

				             utils::updateable_value<uint32_t> default_timeout_in_ms)

				        : _gossiper(gossiper), _proxy(proxy), _mm(mm), _sdks(sdks), _cdc_metadata(cdc_metadata), _ssg(ssg) {

				        s_default_timeout_in_ms = std::move(default_timeout_in_ms);

				    }

				    future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				@@ -199,13 +208,16 @@ public:

				    future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);

				    future<> start();

				    future<> stop() { return make_ready_future<>(); }

				    future<> stop() {

				        // disconnect from the value source, but keep the value unchanged.

				        s_default_timeout_in_ms = utils::updateable_value<uint32_t>{s_default_timeout_in_ms()};

				        return make_ready_future<>();

				    }

				    static sstring table_name(const schema&);

				    static db::timeout_clock::time_point default_timeout();

				    static void set_default_timeout(db::timeout_clock::duration timeout);

				private:

				    static db::timeout_clock::duration s_default_timeout;

				    static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;

				public:

				    static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);

				@@ -213,30 +225,31 @@ private:

				    friend class rmw_operation;

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);

				public:

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);

				    static std::optional<rjson::value> describe_single_item(schema_ptr,

				        const query::partition_slice&,

				        const cql3::selection::selection&,

				        const query::result&,

				        const std::optional<attrs_to_get>&);

				    static std::vector<rjson::value> describe_multi_item(schema_ptr schema,

				        const query::partition_slice& slice,

				        const cql3::selection::selection& selection,

				        const query::result& query_result,

				        const std::optional<attrs_to_get>& attrs_to_get);

				    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<bytes_opt>&,

				        const std::vector<managed_bytes_opt>&,

				        const std::optional<attrs_to_get>&,

				        rjson::value&,

				        bool = false);

				    static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);

				    static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);

				    static void supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);

				    static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);

				};

				// is_big() checks approximately if the given JSON value is "bigger" than

									
										63

alternator/expressions.cc
									
												View File
												
				@@ -29,7 +29,7 @@

				namespace alternator {

				template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>

				Result do_with_parser(std::string_view input, Func&& f) {

				static Result do_with_parser(std::string_view input, Func&& f) {

				    expressionsLexer::InputStreamType input_stream{

				        reinterpret_cast<const ANTLR_UINT8*>(input.data()),

				        ANTLR_ENC_UTF8,

				@@ -43,31 +43,41 @@ Result do_with_parser(std::string_view input, Func&& f) {

				    return result;

				}

				template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>

				static Result parse(const char* input_name, std::string_view input, Func&& f) {

				    if (input.length() > 4096) {

				        throw expressions_syntax_error(format("{} expression size {} exceeds allowed maximum 4096.",

				            input_name, input.length()));

				    }

				    try {

				        return do_with_parser(input, f);

				    } catch (expressions_syntax_error& e) {

				        // If already an expressions_syntax_error, don't print the type's

				        // name (it's just ugly), just the message.

				        // TODO: displayRecognitionError could set a position inside the

				        // expressions_syntax_error in throws, and we could use it here to

				        // mark the broken position in 'input'.

				        throw expressions_syntax_error(format("Failed parsing {} '{}': {}",

				            input_name, input, e.what()));

				    } catch (...) {

				        throw expressions_syntax_error(format("Failed parsing {} '{}': {}",

				            input_name, input, std::current_exception()));

				    }

				}

				parsed::update_expression

				parse_update_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::update_expression));

				    } catch (...) {

				        throw expressions_syntax_error(format("Failed parsing UpdateExpression '{}': {}", query, std::current_exception()));

				    }

				    return parse("UpdateExpression", query,  std::mem_fn(&expressionsParser::update_expression));

				}

				std::vector<parsed::path>

				parse_projection_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::projection_expression));

				    } catch (...) {

				        throw expressions_syntax_error(format("Failed parsing ProjectionExpression '{}': {}", query, std::current_exception()));

				    }

				    return parse ("ProjectionExpression", query,  std::mem_fn(&expressionsParser::projection_expression));

				}

				parsed::condition_expression

				parse_condition_expression(std::string_view query) {

				    try {

				        return do_with_parser(query,  std::mem_fn(&expressionsParser::condition_expression));

				    } catch (...) {

				        throw expressions_syntax_error(format("Failed parsing ConditionExpression '{}': {}", query, std::current_exception()));

				    }

				parse_condition_expression(std::string_view query, const char* caller) {

				    return parse(caller, query,  std::mem_fn(&expressionsParser::condition_expression));

				}

				namespace parsed {

				@@ -418,9 +428,14 @@ void for_condition_expression_on(const parsed::condition_expression& ce, const n

				// calculate_size() is ConditionExpression's size() function, i.e., it takes

				// a JSON-encoded value and returns its "size" as defined differently for the

				// different types - also as a JSON-encoded number.

				// It return a JSON-encoded "null" value if this value's type has no size

				// defined. Comparisons against this non-numeric value will later fail.

				static rjson::value calculate_size(const rjson::value& v) {

				// If the value's type (e.g. number) has no size defined, there are two cases:

				// 1. If from_data (the value came directly from an attribute of the data),

				//    It returns a JSON-encoded "null" value. Comparisons against this

				//    non-numeric value will later fail, so eventually the application will

				//    get a ConditionalCheckFailedException.

				// 2. Otherwise (the value came from a constant in the query or some other

				//    calculation), throw a ValidationException.

				static rjson::value calculate_size(const rjson::value& v, bool from_data) {

				    // NOTE: If v is improperly formatted for our JSON value encoding, it

				    // must come from the request itself, not from the database, so it makes

				    // sense to throw a ValidationException if we see such a problem.

				@@ -449,10 +464,12 @@ static rjson::value calculate_size(const rjson::value& v) {

				            throw api_error::validation(format("invalid byte string: {}", v));

				        }

				        ret = base64_decoded_len(rjson::to_string_view(it->value));

				    } else {

				    } else if (from_data) {

				        rjson::value json_ret = rjson::empty_object();

				        rjson::add(json_ret, "null", rjson::value(true));

				        return json_ret;

				    } else {

				        throw api_error::validation(format("Unsupported operand type {} for function size()", it->name));

				    }

				    rjson::value json_ret = rjson::empty_object();

				    rjson::add(json_ret, "N", rjson::from_string(std::to_string(ret)));

				@@ -534,7 +551,7 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {

				                        format("{}: size() accepts 1 parameter, got {}", caller, f._parameters.size()));

				            }

				            rjson::value v = calculate_value(f._parameters[0], caller, previous_item);

				            return calculate_size(v);

				            return calculate_size(v, f._parameters[0].is_path());

				        }

				    },

				    {"attribute_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

				@@ -662,7 +679,7 @@ static rjson::value extract_path(const rjson::value* item,

				            // objects. But today Alternator does not validate the structure

				            // of nested documents before storing them, so this can happen on

				            // read.

				            throw api_error::validation(format("{}: malformed item read: {}", *item));

				            throw api_error::validation(format("{}: malformed item read: {}", caller, *item));

				        }

				        const char* type = v->MemberBegin()->name.GetString();

				        v = &(v->MemberBegin()->value);

82

alternator/expressions.g

View File

@@ -74,7 +74,22 @@ options {
  */
 @parser::context {
     void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
         throw expressions_syntax_error("syntax error");
         const char* err;
         switch (ex->getType()) {
         case antlr3::ExceptionType::FAILED_PREDICATE_EXCEPTION:
             err = "expression nested too deeply";
             break;
         default:
             err = "syntax error";
             break;
         }
         // Alternator expressions are always single line so ex->get_line()
         // is always 1, no sense to print it.
         // TODO: return the position as part of the exception, so the
         // caller in expressions.cc that knows the expression string can
         // mark the error position in the final error message.
         throw expressions_syntax_error(format("{} at char {}", err,
             ex->get_charPositionInLine()));
     }
 }
 @lexer::context {
@@ -83,6 +98,23 @@ options {
     }
 }
 /* Unfortunately, ANTLR uses recursion - not the heap - to parse recursive
  * expressions. To make things even worse, ANTLR has no way to limit the
  * depth of this recursion (unlike Yacc which has YYMAXDEPTH). So deeply-
  * nested expression like "(((((((((((((..." can easily crash Scylla on a
  * stack overflow (see issue #14477).
  *
  * We are lucky that in the grammar for DynamoDB expressions (below),
  * only a few specific rules can recurse, so it was fairly easy to add a
  * "depth" counter to a few specific rules, and then use a predicate
  * "{depth<MAX_DEPTH}?" to avoid parsing if the depth exceeds this limit,
  * and throw a FAILED_PREDICATE_EXCEPTION in that case, which we will
  * report to the user as a "expression nested too deeply" error.
  */
 @parser::members {
     static constexpr int MAX_DEPTH = 400;
 }
 /*
  * Lexical analysis phase, i.e., splitting the input up to tokens.
  * Lexical analyzer rules have names starting in capital letters.
@@ -155,19 +187,20 @@ path returns [parsed::path p]:
       | '[' INTEGER ']'           { $p.add_index(std::stoi($INTEGER.text)); }
     )*;
 value returns [parsed::value v]:
 /* See comment above why the "depth" counter was needed here */
 value[int depth] returns [parsed::value v]:
       VALREF       { $v.set_valref($VALREF.text); }
     | path         { $v.set_path($path.p); }
     | NAME         { $v.set_func_name($NAME.text); }
      '(' x=value   { $v.add_func_parameter($x.v); }
      (',' x=value  { $v.add_func_parameter($x.v); })*
     | {depth<MAX_DEPTH}? NAME { $v.set_func_name($NAME.text); }
      '(' x=value[depth+1]    { $v.add_func_parameter($x.v); }
      (',' x=value[depth+1]   { $v.add_func_parameter($x.v); })*
      ')'
     ;
 update_expression_set_rhs returns [parsed::set_rhs rhs]:
     v=value  { $rhs.set_value(std::move($v.v)); }
     (   '+' v=value  { $rhs.set_plus(std::move($v.v)); }
       | '-' v=value  { $rhs.set_minus(std::move($v.v)); }
     v=value[0]  { $rhs.set_value(std::move($v.v)); }
     (   '+' v=value[0]  { $rhs.set_plus(std::move($v.v)); }
       | '-' v=value[0]  { $rhs.set_minus(std::move($v.v)); }
     )?
     ;
@@ -205,7 +238,7 @@ projection_expression returns [std::vector<parsed::path> v]:
 primitive_condition returns [parsed::primitive_condition c]:
       v=value         { $c.add_value(std::move($v.v));
       v=value[0]      { $c.add_value(std::move($v.v));
                         $c.set_operator(parsed::primitive_condition::type::VALUE); }
       (  (  '='       { $c.set_operator(parsed::primitive_condition::type::EQ); }
           | '<' '>'   { $c.set_operator(parsed::primitive_condition::type::NE); }
@@ -214,14 +247,14 @@ primitive_condition returns [parsed::primitive_condition c]:
           | '>'       { $c.set_operator(parsed::primitive_condition::type::GT); }
           | '>' '='   { $c.set_operator(parsed::primitive_condition::type::GE); }
          )
          v=value      { $c.add_value(std::move($v.v)); }
          v=value[0]   { $c.add_value(std::move($v.v)); }
        | BETWEEN      { $c.set_operator(parsed::primitive_condition::type::BETWEEN); }
          v=value      { $c.add_value(std::move($v.v)); }
          v=value[0]   { $c.add_value(std::move($v.v)); }
          AND
          v=value      { $c.add_value(std::move($v.v)); }
          v=value[0]   { $c.add_value(std::move($v.v)); }
        | IN '('       { $c.set_operator(parsed::primitive_condition::type::IN); }
          v=value      { $c.add_value(std::move($v.v)); }
          (',' v=value { $c.add_value(std::move($v.v)); })*
          v=value[0]   { $c.add_value(std::move($v.v)); }
          (',' v=value[0] { $c.add_value(std::move($v.v)); })*
          ')'
       )?
     ;
@@ -231,19 +264,20 @@ primitive_condition returns [parsed::primitive_condition c]:
 // common rule prefixes, and (lack of) support for operator precedence.
 // These rules could have been written more clearly using a more powerful
 // parser generator - such as Yacc.
 boolean_expression returns [parsed::condition_expression e]:
 	  b=boolean_expression_1       { $e.append(std::move($b.e), '|'); }
 	  (OR b=boolean_expression_1   { $e.append(std::move($b.e), '|'); } )*
 // See comment above why the "depth" counter was needed here.
 boolean_expression[int depth] returns [parsed::condition_expression e]:
 	  b=boolean_expression_1[depth]       { $e.append(std::move($b.e), '|'); }
 	  (OR b=boolean_expression_1[depth]   { $e.append(std::move($b.e), '|'); } )*
 	;
 boolean_expression_1 returns [parsed::condition_expression e]:
 	  b=boolean_expression_2       { $e.append(std::move($b.e), '&'); }
 	  (AND b=boolean_expression_2  { $e.append(std::move($b.e), '&'); } )*
 boolean_expression_1[int depth] returns [parsed::condition_expression e]:
 	  b=boolean_expression_2[depth]       { $e.append(std::move($b.e), '&'); }
 	  (AND b=boolean_expression_2[depth]  { $e.append(std::move($b.e), '&'); } )*
 	;
 boolean_expression_2 returns [parsed::condition_expression e]:
 boolean_expression_2[int depth] returns [parsed::condition_expression e]:
 	  p=primitive_condition        { $e.set_primitive(std::move($p.c)); }
 	| NOT b=boolean_expression_2   { $e = std::move($b.e); $e.apply_not(); }
 	| '(' b=boolean_expression ')' { $e = std::move($b.e); }
 	| {depth<MAX_DEPTH}? NOT b=boolean_expression_2[depth+1]   { $e = std::move($b.e); $e.apply_not(); }
 	| {depth<MAX_DEPTH}? '(' b=boolean_expression[depth+1] ')' { $e = std::move($b.e); }
     ;
 condition_expression returns [parsed::condition_expression e]:
     boolean_expression { e=std::move($boolean_expression.e); } EOF;
     boolean_expression[0] { e=std::move($boolean_expression.e); } EOF;

									
										2

alternator/expressions.hh
									
												View File
												
				@@ -28,7 +28,7 @@ public:

				parsed::update_expression parse_update_expression(std::string_view query);

				std::vector<parsed::path> parse_projection_expression(std::string_view query);

				parsed::condition_expression parse_condition_expression(std::string_view query);

				parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);

				void resolve_update_expression(parsed::update_expression& ue,

				        const rjson::value* expression_attribute_names,

									
										136

alternator/serialization.cc
									
												View File
												
				@@ -50,6 +50,115 @@ type_representation represent_type(alternator_type atype) {

				    return it->second;

				}

				// Get the magnitude and precision of a big_decimal - as these concepts are

				// defined by DynamoDB - to allow us to enforce limits on those as explained

				// in ssue #6794. The "magnitude" of 9e123 is 123 and of -9e-123 is -123,

				// the "precision" of 12.34e56 is the number of significant digits - 4.

				//

				// Unfortunately it turned out to be quite difficult to take a big_decimal and

				// calculate its magnitude and precision from its scale() and unscaled_value().

				// So in the following ugly implementation we calculate them from the string

				// representation instead. We assume the number was already parsed

				// sucessfully to a big_decimal to it follows its syntax rules.

				//

				// FIXME: rewrite this function to take a big_decimal, not a string.

				// Maybe a snippet like this can help:

				// boost::multiprecision::cpp_int digits = boost::multiprecision::log10(num.unscaled_value().convert_to<boost::multiprecision::mpf_float_50>()).convert_to<boost::multiprecision::cpp_int>() + 1;

				internal::magnitude_and_precision internal::get_magnitude_and_precision(std::string_view s) {

				    size_t e_or_end = s.find_first_of("eE");

				    std::string_view base = s.substr(0, e_or_end);

				    if (s[0]=='-' || s[0]=='+') {

				        base = base.substr(1);

				    }

				    int magnitude = 0;

				    int precision = 0;

				    size_t dot_or_end = base.find_first_of(".");

				    size_t nonzero = base.find_first_not_of("0");

				    if (dot_or_end != std::string_view::npos) {

				        if (nonzero == dot_or_end) {

				            // 0.000031 => magnitude = -5 (like 3.1e-5), precision = 2.

				            std::string_view fraction = base.substr(dot_or_end + 1);

				            size_t nonzero2 = fraction.find_first_not_of("0");

				            if (nonzero2 != std::string_view::npos) {

				                magnitude = -nonzero2 - 1;

				                precision = fraction.size() - nonzero2;

				            }

				        } else {

				            // 000123.45678 => magnitude = 2, precision = 8.

				            magnitude = dot_or_end - nonzero - 1;

				            precision = base.size() - nonzero - 1;

				        }

				        // trailing zeros don't count to precision, e.g., precision

				        // of 1000.0, 1.0 or 1.0000 are just 1.

				        size_t last_significant = base.find_last_not_of(".0");

				        if (last_significant == std::string_view::npos) {

				            precision = 0;

				        } else if (last_significant < dot_or_end) {

				            // e.g., 1000.00 reduce 5 = 7 - (0+1) - 1 from precision

				            precision -= base.size() - last_significant - 2;

				        } else {

				            // e.g., 1235.60 reduce 5 = 7 - (5+1) from precision

				            precision -= base.size() - last_significant - 1;

				        }

				    } else if (nonzero == std::string_view::npos) {

				        // all-zero integer 000000

				        magnitude = 0;

				        precision = 0;

				    } else {

				        magnitude = base.size() - 1 - nonzero;

				        precision = base.size() - nonzero;

				        // trailing zeros don't count to precision, e.g., precision

				        // of 1000 is just 1.

				        size_t last_significant = base.find_last_not_of("0");

				        if (last_significant == std::string_view::npos) {

				            precision = 0;

				        } else {

				            // e.g., 1000 reduce 3 = 4 - (0+1)

				            precision -= base.size() - last_significant - 1;

				        }

				    }

				    if (precision && e_or_end != std::string_view::npos) {

				        std::string_view exponent = s.substr(e_or_end + 1);

				        if (exponent.size() > 4) {

				            // don't even bother atoi(), exponent is too large

				            magnitude = exponent[0]=='-' ? -9999 : 9999;

				        } else {

				            try {

				                magnitude += boost::lexical_cast<int32_t>(exponent);

				            } catch (...) {

				                magnitude = 9999;

				            }

				        }

				    }

				    return magnitude_and_precision {magnitude, precision};

				}

				// Parse a number read from user input, validating that it has a valid

				// numeric format and also in the allowed magnitude and precision ranges

				// (see issue #6794). Throws an api_error::validation if the validation

				// failed.

				static big_decimal parse_and_validate_number(std::string_view s) {

				    try {

				        big_decimal ret(s);

				        auto [magnitude, precision] = internal::get_magnitude_and_precision(s);

				        if (magnitude > 125) {

				            throw api_error::validation(format("Number overflow: {}. Attempting to store a number with magnitude larger than supported range.", s));

				        }

				        if (magnitude < -130) {

				            throw api_error::validation(format("Number underflow: {}. Attempting to store a number with magnitude lower than supported range.", s));

				        }

				        if (precision > 38) {

				            throw api_error::validation(format("Number too precise: {}. Attempting to store a number with more significant digits than supported.", s));

				        }

				        return ret;

				    } catch (const marshal_exception& e) {

				        throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", s));

				    }

				}

				struct from_json_visitor {

				    const rjson::value& v;

				    bytes_ostream& bo;

				@@ -67,11 +176,7 @@ struct from_json_visitor {

				        bo.write(boolean_type->decompose(v.GetBool()));

				    }

				    void operator()(const decimal_type_impl& t) const {

				        try {

				            bo.write(t.from_string(rjson::to_string_view(v)));

				        } catch (const marshal_exception& e) {

				            throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", v));

				        }

				        bo.write(decimal_type->decompose(parse_and_validate_number(rjson::to_string_view(v))));

				    }

				    // default

				    void operator()(const abstract_type& t) const {

				@@ -203,6 +308,8 @@ bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column

				        // FIXME: it's difficult at this point to get information if value was provided

				        // in request or comes from the storage, for now we assume it's user's fault.

				        return *unwrap_bytes(value, true);

				    } else if (column.type == decimal_type) {

				        return decimal_type->decompose(parse_and_validate_number(rjson::to_string_view(value)));

				    } else {

				        return column.type->from_string(value_view);

				    }

				@@ -295,16 +402,13 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {

				    if (it->name != "N") {

				        throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));

				    }

				    try {

				        if (!it->value.IsString()) {

				            // We shouldn't reach here. Callers normally validate their input

				            // earlier with validate_value().

				            throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));

				        }

				        return big_decimal(rjson::to_string_view(it->value));

				    } catch (const marshal_exception& e) {

				        throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", it->value));

				    if (!it->value.IsString()) {

				        // We shouldn't reach here. Callers normally validate their input

				        // earlier with validate_value().

				        throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));

				    }

				    big_decimal ret = parse_and_validate_number(rjson::to_string_view(it->value));

				    return ret;

				}

				std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {

				@@ -316,8 +420,8 @@ std::optional<big_decimal> try_unwrap_number(const rjson::value& v) {

				        return std::nullopt;

				    }

				    try {

				        return big_decimal(rjson::to_string_view(it->value));

				    } catch (const marshal_exception& e) {

				        return parse_and_validate_number(rjson::to_string_view(it->value));

				    } catch (api_error&) {

				        return std::nullopt;

				    }

				}

									
										7

alternator/serialization.hh
									
												View File
												
				@@ -94,5 +94,12 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&

				// Returns a null value if one of the arguments is not actually a list.

				rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2);

				namespace internal {

				struct magnitude_and_precision {

				    int magnitude;

				    int precision;

				};

				magnitude_and_precision get_magnitude_and_precision(std::string_view);

				}

				}

									
										2

alternator/server.cc
									
												View File
												
				@@ -424,7 +424,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				    co_await client_state.maybe_update_per_service_level_params();

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);

				    tracing::trace(trace_state, op);

				    tracing::trace(trace_state, "{}", op);

				    rjson::value json_request = co_await _json_parser.parse(std::move(content));

				    co_return co_await callback_it->second(_executor, client_state, trace_state,

				            make_service_permit(std::move(units)), std::move(json_request), std::move(req));

									
										2

alternator/streams.cc
									
												View File
												
				@@ -1096,7 +1096,7 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche

				    }

				}

				void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp) {

				void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp) {

				    auto& opts = schema.cdc_options();

				    if (opts.enabled()) {

				        auto db = sp.data_dictionary();

									
										14

alternator/ttl.cc
									
												View File
												
				@@ -241,7 +241,7 @@ static bool is_expired(const rjson::value& expiration_time, gc_clock::time_point

				// understands it is an expiration event - not a user-initiated deletion.

				static future<> expire_item(service::storage_proxy& proxy,

				                            const service::query_state& qs,

				                            const std::vector<bytes_opt>& row,

				                            const std::vector<managed_bytes_opt>& row,

				                            schema_ptr schema,

				                            api::timestamp_type ts) {

				    // Prepare the row key to delete

				@@ -260,7 +260,7 @@ static future<> expire_item(service::storage_proxy& proxy,

				            // FIXME: log or increment a metric if this happens.

				            return make_ready_future<>();

				        }

				        exploded_pk.push_back(*row_c);

				        exploded_pk.push_back(to_bytes(*row_c));

				    }

				    auto pk = partition_key::from_exploded(exploded_pk);

				    mutation m(schema, pk);

				@@ -280,7 +280,7 @@ static future<> expire_item(service::storage_proxy& proxy,

				                // FIXME: log or increment a metric if this happens.

				                return make_ready_future<>();

				            }

				            exploded_ck.push_back(*row_c);

				            exploded_ck.push_back(to_bytes(*row_c));

				        }

				        auto ck = clustering_key::from_exploded(exploded_ck);

				        m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));

				@@ -387,7 +387,7 @@ class token_ranges_owned_by_this_shard {

				    class ranges_holder_primary {

				        const dht::token_range_vector _token_ranges;

				     public:

				        ranges_holder_primary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)

				        ranges_holder_primary(const locator::vnode_effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)

				            : _token_ranges(erm->get_primary_ranges(ep)) {}

				        std::size_t size() const { return _token_ranges.size(); }

				        const dht::token_range& operator[](std::size_t i) const {

				@@ -430,6 +430,7 @@ class token_ranges_owned_by_this_shard {

				    size_t _range_idx;

				    size_t _end_idx;

				    std::optional<dht::selective_token_range_sharder> _intersecter;

				    locator::effective_replication_map_ptr _erm;

				public:

				    token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)

				        :  _s(s)

				@@ -437,6 +438,7 @@ public:

				                g, utils::fb_utilities::get_broadcast_address())

				        , _range_idx(random_offset(0, _token_ranges.size() - 1))

				        , _end_idx(_range_idx + _token_ranges.size())

				        , _erm(s->table().get_effective_replication_map())

				    {

				        tlogger.debug("Generating token ranges starting from base range {} of {}", _range_idx, _token_ranges.size());

				    }

				@@ -469,7 +471,7 @@ public:

				                    return std::nullopt;

				                }

				            }

				            _intersecter.emplace(_s->get_sharder(), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());

				            _intersecter.emplace(_erm->get_sharder(*_s), _token_ranges[_range_idx % _token_ranges.size()], this_shard_id());

				        }

				    }

				@@ -593,7 +595,7 @@ static future<> scan_table_ranges(

				            continue;

				        }

				        for (const auto& row : rows) {

				            const bytes_opt& cell = row[*expiration_column];

				            const managed_bytes_opt& cell = row[*expiration_column];

				            if (!cell) {

				                continue;

				            }

									
										1

api/CMakeLists.txt
									
												View File
												
				@@ -14,6 +14,7 @@ set(swagger_files

				  api-doc/hinted_handoff.json

				  api-doc/lsa.json

				  api-doc/messaging_service.json

				  api-doc/metrics.json

				  api-doc/storage_proxy.json

				  api-doc/storage_service.json

				  api-doc/stream_manager.json

									
										62

api/api-doc/column_family.json
									
												View File
												
				@@ -437,6 +437,68 @@

				            }

				         ]

				      },

				      {

				         "path":"/column_family/tombstone_gc/{name}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Check if tombstone GC is enabled for a given table",

				               "type":"boolean",

				               "nickname":"get_tombstone_gc",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The table name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            },

				            {

				               "method":"POST",

				               "summary":"Enable tombstone GC for a given table",

				               "type":"void",

				               "nickname":"enable_tombstone_gc",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The table name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            },

				            {

				               "method":"DELETE",

				               "summary":"Disable tombstone GC for a given table",

				               "type":"void",

				               "nickname":"disable_tombstone_gc",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"name",

				                     "description":"The table name in keyspace:name format",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/column_family/estimate_keys/{name}",

				         "operations":[

									
										44

api/api-doc/error_injection.json
									
												View File
												
				@@ -34,6 +34,14 @@

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"parameters",

				                     "description":"dict of parameters to pass to the injection (json format)",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"dict",

				                     "paramType":"body"

				                  }

				               ]

				            },

				@@ -58,6 +66,30 @@

				            }

				         ]

				      },

				      {

				         "path":"/v2/error_injection/injection/{injection}/message",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Send message to trigger an event in injection's code",

				               "type":"void",

				               "nickname":"message_injection",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"injection",

				                     "description":"injection name, should correspond to an injection added in code",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/v2/error_injection/injection",

				         "operations":[

				@@ -86,5 +118,15 @@

				            }

				         ]

				      }

				   ]

				   ],

				   "components":{

				      "schemas": {

				         "dict": {

				            "type": "object",

				            "additionalProperties": {

				               "type": "string"

				            }

				         }

				      }

				   }

				}

									
										2

api/api-doc/messaging_service.json
									
												View File
												
				@@ -245,7 +245,7 @@

				                 "GOSSIP_SHUTDOWN",

				                 "DEFINITIONS_UPDATE",

				                 "TRUNCATE",

				                 "REPLICATION_FINISHED",

				                 "UNUSED__REPLICATION_FINISHED",

				                 "MIGRATION_REQUEST",

				                 "PREPARE_MESSAGE",

				                 "PREPARE_DONE_MESSAGE",

									
										34

api/api-doc/metrics.def.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,34 @@

				    "metrics_config": {

				        "id": "metrics_config",

				        "summary": "An entry in the metrics configuration",

				        "properties": {

				            "source_labels": {

				                "type": "array",

				                "items": {

				                    "type": "string"

				                },

				                "description": "The source labels, a match is based on concatination of the labels"

				            },

				            "action": {

				                "type": "string",

				                "description": "The action to perfrom on match",

				                "enum": ["skip_when_empty", "report_when_empty", "replace", "keep", "drop", "drop_label"]

				            },

				            "target_label": {

				                "type": "string",

				                "description": "The application state version"

				            },

				            "replacement": {

				                "type": "string",

				                "description": "The replacement string to use when replacing a value"

				            },

				            "regex": {

				                "type": "string",

				                "description": "The regex string to use when replacing a value"

				            },

				            "separator": {

				                "type": "string",

				                "description": "The separator string to use when concatinating the labels"

				            }

				        }

				    }

									
										66

api/api-doc/metrics.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,66 @@

				    "/v2/metrics-config/":{

				        "get":{

				            "description":"Return the metrics layer configuration",

				            "operationId":"get_metrics_config",

				            "produces":[

				                "application/json"

				            ],

				            "tags":[

				                "metrics"

				            ],

				            "parameters":[

				            ],

				            "responses":{

				                "200":{

				                "schema": {

				                    "type":"array",

				                    "items":{

				                        "$ref":"#/definitions/metrics_config",

				                        "description":"metrics Config value"

				                    }

				                    }

				                },

				                "default":{

				                    "description":"unexpected error",

				                    "schema":{

				                        "$ref":"#/definitions/ErrorModel"

				                    }

				                }

				            }

				        },

				        "post": {

				             "description":"Set the metrics layer relabel configuration",

				            "operationId":"set_metrics_config",

				            "produces":[

				                "application/json"

				            ],

				            "tags":[

				                "metrics"

				            ],

				            "parameters":[

				               {

				                "in":"body",

				                "name":"conf",

				                "description":"An array of relabel_config objects",

				                "schema": {

				                    "type":"array",

				                    "items":{

				                        "$ref":"#/definitions/metrics_config",

				                        "description":"metrics Config value"

				                    }

				                }

				               }

				            ],

				            "responses":{

				                "200":{

				                    "description": "OK"

				                },

				                "default":{

				                    "description":"unexpected error",

				                    "schema":{

				                        "$ref":"#/definitions/ErrorModel"

				                    }

				                }

				            }

				        }

				    }

									
										91

api/api-doc/storage_service.json
									
												View File
												
				@@ -465,7 +465,7 @@

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Retrieve the mapping of endpoint to host ID",

				               "summary":"Retrieve the mapping of endpoint to host ID of all nodes that own tokens",

				               "type":"array",

				               "items":{

				                  "type":"mapper"

				@@ -1114,6 +1114,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"ranges_parallelism",

				                     "description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            },

				@@ -1946,7 +1954,7 @@

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Reset local schema",

				               "summary":"Forces this node to recalculate versions of schema objects.",

				               "type":"void",

				               "nickname":"reset_local_schema",

				               "produces":[

				@@ -2110,6 +2118,65 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/tombstone_gc/{keyspace}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Enable tombstone GC",

				               "type":"void",

				               "nickname":"enable_tombstone_gc",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            },

				            {

				               "method":"DELETE",

				               "summary":"Disable tombstone GC",

				               "type":"void",

				               "nickname":"disable_tombstone_gc",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Comma-separated column family names",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/deliver_hints",

				         "operations":[

				@@ -2428,7 +2495,23 @@

				               ]

				            }

				         ]

				      }      

				      },

				      {

				         "path":"/storage_service/raft_topology/reload",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Reload Raft topology state from disk.",

				               "type":"void",

				               "nickname":"reload_raft_topology_state",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      }

				   ],

				   "models":{

				      "mapper":{

				@@ -2631,7 +2714,7 @@

				                "description":"File creation time"

				            },

				            "generation":{

				                "type":"long",

				                "type":"string",

				                "description":"SSTable generation"

				            },

				            "level":{

									
										2

api/api-doc/swagger20_header.json
									
												View File
												
				@@ -16,7 +16,7 @@

				    }

				  },

				  "host": "{{Host}}",

				  "basePath": "/v2",

				  "basePath": "/",

				  "schemes": [

				    "http"

				  ],

									
										528

api/api-doc/task_manager.json
									
												View File
												
				@@ -1,182 +1,182 @@

				{

				    "apiVersion":"0.0.1",

				    "swaggerVersion":"1.2",

				    "basePath":"{{Protocol}}://{{Host}}",

				    "resourcePath":"/task_manager",

				    "produces":[

				       "application/json"

				    ],

				    "apis":[

				       {

				          "path":"/task_manager/list_modules",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get all modules names",

				                "type":"array",

				                "items":{

				                   "type":"string"

				                },

				                "nickname":"get_modules",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/list_module_tasks/{module}",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get a list of tasks",

				                "type":"array",

				                "items":{

				                    "type":"task_stats"

				                },

				                "nickname":"get_tasks",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"module",

				                        "description":"The module to query about",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"path"

				                    },

				                    {

				                        "name":"internal",

				                        "description":"Boolean flag indicating whether internal tasks should be shown (false by default)",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"boolean",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"keyspace",

				                        "description":"The keyspace to query about",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"table",

				                        "description":"The table to query about",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/task_status/{task_id}",

				          "operations":[

				             {

				                "method":"GET",

				                "summary":"Get task status",

				                "type":"task_status",

				                "nickname":"get_task_status",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to query about",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"path"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager/abort_task/{task_id}",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Abort running task and its descendants",

				                "type":"void",

				                "nickname":"abort_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                   {

				                      "name":"task_id",

				                      "description":"The uuid of a task to abort",

				                      "required":true,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"path"

				                   }

				                ]

				             }

				          ]

				       },

				       {

				        "path":"/task_manager/wait_task/{task_id}",

				        "operations":[

				           {

				              "method":"GET",

				              "summary":"Wait for a task to complete",

				              "type":"task_status",

				              "nickname":"wait_task",

				              "produces":[

				                 "application/json"

				              ],

				              "parameters":[

				                 {

				                    "name":"task_id",

				                    "description":"The uuid of a task to wait for",

				                    "required":true,

				                    "allowMultiple":false,

				                    "type":"string",

				                    "paramType":"path"

				                 }

				              ]

				           }

				        ]

				     },

				     {

				      "path":"/task_manager/task_status_recursive/{task_id}",

				      "operations":[

				         {

				            "method":"GET",

				            "summary":"Get statuses of the task and all its descendants",

				            "type":"array",

				            "items":{

				               "type":"task_status"

				            },

				            "nickname":"get_task_status_recursively",

				            "produces":[

				               "application/json"

				            ],

				            "parameters":[

				                {

				                    "name":"task_id",

				                    "description":"The uuid of a task to query about",

				                    "required":true,

				                    "allowMultiple":false,

				                    "type":"string",

				                    "paramType":"path"

				                }

				            ]

				         }

				      ]

				     },

				     {

				   "apiVersion":"0.0.1",

				   "swaggerVersion":"1.2",

				   "basePath":"{{Protocol}}://{{Host}}",

				   "resourcePath":"/task_manager",

				   "produces":[

				      "application/json"

				   ],

				   "apis":[

				      {

				         "path":"/task_manager/list_modules",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get all modules names",

				               "type":"array",

				               "items":{

				                  "type":"string"

				               },

				               "nickname":"get_modules",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/list_module_tasks/{module}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get a list of tasks",

				               "type":"array",

				               "items":{

				                  "type":"task_stats"

				               },

				               "nickname":"get_tasks",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"module",

				                     "description":"The module to query about",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"internal",

				                     "description":"Boolean flag indicating whether internal tasks should be shown (false by default)",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace to query about",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"table",

				                     "description":"The table to query about",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/task_status/{task_id}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get task status",

				               "type":"task_status",

				               "nickname":"get_task_status",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to query about",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/abort_task/{task_id}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Abort running task and its descendants",

				               "type":"void",

				               "nickname":"abort_task",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to abort",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/wait_task/{task_id}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Wait for a task to complete",

				               "type":"task_status",

				               "nickname":"wait_task",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to wait for",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/task_status_recursive/{task_id}",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get statuses of the task and all its descendants",

				               "type":"array",

				               "items":{

				                  "type":"task_status"

				               },

				               "nickname":"get_task_status_recursively",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to query about",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/ttl",

				         "operations":[

				            {

				@@ -199,88 +199,96 @@

				               ]

				            }

				         ]

				     }

				    ],

				    "models":{

				       "task_stats" :{

				           "id": "task_stats",

				           "description":"A task statistics object",

				           "properties":{

				             "task_id":{

				                "type":"string",

				                "description":"The uuid of a task"

				             },

				             "state":{

				                "type":"string",

				                "enum":[

				      }

				   ],

				   "models":{

				      "task_stats" :{

				         "id": "task_stats",

				         "description":"A task statistics object",

				         "properties":{

				            "task_id":{

				               "type":"string",

				               "description":"The uuid of a task"

				            },

				            "state":{

				               "type":"string",

				               "enum":[

				                  "created",

				                  "running",

				                  "done",

				                  "failed"

				                ],

				                "description":"The state of a task"

				             },

				             "type":{

				                "type":"string",

				                "description":"The description of the task"

				             },

				             "keyspace":{

				                "type":"string",

				                "description":"The keyspace the task is working on (if applicable)"

				             },

				             "table":{

				                "type":"string",

				                "description":"The table the task is working on (if applicable)"

				             },

				             "entity":{

				                "type":"string",

				                "description":"Task-specific entity description"

				             },

				             "sequence_number":{

				                "type":"long",

				                "description":"The running sequence number of the task"

				             }

				           }

				       },

				       "task_status":{

				          "id":"task_status",

				          "description":"A task status object",

				          "properties":{

				             "id":{

				                "type":"string",

				                "description":"The uuid of the task"

				             },

				             "type":{

				                "type":"string",

				                "description":"The description of the task"

				             },

				             "state":{

				               ],

				               "description":"The state of a task"

				            },

				            "type":{

				               "type":"string",

				               "description":"The description of the task"

				            },

				            "scope":{

				               "type":"string",

				               "description":"The scope of the task"

				            },

				            "keyspace":{

				               "type":"string",

				               "description":"The keyspace the task is working on (if applicable)"

				            },

				            "table":{

				               "type":"string",

				               "description":"The table the task is working on (if applicable)"

				            },

				            "entity":{

				               "type":"string",

				               "description":"Task-specific entity description"

				            },

				            "sequence_number":{

				               "type":"long",

				               "description":"The running sequence number of the task"

				            }

				         }

				      },

				      "task_status":{

				         "id":"task_status",

				         "description":"A task status object",

				         "properties":{

				            "id":{

				               "type":"string",

				               "description":"The uuid of the task"

				            },

				            "type":{

				               "type":"string",

				               "description":"The description of the task"

				            },

				            "scope":{

				               "type":"string",

				               "description":"The scope of the task"

				            },

				            "state":{

				               "type":"string",

				               "enum":[

				                 "created",

				                 "running",

				                 "done",

				                 "failed"

				                  "created",

				                  "running",

				                  "done",

				                  "failed"

				               ],

				                "description":"The state of the task"

				             },

				             "is_abortable":{

				                "type":"boolean",

				                "description":"Boolean flag indicating whether the task can be aborted"

				             },

				             "start_time":{

				                "type":"datetime",

				                "description":"The start time of the task"

				             },

				             "end_time":{

				                "type":"datetime",

				                "description":"The end time of the task (unspecified when the task is not completed)"

				             },

				             "error":{

				                "type":"string",

				                "description":"Error string, if the task failed"

				             },

				             "parent_id":{

				               "description":"The state of the task"

				            },

				            "is_abortable":{

				               "type":"boolean",

				               "description":"Boolean flag indicating whether the task can be aborted"

				            },

				            "start_time":{

				               "type":"datetime",

				               "description":"The start time of the task"

				            },

				            "end_time":{

				               "type":"datetime",

				               "description":"The end time of the task (unspecified when the task is not completed)"

				            },

				            "error":{

				               "type":"string",

				               "description":"Error string, if the task failed"

				            },

				            "parent_id":{

				               "type":"string",

				               "description":"The uuid of the parent task"

				            },

				@@ -318,12 +326,12 @@

				            },

				            "children_ids":{

				               "type":"array",

				                "items":{

				                    "type":"string"

				                },

				               "items":{

				                  "type":"string"

				               },

				               "description":"Task IDs of children of this task"

				            }

				          }

				       }

				    }

				 }

				         }

				      }

				   }

				}

									
										304

api/api-doc/task_manager_test.json
									
												View File
												
				@@ -1,153 +1,153 @@

				{

				    "apiVersion":"0.0.1",

				    "swaggerVersion":"1.2",

				    "basePath":"{{Protocol}}://{{Host}}",

				    "resourcePath":"/task_manager_test",

				    "produces":[

				       "application/json"

				    ],

				    "apis":[

				       {

				          "path":"/task_manager_test/test_module",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Register test module in task manager",

				                "type":"void",

				                "nickname":"register_test_module",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             },

				             {

				                "method":"DELETE",

				                "summary":"Unregister test module in task manager",

				                "type":"void",

				                "nickname":"unregister_test_module",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager_test/test_task",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Register test task",

				                "type":"string",

				                "nickname":"register_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to register",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"shard",

				                        "description":"The shard of the task",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"long",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"parent_id",

				                        "description":"The uuid of a parent task",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"keyspace",

				                        "description":"The keyspace the task is working on",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"table",

				                        "description":"The table the task is working on",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    },

				                    {

				                        "name":"entity",

				                        "description":"Task-specific entity description",

				                        "required":false,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             },

				             {

				                "method":"DELETE",

				                "summary":"Unregister test task",

				                "type":"void",

				                "nickname":"unregister_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                    {

				                        "name":"task_id",

				                        "description":"The uuid of a task to register",

				                        "required":true,

				                        "allowMultiple":false,

				                        "type":"string",

				                        "paramType":"query"

				                    }

				                ]

				             }

				          ]

				       },

				       {

				          "path":"/task_manager_test/finish_test_task/{task_id}",

				          "operations":[

				             {

				                "method":"POST",

				                "summary":"Finish test task",

				                "type":"void",

				                "nickname":"finish_test_task",

				                "produces":[

				                   "application/json"

				                ],

				                "parameters":[

				                   {

				                      "name":"task_id",

				                      "description":"The uuid of a task to finish",

				                      "required":true,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"path"

				                   },

				                   {

				                      "name":"error",

				                      "description":"The error with which task fails (if it does)",

				                      "required":false,

				                      "allowMultiple":false,

				                      "type":"string",

				                      "paramType":"query"

				                   }

				                ]

				             }

				          ]

				       }

				    ]

				 }

				   "apiVersion":"0.0.1",

				   "swaggerVersion":"1.2",

				   "basePath":"{{Protocol}}://{{Host}}",

				   "resourcePath":"/task_manager_test",

				   "produces":[

				      "application/json"

				   ],

				   "apis":[

				      {

				         "path":"/task_manager_test/test_module",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Register test module in task manager",

				               "type":"void",

				               "nickname":"register_test_module",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            },

				            {

				               "method":"DELETE",

				               "summary":"Unregister test module in task manager",

				               "type":"void",

				               "nickname":"unregister_test_module",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager_test/test_task",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Register test task",

				               "type":"string",

				               "nickname":"register_test_task",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to register",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"shard",

				                     "description":"The shard of the task",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"long",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"parent_id",

				                     "description":"The uuid of a parent task",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace the task is working on",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"table",

				                     "description":"The table the task is working on",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"entity",

				                     "description":"Task-specific entity description",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            },

				            {

				               "method":"DELETE",

				               "summary":"Unregister test task",

				               "type":"void",

				               "nickname":"unregister_test_task",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to register",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager_test/finish_test_task/{task_id}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Finish test task",

				               "type":"void",

				               "nickname":"finish_test_task",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"task_id",

				                     "description":"The uuid of a task to finish",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"error",

				                     "description":"The error with which task fails (if it does)",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      }

				   ]

				}

									
										46

api/api.cc
									
												View File
												
				@@ -60,8 +60,10 @@ future<> set_server_init(http_context& ctx) {

				        rb->set_api_doc(r);

				        rb02->set_api_doc(r);

				        rb02->register_api_file(r, "swagger20_header");

				        rb02->register_api_file(r, "metrics");

				        rb->register_function(r, "system",

				                "The system related API");

				        rb02->add_definitions_file(r, "metrics");

				        set_system(ctx, r);

				    });

				}

				@@ -69,7 +71,7 @@ future<> set_server_init(http_context& ctx) {

				future<> set_server_config(http_context& ctx, const db::config& cfg) {

				    auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");

				    return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {

				        set_config(rb02, ctx, r, cfg);

				        set_config(rb02, ctx, r, cfg, false);

				    });

				}

				@@ -100,12 +102,16 @@ future<> unset_rpc_controller(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });

				}

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {

				    return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs, &sys_ks] (http_context& ctx, routes& r) {

				            set_storage_service(ctx, r, ss, g.local(), cdc_gs, sys_ks);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {

				    return register_api(ctx, "storage_service", "The storage service API", [&ss, &group0_client] (http_context& ctx, routes& r) {

				            set_storage_service(ctx, r, ss, group0_client);

				        });

				}

				future<> unset_server_storage_service(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });

				}

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {

				    return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });

				}

				@@ -187,10 +193,10 @@ future<> unset_server_messaging_service(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_messaging_service(ctx, r); });

				}

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss) {

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_proxy>& proxy) {

				    return register_api(ctx, "storage_proxy",

				                "The storage proxy API", [&ss] (http_context& ctx, routes& r) {

				                    set_storage_proxy(ctx, r, ss);

				                "The storage proxy API", [&proxy] (http_context& ctx, routes& r) {

				                    set_storage_proxy(ctx, r, proxy);

				                });

				}

				@@ -214,10 +220,10 @@ future<> set_server_cache(http_context& ctx) {

				            "The cache service API", set_cache_service);

				}

				future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g) {

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy) {

				    return register_api(ctx, "hinted_handoff",

				                "The hinted handoff API", [&g] (http_context& ctx, routes& r) {

				                    set_hinted_handoff(ctx, r, g.local());

				                "The hinted handoff API", [&proxy] (http_context& ctx, routes& r) {

				                    set_hinted_handoff(ctx, r, proxy);

				                });

				}

				@@ -264,28 +270,36 @@ future<> set_server_done(http_context& ctx) {

				    });

				}

				future<> set_server_task_manager(http_context& ctx, lw_shared_ptr<db::config> cfg) {

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx, &cfg = *cfg](routes& r) {

				    return ctx.http_server.set_routes([rb, &ctx, &tm, &cfg = *cfg](routes& r) {

				        rb->register_function(r, "task_manager",

				                "The task manager API");

				        set_task_manager(ctx, r, cfg);

				        set_task_manager(ctx, r, tm, cfg);

				    });

				}

				future<> unset_server_task_manager(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_task_manager(ctx, r); });

				}

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				future<> set_server_task_manager_test(http_context& ctx) {

				future<> set_server_task_manager_test(http_context& ctx, sharded<tasks::task_manager>& tm) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx](routes& r) mutable {

				    return ctx.http_server.set_routes([rb, &ctx, &tm](routes& r) mutable {

				        rb->register_function(r, "task_manager_test",

				                "The task manager test API");

				        set_task_manager_test(ctx, r);

				        set_task_manager_test(ctx, r, tm);

				    });

				}

				future<> unset_server_task_manager_test(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_task_manager_test(ctx, r); });

				}

				#endif

				void req_params::process(const request& req) {

									
										26

api/api_init.hh
									
												View File
												
				@@ -22,6 +22,7 @@ namespace service {

				class load_meter;

				class storage_proxy;

				class storage_service;

				class raft_group0_client;

				} // namespace service

				@@ -51,7 +52,6 @@ class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				namespace cdc { class generation_service; }

				namespace gms {

				@@ -61,6 +61,10 @@ class gossiper;

				namespace auth { class service; }

				namespace tasks {

				class task_manager;

				}

				namespace api {

				struct http_context {

				@@ -68,15 +72,12 @@ struct http_context {

				    sstring api_doc;

				    httpd::http_server_control http_server;

				    distributed<replica::database>& db;

				    distributed<service::storage_proxy>& sp;

				    service::load_meter& lmeter;

				    const sharded<locator::shared_token_metadata>& shared_token_metadata;

				    sharded<tasks::task_manager>& tm;

				    http_context(distributed<replica::database>& _db,

				            distributed<service::storage_proxy>& _sp,

				            service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm, sharded<tasks::task_manager>& _tm)

				            : db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm), tm(_tm) {

				            service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)

				            : db(_db), lmeter(_lm), shared_token_metadata(_stm) {

				    }

				    const locator::token_metadata& get_token_metadata();

				@@ -86,7 +87,8 @@ future<> set_server_init(http_context& ctx);

				future<> set_server_config(http_context& ctx, const db::config& cfg);

				future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);

				future<> unset_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);

				future<> unset_server_storage_service(http_context& ctx);

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);

				future<> unset_server_sstables_loader(http_context& ctx);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);

				@@ -106,17 +108,19 @@ future<> set_server_load_sstable(http_context& ctx, sharded<db::system_keyspace>

				future<> unset_server_load_sstable(http_context& ctx);

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> unset_server_messaging_service(http_context& ctx);

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_service>& ss);

				future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_proxy>& proxy);

				future<> unset_server_storage_proxy(http_context& ctx);

				future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);

				future<> unset_server_stream_manager(http_context& ctx);

				future<> set_hinted_handoff(http_context& ctx, sharded<gms::gossiper>& g);

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p);

				future<> unset_hinted_handoff(http_context& ctx);

				future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);

				future<> set_server_cache(http_context& ctx);

				future<> set_server_compaction_manager(http_context& ctx);

				future<> set_server_done(http_context& ctx);

				future<> set_server_task_manager(http_context& ctx, lw_shared_ptr<db::config> cfg);

				future<> set_server_task_manager_test(http_context& ctx);

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg);

				future<> unset_server_task_manager(http_context& ctx);

				future<> set_server_task_manager_test(http_context& ctx, sharded<tasks::task_manager>& tm);

				future<> unset_server_task_manager_test(http_context& ctx);

				}

									
										1

api/authorization_cache.cc
									
												View File
												
				@@ -11,6 +11,7 @@

				#include "api/authorization_cache.hh"

				#include "api/api.hh"

				#include "auth/common.hh"

				#include "auth/service.hh"

				namespace api {

				using namespace json;

									
										74

api/column_family.cc
									
												View File
												
				@@ -43,7 +43,7 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {

				    return std::make_tuple(name.substr(0, pos), name.substr(end));

				}

				const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {

				table_id get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {

				    try {

				        return db.find_uuid(ks, cf);

				    } catch (replica::no_such_column_family& e) {

				@@ -51,7 +51,7 @@ const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::da

				    }

				}

				const table_id& get_uuid(const sstring& name, const replica::database& db) {

				table_id get_uuid(const sstring& name, const replica::database& db) {

				    auto [ks, cf] = parse_fully_qualified_cf_name(name);

				    return get_uuid(ks, cf, db);

				}

				@@ -135,9 +135,9 @@ static future<json::json_return_type>  get_cf_histogram(http_context& ctx, const

				static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    std::function<utils::ihistogram(const replica::database&)> fun = [f] (const replica::database& db)  {

				        utils::ihistogram res;

				        for (auto i : db.get_column_families()) {

				            res += (i.second->get_stats().*f).hist;

				        }

				        db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) mutable {

				            res += (table->get_stats().*f).hist;

				        });

				        return res;

				    };

				    return ctx.db.map(fun).then([](const std::vector<utils::ihistogram> &res) {

				@@ -162,9 +162,9 @@ static future<json::json_return_type>  get_cf_rate_and_histogram(http_context& c

				static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {

				    std::function<utils::rate_moving_average_and_histogram(const replica::database&)> fun = [f] (const replica::database& db)  {

				        utils::rate_moving_average_and_histogram res;

				        for (auto i : db.get_column_families()) {

				            res += (i.second->get_stats().*f).rate();

				        }

				        db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) {

				            res += (table->get_stats().*f).rate();

				        });

				        return res;

				    };

				    return ctx.db.map(fun).then([](const std::vector<utils::rate_moving_average_and_histogram> &res) {

				@@ -306,21 +306,21 @@ ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared

				void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace>& sys_ks) {

				    cf::get_column_family_name.set(r, [&ctx] (const_req req){

				        std::vector<sstring> res;

				        for (auto i: ctx.db.local().get_column_families_mapping()) {

				            res.push_back(i.first.first + ":" + i.first.second);

				        }

				        ctx.db.local().get_tables_metadata().for_each_table_id([&] (const std::pair<sstring, sstring>& kscf, table_id) {

				            res.push_back(kscf.first + ":" + kscf.second);

				        });

				        return res;

				    });

				    cf::get_column_family.set(r, [&ctx] (std::unique_ptr<http::request> req){

				            std::list<cf::column_family_info> res;

				            for (auto i: ctx.db.local().get_column_families_mapping()) {

				        std::list<cf::column_family_info> res;

				            ctx.db.local().get_tables_metadata().for_each_table_id([&] (const std::pair<sstring, sstring>& kscf, table_id) {

				                cf::column_family_info info;

				                info.ks = i.first.first;

				                info.cf =  i.first.second;

				                info.ks = kscf.first;

				                info.cf =  kscf.second;

				                info.type = "ColumnFamilies";

				                res.push_back(info);

				            }

				            });

				            return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));

				        });

				@@ -871,6 +871,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    });

				    cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        apilog.info("column_family/enable_auto_compaction: name={}", req->param["name"]);

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {

				            auto g = replica::database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {

				@@ -882,6 +883,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    });

				    cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        apilog.info("column_family/disable_auto_compaction: name={}", req->param["name"]);

				        return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {

				            auto g = replica::database::autocompaction_toggle_guard(db);

				            return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {

				@@ -892,6 +894,30 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				        });

				    });

				    cf::get_tombstone_gc.set(r, [&ctx] (const_req req) {

				        auto uuid = get_uuid(req.param["name"], ctx.db.local());

				        replica::table& t = ctx.db.local().find_column_family(uuid);

				        return t.tombstone_gc_enabled();

				    });

				    cf::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        apilog.info("column_family/enable_tombstone_gc: name={}", req->param["name"]);

				        return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {

				            t.set_tombstone_gc_enabled(true);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    cf::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        apilog.info("column_family/disable_tombstone_gc: name={}", req->param["name"]);

				        return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {

				            t.set_tombstone_gc_enabled(false);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {

				        auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);

				        auto&& ks = std::get<0>(ks_cf);

				@@ -955,6 +981,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        sstring strategy = req->get_query_param("class_name");

				        apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->param["name"], strategy);

				        return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {

				            cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));

				        }).then([] {

				@@ -990,11 +1017,12 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				        auto key = req->get_query_param("key");

				        auto uuid = get_uuid(req->param["name"], ctx.db.local());

				        return ctx.db.map_reduce0([key, uuid] (replica::database& db) {

				            return db.find_column_family(uuid).get_sstables_by_partition_key(key);

				        return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {

				            auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);

				            co_return boost::copy_range<std::unordered_set<sstring>>(sstables | boost::adaptors::transformed([] (auto s) { return s->get_filename(); }));

				        }, std::unordered_set<sstring>(),

				            [](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {

				            a.insert(b.begin(),b.end());

				        [](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {

				            a.merge(b);

				            return a;

				        }).then([](const std::unordered_set<sstring>& res) {

				            return make_ready_future<json::json_return_type>(container_to_vec(res));

				@@ -1023,9 +1051,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				            fail(unimplemented::cause::API);

				        }

				        apilog.info("column_family/force_major_compaction: name={}", req->param["name"]);

				        auto [ks, cf] = parse_fully_qualified_cf_name(req->param["name"]);

				        auto keyspace = validate_keyspace(ctx, ks);

				        std::vector<table_id> table_infos = {ctx.db.local().find_uuid(ks, cf)};

				        std::vector<table_info> table_infos = {table_info{

				            .name = cf,

				            .id = ctx.db.local().find_uuid(ks, cf)

				        }};

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, std::move(table_infos));

									
										9

api/column_family.hh
									
												View File
												
				@@ -23,7 +23,7 @@ namespace api {

				void set_column_family(http_context& ctx, httpd::routes& r, sharded<db::system_keyspace>& sys_ks);

				void unset_column_family(http_context& ctx, httpd::routes& r);

				const table_id& get_uuid(const sstring& name, const replica::database& db);

				table_id get_uuid(const sstring& name, const replica::database& db);

				future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);

				@@ -68,9 +68,10 @@ struct map_reduce_column_families_locally {

				    std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;

				    future<std::unique_ptr<std::any>> operator()(replica::database& db) const {

				        auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));

				        return do_for_each(db.get_column_families(), [res, this](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {

				            *res = reducer(std::move(*res), mapper(*i.second.get()));

				        }).then([res] {

				        return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) {

				            *res = reducer(std::move(*res), mapper(*table.get()));

				            return make_ready_future();

				        }).then([res] () {

				            return std::move(*res);

				        });

				    }

									
										4

api/compaction_manager.cc
									
												View File
												
				@@ -68,8 +68,8 @@ void set_compaction_manager(http_context& ctx, routes& r) {

				    cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return ctx.db.map_reduce0([](replica::database& db) {

				            return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {

				                return do_for_each(db.get_column_families(), [&tasks](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) -> future<> {

				                    replica::table& cf = *i.second.get();

				                return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) {

				                    replica::table& cf = *table.get();

				                    tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();

				                    return make_ready_future<>();

				                }).then([&tasks] {

									
										8

api/config.cc
									
												View File
												
				@@ -45,7 +45,7 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc

				    } else {

				        ss <<',';

				    };

				    ss << "\"/config/" << name <<"\": {"

				    ss << "\"/v2/config/" << name <<"\": {"

				      "\"get\": {"

				        "\"description\": \"" << boost::replace_all_copy(boost::replace_all_copy(boost::replace_all_copy(description,"\n","\\n"),"\"", "''"), "\t", " ") <<"\","

				        "\"operationId\": \"find_config_"<< name <<"\","

				@@ -76,9 +76,9 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc

				namespace cs = httpd::config_json;

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg) {

				    rb->register_function(r, [&cfg] (output_stream<char>& os) {

				        return do_with(true, [&os, &cfg] (bool& first) {

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg, bool first) {

				    rb->register_function(r, [&cfg, first] (output_stream<char>& os) {

				        return do_with(first, [&os, &cfg] (bool& first) {

				            auto f = make_ready_future();

				            for (auto&& cfg_ref : cfg.values()) {

				                auto&& cfg = cfg_ref.get();

									
										2

api/config.hh
									
												View File
												
				@@ -13,5 +13,5 @@

				namespace api {

				void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg);

				void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg, bool first = false);

				}

									
										34

api/error_injection.cc
									
												View File
												
				@@ -12,7 +12,9 @@

				#include <seastar/http/exception.hh>

				#include "log.hh"

				#include "utils/error_injection.hh"

				#include "utils/rjson.hh"

				#include <seastar/core/future-util.hh>

				#include <seastar/util/short_streams.hh>

				namespace api {

				using namespace seastar::httpd;

				@@ -24,10 +26,27 @@ void set_error_injection(http_context& ctx, routes& r) {

				    hf::enable_injection.set(r, [](std::unique_ptr<request> req) {

				        sstring injection = req->param["injection"];

				        bool one_shot = req->get_query_param("one_shot") == "True";

				        auto& errinj = utils::get_local_injector();

				        return errinj.enable_on_all(injection, one_shot).then([] {

				            return make_ready_future<json::json_return_type>(json::json_void());

				        });

				        auto params = req->content;

				        const size_t max_params_size = 1024 * 1024;

				        if (params.size() > max_params_size) {

				            // This is a hard limit, because we don't want to allocate

				            // too much memory or block the thread for too long.

				            throw httpd::bad_param_exception(format("Injection parameters are too long, max length is {}", max_params_size));

				        }

				        try {

				            auto parameters = params.empty()

				                ? utils::error_injection_parameters{}

				                : rjson::parse_to_map<utils::error_injection_parameters>(params);

				            auto& errinj = utils::get_local_injector();

				            return errinj.enable_on_all(injection, one_shot, std::move(parameters)).then([] {

				                return make_ready_future<json::json_return_type>(json::json_void());

				            });

				        } catch (const rjson::error& e) {

				            throw httpd::bad_param_exception(format("Failed to parse injections parameters: {}", e.what()));

				        }

				    });

				    hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {

				@@ -52,6 +71,13 @@ void set_error_injection(http_context& ctx, routes& r) {

				        });

				    });

				    hf::message_injection.set(r, [](std::unique_ptr<request> req) {

				        sstring injection = req->param["injection"];

				        auto& errinj = utils::get_local_injector();

				        return errinj.receive_message_on_all(injection).then([] {

				            return make_ready_future<json::json_return_type>(json::json_void());

				        });

				    });

				}

				} // namespace api

									
										33

api/failure_detector.cc
									
												View File
												
				@@ -19,24 +19,25 @@ namespace fd = httpd::failure_detector_json;

				void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {

				        std::vector<fd::endpoint_state> res;

				        for (auto i : g.get_endpoint_states()) {

				        res.reserve(g.num_endpoints());

				        g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {

				            fd::endpoint_state val;

				            val.addrs = fmt::to_string(i.first);

				            val.is_alive = i.second.is_alive();

				            val.generation = i.second.get_heart_beat_state().get_generation().value();

				            val.version = i.second.get_heart_beat_state().get_heart_beat_version().value();

				            val.update_time = i.second.get_update_timestamp().time_since_epoch().count();

				            for (auto a : i.second.get_application_state_map()) {

				            val.addrs = fmt::to_string(addr);

				            val.is_alive = g.is_alive(addr);

				            val.generation = eps.get_heart_beat_state().get_generation().value();

				            val.version = eps.get_heart_beat_state().get_heart_beat_version().value();

				            val.update_time = eps.get_update_timestamp().time_since_epoch().count();

				            for (const auto& [as_type, app_state] : eps.get_application_state_map()) {

				                fd::version_value version_val;

				                // We return the enum index and not it's name to stay compatible to origin

				                // method that the state index are static but the name can be changed.

				                version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(a.first);

				                version_val.value = a.second.value();

				                version_val.version = a.second.version().value();

				                version_val.application_state = static_cast<std::underlying_type<gms::application_state>::type>(as_type);

				                version_val.value = app_state.value();

				                version_val.version = app_state.version().value();

				                val.application_state.push(version_val);

				            }

				            res.push_back(val);

				        }

				            res.emplace_back(std::move(val));

				        });

				        return make_ready_future<json::json_return_type>(res);

				    });

				@@ -56,9 +57,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {

				        std::map<sstring, sstring> nodes_status;

				        for (auto& entry : g.get_endpoint_states()) {

				            nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");

				        }

				        g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {

				            nodes_status.emplace(node.to_sstring(), g.is_alive(node) ? "UP" : "DOWN");

				        });

				        return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));

				    });

				@@ -70,7 +71,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    });

				    fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {

				        auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));

				        auto state = g.get_endpoint_state_ptr(gms::inet_address(req->param["addr"]));

				        if (!state) {

				            return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));

				        }

									
										19

api/gossiper.cc
									
												View File
												
				@@ -6,8 +6,11 @@

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include <seastar/core/coroutine.hh>

				#include "gossiper.hh"

				#include "api/api-doc/gossiper.json.hh"

				#include "gms/endpoint_state.hh"

				#include "gms/gossiper.hh"

				namespace api {

				@@ -15,9 +18,9 @@ using namespace seastar::httpd;

				using namespace json;

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (const_req req) {

				        auto res = g.get_unreachable_members();

				        return container_to_vec(res);

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_unreachable_members_synchronized();

				        co_return json::json_return_type(container_to_vec(res));

				    });

				@@ -27,9 +30,11 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				        });

				    });

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (const_req req) {

				        gms::inet_address ep(req.param["addr"]);

				        return g.get_endpoint_downtime(ep);

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        gms::inet_address ep(req->param["addr"]);

				        // synchronize unreachable_members on all shards

				        co_await g.get_unreachable_members_synchronized();

				        co_return g.get_endpoint_downtime(ep);

				    });

				    httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {

				@@ -59,7 +64,7 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->param["addr"]);

				        return g.force_remove_endpoint(ep).then([] {

				        return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

									
										34

api/hinted_handoff.cc
									
												View File
												
				@@ -13,7 +13,6 @@

				#include "api/api-doc/hinted_handoff.json.hh"

				#include "gms/inet_address.hh"

				#include "gms/gossiper.hh"

				#include "service/storage_proxy.hh"

				namespace api {

				@@ -22,38 +21,33 @@ using namespace json;

				using namespace seastar::httpd;

				namespace hh = httpd::hinted_handoff_json;

				void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {

				    hh::create_hints_sync_point.set(r, [&ctx, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto parse_hosts_list = [&g] (sstring arg) {

				void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {

				    hh::create_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto parse_hosts_list = [] (sstring arg) {

				            std::vector<sstring> hosts_str = split(arg, ",");

				            std::vector<gms::inet_address> hosts;

				            hosts.reserve(hosts_str.size());

				            if (hosts_str.empty()) {

				                // No target_hosts specified means that we should wait for hints for all nodes to be sent

				                const auto members_set = g.get_live_members();

				                std::copy(members_set.begin(), members_set.end(), std::back_inserter(hosts));

				            } else {

				                for (const auto& host_str : hosts_str) {

				                    try {

				                        gms::inet_address host;

				                        host = gms::inet_address(host_str);

				                        hosts.push_back(host);

				                    } catch (std::exception& e) {

				                        throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));

				                    }

				            for (const auto& host_str : hosts_str) {

				                try {

				                    gms::inet_address host;

				                    host = gms::inet_address(host_str);

				                    hosts.push_back(host);

				                } catch (std::exception& e) {

				                    throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));

				                }

				            }

				            return hosts;

				        };

				        std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));

				        return ctx.sp.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {

				        return proxy.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {

				            return json::json_return_type(sync_point.encode());

				        });

				    });

				    hh::get_hints_sync_point.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    hh::get_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        db::hints::sync_point sync_point;

				        const sstring encoded = req->get_query_param("id");

				        try {

				@@ -87,7 +81,7 @@ void set_hinted_handoff(http_context& ctx, routes& r, gms::gossiper& g) {

				        using return_type = hh::ns_get_hints_sync_point::get_hints_sync_point_return_type;

				        using return_type_wrapper = hh::ns_get_hints_sync_point::return_type_wrapper;

				        return ctx.sp.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {

				        return proxy.local().wait_for_hint_sync_point(std::move(sync_point), deadline).then([] {

				            return json::json_return_type(return_type_wrapper(return_type::DONE));

				        }).handle_exception_type([] (const timed_out_error&) {

				            return json::json_return_type(return_type_wrapper(return_type::IN_PROGRESS));

									
										9

api/hinted_handoff.hh
									
												View File
												
				@@ -8,17 +8,14 @@

				#pragma once

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				namespace gms {

				class gossiper;

				}

				namespace service { class storage_proxy; }

				namespace api {

				void set_hinted_handoff(http_context& ctx, httpd::routes& r, gms::gossiper& g);

				void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p);

				void unset_hinted_handoff(http_context& ctx, httpd::routes& r);

				}

									
										203

api/storage_proxy.cc
									
												View File
												
				@@ -10,7 +10,6 @@

				#include "service/storage_proxy.hh"

				#include "api/api-doc/storage_proxy.json.hh"

				#include "api/api-doc/utils.json.hh"

				#include "service/storage_service.hh"

				#include "db/config.hh"

				#include "utils/histogram.hh"

				#include "replica/database.hh"

				@@ -116,17 +115,17 @@ utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimat

				    return res;

				}

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {

				static future<json::json_return_type>  sum_estimated_histogram(sharded<service::storage_proxy>& proxy, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(proxy, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).histogram();

				    }, utils::time_estimated_histogram_merge, utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {

				        return make_ready_future<json::json_return_type>(time_to_json_histogram(val));

				    });

				}

				static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  sum_estimated_histogram(sharded<service::storage_proxy>& proxy, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, f, utils::estimated_histogram_merge,

				    return two_dimensional_map_reduce(proxy, f, utils::estimated_histogram_merge,

				            utils::estimated_histogram()).then([](const utils::estimated_histogram& val) {

				        utils_json::estimated_histogram res;

				        res = val;

				@@ -134,8 +133,8 @@ static future<json::json_return_type>  sum_estimated_histogram(http_context& ctx

				    });

				}

				static future<json::json_return_type>  total_latency(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {

				static future<json::json_return_type>  total_latency(sharded<service::storage_proxy>& proxy, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(proxy, [f] (service::storage_proxy_stats::stats& stats) {

				            return (stats.*f).hist.mean * (stats.*f).hist.count;

				        }, std::plus<double>(), 0.0).then([](double val) {

				        int64_t res = val;

				@@ -184,43 +183,43 @@ sum_timer_stats_storage_proxy(distributed<proxy>& d,

				    });

				}

				void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_service>& ss) {

				void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {

				    sp::get_total_hints.set(r, [](std::unique_ptr<http::request> req)  {

				        //TBD

				        unimplemented();

				        return make_ready_future<json::json_return_type>(0);

				    });

				    sp::get_hinted_handoff_enabled.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				        const auto& filter = ctx.sp.local().get_hints_host_filter();

				    sp::get_hinted_handoff_enabled.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        const auto& filter = proxy.local().get_hints_host_filter();

				        return make_ready_future<json::json_return_type>(!filter.is_disabled_for_all());

				    });

				    sp::set_hinted_handoff_enabled.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				    sp::set_hinted_handoff_enabled.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        auto enable = req->get_query_param("enable");

				        auto filter = (enable == "true" || enable == "1")

				                ? db::hints::host_filter(db::hints::host_filter::enabled_for_all_tag {})

				                : db::hints::host_filter(db::hints::host_filter::disabled_for_all_tag {});

				        return ctx.sp.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {

				        return proxy.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {

				            return sp.change_hints_host_filter(filter);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    sp::get_hinted_handoff_enabled_by_dc.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				    sp::get_hinted_handoff_enabled_by_dc.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        std::vector<sstring> res;

				        const auto& filter = ctx.sp.local().get_hints_host_filter();

				        const auto& filter = proxy.local().get_hints_host_filter();

				        const auto& dcs = filter.get_dcs();

				        res.reserve(res.size());

				        std::copy(dcs.begin(), dcs.end(), std::back_inserter(res));

				        return make_ready_future<json::json_return_type>(res);

				    });

				    sp::set_hinted_handoff_enabled_by_dc_list.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				    sp::set_hinted_handoff_enabled_by_dc_list.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        auto dcs = req->get_query_param("dcs");

				        auto filter = db::hints::host_filter::parse_from_dc_list(std::move(dcs));

				        return ctx.sp.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {

				        return proxy.invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {

				            return sp.change_hints_host_filter(filter);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				@@ -342,144 +341,131 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    sp::get_read_repair_attempted.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_attempts);

				    sp::get_read_repair_attempted.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_attempts);

				    });

				    sp::get_read_repair_repaired_blocking.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_blocking);

				    sp::get_read_repair_repaired_blocking.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_repaired_blocking);

				    });

				    sp::get_read_repair_repaired_background.set(r, [&ctx](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read_repair_repaired_background);

				    sp::get_read_repair_repaired_background.set(r, [&proxy](std::unique_ptr<http::request> req)  {

				        return sum_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read_repair_repaired_background);

				    });

				    sp::get_schema_versions.set(r, [&ss](std::unique_ptr<http::request> req)  {

				        return ss.local().describe_schema_versions().then([] (auto result) {

				            std::vector<sp::mapper_list> res;

				            for (auto e : result) {

				                sp::mapper_list entry;

				                entry.key = std::move(e.first);

				                entry.value = std::move(e.second);

				                res.emplace_back(std::move(entry));

				            }

				            return make_ready_future<json::json_return_type>(std::move(res));

				        });

				    sp::get_cas_read_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &proxy::stats::cas_read_timeouts);

				    });

				    sp::get_cas_read_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_read_timeouts);

				    sp::get_cas_read_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &proxy::stats::cas_read_unavailables);

				    });

				    sp::get_cas_read_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_read_unavailables);

				    sp::get_cas_write_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &proxy::stats::cas_write_timeouts);

				    });

				    sp::get_cas_write_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_write_timeouts);

				    sp::get_cas_write_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &proxy::stats::cas_write_unavailables);

				    });

				    sp::get_cas_write_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &proxy::stats::cas_write_unavailables);

				    sp::get_cas_write_metrics_unfinished_commit.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_stats(proxy, &proxy::stats::cas_write_unfinished_commit);

				    });

				    sp::get_cas_write_metrics_unfinished_commit.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::cas_write_unfinished_commit);

				    sp::get_cas_write_metrics_contention.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(proxy, &proxy::stats::cas_write_contention);

				    });

				    sp::get_cas_write_metrics_contention.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(ctx, &proxy::stats::cas_write_contention);

				    sp::get_cas_write_metrics_condition_not_met.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_stats(proxy, &proxy::stats::cas_write_condition_not_met);

				    });

				    sp::get_cas_write_metrics_condition_not_met.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::cas_write_condition_not_met);

				    sp::get_cas_write_metrics_failed_read_round_optimization.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_stats(proxy, &proxy::stats::cas_failed_read_round_optimization);

				    });

				    sp::get_cas_write_metrics_failed_read_round_optimization.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::cas_failed_read_round_optimization);

				    sp::get_cas_read_metrics_unfinished_commit.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_stats(proxy, &proxy::stats::cas_read_unfinished_commit);

				    });

				    sp::get_cas_read_metrics_unfinished_commit.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_stats(ctx.sp, &proxy::stats::cas_read_unfinished_commit);

				    sp::get_cas_read_metrics_contention.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(proxy, &proxy::stats::cas_read_contention);

				    });

				    sp::get_cas_read_metrics_contention.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(ctx, &proxy::stats::cas_read_contention);

				    sp::get_read_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::read_timeouts);

				    });

				    sp::get_read_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);

				    sp::get_read_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::read_unavailables);

				    });

				    sp::get_read_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);

				    sp::get_range_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::range_slice_timeouts);

				    });

				    sp::get_range_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);

				    sp::get_range_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::range_slice_unavailables);

				    });

				    sp::get_range_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);

				    sp::get_write_metrics_timeouts.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::write_timeouts);

				    });

				    sp::get_write_metrics_timeouts.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);

				    sp::get_write_metrics_unavailables.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(proxy, &service::storage_proxy_stats::stats::write_unavailables);

				    });

				    sp::get_write_metrics_unavailables.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_long(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);

				    sp::get_read_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::read_timeouts);

				    });

				    sp::get_read_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_timeouts);

				    sp::get_read_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::read_unavailables);

				    });

				    sp::get_read_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::read_unavailables);

				    sp::get_range_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::range_slice_timeouts);

				    });

				    sp::get_range_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_timeouts);

				    sp::get_range_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::range_slice_unavailables);

				    });

				    sp::get_range_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::range_slice_unavailables);

				    sp::get_write_metrics_timeouts_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::write_timeouts);

				    });

				    sp::get_write_metrics_timeouts_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_timeouts);

				    sp::get_write_metrics_unavailables_rates.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(proxy, &service::storage_proxy_stats::stats::write_unavailables);

				    });

				    sp::get_write_metrics_unavailables_rates.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timed_rate_as_obj(ctx.sp, &service::storage_proxy_stats::stats::write_unavailables);

				    sp::get_range_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);

				    });

				    sp::get_range_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);

				    sp::get_write_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_write_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);

				    sp::get_read_metrics_latency_histogram_depricated.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_read_metrics_latency_histogram_depricated.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_histogram_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);

				    sp::get_range_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);

				    });

				    sp::get_range_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);

				    sp::get_write_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_cas_write_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats(proxy, &proxy::stats::cas_write);

				    });

				    sp::get_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_cas_write_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats(ctx.sp, &proxy::stats::cas_write);

				    });

				    sp::get_cas_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats(ctx.sp, &proxy::stats::cas_read);

				    sp::get_cas_read_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats(proxy, &proxy::stats::cas_read);

				    });

				    sp::get_view_write_metrics_latency_histogram.set(r, [](std::unique_ptr<http::request> req) {

				@@ -490,31 +476,31 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se

				        return make_ready_future<json::json_return_type>(get_empty_moving_average());

				    });

				    sp::get_read_metrics_latency_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::read);

				    sp::get_read_metrics_latency_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::read);

				    sp::get_read_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(proxy, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_read_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return total_latency(ctx, &service::storage_proxy_stats::stats::read);

				    sp::get_read_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return total_latency(proxy, &service::storage_proxy_stats::stats::read);

				    });

				    sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::write);

				    sp::get_write_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_estimated_histogram(proxy, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_write_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return total_latency(ctx, &service::storage_proxy_stats::stats::write);

				    sp::get_write_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return total_latency(proxy, &service::storage_proxy_stats::stats::write);

				    });

				    sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(ctx.sp, &service::storage_proxy_stats::stats::range);

				    sp::get_range_estimated_histogram.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return sum_timer_stats_storage_proxy(proxy, &service::storage_proxy_stats::stats::range);

				    });

				    sp::get_range_latency.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        return total_latency(ctx, &service::storage_proxy_stats::stats::range);

				    sp::get_range_latency.set(r, [&proxy](std::unique_ptr<http::request> req) {

				        return total_latency(proxy, &service::storage_proxy_stats::stats::range);

				    });

				}

				@@ -547,7 +533,6 @@ void unset_storage_proxy(http_context& ctx, routes& r) {

				    sp::get_read_repair_attempted.unset(r);

				    sp::get_read_repair_repaired_blocking.unset(r);

				    sp::get_read_repair_repaired_background.unset(r);

				    sp::get_schema_versions.unset(r);

				    sp::get_cas_read_timeouts.unset(r);

				    sp::get_cas_read_unavailables.unset(r);

				    sp::get_cas_write_timeouts.unset(r);

									
										4

api/storage_proxy.hh
									
												View File
												
				@@ -11,11 +11,11 @@

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				namespace service { class storage_service; }

				namespace service { class storage_proxy; }

				namespace api {

				void set_storage_proxy(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss);

				void set_storage_proxy(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& proxy);

				void unset_storage_proxy(http_context& ctx, httpd::routes& r);

				}

									
										332

api/storage_service.cc
									
												View File
												
				@@ -8,6 +8,7 @@

				#include "storage_service.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "api/api-doc/storage_proxy.json.hh"

				#include "db/config.hh"

				#include "db/schema_tables.hh"

				#include "utils/hash.hh"

				@@ -42,7 +43,6 @@

				#include "thrift/controller.hh"

				#include "locator/token_metadata.hh"

				#include "cdc/generation_service.hh"

				#include "service/storage_proxy.hh"

				#include "locator/abstract_replication_strategy.hh"

				#include "sstables_loader.hh"

				#include "db/view/view_builder.hh"

				@@ -52,22 +52,10 @@ using namespace std::chrono_literals;

				extern logging::logger apilog;

				namespace std {

				std::ostream& operator<<(std::ostream& os, const api::table_info& ti) {

				    fmt::print(os, "table{{name={}, id={}}}", ti.name, ti.id);

				    return os;

				}

				} // namespace std

				namespace api {

				const locator::token_metadata& http_context::get_token_metadata() {

				        return *shared_token_metadata.local().get();

				}

				namespace ss = httpd::storage_service_json;

				namespace sp = httpd::storage_proxy_json;

				using namespace json;

				sstring validate_keyspace(http_context& ctx, sstring ks_name) {

				@@ -220,32 +208,47 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition

				    });

				}

				future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {

				static future<json::json_return_type> set_tables(http_context& ctx, const sstring& keyspace, std::vector<sstring> tables, std::function<future<>(replica::table&)> set) {

				    if (tables.empty()) {

				        tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());

				    }

				    apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);

				    return do_with(keyspace, std::move(tables), [&ctx, enabled] (const sstring &keyspace, const std::vector<sstring>& tables) {

				        return ctx.db.invoke_on(0, [&ctx, &keyspace, &tables, enabled] (replica::database& db) {

				            auto g = replica::database::autocompaction_toggle_guard(db);

				            return ctx.db.invoke_on_all([&keyspace, &tables, enabled] (replica::database& db) {

				                return parallel_for_each(tables, [&db, &keyspace, enabled] (const sstring& table) {

				                    replica::column_family& cf = db.find_column_family(keyspace, table);

				                    if (enabled) {

				                        cf.enable_auto_compaction();

				                    } else {

				                        return cf.disable_auto_compaction();

				                    }

				                    return make_ready_future<>();

				                });

				            }).finally([g = std::move(g)] {});

				    return do_with(keyspace, std::move(tables), [&ctx, set] (const sstring& keyspace, const std::vector<sstring>& tables) {

				        return ctx.db.invoke_on_all([&keyspace, &tables, set] (replica::database& db) {

				            return parallel_for_each(tables, [&db, &keyspace, set] (const sstring& table) {

				                replica::table& t = db.find_column_family(keyspace, table);

				                return set(t);

				            });

				        });

				    }).then([] {

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				}

				future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {

				    apilog.info("set_tables_autocompaction: enabled={} keyspace={} tables={}", enabled, keyspace, tables);

				    return ctx.db.invoke_on(0, [&ctx, keyspace, tables = std::move(tables), enabled] (replica::database& db) {

				        auto g = replica::database::autocompaction_toggle_guard(db);

				        return set_tables(ctx, keyspace, tables, [enabled] (replica::table& cf) {

				            if (enabled) {

				                cf.enable_auto_compaction();

				            } else {

				                return cf.disable_auto_compaction();

				            }

				            return make_ready_future<>();

				        }).finally([g = std::move(g)] {});

				    });

				}

				future<json::json_return_type> set_tables_tombstone_gc(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {

				    apilog.info("set_tables_tombstone_gc: enabled={} keyspace={} tables={}", enabled, keyspace, tables);

				    return set_tables(ctx, keyspace, std::move(tables), [enabled] (replica::table& t) {

				        t.set_tombstone_gc_enabled(enabled);

				        return make_ready_future<>();

				    });

				}

				void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {

				    ss::start_native_transport.set(r, [&ctl](std::unique_ptr<http::request> req) {

				        return smp::submit_to(0, [&] {

				@@ -314,7 +317,7 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {

				    ss::repair_async.set(r, [&ctx, &repair](std::unique_ptr<http::request> req) {

				        static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",

				                "jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",

				                "startToken", "endToken" };

				                "startToken", "endToken", "ranges_parallelism"};

				        std::unordered_map<sstring, sstring> options_map;

				        for (auto o : options) {

				            auto s = req->get_query_param(o);

				@@ -459,29 +462,21 @@ static future<json::json_return_type> describe_ring_as_json(sharded<service::sto

				    co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));

				}

				static std::vector<table_id> get_table_ids(const std::vector<table_info>& table_infos) {

				    std::vector<table_id> table_ids{table_infos.size()};

				    boost::transform(table_infos, table_ids.begin(), [] (const auto& ti) {

				        return ti.id;

				    });

				    return table_ids;

				}

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {

				    ss::local_hostid.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto id = ctx.db.local().get_config().host_id;

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {

				    ss::local_hostid.set(r, [&ss](std::unique_ptr<http::request> req) {

				        auto id = ss.local().get_token_metadata().get_my_id();

				        return make_ready_future<json::json_return_type>(id.to_sstring());

				    });

				    ss::get_tokens.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().sorted_tokens(), [](const dht::token& i) {

				    ss::get_tokens.set(r, [&ss] (std::unique_ptr<http::request> req) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_metadata().sorted_tokens(), [](const dht::token& i) {

				           return fmt::to_string(i);

				        }));

				    });

				    ss::get_node_tokens.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				    ss::get_node_tokens.set(r, [&ss] (std::unique_ptr<http::request> req) {

				        gms::inet_address addr(req->param["endpoint"]);

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().get_tokens(addr), [](const dht::token& i) {

				        return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_token_metadata().get_tokens(addr), [](const dht::token& i) {

				           return fmt::to_string(i);

				       }));

				    });

				@@ -549,8 +544,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        });

				    });

				    ss::get_leaving_nodes.set(r, [&ctx](const_req req) {

				        return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());

				    ss::get_leaving_nodes.set(r, [&ss](const_req req) {

				        return container_to_vec(ss.local().get_token_metadata().get_leaving_endpoints());

				    });

				    ss::get_moving_nodes.set(r, [](const_req req) {

				@@ -558,8 +553,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return container_to_vec(addr);

				    });

				    ss::get_joining_nodes.set(r, [&ctx](const_req req) {

				        auto points = ctx.get_token_metadata().get_bootstrap_tokens();

				    ss::get_joining_nodes.set(r, [&ss](const_req req) {

				        auto points = ss.local().get_token_metadata().get_bootstrap_tokens();

				        std::unordered_set<sstring> addr;

				        for (auto i: points) {

				            addr.insert(fmt::to_string(i.second));

				@@ -619,7 +614,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::describe_any_ring.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) {

				        // Find an arbitrary non-system keyspace.

				        auto keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();

				        auto keyspaces = ctx.db.local().get_non_local_vnode_based_strategy_keyspaces();

				        if (keyspaces.empty()) {

				            throw std::runtime_error("No keyspace provided and no non system kespace exist");

				        }

				@@ -631,9 +626,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return describe_ring_as_json(ss, validate_keyspace(ctx, req->param));

				    });

				    ss::get_host_id_map.set(r, [&ctx](const_req req) {

				    ss::get_host_id_map.set(r, [&ss](const_req req) {

				        std::vector<ss::mapper> res;

				        return map_to_key_value(ctx.get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);

				        return map_to_key_value(ss.local().get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);

				    });

				    ss::get_load.set(r, [&ctx](std::unique_ptr<http::request> req) {

				@@ -653,9 +648,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        });

				    });

				    ss::get_current_generation_number.set(r, [&g](std::unique_ptr<http::request> req) {

				    ss::get_current_generation_number.set(r, [&ss](std::unique_ptr<http::request> req) {

				        gms::inet_address ep(utils::fb_utilities::get_broadcast_address());

				        return g.get_current_generation_number(ep).then([](gms::generation_type res) {

				        return ss.local().gossiper().get_current_generation_number(ep).then([](gms::generation_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				    });

				@@ -666,11 +661,10 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				                req.get_query_param("key")));

				    });

				    ss::cdc_streams_check_and_repair.set(r, [&cdc_gs] (std::unique_ptr<http::request> req) {

				        if (!cdc_gs.local_is_initialized()) {

				            throw std::runtime_error("get_cdc_generation_service: not initialized yet");

				        }

				        return cdc_gs.local().check_and_repair_cdc_streams().then([] {

				    ss::cdc_streams_check_and_repair.set(r, [&ss] (std::unique_ptr<http::request> req) {

				        return ss.invoke_on(0, [] (service::storage_service& ss) {

				            return ss.check_and_repair_cdc_streams();

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				@@ -682,7 +676,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        apilog.debug("force_keyspace_compaction: keyspace={} tables={}", keyspace, table_infos);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos));

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos);

				        try {

				            co_await task->done();

				        } catch (...) {

				@@ -705,7 +699,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos));

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos);

				        try {

				            co_await task->done();

				        } catch (...) {

				@@ -720,7 +714,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        bool res = false;

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, get_table_ids(table_infos), res);

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, res);

				        try {

				            co_await task->done();

				        } catch (...) {

				@@ -738,7 +732,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, get_table_ids(table_infos), exclude_current_version);

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        try {

				            co_await task->done();

				        } catch (...) {

				@@ -779,21 +773,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::remove_node.set(r, [&ss](std::unique_ptr<http::request> req) {

				        auto host_id = validate_host_id(req->get_query_param("host_id"));

				        std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");

				        std::vector<sstring> ignore_nodes_strs = utils::split_comma_separated_list(req->get_query_param("ignore_nodes"));

				        apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);

				        auto ignore_nodes = std::list<locator::host_id_or_endpoint>();

				        for (std::string n : ignore_nodes_strs) {

				        for (const sstring& n : ignore_nodes_strs) {

				            try {

				                std::replace(n.begin(), n.end(), '\"', ' ');

				                std::replace(n.begin(), n.end(), '\'', ' ');

				                boost::trim_all(n);

				                if (!n.empty()) {

				                    auto hoep = locator::host_id_or_endpoint(n);

				                    if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {

				                        throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");

				                    }

				                    ignore_nodes.push_back(std::move(hoep));

				                auto hoep = locator::host_id_or_endpoint(n);

				                if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {

				                    throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");

				                }

				                ignore_nodes.push_back(std::move(hoep));

				            } catch (...) {

				                throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));

				            }

				@@ -906,11 +895,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				    ss::is_initialized.set(r, [&ss, &g](std::unique_ptr<http::request> req) {

				        return ss.local().get_operation_mode().then([&g] (auto mode) {

				    ss::is_initialized.set(r, [&ss](std::unique_ptr<http::request> req) {

				        return ss.local().get_operation_mode().then([&ss] (auto mode) {

				            bool is_initialized = mode >= service::storage_service::mode::STARTING;

				            if (mode == service::storage_service::mode::NORMAL) {

				                is_initialized = g.is_enabled();

				                is_initialized = ss.local().gossiper().is_enabled();

				            }

				            return make_ready_future<json::json_return_type>(is_initialized);

				        });

				@@ -979,10 +968,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				                ks.set_incremental_backups(value);

				            }

				            for (auto& pair: db.get_column_families()) {

				                auto cf_ptr = pair.second;

				                cf_ptr->set_incremental_backups(value);

				            }

				            db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> table) {

				                table->set_incremental_backups(value);

				            });

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				@@ -1023,13 +1011,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return make_ready_future<json::json_return_type>(res);

				    });

				    ss::reset_local_schema.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {

				    ss::reset_local_schema.set(r, [&ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        // FIXME: We should truncate schema tables if more than one node in the cluster.

				        auto& fs = ctx.sp.local().features();

				        apilog.info("reset_local_schema");

				        return db::schema_tables::recalculate_schema_version(sys_ks, ctx.sp, fs).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				        co_await ss.local().reload_schema();

				        co_return json_void();

				    });

				    ss::set_trace_probability.set(r, [](std::unique_ptr<http::request> req) {

				@@ -1111,6 +1097,22 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return set_tables_autocompaction(ctx, keyspace, tables, false);

				    });

				    ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");

				        apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);

				        return set_tables_tombstone_gc(ctx, keyspace, tables, true);

				    });

				    ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req->param);

				        auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");

				        apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);

				        return set_tables_tombstone_gc(ctx, keyspace, tables, false);

				    });

				    ss::deliver_hints.set(r, [](std::unique_ptr<http::request> req) {

				        //TBD

				        unimplemented();

				@@ -1118,12 +1120,12 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				        return make_ready_future<json::json_return_type>(json_void());

				      });

				    ss::get_cluster_name.set(r, [&g](const_req req) {

				        return g.get_cluster_name();

				    ss::get_cluster_name.set(r, [&ss](const_req req) {

				        return ss.local().gossiper().get_cluster_name();

				    });

				    ss::get_partitioner_name.set(r, [&g](const_req req) {

				        return g.get_partitioner_name();

				    ss::get_partitioner_name.set(r, [&ss](const_req req) {

				        return ss.local().gossiper().get_partitioner_name();

				    });

				    ss::get_tombstone_warn_threshold.set(r, [](std::unique_ptr<http::request> req) {

				@@ -1241,7 +1243,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				                auto& ext = db.get_config().extensions();

				                for (auto& t : db.get_column_families() | boost::adaptors::map_values) {

				                db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {

				                    auto& schema = t->schema();

				                    if ((ks.empty() || ks == schema->ks_name()) && (cf.empty() || cf == schema->cf_name())) {

				                        // at most Nsstables long

				@@ -1257,7 +1259,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				                            ss::sstable info;

				                            info.timestamp = t;

				                            info.generation = sstables::generation_value(sstable->generation());

				                            info.generation = fmt::to_string(sstable->generation());

				                            info.level = sstable->get_sstable_level();

				                            info.size = sstable->bytes_on_disk();

				                            info.data_size = sstable->ondisk_data_size();

				@@ -1322,7 +1324,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				                        }

				                        res.emplace_back(std::move(tst));

				                    }

				                }

				                });

				                std::sort(res.begin(), res.end(), [](const ss::table_sstables& t1, const ss::table_sstables& t2) {

				                    return t1.keyspace() < t2.keyspace() || (t1.keyspace() == t2.keyspace() && t1.table() < t2.table());

				                });

				@@ -1332,6 +1334,123 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				            });

				        });

				    });

				    ss::reload_raft_topology_state.set(r,

				            [&ss, &group0_client] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await ss.invoke_on(0, [&group0_client] (service::storage_service& ss) -> future<> {

				            apilog.info("Waiting for group 0 read/apply mutex before reloading Raft topology state...");

				            auto holder = co_await group0_client.hold_read_apply_mutex();

				            apilog.info("Reloading Raft topology state");

				            // Using topology_transition() instead of topology_state_load(), because the former notifies listeners

				            co_await ss.topology_transition();

				            apilog.info("Reloaded Raft topology state");

				        });

				        co_return json_void();

				    });

				    sp::get_schema_versions.set(r, [&ss](std::unique_ptr<http::request> req)  {

				        return ss.local().describe_schema_versions().then([] (auto result) {

				            std::vector<sp::mapper_list> res;

				            for (auto e : result) {

				                sp::mapper_list entry;

				                entry.key = std::move(e.first);

				                entry.value = std::move(e.second);

				                res.emplace_back(std::move(entry));

				            }

				            return make_ready_future<json::json_return_type>(std::move(res));

				        });

				    });

				}

				void unset_storage_service(http_context& ctx, routes& r) {

				    ss::local_hostid.unset(r);

				    ss::get_tokens.unset(r);

				    ss::get_node_tokens.unset(r);

				    ss::get_commitlog.unset(r);

				    ss::get_token_endpoint.unset(r);

				    ss::toppartitions_generic.unset(r);

				    ss::get_leaving_nodes.unset(r);

				    ss::get_moving_nodes.unset(r);

				    ss::get_joining_nodes.unset(r);

				    ss::get_release_version.unset(r);

				    ss::get_scylla_release_version.unset(r);

				    ss::get_schema_version.unset(r);

				    ss::get_all_data_file_locations.unset(r);

				    ss::get_saved_caches_location.unset(r);

				    ss::get_range_to_endpoint_map.unset(r);

				    ss::get_pending_range_to_endpoint_map.unset(r);

				    ss::describe_any_ring.unset(r);

				    ss::describe_ring.unset(r);

				    ss::get_host_id_map.unset(r);

				    ss::get_load.unset(r);

				    ss::get_load_map.unset(r);

				    ss::get_current_generation_number.unset(r);

				    ss::get_natural_endpoints.unset(r);

				    ss::cdc_streams_check_and_repair.unset(r);

				    ss::force_keyspace_compaction.unset(r);

				    ss::force_keyspace_cleanup.unset(r);

				    ss::perform_keyspace_offstrategy_compaction.unset(r);

				    ss::upgrade_sstables.unset(r);

				    ss::force_keyspace_flush.unset(r);

				    ss::decommission.unset(r);

				    ss::move.unset(r);

				    ss::remove_node.unset(r);

				    ss::get_removal_status.unset(r);

				    ss::force_remove_completion.unset(r);

				    ss::set_logging_level.unset(r);

				    ss::get_logging_levels.unset(r);

				    ss::get_operation_mode.unset(r);

				    ss::is_starting.unset(r);

				    ss::get_drain_progress.unset(r);

				    ss::drain.unset(r);

				    ss::truncate.unset(r);

				    ss::get_keyspaces.unset(r);

				    ss::stop_gossiping.unset(r);

				    ss::start_gossiping.unset(r);

				    ss::is_gossip_running.unset(r);

				    ss::stop_daemon.unset(r);

				    ss::is_initialized.unset(r);

				    ss::join_ring.unset(r);

				    ss::is_joined.unset(r);

				    ss::set_stream_throughput_mb_per_sec.unset(r);

				    ss::get_stream_throughput_mb_per_sec.unset(r);

				    ss::get_compaction_throughput_mb_per_sec.unset(r);

				    ss::set_compaction_throughput_mb_per_sec.unset(r);

				    ss::is_incremental_backups_enabled.unset(r);

				    ss::set_incremental_backups_enabled.unset(r);

				    ss::rebuild.unset(r);

				    ss::bulk_load.unset(r);

				    ss::bulk_load_async.unset(r);

				    ss::reschedule_failed_deletions.unset(r);

				    ss::sample_key_range.unset(r);

				    ss::reset_local_schema.unset(r);

				    ss::set_trace_probability.unset(r);

				    ss::get_trace_probability.unset(r);

				    ss::get_slow_query_info.unset(r);

				    ss::set_slow_query.unset(r);

				    ss::enable_auto_compaction.unset(r);

				    ss::disable_auto_compaction.unset(r);

				    ss::enable_tombstone_gc.unset(r);

				    ss::disable_tombstone_gc.unset(r);

				    ss::deliver_hints.unset(r);

				    ss::get_cluster_name.unset(r);

				    ss::get_partitioner_name.unset(r);

				    ss::get_tombstone_warn_threshold.unset(r);

				    ss::set_tombstone_warn_threshold.unset(r);

				    ss::get_tombstone_failure_threshold.unset(r);

				    ss::set_tombstone_failure_threshold.unset(r);

				    ss::get_batch_size_failure_threshold.unset(r);

				    ss::set_batch_size_failure_threshold.unset(r);

				    ss::set_hinted_handoff_throttle_in_kb.unset(r);

				    ss::get_metrics_load.unset(r);

				    ss::get_exceptions.unset(r);

				    ss::get_total_hints_in_progress.unset(r);

				    ss::get_total_hints.unset(r);

				    ss::get_ownership.unset(r);

				    ss::get_effective_ownership.unset(r);

				    ss::sstable_info.unset(r);

				    ss::reload_raft_topology_state.unset(r);

				    sp::get_schema_versions.unset(r);

				}

				enum class scrub_status {

				@@ -1494,27 +1613,12 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				            throw httpd::bad_param_exception(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));

				        }

				        const auto& reduce_compaction_stats = [] (const compaction_manager::compaction_stats_opt& lhs, const compaction_manager::compaction_stats_opt& rhs) {

				            sstables::compaction_stats stats{};

				            stats += lhs.value();

				            stats += rhs.value();

				            return stats;

				        };

				        sstables::compaction_stats stats;

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<scrub_sstables_compaction_task_impl>({}, std::move(keyspace), db, column_families, opts, stats);

				        try {

				            auto opt_stats = co_await db.map_reduce0([&] (replica::database& db) {

				                return map_reduce(column_families, [&] (sstring cfname) -> future<std::optional<sstables::compaction_stats>> {

				                    auto& cm = db.get_compaction_manager();

				                    auto& cf = db.find_column_family(keyspace, cfname);

				                    sstables::compaction_stats stats{};

				                    co_await cf.parallel_foreach_table_state([&] (compaction::table_state& ts) mutable -> future<> {

				                        auto r = co_await cm.perform_sstable_scrub(ts, opts);

				                        stats += r.value_or(sstables::compaction_stats{});

				                    });

				                    co_return stats;

				                }, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);

				            }, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);

				            if (opt_stats && opt_stats->validation_errors) {

				            co_await task->done();

				            if (stats.validation_errors) {

				                co_return json::json_return_type(static_cast<int>(scrub_status::validation_errors));

				            }

				        } catch (const sstables::compaction_aborted_exception&) {

									
										15

api/storage_service.hh
									
												View File
												
				@@ -25,7 +25,6 @@ class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				namespace cdc { class generation_service; }

				class sstables_loader;

				namespace gms {

				@@ -51,11 +50,6 @@ sstring validate_keyspace(http_context& ctx, const httpd::parameters& param);

				// If the parameter is found and empty, returns a list of all table names in the keyspace.

				std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				struct table_info {

				    sstring name;

				    table_id id;

				};

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				@@ -63,7 +57,8 @@ struct table_info {

				// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.

				std::vector<table_info> parse_table_infos(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);

				void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);

				void unset_storage_service(http_context& ctx, httpd::routes& r);

				void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);

				void unset_sstables_loader(http_context& ctx, httpd::routes& r);

				void set_view_builder(http_context& ctx, httpd::routes& r, sharded<db::view::view_builder>& vb);

				@@ -79,9 +74,3 @@ void unset_snapshot(http_context& ctx, httpd::routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				} // namespace api

				namespace std {

				std::ostream& operator<<(std::ostream& os, const api::table_info& ti);

				} // namespace std

									
										77

api/system.cc
									
												View File
												
				@@ -7,10 +7,18 @@

				 */

				#include "api/api-doc/system.json.hh"

				#include "api/api-doc/metrics.json.hh"

				#include "api/api.hh"

				#include <seastar/core/reactor.hh>

				#include <seastar/core/metrics_api.hh>

				#include <seastar/core/relabel_config.hh>

				#include <seastar/http/exception.hh>

				#include <seastar/util/short_streams.hh>

				#include <seastar/http/short_streams.hh>

				#include "utils/rjson.hh"

				#include "log.hh"

				#include "replica/database.hh"

				@@ -20,8 +28,77 @@ namespace api {

				using namespace seastar::httpd;

				namespace hs = httpd::system_json;

				namespace hm = httpd::metrics_json;

				void set_system(http_context& ctx, routes& r) {

				    hm::get_metrics_config.set(r, [](const_req req) {

				        std::vector<hm::metrics_config> res;

				        res.resize(seastar::metrics::get_relabel_configs().size());

				        size_t i = 0;

				        for (auto&& r : seastar::metrics::get_relabel_configs()) {

				            res[i].action = r.action;

				            res[i].target_label = r.target_label;

				            res[i].replacement = r.replacement;

				            res[i].separator = r.separator;

				            res[i].source_labels = r.source_labels;

				            res[i].regex = r.expr.str();

				            i++;

				        }

				        return res;

				    });

				    hm::set_metrics_config.set(r, [](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        rapidjson::Document doc;

				        doc.Parse(req->content.c_str());

				        if (!doc.IsArray()) {

				            throw bad_param_exception("Expected a json array");

				        }

				        std::vector<seastar::metrics::relabel_config> relabels;

				        relabels.resize(doc.Size());

				        for (rapidjson::SizeType i = 0; i < doc.Size(); i++) {

				            const auto& element = doc[i];

				            if (element.HasMember("source_labels")) {

				                std::vector<std::string> source_labels;

				                source_labels.resize(element["source_labels"].Size());

				                for (size_t j = 0; j < element["source_labels"].Size(); j++) {

				                    source_labels[j] = element["source_labels"][j].GetString();

				                }

				                relabels[i].source_labels = source_labels;

				            }

				            if (element.HasMember("action")) {

				                relabels[i].action = seastar::metrics::relabel_config_action(element["action"].GetString());

				            }

				            if (element.HasMember("replacement")) {

				                relabels[i].replacement = element["replacement"].GetString();

				            }

				            if (element.HasMember("separator")) {

				                relabels[i].separator = element["separator"].GetString();

				            }

				            if (element.HasMember("target_label")) {

				                relabels[i].target_label = element["target_label"].GetString();

				            }

				            if (element.HasMember("regex")) {

				                relabels[i].expr = element["regex"].GetString();

				            }

				        }

				        return do_with(std::move(relabels), false, [](const std::vector<seastar::metrics::relabel_config>& relabels, bool& failed) {

				            return smp::invoke_on_all([&relabels, &failed] {

				                return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {

				                    if (result.metrics_relabeled_due_to_collision > 0) {

				                        failed = true;

				                    }

				                    return;

				                });

				            }).then([&failed](){

				                if (failed) {

				                    throw bad_param_exception("conflicts found during relabeling");

				                }

				                return make_ready_future<json::json_return_type>(seastar::json::json_void());

				            });

				        });

				    });

				    hs::get_system_uptime.set(r, [](const_req req) {

				        return std::chrono::duration_cast<std::chrono::milliseconds>(engine().uptime()).count();

				    });

									
										121

api/task_manager.cc
									
												View File
												
				@@ -44,6 +44,7 @@ struct task_stats {

				        : task_id(task->id().to_sstring())

				        , state(task->get_status().state)

				        , type(task->type())

				        , scope(task->get_status().scope)

				        , keyspace(task->get_status().keyspace)

				        , table(task->get_status().table)

				        , entity(task->get_status().entity)

				@@ -53,6 +54,7 @@ struct task_stats {

				    sstring task_id;

				    tasks::task_manager::task_state state;

				    std::string type;

				    std::string scope;

				    std::string keyspace;

				    std::string table;

				    std::string entity;

				@@ -69,6 +71,7 @@ tm::task_status make_status(full_task_status status) {

				    tm::task_status res{};

				    res.id = status.task_status.id.to_sstring();

				    res.type = status.type;

				    res.scope = status.task_status.scope;

				    res.state = status.task_status.state;

				    res.is_abortable = bool(status.abortable);

				    res.start_time = st;

				@@ -108,18 +111,23 @@ future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task

				    co_return s;

				}

				void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {

				    tm::get_modules.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(ctx.tm.local().get_modules() | boost::adaptors::map_keys);

				void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {

				    tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(tm.local().get_modules() | boost::adaptors::map_keys);

				        co_return v;

				    });

				    tm::get_tasks.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::get_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        using chunked_stats = utils::chunked_vector<task_stats>;

				        auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};

				        std::vector<chunked_stats> res = co_await ctx.tm.map([&req, internal] (tasks::task_manager& tm) {

				        std::vector<chunked_stats> res = co_await tm.map([&req, internal] (tasks::task_manager& tm) {

				            chunked_stats local_res;

				            auto module = tm.find_module(req->param["module"]);

				            tasks::task_manager::module_ptr module;

				            try {

				                module = tm.find_module(req->param["module"]);

				            } catch (...) {

				                throw bad_param_exception(fmt::format("{}", std::current_exception()));

				            }

				            const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {

				                return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);

				            });

				@@ -148,57 +156,76 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {

				        co_return std::move(f);

				    });

				    tm::get_task_status.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::get_task_status.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				            auto state = task->get_status().state;

				            if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {

				                task->unregister_task();

				            }

				            co_return std::move(task);

				        }));

				        tasks::task_manager::foreign_task_ptr task;

				        try {

				            task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				                if (task->is_complete()) {

				                    task->unregister_task();

				                }

				                co_return std::move(task);

				            }));

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        auto s = co_await retrieve_status(task);

				        co_return make_status(s);

				    });

				    tm::abort_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::abort_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				            if (!task->is_abortable()) {

				                co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));

				            }

				            co_await task->abort();

				        });

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				                if (!task->is_abortable()) {

				                    co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));

				                }

				                co_await task->abort();

				            });

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        co_return json_void();

				    });

				    tm::wait_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::wait_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {

				            return task->done().then_wrapped([task] (auto f) {

				                task->unregister_task();

				                f.get();

				                return make_foreign(task);

				            });

				        }));

				        tasks::task_manager::foreign_task_ptr task;

				        try {

				            task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) {

				                return task->done().then_wrapped([task] (auto f) {

				                    task->unregister_task();

				                    // done() is called only because we want the task to be complete before getting its status.

				                    // The future should be ignored here as the result does not matter.

				                    f.ignore_ready_future();

				                    return make_foreign(task);

				                });

				            }));

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        auto s = co_await retrieve_status(task);

				        co_return make_status(s);

				    });

				    tm::get_task_status_recursively.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& _ctx = ctx;

				    tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& tm = _tm;

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        std::queue<tasks::task_manager::foreign_task_ptr> q;

				        utils::chunked_vector<full_task_status> res;

				        // Get requested task.

				        auto task = co_await tasks::task_manager::invoke_on_task(_ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				            auto state = task->get_status().state;

				            if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {

				                task->unregister_task();

				            }

				            co_return task;

				        }));

				        tasks::task_manager::foreign_task_ptr task;

				        try {

				            // Get requested task.

				            task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {

				                if (task->is_complete()) {

				                    task->unregister_task();

				                }

				                co_return task;

				            }));

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        // Push children's statuses in BFS order.

				        q.push(co_await task.copy());   // Task cannot be moved since we need it to be alive during whole loop execution.

				@@ -228,9 +255,23 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {

				    tm::get_and_update_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        uint32_t ttl = cfg.task_ttl_seconds();

				        co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);

				        try {

				            co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);

				        } catch (...) {

				            throw bad_param_exception(fmt::format("{}", std::current_exception()));

				        }

				        co_return json::json_return_type(ttl);

				    });

				}

				void unset_task_manager(http_context& ctx, routes& r) {

				    tm::get_modules.unset(r);

				    tm::get_tasks.unset(r);

				    tm::get_task_status.unset(r);

				    tm::abort_task.unset(r);

				    tm::wait_task.unset(r);

				    tm::get_task_status_recursively.unset(r);

				    tm::get_and_update_ttl.unset(r);

				}

				}

									
										8

api/task_manager.hh
									
												View File
												
				@@ -8,11 +8,17 @@

				#pragma once

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				#include "db/config.hh"

				namespace tasks {

				    class task_manager;

				}

				namespace api {

				void set_task_manager(http_context& ctx, httpd::routes& r, db::config& cfg);

				void set_task_manager(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm, db::config& cfg);

				void unset_task_manager(http_context& ctx, httpd::routes& r);

				}

									
										62

api/task_manager_test.cc
									
												View File
												
				@@ -20,17 +20,17 @@ namespace tmt = httpd::task_manager_test_json;

				using namespace json;

				using namespace seastar::httpd;

				void set_task_manager_test(http_context& ctx, routes& r) {

				    tmt::register_test_module.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) {

				void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm) {

				    tmt::register_test_module.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await tm.invoke_on_all([] (tasks::task_manager& tm) {

				            auto m = make_shared<tasks::test_module>(tm);

				            tm.register_module("test", m);

				        });

				        co_return json_void();

				    });

				    tmt::unregister_test_module.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {

				    tmt::unregister_test_module.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {

				            auto module_name = "test";

				            auto module = tm.find_module(module_name);

				            co_await module->stop();

				@@ -38,8 +38,8 @@ void set_task_manager_test(http_context& ctx, routes& r) {

				        co_return json_void();

				    });

				    tmt::register_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        sharded<tasks::task_manager>& tms = ctx.tm;

				    tmt::register_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        sharded<tasks::task_manager>& tms = tm;

				        auto it = req->query_parameters.find("task_id");

				        auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();

				        it = req->query_parameters.find("shard");

				@@ -54,7 +54,7 @@ void set_task_manager_test(http_context& ctx, routes& r) {

				        tasks::task_info data;

				        if (it != req->query_parameters.end()) {

				            data.id = tasks::task_id{utils::UUID{it->second}};

				            auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(ctx.tm, data.id);

				            auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(tm, data.id);

				            data.shard = parent_ptr->get_status().shard;

				        }

				@@ -69,34 +69,50 @@ void set_task_manager_test(http_context& ctx, routes& r) {

				        co_return id.to_sstring();

				    });

				    tmt::unregister_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tmt::unregister_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				            tasks::test_task test_task{task};

				            co_await test_task.unregister_task();

				        });

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {

				                tasks::test_task test_task{task};

				                co_await test_task.unregister_task();

				            });

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        co_return json_void();

				    });

				    tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tmt::finish_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};

				        auto it = req->query_parameters.find("error");

				        bool fail = it != req->query_parameters.end();

				        std::string error = fail ? it->second : "";

				        co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {

				            tasks::test_task test_task{task};

				            if (fail) {

				                test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));

				            } else {

				                test_task.finish();

				            }

				            return make_ready_future<>();

				        });

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {

				                tasks::test_task test_task{task};

				                if (fail) {

				                    test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));

				                } else {

				                    test_task.finish();

				                }

				                return make_ready_future<>();

				            });

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        co_return json_void();

				    });

				}

				void unset_task_manager_test(http_context& ctx, routes& r) {

				    tmt::register_test_module.unset(r);

				    tmt::unregister_test_module.unset(r);

				    tmt::register_test_task.unset(r);

				    tmt::unregister_test_task.unset(r);

				    tmt::finish_test_task.unset(r);

				}

				}

				#endif

									
										8

api/task_manager_test.hh
									
												View File
												
				@@ -10,11 +10,17 @@

				#pragma once

				#include <seastar/core/sharded.hh>

				#include "api.hh"

				namespace tasks {

				class task_manager;

				}

				namespace api {

				void set_task_manager_test(http_context& ctx, httpd::routes& r);

				void set_task_manager_test(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm);

				void unset_task_manager_test(http_context& ctx, httpd::routes& r);

				}

									
										4

auth/CMakeLists.txt
									
												View File
												
				@@ -7,6 +7,7 @@ target_sources(scylla_auth

				    allow_all_authorizer.cc

				    authenticated_user.cc

				    authenticator.cc

				    certificate_authenticator.cc

				    common.cc

				    default_authorizer.cc

				    password_authenticator.cc

				@@ -30,6 +31,7 @@ target_link_libraries(scylla_auth

				  PRIVATE

				    cql3

				    idl

				    wasmtime_bindings)

				    wasmtime_bindings

				    libxcrypt::libxcrypt)

				add_whole_archive(auth scylla_auth)

									
										9

auth/authenticated_user.hh
									
												View File
												
				@@ -35,16 +35,9 @@ public:

				    ///

				    authenticated_user() = default;

				    explicit authenticated_user(std::string_view name);

				    friend bool operator==(const authenticated_user&, const authenticated_user&) noexcept = default;

				};

				inline bool operator==(const authenticated_user& u1, const authenticated_user& u2) noexcept {

				    return u1.name == u2.name;

				}

				inline bool operator!=(const authenticated_user& u1, const authenticated_user& u2) noexcept {

				    return !(u1 == u2);

				}

				const authenticated_user& anonymous_user() noexcept;

				inline bool is_anonymous(const authenticated_user& u) noexcept {

									
										4

auth/authenticator.cc
									
												View File
												
				@@ -18,3 +18,7 @@

				const sstring auth::authenticator::USERNAME_KEY("username");

				const sstring auth::authenticator::PASSWORD_KEY("password");

				future<std::optional<auth::authenticated_user>> auth::authenticator::authenticate(session_dn_func) const {

				    return make_ready_future<std::optional<auth::authenticated_user>>(std::nullopt);

				}

									
										19

auth/authenticator.hh
									
												View File
												
				@@ -15,6 +15,8 @@

				#include <set>

				#include <stdexcept>

				#include <unordered_map>

				#include <optional>

				#include <functional>

				#include <seastar/core/enum.hh>

				#include <seastar/core/future.hh>

				@@ -36,6 +38,16 @@ namespace auth {

				class authenticated_user;

				// Query alt name info as a single (subject style) string

				using alt_name_func = std::function<future<std::string>()>;

				struct certificate_info {

				    std::string subject;

				    alt_name_func get_alt_names;

				};

				using session_dn_func = std::function<future<std::optional<certificate_info>>()>;

				///

				/// Abstract client for authenticating role identity.

				///

				@@ -87,6 +99,13 @@ public:

				    ///

				    virtual future<authenticated_user> authenticate(const credentials_map& credentials) const = 0;

				    ///

				    /// Authenticate (early) using transport info

				    ///

				    /// \returns nullopt if not supported/required. exceptional future if failed

				    ///

				    virtual future<std::optional<authenticated_user>> authenticate(session_dn_func) const;

				    ///

				    /// Create an authentication record for a new user. This is required before the user can log-in.

				    ///

									
										4

auth/authorizer.hh
									
												View File
												
				@@ -39,10 +39,6 @@ inline bool operator==(const permission_details& pd1, const permission_details&

				            == std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions.mask());

				}

				inline bool operator!=(const permission_details& pd1, const permission_details& pd2) {

				    return !(pd1 == pd2);

				}

				inline bool operator<(const permission_details& pd1, const permission_details& pd2) {

				    return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions)

				            < std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions);

									
										181

auth/certificate_authenticator.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,181 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 *

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#include "auth/certificate_authenticator.hh"

				#include <regex>

				#include "utils/class_registrator.hh"

				#include "data_dictionary/data_dictionary.hh"

				#include "cql3/query_processor.hh"

				#include "db/config.hh"

				static const auto CERT_AUTH_NAME = "com.scylladb.auth.CertificateAuthenticator";

				const std::string_view auth::certificate_authenticator_name(CERT_AUTH_NAME);

				static logging::logger clogger("certificate_authenticator");

				static const std::string cfg_source_attr = "source";

				static const std::string cfg_query_attr = "query";

				static const std::string cfg_source_subject = "SUBJECT";

				static const std::string cfg_source_altname = "ALTNAME";

				static const class_registrator<auth::authenticator

				    , auth::certificate_authenticator

				    , cql3::query_processor&

				    , ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);

				enum class auth::certificate_authenticator::query_source {

				    subject, altname

				};

				auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::migration_manager&)

				    : _queries([&] {

				        auto& conf = qp.db().get_config();

				        auto queries = conf.auth_certificate_role_queries();

				        if (queries.empty()) {

				            throw std::invalid_argument("No role extraction queries specified.");

				        }

				        std::vector<std::pair<query_source, boost::regex>> res;

				        for (auto& map : queries) {

				            // first, check for any invalid config keys

				            if (map.size() == 2) {

				                try {

				                    auto& source = map.at(cfg_source_attr);

				                    std::string query = map.at(cfg_query_attr);

				                    std::transform(source.begin(), source.end(), source.begin(), ::toupper);

				                    boost::regex ex(query);

				                    if (ex.mark_count() != 1) {

				                        throw std::invalid_argument("Role query must have exactly one mark expression");

				                    }

				                    clogger.debug("Append role query: {} : {}", source, query);

				                    if (source == cfg_source_subject) {

				                        res.emplace_back(query_source::subject, std::move(ex));

				                    } else if (source == cfg_source_altname) {

				                        res.emplace_back(query_source::altname, std::move(ex));

				                    } else {

				                        throw std::invalid_argument(fmt::format("Invalid source: {}", map.at(cfg_source_attr)));

				                    }

				                    continue;

				                } catch (std::out_of_range&) {

				                    // just fallthrough

				                } catch (std::regex_error&) {

				                    std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));

				                }

				            }

				            throw std::invalid_argument(fmt::format("Invalid query: {}", map));

				        }

				        return res;

				    }())

				{}

				auth::certificate_authenticator::~certificate_authenticator() = default;

				future<> auth::certificate_authenticator::start() {

				    co_return;

				}

				future<> auth::certificate_authenticator::stop() {

				    co_return;

				}

				std::string_view auth::certificate_authenticator::qualified_java_name() const {

				    return certificate_authenticator_name;

				}

				bool auth::certificate_authenticator::require_authentication() const {

				    return true;

				}

				auth::authentication_option_set auth::certificate_authenticator::supported_options() const {

				    return {};

				}

				auth::authentication_option_set auth::certificate_authenticator::alterable_options() const {

				    return {};

				}

				future<std::optional<auth::authenticated_user>> auth::certificate_authenticator::authenticate(session_dn_func f) const {

				    if (!f) {

				        co_return std::nullopt;

				    }

				    auto dninfo = co_await f();

				    if (!dninfo) {

				        throw exceptions::authentication_exception("No valid certificate found");

				    }

				    auto& subject = dninfo->subject;

				    std::optional<std::string> altname ;

				    const std::string* source_str = nullptr;

				    for (auto& [source, expr] : _queries) {

				        switch (source) {

				            default:

				            case query_source::subject:

				                source_str = &subject;

				                break;

				            case query_source::altname:

				                if (!altname) {

				                    altname = dninfo->get_alt_names ? co_await dninfo->get_alt_names() : std::string{};

				                }

				                source_str = &*altname;

				                break;

				        }

				        clogger.debug("Checking {}: {}", int(source), *source_str);

				        boost::smatch m;

				        if (boost::regex_search(*source_str, m, expr)) {

				            auto username = m[1].str();

				            clogger.debug("Return username: {}", username);

				            co_return username;

				        }

				    }

				    throw exceptions::authentication_exception(format("Subject '{}'/'{}' does not match any query expression", subject, altname));

				}

				future<auth::authenticated_user> auth::certificate_authenticator::authenticate(const credentials_map&) const {

				    throw exceptions::authentication_exception("Cannot authenticate using attribute map");

				}

				future<> auth::certificate_authenticator::create(std::string_view role_name, const authentication_options& options) const {

				    // TODO: should we keep track of roles/enforce existence? Role manager should deal with this...

				    co_return;

				}

				future<> auth::certificate_authenticator::alter(std::string_view role_name, const authentication_options& options) const {

				    co_return;

				}

				future<> auth::certificate_authenticator::drop(std::string_view role_name) const {

				    co_return;

				}

				future<auth::custom_options> auth::certificate_authenticator::query_custom_options(std::string_view) const {

				    co_return auth::custom_options{};

				}

				const auth::resource_set& auth::certificate_authenticator::protected_resources() const {

				    static const resource_set resources;

				    return resources;

				}

				::shared_ptr<auth::sasl_challenge> auth::certificate_authenticator::new_sasl_challenge() const {

				    throw exceptions::authentication_exception("Login authentication not supported");

				}

									
										62

auth/certificate_authenticator.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,62 @@

				/*

				 * Copyright (C) 2022-present ScyllaDB

				 *

				 */

				/*

				 * SPDX-License-Identifier: AGPL-3.0-or-later

				 */

				#pragma once

				#include <boost/regex.hpp>

				#include "auth/authenticator.hh"

				namespace cql3 {

				class query_processor;

				} // namespace cql3

				namespace service {

				class migration_manager;

				}

				namespace auth {

				extern const std::string_view certificate_authenticator_name;

				class certificate_authenticator : public authenticator {

				    enum class query_source;

				    std::vector<std::pair<query_source, boost::regex>> _queries;

				public:

				    certificate_authenticator(cql3::query_processor&, ::service::migration_manager&);

				    ~certificate_authenticator();

				    future<> start() override;

				    future<> stop() override;

				    std::string_view qualified_java_name() const override;

				    bool require_authentication() const override;

				    authentication_option_set supported_options() const override;

				    authentication_option_set alterable_options() const override;

				    future<authenticated_user> authenticate(const credentials_map& credentials) const override;

				    future<std::optional<authenticated_user>> authenticate(session_dn_func) const override;

				    future<> create(std::string_view role_name, const authentication_options& options) const override;

				    future<> alter(std::string_view role_name, const authentication_options& options) const override;

				    future<> drop(std::string_view role_name) const override;

				    future<custom_options> query_custom_options(std::string_view role_name) const override;

				    const resource_set& protected_resources() const override;

				    ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;

				private:

				};

				}

									
										17

auth/common.cc
									
												View File
												
				@@ -71,7 +71,8 @@ static future<> create_metadata_table_if_missing_impl(

				        auto group0_guard = co_await mm.start_group0_operation();

				        auto ts = group0_guard.write_timestamp();

				        try {

				            co_return co_await mm.announce(co_await mm.prepare_new_column_family_announcement(table, ts), std::move(group0_guard));

				            co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),

				                    std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));

				        } catch (exceptions::already_exists_exception&) {}

				    }

				}

				@@ -84,20 +85,6 @@ future<> create_metadata_table_if_missing(

				    return futurize_invoke(create_metadata_table_if_missing_impl, table_name, qp, cql, mm);

				}

				future<> wait_for_schema_agreement(::service::migration_manager& mm, const replica::database& db, seastar::abort_source& as) {

				    static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };

				    return do_until([&db, &as] {

				        as.check();

				        return db.get_version() != replica::database::empty_version;

				    }, pause).then([&mm, &as] {

				        return do_until([&mm, &as] {

				            as.check();

				            return mm.have_schema_agreement();

				        }, pause);

				    });

				}

				::service::query_state& internal_distributed_query_state() noexcept {

				#ifdef DEBUG

				    // Give the much slower debug tests more headroom for completing auth queries.

									
										4

auth/common.hh
									
												View File
												
				@@ -22,7 +22,6 @@

				#include "log.hh"

				#include "seastarx.hh"

				#include "utils/exponential_backoff_retry.hh"

				#include "service/query_state.hh"

				using namespace std::chrono_literals;

				@@ -32,6 +31,7 @@ class database;

				namespace service {

				class migration_manager;

				class query_state;

				}

				namespace cql3 {

				@@ -67,8 +67,6 @@ future<> create_metadata_table_if_missing(

				        std::string_view cql,

				        ::service::migration_manager&) noexcept;

				future<> wait_for_schema_agreement(::service::migration_manager&, const replica::database&, seastar::abort_source&);

				///

				/// Time-outs for internal, non-local CQL queries.

				///

									
										2

auth/default_authorizer.cc
									
												View File
												
				@@ -129,7 +129,7 @@ future<> default_authorizer::start() {

				                _migration_manager).then([this] {

				            _finished = do_after_system_ready(_as, [this] {

				                return async([this] {

				                    wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();

				                    _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();

				                    if (legacy_metadata_exists()) {

				                        if (!any_granted().get0()) {

									
										28

auth/password_authenticator.cc
									
												View File
												
				@@ -29,6 +29,7 @@

				#include "utils/class_registrator.hh"

				#include "replica/database.hh"

				#include "cql3/query_processor.hh"

				#include "db/config.hh"

				namespace auth {

				@@ -50,14 +51,23 @@ static const class_registrator<

				static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());

				static std::string_view get_config_value(std::string_view value, std::string_view def) {

				    return value.empty() ? def : value;

				}

				std::string password_authenticator::default_superuser(const db::config& cfg) {

				    return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));

				}

				password_authenticator::~password_authenticator() {

				}

				password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)

				    : _qp(qp)

				    , _migration_manager(mm)

				    , _stopped(make_ready_future<>()) {

				}

				    , _stopped(make_ready_future<>()) 

				    , _superuser(default_superuser(qp.db().get_config()))

				{}

				static bool has_salted_hash(const cql3::untyped_result_set_row& row) {

				    return !row.get_or<sstring>(SALTED_HASH, "").empty();

				@@ -106,13 +116,17 @@ future<> password_authenticator::migrate_legacy_metadata() const {

				}

				future<> password_authenticator::create_default_if_missing() const {

				    return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {

				    return default_role_row_satisfies(_qp, &has_salted_hash, _superuser).then([this](bool exists) {

				        if (!exists) {

				            std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));

				            if (salted_pwd.empty()) {

				                salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);

				            }

				            return _qp.execute_internal(

				                    update_row_query(),

				                    db::consistency_level::QUORUM,

				                    internal_distributed_query_state(),

				                    {passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME},

				                    {salted_pwd, _superuser},

				                    cql3::query_processor::cache_internal::no).then([](auto&&) {

				                plogger.info("Created default superuser authentication record.");

				            });

				@@ -132,9 +146,9 @@ future<> password_authenticator::start() {

				         _stopped = do_after_system_ready(_as, [this] {

				             return async([this] {

				                 wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();

				                 _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();

				                 if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {

				                 if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get0()) {

				                     if (legacy_metadata_exists()) {

				                         plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");

				                     }

				@@ -161,6 +175,8 @@ future<> password_authenticator::stop() {

				}

				db::consistency_level password_authenticator::consistency_for_user(std::string_view role_name) {

				    // TODO: this is plain dung. Why treat hardcoded default special, but for example a user-created

				    // super user uses plain LOCAL_ONE?

				    if (role_name == DEFAULT_USER_NAME) {

				        return db::consistency_level::QUORUM;

				    }

									
										6

auth/password_authenticator.hh
									
												View File
												
				@@ -14,6 +14,10 @@

				#include "auth/authenticator.hh"

				namespace db {

				    class config;

				}

				namespace cql3 {

				class query_processor;

				@@ -33,9 +37,11 @@ class password_authenticator : public authenticator {

				    ::service::migration_manager& _migration_manager;

				    future<> _stopped;

				    seastar::abort_source _as;

				    std::string _superuser;

				public:

				    static db::consistency_level consistency_for_user(std::string_view role_name);

				    static std::string default_superuser(const db::config&);

				    password_authenticator(cql3::query_processor&, ::service::migration_manager&);

									
										9

auth/resource.cc
									
												View File
												
				@@ -79,6 +79,13 @@ static permission_set applicable_permissions(const service_level_resource_view &

				}

				static permission_set applicable_permissions(const functions_resource_view& fv) {

				    if (fv.function_name() || fv.function_signature()) {

				        return permission_set::of<

				                permission::ALTER,

				                permission::DROP,

				                permission::AUTHORIZE,

				                permission::EXECUTE>();

				    }

				    return permission_set::of<

				            permission::CREATE,

				            permission::ALTER,

				@@ -292,7 +299,7 @@ std::optional<std::vector<std::string_view>> functions_resource_view::function_a

				    std::vector<std::string_view> parts;

				    if (_resource._parts[3] == "") {

				        return {};

				        return parts;

				    }

				    for (size_t i = 3; i < _resource._parts.size(); i++) {

				        parts.push_back(_resource._parts[i]);

									
										10

auth/resource.hh
									
												View File
												
				@@ -117,20 +117,12 @@ private:

				    friend class functions_resource_view;

				    friend bool operator<(const resource&, const resource&);

				    friend bool operator==(const resource&, const resource&);

				    friend bool operator==(const resource&, const resource&) = default;

				    friend resource parse_resource(std::string_view);

				};

				bool operator<(const resource&, const resource&);

				inline bool operator==(const resource& r1, const resource& r2) {

				    return (r1._kind == r2._kind) && (r1._parts == r2._parts);

				}

				inline bool operator!=(const resource& r1, const resource& r2) {

				    return !(r1 == r2);

				}

				std::ostream& operator<<(std::ostream&, const resource&);

				class resource_kind_mismatch : public std::invalid_argument {

									
										4

auth/role_or_anonymous.cc
									
												View File
												
				@@ -17,10 +17,6 @@ std::ostream& operator<<(std::ostream& os, const role_or_anonymous& mr) {

				    return os;

				}

				bool operator==(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {

				    return mr1.name == mr2.name;

				}

				bool is_anonymous(const role_or_anonymous& mr) noexcept {

				    return !mr.name.has_value();

				}

									
										7

auth/role_or_anonymous.hh
									
												View File
												
				@@ -26,16 +26,11 @@ public:

				    role_or_anonymous() = default;

				    role_or_anonymous(std::string_view name) : name(name) {

				    }

				    friend bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept = default;

				};

				std::ostream& operator<<(std::ostream&, const role_or_anonymous&);

				bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept;

				inline bool operator!=(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {

				    return !(mr1 == mr2);

				}

				bool is_anonymous(const role_or_anonymous&) noexcept;

				}

									
										68

auth/roles-metadata.cc
									
												View File
												
				@@ -46,59 +46,43 @@ constexpr std::string_view qualified_name("system_auth.roles");

				future<bool> default_role_row_satisfies(

				        cql3::query_processor& qp,

				        std::function<bool(const cql3::untyped_result_set_row&)> p) {

				        std::function<bool(const cql3::untyped_result_set_row&)> p,

				        std::optional<std::string> rolename) {

				    static const sstring query = format("SELECT * FROM {} WHERE {} = ?",

				            meta::roles_table::qualified_name,

				            meta::roles_table::role_col_name);

				    return do_with(std::move(p), [&qp](const auto& p) {

				        return qp.execute_internal(

				                query,

				                db::consistency_level::ONE,

				                {meta::DEFAULT_SUPERUSER_NAME},

				                cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {

				            if (results->empty()) {

				                return qp.execute_internal(

				                        query,

				                        db::consistency_level::QUORUM,

				                        internal_distributed_query_state(),

				                        {meta::DEFAULT_SUPERUSER_NAME},

				                        cql3::query_processor::cache_internal::yes).then([&p](::shared_ptr<cql3::untyped_result_set> results) {

				                    if (results->empty()) {

				                        return make_ready_future<bool>(false);

				                    }

				                    return make_ready_future<bool>(p(results->one()));

				                });

				            }

				            return make_ready_future<bool>(p(results->one()));

				        });

				    });

				    for (auto cl : { db::consistency_level::ONE, db::consistency_level::QUORUM }) {

				        auto results = co_await qp.execute_internal(query, cl

				            , internal_distributed_query_state()

				            , {rolename.value_or(std::string(meta::DEFAULT_SUPERUSER_NAME))}

				            , cql3::query_processor::cache_internal::yes

				            );

				        if (!results->empty()) {

				            co_return p(results->one());

				        }

				    }

				    co_return false;

				}

				future<bool> any_nondefault_role_row_satisfies(

				        cql3::query_processor& qp,

				        std::function<bool(const cql3::untyped_result_set_row&)> p) {

				        std::function<bool(const cql3::untyped_result_set_row&)> p,

				        std::optional<std::string> rolename) {

				    static const sstring query = format("SELECT * FROM {}", meta::roles_table::qualified_name);

				    return do_with(std::move(p), [&qp](const auto& p) {

				        return qp.execute_internal(

				                query,

				                db::consistency_level::QUORUM,

				                internal_distributed_query_state(),

				                cql3::query_processor::cache_internal::no).then([&p](::shared_ptr<cql3::untyped_result_set> results) {

				            if (results->empty()) {

				                return false;

				            }

				    auto results = co_await qp.execute_internal(query, db::consistency_level::QUORUM

				        , internal_distributed_query_state(), cql3::query_processor::cache_internal::no

				        );

				    if (results->empty()) {

				        co_return false;

				    }

				    static const sstring col_name = sstring(meta::roles_table::role_col_name);

				            static const sstring col_name = sstring(meta::roles_table::role_col_name);

				            return boost::algorithm::any_of(*results, [&p](const cql3::untyped_result_set_row& row) {

				                const bool is_nondefault = row.get_as<sstring>(col_name) != meta::DEFAULT_SUPERUSER_NAME;

				                return is_nondefault && p(row);

				            });

				        });

				    co_return boost::algorithm::any_of(*results, [&](const cql3::untyped_result_set_row& row) {

				        auto superuser = rolename ? std::string_view(*rolename) : meta::DEFAULT_SUPERUSER_NAME;

				        const bool is_nondefault = row.get_as<sstring>(col_name) != superuser;

				        return is_nondefault && p(row);

				    });

				}

									
										8

auth/roles-metadata.hh
									
												View File
												
				@@ -43,13 +43,17 @@ constexpr std::string_view role_col_name{"role", 4};

				///

				future<bool> default_role_row_satisfies(

				        cql3::query_processor&,

				        std::function<bool(const cql3::untyped_result_set_row&)>);

				        std::function<bool(const cql3::untyped_result_set_row&)>,

				        std::optional<std::string> rolename = {}

				        );

				///

				/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.

				///

				future<bool> any_nondefault_role_row_satisfies(

				        cql3::query_processor&,

				        std::function<bool(const cql3::untyped_result_set_row&)>);

				        std::function<bool(const cql3::untyped_result_set_row&)>,

				        std::optional<std::string> rolename = {}

				        );

				}

									
										30

auth/service.cc
									
												View File
												
				@@ -7,6 +7,7 @@

				 */

				#include <seastar/core/coroutine.hh>

				#include "auth/resource.hh"

				#include "auth/service.hh"

				#include <algorithm>

				@@ -20,6 +21,7 @@

				#include "auth/allow_all_authorizer.hh"

				#include "auth/common.hh"

				#include "auth/role_or_anonymous.hh"

				#include "cql3/functions/function_name.hh"

				#include "cql3/functions/functions.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				@@ -66,6 +68,7 @@ private:

				    void on_update_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}

				    void on_update_tablet_metadata() override {}

				    void on_drop_keyspace(const sstring& ks_name) override {

				        // Do it in the background.

				@@ -75,6 +78,12 @@ private:

				        }).handle_exception([] (std::exception_ptr e) {

				            log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);

				        });

				        (void)_authorizer.revoke_all(

				            auth::make_functions_resource(ks_name)).handle_exception_type([](const unsupported_authorization_operation&) {

				            // Nothing.

				        }).handle_exception([] (std::exception_ptr e) {

				            log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);

				        });

				    }

				    void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {

				@@ -89,8 +98,22 @@ private:

				    }

				    void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}

				    void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}

				    void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}

				    void on_drop_function(const sstring& ks_name, const sstring& function_name) override {

				        (void)_authorizer.revoke_all(

				            auth::make_functions_resource(ks_name, function_name)).handle_exception_type([](const unsupported_authorization_operation&) {

				            // Nothing.

				        }).handle_exception([] (std::exception_ptr e) {

				            log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);

				        });

				    }

				    void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {

				        (void)_authorizer.revoke_all(

				            auth::make_functions_resource(ks_name, aggregate_name)).handle_exception_type([](const unsupported_authorization_operation&) {

				            // Nothing.

				        }).handle_exception([] (std::exception_ptr e) {

				            log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);

				        });

				    }

				    void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}

				};

				@@ -155,7 +178,8 @@ future<> service::create_keyspace_if_missing(::service::migration_manager& mm) c

				                    opts,

				                    true);

				            co_return co_await mm.announce(mm.prepare_new_keyspace_announcement(ksm, ts), std::move(group0_guard));

				            co_return co_await mm.announce(::service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts),

				                    std::move(group0_guard), format("auth_service: create {} keyspace", meta::AUTH_KS));

				        }

				    }

				}

									
										19

auth/standard_role_manager.cc
									
												View File
												
				@@ -28,6 +28,8 @@

				#include "log.hh"

				#include "utils/class_registrator.hh"

				#include "replica/database.hh"

				#include "service/migration_manager.hh"

				#include "password_authenticator.hh"

				namespace auth {

				@@ -127,6 +129,13 @@ static bool has_can_login(const cql3::untyped_result_set_row& row) {

				    return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());

				}

				standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)

				    : _qp(qp)

				    , _migration_manager(mm)

				    , _stopped(make_ready_future<>())

				    , _superuser(password_authenticator::default_superuser(qp.db().get_config()))

				{}

				std::string_view standard_role_manager::qualified_java_name() const noexcept {

				    return "org.apache.cassandra.auth.CassandraRoleManager";

				}

				@@ -168,7 +177,7 @@ future<> standard_role_manager::create_metadata_tables_if_missing() const {

				}

				future<> standard_role_manager::create_default_role_if_missing() const {

				    return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {

				    return default_role_row_satisfies(_qp, &has_can_login, _superuser).then([this](bool exists) {

				        if (!exists) {

				            static const sstring query = format("INSERT INTO {} ({}, is_superuser, can_login) VALUES (?, true, true)",

				                    meta::roles_table::qualified_name,

				@@ -178,9 +187,9 @@ future<> standard_role_manager::create_default_role_if_missing() const {

				                    query,

				                    db::consistency_level::QUORUM,

				                    internal_distributed_query_state(),

				                    {meta::DEFAULT_SUPERUSER_NAME},

				                    cql3::query_processor::cache_internal::no).then([](auto&&) {

				                log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);

				                    {_superuser},

				                    cql3::query_processor::cache_internal::no).then([this](auto&&) {

				                log.info("Created default superuser role '{}'.", _superuser);

				                return make_ready_future<>();

				            });

				        }

				@@ -232,7 +241,7 @@ future<> standard_role_manager::start() {

				        return this->create_metadata_tables_if_missing().then([this] {

				            _stopped = auth::do_after_system_ready(_as, [this] {

				                return seastar::async([this] {

				                    wait_for_schema_agreement(_migration_manager, _qp.db().real_database(), _as).get0();

				                    _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();

				                    if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {

				                        if (this->legacy_metadata_exists()) {

									
										7

auth/standard_role_manager.hh
									
												View File
												
				@@ -34,13 +34,10 @@ class standard_role_manager final : public role_manager {

				    ::service::migration_manager& _migration_manager;

				    future<> _stopped;

				    seastar::abort_source _as;

				    std::string _superuser;

				public:

				    standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)

				            : _qp(qp)

				            , _migration_manager(mm)

				            , _stopped(make_ready_future<>()) {

				    }

				    standard_role_manager(cql3::query_processor&, ::service::migration_manager&);

				    virtual std::string_view qualified_java_name() const noexcept override;

									
										10

backlog_controller.hh
									
												View File
												
				@@ -37,10 +37,8 @@

				// The constants q1 and q2 are used to determine the proportional factor at each stage.

				class backlog_controller {

				public:

				    struct scheduling_group {

				        seastar::scheduling_group cpu = default_scheduling_group();

				        seastar::io_priority_class io = default_priority_class();

				    };

				    using scheduling_group = seastar::scheduling_group;

				    future<> shutdown() {

				        _update_timer.cancel();

				        return std::move(_inflight_update);

				@@ -58,11 +56,11 @@ protected:

				    };

				    scheduling_group _scheduling_group;

				    timer<> _update_timer;

				    std::vector<control_point> _control_points;

				    std::function<float()> _current_backlog;

				    timer<> _update_timer;

				    // updating shares for an I/O class may contact another shard and returns a future.

				    future<> _inflight_update;

				@@ -82,9 +80,9 @@ protected:

				                       std::vector<control_point> control_points, std::function<float()> backlog,

				                       float static_shares = 0)

				        : _scheduling_group(std::move(sg))

				        , _update_timer([this] { adjust(); })

				        , _control_points()

				        , _current_backlog(std::move(backlog))

				        , _update_timer([this] { adjust(); })

				        , _inflight_update(make_ready_future<>())

				        , _static_shares(static_shares)

				    {

8

bin/cqlsh Executable file

View File

@@ -0,0 +1,8 @@
 #!/bin/bash
 # Copyright (C) 2023-present ScyllaDB
 # SPDX-License-Identifier: AGPL-3.0-or-later
 here=$(dirname "$0")
 exec "$here/../tools/cqlsh/bin/cqlsh" "$@"

									
										15

bytes.hh
									
												View File
												
				@@ -17,7 +17,7 @@

				#include <functional>

				#include <compare>

				#include "utils/mutable_view.hh"

				#include <xxhash.h>

				#include "utils/simple_hashers.hh"

				using bytes = basic_sstring<int8_t, uint32_t, 31, false>;

				using bytes_view = std::basic_string_view<int8_t>;

				@@ -160,18 +160,7 @@ struct appending_hash<bytes_view> {

				    }

				};

				struct bytes_view_hasher : public hasher {

				    XXH64_state_t _state;

				    bytes_view_hasher(uint64_t seed = 0) noexcept {

				        XXH64_reset(&_state, seed);

				    }

				    void update(const char* ptr, size_t length) noexcept {

				        XXH64_update(&_state, ptr, length);

				    }

				    size_t finalize() {

				        return static_cast<size_t>(XXH64_digest(&_state));

				    }

				};

				using bytes_view_hasher = simple_xx_hasher;

				namespace std {

				template <>

									
										18

bytes_ostream.hh
									
												View File
												
				@@ -53,6 +53,10 @@ public:

				        using difference_type = std::ptrdiff_t;

				        using pointer = bytes_view*;

				        using reference = bytes_view&;

				        struct implementation {

				            blob_storage* current_chunk;

				        };

				    private:

				        chunk* _current = nullptr;

				    public:

				@@ -75,11 +79,11 @@ public:

				            ++(*this);

				            return tmp;

				        }

				        bool operator==(const fragment_iterator& other) const {

				            return _current == other._current;

				        }

				        bool operator!=(const fragment_iterator& other) const {

				            return _current != other._current;

				        bool operator==(const fragment_iterator&) const = default;

				        implementation extract_implementation() const {

				            return implementation {

				                .current_chunk = _current,

				            };

				        }

				    };

				    using const_iterator = fragment_iterator;

				@@ -432,10 +436,6 @@ public:

				        return true;

				    }

				    bool operator!=(const bytes_ostream& other) const {

				        return !(*this == other);

				    }

				    // Makes this instance empty.

				    //

				    // The first buffer is not deallocated, so callers may rely on the

									
										94

cache_flat_mutation_reader.hh
									
												View File
												
				@@ -110,6 +110,9 @@ class cache_flat_mutation_reader final : public flat_mutation_reader_v2::impl {

				    flat_mutation_reader_v2* _underlying = nullptr;

				    flat_mutation_reader_v2_opt _underlying_holder;

				    gc_clock::time_point _read_time;

				    gc_clock::time_point _gc_before;

				    future<> do_fill_buffer();

				    future<> ensure_underlying();

				    void copy_from_cache_to_buffer();

				@@ -178,6 +181,20 @@ class cache_flat_mutation_reader final : public flat_mutation_reader_v2::impl {

				    const schema& table_schema() {

				        return *_snp->schema();

				    }

				    gc_clock::time_point get_read_time() {

				        return _read_context.tombstone_gc_state() ? gc_clock::now() : gc_clock::time_point::min();

				    }

				    gc_clock::time_point get_gc_before(const schema& schema, dht::decorated_key dk, const gc_clock::time_point query_time) {

				        auto gc_state = _read_context.tombstone_gc_state();

				        if (gc_state) {

				            return gc_state->get_gc_before_for_key(schema.shared_from_this(), dk, query_time);

				        }

				        return gc_clock::time_point::min();

				    }

				public:

				    cache_flat_mutation_reader(schema_ptr s,

				                               dht::decorated_key dk,

				@@ -196,6 +213,8 @@ public:

				        , _read_context_holder()

				        , _read_context(ctx)    // ctx is owned by the caller, who's responsible for closing it.

				        , _next_row(*_schema, *_snp, false, _read_context.is_reversed())

				        , _read_time(get_read_time())

				        , _gc_before(get_gc_before(*_schema, dk, _read_time))

				    {

				        clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),

				                      fmt::ptr(&*_snp));

				@@ -730,9 +749,51 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {

				        }

				    }

				    // We add the row to the buffer even when it's full.

				    // This simplifies the code. For more info see #3139.

				    if (_next_row_in_range) {

				        bool remove_row = false;

				        if (_read_context.tombstone_gc_state() // do not compact rows when tombstone_gc_state is not set (used in some unit tests)

				            && !_next_row.dummy()

				            && _snp->at_latest_version()

				            && _snp->at_oldest_version()) {

				            deletable_row& row = _next_row.latest_row();

				            tombstone range_tomb = _next_row.range_tombstone_for_row();

				            auto t = row.deleted_at();

				            t.apply(range_tomb);

				            auto row_tomb_expired = [&](row_tombstone tomb) {

				                return (tomb && tomb.max_deletion_time() < _gc_before);

				            };

				            auto is_row_dead = [&](const deletable_row& row) {

				                auto& m = row.marker();

				                return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before);

				            };

				            if (row_tomb_expired(t) || is_row_dead(row)) {

				                can_gc_fn always_gc = [&](tombstone) { return true; };

				                const schema& row_schema = _next_row.latest_row_schema();

				                _read_context.cache()._tracker.on_row_compacted();

				                with_allocator(_snp->region().allocator(), [&] {

				                    deletable_row row_copy(row_schema, row);

				                    row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, always_gc, _gc_before, nullptr);

				                    std::swap(row, row_copy);

				                });

				                remove_row = row.empty();

				                auto tomb_expired = [&](tombstone tomb) {

				                    return (tomb && tomb.deletion_time < _gc_before);

				                };

				                auto latests_range_tomb = _next_row.get_iterator_in_latest_version()->range_tombstone();

				                if (tomb_expired(latests_range_tomb)) {

				                    _next_row.get_iterator_in_latest_version()->set_range_tombstone({});

				                }

				            }

				        }

				        if (_next_row.range_tombstone_for_row() != _current_tombstone) [[unlikely]] {

				            auto tomb = _next_row.range_tombstone_for_row();

				            auto new_lower_bound = position_in_partition::before_key(_next_row.position());

				@@ -742,8 +803,31 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {

				            _current_tombstone = tomb;

				            _read_context.cache()._tracker.on_range_tombstone_read();

				        }

				        add_to_buffer(_next_row);

				        move_to_next_entry();

				        if (remove_row) {

				            _read_context.cache()._tracker.on_row_compacted_away();

				            _lower_bound = position_in_partition::after_key(*_schema, _next_row.position());

				            partition_snapshot_row_weakref row_ref(_next_row);

				            move_to_next_entry();

				            with_allocator(_snp->region().allocator(), [&] {

				                cache_tracker& tracker = _read_context.cache()._tracker;

				                if (row_ref->is_linked()) {

				                    tracker.get_lru().remove(*row_ref);

				                }

				                row_ref->on_evicted(tracker);

				            });

				            _snp->region().allocator().invalidate_references();

				            _next_row.force_valid();

				        } else {

				            // We add the row to the buffer even when it's full.

				            // This simplifies the code. For more info see #3139.

				            add_to_buffer(_next_row);

				            move_to_next_entry();

				        }

				    } else {

				        move_to_next_range();

				    }

				@@ -894,7 +978,7 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs

				    if (!row.dummy()) {

				        _read_context.cache().on_row_hit();

				        if (_read_context.digest_requested()) {

				            row.latest_row().cells().prepare_hash(table_schema(), column_kind::regular_column);

				            row.latest_row_prepare_hash();

				        }

				        add_clustering_row_to_buffer(mutation_fragment_v2(*_schema, _permit, row.row()));

				    } else {

									
										1

cartesian_product.hh
									
												View File
												
				@@ -68,7 +68,6 @@ public:

				            _pos = -1;

				        }

				        bool operator==(const iterator& o) const { return _pos == o._pos; }

				        bool operator!=(const iterator& o) const { return _pos != o._pos; }

				    };

				public:

				    cartesian_product(const std::vector<std::vector<T>>& vec_of_vecs) : _vec_of_vecs(vec_of_vecs) {}

									
										1

cdc/cdc_options.hh
									
												View File
												
				@@ -65,7 +65,6 @@ public:

				    void ttl(int v) { _ttl = v; }

				    bool operator==(const options& o) const;

				    bool operator!=(const options& o) const;

				};

				} // namespace cdc

									
										318

cdc/generation.cc
									
												View File
												
				@@ -13,6 +13,7 @@

				#include <seastar/core/sleep.hh>

				#include <seastar/core/coroutine.hh>

				#include "gms/endpoint_state.hh"

				#include "keys.hh"

				#include "schema/schema_builder.hh"

				#include "replica/database.hh"

				@@ -25,6 +26,7 @@

				#include "gms/inet_address.hh"

				#include "gms/gossiper.hh"

				#include "gms/feature_service.hh"

				#include "utils/error_injection.hh"

				#include "utils/UUID_gen.hh"

				#include "cdc/generation.hh"

				@@ -66,10 +68,10 @@ static constexpr auto stream_id_index_shift = stream_id_version_shift + stream_i

				static constexpr auto stream_id_random_shift = stream_id_index_shift + stream_id_index_bits;

				/**

				 * Responsibilty for encoding stream_id moved from factory method to

				 * this constructor, to keep knowledge of composition in a single place.

				 * Note this is private and friended to topology_description_generator,

				 * because he is the one who defined the "order" we view vnodes etc.

				 * Responsibility for encoding stream_id moved from the create_stream_ids

				 * function to this constructor, to keep knowledge of composition in a

				 * single place. Note the make_new_generation_description function

				 * defines the "order" in which we view vnodes etc.

				 */

				stream_id::stream_id(dht::token token, size_t vnode_index)

				    : _value(bytes::initialized_later(), 2 * sizeof(int64_t))

				@@ -153,18 +155,18 @@ bool token_range_description::operator==(const token_range_description& o) const

				        && sharding_ignore_msb == o.sharding_ignore_msb;

				}

				topology_description::topology_description(std::vector<token_range_description> entries)

				topology_description::topology_description(utils::chunked_vector<token_range_description> entries)

				    : _entries(std::move(entries)) {}

				bool topology_description::operator==(const topology_description& o) const {

				    return _entries == o._entries;

				}

				const std::vector<token_range_description>& topology_description::entries() const& {

				const utils::chunked_vector<token_range_description>& topology_description::entries() const& {

				    return _entries;

				}

				std::vector<token_range_description>&& topology_description::entries() && {

				utils::chunked_vector<token_range_description>&& topology_description::entries() && {

				    return std::move(_entries);

				}

				@@ -183,98 +185,48 @@ static std::vector<stream_id> create_stream_ids(

				    return result;

				}

				class topology_description_generator final {

				    const std::unordered_set<dht::token>& _bootstrap_tokens;

				    const locator::token_metadata_ptr _tmptr;

				    const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& _get_sharding_info;

				    // Compute a set of tokens that split the token ring into vnodes

				    auto get_tokens() const {

				        auto tokens = _tmptr->sorted_tokens();

				        auto it = tokens.insert(

				                tokens.end(), _bootstrap_tokens.begin(), _bootstrap_tokens.end());

				        std::sort(it, tokens.end());

				        std::inplace_merge(tokens.begin(), it, tokens.end());

				        tokens.erase(std::unique(tokens.begin(), tokens.end()), tokens.end());

				        return tokens;

				    }

				    token_range_description create_description(size_t index, dht::token start, dht::token end) const {

				        token_range_description desc;

				        desc.token_range_end = end;

				        auto [shard_count, ignore_msb] = _get_sharding_info(end);

				        desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);

				        desc.sharding_ignore_msb = ignore_msb;

				        return desc;

				    }

				public:

				    topology_description_generator(

				            const std::unordered_set<dht::token>& bootstrap_tokens,

				            const locator::token_metadata_ptr tmptr,

				            // This function must return sharding parameters for a node that owns the vnode ending with

				            // the given token. Returns <shard_count, ignore_msb> pair.

				            const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info)

				        : _bootstrap_tokens(bootstrap_tokens)

				        , _tmptr(std::move(tmptr))

				        , _get_sharding_info(get_sharding_info)

				    {}

				    /*

				     * Generate a set of CDC stream identifiers such that for each shard

				     * and vnode pair there exists a stream whose token falls into this vnode

				     * and is owned by this shard. It is sometimes not possible to generate

				     * a CDC stream identifier for some (vnode, shard) pair because not all

				     * shards have to own tokens in a vnode. Small vnode can be totally owned

				     * by a single shard. In such case, a stream identifier that maps to

				     * end of the vnode is generated.

				     *

				     * Then build a cdc::topology_description which maps tokens to generated

				     * stream identifiers, such that if token T is owned by shard S in vnode V,

				     * it gets mapped to the stream identifier generated for (S, V).

				     */

				    // Run in seastar::async context.

				    topology_description generate() const {

				        const auto tokens = get_tokens();

				        std::vector<token_range_description> vnode_descriptions;

				        vnode_descriptions.reserve(tokens.size());

				        vnode_descriptions.push_back(

				                create_description(0, tokens.back(), tokens.front()));

				        for (size_t idx = 1; idx < tokens.size(); ++idx) {

				            vnode_descriptions.push_back(

				                    create_description(idx, tokens[idx - 1], tokens[idx]));

				        }

				        return {std::move(vnode_descriptions)};

				    }

				};

				bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper& g) {

				    auto my_host_id = g.get_host_id(me);

				    auto& eps = g.get_endpoint_states();

				    return std::none_of(eps.begin(), eps.end(),

				            [&] (const std::pair<gms::inet_address, gms::endpoint_state>& ep) {

				        return my_host_id < g.get_host_id(ep.first);

				    });

				    return g.for_each_endpoint_state_until([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {

				        return stop_iteration(my_host_id < g.get_host_id(node));

				    }) == stop_iteration::no;

				}

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(

				bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm) {

				    if (tm.sorted_tokens().size() != gen.entries().size()) {

				        // We probably have garbage streams from old generations

				        cdc_log.info("Generation size does not match the token ring");

				        return false;

				    } else {

				        std::unordered_set<dht::token> gen_ends;

				        for (const auto& entry : gen.entries()) {

				            gen_ends.insert(entry.token_range_end);

				        }

				        for (const auto& metadata_token : tm.sorted_tokens()) {

				            if (!gen_ends.contains(metadata_token)) {

				                cdc_log.warn("CDC generation missing token {}", metadata_token);

				                return false;

				            }

				        }

				        return true;

				    }

				}

				static future<utils::chunked_vector<mutation>> get_common_cdc_generation_mutations(

				        schema_ptr s,

				        utils::UUID id,

				        const partition_key& pkey,

				        noncopyable_function<clustering_key (dht::token)>&& get_ckey_from_range_end,

				        const cdc::topology_description& desc,

				        size_t mutation_size_threshold,

				        api::timestamp_type ts) {

				    utils::chunked_vector<mutation> res;

				    res.emplace_back(s, partition_key::from_singular(*s, id));

				    res.back().set_static_cell(to_bytes("num_ranges"), int32_t(desc.entries().size()), ts);

				    res.emplace_back(s, pkey);

				    size_t size_estimate = 0;

				    size_t total_size_estimate = 0;

				    for (auto& e : desc.entries()) {

				        if (size_estimate >= mutation_size_threshold) {

				            res.emplace_back(s, partition_key::from_singular(*s, id));

				            total_size_estimate += size_estimate;

				            res.emplace_back(s, pkey);

				            size_estimate = 0;

				        }

				@@ -285,16 +237,60 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(

				        }

				        size_estimate += e.streams.size() * 20;

				        auto ckey = clustering_key::from_singular(*s, dht::token::to_int64(e.token_range_end));

				        auto ckey = get_ckey_from_range_end(e.token_range_end);

				        res.back().set_cell(ckey, to_bytes("streams"), make_set_value(db::cdc_streams_set_type, std::move(streams)), ts);

				        res.back().set_cell(ckey, to_bytes("ignore_msb"), int8_t(e.sharding_ignore_msb), ts);

				        co_await coroutine::maybe_yield();

				    }

				    total_size_estimate += size_estimate;

				    // Copy mutations n times, where n is picked so that the memory size of all mutations together exceeds `max_command_size`.

				    utils::get_local_injector().inject("cdc_generation_mutations_replication", [&res, total_size_estimate, mutation_size_threshold] {

				        utils::chunked_vector<mutation> new_res;

				        size_t number_of_copies = (mutation_size_threshold / total_size_estimate + 1) * 2;

				        for (size_t i = 0; i < number_of_copies; ++i) {

				            std::copy(res.begin(), res.end(), std::back_inserter(new_res));

				        }

				        res = std::move(new_res);

				    });

				    co_return res;

				}

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v2(

				        schema_ptr s,

				        utils::UUID id,

				        const cdc::topology_description& desc,

				        size_t mutation_size_threshold,

				        api::timestamp_type ts) {

				    auto pkey = partition_key::from_singular(*s, id);

				    auto get_ckey = [s] (dht::token range_end) {

				        return clustering_key::from_singular(*s, dht::token::to_int64(range_end));

				    };

				    auto res = co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);

				    res.back().set_static_cell(to_bytes("num_ranges"), int32_t(desc.entries().size()), ts);

				    co_return res;

				}

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(

				        schema_ptr s,

				        utils::UUID id,

				        const cdc::topology_description& desc,

				        size_t mutation_size_threshold,

				        api::timestamp_type ts) {

				    auto pkey = partition_key::from_singular(*s, CDC_GENERATIONS_V3_KEY);

				    auto get_ckey = [&] (dht::token range_end) {

				        return clustering_key::from_exploded(*s, {timeuuid_type->decompose(id), long_type->decompose(dht::token::to_int64(range_end))}) ;

				    };

				    co_return co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);

				}

				// non-static for testing

				size_t limit_of_streams_in_topology_description() {

				    // Each stream takes 16B and we don't want to exceed 4MB so we can have

				@@ -327,13 +323,47 @@ topology_description limit_number_of_streams_if_needed(topology_description&& de

				    return topology_description(std::move(entries));

				}

				std::pair<utils::UUID, cdc::topology_description> make_new_generation_data(

				// Compute a set of tokens that split the token ring into vnodes.

				static auto get_tokens(const std::unordered_set<dht::token>& bootstrap_tokens, const locator::token_metadata_ptr tmptr) {

				    auto tokens = tmptr->sorted_tokens();

				    auto it = tokens.insert(tokens.end(), bootstrap_tokens.begin(), bootstrap_tokens.end());

				    std::sort(it, tokens.end());

				    std::inplace_merge(tokens.begin(), it, tokens.end());

				    tokens.erase(std::unique(tokens.begin(), tokens.end()), tokens.end());

				    return tokens;

				}

				static token_range_description create_token_range_description(

				        size_t index,

				        dht::token start,

				        dht::token end,

				        const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info) {

				    token_range_description desc;

				    desc.token_range_end = end;

				    auto [shard_count, ignore_msb] = get_sharding_info(end);

				    desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);

				    desc.sharding_ignore_msb = ignore_msb;

				    return desc;

				}

				cdc::topology_description make_new_generation_description(

				        const std::unordered_set<dht::token>& bootstrap_tokens,

				        const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,

				        const locator::token_metadata_ptr tmptr) {

				    auto gen = topology_description_generator(bootstrap_tokens, tmptr, get_sharding_info).generate();

				    auto uuid = utils::make_random_uuid();

				    return {uuid, std::move(gen)};

				    const auto tokens = get_tokens(bootstrap_tokens, tmptr);

				    utils::chunked_vector<token_range_description> vnode_descriptions;

				    vnode_descriptions.reserve(tokens.size());

				    vnode_descriptions.push_back(create_token_range_description(0, tokens.back(), tokens.front(), get_sharding_info));

				    for (size_t idx = 1; idx < tokens.size(); ++idx) {

				        vnode_descriptions.push_back(create_token_range_description(idx, tokens[idx - 1], tokens[idx], get_sharding_info));

				    }

				    return {std::move(vnode_descriptions)};

				}

				db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milliseconds ring_delay) {

				@@ -365,7 +395,9 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const

				            return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};

				        }

				    };

				    auto [uuid, gen] = make_new_generation_data(bootstrap_tokens, get_sharding_info, tmptr);

				    auto uuid = utils::make_random_uuid();

				    auto gen = make_new_generation_description(bootstrap_tokens, get_sharding_info, tmptr);

				    // Our caller should ensure that there are normal tokens in the token ring.

				    auto normal_token_owners = tmptr->count_normal_token_owners();

				@@ -419,8 +451,12 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const

				 * but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,

				 * which means it will gossip the generation's timestamp.

				 */

				static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::gossiper& g) {

				    auto gen_id_string = g.get_application_state_value(endpoint, gms::application_state::CDC_GENERATION_ID);

				static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::endpoint_state& eps) {

				    const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);

				    if (!gen_id_ptr) {

				        return std::nullopt;

				    }

				    auto gen_id_string = gen_id_ptr->value();

				    cdc_log.trace("endpoint={}, gen_id_string={}", endpoint, gen_id_string);

				    return gms::versioned_value::cdc_generation_id_from_string(gen_id_string);

				}

				@@ -624,21 +660,21 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {

				    // For each CDC log table get the TTL setting (from CDC options) and the table's creation time

				    std::vector<time_and_ttl> times_and_ttls;

				    for (auto& [_, cf] : _db.get_column_families()) {

				        auto& s = *cf->schema();

				    _db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {

				        auto& s = *t->schema();

				        auto base = cdc::get_base_table(_db, s.ks_name(), s.cf_name());

				        if (!base) {

				            // Not a CDC log table.

				            continue;

				            return;

				        }

				        auto& cdc_opts = base->cdc_options();

				        if (!cdc_opts.enabled()) {

				            // This table is named like a CDC log table but it's not one.

				            continue;

				            return;

				        }

				        times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});

				    }

				    });

				    if (times_and_ttls.empty()) {

				        // There's no point in rewriting old generations' streams (they don't contain any data).

				@@ -726,8 +762,8 @@ future<> generation_service::stop() {

				        cdc_log.error("CDC stream rewrite failed: ", std::current_exception());

				    }

				    if (this_shard_id() == 0) {

				        co_await _gossiper.unregister_(shared_from_this());

				    if (_joined && (this_shard_id() == 0)) {

				        co_await leave_ring();

				    }

				    _stopped = true;

				@@ -739,7 +775,6 @@ generation_service::~generation_service() {

				future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    assert(_sys_ks.local().bootstrap_complete());

				    _gen_id = std::move(startup_gen_id);

				    _gossiper.register_(shared_from_this());

				@@ -757,18 +792,24 @@ future<> generation_service::after_join(std::optional<cdc::generation_id>&& star

				    _cdc_streams_rewrite_complete = maybe_rewrite_streams_descriptions();

				}

				future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {

				future<> generation_service::leave_ring() {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    _joined = false;

				    co_await _gossiper.unregister_(shared_from_this());

				}

				future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);

				    auto val = ep_state->get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);

				    if (!val) {

				        return make_ready_future();

				    }

				    return on_change(ep, gms::application_state::CDC_GENERATION_ID, *val);

				    return on_change(ep, gms::application_state::CDC_GENERATION_ID, *val, pid);

				}

				future<> generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {

				future<> generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v, gms::permit_id) {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    if (app_state != gms::application_state::CDC_GENERATION_ID) {

				@@ -788,22 +829,21 @@ future<> generation_service::check_and_repair_cdc_streams() {

				    }

				    std::optional<cdc::generation_id> latest = _gen_id;

				    const auto& endpoint_states = _gossiper.get_endpoint_states();

				    for (const auto& [addr, state] : endpoint_states) {

				    _gossiper.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& state) {

				        if (_gossiper.is_left(addr)) {

				            cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);

				            continue;

				            return;

				        }

				        if (!_gossiper.is_normal(addr)) {

				            throw std::runtime_error(format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"

				                    " ({} is in state {})", addr, _gossiper.get_gossip_status(state)));

				        }

				        const auto gen_id = get_generation_id_for(addr, _gossiper);

				        const auto gen_id = get_generation_id_for(addr, state);

				        if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {

				            latest = gen_id;

				        }

				    }

				    });

				    auto tmptr = _token_metadata.get();

				    auto sys_dist_ks = get_sys_dist_ks();

				@@ -858,24 +898,9 @@ future<> generation_service::check_and_repair_cdc_streams() {

				                " even though some node gossiped about it.",

				                latest, db_clock::now());

				            should_regenerate = true;

				        } else {

				            if (tmptr->sorted_tokens().size() != gen->entries().size()) {

				                // We probably have garbage streams from old generations

				                cdc_log.info("Generation size does not match the token ring, regenerating");

				                should_regenerate = true;

				            } else {

				                std::unordered_set<dht::token> gen_ends;

				                for (const auto& entry : gen->entries()) {

				                    gen_ends.insert(entry.token_range_end);

				                }

				                for (const auto& metadata_token : tmptr->sorted_tokens()) {

				                    if (!gen_ends.contains(metadata_token)) {

				                        cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);

				                        should_regenerate = true;

				                        break;

				                    }

				                }

				            }

				        } else if (!is_cdc_generation_optimal(*gen, *tmptr)) {

				            should_regenerate = true;

				            cdc_log.info("CDC generation {} needs repair, regenerating", latest);

				        }

				    }

				@@ -935,17 +960,13 @@ future<> generation_service::legacy_handle_cdc_generation(std::optional<cdc::gen

				        co_return;

				    }

				    if (!_sys_ks.local().bootstrap_complete() || !_sys_dist_ks.local_is_initialized()

				            || !_sys_dist_ks.local().started()) {

				        // The service should not be listening for generation changes until after the node

				        // is bootstrapped. Therefore we would previously assume that this condition

				        // can never become true and call on_internal_error here, but it turns out that

				        // it may become true on decommission: the node enters NEEDS_BOOTSTRAP

				        // state before leaving the token ring, so bootstrap_complete() becomes false.

				        // In that case we can simply return.

				        co_return;

				    if (!_sys_dist_ks.local_is_initialized() || !_sys_dist_ks.local().started()) {

				        on_internal_error(cdc_log, "Legacy handle CDC generation with sys.dist.ks. down");

				    }

				    // The service should not be listening for generation changes until after the node

				    // is bootstrapped and since the node leaves the ring on decommission

				    if (co_await container().map_reduce(and_reducer(), [ts = get_ts(*gen_id)] (generation_service& svc) {

				        return !svc._cdc_metadata.prepare(ts);

				    })) {

				@@ -1008,12 +1029,12 @@ future<> generation_service::legacy_scan_cdc_generations() {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    std::optional<cdc::generation_id> latest;

				    for (const auto& ep: _gossiper.get_endpoint_states()) {

				        auto gen_id = get_generation_id_for(ep.first, _gossiper);

				    _gossiper.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {

				        auto gen_id = get_generation_id_for(node, eps);

				        if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {

				            latest = gen_id;

				        }

				    }

				    });

				    if (latest) {

				        cdc_log.info("Latest generation seen during startup: {}", *latest);

				@@ -1090,19 +1111,8 @@ shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks(

				    return _sys_dist_ks.local_shared();

				}

				std::ostream& operator<<(std::ostream& os, const generation_id& gen_id) {

				    std::visit(make_visitor(

				    [&os] (const generation_id_v1& id) { os << id.ts; },

				    [&os] (const generation_id_v2& id) { os << "(" << id.ts << ", " << id.id << ")"; }

				    ), gen_id);

				    return os;

				}

				db_clock::time_point get_ts(const generation_id& gen_id) {

				    return std::visit(make_visitor(

				    [] (const generation_id_v1& id) { return id.ts; },

				    [] (const generation_id_v2& id) { return id.ts; }

				    ), gen_id);

				    return std::visit([] (auto& id) { return id.ts; }, gen_id);

				}

				} // namespace cdc

									
										48

cdc/generation.hh
									
												View File
												
				@@ -92,13 +92,13 @@ struct token_range_description {

				 * in the `_entries` vector. See the comment above `token_range_description` for explanation.

				 */

				class topology_description {

				    std::vector<token_range_description> _entries;

				    utils::chunked_vector<token_range_description> _entries;

				public:

				    topology_description(std::vector<token_range_description> entries);

				    topology_description(utils::chunked_vector<token_range_description> entries);

				    bool operator==(const topology_description&) const;

				    const std::vector<token_range_description>& entries() const&;

				    std::vector<token_range_description>&& entries() &&;

				    const utils::chunked_vector<token_range_description>& entries() const&;

				    utils::chunked_vector<token_range_description>&& entries() &&;

				};

				/**

				@@ -133,7 +133,28 @@ public:

				 */

				bool should_propose_first_generation(const gms::inet_address& me, const gms::gossiper&);

				std::pair<utils::UUID, cdc::topology_description> make_new_generation_data(

				/*

				 * Checks if the CDC generation is optimal, which is true if its `topology_description` is consistent

				 * with `token_metadata`.

				*/

				bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm);

				/*

				 * Generate a set of CDC stream identifiers such that for each shard

				 * and vnode pair there exists a stream whose token falls into this vnode

				 * and is owned by this shard. It is sometimes not possible to generate

				 * a CDC stream identifier for some (vnode, shard) pair because not all

				 * shards have to own tokens in a vnode. Small vnode can be totally owned

				 * by a single shard. In such case, a stream identifier that maps to

				 * end of the vnode is generated.

				 *

				 * Then build a cdc::topology_description which maps tokens to generated

				 * stream identifiers, such that if token T is owned by shard S in vnode V,

				 * it gets mapped to the stream identifier generated for (S, V).

				 *

				 * Run in seastar::async context.

				 */

				cdc::topology_description make_new_generation_description(

				    const std::unordered_set<dht::token>& bootstrap_tokens,

				    const noncopyable_function<std::pair<size_t, uint8_t> (dht::token)>& get_sharding_info,

				    const locator::token_metadata_ptr);

				@@ -144,9 +165,20 @@ db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milli

				// using `mutation_size_threshold` to decide on the mutation sizes. The partition key of each mutation

				// is given by `gen_uuid`. The timestamp of each cell in each mutation is given by `mutation_timestamp`.

				//

				// Works for only specific schemas: CDC_GENERATIONS_V2 (in system_distributed_keyspace)

				// and CDC_GENERATIONS_V3 (in system_keyspace).

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations(

				// Works only for the CDC_GENERATIONS_V2 schema (in system_distributed keyspace).

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v2(

				    schema_ptr, utils::UUID gen_uuid, const cdc::topology_description&,

				    size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);

				// The partition key of all rows in the single-partition CDC_GENERATIONS_V3 schema (in system keyspace).

				static constexpr auto CDC_GENERATIONS_V3_KEY = "cdc_generations";

				// Translates the CDC generation data given by a `cdc::topology_description` into a vector of mutations,

				// using `mutation_size_threshold` to decide on the mutation sizes. The first clustering key column is

				// given by `gen_uuid`. The timestamp of each cell in each mutation is given by `mutation_timestamp`.

				//

				// Works only for the CDC_GENERATIONS_V3 schema (in system keyspace).

				future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(

				    schema_ptr, utils::UUID gen_uuid, const cdc::topology_description&,

				    size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);

									
										30

cdc/generation_id.hh
									
												View File
												
				@@ -28,7 +28,35 @@ struct generation_id_v2 {

				using generation_id = std::variant<generation_id_v1, generation_id_v2>;

				std::ostream& operator<<(std::ostream&, const generation_id&);

				db_clock::time_point get_ts(const generation_id&);

				} // namespace cdc

				template <>

				struct fmt::formatter<cdc::generation_id_v1> {

				    constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }

				    template <typename FormatContext>

				    auto format(const cdc::generation_id_v1& gen_id, FormatContext& ctx) const {

				        return fmt::format_to(ctx.out(), "{}", gen_id.ts);

				    }

				};

				template <>

				struct fmt::formatter<cdc::generation_id_v2> {

				    constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }

				    template <typename FormatContext>

				    auto format(const cdc::generation_id_v2& gen_id, FormatContext& ctx) const {

				        return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);

				    }

				};

				template <>

				struct fmt::formatter<cdc::generation_id> {

				    constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }

				    template <typename FormatContext>

				    auto format(const cdc::generation_id& gen_id, FormatContext& ctx) const {

				        return std::visit([&ctx] (auto& id) {

				            return fmt::format_to(ctx.out(), "{}", id);

				        }, gen_id);

				    }

				};

									
										15

cdc/generation_service.hh
									
												View File
												
				@@ -98,19 +98,20 @@ public:

				     * Must be called on shard 0 - that's where the generation management happens.

				     */

				    future<> after_join(std::optional<cdc::generation_id>&& startup_gen_id);

				    future<> leave_ring();

				    cdc::metadata& get_cdc_metadata() {

				        return _cdc_metadata;

				    }

				    virtual future<> before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override { return make_ready_future(); }

				    virtual future<> on_alive(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }

				    virtual future<> on_dead(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }

				    virtual future<> on_remove(gms::inet_address) override { return make_ready_future(); }

				    virtual future<> on_restart(gms::inet_address, gms::endpoint_state) override { return make_ready_future(); }

				    virtual future<> before_change(gms::inet_address, gms::endpoint_state_ptr, gms::application_state, const gms::versioned_value&) override { return make_ready_future(); }

				    virtual future<> on_alive(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_dead(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_remove(gms::inet_address, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_restart(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_join(gms::inet_address, gms::endpoint_state) override;

				    virtual future<> on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;

				    virtual future<> on_join(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override;

				    virtual future<> on_change(gms::inet_address, gms::application_state, const gms::versioned_value&, gms::permit_id) override;

				    future<> check_and_repair_cdc_streams();

									
										8

cdc/log.cc
									
												View File
												
				@@ -160,7 +160,7 @@ public:

				        });

				    }

				    void on_before_create_column_family(const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {

				    void on_before_create_column_family(const keyspace_metadata& ksm, const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {

				        if (schema.cdc_options().enabled()) {

				            auto& db = _ctxt._proxy.get_db().local();

				            auto logname = log_name(schema.cf_name());

				@@ -395,9 +395,6 @@ bool cdc::options::operator==(const options& o) const {

				    return enabled() == o.enabled() && _preimage == o._preimage && _postimage == o._postimage && _ttl == o._ttl

				            && _delta_mode == o._delta_mode;

				}

				bool cdc::options::operator!=(const options& o) const {

				    return !(*this == o);

				}

				namespace cdc {

				@@ -635,9 +632,6 @@ public:

				    bool operator==(const collection_iterator& x) const {

				        return _v == x._v;

				    }

				    bool operator!=(const collection_iterator& x) const {

				        return !(*this == x);

				    }

				private:

				    void next() {

				        --_rem;

									
										2

cdc/metadata.cc
									
												View File
												
				@@ -40,7 +40,7 @@ static cdc::stream_id get_stream(

				// non-static for testing

				cdc::stream_id get_stream(

				        const std::vector<cdc::token_range_description>& entries,

				        const utils::chunked_vector<cdc::token_range_description>& entries,

				        dht::token tok) {

				    if (entries.empty()) {

				        on_internal_error(cdc_log, "get_stream: entries empty");

									
										2

cdc/split.cc
									
												View File
												
				@@ -389,7 +389,7 @@ struct extract_changes_visitor {

				    }

				    void partition_delete(const tombstone& t) {

				        _result[t.timestamp].partition_deletions = {t};

				        _result[t.timestamp].partition_deletions = partition_deletion{t};

				    }

				    constexpr bool finished() const { return false; }

									
										3

cell_locking.hh
									
												View File
												
				@@ -93,9 +93,6 @@ public:

				        bool operator==(const iterator& other) const {

				            return _position == other._position;

				        }

				        bool operator!=(const iterator& other) const {

				            return !(*this == other);

				        }

				    };

				public:

				    explicit partition_cells_range(const mutation_partition& mp) : _mp(mp) { }

									
										20

checked-file-impl.hh
									
												View File
												
				@@ -21,27 +21,27 @@ public:

				            : file_impl(*get_file_impl(f)),  _error_handler(error_handler), _file(f) {

				    }

				    virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {

				    virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, io_intent* intent) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->write_dma(pos, buffer, len, pc);

				            return get_file_impl(_file)->write_dma(pos, buffer, len, intent);

				        });

				    }

				    virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {

				    virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, io_intent* intent) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->write_dma(pos, iov, pc);

				            return get_file_impl(_file)->write_dma(pos, iov, intent);

				        });

				    }

				    virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {

				    virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, io_intent* intent) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->read_dma(pos, buffer, len, pc);

				            return get_file_impl(_file)->read_dma(pos, buffer, len, intent);

				        });

				    }

				    virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {

				    virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, io_intent* intent) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->read_dma(pos, iov, pc);

				            return get_file_impl(_file)->read_dma(pos, iov, intent);

				        });

				    }

				@@ -99,9 +99,9 @@ public:

				        });

				    }

				    virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override {

				    virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, io_intent* intent) override {

				        return do_io_check(_error_handler, [&] {

				            return get_file_impl(_file)->dma_read_bulk(offset, range_size, pc);

				            return get_file_impl(_file)->dma_read_bulk(offset, range_size, intent);

				        });

				    }

				private:

									
										6

clocks-impl.cc
									
												View File
												
				@@ -15,12 +15,6 @@

				std::atomic<int64_t> clocks_offset;

				std::ostream& operator<<(std::ostream& os, db_clock::time_point tp) {

				    auto t = db_clock::to_time_t(tp);

				    ::tm t_buf;

				    return os << std::put_time(::gmtime_r(&t, &t_buf), "%Y/%m/%d %T");

				}

				std::string format_timestamp(api::timestamp_type ts) {

				    auto t = std::time_t(std::chrono::duration_cast<std::chrono::seconds>(api::timestamp_clock::duration(ts)).count());

				    ::tm t_buf;

									
										3

clustering_interval_set.hh
									
												View File
												
				@@ -75,8 +75,7 @@ public:

				            const interval::interval_type& iv = *_i;

				            return position_range{iv.lower().position(), iv.upper().position()};

				        }

				        bool operator==(const position_range_iterator& other) const { return _i == other._i; }

				        bool operator!=(const position_range_iterator& other) const { return _i != other._i; }

				        bool operator==(const position_range_iterator& other) const = default;

				        position_range_iterator& operator++() {

				            ++_i;

				            return *this;

									
										27

cmake/Findrapidxml.cmake
									
										Normal file
									
												View File
												
				@@ -0,0 +1,27 @@

				#

				# Copyright 2023-present ScyllaDB

				#

				#

				# SPDX-License-Identifier: AGPL-3.0-or-later

				#

				find_path(rapidxml_INCLUDE_DIR

				  NAMES rapidxml.h rapidxml/rapidxml.hpp)

				mark_as_advanced(

				  rapidxml_INCLUDE_DIR)

				include(FindPackageHandleStandardArgs)

				find_package_handle_standard_args(rapidxml

				  REQUIRED_VARS

				    rapidxml_INCLUDE_DIR)

				if(rapidxml_FOUND)

				  if(NOT TARGET rapidxml::rapidxml)

				    add_library(rapidxml::rapidxml INTERFACE IMPORTED)

				    set_target_properties(rapidxml::rapidxml

				      PROPERTIES

				        INTERFACE_INCLUDE_DIRECTORIES ${rapidxml_INCLUDE_DIR})

				  endif()

				endif()

									
										17

cmake/add_version_library.cmake
									
												View File
												
				@@ -1,20 +1,31 @@

				###

				### Generate version file and supply appropriate compile definitions for release.cc

				###

				function(add_version_library name source)

				function(generate_scylla_version)

				  set(version_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-VERSION-FILE)

				  set(release_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-RELEASE-FILE)

				  set(product_file ${CMAKE_CURRENT_BINARY_DIR}/SCYLLA-PRODUCT-FILE)

				  execute_process(

				    COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN --output-dir "${CMAKE_CURRENT_BINARY_DIR}"

				    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR})

				  file(STRINGS ${version_file} scylla_version)

				  file(STRINGS ${release_file} scylla_release)

				  file(STRINGS ${product_file} scylla_product)

				  string(REPLACE "-" "~" scylla_version_tilde ${scylla_version})

				  set(Scylla_VERSION "${scylla_version_tilde}" CACHE INTERNAL "")

				  set(Scylla_RELEASE "${scylla_release}" CACHE INTERNAL "")

				  set(Scylla_PRODUCT "${scylla_product}" CACHE INTERNAL "")

				endfunction(generate_scylla_version)

				function(add_version_library name source)

				  add_library(${name} OBJECT ${source})

				  target_compile_definitions(${name}

				    PRIVATE

				      SCYLLA_VERSION=\"${scylla_version}\"

				      SCYLLA_RELEASE=\"${scylla_release}\")

				      SCYLLA_VERSION=\"${Scylla_VERSION}\"

				      SCYLLA_RELEASE=\"${Scylla_RELEASE}\")

				  target_link_libraries(${name}

				    PRIVATE

				      Seastar::seastar)

									
										13

cmake/add_whole_archive.cmake
									
												View File
												
				@@ -5,15 +5,6 @@

				# actually compiling a sample program.

				function(add_whole_archive name library)

				  add_library(${name} INTERFACE)

				  if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.24)

				    target_link_libraries(${name} INTERFACE

				      "$<LINK_LIBRARY:WHOLE_ARCHIVE,${library}>")

				  else()

				    add_dependencies(${name} ${library})

				    target_include_directories(${name} INTERFACE

				      ${CMAKE_SOURCE_DIR})

				    target_link_options(auth INTERFACE

				      "$<$<CXX_COMPILER_ID:Clang>:SHELL:LINKER:-force_load $<TARGET_LINKER_FILE:${library}>>"

				      "$<$<CXX_COMPILER_ID:GNU>:SHELL:LINKER:--whole-archive $<TARGET_LINKER_FILE:${library}> LINKER:--no-whole-archive>")

				  endif()

				  target_link_libraries(${name} INTERFACE

				    "$<LINK_LIBRARY:WHOLE_ARCHIVE,${library}>")

				endfunction()

									
										50

cmake/build_submodule.cmake
									
										Normal file
									
												View File
												
				@@ -0,0 +1,50 @@

				function(build_submodule name dir)

				  cmake_parse_arguments(parsed_args "NOARCH" "" "" ${ARGN})

				  set(version_release "${Scylla_VERSION}-${Scylla_RELEASE}")

				  set(product_version_release

				    "${Scylla_PRODUCT}-${Scylla_VERSION}-${Scylla_RELEASE}")

				  set(working_dir ${CMAKE_CURRENT_SOURCE_DIR}/${dir})

				  if(parsed_args_NOARCH)

				    set(arch "noarch")

				  else()

				    set(arch "${CMAKE_SYSTEM_PROCESSOR}")

				  endif()

				  set(reloc_args ${parsed_args_UNPARSED_ARGUMENTS})

				  set(reloc_pkg "${working_dir}/build/${Scylla_PRODUCT}-${name}-${version_release}.${arch}.tar.gz")

				  add_custom_command(

				    OUTPUT ${reloc_pkg}

				    COMMAND reloc/build_reloc.sh --version ${product_version_release} --nodeps ${reloc_args}

				    WORKING_DIRECTORY "${working_dir}"

				    JOB_POOL submodule_pool)

				  add_custom_target(dist-${name}-tar

				    DEPENDS ${reloc_pkg})

				  add_custom_target(dist-${name}-rpm

				    COMMAND reloc/build_rpm.sh --reloc-pkg ${reloc_pkg}

				    DEPENDS ${reloc_pkg}

				    WORKING_DIRECTORY "${working_dir}")

				  add_custom_target(dist-${name}-deb

				    COMMAND reloc/build_deb.sh --reloc-pkg ${reloc_pkg}

				    DEPENDS ${reloc_pkg}

				    WORKING_DIRECTORY "${working_dir}")

				  add_custom_target(dist-${name}

				    DEPENDS dist-${name}-tar dist-${name}-rpm dist-${name}-deb)

				endfunction()

				macro(dist_submodule name dir pkgs)

				  # defined as a macro, so that we can append the path to the dist tarball to

				  # specfied "pkgs"

				  cmake_parse_arguments(parsed_args "NOARCH" "" "" ${ARGN})

				  if(parsed_args_NOARCH)

				    set(arch "noarch")

				  else()

				    set(arch "${CMAKE_SYSTEM_PROCESSOR}")

				  endif()

				  set(pkg_name "${Scylla_PRODUCT}-${name}-${Scylla_VERSION}-${Scylla_RELEASE}.${arch}.tar.gz")

				  set(reloc_pkg "${CMAKE_SOURCE_DIR}/tools/${dir}/build/${pkg_name}")

				  set(dist_pkg "${CMAKE_CURRENT_BINARY_DIR}/${pkg_name}")

				  add_custom_command(

				    OUTPUT ${dist_pkg}

				    COMMAND ${CMAKE_COMMAND} -E copy ${reloc_pkg} ${dist_pkg}

				    DEPENDS dist-${name}-tar)

				  list(APPEND ${pkgs} "${dist_pkg}")

				endmacro()

									
										6

cmake/generate_cql_grammar.cmake
									
												View File
												
				@@ -1,7 +1,5 @@

				find_program (ANTLR3 antlr3)

				if(NOT ANTLR3)

				  message(FATAL "antlr3 is required")

				endif()

				find_program (ANTLR3 antlr3

				  REQUIRED)

				# Parse antlr3 grammar files and generate C++ sources

				function(generate_cql_grammar)

									
										23

cmake/mode.COVERAGE.cmake
									
										Normal file
									
												View File
												
				@@ -0,0 +1,23 @@

				set(Seastar_OptimizationLevel_COVERAGE "g")

				set(CMAKE_CXX_FLAGS_COVERAGE

				  ""

				  CACHE

				  INTERNAL

				  "")

				string(APPEND CMAKE_CXX_FLAGS_COVERAGE

				  " -O${Seastar_OptimizationLevel_SANITIZE}")

				set(Seastar_DEFINITIONS_COVERAGE

				  SCYLLA_BUILD_MODE=debug

				  DEBUG

				  SANITIZE

				  DEBUG_LSA_SANITIZER

				  SCYLLA_ENABLE_ERROR_INJECTION)

				set(CMAKE_CXX_FLAGS_COVERAGE

				  " -O${Seastar_OptimizationLevel_COVERAGE} -fprofile-instr-generate -fcoverage-mapping -g -gz")

				set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE

				  "-fprofile-instr-generate -fcoverage-mapping")

				set(stack_usage_threshold_in_KB 40)

									
										13

cmake/mode.RELEASE.cmake
									
												View File
												
				@@ -12,16 +12,15 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64|aarch64")

				else()

				  set(clang_inline_threshold 2500)

				endif()

				string(APPEND CMAKE_CXX_FLAGS_RELEASE

				  " $<$<CXX_COMPILER_ID:GNU>:--param inline-unit-growth=300"

				  " $<$<CXX_COMPILER_ID:Clang>:-mllvm -inline-threshold=${clang_inline_threshold}>"

				add_compile_options(

				  "$<$<CXX_COMPILER_ID:GNU>:--param;inline-unit-growth=300>"

				  "$<$<CXX_COMPILER_ID:Clang>:-mllvm;-inline-threshold=${clang_inline_threshold}>"

				  # clang generates 16-byte loads that break store-to-load forwarding

				  # gcc also has some trouble: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554

				  " -fno-slp-vectorize")

				set(Seastar_DEFINITIONS_DEBUG

				  "-fno-slp-vectorize")

				set(Seastar_DEFINITIONS_RELEASE

				  SCYLLA_BUILD_MODE=release)

				set(CMAKE_STATIC_LINKER_FLAGS_RELEASE

				  "-Wl,--gc-sections")

				add_link_options("LINKER:--gc-sections")

				set(stack_usage_threshold_in_KB 13)

									
										17

cmake/mode.SANITIZE.cmake
									
										Normal file
									
												View File
												
				@@ -0,0 +1,17 @@

				set(Seastar_OptimizationLevel_SANITIZE "s")

				set(CMAKE_CXX_FLAGS_SANITIZE

				  ""

				  CACHE

				  INTERNAL

				  "")

				string(APPEND CMAKE_CXX_FLAGS_SANITIZE

				  " -O${Seastar_OptimizationLevel_SANITIZE}")

				set(Seastar_DEFINITIONS_SANITIZE

				  SCYLLA_BUILD_MODE=sanitize

				  DEBUG

				  SANITIZE

				  DEBUG_LSA_SANITIZER

				  SCYLLA_ENABLE_ERROR_INJECTION)

				set(stack_usage_threshold_in_KB 50)

									
										100

cmake/mode.common.cmake
									
												View File
												
				@@ -1,9 +1,7 @@

				set(disabled_warnings

				  c++11-narrowing

				  mismatched-tags

				  missing-braces

				  overloaded-virtual

				  parentheses-equality

				  unsupported-friend)

				include(CheckCXXCompilerFlag)

				foreach(warning ${disabled_warnings})

				@@ -13,27 +11,117 @@ foreach(warning ${disabled_warnings})

				  endif()

				endforeach()

				list(TRANSFORM _supported_warnings PREPEND "-Wno-")

				string(JOIN " " CMAKE_CXX_FLAGS "-Wall" "-Werror" ${_supported_warnings})

				add_compile_options(

				  "-Wall"

				  "-Werror"

				  "-Wno-error=deprecated-declarations"

				  "-Wimplicit-fallthrough"

				  ${_supported_warnings})

				function(default_target_arch arch)

				  set(x86_instruction_sets i386 i686 x86_64)

				  if(CMAKE_SYSTEM_PROCESSOR IN_LIST x86_instruction_sets)

				    set(${arch} "westmere" PARENT_SCOPE)

				  elseif(CMAKE_SYSTEM_PROCESSOR EQUAL "aarch64")

				  elseif(CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")

				    # we always use intrinsics like vmull.p64 for speeding up crc32 calculations

				    # on the aarch64 architectures, and they require the crypto extension, so

				    # we have to add "+crypto" in the architecture flags passed to -march. the

				    # same applies to crc32 instructions, which need the ARMv8-A CRC32 extension

				    # please note, Seastar also sets -march when compiled with DPDK enabled.

				    set(${arch} "armv8-a+crc+crypto" PARENT_SCOPE)

				  else()

				    set(${arch} "" PARENT_SCOPE)

				  endif()

				endfunction()

				function(pad_at_begin output fill str length)

				  # pad the given `${str} with `${fill}`, right aligned. with the syntax of

				  # fmtlib:

				  #   fmt::print("{:#>{}}", str, length)

				  # where `#` is the `${fill}` char

				  string(LENGTH "${str}" str_len)

				  math(EXPR padding_len "${length} - ${str_len}")

				  if(padding_len GREATER 0)

				    string(REPEAT ${fill} ${padding_len} padding)

				  endif()

				  set(${output} "${padding}${str}" PARENT_SCOPE)

				endfunction()

				# The relocatable package includes its own dynamic linker. We don't

				# know the path it will be installed to, so for now use a very long

				# path so that patchelf doesn't need to edit the program headers.  The

				# kernel imposes a limit of 4096 bytes including the null. The other

				# constraint is that the build-id has to be in the first page, so we

				# can't use all 4096 bytes for the dynamic linker.

				# In here we just guess that 2000 extra / should be enough to cover

				# any path we get installed to but not so large that the build-id is

				# pushed to the second page.

				# At the end of the build we check that the build-id is indeed in the

				# first page. At install time we check that patchelf doesn't modify

				# the program headers.

				function(get_padded_dynamic_linker_option output length)

				  set(dynamic_linker_option "-dynamic-linker")

				  # capture the drive-generated command line first

				  execute_process(

				    COMMAND ${CMAKE_C_COMPILER} "-###" /dev/null -o t

				    ERROR_VARIABLE driver_command_line

				    ERROR_STRIP_TRAILING_WHITESPACE)

				  # extract the argument for the "-dynamic-linker" option

				  if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"? \"?([^ \"]*)\"? .*")

				    set(dynamic_linker ${CMAKE_MATCH_1})

				  else()

				    message(FATAL_ERROR "Unable to find ${dynamic_linker_option} in driver-generated command: "

				      "${driver_command_line}")

				  endif()

				  # prefixing a path with "/"s does not actually change it means

				  pad_at_begin(padded_dynamic_linker "/" "${dynamic_linker}" ${length})

				  set(${output} "${dynamic_linker_option}=${padded_dynamic_linker}" PARENT_SCOPE)

				endfunction()

				add_compile_options("-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")

				default_target_arch(target_arch)

				if(target_arch)

				    string(APPEND CMAKE_CXX_FLAGS " -march=${target_arch}")

				  add_compile_options("-march=${target_arch}")

				endif()

				math(EXPR _stack_usage_threshold_in_bytes "${stack_usage_threshold_in_KB} * 1024")

				set(_stack_usage_threshold_flag "-Wstack-usage=${_stack_usage_threshold_in_bytes}")

				check_cxx_compiler_flag(${_stack_usage_threshold_flag} _stack_usage_flag_supported)

				if(_stack_usage_flag_supported)

				  string(APPEND CMAKE_CXX_FLAGS " ${_stack_usage_threshold_flag}")

				  add_compile_options("${_stack_usage_threshold_flag}")

				endif()

				# Force SHA1 build-id generation

				add_link_options("LINKER:--build-id=sha1")

				include(CheckLinkerFlag)

				set(Scylla_USE_LINKER

				    ""

				    CACHE

				    STRING

				    "Use specified linker instead of the default one")

				if(Scylla_USE_LINKER)

				    set(linkers "${Scylla_USE_LINKER}")

				else()

				    set(linkers "lld" "gold")

				endif()

				foreach(linker ${linkers})

				    set(linker_flag "-fuse-ld=${linker}")

				    check_linker_flag(CXX ${linker_flag} "CXX_LINKER_HAVE_${linker}")

				    if(CXX_LINKER_HAVE_${linker})

				        add_link_options("${linker_flag}")

				        break()

				    elseif(Scylla_USE_LINKER)

				        message(FATAL_ERROR "${Scylla_USE_LINKER} is not supported.")

				    endif()

				endforeach()

				if(DEFINED ENV{NIX_CC})

				  get_padded_dynamic_linker_option(dynamic_linker_option 0)

				else()

				  # gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to

				  # that. The 512 includes the null at the end, hence the 511 bellow.

				  get_padded_dynamic_linker_option(dynamic_linker_option 511)

				endif()

				add_link_options("${dynamic_linker_option}")

678

compaction/compaction.cc

View File

File diff suppressed because it is too large Load Diff

									
										41

compaction/compaction.hh
									
												View File
												
				@@ -13,8 +13,8 @@

				#include "compaction/compaction_descriptor.hh"

				#include "gc_clock.hh"

				#include "compaction_weight_registration.hh"

				#include "service/priority_manager.hh"

				#include "utils/UUID.hh"

				#include "utils/pretty_printers.hh"

				#include "table_state.hh"

				#include <seastar/core/thread.hh>

				#include <seastar/core/abort_source.hh>

				@@ -25,21 +25,6 @@ namespace sstables {

				bool is_eligible_for_compaction(const sstables::shared_sstable& sst) noexcept;

				class pretty_printed_data_size {

				    uint64_t _size;

				public:

				    pretty_printed_data_size(uint64_t size) : _size(size) {}

				    friend std::ostream& operator<<(std::ostream&, pretty_printed_data_size);

				};

				class pretty_printed_throughput {

				    uint64_t _size;

				    std::chrono::duration<float> _duration;

				public:

				    pretty_printed_throughput(uint64_t size, std::chrono::duration<float> dur) : _size(size), _duration(std::move(dur)) {}

				    friend std::ostream& operator<<(std::ostream&, pretty_printed_throughput);

				};

				// Return the name of the compaction type

				// as used over the REST api, e.g. "COMPACTION" or "CLEANUP".

				sstring compaction_name(compaction_type type);

				@@ -63,6 +48,7 @@ struct compaction_info {

				};

				struct compaction_data {

				    uint64_t compaction_size = 0;

				    uint64_t total_partitions = 0;

				    uint64_t total_keys_written = 0;

				    sstring stop_requested;

				@@ -92,12 +78,15 @@ struct compaction_stats {

				    uint64_t start_size = 0;

				    uint64_t end_size = 0;

				    uint64_t validation_errors = 0;

				    // Bloom filter checks during max purgeable calculation

				    uint64_t bloom_filter_checks = 0;

				    compaction_stats& operator+=(const compaction_stats& r) {

				        ended_at = std::max(ended_at, r.ended_at);

				        start_size += r.start_size;

				        end_size += r.end_size;

				        validation_errors += r.validation_errors;

				        bloom_filter_checks += r.bloom_filter_checks;

				        return *this;

				    }

				    friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {

				@@ -112,12 +101,27 @@ struct compaction_result {

				    compaction_stats stats;

				};

				class read_monitor_generator;

				class compaction_progress_monitor {

				    std::unique_ptr<read_monitor_generator> _generator = nullptr;

				    uint64_t _progress = 0;

				public:

				    void set_generator(std::unique_ptr<read_monitor_generator> generator);

				    void reset_generator();

				    // Returns number of bytes processed with _generator.

				    uint64_t get_progress() const;

				    friend class compaction;

				    friend future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor, compaction_data&, table_state&, compaction_progress_monitor&);

				};

				// Compact a list of N sstables into M sstables.

				// Returns info about the finished compaction, which includes vector to new sstables.

				//

				// compaction_descriptor is responsible for specifying the type of compaction, and influencing

				// compaction behavior through its available member fields.

				future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s);

				future<compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s, compaction_progress_monitor& progress_monitor);

				// Return list of expired sstables for column family cf.

				// A sstable is fully expired *iff* its max_local_deletion_time precedes gc_before and its

				@@ -130,7 +134,4 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable

				// For tests, can drop after we virtualize sstables.

				flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors);

				// For tests, can drop after we virtualize sstables.

				future<uint64_t> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 rd, const compaction_data& info);

				}

									
										7

compaction/compaction_backlog_manager.hh
									
												View File
												
				@@ -12,7 +12,6 @@

				#include <memory>

				#include <seastar/core/shared_ptr.hh>

				#include "sstables/shared_sstable.hh"

				#include "sstables/progress_monitor.hh"

				#include "timestamp.hh"

				class compaction_backlog_manager;

				@@ -60,18 +59,20 @@ public:

				    using ongoing_compactions = std::unordered_map<sstables::shared_sstable, backlog_read_progress_manager*>;

				    struct impl {

				        virtual void replace_sstables(std::vector<sstables::shared_sstable> old_ssts, std::vector<sstables::shared_sstable> new_ssts) = 0;

				        // FIXME: Should provide strong exception safety guarantees

				        virtual void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts) = 0;

				        virtual double backlog(const ongoing_writes& ow, const ongoing_compactions& oc) const = 0;

				        virtual ~impl() { }

				    };

				    compaction_backlog_tracker(std::unique_ptr<impl> impl) : _impl(std::move(impl)) {}

				    compaction_backlog_tracker(compaction_backlog_tracker&&);

				    compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) noexcept;

				    compaction_backlog_tracker& operator=(compaction_backlog_tracker&&) = delete;

				    compaction_backlog_tracker(const compaction_backlog_tracker&) = delete;

				    ~compaction_backlog_tracker();

				    double backlog() const;

				    // FIXME: Should provide strong exception safety guarantees

				    void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts);

				    void register_partially_written_sstable(sstables::shared_sstable sst, backlog_write_progress_manager& wp);

				    void register_compacting_sstable(sstables::shared_sstable sst, backlog_read_progress_manager& rp);

									
										26

compaction/compaction_descriptor.hh
									
												View File
												
				@@ -18,7 +18,6 @@

				#include "sstables/sstable_set.hh"

				#include "utils/UUID.hh"

				#include "dht/i_partitioner.hh"

				#include "compaction_weight_registration.hh"

				#include "compaction_fwd.hh"

				namespace sstables {

				@@ -73,6 +72,12 @@ public:

				            only, // scrub only quarantined sstables

				        };

				        quarantine_mode quarantine_operation_mode = quarantine_mode::include;

				        using quarantine_invalid_sstables = bool_class<class quarantine_invalid_sstables_tag>;

				        // Should invalid sstables be moved into quarantine.

				        // Only applies to validate-mode.

				        quarantine_invalid_sstables quarantine_sstables = quarantine_invalid_sstables::yes;

				    };

				    struct reshard {

				    };

				@@ -109,8 +114,8 @@ public:

				        return compaction_type_options(upgrade{});

				    }

				    static compaction_type_options make_scrub(scrub::mode mode) {

				        return compaction_type_options(scrub{mode});

				    static compaction_type_options make_scrub(scrub::mode mode, scrub::quarantine_invalid_sstables quarantine_sstables = scrub::quarantine_invalid_sstables::yes) {

				        return compaction_type_options(scrub{.operation_mode = mode, .quarantine_sstables = quarantine_sstables});

				    }

				    template <typename... Visitor>

				@@ -118,6 +123,11 @@ public:

				        return std::visit(std::forward<Visitor>(visitor)..., _options);

				    }

				    template <typename OptionType>

				    const auto& as() const {

				        return std::get<OptionType>(_options);

				    }

				    const options_variant& options() const { return _options; }

				    compaction_type type() const;

				@@ -151,12 +161,12 @@ struct compaction_descriptor {

				    compaction_type_options options = compaction_type_options::make_regular();

				    // If engaged, compaction will cleanup the input sstables by skipping non-owned ranges.

				    compaction::owned_ranges_ptr owned_ranges;

				    // Required for reshard compaction.

				    const dht::sharder* sharder;

				    compaction_sstable_creator_fn creator;

				    compaction_sstable_replacer_fn replacer;

				    ::io_priority_class io_priority = default_priority_class();

				    // Denotes if this compaction task is comprised solely of completely expired SSTables

				    sstables::has_only_fully_expired has_only_fully_expired = has_only_fully_expired::no;

				@@ -166,7 +176,6 @@ struct compaction_descriptor {

				    static constexpr uint64_t default_max_sstable_bytes = std::numeric_limits<uint64_t>::max();

				    explicit compaction_descriptor(std::vector<sstables::shared_sstable> sstables,

				                                   ::io_priority_class io_priority,

				                                   int level = default_level,

				                                   uint64_t max_sstable_bytes = default_max_sstable_bytes,

				                                   run_id run_identifier = run_id::create_random_id(),

				@@ -178,18 +187,15 @@ struct compaction_descriptor {

				        , run_identifier(run_identifier)

				        , options(options)

				        , owned_ranges(std::move(owned_ranges_))

				        , io_priority(io_priority)

				    {}

				    explicit compaction_descriptor(sstables::has_only_fully_expired has_only_fully_expired,

				                                   std::vector<sstables::shared_sstable> sstables,

				                                   ::io_priority_class io_priority)

				                                   std::vector<sstables::shared_sstable> sstables)

				        : sstables(std::move(sstables))

				        , level(default_level)

				        , max_sstable_bytes(default_max_sstable_bytes)

				        , run_identifier(run_id::create_random_id())

				        , options(compaction_type_options::make_regular())

				        , io_priority(io_priority)

				        , has_only_fully_expired(has_only_fully_expired)

				    {}

852

compaction/compaction_manager.cc

View File

File diff suppressed because it is too large Load Diff

Compare commits

3105 Commits br-next ... mykaul-pat

2 .gitignore vendored Unescape Escape View File

44 CMakeLists.txt Unescape Escape View File

12 SCYLLA-VERSION-GEN Unescape Escape View File

2 alternator/auth.cc Unescape Escape View File

9 alternator/controller.cc Unescape Escape View File

425 alternator/executor.cc Unescape Escape View File

39 alternator/executor.hh Unescape Escape View File

63 alternator/expressions.cc Unescape Escape View File

82 alternator/expressions.g Unescape Escape View File

2 alternator/expressions.hh Unescape Escape View File

136 alternator/serialization.cc Unescape Escape View File

7 alternator/serialization.hh Unescape Escape View File

2 alternator/server.cc Unescape Escape View File

2 alternator/streams.cc Unescape Escape View File

14 alternator/ttl.cc Unescape Escape View File

1 api/CMakeLists.txt Unescape Escape View File

62 api/api-doc/column_family.json Unescape Escape View File

44 api/api-doc/error_injection.json Unescape Escape View File

2 api/api-doc/messaging_service.json Unescape Escape View File

34 api/api-doc/metrics.def.json Normal file Unescape Escape View File

66 api/api-doc/metrics.json Normal file Unescape Escape View File

91 api/api-doc/storage_service.json Unescape Escape View File

2 api/api-doc/swagger20_header.json Unescape Escape View File

528 api/api-doc/task_manager.json Unescape Escape View File

304 api/api-doc/task_manager_test.json Unescape Escape View File

46 api/api.cc Unescape Escape View File

26 api/api_init.hh Unescape Escape View File

1 api/authorization_cache.cc Unescape Escape View File

74 api/column_family.cc Unescape Escape View File

9 api/column_family.hh Unescape Escape View File

4 api/compaction_manager.cc Unescape Escape View File

8 api/config.cc Unescape Escape View File

2 api/config.hh Unescape Escape View File

34 api/error_injection.cc Unescape Escape View File

33 api/failure_detector.cc Unescape Escape View File

19 api/gossiper.cc Unescape Escape View File

34 api/hinted_handoff.cc Unescape Escape View File

9 api/hinted_handoff.hh Unescape Escape View File

203 api/storage_proxy.cc Unescape Escape View File

4 api/storage_proxy.hh Unescape Escape View File

332 api/storage_service.cc Unescape Escape View File

15 api/storage_service.hh Unescape Escape View File

77 api/system.cc Unescape Escape View File

121 api/task_manager.cc Unescape Escape View File

8 api/task_manager.hh Unescape Escape View File

62 api/task_manager_test.cc Unescape Escape View File

8 api/task_manager_test.hh Unescape Escape View File

4 auth/CMakeLists.txt Unescape Escape View File

9 auth/authenticated_user.hh Unescape Escape View File

4 auth/authenticator.cc Unescape Escape View File

19 auth/authenticator.hh Unescape Escape View File

4 auth/authorizer.hh Unescape Escape View File

181 auth/certificate_authenticator.cc Normal file Unescape Escape View File

62 auth/certificate_authenticator.hh Normal file Unescape Escape View File

17 auth/common.cc Unescape Escape View File

4 auth/common.hh Unescape Escape View File

2 auth/default_authorizer.cc Unescape Escape View File

28 auth/password_authenticator.cc Unescape Escape View File

6 auth/password_authenticator.hh Unescape Escape View File

9 auth/resource.cc Unescape Escape View File

10 auth/resource.hh Unescape Escape View File

4 auth/role_or_anonymous.cc Unescape Escape View File

7 auth/role_or_anonymous.hh Unescape Escape View File

68 auth/roles-metadata.cc Unescape Escape View File

8 auth/roles-metadata.hh Unescape Escape View File

30 auth/service.cc Unescape Escape View File

19 auth/standard_role_manager.cc Unescape Escape View File

7 auth/standard_role_manager.hh Unescape Escape View File

10 backlog_controller.hh Unescape Escape View File

8 bin/cqlsh Executable file Unescape Escape View File

15 bytes.hh Unescape Escape View File

18 bytes_ostream.hh Unescape Escape View File

94 cache_flat_mutation_reader.hh Unescape Escape View File

1 cartesian_product.hh Unescape Escape View File

1 cdc/cdc_options.hh Unescape Escape View File

318 cdc/generation.cc Unescape Escape View File

48 cdc/generation.hh Unescape Escape View File

30 cdc/generation_id.hh Unescape Escape View File

3105 Commits

br-next ... mykaul-pat

2

.gitignore vendored

View File

44

CMakeLists.txt

View File

12

SCYLLA-VERSION-GEN

View File

2

alternator/auth.cc

View File

9

alternator/controller.cc

View File

425

alternator/executor.cc

View File

39

alternator/executor.hh

View File

63

alternator/expressions.cc

View File

82

alternator/expressions.g

View File

2

alternator/expressions.hh

View File

136

alternator/serialization.cc

View File

7

alternator/serialization.hh

View File

2

alternator/server.cc

View File

2

alternator/streams.cc

View File

14

alternator/ttl.cc

View File

1

api/CMakeLists.txt

View File

62

api/api-doc/column_family.json

View File

44

api/api-doc/error_injection.json

View File

2

api/api-doc/messaging_service.json

View File

34

api/api-doc/metrics.def.json Normal file

View File

66

api/api-doc/metrics.json Normal file

View File

91

api/api-doc/storage_service.json

View File

2

api/api-doc/swagger20_header.json

View File

528

api/api-doc/task_manager.json

View File

304

api/api-doc/task_manager_test.json

View File

46

api/api.cc

View File

26

api/api_init.hh

View File

1

api/authorization_cache.cc

View File

74

api/column_family.cc

View File

9

api/column_family.hh

View File

4

api/compaction_manager.cc

View File

8

api/config.cc

View File

2

api/config.hh

View File

34

api/error_injection.cc

View File

33

api/failure_detector.cc

View File

19

api/gossiper.cc

View File

34

api/hinted_handoff.cc

View File

9

api/hinted_handoff.hh

View File

203

api/storage_proxy.cc

View File

4

api/storage_proxy.hh

View File

332

api/storage_service.cc

View File

15

api/storage_service.hh

View File

77

api/system.cc

View File

121

api/task_manager.cc

View File

8

api/task_manager.hh

View File

62

api/task_manager_test.cc

View File

8

api/task_manager_test.hh

View File

4

auth/CMakeLists.txt

View File

9

auth/authenticated_user.hh

View File

4

auth/authenticator.cc

View File

19

auth/authenticator.hh

View File

4

auth/authorizer.hh

View File

181

auth/certificate_authenticator.cc Normal file

View File

62

auth/certificate_authenticator.hh Normal file

View File

17

auth/common.cc

View File

4

auth/common.hh

View File

2

auth/default_authorizer.cc

View File

28

auth/password_authenticator.cc

View File

6

auth/password_authenticator.hh

View File

9

auth/resource.cc

View File

10

auth/resource.hh

View File

4

auth/role_or_anonymous.cc

View File

7

auth/role_or_anonymous.hh

View File

68

auth/roles-metadata.cc

View File

8

auth/roles-metadata.hh

View File

30

auth/service.cc

View File

19

auth/standard_role_manager.cc

View File

7

auth/standard_role_manager.hh

View File

10

backlog_controller.hh

View File

8

bin/cqlsh Executable file

View File

15

bytes.hh

View File

18

bytes_ostream.hh

View File

94

cache_flat_mutation_reader.hh

View File

1

cartesian_product.hh

View File

1

cdc/cdc_options.hh

View File

318

cdc/generation.cc

View File

48

cdc/generation.hh

View File

30

cdc/generation_id.hh

View File

15

cdc/generation_service.hh

View File