this mirrors what we have in `configure.py`, to build the CqlParser with `-O1`
and disable `-fsanitize-address-use-after-scope` when compiling CqlParser.cc
in order to prevent the compiler from emitting code which uses large amount of stack
space at the runtime.
Closesscylladb/scylladb#15819
* github.com:scylladb/scylladb:
build: cmake: avoid using large amount stack of when compiling parser
build: cmake: s/COMPILE_FLAGS/COMPILE_OPTIONS/
There are some schema modifications performed automatically (during bootstrap, upgrade etc.) by Scylla that are announced by multiple calls to `migration_manager::announce` even though they are logically one change. Precisely, they appear in:
- `system_distributed_keyspace::start`,
- `redis:create_keyspace_if_not_exists_impl`,
- `table_helper::setup_keyspace` (for the `system_traces` keyspace).
All these places contain a FIXME telling us to `announce` only once. There are a few reasons for this:
- calling `migration_manager::announce` with Raft is quite expensive -- taking a `read_barrier` is necessary, and that requires contacting a leader, which then must contact a quorum,
- we must implement a retrying mechanism for every automatic `announce` if `group0_concurrent_modification` occurs to enable support for concurrent bootstrap in Raft-based topology. Doing it before the FIXMEs mentioned above would be harder, and fixing the FIXMEs later would also be harder.
This PR fixes the first two FIXMEs and improves the situation with the last one by reducing the number of the `announce` calls to two. Unfortunately, reducing this number to one requires a big refactor. We can do it as a follow-up to a new, more specific issue. Also, we leave a new FIXME.
Fixing the first two FIXMEs required enabling the announcement of a keyspace together with its tables. Until now, the code responsible for preparing mutations for a new table could assume the existence of the keyspace. This assumption wasn't necessary, but removing it required some refactoring.
Fixes#15437Closesscylladb/scylladb#15594
* github.com:scylladb/scylladb:
table_helper: announce twice in setup_keyspace
table_helper: refactor setup_table
redis: create_keyspace_if_not_exists_impl: fix indentation
redis: announce once in create_keyspace_if_not_exists_impl
db: system_distributed_keyspace: fix indentation
db: system_distributed_keyspace: announce once in start
tablet_allocator: update on_before_create_column_family
migration_listener: add parameter to on_before_create_column_family
alternator: executor: use new prepare_new_column_family_announcement
alternator: executor: introduce create_keyspace_metadata
migration_manager: add new prepare_new_column_family_announcement
Topology coordinator should handle failures internally as long as it
remains to be the coordinator. The raft state monitor is not in better
position to handle any errors thrown by it, all it can do it to restart
the coordinator. The series makes topology_coordinator::run handle all
the errors internally and mark the function as noexcept to not leak
error handling complexity into the raft state monitor.
* 'gleb/15728-fix' of github.com:scylladb/scylla-dev:
storage_service: raft topology: mark topology_coordinator::run function as noexcept
storage_service: raft topology: do not throw error from fence_previous_coordinator()
it is printed when pytest passes it down as a fixture as part of
the logging message. it would help with debugging a object_store test.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15817
The function handled all exceptions internally. By making it noexcept we
make sure that the caller (raft_state_monitor_fiber) does not need
handle any exceptions from the topology coordinator fiber.
Throwing error kills the topology coordinator monitor fiber. Instead we
retry the operation until it succeeds or the node looses its leadership.
This is fine before for the operation to succeed quorum is needed and if
the quorum is not available the node should relinquish its leadership.
Fixes#15728
this mirrors what we have in `configure.py`, to build the CqlParser with -O1
and disable sanitize-address-use-after-scope when compiling CqlParser.cc
in order to prevent the compiler from emitting code which uses large amount of stack
at the runtime.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we create a new UUID for a new sstable managed by the s3_storage, and we use the string representation of UUID defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for naming the objects stored on s3_storage. but this representation is not what we are using for storing sstables on local filesystem when the option of "uuid_sstable_identifiers_enabled" is enabled. instead, we are using a base36-based representation which is shorter.
to be consistent with the naming of the sstables created for local filesystem, and more importantly, to simplify the interaction between the local copy of sstables and those stored on object storage, we should use the same string representation of the sstable identifier.
so, in this change:
1. instead of creating a new UUID, just reuse the generation of the sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the base36-based one (implemented by fmt::formatter<sstables::generation_type>)
Fixes#14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#14406
* github.com:scylladb/scylladb:
sstable: remove _remote_prefix from s3_storage
sstable: switch to uuid identifier for naming S3 sstable objects
this series extracts the the env variables related functions out and remove unused `import`s for better readability.
Closesscylladb/scylladb#15796
* github.com:scylladb/scylladb:
test/pylib: remove duplicated imports
test/pylib: extract the env variable printing into MinIoServer
test/pylib: extract _set_environ() out
default_compaction_progress_monitor returns a reference to a static
object. So, it should be read-only, but its users need to modify it.
Delete default_compaction_progress_monitor and use one's own
compaction_progress_monitor instance where it's needed.
Closesscylladb/scylladb#15800
This commit enables publishing documentation
from branch-5.4. The docs will be published
as UNSTABLE (the warning about version 5.4
being unstable will be displayed).
Closesscylladb/scylladb#15762
The log-structured allocator maintains memory reserves to so that
operations using log-strucutured allocator memory can have some
working memory and can allocate. The reserves start small and are
increased if allocation failures are encountered. Before starting
an operation, the allocator first frees memory to satisfy the reserves.
One problem is that if the reserves are set to a high value and
we encounter a stall, then, first, we have no idea what value
the reserves are set to, and second, we have no idea what operation
caused the reserves to be increased.
We fix this problem by promoting the log reports of reserve increases
from DEBUG level to INFO level and by attaching a stack trace to
those reports. This isn't optimal since the messages are used
for debugging, not for informing the user about anything important
for the operation of the node, but I see no other way to obtain
the information.
Ref #13930.
Closesscylladb/scylladb#15153
Said method is called in an allocating section, which will re-try the enclosed lambda on allocation failure. `read_context()` however moves the permit parameter so on the second and later calls, the permit will be in a moved-from state, triggering a `nullptr` dereference and therefore a segfault.
We already have a unit test (`test_exception_safety_of_reads` in `row_cache_test.cc`) which was supposed to cover this, but:
* It only tests range scans, not single partition reads, which is a separate path.
* Turns out allocation failure tests are again silently broken (no error is injected at all). This is because `test/lib/memtable_snapshot_source.hh` creates a critical alloc section which accidentally covers the entire duration of tests using it.
Fixes: #15578Closesscylladb/scylladb#15614
* github.com:scylladb/scylladb:
test/boost/row_cache_test: test_exception_safety_of_reads: also cover single-partition reads
test/lib/memtable_snapshot_source: disable critical alloc section while waiting
row_cache: make_reader_opt(): make make_context() reentrant
Major compaction semantics is that all data of a table will be compacted
together, so user can expect e.g. a recently introduced tombstone to be
compacted with the data it shadows.
Today, it can happen that all data in maintenance set won't be included
for major, until they're promoted into main set by off-strategy.
So user might be left wondering why major is not having the expected
effect.
To fix this, let's perform off-strategy first, so data in maintenance
set will be made available by major. A similar approach is done for
data in memtable, so flush is performed before major starts.
The only exception will be data in staging, which cannot be compacted
until view building is done with it, to avoid inconsistency in view
replicas.
The serialization in comapaction manager of reshape jobs guarantee
correctness if there's an ongoing off-strategy on behalf of the
table.
Fixes#11915.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15792
This patch adds a reproducer for a minor compatibility between Scylla's
and Cassandra's handling of a prepared statement when a bind marker with
the same name is used more than once, e.g.,
```
SELECT * FROM tbl WHERE p=:x AND c=:x
```
It turns out that Scylla tells the driver that there is only one bind
marker, :x, whereas Cassandra tells the driver that there are two bind
markers, both named :x. This makes no different if the user passes
a map `{'x': 3}`, but if the user passes a tuple, Scylla accepts only
`(3,)` (assigning both bind markers the same value) and Cassandra
accepts only `(3,3)`.
The test added in this patch demonstrates this incompatibility.
It fails on Scylla, passes on Cassandra, and is marked "xfail".
Refs #15559
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#15564
Exception thrown from row_level_repair::run does not show the root
cause of a failure making it harder to debug.
Add the internal exception contents to runtime_error message.
After the change the log will mention the real cause (last line), e.g.:
repair - repair[92db0739-584b-4097-b6e2-e71a66e40325]: 33 out of 132 ranges failed,
keyspace=system_distributed, tables={cdc_streams_descriptions_v2, cdc_generation_timestamps,
view_build_status, service_levels}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false,
failed_because=seastar::nested_exception: std::runtime_error (Failed to repair for keyspace=system_distributed,
cf=cdc_streams_descriptions_v2, range=(8720988750842579417,+inf))
(while cleaning up after seastar::abort_requested_exception (abort requested))
Closesscylladb/scylladb#15770
This PR improves the way of how handling failures is documented and accessible to the user.
- The Handling Failures section is moved from Raft to Troubleshooting.
- Two new topics about failure are added to Troubleshooting with a link to the Handling Failures page (Failure to Add, Remove, or Replace a Node, Failure to Update the Schema).
- A note is added to the add/remove/replace node procedures to indicate that a quorum is required.
See individual commits for more details.
Fixes https://github.com/scylladb/scylladb/issues/13149Closesscylladb/scylladb#15628
* github.com:scylladb/scylladb:
doc: add a note about Raft
doc: add the quorum requirement to procedures
doc: add more failure info to Troubleshooting
doc: move Handling Failures to Troubleshooting
instead of appending the options to the CMake variables, use the command to do this. simpler this way. and the bonus is that the options are de-duplicated.
Closesscylladb/scylladb#15797
* github.com:scylladb/scylladb:
build: cmake: use add_link_options() when appropriate
build: cmake: use add_compile_options() when appropriate
this series is one of the steps to remove global statements in `configure.py`.
not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs #15379Closesscylladb/scylladb#15780
* github.com:scylladb/scylladb:
build: update modeval using a dict
build: pass args.test_repeat and args.test_timeout explicitly
build: pull in jsoncpp using "pkgs"
build: build: extract code fragments into functions
instead of appending to CMAKE_EXE_LINKER_FLAGS*, use
add_link_options() to add more options. as CMAKE_EXE_LINKER_FLAGS*
is a string, and typically set by user, let's use add_link_options()
instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of appending to CMAKE_CXX_FLAGS, use add_compile_options()
to add more options. as CMAKE_CXX_FLAGS is a string, and typically
set by user, let's use add_compile_options() instead, the options
added by this command will be added before CMAKE_CXX_FLAGS, and
will have lower priority.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we use the sstable.generation() for the remote prefix of
the key of the object for storing the sstable component, there is
no need to set remote_prefix beforehand.
since `s3_storage::ensure_remote_prefix()` and
`system_kesypace::sstables_registry_lookup_entry()` are not used
anymore, they are removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we create a new UUID for a new sstable managed
by the s3_storage, and we use the string representation of UUID
defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for
naming the objects stored on s3_storage. but this representation is
not what we are using for storing sstables on local filesystem when
the option of "uuid_sstable_identifiers_enabled" is enabled. instead,
we are using a base36-based representation which is shorter.
to be consistent with the naming of the sstables created for local
filesystem, and more importantly, to simplify the interaction between
the local copy of sstables and those stored on object storage, we should
use the same string representation of the sstable identifier.
so, in this change:
1. instead of creating a new UUID, just reuse the generation of the
sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As
we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined
by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the
base36-based one (implemented by
fmt::formatter<sstables::generation_type>)
4. enable the `uuid_sstable_identifers` cluster feature if it is
enabled in the `test_env_config`, so that it the sstable manager
can enable the uuid-based uuid when creating a new uuid for
sstable.
5. throw if the generation of sstable is not UUID-based when
accessing / manipulating an sstable with S3 storage backend. as
the S3 storage backend now relies on this option. as, otherwise
we'd have sstables with key like s3://bucket/number/basename, which
is just unable to serve as a unique id for sstable if the bucket is
shared across multiple tables.
Fixes#14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The following new commands are implemented:
* stop
* compactionhistory
All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#15649
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement compactionhistory command
tools/scylla-nodetool: implement stop command
mutation/json: extract generic streaming writer into utils/rjson.hh
test/nodetool: rest_api_mock.py: add support for error responses
This is the continuation of 8c03eeb85d
Registering API handlers for services need to
* get the service to handle requests via argument, not from http context (http context, in turn, is going not to depend on anything)
* unset the handlers on stop so that the service is not used after it's stopped (and before API server is stopped)
This makes task manager handlers work this way
Closesscylladb/scylladb#15764
* github.com:scylladb/scylladb:
api: Unset task_manager test API handlers
api: Unset task_manager API handlers
api: Remove ctx->task_manager dependency
api: Use task_manager& argument in test API handlers
api: Push sharded<task_manager>& down the test API set calls
api: Use task_manager& argument in API handlers
api: Push sharded<task_manager>& down the API set calls
instead of updating `modes` in with global statements, update it in
a function. for better readablity. and to reduce the statements in
global scope.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change adds "jsoncpp" dependency using "pkgs". simpler this
way. it also helps to remove more global statements.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change extract `get_warnings_options()` out. it helps to
remove more global statements.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The options parameter is redundant. We always use
`_metadata->strategy_options()` and
`keyspace::create_replication_strategy` already assumes that
`_metadata` is set by using its other fields.
Closesscylladb/scylladb#15776
memtable_snapshot_source starts a background fiber in its constructor,
which compacts LSA memory in a loop. The loop's inside is covered with a
critical alloc section. It also contains a wait on a condition variable
and in its present form the critical section also covers the wait,
effectively turning off allocation failure injection for any test using
the memtable_snapshot_source.
This patch disables the critical alloc section while the loop waits on
the condition variable.
Said lambda currently moves the permit parameter, so on the second and
later calls it will possibly run into use-after-move. This can happen if
the allocating section below fails and is re-tried.
the java related build dependencies are installed by
* tools/java/install-dependencies.sh
* tools/jmx/install-dependencies.sh
respectively. and the parent `install-dependencies.sh` always
invoke these scripts, so there is no need to repeat them in the
parent `install-dependenceies.sh` anymore.
in addition to dedup the build deps, this change also helps to
reduce the size of build dependencies. as by default, `dnf`
install the weak deps, unless `-setopt=install_weak_deps=False`
is passed to it, so this change also helps to reduce the traffic
and foot print of the installed packages for building scylla.
see also 9dddad27bf
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15473
test.py calls `uninstall()` and `stop()` concurrently from exit
artifacts, and `uninstall()` internally calls `stop()`. This leads to
premature releasing of IP addresses from `uninstall()` (returning IPs to
the pool) while the servers using those IPs are still stopping. Then a
server might obtain that IP from the pool and fail to start due to
"Address already in use".
Put a lock around the body of `stop()` to prevent that.
Fixes: scylladb/scylladb#15755Closesscylladb/scylladb#15763
before this change, we print
marshaling error: Value not compatible with type org.apache.cassandra.db.marshal.AsciiType: '...'
but the wording is not quite user friendly, it is a mapping of the
underlying implementation, user would have difficulty understanding
"marshaling" and/or "org.apache.cassandra.db.marshal.AsciiType"
when reading this error message.
so, in this change
1. change the error message to:
Invalid ASCII character in string literal: '...'
which should be more straightforward, and easier to digest.
2. update the test accordingly
please note, the quoted non-ASCII string is preserved instead of
being printed in hex, as otherwise user would not be able to map it
with his/her input.
Refs #14320
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15678
* seastar bab1625c...17183ed4 (73):
> thread_pool: Reference reactor, not point to
> sstring: inherit publicly from string_view formatter
> circleci: use conditional steps
> weak_ptr: include used header
> build: disable the -Wunused-* warnings for checkheaders
> resource: move variable into smaller lexical scope
> resource: use structured binding when appropriate
> httpd: Added server and client addresses to request structure
> io_queue: do not dereference moved-away shared pointer
> treewide: explicitly define ctor and assignment operator
> memory: use `err` for the error string
> doc: Add document describing all the math behind IO scheduler
> io_queue: Add flow-rate based self slowdown backlink
> io_queue: Make main throttler uncapped
> io_queue: Add queue-wide metrics
> io_queue: Introduce "flow monitor"
> io_queue: Count total number of dispatched and completed requests so far
> io_queue: Introduce io_group::io_latency_goal()
> tests: test the vector overload for when_all_succeed
> core: add a vector overload to when_all_succeed
> loop: Fix iterator_range_estimate_vector_capacity for random iters
> loop: Add test for iterator_range_estimate_vector_capacity
> core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
> memory: s/throw()/noexcept/
> build: enable -Wdeprecated compiler option
> reactor: mark kernel_completion's dtor protected
> tests: always wait for promise
> http, json, net: define-generated copy ctor for polymorphic types
> treewide: do not define constexpr static out-of-line
> reactor: do not define dtor of kernel_completion
> http/exception: stop using dynamic exception specification
> metrics: replace vector with deque
> metrics: change metadata vector to deque
> utils/backtrace.hh: make simple_backtrace formattable
> reactor: Unfriend disk_config_params
> reactor: Move add_to_flush_poller() to internal namespace
> reactor: Unfriend a bunch of sched group template calls
> rpc_test: Test rpc send glitches
> net: Implement batch flush support for existing sockets
> iostream: Configure batch flushes if sink can do it
> net: Added remote address accessors
> circleci: update the image to CircleCI "standard" image
> build: do not add header check target if no headers to check
> build: pass target name to seastar_check_self_contained
> build: detect glibc features using CMake
> build: extract bits checking libc into CheckLibc.cmake
> http/exception: add formatter for httpd::base_exception
> http/client: Mark write_body() const
> http/client: Introduce request::_bytes_written
> http/client: Mark maybe_wait_for_continue() const
> http/client: Mark send_request_head() const
> http/client: Detach setup_request()
> http/api_docs: copy in api_docs's copy constructor
> script: do not inherit from object
> scripts: addr2line: change StdinBacktraceIterator to a function
> scripts: addr2line: use yield instead defining a class
> tests: skip tests that require backtrace if execinfo.h is not found
> backtrace: check for existence of execinfo.h
> core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
> core: add sleep_abortable instantiation for manual_clock
> tls: Return EPIPE exception when writing to shutdown socket
> http/client: Don't cache connection if server advertises it
> http/client: Mark connection as "keep in cache"
> core: fix strerror_r usage from glibc extension
> reactor: access sigevent.sigev_notify_thread_id with a macro
> posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
> reactor: replace __mode_t with mode_t
> reactor: change sys/poll.h to posix poll.h
> rpc: Add unit test for per-domain metrics
> rpc: Report client connections metrics
> rpc: Count dead client stats
> rpc: Add seastar::rpc::metrics
> rpc: Make public queues length getters
io-scheduler fixes
refs: #15312
refs: #11805
http client fixes
refs: #13736
refs: #15509
rpc fixes
refs: #15462Closesscylladb/scylladb#15774
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
messaging_service.cc depends on idl, but many source files in
scylla-main do no depend on idl, so let's
* move "message/*" into its own directory and add an inter-library
dependency between it and the "idl" library.
* rename the target of "message" under test/manual to "message_test"
to avoid the name collision
this should address the compilation failure of
```
FAILED: CMakeFiles/scylla-main.dir/message/messaging_service.cc.o
/usr/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -MF CMakeFiles/scylla-main.dir/message/messaging_service.cc.o.d -o CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -c /home/kefu/dev/scylladb/message/messaging_service.cc
/home/kefu/dev/scylladb/message/messaging_service.cc:81:10: fatal error: 'idl/join_node.dist.hh' file not found
^~~~~~~~~~~~~~~~~~~~~~~
```
where the compiler failed to find the included `idl/join_node.dist.hh`,
which is exposed by the idl library as part of its public interface.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15657
this series is one of the steps to remove global statements in `configure.py`.
not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs #15379Closesscylladb/scylladb#15668
* github.com:scylladb/scylladb:
build: move check for NIX_CC into dynamic_linker_option()
build: extract dynamic_linker_option(): out
build: move `headers` into write_build_file()
to match the behavior of `configure.py`.
Closesscylladb/scylladb#15667
* github.com:scylladb/scylladb:
build: cmake: pass -dynamic-linker to ld
build: cmake: set CMAKE_EXE_LINKER_FLAGS in mode.common.cmake
`expression` is a std::variant with 16 different variants
that represent different types of AST nodes.
Let's add documentation that explains what each of these
16 types represents. For people who are not familiar with expression
code it might not be clear what each of them does, so let's add
clear descriptions for all of them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closesscylladb/scylladb#15767
It seems that Scylla has more values returned by DeleteTable operation than DynamoDB.
In this patch I added a table status check when generating output.
If we delete the table, values KeySchema, AttributeDefinitions and CreationDateTime won't be returned.
The test has also been modified to check that these attributes are not returned.
Fixes scylladb#14132
Closesscylladb/scylladb#15707
this series
1. let sstable tests using test_env to use uuid-based sstable identifiers by default
2. let the test who requires integer-based identifier keep using it
this should enable us to perform the s3 related test after enforcing the uuid-based identifier for s3 backend, otherwise the s3 related test would fail as it also utilize `test_env`.
Closesscylladb/scylladb#14553
* github.com:scylladb/scylladb:
test: set use_uuid to true by default in sstables::test_env
test: enable test to set uuid_sstable_identifiers
Now the task manager's API (and test API) use the argument and this
explicit dependency is no longer required
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A memtable object contains two logalloc::allocating_section members
that track memory allocation requirements during reads and writes.
Because these are local to the memtable, each time we seal a memtable
and create a new one, these statistics are forgotten. As a result
we may have to re-learn the typical size of reads and writes, incurring
a small performance penalty.
The solution is to move the allocating_section object to the memtable_list
container. The workload is the same across all memtables of the same
table, so we don't lose discrimination here.
The performance penalty may be increased later if log changes to
memory reserve thresholds including a backtrace, so this reduces the
odds of incurring such a penalty.
Closesscylladb/scylladb#15737
`deletion_time` is a part of the `partition_header`, which is in turn
a part of `partition`. and `data_file` is a sequence of `partition`.
`data_file` represents *-Data.db component of an SSTable.
see docs/architecture/sstable3/sstables-3-data-file-format.rst.
we always parse the data component via `flat_mutation_reader_v2`, which is in turn
implemented with mx/reader.cc or kl/reader.cc depending on
the version of SSTable to be read.
in other words, we decode `deletion_time` in mx/reader.cc or
kl/reader.cc, not in sstable.cc. so let's drop the overload
parse() for deletion_time. it's not necessary and more importantly,
confusing.
Refs #15116
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15756
Currently distributed_loader starts sharded<sstable_directory> with four sharded parameters. That's quite bulky and can be made much shorter.
Closesscylladb/scylladb#15653
* github.com:scylladb/scylladb:
distributed_loader: Remove explicit sharded<erms>
distributed_loader: Brush up start_subdir()
sstable_directory: Add enlightened construction
table: Add global_table_ptr::as_sharded_parameter()
This commit:
- Removes upgrade guides for versions older than 5.0.
The oldest one is from version 4.6 to 5.0.
- Adds the redirections for the removed pages.
Closesscylladb/scylladb#15709
pytest changes the test's sys.stdout and sys.stderr to the
captured fds when it captures the outputs of the test. so we
are not able to get the STDOUT_FILENO and STDERR_FILENO in C
by querying `sys.stdout.fileno()` and `sys.stderr.fileno()`.
their return values are not 1 and 2 anymore, unless pytest
is started with "-s".
so, to ensure that we always redirect the child process's
outputs to the log file. we need to use 1 and 2 for accessing
the well-known fds, which are the ones used by the child
process, when it writes to stdout and stderr.
this change should address the problem that the log file is
always empty, unless "-s" is specified.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15560
compaction_read_monitor_generator is an existing mechanism
for monitoring progress of sstables reading during compaction.
In this change information gathered by compaction_read_monitor_generator
is utilized by task manager compaction tasks of the lowest level,
i.e. compaction executors, to calculate task progress.
compaction_read_monitor_generator has a flag, which decides whether
monitored changes will be registered by compaction_backlog_tracker.
This allows us to pass the generator to all compaction readers without
impacting the backlog.
Task executors have access to compaction_read_monitor_generator_wrapper,
which protects the internals of compaction_read_monitor_generator
and provides only the necessary functionality.
Closesscylladb/scylladb#14878
* github.com:scylladb/scylladb:
compaction: add get_progress method to compaction_task_impl
compaction: find total compaction size
compaction: sstables: monitor validation scrub with compaction_read_generator
compaction: keep compaction_progress_monitor in compaction_task_executor
compaction: use read monitor generator for all compactions
compaction: add compaction_progress_monitor
compaction: add flag to compaction_read_monitor_generator
This is a follow-up for #15279 and it fixes two problems.
First, we restore flushes on writes for the tables that were switched to the schema commitlog if `SCHEMA_COMMITLOG` feature is not yet enabled. Otherwise durability is not guaranteed.
Second, we address the problem with truncation records, which could refer to the old commitlog if any of the switched tables were truncated in the past. If the node crashes later, and we replay schema commitlog, we may skip some mutations since their `replay_position`s will be smaller than the `replay_position`s stored for the old commitlog in the `truncated` table.
It turned out that this problem exists even if we don't switch commitlogs for tables. If the node was rebooted the segment ids will start from some small number - they use `steady_clock` which is usually bound to boot time. This means that if the node crashed we may skip the mutations because their RPs will be smaller than the last truncation record RP.
To address this problem we delete truncation records as soon as commitlog is replayed. We also include a test which demonstrates the problem.
Fixes#15354Closesscylladb/scylladb#15532
* github.com:scylladb/scylladb:
add test_commitlog
system.truncated: Remove replay_position data from truncated on start
main.cc: flush only local memtables when replaying schema commitlog
main.cc: drop redundant supervisor::notify
system_keyspace: flush if schema commitlog is not available
The status is no longer used. The function that referenced it was
removed by 5a96751534 and it was unused
back then for awhile already.
Message-Id: <ZS92mcGE9Ke5DfXB@scylladb.com>
When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what.
This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables.
Test included.
fixes: #15708closes: scylladb/scylla-manager#3603Closesscylladb/scylladb#15723
* github.com:scylladb/scylladb:
test: Add test for datadir/ layout
sstable_directory: Indentation fix after previous patch
db,sstables: Move storage init for system keyspace to table creation
before this change, they reads
> Was local deletion time capped at ...
and
> Was partition tombstone deletion time capped at ...
the "Was" part is confusing. and the first description is not
accurate enough. so let's improve them a little bit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15108
the `utils::UUID` class is not used by the implementation of
`canonical_mutation`, so let's remove the include from this source file.
the `#include` was originally added in
5a353486c6, but that commit did
add any code using UUID to this file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15731
This commit removes the information about the recommended way of upgrading ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade procedure is not supported (it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.0-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 4.6-to.-5.1 upgrade guide and replace with general info.
- Remove the information from the 5.x.y-to.-5.x.z upgrade guide and replace with general info.
- Remove the following files as no longer necessary (they were only created to incorporate the (invalid) information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
This PR is a continuation of https://github.com/scylladb/scylladb/pull/15739.
**This PR must be backported to branch-5.2 and branch-5.1.**
Closesscylladb/scylladb#15740
* github.com:scylladb/scylladb:
doc: remove wrong image upgrade info (5.x.y-to-5.x.y)
doc: remove wrong image upgrade info (4.6-to-5.0)
doc: remove wrong image upgrade info (5.0-to-5.1)
This commit removes the information about
the recommended way of upgrading ScyllaDB
images - by updating ScyllaDB and OS packages
in one step.
This upgrade procedure is not supported
(it was implemented, but then reverted).
The scope of this commit:
- Remove the information from the 5.1-to.-5.2
upgrade guide and replace with general info.
- Remove the information from the Image Upgrade
page.
- Remove outdated info (about previous releases)
from the Image Upgrade page.
- Rename "AMI Upgrade" as "Image Upgrade"
in the page tree.
Refs: https://github.com/scylladb/scylladb/issues/15733Closesscylladb/scylladb#15739
Fixes#14870
(Originally suggested by @avikivity). Use commit log stored GC clock min positions to narrow compaction GC bounds.
(Still requires augmented manual flush:es with extensive CL clearing to pass various dtest, but this does not affect "real" execution).
Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment the first time. Because GC clock is wall
clock time and only connected to TTL (not cell/row timestamps), this gives a fairly accurate view of GC low bounds
per segment. This is then (in a rather ugly way) propagated to tombstone_gc_state to narrow the allowed GC bounds for
a CF, based on what is currently left in CL.
Note: this is a rather unoptimized version - no caching or anything. But even so, should not be excessively expensive,
esp. since various other code paths already cache the results.
Closesscylladb/scylladb#15060
* github.com:scylladb/scylladb:
main/cql_test_env: Augment compaction mgr tombstone_gc_state with CL GC info
tombstone_gc_state: Add optional callback to augment GC bounds
commitlog: Add keeping track of approximate lowest GC clock for CF entries
database: Force new commitlog segment on user initiated flush
commitlog: Add helper to force new active segment
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.x.y-to-5.x.y upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
In addition, the following files are removed as no longer
necessary (they were only created to incorporate the (invalid)
information about image upgrade into the upgrade guides.
/upgrade/_common/upgrade-image-opensource.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p1.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian-p2.rst
/upgrade/_common/upgrade-guide-v5-patch-ubuntu-and-debian.rst
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 4.6-to-5.0 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
Check that commitlog provides durability in case
of a node reboot:
* truncate table T, truncation_record RP=1000;
* clean shutdown node/reboot machine/restart node, now RP=~0
since segment ids count from boot time;
* write some data to T; crash/restart
* check data is retained
Once we've started clean, and all replaying is done, truncation logs
commit log regarding replay positions are invalid. We should exorcise
them as soon as possible. Note that we cannot remove truncation data
completely though, since the time stamps stored are used by things like
batch log to determine if it should use or discard old batch data.
Later in the code we have 'replaying schema commit log',
which duplicates this one. Also,
maybe_init_schema_commitlog may skip schema commitlog
initialization if the SCHEMA_COMMITLOG feature is
not yet supported by the cluster, so this notification
can be misleading.
In PR #15279 we removed flushes when writing to a number
of tables from the system keyspace. This was made possible
by switching these tables to the schema commitlog.
Schema commitlog is enabled only when the SCHEMA_COMMITLOG
feature is supported by all nodes in the cluster. Before that
these tables will use the regular commitlog, which is not
durable because it uses db::commitlog::sync_mode::PERIODIC. This
means that we may lose data if a node crashes during upgrade
to the version with schema commitlog.
In this commit we fix this problem by restoring flushes
after writes to the tables if the schema commitlog
is not enabled yet.
The patch also contains a test that demonstrates the
problem. We need flush_schema_tables_after_modification
option since otherwise schema changes are not durable
and node fails after restart.
This commit removes the invalid information about
the recommended way of upgrading ScyllaDB
images (by updating ScyllaDB and OS packages
in one step) from the 5.0-to-5.1 upgrade guide.
This upgrade procedure is not supported (it was
implemented, but then reverted).
Refs https://github.com/scylladb/scylladb/issues/15733
there is no need to explicitly cast an instance of
std::chrono::hours to std::chrono::milliseconds to feed it to a
function which expects std::chrono::milliseconds. the constructor
of of std::chrono::milliseconds is able to do this convert and
create a new instance of std::chrono::milliseconds from another.
std::chrono::duration<> instance.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15734
You can now pass `expected_error` to `ManagerClient.decommission_node`
and `ManagerClient.remove_node`. Useful in combination with error
injections, for example.
Closesscylladb/scylladb#15650
Fixes#14870 (yet another alternative solution)
(Originally suggested by @avikivity). Use store GC clock min positions from CL
to narrow compaction GC bounds.
Note: not optimized with caches or anything at this point. Can easily be added
though of course always somewhat risky.
Adds a lowest timestamp of GC clock whenever a CF is added to a CL segment
first. Because GC clock is wall clock time and only connected to TTL (not
cell/row timestamps), this gives a fairly accurate view of GC low bounds
per segment.
Includes of course a function to get the all-segment lowest per CF.
On some environment such as VMware instance, /dev/disk/by-uuid/<UUID> is
not available, scylla_raid_setup will fail while mounting volume.
To avoid failing to mount /dev/disk/by-uuid/<UUID>, fetch all available
paths to mount the disk and fallback to other paths like by-partuuid,
by-id, by-path or just using real device path like /dev/md0.
To get device path, and also to dumping device status when UUID is not
available, this will introduce UdevInfo class which communicate udev
using pyudev.
Related #11359Closesscylladb/scylladb#13803
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`
Closesscylladb/scylladb#15583
* github.com:scylladb/scylladb:
system_keyspace: drop truncation_record
system_keyspace: remove get_truncated_at method
table: get_truncation_time: check _truncated_at is initialized
database: add_column_family: initialize truncation_time for new tables
database: add_column_family: rename readonly parameter to is_new
system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
commitlog_replayer: refactor commitlog_replayer::impl::init
system_keyspace: drop redundant typedef
system_keyspace: drop redundant save_truncation_record overload
table: rename cache_truncation_record -> set_truncation_time
system_keyspace: get_truncated_position -> get_truncated_positions
The estimation assumes that size of other components are irrelevant,
when estimating the number of partitions for each output sstable.
The sstables are split according to the data file size, therefore
size of other files are irrelevant for the estimation.
With certain data models, like single-row partitions containing small
values, the index could be even larger than data.
For example, assume index is as large as data, then the estimation
would say that 2x more sstables will be generated, and as a result,
each sstable are underestimated to have 2x less keys.
Fix it by only accounting size of data file.
Fixes#15726.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15727
If abort is requsted during bootstrap then a node should exit normally.
To achieve so, abort_requested_exception should be thrown as main
handles it gracefully.
In data_sync_repair_task_impl::run exceptions from all shards are
wrapped together into std::runtime_exception and so they aren't
handled as they are supposed to.
Throw abort_requested_exception when shutdown was requested.
Throw abort_requested_exception also if repair::task_manager_module::is_aborted,
so that force_terminate_all_repair_sessions acts the same regardless
the state of the repair.
To maintain consistency do the same for user_requested_repair_task_impl.
Fixes: #15710.
Closesscylladb/scylladb#15722
before this change, pytest does not populate its suites's
`scylla_env` down to the forked pytest child process. this works
if the test does not care about the env variables in `scylla_env`.
but object_store is an exception, as it launches scylla instances
by itself. so, without the help of `scylla_env`, `run.find_scylla()`
always find the newest file globbed by `build/*/scylla`. this is not
always what we expect. on the contrary, if we launch object_store's
pytest using `test.py`, there are good chances that object_store
ends up with testing a wrong scylla executable if we have multiple
builds under `build/*/scylla`.
so, in this change, we populate `self.suite.scylla_env` down to
the child process created by `PythonTest`, so that all pytest
based tests can have access to its suites's env variables.
in addition to 'SCYLLA' env variable, they also include the
the env variables required by LLVM code coverage instrumentation.
this is also nice to have.
Fixes#15679
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15682
The stall detector uses glibc backtrace function to
collect backtraces, this causes ASAN failures on ARM.
For now we just disable the stall detector in this
configuration, the ticket about migrating
to libunwind: scylladb/seastar#1878
We increase the value of blocked_reactor_notify_ms to
make sure the stall detector never fires.
Fixes#15389Fixes#15090Closesscylladb/scylladb#15720
The test checks that
- for non-system keyspace datadir and its staging/ and upload/ subdirs
are created when the table is created _and_ that the directory is
re-populated on boot in case it was explicitly removed
- for system non-virtual tables it checks that the same directory layout
is created on boot
- for system virtual tables it checks that the directory layout doesn't
exist
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
User and system keyspaces are created and populated slightly
differently.
System keyspace is created via system_keyspace::make() which eventually
calls calls add_column_family(). Then it's populated via
init_system_keyspace() which calls sstable_directory::prepare() which,
in turn, optionally creates directories in datadir/ or checks the
directory permissions if it exists
User keyspaces are created with the help of
add_column_family_and_make_directory() call which calls the
add_column_family() mentioned above _and_ calls table::init_storage() to
create directories. When it's populated with init_non_system_keyspaces()
it also calls sstable_directory::prepare() which notices that the
directory exists and then checks the permissions.
As a result, sstable_directory::prepare() initializes storage for system
keyspace only and there's a BUG (#15708) that the upload/ subdir is not
created.
This patch makes the directories creation for _all_ keyspaces with the
table::init_storage(). The change only touches system keyspace by moving
the creation of directories from sstable_directory::prepare() into
system_keyspace::make().
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We refactor table_helper::setup_keyspace so that it calls
migration_manager::announce at most twice. We achieve it by
announcing all tables at once.
The number of announcements should further be reduced to one, but
it requires a big refactor. The CQL code used in
parse_new_cf_statement assumes the keyspace has already been
created. We cannot have such an assumption if we want to announce
a keyspace and its tables together. However, we shouldn't touch
the CQL code as it would impact user requests, too.
One solution is using schema_builder instead of the CQL statements
to create tables in table_helper.
Another approach is removing table_helper completely. It is used
only for the system_traces keyspace, which Scylla creates
automatically. We could refactor the way Scylla handles this
keyspace and make table_helper unneeded.
In the following commit, we reduce migration_manager::announce
calls in table_helper::setup_keyspace by announcing all tables
together. To do it, we cannot use table_helper::setup_table
anymore, which announces a single table itself. However, the new
code still has to translate CQL statements, so we extract it to the
new parse_new_cf_statement function to avoid duplication.
We refactor system_distributed_keyspace::start so that it takes at
most one group 0 guard and calls migration_manager::announce at
most once.
We remove a catch expression together with the FIXME from
get_updated_service_levels (add_new_columns_if_missing before the
patch) because we cannot treat the service_levels update
differently anymore.
After adding the keyspace_metadata parameter to
migration_listener::on_before_create_column_family,
tablet_allocator doesn't need to load it from the database.
This change is necessary before merging migration_manager::announce
calls in the following commit.
After adding the new prepare_new_column_family_announcement that
doesn't assume the existence of a keyspace, we also need to get
rid of the same assumption in all on_before_create_column_family
calls. After all, they may be initiated before creating the
keyspace. However, some listeners require keyspace_metadata, so we
pass it as a new parameter.
We can use the new prepare_new_column_family_announcement function
that doesn't assume the existence of the keyspace instead of the
previous work-around.
We need to store a new keyspace's keyspace_metadata as a local
variable in create_table_on_shard0. In the following commit, we
use it to call the new prepare_new_column_family_announcement
function.
In the following commits, we reduce the number of the
migration_manager::anounce calls by merging some of them in a way
that logically makes sense. Some of these merges are similar --
we announce a new keyspace and its tables together. However,
we cannot use the current prepare_new_column_family_announcement
there because it assumes that the keyspace has already been created
(when it loads the keyspace from the database). Luckily, this
assumption is not necessary as this function only needs
keyspace_metadata. Instead of loading it from the database, we can
pass it as a parameter.
Validation scrub bypasses the usual compaction machinery, though it
still needs to be tracked with compaction_progress_monitor so that
we could reach its progress from compaction task executor.
Track sstable scrub in validate mode with read monitors.
Keep compaction_progress_monitor in compaction_task_executor and pass a reference
to it further, so that the compaction progress could be retrieved out of it.
Compaction read monitor generators are used in all compaction types.
Classes which did not use _monitor_generator so far, create it with
_use_backlog_tracker set to no, not to impact backlog tracker.
In the following patches compaction_read_monitor_generator will be used
to find progress of compaction_task_executor's. To avoid unnecessary life
prolongation and exposing internals of the class out of compaction.cc,
compaction_progress_monitor is created.
Compaction class keeps a reference to the compaction_progress_monitor.
Inheriting classes which actually use compaction_read_monitor_generator,
need to set it with set_generator method.
Following patches will use compaction_read_monitor_generator
to track progress of all types of compaction. Some of them should
not be registered in compaction_backlog_tracker.
_use_backlog_tracker flag, which is by default set to true, is
added to compaction_read_monitor_generator and passed to all
compaction_read_monitors created by this generator.
Currently, when the test is executed too quickly, the timestamp
insterted into the 'my_table' table might be the same as the
timestamp used in the SELECT statement for comparison. However,
the statement only selects rows where the inserted timestamp
is strictly lower than current timestamp. As a result, when this
comparison fails, we may skip executing the following comparison,
which uses a user-defined function, due to which the statement
is supposed to fail with an error. Instead, the select statement
simply returns no rows and the test case fails.
To fix this, simply use the less or equal operator instead
of using the strictly less operator for comparing timestamps.
Fixes#15616Closesscylladb/scylladb#15699
This is in order to aid investigation of falkiness of the test, which
fails due to a timeout during scan after cluster restart in debug mode.
See #14746.
I enable trace-level logging for some scylla-side loggers and
inject logging of sent and received messages on the driver side.
Closesscylladb/scylladb#15696
When a remote view update doesn't succeed there's a log message
saying "Error applying view update...".
This message had log level ERROR, but it's not really a hard error.
View updates can fail for a multitude of reasons, even during normal operation.
A failing view update isn't fatal, it will be saved as a view hint a retried later.
Let's change the log level to WARN. It's something that shouldn't happen too much,
but it's not a disaster either.
ERROR log level causes trouble in tests which assume that an ERROR level message
means that the test has failed.
Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784
For local view updates the log level stays at "ERROR", local view updates shouldn't fail.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closesscylladb/scylladb#15640
* tools/cqlsh 66ae7eac...426fa0ea (8):
> Updated Scylla Driver[Issue scylladb/scylla-cqlsh#55]
> copyutil: closing the local end of pipes after processes starts
> setup.py: specify Cython language_level explicitly
> setup.py: pass extensions as a list
> setup.py: reindent block in else branch
> setup.py: early return in get_extension()
> reloc: install build==0.10.0
> reloc: add --verbose option to build_reloc.sh
Fixes: https://github.com/scylladb/scylla-cqlsh/issues/37Closesscylladb/scylladb#15685
test_storage_service_keyspace_cleanup_with_no_owned_ranges
from test_storage_service.py creates snapshots with tags based
on current time. Thus if a test runs on the same node twice
with time distance short enough, there may be a name collision
between the snapshots from two runs. This will cause the second
run to fail on assertions.
Use new_test_snapshot fixture to drop snapshots after the test.
Delete my_snapshot_tags as it's no longer necessary.
Fixes: #15680.
Closesscylladb/scylladb#15683
In 20ff2ae5e1 mutating endpoints were
changed to use PUT. But some of them return a response, and I forgot to
provide `response_type` parameter to `put_json` (which causes
`RESTClient` to actually obtain the response). These endpoints now
return `None`.
Fix this.
Closesscylladb/scylladb#15674
`employ_ld_trickery` is only used by `dynamic_linker_option()`, so
move it into this function.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so that CMakeLists.txt is less cluttered. as we will append
`--dynamic-linker` option to the LDFLAGS.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The logger is not thread safe, so a multithreaded test can concurrently
write into the log, yielding unreadable XMLs.
Example:
boost/sstable_directory_test: failed to parse XML output '/scylladir/testlog/x86_64/release/xml/boost.sstable_directory_test.sstable_directory_shared_sstables_reshard_correctly.3.xunit.xml': not well-formed (invalid token): line 1, column 1351
The critical (today's unprotected) section is in boost/test/utils/xml_printer.hpp:
```
inline std::ostream&
operator<<( custom_printer<cdata> const& p, const_string value )
{
*p << BOOST_TEST_L( "<![CDATA[" );
print_escaped_cdata( *p, value );
return *p << BOOST_TEST_L( "]]>" );
}
```
The problem is not restricted to xml, but the unreadable xml file caused
the test to fail when trying to parse it, to present a summary.
New thread-safe variants of BOOST_REQUIRE and BOOST_REQUIRE_EQUAL are
introduced to help multithreaded tests. We'll start patching tests of
sstable_directory_test that will call BOOST_REQUIRE* from multiple
threads. Later, we can expand its usage to other tests.
Fixes#15654.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15655
for better coverage of uuid-based sstable identifier. since this
option is enabled by default, this also match our tests with the
default behavior of scylladb.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
some of the tests are still relying on the integer-based sstable
identifier, so let's add a method to test_env, so that the tests
relying on this can opt-out. we will change the default setting
of sstables::test_env to use uuid-base sstable identifier in the
next commit. this change does not change the existing behavior.
it just adds a new knob to test_env_config. and let the tests
relying on this to customize the test_env_config to disable
use_uuid.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This PR is another step in refactoring the Hinted Handoff module. It aims at modernizing the code by moving to coroutines, using `std::ranges` instead of Boost's ones where possible, and uses other features coming with the new C++ standards.
It also tries to make the code clearer and get rid of confusing elements, e.g. using shared pointers where they shouldn't be used or marking methods as virtual even though nothing derives from the class. It also prevents `manager.hh` from giving direct access to internal structures (`hint_endpoint_manager` in this case).
Refs #15358Closesscylladb/scylladb#15631
* github.com:scylladb/scylladb:
db/hints/manager: Reword comments about state
db/hints/manager: Unfriend space_watchdog
db/hints: Remove a redundant alias
db/hints: Remove an unused namespace
db/hints: Coroutinize change_host_filter()
db/hints: Coroutinize drain_for()
db/hints: Clean up can_hint_for()
db/hints: Clean up store_hint()
db/hints: Clean up too_many_in_flight_hints_for()
db/hints: Refactor get_ep_manager()
db/hints: Coroutinize wait_for_sync_point()
db/hints: Use std::span in calculate_current_sync_point
db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints()
db/hints: Clean up manager::forbid_hints()
db/hints: Clean up manager::allow_hints()
db/hints: Coroutinize compute_hints_dir_device_id()
db/hints: Clean up manager::stop()
db/hints: Clean up manager::start()
db/hints/manager: Clean up the constructor
db/hints: Remove boilerplate drain_lock()
db/hints: Let drain_for() return a future
db/hints: Remove ep_managers_end
db/hints: Remove find_ep_manager
db/hints: Use manager as API for hint_endpoint_manager
db/hints: Don't mark have_ep_manager()'s definition as inline
db/hints: Remove make_directory_initializer()
db/hints/manager: Order constructors
db/hints: Move ~manager() and mark it as noexcept
db/hints: Use reference for storage proxy
db/hints/manager: Explicitly delete copy constructor
db/hints: Capitalize constants
db/hints/manager: Hide declarations
db/hints/manager: Move the defintions of static members to the header
db/hints: Move make_dummy() to the header
db/hints: Don't explicitly define ~directory_initializer()
db/hints: Change the order of logging in ensure_created_and_verified()
db/hints: Coroutinize ensure_rebalanced()
db/hints: Coroutinize ensure_created_and_verified()
db/hints: Improve formatting of directory_initializer::impl
db/hints: Do not rely on the values of enums
db/hints: Move the implementation of directory_initializer
db/hints: Prefer nested namespaces
db/hints: Remove an unused alias from manager.hh
db/hints: Reorder includes in manager.hh and .cc
This commit adds a note to specify
that the information on the Handling
Failures page only refers to clusters
with Raft enabled.
Also, the comment is included to remove
the note in future versions.
The sharded replication map was needed to provide sharded for sstable
directory. Now it gets sharded via table reference and thus the erms
thing becomes unused
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Drop some local references to class members and line-up arguments to
starting distributed sstable directory. Purely a clean up patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing constructor is pretty heavyweight for the distributed
loader to use -- it needs to pass it 4 sharded parameters which looks
pretty bulky in the text editor. However, 5 constructor arguments are
obtained directly from the table, so the dist. loader code with global
table pointer at hand can pass _it_ as sharded parameter and let the
sstable directory extract what it needs.
Sad news is that sstable_directory cannot be switched to just use table
reference. Tools code doesn't have table at hand, but needs the
facilities sstable_directory provides
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method returns seastar::sharded_parameter<> for the global table
that evaluates into local table reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Refactor the code to be more consistent -- we often did the same thing in multiple ways depending on the endpoint, such as how we returned errors (some endpoints would return them through exceptions, other would wrap into `aiohttp.web.Response`s). Choose the arguably least boilerplate'y way in each case.
Then reduce the boilerplate even further.
Thanks to these refactors, modifying the framework in the future will require less work and it will be more obvious which of the possible ways to modify it should be picked (i.e. consistent with the existing code.)
Closesscylladb/scylladb#15646
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: reduce `aiohttp` boilerplate
test/pylib: always return data as JSON from endpoints
test/pylib: scylla_cluster: catch `HTTPError` in topology change endpoints
test/pylib: scylla_cluster: do sanity/precondition checks through asserts
test/pylib: scylla_cluster: return errors through exceptions
test/pylib: use JSON data to pass `expected_error` in `server_start`
test/pylib: use PUT instead of GET for mutating endpoints
test/pylib: rest_client: make `data` optional in `put_json`
test/pylib: fix some type errors
The current comments should be clearer to someone
not familiar with the module. This commit also makes
them abide by the limit of 120 characters per line.
space_watchdog is a friend of shard hint manager just to
be able to execute one of its functions. This commit changes
that by unfriending the class and exposing the function.
This commit gets rid of boilerplate in the function,
leverages a range pipe and explicit types to make
the code more readable, and changes the logs to
make it clearer what happens.
fmt::to_string should be preferred to seastar::format.
It's clearer and simpler. Besides that, this commit makes
the code abide by the limit of 120 characters per line.
Currently, the function doesn't return anything.
However, if the futurue doesn't need to be awaited,
the caller can decide that. There is no reason
to make that decision in the function itself.
This commit makes with_file_update_mutex() a method of hint_endpoint_manager
and introduces db::hints::manager::with_file_update_mutex_for() for accessing
it from the outside. This way, hint_endpoint_manager is hidden and no one
needs to know about its existence.
This commit makes db::hints::manager store service::storage_proxy
as a reference instead of a seastar::shared_ptr. The manager is
owned by storage proxy, so it only lives as long as storage proxy
does. Hence, it makes little sense to store the latter as a shared
pointer; in fact, it's very confusing and may be error-prone.
The field never changes, so it's safe to keep it as a reference
(especially because copy and move constructors of db::hints::manager
are both deleted). What's more, we ensure that the hint manager
has access to storage proxy as soon as it's created.
The same changes were applied to db::hints::resource_manager.
The rationale is the same.
If the variables are accessible from the outside, it makes
sense to also expose their initial values to the user.
This commit moves them to the header and marks as inline.
Make handlers return their results directly, without wrapping them into
`aiohttp.web.Response`s. Instead the wrapping is done in a generic way
when defining the routes.
Some endpoint handlers return JSON, some return text, some return empty
responses.
Reduce the number of different handler types by making the text case a
subcase of the JSON case. This also simplifies some code on the
`ManagerClient` side, which would have to deserialize data from text
(because some endpoint handlers would serialize data into text for no
particular reason). And it will allow reducing boilerplate in later
commits even further.
The new logging order seems to make more sense, i.e.
we first log that we're creating and validating directories,
and only then do we start doing that.
The previous order when those actions were reversed didn't
match the log's message because the action was already
done when we informed the user of it.
These changes move away from relying on specific
values of enum variants. The code based on the arithmetic
of them is trivial, and there is no reason to not operator==
and operator!= instead. This should make the code less error
prone and easier to understand.
Exceptions flying from `RESTClient` (which used to communicate with
Scylla's REST API) are in fact not `RuntimeException`s, they are
`HTTPError`s (a type defined in the `rest_client` module). So they would
just fly through our catch branches, and the additional info (such as
log file path) would not be attached. Fix this.
Some of them are already done that way, so turn the rest to asserts as
well for consistency and code terseness, instead of using the `if` +
`raise` pattern.
Since 2f84e820fd it is possible to return errors from
`ScyllaClusterManager` handlers through exceptions without losing the
contents of these exceptions (the contents arrive at `ManagerClient` and
the test can inspect them, unlike in the past where the client would get
a generic `InternalServerError`)
Change all handlers to return errors through exceptions (like some
already do) and get rid of the `ActionReturn` boilerplate.
When checking for `self.cluster`, do it through assertions, like most of
the handlers already do, instead of using the `if` + `raise` pattern.
`ScyllaClusterManager` registers a bunch of HTTP endpoints which
`ManagerClient` uses to perform operations on a cluster during
a topology test.
The endpoints were inconsistently using verbs, like using GET for
endpoints that would have side effects. Use PUT for these.
This is the continuation of 3e74432dbf.
Registering API handlers for services need to
- happen next to the corresponding service's start
- use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job)
- get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything)
Hints API handlers want to use proxy, but also reference gossiper and capture proxy via http context. This PR fixes both and removes http_contex -> proxy dependency as no longer needed
Closesscylladb/scylladb#15644
* github.com:scylladb/scylladb:
api: Remove proxy reference from http context
api,hints: Use proxy instead of ctx
api,hints: Pass sharded<proxy>& instead of gossiper&
api,hints: Fix indentation after previous patch
api,hints: Move gossiper access to proxy
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:
1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.
This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of the query.
Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).
Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111Closesscylladb/scylladb#15414
* github.com:scylladb/scylladb:
test/topology_custom: add test_read_repair.py
replica/mutation_dump: detect end-of-page in range-scans
tools/scylla-sstable: write: abort parser thread if writing fails
test/pylib: add REST methods to get node exe and workdir paths
test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
service/storage_proxy: add trace points for the actual read executor type
service/storage_proxy: add trace points for read-repair
storage_proxy: Add more trace-level logging to read-repair
database: Fix accounting of small partitions in mutation query
database, storage_proxy: Reconcile pages with no live rows incrementally
* tools/jmx d107758...8d15342 (2):
> Revert "install-dependencies.sh: do not install weak dependencies"
> install-dependencies.sh: do not install weak dependencies Especially for Java, we really do not need the tens of packages and MBs it adds, just because Java apps can be built and use sound and graphics and whatnot.
For JSON objects represented as map<ascii, int>, don't treat ASCII keys
as a nested JSON string. We were doing that prior to the patch, which
led to parsing errors.
Included the error offset where JSON parsing failed for
rjson::parse related functions to help identify parsing errors
better.
Fixes: #7949
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15499
Consider the following code snippet:
```c++
future<> foo() {
semaphore.consume(1024);
}
future<> bar() {
return _allocating_section([&] {
foo();
});
}
```
If the consumed memory triggers the OOM kill limit, the semaphore will throw `std::bad_alloc`. The allocating section will catch this, bump std reserves and retry the lambda. Bumping the reserves will not do anything to prevent the next call to `consume()` from triggering the kill limit. So this cycle will repeat until std reserves are so large that ensuring the reserve fails. At this point LSA gives up and re-throws the `std::bad_alloc`. Beyond the useless time spent on code that is doomed to fail, this also results in expensive LSA compaction and eviction of the cache (while trying to ensure reserves).
Prevent this situation by throwing a distinct exception type which is derived from `std::bad_alloc`. Allocating section will not retry on seeing this exception.
A test reproducing the bug is also added.
Fixes: #15278Closesscylladb/scylladb#15581
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add test_cache_reader_semaphore_oom_kill
utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled()
reader_concurrency_semaphore: use utils::memory_limit_reached exception
utils: add memory_limit_reached exception
Now hints endpoints use ctx.sp reference, but it has the direct proxy
reference at hand and should prefer it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
API handlers should try to avoid using any service other than the "main"
one. For hints API this service is going to be proxy, so no gossiper
access in the handler itself.
(indentation is left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The descriptor in question is used to parse sstable's file path and return back the result. Parser, among "relevant" info, also parses sstable directory and keyspace+table names. However, there are no code (almost) that needs those strings. And the need to construct descriptor with those makes some places obscurely use empty strings.
The PR removes sstable's directory, keyspace and table names from descriptor and, while at it, relaxes the sstable directory code that makes descriptor out of a real sstable object by (!) parsing its Data file path back.
Closesscylladb/scylladb#15617
* github.com:scylladb/scylladb:
sstables: Make descriptor from sstable without parsing
sstables: Do not keep directory, keyspace and table names on descriptor
sstables: Make tuple inside helper parser method
sstables: Do not use ks.cf pair from descriptor
sstables: Return tuple from parse_path() without ks.cf hints
sstables: Rename make_descriptor() to parse_path()
The only usage is in batchlog_manager, and it
can be replaced with cf.get_truncation_time().
std::optional<std::reference_wrapper<canonical_mutation>>
is replaced with canonical_mutation* since it is
semantically the same but with less type boilerplate.
We want to make table::_truncated_at optional, so that in
get_truncation_time we can assert that it is initialized.
For existing tables this initialisation will happen in
load_truncation_times function, and for new tables we
want to initialize it in add_column_family like we do
with mark_ready_for_writes.
Now add_column_family function has parameter 'readonly', which is
set by the callers to false if we are creating a fresh new table
and not loading it from sstables. In this commit we rename this
parameter to is_new and invert the passed values.
This will allow us in the next commit to initialize _truncated_at field
for new tables.
load_truncation_times() now works only for
schema tables since the rest is not loaded
until distributed_loader::init_non_system_keyspaces.
An attempt to call cf.set_truncation_time
for non-system table just throws an exception,
which is caught and logged with debug level.
This means that the call cf.get_truncation_time in
paxos_state.cc has never worked as expected.
To fix that we move load_truncation_times()
closer to the point where the tables are loaded.
The function distributed_loader::populate_keyspace is
called for both system and non-system tables. Once
the tables are loaded, we use the 'truncated' table
to initialize _truncated_at field for them.
The truncation_time check for schema tables is also moved
into populate_keyspace since is seems like a more natural
place for it.
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves the std::make_tuple() call into internal static path
parsing helper to make the next patch smaller and nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's only one place that needs ks.cf pair from the parsed desctipror
-- sstables loader from tools/. This code already has ks.cf from the
tuple returned after parsing and can use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two path parsers. One of them accepts keyspace and table names
and the other one doesn't. The latter is then supposed to parse the
ks.cf pair from path and put it on the descriptor. This patch makes this
method return ks.cf so that later it will be possible to remove these
strings from the desctiptor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All `sstable_set_impl` subclasses/implementations already keep a `schema_ptr` so we can make `sstable_set_impl::make_incremental_selector` function return both the selector and the schema that's being used by it.
That way, we can use the returned schema in `sstable_set::make_incremental_selector` function instead of `sstable_set::_schema` field which makes the field unused and allows us to remove it alltogether and reduce the memory footprint of `sstable_set` objects.
Closesscylladb/scylladb#15570
* github.com:scylladb/scylladb:
sstable_set: Remove unused _schema field
sstable_set_impl: Return also schema from make_incremental_selector
When joining the cluster in raft topology mode, the new node asks some
existing node in the cluster to put its information to the
`system.topology` table. Later, the topology coordinator is supposed to
contact the joining node back, telling it that it was added to group 0
and accepted, or rejected. Due to the fact that the topology coordinator
might not manage to successfully contact the joining node, in order not
to get stuck it might decide to give up and move the node to left state
and forget about it (this not always happens as of now, but will in the
future). Because of that, the joining node must use a timeout when
waiting for a response because it's not guaranteed that it will ever
receive it.
There is an additional complication: the topology coordinator might be
busy and not notice the request to join for a long time. For example, it
might be migrating tablets or joining other nodes which are in the queue
before it. Therefore, it's difficult to choose a timeout which is long
enough for every case and still not too long.
Such a failure was observed to happen in ARM tests in debug mode. In
order to unblock the CI the timeout is increased from 30 seconds to 3
minutes. As a proper solution, the procedure will most likely have to be
adjusted in a more significant way.
Fixes: #15600Closesscylladb/scylladb#15618
The method really parses provided path, so the existing name is pretty
confusing. It's extra confusing in the table::get_snapshot_details()
where it's just called and the return value is simply ignored.
Named "parse_..." makes it clear what the method is for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Also rephase the messages a bit so they are more uniform.
The goal of this change is to make semaphore mismatches easier to
diagnose, by including the table name and the permit name in the
printout.
While at it, add a test for semaphore mismatch, it didn't have one.
Refs: #15485Closesscylladb/scylladb#15508
Tracing is one of two global service left out there with its starting and stopping being pretty hairy. In order to de-globalize it and keep its start-stop under control the existing start-stop sequence is worth cleaning. This PR
* removes create_ , start_ and stop_ wrappers to un-hide the global tracing_instance thing
* renames tracing::stop() to shutdown() as it's in fact shutdown
* coroutinizes start/shutdown/stop while at it
Squeezed parts from #14156 that don't reorder start-stop calls
Closesscylladb/scylladb#15611
* github.com:scylladb/scylladb:
main: Capture local tracing reference to stop tracing
tracing: Pack testing code
tracing: Remove stop_tracing() wrapper
tracing: Remove start_tracing() wrapper
tracing: Remove create_tracing() wrapper
tracing: Make shutdown() re-entrable
tracing: Coroutinize start/shutdown/stop
tracing: Rename helper's stop() to shutdown()
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.
Fix that.
Fixes#15632Closesscylladb/scylladb#15633
The test does (among other things) the following:
1. Create a cache reader with buffer of size 1 and fill the buffer.
2. Update the cache.
3. Check that the reader produces the first mutation as seen before
the update (because the buffer fill should have snapshotted the first
mutation), and produces other mutation as seen after the update.
However, the test is not guaranteed to stop after the update succeeds.
Even during a successful update, an allocation might have failed
(and been retried by an allocation_section), which will cause the
body of with_allocation_failures to run again. On subsequent runs
the last check (the "3." above) fails, because the first mutation
is snapshotted already with the new version.
Fix that.
Closesscylladb/scylladb#15634
When a tablet is migrated into a new home, we need to clean its storage (i.e. the compaction group) in the old home. This includes its presence in row cache, which can be shared by multiple tablets living in the same shard.
For exception safety, the following is done first in a "prepare phase" during cache invalidation.
1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the storage of the tablet.
Then once cache is invalidated successfully, we then clear the sstable sets of the the group in the "execution phase", to prevent any background op from incorrectly picking them and also to allow for their deletion.
All the sstables of a tablet are deleted atomically, in order to guarantee that a failure midway won't cause data resurrection if it happens tablet is migrated back into the old home.
Closesscylladb/scylladb#15524
* github.com:scylladb/scylladb:
replica: Clean up storage of tablet on migration
replica: Add async gate to compaction_group
replica: Coroutinize compaction_group::stop()
replica: Make compaction group flush noexcept
Define sstable_set_impl::selector_and_schema_t type as a tuple that
contains both a newly created selector and a schema that the selector
is using.
This will allow removal of _schema field from sstable_set class as
the only place it was used was make_incremental_selector.
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
When a tablet is migrated into a new home, we need to clean its
storage (i.e. the compaction group) in the old home.
This includes its presence in row cache, which can be shared by
multiple tablets living in the same shard.
For exception safety, the following is done first in a "prepare
phase" during cache invalidation.
1) take a compaction guard, to stop and disable compaction
2) flush memtable(s).
3) builds a list of all sstables, which represents all the
storage of the tablet.
Then once cache is invalidated successfully, we then clear
the sstable sets of the the group in the "execution phase",
to prevent any background op from incorrectly picking them
and also to allow for their deletion.
All the sstables of a tablet are deleted atomically, in order
to guarantee that a failure midway won't cause data resurrection
if it happens tablet is migrated back into the old home.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this is redundant code that should have be gone a long time ago.
the snippet (which lies above the code being deleted):
db.invoke_on_all([] (replica::database& db) {
db.get_tables_metadata().for_each_table([] (table_id, lw_shared_ptr<replica::table> table) {
replica::table& t = *table;
t.enable_auto_compaction();
});
}).get();
provides the same thing as this code being deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15597
The following new commands are implemented:
* disablebackup
* disablebinary
* disablegossip
* enablebackup
* enablebinary
* enablegossip
* gettraceprobability
* help
* settraceprobability
* statusbackup
* statusbinary
* statusgossip
* version
All are associated with tests. All tests (both old and new) pass with both the scylla-native and the cassandra nodetool implementation.
Refs: https://github.com/scylladb/scylladb/issues/15588Closesscylladb/scylladb#15593
* github.com:scylladb/scylladb:
tools/scylla-nodetool: implement help operation
tools/scylla-nodetool: implement the traceprobability commands
tools/scylla-nodetool: implement the gossip commands
tools/scylla-nodetool: implement the binary commands
tools/scylla-nodetool: implement backup related commands
tools/scylla-nodetool: implement version command
test/nodetool: introduce utils.check_nodetool_fails_with()
test/nodetool: return stdout of nodetool invokation
test/nodetool/rest_api_mock.py: fix request param matching
tools/scylla-nodetool: compact: remove --partition argument
tools/scylla-nodetool: scylla_rest_client: add support delete method
tools/scylla-nodetool: get rid of check_json_type()
tools/scylla-nodetool: log more details for failed requests
tools/scylla-*: use operation_option for positional options
tools/utils: add support for operation aliases
This commit adds new pages with reference to
Handling Node Failures to Troubleshooting.
The pages are:
- Failure to Add, Remove, or Replace a Node
(in the Cluster section)
- Failure to Update the Schema
(in the Data Modeling section)
This commit moves the content of the Handling
Failures section on the Raft page to the new
Handling Node Failures page in the Troubleshooting
section.
Background:
When Raft was experimental, the Handling Failures
section was only applicable to clusters
where Raft was explicitly enabled.
Now that Raft is the default, the information
about handling failures is relevant to
all users.
Checking that nodetool fails with a given message turned out to be a
common pattern, so extract the logic for checking this into a method of
its own. Refactor the existing tests to use it, instead of the
hand-coded equivalent.
We don't need map_reduce here since get_truncated_positions returns
the same result on all shards.
We remove 'finally' semantics in this commit since it doesn't seem we
really need it. There is no code that relies on the state of this
data structure in case of exception. An exception will propagate
to scylla_main() and the program will just exit.
This is a refactoring commit without observable
changes in behaviour.
There is a truncation_record struct, but in this method we
only care about time, so rename it (and other related methods)
appropriately to avoid confusion.
This PR updates the information on the ScyllaDB vs. Cassandra compatibility. It covers the information from https://github.com/scylladb/scylladb/issues/15563, but there could more more to fix.
@tzach @scylladb/scylla-maint Please review this PR and the page covering our compatibility with Cassandra and let me know if you see anything else that needs to be fixed.
I've added the updates with separate commits in case you want to backport some info (e.g. about AzureSnitch).
Fixes https://github.com/scylladb/scylladb/issues/15563Closesscylladb/scylladb#15582
* github.com:scylladb/scylladb:
doc: deprecate Thrift in Cassandra compatibility
doc: remove row/key cache from Cassandra compatibility
doc: add AzureSnitch to Cassandra compatibility
The sentence says that if table args are provided compaction will run on
all tables. This is ambigous, so the sentence is rephrased to specify
that compaction will only run on the provided tables.
Closesscylladb/scylladb#15394
There's a finally-chain stopping tracing out there, now it can just use
the deferred stop call and that's it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's confusing, as it doesn't stop tracing, but rather shuts it down
on all shards. The only caller of it can be more descriptive without the
wrapper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today's shutdown() and its stop() peer are very restrictive in a way
callers should use them. There's no much point in it, making shutdown()
re-entrable, as for other services, will let relaxing callers code here
and in next patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It may happen that wrapping up multipart upload fails too. However,
before sending the request the driver clears the _upload_id field thus
marking the whole process as "all is OK". So in case the finalization
method fails and thrown, the upload context remains on the server side
forever.
Fix this by keeping the _upload_id set, so even if finalization throws,
closing the uploader notices this and calls abort.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15521
This check is redundant. Originally it was intended to work around by
rapidjson using an assert by default to check that the fields have the
expected type. But turns out we already configure rapidjson to use a
plain exception in utils/rjson.hh, so check_json_type() is not needed
for graceful error handling.
Use operation_option to describe positional options. The structure used
before -- app_template::positional_option -- was not a good fit for
this, as it was designed to store a description that is immediately
passed to the boost::program_options subsystem and then discarded.
As such, it had a raw pointer member, which was expected to be
immediately wrapped by boost::shared_ptr<> by boost::program_options.
This produced memory leaks for tools, for options that ended up not
being used. To avoid this altogether, use operation_option, converting
to the app_template::positional_option at the last moment.
The test needs to call flush-keyspace API endpoint and currently it does it by hand. Not very convenient.
Also in the future there will be the need for _background_ API kicking, the currently used requests package cannot do it, while pylib REST API can
Closesscylladb/scylladb#15565
* github.com:scylladb/scylladb:
test/object_store: Use REST client from pylib
test/pylib: Add flush_keyspace() method to rest client
test/object_store: Wrap yielded managed cluster
Registering API handlers for services need to
- happen next to the corresponding service's start
- use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job)
- get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything)
The storage proxy handlers don't follow any of that rules, this PR fixes them
Closesscylladb/scylladb#15584
* github.com:scylladb/scylladb:
api: Make storage_proxy handlers use proxy argument
api: Change some static helpers to use proxy instead of ctx
api: Pass sharded<storage_proxy> reference to storage_proxy handlers
api: Start (and stop) storage_proxy API earlier
api: Remove storage_service argument from storage_proxy setup
api: Move storage_proxy/ endpoint using storage_service
api: Remove storage_proxy.hh from storage_service.cc
main: Initialize API server early
Use NullCompactionStrategy for the test_table fixture
rather than using the `no_autocompaction_context`.
Besides being simpler, as regular compaction just comes in
the way for all tests that use `SELECT MUTATION_FRAGMENTS`
The latter would be problematic when we start run cql-pytest
test cases in parallel rather than in serial since it
will inadvertantly affect other test cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15574
And stop using proxy reference from http context. After a while the
proxy dependency will be removed from http context
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some helpers in storage_proxy.cc that get proxy reference from
passed http context argument. Next patch will stop using ctx for that
purpose, so prepare in advance by making the helpers use proxy reference
argument directly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goals is to make handlers use proxy argument instead of keeping
proxt as dependency on http context (other handlers are mostly such
already)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The code setting up storage_proxy/ endpoints no longer needs
storage_service and related decoration
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_proxy/get_schema_version is served by storage_service, so it
should be in storage_service.cc instead
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Surprisingly, but the dependency-less API server context is initialized
somewhere in the middle of main. By that time some "real" services had
already started and should have the ability to register their endpoints,
so API context should be initialized way ahead. This patch places its
initialization next to prometheus init.
One thing that's not nice here is that API port listening remains where
it was before the patch, so for the external ... observer API
initialization doesn't change. Likely API should start listening for
connection early as well, but that's left for future patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR is the second step in refactoring the Hinted Handoff module. It cleans up the contents of the file `hint_storage.cc`. The biggest change is the transition from continuations to coroutines.
Refs #15358Closesscylladb/scylladb#15496
* github.com:scylladb/scylladb:
db/hints: Alias segment list in hint_storage.cc
db/hints: Rename rebalance to rebalance_hints
db/hints: Clean up rebalance() in hint_storage.cc
db/hints: Coroutinize hint_storage.cc
db/hints: Clean up remove_irrelevant_shards_directories() in hint_storage.cc
db/hints: Clean up rebalance_segments() in hint_storage.cc
db/hints: Clean up rebalance_segments_for() in hint_storage.cc
db/hints: Clean up get_current_hints_segments() in hint_storage.cc
db/hints: Rename scan_for_hints_dirs to scan_shard_hint_directories
db/hints: Clean up scan_for_hints_dirs() in hint_storage.cc
db/hints: Wrap hint_storage.cc in an anonymous namespace
Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages.
Refactor the test a bit while at it.
Fixes: #15285Closesscylladb/scylladb#15523
* github.com:scylladb/scylladb:
test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints`
test: test_fencing.py: reload topology through REST API in `test_fence_hints`
test: refactor test_fencing.py
api: storage_service: add REST API to reload topology state
Allow disabling auto-compaction for given table(s)
using either the ks.table syntax or ks:table (as the
api suggests).
The first syntax would likely be more common since
the test tables we automatically create are named
as test_keyspace.test_table so we can pass that name
to `no_autocompaction_context` as is.
test_tools.system_scylla_local_sstable_prepared was
modified to disable auto-compaction only only
the `system.scylla_local` table rather than
the whole `system` keyspace, since it only relies
on this table. Plus, it helps test this change :)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15575
Enable TRACE level logging on the server that's supposed to send the
hints. Should make it easier to debug failures in the future, if any
happen again.
Restarting a node in order to reload topology may have side effects that
lead to test flakiness. While the node is shutting down, it gives up
leadership. Before it finishes shutting down, another node may become
Raft group 0 leader, then topology coordinator, then send a topology
command, triggering topology state reload on the shutting down node,
causing its topology version to get updated, allowing it to send a
successful hint before it shuts down and restarts. After it restarts, no
more hints will be sent, so the metrics condition we're waiting for (for
a hint to be sent) will never become true (metrics are not persisted
between restarts).
Instead of restarting, reload topology state through the new REST API.
This also makes the test a bit faster.
Fixes#15285
- use `manager.get_cql()` to silence mypy (`manager.cql` is `Optional`)
- extract `metrics.lines_by_prefix('scylla_hints_manager_')` to a helper
function
- when waiting for conditions on metrics, split the condition into
safety and liveness part, and fail early if the safety part does not
hold
- in `exactly_one_hint` send, don't check that `send_errors_metric` is
`0` (it won't be after the next commit)
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.
The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip.
Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.
In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows:
- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then:
- If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted.
- Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down.
The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).
More details about the procedure and the RPC are described in `topology-over-raft.md`.
Fixes: #15152Closesscylladb/scylladb#15196
* github.com:scylladb/scylladb:
tests: mark test_blocked_bootstrap as skipped
storage_service: do not check features in shadow round
storage_service: remove raft_{boostrap,replace}
topology_coordinator: relax the check in enable_features
raft_group0: insert replaced node info before server setup
storage_service: use join node rpc to join the cluster
topology_coordinator: handle joining nodes
topology_state_machine: add join_group0 state
storage_service: add join node RPC handlers
raft: expose current_leader in raft::server
storage_service: extract wait_for_live_nodes_timeout constant
raft_group0: abstract out node joining handshake
storage_service: pass raft_topology_change_enabled on rpc init
rpc: add new join handshake verbs
docs: document the new join procedure
topology_state_machine: add supported_features to replica_state
storage_service: check destination host ID in raft verbs
group_state_machine: take reference to raft address map
raft_group0: expose joined_group0
This commit adds AzureSnitch (together with a link
to the AzureSnitch description) to the Cassandra
compatibility page.
In addition, the Sniches table is fixed.
Test cases kick scylla to force keyspaces flush (to have the objects on
object store) by hand. Equip the wrapped cluster object with the REST
API class instance for convenience
The assertion for 200 return status code is dropped, REST client does it
behind the scenes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Test cases use temporary cluster object which is, in fact, cql cluster.
In the future there will be the need to perform more actions on it
rather than just querying it with cql client, so wrap the cluster with
an extendable object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check that the cache reader reacts correctly to semaphore's OOM kill
attempt, letting the read to fail, instead of going berserk, trying to
reserve more-and-more memory, until the reserve cannot be satisfied.
replica::table has the same gate for gating async operations, and
even synchronize stop of table with in-flight writes that will
apply into memory.
compaction group gains the same gate, which will be used when
operations are confined to a single group. table's gate is kept
for table wide operations like query, truncate, etc.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit makes the function less compact and abides by the limit
of 120 characters per line; that makes the code more readable.
We start using fmt::to_string instead of seastar::format("{:d"})
to convert strings to integers -- the new way is the preferred one.
The changes also name variables in a more descriptive way.
This commit makes the function less compact and abides by the limit
of 120 characters per line. That makes the code more readable.
It also doesn't unnecessarily call c_str() on seastar::sstring.
There is no need to call c_str() on the name of the directory entry.
In fact, the used overload std::stoi() takes an std::string as its
argument. Providing seastar::sstring instead of const char* is more
efficient because we can allocate just the right amount of memory
and std::memcpy it, i.e. call std::string(const char*, std::size_t).
Using the overload std::string(const char*) would need to first
traverse the string to find the null byte.
This is a small change, all the more because paths don't tend to
be long, but it's some gain nonetheless.
The commit also inserts a few empty lines to make the code less
compact and improve readability as a result.
An anonymous namespace is a safer mechanism than the static
keyword. When adding a new piece of code, it's easy to
forget about adding the static. In that case, that code
might undergo external linkage. However, when code is put
in an anonymous namespace (when it should not), the linker
will immediately detect it (in most cases), and
the programmer will be able to spot and fix their mistake
right away.
Said method catches bad-allocs and retries the passed-in function after
raising the reserves. This does nothing to help the function succeed if
the bad alloc was throw from the semaphore, because the kill limit was
reached. In this case the read should be left to fail and terminate.
Now that the semaphore is throwing utils::memory_limit_reached in this
case, we can distinguish this case and just re-throw the exception.
A distinct exception derived from std::bad_alloc, used in cases when
memory didn't really run out, but the process or task reached the memory
limit alloted for it. Using a distinct type for this case allows for LSA
to correctly react to this case.
With the new procedure to join nodes, testing the scenario in
`test_blocked_bootstrap` becomes very tricky. To recap, the test does
the following:
- Starts a 3-node cluster,
- Shuts down node 1,
- Tries to replace node 1 with node 4, but an error injection is
triggered which causes node 4 to fail after it joins group 0. Note
that pre-JOIN_NODE handshake, this would only result in node 4 being
added to group 0 config, but no modification to the group 0 state
itself is being done - the joining node is supposed to write a request
to join.
- Tries to replace node 1 again with node 5, which should succeed.
The bug that this regression test was supposed to check for was that
node 5 would try to resolve all IPs of nodes added to group 0 config.
Because node 4 shuts down before advertising itself in gossip, the node
5 would get stuck.
The new procedure to join group 0 complicates the situation because a
request to join is written first to group 0 and only then the topology
coordinator modifies the group 0 config. It is possible to add an error
injection to the topology coordinator code so that it doesn't change the
group 0 state and proceeds with bootstrapping the node, but it will only
get stuck trying to add the node. If node 5 tries to join in the
meantime, the topology coordinator may switch to it and try to bootstrap
it instead, but this is basically a 50% chance because it depends on the
order of node 4 and node 5's host IDs in the topology_state_machine
struct.
It should be possible to fix the test with error recovery, but until
then it is marked as skipped.
The new joining procedure safely checks compatibility of
supported/enabled features, therefore there is no longer any need to do
it in the gossip shadow round.
Currently, `enable_features` requires that there are no topology in
progress and there are no nodes waiting to be joined. Now, after the new
handshake is implemented, we can drop the second condition because nodes
in `none` state are not a part of group 0 yet.
Additionally, the comments inside `enable_features` are clarified so
that they explain why it's safe to only include normal features when
doing the barrier and calculating features to enable.
Currently, information about replaced node is put into the raft address
map after joining group 0 via `join_group0`. However, the new handshake
which happens when joining group 0 needs to read the group 0 state (so
that it can wait until it sees all normal nodes as UP). Loading the
topology state to memory involves resolving IP addresses of the normal
nodes, so the information about replaced node needs to be inserted
before the handshake happens.
This commit moves insertion of the replace node's data before the call
to `join_group0`.
Currently, when the topology coordinator notices a request to join or
replace a node, the node is transitioned to an appropriate state and the
topology is moved to commit_new_generation/write_both_read_old, in a
single group 0 operation. In later commits, the topology coordinator
will accept/reject nodes based on the request, so we would like to have
a separate step - topology coordinator accepts, transitions to bootstrap
state, tells the node that it is accepted, and only then continues with
the topology transition.
This commits adds a new `join_group0` transition state that precedes
`commit_cdc_generation`.
This PR replaces a link to a section of the ScyllaDB website with little information about ScyllaDB vs. Cassandra with a link to
a documentation section where Cassandra compatibility is covered in detail.
In addition, it removes outdated or irrelevant information about versions from the Cassandra compatibility page.
Now that the documentation is versioned, we shouldn't add such information to the content.
Fixes https://github.com/scylladb/scylla-enterprise/issues/3454Closesscylladb/scylladb#15562
* github.com:scylladb/scylladb:
doc: remove outdated/irrelevant version info
doc: replace the link to Cassandra compatibility
This commit removes outdated or irrelevant
information about versions from the Cassandra
compatibility page.
Now that the documentation is versioned, we
shouldn't add such information to the content.
`if (.. EQUAL ..)` is used to compare numbers so if the LHS is not
a number the condition is evaluated as false, this prevents us from
setting the -march when building for aarch64 targets. and because
crc32 implementation in utils/ always use the crypto extension
intrinsics, this also breaks the build like
```
In file included from /home/fedora/scylla/utils/gz/crc_combine.cc:40:
/home/fedora/scylla/utils/clmul.hh:60:12: error: always_inline function 'vmull_p64' requires target feature 'aes', but would be inlined into functi
on 'clmul_u32' that is compiled without support for 'aes'
return vmull_p64(p1, p2);
^
```
so, in this change,
* compare two strings using `STREQUAL`.
* document the reason why we need to set the -march to the
specfied argument.
see also http://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#g_t-march-and--mcpu-Feature-Modifiers
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15553
this series is one of the steps to remove global statements in `configure.py`.
not only the script is more structured this way, this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs #15379Closesscylladb/scylladb#15552
* github.com:scylladb/scylladb:
build: pass `args` explicitly
build: remove `distro_extra_ldflags`
build: remove `distro_extra_cflags`
build: remove `distro_extra_cmake_args`
build: pass variables explicitly
build: do not mutate args.user_ldflags
build: do not mutate args.user_ldflags
build: use os.makedirs(exist_ok=True)
Some "Additional Information" section headings
appear on the page tree in the left sidebar
because of their incorrect underline.
This commit fixes the problem by replacing title
underline with section underline.
Closesscylladb/scylladb#15550
This commit replaces a link to a section of
the ScyllaDB website with little information
about ScyllaDB vs. Cassandra with a link to
a documentation section where Cassandra
compatibility is covered in detail.
Storage service API set/unset has two flaws.
First, unset doesn't happen, so after storage service is stopped its handlers become "local is not initialized"-assertion and use-after-free landmines.
Second, setting of storage service API carry gossiper and system keyspace references, thus duplicating the knowledge about storage service dependencies.
This PR fixes both by adding the storage service API unsetting and by making the handlers use _only_ storage service instance, not any externally provided references.
Closesscylladb/scylladb#15547
* github.com:scylladb/scylladb:
main, api: Set/Unset storage_service API in proper place
api/storage_service: Remove gossiper arg from API
api/storage_service: Remove system keyspace arg from API
api/storage_service: Get gossiper from storage service
api/storage_service: Get token_metadata from storage service
instead relying on updating `global()` with `args`, pass the
argument explicitly, this helps us understand the data dependency
in this script better.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of using `global()` pass used variables explicitly this
helps us to understand the data dependency in this script better.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
mutating the member variables in `args` after it is returned from
`arg_parser.parse_args()` is but confusing. let's use a new variable
for tracking the updated `user_cflags`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
mutating the member variables in `args` after it is returned from
`arg_parser.parse_args()` is but confusing. let's use a new variable
for tracking the updated `user_ldflags`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of checking the existence of the directory, use the
`exist_ok` parameter, which was introduced back in Python 3.2,
and it has been used in this script.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When presented with queries that use the same named bind variables twice, like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```
Scylla generated empty `partition_key_bind_indexes` (`pk_indexes`).
`pk_indexes` tell the driver which bind variables it should use to calculate the partition token for a query. Without it, the driver is unable to determine the token and it will send the query to a random node.
Scylla should generate pk_indexes which tell the driver that it can use bind variable with `bind_index = 0` to calculate the partition token for this query.
The problem was that `_target_columns` keep only a single target_column for each bind variable.
In the example above `:x` is compared with both `p` and `c`, but `_target_columns` would contain only one of them, and Scylla wasn't able to tell that this bind variable is compared with a partition key column.
To fix it, let's replace `_target_columns` with `_targets`. `_targets` keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.
A `cql-pytest` reproducer is added.
I also added some comments in `prepare_context.hh/cc` to make it easier to read.
Fixes: https://github.com/scylladb/scylladb/issues/15374Closesscylladb/scylladb#15526
* github.com:scylladb/scylladb:
cql-pytest/test-prepare: remove xfail marker from *pk_indexes_duplicate_named_variables
cql3/prepare_context: fix generating pk_indexes for duplicate named bind variables
cql3: improve readability of prepare_context
cql-pytest: test generation of pk indexes during PREPARE
There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in 3b95f4f107
refs: #2737
refs: #4384Closesscylladb/scylladb#15545
* github.com:scylladb/scylladb:
forward_service: Remove .shutdown() method
forward_service: Set _shutdown in abort-source subscription
forward_service: Add abort_source to constructor
we compare the symbols lists of stripped ELF file ($orig.stripped) and
that of the one including debugging symbols ($orig.debug) to get a
an ELF file which includes only the necessary bits as the debuginfo
($orig.minidebug).
but we generate the symbol list of stripped ELF file using the
sysv format, while generate the one from the unstripped one using
posix format. the former is always padded the symbol names with spaces
so that their the length at least the same as the section name after
we split the fields with "|".
that's why the diff includes the stuff we don't expect. and hence,
we have tons of warnings like:
```
objcopy: build/node_exporter/node_exporter.keep_symbols:4910: Ignoring rubbish found on this line
```
when using objcopy to filter the ELF file to keep only the
symbols we are interested in.
so, in this change
* use the same format when dumping the symbols from unstripped ELF
file
* include the symbols in the text area -- the code, by checking
"T" and "t" in the dumped symbols. this was achieved by matching
the lines with "FUNC" before this change.
* include the the symbols in .init data section -- the global
variables which are initialized at compile time. they could
be also interesting when debugging an application.
Fixes#15513
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15514
When doing a SELECT CAST(b AS int), Cassandra returns a column named
cast(b as int). Currently, Scylla uses a different name -
system.castasint(b). For Cassandra compatibility, we should switch to
the same name.
fixes#14508Closesscylladb/scylladb#14800
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
Closesscylladb/scylladb#15506
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.
If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.
If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
Like in the non-raft topology path, during the new handshake, the
joining node will wait until all normal nodes are alive. The timeout
used during the wait is extracted to a constant so that it will be
reused in the handshake code, to be introduced in later commits.
Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an
existing group 0 member to add this node to the group, in case the
joining node was not a discovery leader. The new handshake verbs
(JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a
preparation, this commit abstracts away the handshake process.
We will want to conditionally register some verbs based on whether we
are using raft topology or not. This commit serves as a preparation,
passing the `raft_topology_change_enabled` to the function which
initializes the verbs (although there is _raft_topology_change_enabled
field already, it's only initialized on shard 0 later).
The `join_node_request` and `join_node_response` RPCs are added:
- `join_node_request` is sent from the joining node to any node in the
cluster. It contains some initial parameters that will be verified by
the receiving node, or the topology coordinator - notably, it contains
a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
joining node to tell it about the the outcome of the verification.
The `service::topology_features` struct was introduced in #14955. Its
purpose was to make it possible to load cluster features from
`system.topology` before schema commitlog replay. It contains a map from
host ID to supported feature set for every normal node.
In order not to duplicate logic for loading features,
the `service::topology`'s `replica_state`s do not hold a set of
supported features and users are supposed to refer to the features
in `topology_features`, which is a field in the `topology` struct.
However, accessing features is quite awkward now.
This commit adds `supported_features` field back to the `replica_state`
struct and the `load_topology_state` function initializes them properly.
The logic duplication needed to initialize them is quite small and the
drawbacks that come with it are outweighed by the fact that we now can
refer to node's supported features in a more natural way.
The `topology_features` struct is no longer a field of `topology`, but
it still exists for the purpose of the feature check that happens before
commitlog replay.
In unlucky but possible circumstances where a node is being replaced
very quickly, RPC requests using raft-related verbs from storage_service
might be sent to it - even before the node starts its group 0 server.
In the latter case, this triggers on_internal_error.
This commit adds protection to the existing verbs in storage_service:
they check whether the group 0 is running and whether the received
host_id matches the actual recipient's host_id.
None of the verbs that are modified are in any existing release, so the
added parameter does not have to be wrapped in rpc::optional.
Before this commit the primary key was hashed for bloom filter check
for each sstable.
This commit makes the key be hashed once per sstable set and reused
for bloom filter lookups in all sstables in the set.
I tested this change using perf_simple_query with the following modifications:
1. Create more than one sstable to have sstable set of more than one elements
2. Try to prevent compactions (I wasn't 100% successful)
3. Use a key that's not present to avoid reading from disk
```
diff --git a/test/perf/perf_simple_query.cc b/test/perf/perf_simple_query.cc
index 26dbf1e99..6bd460df2 100644
--- a/test/perf/perf_simple_query.cc
+++ b/test/perf/perf_simple_query.cc
@@ -105,6 +105,8 @@ std::ostream& operator<<(std::ostream& os, const test_config& cfg) {
static void create_partitions(cql_test_env& env, test_config& cfg) {
std::cout << "Creating " << cfg.partitions << " partitions..." << std::endl;
+ // Create 10 sstables each with all the data
+ for (unsigned count = 0; count < 10; ++count) {
for (unsigned sequence = 0; sequence < cfg.partitions; ++sequence) {
if (cfg.counters) {
execute_counter_update_for_key(env, make_key(sequence));
@@ -117,6 +119,7 @@ static void create_partitions(cql_test_env& env, test_config& cfg) {
std::cout << "Flushing partitions..." << std::endl;
env.db().invoke_on_all(&replica::database::flush_all_memtables).get();
}
+ }
}
static int64_t make_random_seq(test_config& cfg) {
@@ -137,8 +140,18 @@ static std::vector<perf_result> test_read(cql_test_env& env, test_config& cfg) {
query += " using timeout " + cfg.timeout;
}
auto id = env.prepare(query).get0();
- return time_parallel([&env, &cfg, id] {
- bytes key = make_random_key(cfg);
+ // Always use the same key that is not present
+ // to make sure we don't read from disk and make
+ // the benchmark CPU bounded.
+ int64_t key_value = 6;
+ bytes key(bytes::initialized_later(), 5*sizeof(key_value));
+ auto i = key.begin();
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ return time_parallel([&env, id, key] {
return env.execute_prepared(id, {{cql3::raw_value::make_value(std::move(key))}}).discard_result();
}, cfg.concurrency, cfg.duration_in_seconds, cfg.operations_per_shard, cfg.stop_on_error);
}
@@ -423,6 +436,10 @@ static std::vector<perf_result> do_cql_test(cql_test_env& env, test_config& cfg)
.with_column("C2", bytes_type)
.with_column("C3", bytes_type)
.with_column("C4", bytes_type)
+ // Try to prevent compaction
+ // to keep the number of sstables high
+ .set_compaction_enabled(false)
+ .set_min_compaction_threshold(2000000000)
.build();
}).get();
@@ -539,6 +556,11 @@ int scylla_simple_query_main(int argc, char** argv) {
const auto enable_cache = app.configuration()["enable-cache"].as<bool>();
std::cout << "enable-cache=" << enable_cache << '\n';
db_cfg->enable_cache(enable_cache);
+ // Try to prevent compaction
+ // to keep the number of sstables high
+ db_cfg->concurrent_compactors(1);
+ db_cfg->compaction_enforce_min_threshold(true);
+ db_cfg->compaction_throughput_mb_per_sec(1);
cql_test_config cfg(db_cfg);
return do_with_cql_env_thread([&app] (auto&& env) {
```
The following command showed 2-3% improvement on my machine but this
depends on the lenght of the key and the number of sstables in the set.
```
./build/release/scylla perf-simple-query --bypass-cache --flush -c 1
--random-seed=2068087418 --enable-cache false
```
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
Closesscylladb/scylladb#15538
SSTable runs work hard to keep the disjointness invariant, therefore they're
expensive to build from scratch.
For every insertion, it keeps the elements sorted by their first key in
order to reject insertion of element that would introduce overlapping.
Additionally, a sstable run can grow to dozens of elements (or hundreds)
therefore, we can also make interaction with compaction strategies more
efficient by not copying them when building a list of candidates in compaction
manager. And less fragile by filtering out any sstable runs that are not
completely eligible for compaction.
Previously, ICS had to give up on using runs managed by sstable set due to
fragility of the interface (meaning runs are being built from scratch
on every call to the strategy, which is very inefficient, but that had to
be done for correctness), but now we can restore that.
Closesscylladb/scylladb#15440
* github.com:scylladb/scylladb:
compaction: Switch to strategy_control::candidates() for regular compaction
tests: Prepare sstable_compaction_test for change in compaction_strategy interface
compaction: Allow strategy to retrieve candidates either as sstables or runs
compaction: Make get_candidates() work with frozen_sstable_run too
sstables: add sstable_run::run_identifier()
sstables: tag sstable_run::insert() with nodiscard
sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs
sstables: Simplify sstable_set interface to retrieve runs
Most of the time only the roots of tasks tree should be non internal.
Change default implementation of is_internal and delete overrides
consistent with it.
Closesscylladb/scylladb#15353
Distributed loader code still "knows" that table's datadir is a filesystem directory with some structure. For S3-backed sstables this still works because for S3 keyspaces scylla still creates and maintains empty directories in datadir. This set fixes the dist. loader assumptions about that and moves them into sstable directory's lister.
refs: #13020Closesscylladb/scylladb#15542
* github.com:scylladb/scylladb:
sstable_directory: Indentation fix after previous patch
sstable_directory: Simplify filesystem prepare()
distributed_loader: Remove get_path() method
distributed_loader: Move directory touching to sstable_directory
distributed_loader: Move directory existance checks to sstable_directory
sstable_directory: Move prepare() core to lister
for better readability, and do not create a CMAKE_BUILD_TYPE CACHE
entry if it is already set using `-D`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15543
Currently the storage-service API handlers are set up in "random" place.
It can happen earlier -- as soon as the storage service itself is ready.
Also, despite storage service is stopped on shutdown, API handlers
continue reference it leading to potential use-after-frees or "local is
not initialized" assertions.
Fix both. Unsetting is pretty bulky, scylladb/seastar#1620 is to help.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some handlers in set_storage_service() have implicit dependency on
gossiper. It's not API that should track it, but storage service itself,
so get the gossiper from service, not from the external argument (it
will be removed soon)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The API handlers that live in set_storage_service() should be
self-contained and operate on storage-service only. Said that, they
should get the token metadata, when needed, from storage service, not
from somewhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There can be 2 waiters now (coordinator and CDC generation publisher),
so signal() is not enough.
Change made in c416c9ff33 missed to
update this site.
Closesscylladb/scylladb#15527
When preparing a `field_selection`, we need to prepare the UDT value,
and then verify that it has this field.
`field_selection_test_assignment` prepares the UDT value using the same
receiver as the whole `field_selection`. This is wrong, this receiver
has the type of the field, and not the UDT.
It's impossible to create a receiver for the UDT. Many different UDTs
can produce an `int` value when the field `a` is selected.
Therefore the receiver should be `nullptr`.
No unit test is added, as this bug doesn't currently cause any issues.
Preparing a column value doesn't do any type checks, so nothing fails.
Still it's good to fix it, just to be correct.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closesscylladb/scylladb#14788
Currently the bit is set in .shutdown() method which is called early on
stop. After the patch the bit it set in the abort-source subscription
callback which is also called early on stop.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now everything is prepared for the switch, let's do it.
Now let's wait for ICS to enjoy the set of changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for upcoming changes that will allow ICS to efficiently
retrieve sstable runs.
Next patch will remove candidates from compaction_strategy's interface
to retrieve candidates using this one instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable_run may reject insertion of a sstable if it's going
to break the disjoint invariant of the run, but it's important
that the caller is aware of it, so it can act on it like
generating a new run id for the sstable so it can be inserted
in another run. the tag is important to avoid unknown
problems in this area.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Users of all_sstable_runs() don't want to mutate the runs, but rather
work with their content. So let's avoid copy and make the intention
explicit with the new frozen_sstable_run used as return type
for the interface.
This will guarantee that ICS will be able to fetch uncompacting
runs efficiently.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface selects all runs that store at least one of the
sstables in the vector.
But that's very fragile, to the point that even ICS had to
stop using it. A better interface is to return all runs
managed by the set and allow compaction manager to do its
filtering.
We want to use it in ICS to avoid the overhead of rebuilding
sstable runs which may be expensive as sorting is performed
to guarantee the disjoint invariant.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs https://github.com/scylladb/scylladb/issues/15379Closesscylladb/scylladb#15515
* github.com:scylladb/scylladb:
build: extract get_os_ids() out
build: extract find_ninja() out
build: extract thrift_uses_boost_share_ptr() out
Currently, the tools loosely follow the following convention on
error-codes:
* return 1 if the error is with any of the command-line arguments
* return 2 on other errors
This patch changes the returned error-code on unknown operation/command
to 100 (instead of the previous 1). The intent is to allow any wrapper
script to determine that the tool failed because the operation is
unrecognized and not because of something else. In particular this
should enable us to write a wrapper script for scylla-nodetool, which
dispatches commands still un-implemented in scylla-nodetool, to the java
nodetool.
Note that the tool will still print an error message on an unknown
operation. So such wrapper script would have to make sure to not let
this bleed-through when it decides to forward the operation.
Closesscylladb/scylladb#15517
When FS lister gets prepared it
- checks if the directory exists
- creates if it it doesn't or bais out if it's quarantine one
- goes and checks the directory's owner and mode
The last step is excessive if the directory didn't exist on entry and
was created.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of the previous patch -- when populating a table,
creating directories should be (optionally) performed by the lister
backend, not by the generic loader.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The loader code still "knows" that tables' sstables live in directories
on datadir filesystem, but that's not always so. So whether or not the
directory with sstables exists should be checked by sstable directory's
component lister, not the loader.
After this change potentially missing quarantine directory will be
processed with the sstable directory with empty result, but that's OK,
empty directories should be already handled correctly, so even if the
directory lister doesn't produce any sstables because it found no files,
or because it just skipped scanning doesn't make any difference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current sstable_directory::prepare() code checks the sstable directory
existance which only makes sense for filesystem-backed sstables.
S3-backed don't (well -- won't) have any directories in datadir, so the
check should be moved into component lister.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Issue #15374 has been fixed, so these tests can be enabled.
Duplicate bind variable names are now handled correctly.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When presented with queries that use the same named bind variables twice,
like this one:
```cql
SELECT p FROM table WHERE p = :x AND c = :x
```
Scylla generated empty partition_key_bind_indexes (pk_indexes).
pk_indexes tell the driver which bind variables it should use to calculate the partition
token for a query. Without it, the driver is unable to determine the token and it will
send the query to a random node.
Scylla should generate pk_indexes which tell the driver that it can use bind variable
with bind_index = 0 to calculate the partition token for a query.
The problem was that _target_columns keep only a single target_column for each bind variable.
In the example above :x is compared with both p and c, but _target_columns would contain
only one of them, and Scylla wasn't able to tell that this bind variable is compared with
a partition key column.
To fix it, let's replace _target_columns with _targets. _targets keeps all comparisons
between bind variables and other expressions, so none of them will be forgotten/overwritten.
Fixes: #15374
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This commits adds a few comments and changes a few variable names
so that it's easier to figure out what the code does.
When I first started looking at this part of the code it wasn't
obvious what's going on - what are _specs, how are they different
from _target_columns? What happens when a variable doesn't have a name?
I hope that this change will make it easier to understand for future readers.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add some tests that test whether `pk indexes` are generated correctly.
When a driver asks to prepare a statement, Scylla's response includes
the metadata for this prepared statement.
In this metadata there's `pk indexes`, which tells the driver which
bind variable values it should use to calculate the partition token.
For a query like:
SELECT * FROM t WHERE p2 = ? AND p1 = ? AND p3 = ?
The correct pk_indexes would be [1, 0, 2], which means
"To calculate the token calculate Hash(bind_vars[1] | bind_vars[0] | bind_vars[2])".
More information is available in the specification:
1959502d8b/doc/native_protocol_v4.spec (L699-L707)
Two tests are marked as xfail because of #15374 - Scylla doesn't correctly handle using the same
named variable in multiple places. This will be fixed soon.
I couldn't find a good place for these tests, so I created a new file - test_prepare.py.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently the datadir is ignored.
Use it to construct the table's base path.
Fixesscylladb/scylladb#15418Closesscylladb/scylladb#15480
* github.com:scylladb/scylladb:
distributed_loader: populate_keyspace: access cf by ref
distributed_loader: table_populator: use datadir for base_path
distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed
distributed_loader: populate_keyspace: fixup indentation
distributed_loader: populate_keyspace: iterate over datadirs in the inner loop
test: sstable_directory_test: add test_multiple_data_dirs
table: init_storage: create upload and staging subdirs on all datadirs
Issue #10357 is about a SELECT with a filter on a regular column which
incorrectly returns a static row without regular columns set (so the
filter would not have matched). We already have four tests reproducing
this issue, but each of them is a small part of a large tests translated
from Cassandra, making it hard to understand the scope of this bug.
So in this patch we add two new tests, one passing and one xfailing,
which clarify the scope of this bug. It turns out that the bug only
occurs when a partition has no clustering rows and only has a static
row. If the partition does have clustering rows - even if those don't
match the filter - the bug doesn't happen. The xfailing test is just
two statements long - a single INSERT and a single SELECT
Refs #10357.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#15120
Scylla can crash due to a complicated interaction of service level drop,
evictable readers, inactive read registration path.
1) service level drop invoke stop of reader concurrency semaphore, which will
wait for in flight requests
2) turns out it stops first the gate used for closing readers that will
become inactive.
3) proceeds to wait for in-flight reads by closing the reader permit gate.
4) one of evictable reads take the inactive read registration path, and
finds the gate for closing readers closed.
5) flat mutation reader is destroyed, but finds the underlying reader was
not closed gracefully and triggers the abort.
By closing permit gate first, evictable readers becoming inactive will
be able to properly close underlying reader, therefore avoiding the
crash.
Fixes#15534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15535
in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability.
Closesscylladb/scylladb#15529
* github.com:scylladb/scylladb:
main.cc: rename aws option
utils/s3/creds: rename aws_config member variables
So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index.
This commit adds UDT/UDF/UDA to generic describe.
Fixes: #14170Closesscylladb/scylladb#14334
* github.com:scylladb/scylladb:
docs:cql: add information about generic describe
cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc
cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe
- s/aws_key/aws_access_key_id/
- s/aws_secret/aws_secret_access_key/
- s/aws_token/aws_session_token/
rename them to more popular names, these names are also used by
boto's API. this should improve the readability and consistency.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There is no need to hold on to the table's
shared ptr since it's held by the global table ptr
we got in the outer loop.
Simplify the code by just getting the local table reference
from `gtable`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the datadir is ignored.
Use it to construct the table's base path.
Fixes scylladb/scylladb#15418
Note that scylla still doesn't work correctly
with multiple data directories due to #15510.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, mark_ready_for_writes is called too early,
after the first data dir is processed, then the next
datadir will hit an assert in `table::mark_ready_for_writes`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to iterate over multiple data directories
in the inner loop rather than the outer loop.
Following patch will make use of the datadir in
table_populator.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a basic regression test that starts the cql test env
with multiple data directories.
It fails without the previous patch:
table: init_storage: create upload and staging subdirs on all datadirs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Also, while at it, add copyright/license blurbs for tests that were missing it.
Closesscylladb/scylladb#15495
* github.com:scylladb/scylladb:
test/topology_custom: add copyright/license blurb to tests
test/topology_custom: test_select_from_mutation_fragments.py: use async query api
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
Currently if writing the sstable fails, e.g. because the input data is
out-of-order, the json parser thread hangs because its output is no
longer consumed. This results in the entire application just freezing.
Fix this by aborting the parsing thread explicitely in the
json_mutation_stream_parser destructor. If the parser thread existed
successfully, this will be a no-op, but on the error-path, this will
ensure that the parser thread doesn't hang.
There is currently a trace point for when the read executor is created,
but this only contains the initial replica set and doesn't mention which
read executor is created in the end. This patch adds trace points for
each different return path, so it is clear from the trace whether
speculative read can happen or not.
Currently the fact that read-repair was triggered can only be inferred
from seeing mutation reads in the trace. This patch adds an explicit
trace point for when read repair is triggered and also when it is
finished or retried.
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.
Fix by accounting the real size of accumulated frozen_mutation.
Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.
Refs #7933
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:
* Large allocations. Serialization of reconcilable_result causes large
allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
side and on the coordinator side causes reactor stalls. This impacts
not only the query at hand. For 1M dead rows, freezing takes 130ms,
unfreezing takes 500ms. Coordinator does multiple freezes and
unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
may fail due to too large mutation size. 1M dead rows is already too
much: Refs #9111.
This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):
Node1: 1 live row, 1M dead rows
Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of
the query.
Before:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
After:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).
Refs #7929
Refs #3672
Refs #7933Fixes#9111
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
Fixes: #14710.
Closesscylladb/scylladb#15091
* github.com:scylladb/scylladb:
cql3: statements: delete execute override
cql3: statements: call check_restricted_table_properties in prepare
cql3: statements: pass data_dictionary::database to check_restricted_table_properties
more structured this way. and the data dependency is more clear
with this change. this also allows us to quickly identify the parts
which should/can be reused when migrating to the CMake based building
system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix fromJson(null) to return null, not a error as it did before this patch.
We use "null" as the default value when unwrapping optionals
to avoid bad optional access errors.
Fixes: scylladb#7912
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15481
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Closesscylladb/scylladb#15400
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
Make sure that all writes started by the old coordinator are completed or
will eventually fail before starting a new coordinator.
Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.
Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.
This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.
Closesscylladb/scylladb#15497
* github.com:scylladb/scylladb:
s3::client: Track memory in client uploads
code: Configure s3 clients' memory usage
s3::client: Construct client with shared semaphore
sstables::storage_manager: Introduce config
This new exception type inherits from std::bad_alloc and allows logalloc
code to add additional information about why the allocation failed. We
currently have 3 different throw sites for std::bad_alloc in logalloc.cc
and when investigating a coredump produced by --abort-on-lsa-bad-alloc,
it is impossible to determine, which throw-site activated last,
triggering the abort.
This patch fixes that by disambiguating the throw-sites and including it
in the error message printed, right before abort.
Refs: #15373Closesscylladb/scylladb#15503
pending_replacement list is used by incremental compaction to
communicate to other ongoing compactions about exhausted sstables
that must be replaced in the sstable set they keep for tombstone
GC purposes.
Reshape doesn't enable tombstone GC, so that list will not
be cleared, which prevents incremental compaction from releasing
sstables referenced by that list. It's not a problem until now
where we want reshape to do incremental compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's done by inheriting regular_compaction, which implement
incremental compaction. But reshape still implements its own
methods for creating writer and reader. One reason is that
reshape is not driven by controller, as input sstables to it
live in maintenance set. Another reason is customization
of things like sstable origin, etc.
stop_sstable_writer() is extended because that's used by
regular_compaction to check for possibility of removing
exhausted sstables earlier whenever an output sstable
is sealed.
Also, incremental compaction will be unconditionally
enabled for ICS/LCS during off-strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's in preparation to next change that will make reshape
inherit from regular compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The v.u.g. start stop is now spread over main() code heavily.
1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it
2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen
3. early on stop the v.u.g. is fully stopped
The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start.
Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires.
Closesscylladb/scylladb#15466
* github.com:scylladb/scylladb:
view_update_generator: Stop for real later
view_update_generator: Add logging to do_abort()
view_update_generator: Move abort kicking to do_abort()
view_update_generator: Add early abort subscription
Currently repair shutdown only happens on stop, but it looks like
nodetool drain can call shutdown too to abort no longer relevant repair
tasks if any. This also makes the main()'s deferred shutdown/stop paths
cleaner a little bit
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15438
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.
Pass data_dictionary::database to check_restricted_table_properties
as an arguemnt instead of query_processor as the method will be called
from a context which does not have access to query processor.
Now the v.u.g.::stop() code waits for the generator bacground fiber to
stop for real. This can happen much later, all the necessary precautions
not to produce more work for the generator had been taken in do_abort().
This keeps the v.u.g. start-stop in one place, except for the call to
invoke_on_all(v.u.g.::start()) which will be handler separately.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When v.u.g. stops is first aborts the generation background fiber by
requesting abort on the internal abort source and signalling the fiber
in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last
in main(), so this move doesn't change much -- when stop_signal fires,
it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing
that would happen after it before real ::stop() is called that depends
on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
when accessing AWS resources, uses are allowed to long-term security
credentials, they can also the temporary credentials. but if the latter
are used, we have to pass a session token along with the keys.
see also https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
so, if we want to programatically get authenticated, we need to
set the "x-amz-security-token" header,
see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#UsingTemporarySecurityCredentials
so, in this change, we
1. add another member named `token` in `s3::endpoint_config::aws_config`
for storing "AWS_SESSION_TOKEN".
2. populate the setting from "object_storage.yaml" and
"$AWS_SESSION_TOKEN" environment variable.
3. set "x-amz-security-token" header if
`s3::endpoint_config::aws_config::token` is not empty.
this should allow us to test s3 client and s3 object store backend
with S3 bucket, with the temporary credentials.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15486
in this series, we use the default values of options specifying the paths to tools for better readability. and also to ease the migration to CMake
Refs #15379Closesscylladb/scylladb#15500
* github.com:scylladb/scylladb:
build: do not check for args.ragel_exec
build: set default value of --with-antlr3 option
When a piece is uploaded it's first flushed, then upload-copy is issued.
Both happen in the background and if piece flush calls resolves with
exception the exception remains unhandled. That's OK, since upload
finalization code checks that some pieces didn't complete (for whatever
reason) and fails the whole uploading, however, the ignored exception is
reported in logs. Not nice.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15491
more structured this way. this also allows us to quickly identify the part which should/can be reused when migrating to CMake based building system.
Refs #15379Closesscylladb/scylladb#15501
* github.com:scylladb/scylladb:
build: extract check_for_lz4() out
build: extract check_for_boost() out
build: extract check_for_minimal_compiler_version() out
build: extract write_build_file() out
Sstables in transitional states are marked with the respective 'status' in the registry. Currently there are two of such -- 'creating' and 'removing'. And the 'sealed' status for sstables in use.
On boot the distributed loader tries to garbage collect the dangling sstables. For filesystem storage it's done with the help of temorary sstables' dirs and pending deletion logs. For s3-backed sstables, the garbage collection means fetching all non-sealed entries and removing the corresponding objects from the storage.
Test included (last patch)
fixes#13024Closesscylladb/scylladb#15318
* github.com:scylladb/scylladb:
test: Extend object_store test to validate GC works
sstable_directory: Garbage collect S3 sstables on reboot
sstable_directory: Pass storage to garbage_collect()
sstable_directory: Create storage instance too
So it would be waited on in shutdown().
Although gossiper::run holds the `_callback_running` semaphore
which is acquired in `do_stop_gossiping`, the gossip messages
it initiates in the background are never waited on.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15493
as the size of `rapidxml::xml_document` size quite large, let's
allocate it on the heap. otherwise GCC 13.2.1 warns us like:
```
utils/s3/client.cc: In function ‘seastar::sstring s3::parse_multipart_copy_upload_etag(seastar::sstring&)’:
utils/s3/client.cc:455:9: warning: stack usage is 66208 bytes [-Wstack-usage=]
455 | sstring parse_multipart_copy_upload_etag(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15472
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
more structured this way. this also allows us to quickly identify
the part which should/can be reused when migrating to CMake based
building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
more structured this way. also, this will allow us to switch over
to the CMake building system.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
args.ragel_exec defaults to "ragel" already, so unless user specifies
an empty ragel using `--with-ragel=""`, we won't have an
`args.ragel_exec` which evaluates to `False`, so drop this check.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we don't need to check if this option is specified.
this option will also be used even after switching to CMake.
Refs #15379
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When uploading an object part, client spawns a background fiber that
keeps the buffers with data on the http request's write_body() lambda
capture. This generates unbound usage of memory with uploaded buffers
which is not nice. Even though s3 client is limited with http's client
max-connections parallelism, waiting for the available connection still
happens with buffers held in memory.
This patch makes the client claim the background memory from the
provided semaphore (which, in turn, sits on the shard-wide storage
manager instance). Once body writing is complete, the claimed units are
returned back to the semaphore allowing for more background writes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This sets the real limits on the memory semaphore.
- scylla sets it to 1% of total memory, 10Mb min, 100Mb max
- tests set it to 16Mb
- perf test sets it to all available memory
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore will be used to cap memory consumption by client. This
patch makes sure the reference to a semaphore exists as an argument to
client's constructor, not more than that.
In scylla binary, the semaphore sits on storage_manager. In tests the
semaphore is some local object. For now the semaphore is unused and is
initialized locked as this patch just pushes the needed argument all the
way around, next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just an empty config that's fed to storage_manager when constructed as a
preparation for further heavier patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cql.execute_async() can now execute paged queries, use it instead of a
blocking API.
While at it, clean-up the test:
* remove unneded wait on ring0 settle
* address flake8 concerns:
- unused imports
- unused variables
- style
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.
Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.
The move-assignment operator has a similar problem,
therefore it is deleted in this series, and the function is
renamed to `transfer_backlog` that just doesn't deal with the
moved-from registration. This is safe since it's only used internally
by the compaction manager.
Fixes#15248Closesscylladb/scylladb#15445
* github.com:scylladb/scylladb:
compaction_state: store backlog_track in std::optional
compaction_backlog_tracker: do not allow moving registered trackers
as yaml-cpp returns an invalid node when the node to be indexed
does not exist all. but it allows us to provide a fallback value
which is returned when the node is not valid. so, let's just use
this helper for accessing a node which does not necessarily exist.
simpler this way
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15488
This PR adds the information that counters do not support data expiration with TTL, plus the link to the TTL page.
Fixes https://github.com/scylladb/scylladb/issues/15479Closesscylladb/scylladb#15489
* github.com:scylladb/scylladb:
doc: improve TTL limitation info on Counters page
doc: add a note that counters do not support TTL
Currently, the endpoint address is set as the new
endpoint_state RPC_ADDRESS. This is wrong since
it should be assigned with the `broadcast_rpc_address`
rather than `broadcast_address`.
This was introduced in b82c77ed9c
Instead just create an empty endpoint_state.
The RPC_ADDRESS (as well as HOST_ID) application states
are set later.
Fixesscylladb/scylladb#15458
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15475
* seastar 576ee47d...bab1625c (13):
> build: s/{dpdk_libs}/${dpdk_libs}/
> build: build with dpdk v23.07
> scripts: Fix escaping of regexes in addr2line
> linux-aio: print more specific error when setup_aio fails
> linux-aio: correct the error message raised when io_setup() fails
> build: reenable -Warray-bound compiling option
> build: error out if find_program() fails
> build: enable systemtap only if it is available
> build: check if libucontext is necessary for using ucontext functions
> smp: reference correct variable when fetch_or()
> build: use target_compile_definitions() for adding -D...
> http/client: pass tls_options to tls::connect()
> Merge 'build, process: avoid using stdout or stderr as C++ identifiers' from Kefu Chai
Frozen toolchain regenerated for new Seastar depdendencies.
configure.py adjusted for new Seastar arch names.
Closesscylladb/scylladb#15476
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).
If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).
When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.
We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.
Fixes: #7620Fixes: #13957Closesscylladb/scylladb#15331
* github.com:scylladb/scylladb:
test: add test for group 0 schema versioning
test/pylib: log_browsing: fix type hint
feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
schema_tables: use schema version from group 0 if present
migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
migration_manager: migration_request handler: assume `canonical_mutation` support
system_keyspace: make `get/set_scylla_local_param` public
feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
schema_tables: refactor `scylla_tables(schema_features)`
migration_manager: add `std::move` to avoid a copy
schema_tables: remove default value for `reload` in `merge_schema`
schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
system_keyspace: fix outdated comment
Compaction tasks executors serve two different purposes - as compaction
manager related entity they execute compaction operation and as task
manager related entity they track compaction status.
When one role depends on the other, as it currently is for
compaction_task_impl::done() and compaction_task_executor::compaction_done(),
requirements of both roles need to be satisfied at the same time in each
corner case. Such complexity leads to bugs.
To prevent it, compaction_task_impl::done() of executors no longer depends
on compaction_task_executor::compaction_done().
Fixes: #14912.
Closesscylladb/scylladb#15140
* github.com:scylladb/scylladb:
compaction: warn about compaction_done()
compaction: do not run stopped compaction
compaction: modify lowest compaction tasks' run method
compaction: pass do_throw_if_stopping to compaction_task_executor
This series mark multiple high cardinality counters with skip_when_empty flag.
After this patch the following counters will not be reported if they were never used:
```
scylla_transport_cql_errors_total
scylla_storage_proxy_coordinator_reads_local_node
scylla_storage_proxy_coordinator_completed_reads_local_node
scylla_transport_cql_errors_total
```
Also marked, the CAS related CQL operations.
Fixes#12751Closesscylladb/scylladb#13558
* github.com:scylladb/scylladb:
service/storage_proxy.cc: mark counters with skip_when_empty
cql3/query_processor.cc: mark cas related metrics with skip_when_empty
transport/server.cc: mark metric counter with skip_when_empty
So that replacing it will destroy the previous tracker
and unregister it before assigning the new one and
then registering it.
This is safer than assiging it in place.
With that, the move assignment operator is not longer
used and can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.
Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.
With that in mind, when move-assigning a compaction_backlog_tracker
the existing tracker can remain registered.
Fixesscylladb/scylladb#15248
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
these options for disabling warnings are not necessary anymore, for
one of the following reasons:
* the code which caused the warning were either fixed or removed
* the toolchain were updated, so the false alarms do not exist
with the latest frozen toolchain.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15450
do not use muti-line comment. this silences the warning from GCC:
```
In file included from ./cql3/prepare_context.hh:19,
from ./cql3/statements/raw/parsed_statement.hh:14,
from build/debug/gen/cql3/CqlParser.hpp:62,
from build/debug/gen/cql3/CqlParser.cpp:44:
./cql3/expr/expression.hh:490:1: error: multi-line comment [-Werror=comment]
490 | /// Custom formatter for an expression. Supports multiple modes:\
| ^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15471
this target mirrors the target named `{mode}e-test` in the
`build.ninja` build script created by `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15448
because we should have a frozeon toolchain built with fedora38, and f38
provides cmake v3.27.4, we can assume the availability of cmake v3.27.4
when building scylla with the toolchain.
in this change, the minimum required CMake version is changed to
3.27.
this also allows us to simplify the implementation of
`add_whole_archive()`, and remove the buggy branch for supporting
CMake < 3.24, as we should have used `${name}` in place of `auth` there.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15446
This PR modularizes `manager.{hh, cc}` by dividing the files into separate smaller units. The changes improve overall readability of code and help reason about it. Each file has a specific purpose now.
This is the first step in refactoring the Hinted Handoff module.
Refs scylladb/scylla#15358Closesscylladb/scylladb#15378
* github.com:scylladb/scylladb:
db/hints: Remove unused aliases from manager.hh
db/hints: Rename end_point_hints_manager
db/hints: Rename sender to hint_sender
db/hints: Move the rebalancing logic to hint_storage
db/hints: Move the implementation of sender
db/hints: Move the declaration of sender to hint_sender.hh
db/hints: Move sender::replay_allowed() to the source file
db/hints: Put end_point_hints_manager in internal namespace
db/hints: Move the implementation of end_point_hints_manager
db/hints: Move the declaration of end_point_hints_manager
db/hints: Move definitions of functions using shard hint manager
db/hints: Introduce hint_storage.hh
db/hints: Extract the logger from manager.cc
db/hints: Extract common types from manager.hh
That's needed for enabling incremental compaction to operate, and
needed for subsequent work that enables incremental compaction
for off-strategy, which in turn uses reshape compaction type.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
this target mirrors the "dist-server-debuginfo-{mode}" target in the `build.ninja` created by `configure.py`.
Closesscylladb/scylladb#15441
* github.com:scylladb/scylladb:
build: cmake: add "dist-server-debuginfo" target
build: cmake: remove debian dep from relocatable pkg
Populating of non-system keyspaces is now done by listing datadirs and assuming that each subdir found is a keyspace. For S3-backed keyspaces this is also true, but it's a bug (#13020). The loop needs to walk the list of known keyspaces instead, and try to find the keyspace storage later, based on the storage option.
Closesscylladb/scylladb#15436
* github.com:scylladb/scylladb:
distributed_loader: Indentation fix after previous patch
distributed_loader: Generalize datadir parallelizm loop
distributed_loader: Provide keyspace ref to populate_keyspace
distributed_loader: Walk list of keyspaces instead of directories
always initialize returned values. the branches which
return these unitiailized returned values handles the
unmatched cases, so this change should not have any
impact on the behavior.
ANTLR3's C++ code generator does not assign any value
to the value, if it runs into failure or encounter
exceptions. for instance, following rule assigns the
value of `isStatic` to `isStaticColumn` only if
nothing goes wrong.
```
cfisStatic returns [bool isStaticColumn]
@init{
bool isStatic = false;
}
: (K_STATIC { isStatic=true; })?
{
$isStaticColumn = isStatic;
}
;
```
as shown in the generated C++ code:
```c++
switch (alt118)
{
case 1:
// build/debug/gen/cql3/Cql.g:989:8: K_STATIC
{
this->matchToken(K_STATIC, &FOLLOW_K_STATIC_in_cfisStatic5870);
if (this->hasException())
{
goto rulecfisStaticEx;
}
if (this->hasFailed())
{
return isStaticColumn;
}
if ( this->get_backtracking()==0 )
{
isStaticColumn=isStatic;
}
}
break;
}
```
when `this->hasException()` or `this->hasFailed()`,
`isStaticColumn` is returned right away without being
initialized, because we don't assign any initial value
to it, neither do we customize the exception handling
for this rule.
and, the parser bails out when its smells something bad
after it tries to match the specified rule. also, the
parser is a stateful tokenizer, its failure state is
carried by the parser itself. also, the matchToken()
*could* fail when trying to find the matched token,
this is the runtime behavior of parser, that's why the
compiler cannot be certain that the error path won't
be taken.
anyway, let's always initialize the values without
default constructor. the return values whose type
is of scoped enum are initialized with zero
initialization, because their types don't provide an
"invalid" value.
this change should silence warnings like:
```
clang++ -MD -MT build/debug/gen/cql3/CqlParser.o -MF build/debug/gen/cql3/CqlParser.o.d -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1 -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen --std=gnu++20 -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -DBOOST_TEST_DYN_LINK -DNOMINMAX -DNOMINMAX -fvisibility=hidden -Wall -Werror -Wextra -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-c++11-narrowing -Wno-ignored-qualifiers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-implicit-int-float-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DFMT_DEPRECATED_OSTREAM -Wno-parentheses-equality -O1 -fno-sanitize-address-use-after-scope -c -o build/debug/gen/cql3/CqlParser.o build/debug/gen/cql3/CqlParser.cpp
build/debug/gen/cql3/CqlParser.cpp:26645:28: error: variable 'perm' is uninitialized when used here [-Werror,-Wuninitialized]
return perm;
^~~~
build/debug/gen/cql3/CqlParser.cpp:26616:5: note: variable 'perm' is declared here
auth::permission perm;
^
build/debug/gen/cql3/CqlParser.cpp:52577:28: error: variable 'op' is uninitialized when used here [-Werror,-Wuninitialized]
return op;
^~
build/debug/gen/cql3/CqlParser.cpp:52518:5: note: variable 'op' is declared here
oper_t op;
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15451
instead of checking the availability of a required program, let's
use the `REQUIRED` argument introduced by CMake 3.18, simpler this
way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15447
Handle abort_requested_exception exactly like
sleep_aborted, as an expected error when startup
is aborted mid-way.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15443
Some time ago populating of tables from sstables was reworked to use sstable states instead of full paths (#12707). Since then few places in the populator was left that still operate on the state-based subdirectory name. This PR collects most of those dangling ends
refs: #13020Closesscylladb/scylladb#15421
* github.com:scylladb/scylladb:
distributed_loader: Print sstable state explicitly
distributed_loader: Move check for the missing dir upper
distributed_loader: Use state as _sstable_directories key
when compiling the tests with -Wsign-compare, the compiler complains like:
```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -MF test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o.d -o test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -c /home/kefu/dev/scylladb/test/boost/tablets_test.cc
/home/kefu/dev/scylladb/test/boost/tablets_test.cc:1335:53: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
for (int log2_tablets = 0; log2_tablets < tablet_count_bits; ++log2_tablets) {
~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
```
in this case, it should be safe to use an signed int as the loop
variable to be compared with `tablet_count_bits`, but let's just
appease the compiler so we can enable the warning option project-wide
to prevent any potential issues caused by signed-unsigned comparision.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15449
We add garbage collection for the `CDC_GENERATIONS_V3` table to prevent
it from endlessly growing. This mechanism is especially needed because
we send the entire contents of `CDC_GENERATIONS_V3` as a part of the
group 0 snapshot.
The solution is to keep a clean-up candidate, which is one of the
already published CDC generations. The CDC generation publisher
introduced in #15281 continually uses this candidate to remove all
generations with timestamps not exceeding the candidate's and sets a new
candidate when needed.
We also add `test_cdc_generation_clearing.py` that verifies this new
mechanism.
Fixes#15323Closesscylladb/scylladb#15413
* github.com:scylladb/scylladb:
test: add test_cdc_generation_clearing
raft topology: remove obsolete CDC generations
raft topology: set CDC generation clean-up candidate
topology_coordinator: refactor publish_oldest_cdc_generation
system_keyspace: introduce decode_cdc_generation_id
system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
in other words, do not create bpo::value unless transfer it to an
option_description.
`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.
before this change, we initialize `operations_with_func` as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.
so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.
after this change, we create the option map as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options map as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.
this change is quite similar to 374bed8c3d
Fixes https://github.com/scylladb/scylladb/issues/15429Closesscylladb/scylladb#15430
* github.com:scylladb/scylladb:
tools/scylla-nodetools: reindent
tools/scylla-nodetools: do not create unowned bpo::value
column_stats::update_local_deletion_time() is not used anywhere,
what is being used is
`column_stats::update_local_deletion_time_and_tombstone_histogram(time_point)`.
while `update_local_deletion_time_and_tombstone_histogram(int32_t)`
is only used internally by a single caller.
neither is `column_stats::update(const deletion_time&)` used.
so let's drop them. and merge
`update_local_deletion_time_and_tombstone_histogram(int32_t)`
into its caller.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15189
in this series, we do not assume the existence of "build" build directory. and prefer using the version files located under the directory specified with the `--build-dir` option.
Refs #15241Closesscylladb/scylladb#15402
* github.com:scylladb/scylladb:
create-relocatable-package.py: prefer $build_dir/SCYLLA-RELEASE-FILE
create-relocatable-package.py: create SCYLLA-RELOCATABLE-FILE with tempfile
in other words, do not create bpo::value unless transfer it to an
option_description.
`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.
before this change, we initialize `operations_with_func` as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.
so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.
after this change, we create the option map as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options map as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.
this change is quite similar to 374bed8c3dFixes#15429
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
```console
$ journalctl --user start scylla-server -xe
Failed to add match 'start': Invalid argument
```
`journalctl` expects a match filter as its positional arguments.
but apparently, start is not a filter. we could use `--unit`
to specify a unit though, like:
```console
$ journalctl --user --unit scylla-server.service -xe
```
but it would flood the stdout with the logging messages printed
by scylla. this is not what a typical user expects. probably a better
use experience can be achieved using
```console
$ systemctl --user status scylla-server
```
which also print the current status reported by the service, and
the command line arguments. they would be more informative in typical
use cases.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15390
The docker/podman tooling is destructive: it will happily
overwrite images locally and on the server. If a maintainer
forgets to update tools/toolchain/image, this can result
in losing an older toolchain container image.
To prevent that, check that the image name is new.
Closesscylladb/scylladb#15397
update commented out experimental_features to reflect the latest
experimental features:
- in 4f23eec4, "raft" was renamed to "consistent-topology-changes".
- in 2dedb5ea, "alternator-ttl" was moved out of experimental features.
- in 5b1421cc, "broadcast-tables" was added to experimental features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15407
this target mirrors the "dist-server-debuginfo-{mode}" target in
the `build.ninja` created by `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`create-relocatable-package.py` does not use or include
`${CMAKE_CURRENT_BINARY_DIR}/debian`. so there is no
need to include this directory as a dependency.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Perform schema changes while mixing nodes in RECOVERY mode with nodes in
group 0 mode:
- schema changes originating from RECOVERY node use
digest-based schema versioning.
- schema changes originating from group 0
nodes use persisted versions committed through group 0.
Verify that schema versions are in sync after each schema change, and
that each schema change results in a different version.
Also add a simple upgrade test, performing a schema change before we
enable Raft (which also enables the new versioning feature) in the
entire cluster, then once upgrade is finished.
One important upgrade test is missing, which we should add to dtest:
create a cluster in Raft mode but in a Scylla version that doesn't
understand GROUP0_SCHEMA_VERSIONING. Then start upgrading to a version
that has this patchset. Perform schema changes while the cluster is
mixed, both on non-upgraded and on upgraded nodes. Such test is
especially important because we're adding a new column to the
`system.scylla_local` table (which we then redact from the schema
definition when we see that the feature is disabled).
pull_gitgub_pr.sh adds a "Closes #xyz" tag so github can close
the pull request after next promotion. Convert it to an absolute
refefence (scylladb/scylladb#xyz) so the commit can be cherry-picked
into another repository without the reference dangling.
Closes#15424
this change change `const` to `constexpr`. because the string literal
defined here is not only immutable, but also initialized at
compile-time, and can be used by constexpr expressions and functions.
this change is introduced to reduce the size of the change when moving
to compile-time format string in future. so far, seastar::format() does
not use the compile-time format string, but we have patches pending on
review implementing this. and the author of this change has local
branches implementing the changes on scylla side to support compile-time
format string, which practically replaces most of the `format()` calls
with `seastar::format()`.
without this change, if we use compile-time format check, compiler fails
like:
```
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: error: call to consteval function 'fmt::basic_format_string<char, const char *const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char *, 0>' is not a constant expression
.description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
^
/usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression
FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
^
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:276:44: note: in call to 'basic_format_string(description_template)'
.description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
^
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:258:16: note: declared here
const auto description_template =
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15432
change the loop variable to `int` to silence warning like
```
/home/kefu/.local/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -I/home/kefu/dev/scylladb/build/cmake/gen -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -MF tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o.d -o tools/CMakeFiles/tools.dir/scylla-nodetool.cc.o -c /home/kefu/dev/scylladb/tools/scylla-nodetool.cc
/home/kefu/dev/scylladb/tools/scylla-nodetool.cc:215:28: error: comparison of integers of different signs: 'unsigned int' and 'int' [-Werror,-Wsign-compare]
for (unsigned i = 0; i < argc; ++i) {
~ ^ ~~~~
```
`i` is used as the index in a plain C-style array, it's perfectly fine
to use a signed integer as index in this case. as per C++ standard,
> The expression E1[E2] is identical (by definition) to *((E1)+(E2))
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15431
As promised in earlier commits:
Fixes: #7620Fixes: #13957
Also modify two test cases in `schema_change_test` which depend on
the digest calculation method in their checks. Details are explained in
the comments.
Population of keyspaces happens first fo system keyspaces, then for
non-system ones. Both methods iterate over config datadirs to populate
from all configured directories. This patch generalizes this loop into
the populate_keyspace() method.
(indentation is deliberately left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question tries to find keyspace reference on the database
by the given keyspace name. However, one of the callers aready has the
keyspace reference at hands and can just pass it. The other calls can
find the keyspace on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When populating non-system keyspaces the dist. loader lists the
directories with keyspaces in datadirs, then tries to call
populate_keyspace() with the found name. If the keyspace in question is
not found on the database, a warning is printed and population
continues.
S3-backed keyspaces are nowadays populated with this process just
because there's a bug #13020 -- even such keyspaces still create empty
directories in datadirs. When the bug gets fixed, population would omit
such keyspaces. This patch prepares this by making population walk the
known keyspaces from the database. BTW, population of system keyspaces
already works by iterating over the list of known keyspaces, not the
datadir subdirectories.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in hope to lower the bar to testing object store.
* add language specifier for better readability of the document.
to highlight the config with YAML syntax
* add more specific comment on the AWS related settings
* explain that endpoint should match in the CREATE KEYSPACE
statement and the one defined by the YAML configuration.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15433
As explained in the previous commit, we use the new
`committed_by_group0` flag attached to each row of a `scylla_tables`
mutation to decide whether the `version` cell needs to be deleted or
not.
The rest of #13957 is solved by pre-existing code -- if the `version`
column is present in the mutation, we don't calculate a hash for
`schema::version()`, but take the value from the column:
```
table_schema_version schema_mutations::digest(db::schema_features sf)
const {
if (_scylla_tables) {
auto rs = query::result_set(*_scylla_tables);
if (!rs.empty()) {
auto&& row = rs.row(0);
auto val = row.get<utils::UUID>("version");
if (val) {
return table_schema_version(*val);
}
}
}
...
```
The issue will therefore be fixed once we enable
`GROUP0_SCHEMA_VERSIONING`.
As described in #13957, when creating or altering a table in group 0
mode, we don't want each node to calculate `schema::version()`s
independently using a hash algorithm. Instead, we want to all nodes to
use a single version for that table, commited by the group 0 command.
There's even a column ready for this in `system.scylla_tables` --
`version`. This column is currently being set for system tables, but
it's not being used for user tables.
Similarly to what we did with global schema version in earlier commits,
the obvious thing to do would be to include a live cell for the `version`
column in the `system.scylla_tables` mutation when we perform the schema
change in Raft mode, and to include a tombstone when performing it
outside of Raft mode, for the RECOVERY case.
But it's not that simple because as it turns out, we're *already*
sending a `version` live cell (and also a tombstone, with timestamp
decremented by 1) in all `system.scylla_tables` mutations. But then we
delete that cell when doing schema merge (which begs the question
why were we sending it in the first place? but I digress):
```
// We must force recalculation of schema version after the merge, since the resulting
// schema may be a mix of the old and new schemas.
delete_schema_version(mutation);
```
the above function removes the `version` cell from the mutation.
So we need another way of distinguishing the cases of schema change
originating from group 0 vs outside group 0 (e.g. RECOVERY).
The method I chose is to extend `system.scylla_tables` with a boolean
column, `committed_by_group0`, and extend schema mutations to set
this column.
In the next commit we'll decide whether or not the `version` cell should
be deleted based on the value of this new column.
As promised in the previous commit, if we persisted a schema version
through a group 0 command, use it after a schema merge instead of
calculating a digest.
Ref: #7620
The above issue will be fixed once we enable the
`GROUP0_SCHEMA_VERSIONING` feature.
We extend schema mutations with an additional mutation to the
`system.scylla_local` table which:
- in Raft mode, stores a UUID under the `group0_schema_version` key.
- outside Raft mode, stores a tombstone under that key.
As we will see in later commits, nodes will use this after applying
schema mutations. If the key is absent or has a tombstone, they'll
calculate the global schema digest on their own -- using the old way. If
the key is present, they'll take the schema version from there.
The Raft-mode schema version is equal to the group 0 state ID of this
schema command.
The tombstone is necessary for the case of performing a schema change in
RECOVERY mode. It will force a revert to the old digest-based way.
Note that extending schema mutations with a `system.scylla_local`
mutation is possible thanks to earlier commits which moved
`system.scylla_local` to schema commitlog, so all mutations in the
schema mutations vector still go to the same commitlog domain.
Support for `canonical_mutation`s was added way back in Scylla 3.2. The
migration request handler was checking whether the remote supports
`canonical_mutation`s to handle rolling upgrades, and if not, it would
use `frozen_mutation`s instead.
We no longer need that second branch, since we don't support skipping
versions during upgrades (certainly everything would burn if we tried a
3.2->5.4 upgrade).
Leave a sanity check but otherwise delete the other branch.
Move node_ops related classes to node_ops/ so that they
are consistently grouped and could be access from
many modules.
Closes#15351
* github.com:scylladb/scylladb:
node_ops: extract classes related to node operations
node_ops: repair: move node_ops_id to node_ops directory
This feature, when enabled, will modify how schema versions
are calculated and stored.
- In group 0 mode, schema versions are persisted by the group 0 command
that performs the schema change, then reused by each node instead of
being calculated as a digest (hash) by each node independently.
- In RECOVERY mode or before Raft upgrade procedure finishes, when we
perform a schema change, we revert to the old digest-based way, taking
into account the possibility of having performed group0-mode schema
changes (that used persistent versions). As we will see in future
commits, this will be done by storing additional flags and tombstones
in system tables.
By "schema versions" we mean both the UUIDs returned from
`schema::version()` and the "global" schema version (the one we gossip
as `application_state::SCHEMA`).
For now, in this commit, the feature is always disabled. Once all
necessary code is setup in following commits, we will enable it together
with Raft.
The `scylla_tables` function gives a different schema definition
for the `system_schema.scylla_tables` table, depending on whether
certain schema features are enabled or not.
The way it was implemented, we had to write `θ(2^n)` amount
of code and comments to handle `n` features.
Refactor it so that the amount of code we have to write to handle `n`
features is `θ(n)`.
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.
Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_scehma` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.
Fix this.
This commit adds the information that ScyllaDB Enterprise
supports FIPS-compliant systems in versions
2023.1.1 and later.
The information is excluded from OSS docs with
the "only" directive, because the support was not
added in OSS.
This commit must be backported to branch-5.2 so that
it appears on version 2023.1 in the Enterprise docs.
Closes#15415
We make the CDC generation publisher continually remove the
obsolete CDC generation data to prevent CDC_GENERATIONS_V3 from
endlessly growing. To achieve this, we use the clean-up candidate.
If it exists and can be safely removed, we remove it together with
all older CDC generations. We also mark the lack of a new
candidate. The next published CDC generation will become one.
Note this solution does not have any guarantee about "when"
it removes obsolete generations. Formally, it guarantees that
if there is a candidate that can be removed and the CDC generation
publisher attempts to remove it, all generations up to the
candidate are removed. In practice, when a new generation appears,
the publisher makes a new candidate or tries to remove an old
candidate, so obsolete generations can stay for a long time only
if no generation appears for a long time. But it is fine because
we only want to prevent CDC_GENERATIONS_V3 from growing too much.
Moreover, providing any guarantees would require a new wake-up
mechanism for the publisher, which would be hard to implement.
We want to use the clean-up candidates to remove the obsolete CDC
generation data, but first, we need to set suitable generations as
a candidate when there is no candidate. Since CDC generations must
be published before we remove them, a generation that is being
published is a good candidate.
In the following commits, we add a new task for the CDC generation
publisher -- clearing obsolete CDC generation data. This task
can be done together with the publishing under one group 0 guard.
We refactor publish_oldest_cdc_generation to make it possible.
Now, this function is more like a command builder. It takes guard
by const reference and updates the vector of mutations and the
reason string. The CDC generation publisher uses them directly to
update the topology at the end after finishing building the
command. This logic will be more visible after adding the clearing
task.
This commit renames `end_point_hints_manager` to `hint_endpoint_manager`
to be consistent with other names used in the module (they all start
with `hint_`).
This commit continues modularizing manager.hh.
After moving the declaration of sender to a dedicated
header file, these changes move its implementation to
a separate source file.
This commit is yet another step in modularizing manager.hh.
We move the declaration of sender to a dedicated file.
Its implementation will follow in a future commit.
The premise of these changes is the fact that we cannot have
a cycle of #includes.
Because the declaration of `sender` is going to be moved to
a separate header file in a future commit, and because that
header file is going to be included in the file where
`end_point_hints_manager` is declared, we will need to rely
on `end_point_hints_manager` being an incomplete type there.
A consequence of that is that we cannot access any of
`end_point_hints_manager`'s methods.
This commit prepares the ground for it by moving
the definition of the function to the source file where
`end_point_hints_manager` will be a complete type.
This commit continues moving end_point_hints_manager to its
dedicated files. After moving the declaration of the class,
these changes move the implementation.
This commit is yet another step in modularizing manager.hh.
We move the declaration of the class to a dedicated header file.
The implementation will follow in a future commit.
We move definitions of inline methods of end_point_hints_manager
and sender accessing shard hint manager to the source file,
effectively un-inlining them. We need to do that to prepare for
moving said structures out of manager.hh. This commit is yet
another step in modularizing manager.hh.
This commit moves types used by shard hint manager
and related to storing hints on disk to another file.
It is yet another step in modularizing manager.hh.
This commit extracts the logger used in manager.cc
to prepare the ground for modularization of manager.hh
into separate smaller files. We want to preserve
the logging behavior (at least for the time being),
which means new files should use the same logger.
These changes serve that purpose.
Currently, data structures used in manager.hh
use their own aliases for gms::inet_address.
It is clear they all should use the same type
and having different names for it only reduces
readability of the code. This commit introduces
a common alias -- endpoint_id -- and gets rid
of the other ones.
This commit is also the first step in modularizing
manager.hh by extracting common types to another
file.
Currently, we only log anything about what was tried w.r.t. obtaining
the schema if it failed. Add a log message to the success path too, so
in case the wrong schema was successfully loaded, the user can find the
problem.
The log message is printed with debug-level, so it doesn't distrurb
output by default.
Fixes: #15384Closes#15417
in this series,
- the build of unstripped package is fixed, and
- the targets for building deb and rpm packages are added. these targets builds deb and rpm packages from the unstripped package.
Closes#15403
* github.com:scylladb/scylladb:
build: cmake: add targets for building deb and rpm packages
build: cmake: correct the paths used when building unstripped pkg
cassandra-stress connects to "localhost" by default. that's exactly the
use case when we install scylla using the unified installer. so do not
suggest "-node xxx" option. the "xxx" part is but confusing.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15411
This series introduces a scylla-native nodetool. It is invokable via the main scylla executable as the other native tools we have. It uses the seastar's new `http::client` to connect to the specified node and execute the desired commands.
For now a single command is implemented: `nodetool compact`, invokable as `scylla nodetool compact`. Once all the boilerplate is added to create a new tool, implementing a single command is not too bad, in terms of code-bloat. Certainly not as clean as a python implementation would be, but good enough. The advantages of a C++ implementation is that all of us in the core team know C++ and that it is shipped right as part of the scylla executable..
Closes#14841
* github.com:scylladb/scylladb:
test: add nodetool tests
test.py: add ToolTestSuite and ToolTest
tools/scylla-nodetool: implement compact operation
tools/scylla-nodetool: implement basic scylla_rest_api_client
tools: introduce scylla-nodetool
utils: export dns_connection_factory from s3/client.cc to http.hh
utils/s3/client: pass logger to dns_connection_factory in constructor
tools/utils: tool_app_template::run_async(): also detect --help* as --help
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.
Closes#15197
* github.com:scylladb/scylladb:
tablets, raft topology: Add support for decommission with tablets
tablet_allocator: Compute load sketch lazily
tablet_allocator: Set node id correctly
tablet_allocator: Make migration_plan a class
tablets: Implement cleanup step
storage_service, tablets: Prevent stale RPCs from running beyond their stage
locator: Introduce tablet_metadata_guard
locator, replica: Add a way to wait for table's effective_replication_map change
storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
raft topology: Add break in the final case clause
raft topology: Fix SIGSEGV when trace-level logging is enabled
raft topology: Set node state in topology
raft topology: Always set host id in topology
When a column family's schema is changed new compaction
strategy type may be applied.
To make sure that it will behave as expected, compaction
strategy need to contain only the allowed options and values.
Methods throwing exception on invalid options are added.
Fixes: #2336.
Closes#13956
* github.com:scylladb/scylladb:
test: add test for compaction strategy validation
compaction: unify exception messages
compaction: cql3: validate options in check_restricted_table_properties
compaction: validate options used in different compaction strategies
compaction: validate common compaction strategy options
compaction: split compaction_strategy_impl constructor
compaction: validate size_tiered_compaction_strategy specific options
compaction: validate time_window_compaction_strategy specific options
compaction: add method to validate min and max threshold
compaction: split size_tiered_compaction_strategy_options constructor
compaction: make compaction strategy keys static constexpr
compaction: use helpers in validate_* functions
compaction: split time_window_compaction_strategy_options construtor
compaction: add validate method to compaction_strategy_options
time_window_compaction_strategy_options: make copy and move-able
size_tiered_compaction_strategy_options: make copy and move-able
When populating from a particular directory, populator code converts
state to subdir name, then prints the path. The conversion is pretty
much artificial, it's better to provide printer for state and print
state explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The quarantine directory can be missing on the datadir and that's OK. In
order to check that and skip population the populator code uses two-step
logic -- first it checks if the directory exists and either puts or not
the sstable_directory object into the map. Later it checks the map and
decide whether to throw or not if the directory is missing.
Let's keep both check and throw in one place for brevity.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The populator maintains a map of path -> sstable_directory pairs one for
each subdirectory for every sstable state. The "path" is in fact not
used by the logic as it's just a subdirectory name for the state and the
rest of the core operates on state. So it's good to make the map of
directories also be indexed by the state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, exceptions thrown from `sst->load` are unhandled,
resulting in, e.g.:
```
ERROR 2023-09-12 08:02:58,124 [shard 0:main] seastar - Exiting on unhandled exception: std::runtime_error (SSTable /home/bhalevy/.dtest/dtest-dxg4xdxg/test/node1/data/ks/cf-a3009f20512911ee8000d81cd2da3fd7/me-3g9b_0e0x_39vtt1y2rcqrffz55j-big-Data.db uses org.apache.cassandra.dht.Murmur3Partitioner partitioner which is different than com.scylladb.dht.CDCPartitioner partitioner used by the database)
```
Log the errors and exit the tool with non-zero status
in this case.
Fixes#15359
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15376
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.
The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.
This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
Example scenario:
1. coordinator A sends RPC #1 to trigger streaming
2. coordinator fails over to B
3. coordinator B performs streaming successfully
4. RPC #1 arrives and starts streaming
5. coordinator B commits the transition to the post-streaming stage
6. coordinator B executes global token metadata barrier
We end up with streaming running despite the fact that the current
coordinator moved on. Currently, this won't happen, because streaming
holds on to erm. But we want to change that (see #14995), so that it
does not block barriers for migrations of other tablets. The same
problem applies to tablet cleanup.
The fix is to use tablet_metadata_guard around such long running
operations, which will keep hold to erm so that in the above scenario
coordinator B will wait for it in step 6. The guard ensures that erm
doesn't block other migrations because it switches to the latest erm
if it's compatible. If it's not, it signals abort_source for the guard
so that such stale operation aborts soon and the barrier in step 6
doesn't wait for long.
Will be used to synchronize long-running tablet operations with
topology coordinator.
It blocks barriers like erm_ptr, but refreshes if change is
irrelevant, so behaves as if the erm_ptr's scope was narrowed down to
a single tablet.
The decode_cdc_generations_ids function allows us to decode
a vector of CDC generation IDs. After adding cleanup_candidate
to CDC_GENERATIONS_V3, we need a similar function that decodes
a single ID.
In the following commits, we implement a garbage collection for
CDC_GENERATIONS_V3. The first step is introducing the clean-up
candidate. It will be continually updated by the CDC generation
publisher and used to remove obsolete data.
Before, it was updated only for normal nodes. We need it for
bootstrapping nodes too. Otherwise, algorithms, e.g. load balancer,
will be confused by observing nodes in topology without host id set.
This will become a problem when load balancer is invoked concurrently
with bootstrap, which currently is not the case, but will be after
later patches.
We should maintain that all nodes in topology have a host id.
Testing the new scylla nodetool tool.
The tests can be run aginst both implementations of nodetool: the
scylla-native one and the cassandra one. They all pass with both
implementations.
Equivalent of nodetool compact.
The following arguments are accepted:
* split-output,s (unused)
* user-defined (error is raised)
* start-token,st (unused)
* end-token,et (unused)
* partition (unused)
The partition argument is mentioned only in our doc, our nodetool
doesn't recognize it. I added it nevertheless (it is ignored).
Split-output doesn't work with our current nodetool, attempting to use
it will result in an error. The option is parsed but an error is used if
used.
Add --host and --port parameters, parse and resolve these and
establish a connection to the provided host.
Add a simple get() and post() method, parsing the returned data as json.
Add the following compatibility arguments:
* password,pw
* password-file,pwf
* username,u
* print-port,pp
These are parsed and silently ignored, as they are specific to JMX and
aren't needed when connecting to the REST API.
Since boost program options doesn't support multi-char short-form
switches, as well as the -short=value syntax, the argv has to be massaged
into a form which boost program options can digest. This is achieved by
substituting all incompatible option formats and syntax with the
equivalent boost program options compatible one.
This mechanism is also used to make sure -h is translated to --host, not
--help. The help message is unfortunately still ambiguous, displaying
both with -h. This will be addressed in a follow-up.
We want to publish this class in a header so it can be used by others,
but it uses the s3 logger. We don't want future users to pollute the s3
logs, so allow users to pass their own loggers to the factory.
There are several system tables with strict durability requirements.
This means that if we have written to such a table, we want to be sure
that the write won't be lost in case of node failure. We currently
accomplish this by accompanying each write to these tables with
`db.flush()` on all shards. This is expensive, since it causes all the
memtables to be written to sstables, which causes a lot of disk writes.
This overheads can become painful during node startup, when we write the
current boot state to `system.local`/`system.scylla_local` or during
topology change, when `update_peer_info`/`update_tokens` write to
`system.peers`.
In this series we remove flushes on writes to the `system.local`,
`system.peers`, `system.scylla_local` and `system.cdc_local` tables and
start using schema commitlog for durability.
Fixes: #15133Closes#15279
* github.com:scylladb/scylladb:
system_keyspace: switch CDC_LOCAL to schema commitlog
system_keyspace: scylla_local: use schema commitlog
database.cc: make _uses_schema_commitlog optional
system_keyspace: drop load phases
database.hh: add_column_family: add readonly parameter
schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards
system_keyspace: switch system.peers to schema commitlog
system_keyspace: switch system.local to schema commitlog
main.cc: move schema commitlog replay earlier
sstables_format_selector: extract listener
sstables_format_selector: wrap when_enabled with seastar::async
main.cc: inline and split system_keyspace.setup
system_keyspace: refactor save_system_schema function
system_keyspace: move initialize_virtual_tables into virtual_tables.hh
system_keyspace: remove unused parameter
config.cc: drop db::config::host_id
main.cc:: extract local_info initialization into function
schema.cc: check static_props for sanity
system_keyspace: set null sharder when configuring schema commitlog
system_keyspace: rename static variables
system_keyspace: remove redundant wait_for_sync_to_commitlog
we get the path object storage config like:
```c++
db::config::get_conf_sub("object_storage.yaml").native()
```
so, the default path should be $SCYLLA_CONF/object_storage.yaml.
in this change, it is corrected.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15406
The added metrics include:
- http client metrics, which include the number of connections, the number of active connections and the number of new connections made so far
- IO metrics that mimic those for traditional IO -- total number of object read/write ops, total number of get/put/uploaded bytes and individual IO request delay (round-trip, including body transfer time)
fixes: #13369Closes#14494
* github.com:scylladb/scylladb:
s3/client: Add IO stats metrics
s3/client: Add HTTP client metrics
s3/client: Split make_request()
s3/client: Wrap http client with struct group_client
s3/client: Move client::stats to namespace scope
s3/client: Keep part size local variable
in a0dcbb09c3, the newly introduced unstripped package does not build
at all. it was using the wrong paths. so, let's correct them.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
similar to d9dcda9dd5, we need to
use the version files located under $build_dir instead "build".
so let's check the existence of $build_dir/SCYLLA-RELEASE-FILE,
and then fallback to the ones under "build".
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change serves two purposes:
1. so we don't assume the existence of '$PWD/build' directory. we should
not assume this. as the build directory could be any diectory, it
does not have to be "build".
2. we don't have to actually create a file under $build_dir. what we
need is but an empty file. so tempfile serves this purpose just well.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
a new target "dist-unified" is added, so that CMake can build unified
package, which is a bundle of all subcomponents, like cqlsh, python3,
jmx and tools.
Fixes#15241Closes#15398
* github.com:scylladb/scylladb:
build: cmake: build unified package
build: cmake: put stripped_dist_pkg under $build/dist
We remove flush from set_scylla_local_param_as
since it's now redundant. We add it to
save_local_enabled_features as features need to
be available before schema commitlog replay.
We skip the flush if save_local_enabled_features
is called from topology_state_load when the features
are migrated to system.topology and we don't need
strict durability.
This field on the null shard is properly initialized
in maybe_init_schema_commitlog function, until then
we can't make decisions based on its value. This problem
can happen e.g. if add_column_family function is called
with readonly=false before maybe_init_schema_commitlog.
It will call commitlog_for to pass the commitlog to
mark_ready_for_writes and commitlog_for reads _uses_schema_commitlog.
In this commit we add protection against this case - we
trigger internal_error if _uses_schema_commitlog is read
before it is initialized.
maybe_init_schema_commitlog() was added to cql_test_env
to make boost tests work with the new invariant.
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before schema commitlog initialization for
SCHEMA_COMMITLOG feature.
In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.
In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.
The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
Previously, creating a table or view in
schema_tables.cc/merge_tables_and_views was a two-step process:
first adding a column family (add_column_family function) and
then marking it as ready for writes (mark_table_as_writable).
There is an yield between these stages, this means
someone could see a table or view for which the
mark_table_as_writable method had not yet been called,
and start writing to it.
This problem was demonstrated by materialised view dtests.
A view is created on all nodes. On some nodes it will be created
earlier than on others and the view rebuild process will start
writing data to that view on other nodes, where mark_table_as_writable
has not yet been called.
In this patch we solve this problem by adding a readonly parameter
to the add_column_family method. When loading tables from disk,
this flag is set to true and the mark_table_as_writable
is called only after all sstables have been loaded.
When creating a new table, this flag is set to false,
mark_table_as_writable is called from inside add_column_family
and the new table becomes visible already as writable.
db.get_notifier().create_view triggers view rebuild, this
process writes to the table on all shards and thus can
access partially created table, e.g the one where
mark_table_ready_for_writes was not yet called.
Schema commitlog lives only on the zero shard,
so we need to turn on use_null_sharder option.
Also, we remove flushes on writes as durability
is now guaranteed by the commitlog.
We want to switch system.local table to
schema commitlog, but this table is used
in host_id initialization (initialize_local_info),
so we need to replay schema commitlog before.
In this commit we gather all the actions
related to early system_keyspace initialization
in one place, before initialize_local_info_thread.
The calls to save_system_schema and recalculate_schema_version
are tied to legacy_schema_migrator::migrate and
initialize_virtual_tables calls, so they are done
separately after legacy_schema_migrator::migrate.
In the following commits we want to move schema
commitlog replay earlier, but the current sstable
format should be selected before the replay.
The current sstable format is stored in system.scylla_local,
so we can't read it until system tables are loaded.
This problem is similar to the enabled_features.
To solve this we split sstables_format_selector in two
parts. The lower level part, sstables_format_selector,
knows only about database and system_keyspace. It
will be moved before system_keyspace initialization,
and the on_system_tables_loaded method will
be called on it when the system_keyspace has loaded its tables.
The higher level part, sstables_format_listener, is responsible
for subscribing to feature_services and gossipier and is started
later, at the same place as sstables_format_selector before this commit.
The listener may fire immediately, we must be in a thread
context for this to work.
In the next commits we are going to move
enable_features_on_startup above
sstables_format_selector::start in scylla_main, so we
need to fix this beforehand.
Our goal is to switch system.local table to schema
commitlog and stop doing flushes when we write to it.
This means it would be incorrect to read from this
table until schema commitlog is replayed.
On the other hand, we need truncation records
to be loaded before we start replaying schema
commitlog, since commitlog_replayer relies on them.
In this commit we inline the system_keyspace::setup
function and split its content into two parts. In
the first part, before schema commitlog replay,
we load truncation records. It's safe to load
them before schema commitlog replay since we intend
to let the flushes on writes to system.truncated
table. In the second part, after schema commitlog replay,
we do the rest of the job - build_bootstrap_info and
db::schema_tables::save_system_schema.
We decided to inline this function since there is
very low cohesion between the actions it's performing.
It's just simpler to reason about them individually.
This is a refactoring commit without observable changes
in behaviour.
Previously, there were two related functions in db::schema_tables:
save_system_keyspace_schema(qp) and save_system_schema(qp, ks).
The first called the second passing "system_schema" as
the second argument. Outside of schema_tables module we
don't need two functions, we just need a way to say
'persist system schema objects in the appropriate tables/keyspaces'.
In this commit we change the function save_system_schema
to have this meaning. Internally it calls save_system_schema_to_keyspace
twice with "system_schema" and "system", since that's what we need
in the single call site of this function in system_keyspace::setup.
In subsequent commits we are going to move this call out of the
system_keyspace::setup.
This is a readability refactoring commit without observable changes
in behaviour.
initialize_virtual_tables logically belongs to virtual_tables module,
and it allows to make other functions in virtual_tables.cc
(register_virtual_tables, install_virtual_readers)
local to the module, which simplifies the matters a bit.
all_virtual_tables() is not needed anymore, all the references to
registered virtual tables are now local to virtual_tables module
and can just use virtual_tables variable directly.
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.
Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.
We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
This is a refactoring commit without observable changes
in behaviour.
The scylla main function is huge and incomprehensible.
There are a lot of hidden dependencies between actions
that it performs, and it's too difficult to reason about
them.
In this commit, we've extracted a small part of it into
its own function. We're hoping that, moving forward,
the rest of the code can be modified in a similar manner.
The schema commitlog lives only on the null shard, it
makes no sense to set use_schema_commitlog
without use_null_sharder.
We also extract the function enable_schema_commitlog which
sets all the needed properties.
Tables with schema commitlog already sync every
write, wait_for_sync_to_commitlog makes sense
only for the regular commitlog.
Technically there are nothing wrong with allowing
both options, but it's confusing. Being strict
and accurate about the meaning of the options
reduces the chance of errors due to misunderstanding.
This is preparation for the next commits, where
we will start generating an error if the combination
of options doesn't make sense.
a new target "dist-unified" is added, so that CMake can build unified
package, which is a bundle of all subcomponents, like cqlsh, python3,
jmx and tools.
Fixes#15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.
Use like this:
curl -X POST http://127.0.0.1:10000/storage_service/relocal_schemaFixes#15380Closes#15381
This code is now spread over main and differs in cql_test_env. The PR unifies both places and makes the manager start-stop look standard
refs: #2795Closes#15375
* github.com:scylladb/scylladb:
batchlog_manager: Remove start() method
batchlog_manager: Start replay loop in constructor
main, cql_test_env: Start-stop batchlog manager in one "block"
batchlog_manager: Move shard-0 check into batchlog_replay_loop()
batchlog_manager: Fix drain() reentrability
Split compaction_strategy_impl constructor into methods that will
be reused for validation.
Add additional checks providing that options' values are legal.
Add compaction_strategy_impl::validate_min_max_threshold method
that will be used to validate min and max threshold values
for different compaction methods.
Split size_tiered_compaction_strategy_options constructor into
methods that will be reused for validation.
Add additional checks providing that options' values are legal.
To be consistent with other compaction_strategy_options,
time_window_compaction_strategy_options uses compaction_strategy_impl::get_value
and cql3::statements::property_definitions::to_long helpers for
parsing.
Add temporarily empty validate method to compaction_strategy_options.
The method will validate the options and help determining whether
only the allowed options were set.
There's a dedicated call to register migration manager's verbs somewhere
in the middle of main. However, until messaging service listening starts
it makes no difference when to register verbs.
This patch moves the verbs registration into mig. manager constructor
thus making it called it with sharded<migration_manager>::start().
Unregistration happens in migration_manager::drain() and it's not
touched here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15367
It's only used to be carried along down to a handler and get
sharded<database> from. Storage service itself can provide it, and the
handler in question already uses it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15368
there are a couple of places that check group is not nullptr,
so let's set it to nullptr on ctor, so shards that don't
have it initialized will bump on assert, instead of failing
with a cryptic segfault error.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15330
in this series, `unified/build_unified.sh` is improved in couple perspectives:
1. add `--build-dir` option, so we don't hardwire the build directory to the naming convention of `build/$mode`.
2. `--pkgs` is respected. this allows the caller to specify the paths to the dist tarballs instead of hardwiring to the paths defined in this script
these changes give us more flexibility when building unified package, and enable us to switch over to CMake based building system,
Refs #15241Closes#15377
* github.com:scylladb/scylladb:
unified: respect --pkgs option
unified: allow passing --pkgs with a semicolon-separated list
unified: prefer SCYLLA-PRODUCT-FILE in build_dir
unified: derive UNIFIED_PKG from --build-dir
unified: add --build-dir option to build_unified.sh
We make the `CDC_GENERATIONS_V3` table single-partition and change the
clustering key from `range_end` to `(id, range_end)`. We also change the
type of `id` to `timeuuid` and ensure that a new generation always has
the highest `id`. These changes allow efficient clearing of obsolete CDC
generation data, which we need to prevent Raft-topology snapshots from
endlessly growing as we introduce new generations over time.
All this code is protected by an experimental feature flag. It includes
the definition of `CDC_GENERATIONS_V3`. The table is not created unless
the feature flag is enabled.
Fixes#15163Closes#15319
* github.com:scylladb/scylladb:
system_keyspace: rename cdc_generation_id_v2
system_keyspace: change id to timeuuid in CDC_GENERATIONS_V3
cdc: generation: remove topology_description_generator
cdc: do not create uuid in make_new_generation_data
system_kayspace: make CDC_GENERATIONS_V3 single-partition
cdc: generation: introduce get_common_cdc_generation_mutations
cdc: generation: rename get_cdc_generation_mutations
should not "tar" to tar, otherwise we'd have following error:
```
tar (child): tar: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
```
as "tar" is not the compressed tarball we want to untar.
Fixes#15328
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15383
Node operations will be integrated with task manager and so node_ops
directory needs to be created. To have an access to node ops related
classes from task manager and preserve consistent naming, move
the classes to node_ops/node_ops_data.cc.
The `gossiper::reset_endpoint_state_map` function is supposed to acquire
a lock in order to serialize with `replicate_live_endpoints_on_change`.
The `lock_endpoint_update_semaphore` is called, but its result is a
future - and it is not co_awaited. Therefore, the lock has no effect.
This commit fixes the issue by adding missing co_await.
Fixes: #15361Closes#15362
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.
The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.
Fixes: #15371
Many of the gossiper internal functions currently use seastar threads for historical reasons,
but since they are short living, the cost of spawning a seastar thread for them is excessive
and they can be simplified and made more efficient using coroutines.
Closes#15364
* github.com:scylladb/scylladb:
gossiper: reindent do_stop_gossiping
gossiper: coroutinize do_stop_gossiping
gossiper: reindent assassinate_endpoint
gossiper: coroutinize assassinate_endpoint
gossiper: coroutinize handle_ack2_msg
gossiper: handle_ack_msg: always log warning on exception
gossiper: reindent handle_ack_msg
gossiper: coroutinize handle_ack_msg
gossiper: reindent handle_syn_msg
gossiper: coroutinize handle_syn_msg
gossiper: message handlers: no need to capture shared_from_this
gossiper: add_local_application_state: throw internal error if endpoint state is not found
gossiper: coroutinize add_local_application_state
Simplify the function. It does not need to spawn
a seastar thread.
While at it, declare it as private since it's called
only internally by the gossiper (and on shard 0).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike handle_syn_msg, the warning is currently printed only
`if (_ack_handlers.contains(from.addr))`.
Unclear why. It is interesting in any case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The handlers future is waited on under `background_msg`
which is closed in gossiper::stop so the instance is
already guranteed to be kept valid.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the function is called too early, the first get_endpoint_state_ptr
would throw an exception that is later caught and degraded
into a warning.
But that endpoint_state should never disappear after yielding,
so call on_internal_error in that case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
let's provide the default value, only if user does not specify --pkgs.
otherwise the --pkgs option is always ignored.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
unlike `configure.py`, the building system created by CMake do not
share the `SCYLLA-PRODUCT-FILE` across different builds. so we cannot
assume that build/SCYLLA-PRODUCT-FILE exists.
so, in this change, we check $BUILD_DIR/SCYLLA-PRODUCT-FILE first,
and fallback to $BUILD_DIR/../SCYLLA-PRODUCT-FILE. this should work
for both configure.py and CMake building system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we should respect the --build-dir if --unified-pkg is not specified,
and deduce the path to unified pkg from BUILD_DIR.
so, in this change, we deduce the path to unified pkg from BUILD_DIR
unless --unified-pkg is specfied.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this allows build_unified.sh to generate unified pkg in specified
directory, instead of assuming the naming convention of build/$mode.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
... and sanitize the future used on stop.
The loop in question is now started in .start(), but all callers now
construct the manager late enough, so the loop spawning can be moved.
This also calls for renaming the future member of the class and allows
to make it regular, not shared, future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently starting and stopping of b.m. is spread over main(). Keep it
close to each other.
Another trickery here is that calling b.m.::start() can only be done
after joining the cluster, because this start() spawns replay loop
which, in turn calls token_metadata::count_normal_token_owners() and if
the latter returns zero, the b.m. code uses it as a fraction denominator
and crashes.
With the above in mind, cql_test_env should start batchlog manager after
it "joins the ring" too. For now it doesn't make any difference, but
next patch will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the only caller of it is the batchlog manager itself. It
checks for the shard-id to be zero, calls the method, then the method
asserts that it's run on shard-0.
Moving the check into the method removes the need for assertion and
makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently drain() is called twise -- first time from
storage_service::drain() (on shutdown), second via
batchlog_manager::stop(). The routine is unintentinally re-entrable,
because:
- explicit check for not aborting the abort source twise
- breaking semaphore can be done multiple times
- co-await-ing of the _started future works because the future is shared
That's not extremely elegant, better to make the drain() bail out early
if it was already called.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in this series, the packaging of tools modules are improved:
- package cqlsh also. as cqlsh should be redistributed as a part of the unified package
- use ${arch} in the postfix of the python3 package. the python3 package is not architecture independent.
- set the version with tide for `Scylla_VERSION`, so it can be reused elsewhere.
Refs #15241Closes#15369
* github.com:scylladb/scylladb:
build: cmake: build cqlsh as a submodule
build: cmake: always use the version with tilde
build: cmake: build python3 dist tarball with arch postfix
build: cmake: use the default comment message
This function practically returned true from inception.
In d38deef499
it started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Closes#15343
* github.com:scylladb/scylladb:
storage_service: fixup indentation after last patch
gossiper: get rid of uses_host_id
since we always use tilde ("~") in the verson number,
let's just cache it as an internal variable in CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that `configure.py` always generate python3 dist tarball with
${arch} postfix, let's mirror this behavior. as `build_unified.sh`
uses this naming convention.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it turns out "Generating submodule python3 in python3" is not
as informative as default one:
"/home/kefu/dev/scylladb/tools/python3/build/scylla-python3-5.4.0~dev-0.20230908.1668d434e458.noarch.tar.gz"
so let's drop the "COMMENT" argument.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Changing the second value of cdc_generation_id_v2 from uuid_type
to timeuuid_type made the name of cdc_generation_id_v2 unsuitable
because it does not match cdc::generation_id_v2 anymore.
We change the type of IDs in CDC_GENERATIONS_V3 to timeuuid to
give them a time-based order. We also change how we initialize
them so that the new CDC generation always has the highest ID.
This is the last step to enabling the efficient clearing of
obsolete CDC generation data.
Additionally, we change the types of current_cdc_generation_uuid,
new_cdc_generation_data_uuid and the second values of the elements
in unpublished_cdc_generations to timeuuid, so that they match id
in CDC_GENERATIONS_V3.
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
In the future commit, we change how we initialize uuid of the
new CDC generation in the Raft-based topology. It forces us to
move this initialization out of the make_new_generation_data
function shared between Raft-based and gossiper-based topologies.
We also rename make_new_generation_data to
make_new_generation_description since it only returns
cdc::topology_description now.
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.
After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
In the following commit, we implement the
get_cdc_generation_mutations_v3 function very similar to
get_cdc_generation_mutations_v2. The only differences in creating
mutations between CDC_GENERATIONS_V2 and CDC_GENERATIONS_V3 are:
- a need to set the num_ranges cell for CDC_GENERATIONS_V2,
- different partition keys,
- different clustering keys.
To avoid code duplication, we introduce
get_common_cdc_generation_mutations, which does most of the work
shared by both functions.
this change allows CMake to build the dist tarball for a certain build.
Refs https://github.com/scylladb/scylladb/issues/15241Closes#15352
* github.com:scylladb/scylladb:
build: cmake: add packaging support
build: cmake: enable build of seastar/apps/iotune
The test-case creates a S3-backed ks, populates it with table and data,
then forces flush to make sstables appear on the backend. Then it
updates the registry by marking all the objects as 'removing' so that on
next boot they will be garbage-collected.
After reboot check that the table is "empty" and also validate that the
backend doesn't have the corresponding objects on board for real
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When booting there can be dangling entries in sstables registry as well
as objects on the storage itself. This patch makes the S3 lister list
those entries and then kick the s3_storage to remove the corresponding
objects. At the end the dangling entries are removed from the registry
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lister method is going to list the dangling objects and then call
storage to actually wipe them (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the directory instance only creates lister, but lister is
unaware on exact objects manipulations. The storage is, so create it
too, it's going to be used by garbage collector in next patches
This change also needs fixing the way cql_test_env is configured for
schema_change_test. There are cases that try to pick up keyspace with S3
storage option from the pre-created sstables, and populating those would
need to provide some (even empty) object storage endpoint
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently minio starts with a bucket that has public anonymous access. Respectively, all tests use unsigned S3 requests. That was done for simplicity, and its better to apply some policy to the bucket and, consequentially, make tests sign their requests.
Other than the obvious benefit that we test requests signing in unit tests, another goal of this PR is to make it possible to simulate and test various error paths locally, e.g. #13745 and #13022Closes#14525
* github.com:scylladb/scylladb:
test/s3: Remove AWS_S3_EXTRA usage
test/s3: Run tests over non-anonymous bucket
test/minio: Create random temp user on start
code: Rename S3_PUBLIC_BUCKET_FOR_TEST
This is a workaround for the flakiness of the test where INSERT
statements following the rolling restart fail with "No host available"
exception. The hypothesis is that those INSERTS race with driver
reconnecting to the cluster and if INSERTs are attempted before
reconnection is finished, the driver will refuse to execute the
statements.
The real fix should be in the driver to join with reconnections but
before that is ready we want to fix CI flakiness.
Refs #14746Closes#15355
instead of fabricating a `/etc/password` manually, we can just
leave it to podman to add an entry in `/etc/password` in container.
as podman allows us to map user's account to the same UID in the
container. see
https://docs.podman.io/en/stable/markdown/options/userns.container.html.
this is not only a cosmetic change, it also avoid the permission denied
failure when accessing `/etc/passwd` in the container when selinux is
enabled. without this change, we would otherwise need to either add the
selinux lable to the bind volume with ':Z' option address the failure
like:
```
type=AVC msg=audit(1693449115.261:2599): avc: denied { open } for pid=2298247 comm="bash" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
type=AVC msg=audit(1693449115.263:2600): avc: denied { open } for pid=2298249 comm="id" path="/etc/passwd" dev="tmpfs" ino=5931 scontext=system_u:system_r:container_t:s0:c252,c259 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file permissive=0
```
found in `/var/log/audit/audit.log`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15230
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:
1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.
This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of the query.
Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```
Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).
Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111Closes#14923
* github.com:scylladb/scylladb:
test/topology_custom: add test_read_repair.py
replica/mutation_dump: detect end-of-page in range-scans
tools/scylla-sstable: write: abort parser thread if writing fails
test/pylib: add REST methods to get node exe and workdir paths
test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
service/storage_proxy: add trace points for the actual read executor type
service/storage_proxy: add trace points for read-repair
storage_proxy: Add more trace-level logging to read-repair
database: Fix accounting of small partitions in mutation query
database, storage_proxy: Reconcile pages with no live rows incrementally
scylla redistribute iotune, so let's enable the related building
options, so that we can built iotune on demand.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When `nodetool disablebinary` command executes its handler aborts listening sockets, shuts down all client connections _and_ (!) then waits for the connections to stop existing. Effectively the command tries to make sure that no activity initiated by a CQL query continues, even though client would never see its result (client sockets are closed)
This makes the disablebinary command hang for long sometimes, which is not really nice. The proposal is to wait for the connections to terminate in the background. So once disablebinary command exists what's guaranteed is that all client connections are aborted and new connections are not admitted, but some activity started by them may still be running (e.g. up until `nodetool drain` is issued). Driver-side sockets won't get the queries' results anyway.
The behavior of `disablebinary` is not documented wrt whether it should wait for CQL processing to stop or not, so technically we're not breaking anything. However, it can happen that it's a disruptive change and some setups may behave differently after it.
refs: #14031
refs: #14711Closes#14743
* github.com:scylladb/scylladb:
test/cql-pytest: Add enable|disable-binary test case
test.py: Add suite option to auto-dirty cluster after test
test/pylib: Add nodetool enable|disable-binary commands
transport: Shutdown server on disablebinary
generic_server: Introduce shutdown()
generic_server: Decouple server stopped from connection stopped
transport/controller: Coroutinize do_stop_server()
transport/controller: Coroutinize stop_server()
The test checks that `nodetool disablebinary` makes subsequent queries
fail and `nodetool enablebinary` lets client to establish new
connections.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
ScyllaCluster can be marked as 'dirty' which means that the cluster is
in unusable state (after test) and shouldn't be re-used by other tests
launched by test.py. For now this is only implemented via the cluster
manager class which is only available for topology tests.
Add a less flexible short-cut for cql-pytest-s via suite.yaml marking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... and do the real "sharded::stop" in the background. On node shutdown
it needs to pick up all dangling background stopping.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method waits for listening sockets to stop listening and aborts the
connected sockets, but doesn't wait for the established connections to
finish processing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _stopped future resolves when all "sockets" stop -- listening and
connected ones. Furure patching will need to wait for listening sockets
to stop separately from connected ones.
Rename the `_stopped` to reflect what it is now while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15300
This function practically returned true from inception.
In d38deef499
It started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to be aligned with seastar's coding-style.md: scylladb uses seastar's
coding-style.md. so let's adhere to it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15345
this series adds `--node-exporter-dir` and `--build-dir` options to `create-relocatable-package.py`. this enables us to use create relocatable package from arbitrary build directories.
Refs #15241Closes#15299
* github.com:scylladb/scylladb:
create-relocatable-package.py: add --node-exporter-dir option
build: specify the build dir instead mode
so we can point `debian_files_gen.py` to builddir other than
'build', and can optionally use other output directory. this would
help to reduce the number of "magic numbers" in our building system.
Refs https://github.com/scylladb/scylladb/issues/15241Closes#15282
* github.com:scylladb/scylladb:
dist/debian: specify debian/* file encodings
dist/debian: wrap lines whose length exceeds 100 chars
dist/debian: add command line option for builddir
dist/debian: modularize debian_files_gen.py
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
Currently if writing the sstable fails, e.g. because the input data is
out-of-order, the json parser thread hangs because its output is no
longer consumed. This results in the entire application just freezing.
Fix this by aborting the parsing thread explicitely in the
json_mutation_stream_parser destructor. If the parser thread existed
successfully, this will be a no-op, but on the error-path, this will
ensure that the parser thread doesn't hang.
There is currently a trace point for when the read executor is created,
but this only contains the initial replica set and doesn't mention which
read executor is created in the end. This patch adds trace points for
each different return path, so it is clear from the trace whether
speculative read can happen or not.
Currently the fact that read-repair was triggered can only be inferred
from seeing mutation reads in the trace. This patch adds an explicit
trace point for when read repair is triggered and also when it is
finished or retried.
The partition key size was ignored by the accounter, as well as the
partition tombstone. As a result, a sequence of partitions with just
tombstones would be accounted as taking no memory and page size
limitter to not kick in.
Fix by accounting the real size of accumulated frozen_mutation.
Also, break pages across partitions even if there are no live rows.
The coordinator can handle it now.
Refs #7933
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:
* Large allocations. Serialization of reconcilable_result causes large
allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
side and on the coordinator side causes reactor stalls. This impacts
not only the query at hand. For 1M dead rows, freezing takes 130ms,
unfreezing takes 500ms. Coordinator does multiple freezes and
unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
may fail due to too large mutation size. 1M dead rows is already too
much: Refs #9111.
This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):
Node1: 1 live row, 1M dead rows
Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of
the query.
Before:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
After:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).
Refs #7929
Refs #3672
Refs #7933Fixes#9111
In the following commits, we modify the CDC_GENERATIONS_V3 schema
to enable efficient clearing of obsolete CDC generation data.
These modifications make the current get_cdc_generation_mutations
work only for the CDC_GENERATIONS_V2 schema, and we need a new
function for CDC_GENERATIONS_V3, so we add the "_v2" suffix.
actually, we never use the its output in our workflow. and the
output is distracting when building the package. so, in this
change, let's print it only on demand. this feature is preserved
just in case some of us would want to use this script for getting
the version number string.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15327
if user fails to set "CMAKE_BUILD_TYPE", it would be empty. and
CMake would fail with confusing error messages like
```
CMake Error at CMakeLists.txt:21 (list):
list sub-command FIND requires three arguments.
CMake Error at CMakeLists.txt:27 (include):
include could not find requested file:
mode.
```
so, in this change
* the the default CMAKE_BUILD_TYPE to "Release"
* quote the ${CMAKE_BUILD_TYPE} when searching it
in the allowed build type lists.
this should address the issues above.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15326
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.
Closes#15280
* github.com:scylladb/scylladb:
system_keyspace: Don't require snitch argument on start
system_keyspace: Don't cache local dc:rack pair
system_keyspace: Save local info with explicit location
storage_service: Get endpoint location from snitch, not system keyspace
snitch: Introduce and use get_location() method
repair: Local location variables instead of system keyspace's one
repair: Use full endpoint location instead of datacenter part
A reviewer noted that test_update_expression_list_append_non_list_arguments
has too much code duplication - the same long API call to run
"SET a = list_append(...)" was repeated many times.
So in this patch we add a short inner function "try_list_append" to
avoid this duplication.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes: #15298
- Adds type for each option.
- Filters out unused / invalid values, moves them to a separate section.
- Adds the term "liveness" to the glossary.
- Removes unused and invalid properties from the docs.
- Updates to the latest version of pyaml.
docs: rename config template directive
Closes#15164
in this series, we try to improve `unified-installer.rst`
- encourage user to install smaller package
- run `./install.sh` directly instead relying on that `sh` points to `bash`
Closes#15325
* github.com:scylladb/scylladb:
doc: run install.sh directly
doc: install headless jdk in sample command line
Find progress of repair tasks based on the number of ranges
that have been repaired.
Fixes: [#1156](https://github.com/scylladb/scylla-enterprise/issues/1156).
Closes#14698
* github.com:scylladb/scylladb:
test: repair tasks test
repair: add methods making repair progress more precise
tasks: make progress related methods virtual
repair: add get_progress method to shard_repair_task_impl
repair: add const noexcept qualifiers to shard_repair_task_impl::ranges_size()
repair: log a name of a particular table repair is working on
tasks: delete move and copy constructors from task_manager::task::impl
compaction_done() returns ready future before compaction_task_executor::run_compaction()
even though the compaction did not start.
Make compaction_done() private and add a comment to warn against
incorrect usage.
Before compaction_task_executor::do_run is called, the executor can
be already aborted. Check if compaction was stopped and set
_compaction_done to exceptional future.
For compaction_task_executors, unlike for all other task manager
tasks, run method does not embrace operations performed in a scope
of a task, but only waits until shared_future connected with
the operations is resolved.
Apart from breaking task manager task conventions, such a run method
must consider all corner cases, not to break task manager or
compaction manager functionality.
To fix existing and prevent further bugs related to task manager
and compaction manager coexistence, call perform_task inside
run method and wait for it in a standard way.
Executors that are not going to be reflected in task manager run call
perform_task the old way.
SIGSEGV was caught during tablet streaming, and the reason was
that storage_service::_group0 (via set_group0()) is only set on
shard 0, therefore when streaming ran on any other shard,
it tried to dereference garbage, which resulted in the crash.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15307
before this change, filesystem_storage::open() reuses
`sstable::make_component_file_writer()` to create the
temporary toc, it will rename the temporary toc to the
real TOC when sealing the sstable.
but this prevents us from reusing filesystem_storage in
yet another storage backend. as the
1. create temporary
2. rename temporary to toc
dance only applies to filesystem_storage. when
filesystem_storage calls into sstable, it calls `sst.make_component_file_writer()`,
which in turn calls the `_storage->make_component_sink()`.
but at this moment, `_storage` is not necessarily `filesystem_storage`
anymore. it could be a wrapper around `filesystem_storage`,
which is not aware of the create-rename dance. and could do
a lot more than create a temporary file when asked to
"make_component_sink()".
if we really want to go this way by reusing sstable's API
in `filesystem_storage` to create a temporary toc, we will
have to rename the whatever temporary toc component created
by the wrapper backend to the toc with the seal() func. but
again, this rename op is only implemented in the
filesystem_storage backend. to mirror this operation in
the wrapper backend does not make sense at all -- it
does not have to be aware of the filesystem_storage's internals.
so in this change, instead of reusing the
`sstable::make_component_file_writer()`, we just inline
its implementation in filesystem_storage to avoid this
problem. this is also an improvement from the design
perspective, as the storage should not call into its
the higher abstraction -- sstable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14443
seastar has deprecated the overload which accepts `server_name`,
let's use the one which accepts `tls::tls_options`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15324
Currently, the topology coordinator has the
`topology::transition_state::publish_cdc_generation` state responsible
for publishing the already created CDC generations to the user-facing
description tables. This process cannot fail as it would cause some CDC
updates to be missed. On the other hand, we would like to abort the
`publish_cdc_generation` state when bootstrap aborts. Of course, we
could also wait until handling this state finishes, even in the case of
the bootstrap abort, but that would be inefficient. We don't want to
unnecessarily block topology operations by publishing CDC generations.
The solution proposed by this PR is to remove the
`publish_cdc_generation` state completely and introduce a new background
fiber of the topology coordinator -- `cdc_generation_publisher` -- that
continually publishes committed CDC generations.
Apart from introducing the CDC generation publisher, we add
`test_cdc_generation_publishing.py` that verifies its correctness and we
adapt other CDC tests to the new changes.
Fixes#15194Closes#15281
* github.com:scylladb/scylladb:
test: test_cdc: introduce wait_for_first_cdc_generation
test: move cdc_streams_check_and_repair check
test: add test_cdc_generation_publishing
docs: remove information about publish_cdc_generation
raft topology: introduce the CDC generation publisher
system_keyspace: load unpublished_cdc_generations to topology
raft topology: mark committed CDC generations as unpublished
raft topology: add unpublished_cdc_generations to system.topology
Add tests for gossiper/endpoint/live and gossiper/endpoint/down
which run only in release mode.
Enable test_remove_node_with_concurrent_ddl and fix types and
variables names used by it, so that they can be reused in gossiper
test.
Fixes: #15223.
Closes#15244
* github.com:scylladb/scylladb:
test: topology: add gossiper test
test: fix types and variable names in wait_for_host_down
ClangBuildAnalyzer reports cql3/cql_statement.hh as being one of the
most expensive header files in the project - being included (mostly
indirectly) in 129 source files, and costing a total of 844 CPU seconds
of compilation.
This patch is an attempt, only *partially* successful, to reduce the
number of times that cql_statement.hh is included. It succeeds in
lowering the number 129 to 99, but not less :-( One of the biggest
difficulties in reducing it further is that query_processor.hh includes
a lot of templated code, which needs stuff from cql_statement.hh.
The solution should be to un-template the functions in
query_processor.hh and move them from the header to a source file, but
this is beyond the scope of this patch and query_processor.hh appears
problematic in other respects as well.
Unfortunately the compilation speedup by this patch is negligible
(the `du -bc build/dev/**/*.o` metric shows less than 0.01% reduction).
Beyond the fact that this patch only removes 30% of the inclusions of
this header, it appears that most of the source files that no longer
include cql_statement.hh after this patch, included anyway many of the
other headers that cql_statement.hh included, so the saving is minimal.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15212
strictly speaking, `sh` is not necessarily bash. while `install.sh`
is written in the Bash dialect. and it errors out if it is not executed
with Bash. and we don't need to add "-x" when running the script, if
we have to, we should add it in `install.sh` not ask user to add this
option. also, `install.sh` is executable with a shebang line using
bash, so we can just execute it.
so, in this change, we just launch this script in the command line
sample.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in comparison with java-11-openjdk, java-11-openjdk-headless does not
offer audio and video support, and has less dependencies. for instance,
java-11-openjdk depends on the X11 libraries, and it also provides
icons representing JDK. but since scylla is a server side application,
we don't expect user to run a desktop on it. so there is no need to
support audio and video.
in this change, we just suggest the a "smaller" package, which is
actually also a dependency of java-11-open-jdk.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After introducing the CDC generation publisher,
test_cdc_log_entries_use_cdc_streams could (at least in theory)
fail by accessing system_distributed.cdc_streams_descriptions_v2
before the first CDC generation has been published.
To avoid flakiness, we simply wait until the first CDC generation
is published in a new function -- wait_for_first_cdc_generation.
The part of test_topology_ops that tests the
cdc_streams_check_and_repair request could (at least in theory)
fail on
`assert(len(gen_timestamps) + 1 == len(new_gen_timestamps))`
after introducing the CDC generation publisher because we can
no longer assume that all previously committed CDC generations
have been published before sending the request.
To prevent flakiness, we move this part of the test to
test_cdc_generations_are_published. This test allows for ensuring
that all previous CDC generations have been published.
Additionally, checking cdc_streams_check_and_repair there is
simpler and arguably fits the test better.
We add two test cases that test the new CDC generation publisher
to detect potential bugs like incorrect order of publications or
not publishing some generations at all.
The purpose of the second test case --
test_multiple_unpublished_cdc_generations -- is to enforce and test
a scenario when there are multiple unpublished CDC generations at
the same time. We expect that this is a rare case. The main fiber
of the topology coordinator would have to make much more progress
(like finishing two bootstraps) than the CDC generation publisher
fiber. Since multiple unpublished CDC generations might never
appear in other tests but could be handled incorrectly, having
such a test is valuable.
Currently, the topology coordinator has the
topology::transition_state::publish_cdc_generation state
responsible for publishing the already created CDC generations
to the user-facing description tables. This process cannot fail
as it would cause some CDC updates to be missed. On the other
hand, we would like to abort the publish_cdc_generation state when
bootstrap aborts. Of course, we could also wait until handling this
state finishes, even in the case of the bootstrap abort, but that
would be inefficient. We don't want to unnecessarily block topology
operations by publishing CDC generations.
The solution is to remove the publish_cdc_generation state
completely and introduce a new background fiber of the topology
coordinator -- cdc_generation_publisher -- that continually
publishes committed CDC generations.
The implementation of the CDC generation publisher is very similar
to the main fiber of the topology coordinator. One noticeable
difference is that we don't catch raft::commit_status_unknown,
which is handled raft_group0_client::add_entry.
Note that this modification changes the Raft-based topology a bit.
Previously, the publish_cdc_generation state had to end before
entering the next state -- write_both_read_old. Now, committed
CDC generations can theoretically be published at any time.
Although it is correct because the following states don't depend on
publish_cdc_generation, it can cause problems in tests. For example,
we can't assume now that a CDC generation is published just because
the bootstrap operation has finished.
We extend service::topology with the list of unpublished CDC
generations and load its contents from system.topology. This step
is the last one in making unpublished CDC generations accessible
to the topology coordinator.
Note that when we load unpublished_cdc_generations, we don't
perform any sanity checks contrary to current_cdc_generation_uuid.
Every unpublished CDC generation was a current generation once,
and we checked it at that moment.
We add committed CDC generations to unpublished_cdc_generations
so that we can load them to topology and properly handle them
in the following commits.
In the following commits, we replace the
topology::transition_state::publish_cdc_generation state with
a background fiber that continually publishes committed CDC
generations. To make these generations accessible to the
topology coordinator, we store them in the new column of
system.topology -- unpublished_cdc_generations.
If a test isn't going to use task manager or isn't interested in
statuses of finished tasks, then keeping them in the memory
for some time (currently 10s by default) after they are finished
is a memory waste.
Set default task_ttl value to zero. It can be changed by setting
--task-ttl-in-seconds or through rest api (/task_manager/ttl).
In conf/scylla.yaml set task-ttl-in-seconds to 10.
Closes#15239
Some tests use non-threaded do_with_cql_env() and wrap the inner lambda with seastar::async(). The cql env already provides a helper for that
Closes#15305
* github.com:scylladb/scylladb:
cql_query_test: Fix indentation after previous patch
cql_query_test: Use do_with_cql_env_thread() explicitly
Now when the keys and region can be configured with "standard"
environment variables, the old custom one can be removed. No automation
uses that it was purely a support for manual testing of a client against
AWS's S3 server
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently minio applies anonymous public policy for the test bucket and
all tests just use unsigned S3 requests. This patch generates a policy
for the temporary minio user and removes the anon public one. All tests
are updated respectively to use the provided key:secret pair.
The use-https bit is off by default as minio still starts with plain
http. That's OK for now, all tests are local and have no secret data
anyway
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code was supposed to be moved into
`mutate_live_and_unreachable_endpoints`
in 2c27297dbd
but it looks like the original statements were left
in place outside the mutate function.
This patch just removes the stale code since the required
logic is already done inside `mutate_live_and_unreachable_endpoints`.
Fixesscylladb/scylladb#15296
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15304
The user is going to have rights to access the test bucket. For now just
create one and export the tests via environment
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The bucket is going to stop being public, rename the env variable in
advance to make the essential patch smaller
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These metrics mimic the existing IO ones -- total number of read
operation, total number of read bytes and total read delay. And the same
for writing.
This patch makes no difference between wrting object with plain PUT vs
putting it with multipart uploading. Instead, it "measures" individual
IO writes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently an http client has several exported "numbers" regarding the
number of transport connections the client uses. This patch exports
those via S3 client's per-sched-group metrics and prepares the ground
for more metrics in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear another make_request() helper that'll do mostly the
same. This split will help to avoid code duplication
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The http-client is per-sched-group. Next patch will need to keep metrics
per-sched-group too and this sched-group -> http-client map is the good
place to put them on. Wrapping struct will allow extending it with
metrics
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The stats is stats about object, not about client, so it's better if it
lives in namespace scope. Also it will avoid conflicts with client stats
that will be reported as metrics (later patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This serves two purposes. First, it fixes potential use-after-move since
the bufs are moved on lambda and bufs.size() are called in the same
statement with no defined evaluation order.
Second, this makes 'size' varable alive up to the time request is
complete thus making it possible to update stats with it (later patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The Alternator tests can run against HTTPS - namely when using
test/alternator/run with the "--https" option (local Alternator
configured with HTTPS) or "--aws" option (DynamoDB, using HTTPS).
In some cases we make these HTTPS requests with verify=False, to avoid
checking the SSL certificates. E.g., this is necessary for Alternator
with a self-signed certificate. Unfortunately, the urllib3 library adds
an ugly warning message when SSL certificate verification is disabled.
In the past we tried to disable these warnings, using the documented
urllib3.disable_warnings() function, but it didn't help. It turns out
that pytest has its own warning handling, so to disable warnings in
pytest we must say so in a special configuration parameter in pytest.ini.
So in this patch, we drop the disable_warnings call from conftest.py
(where it didn't help), and instead put a similar declaration in
pytest.ini. The disable_warnings call in the test/alternator/run
script needs to remain - it is run outside pytest, so pytest.ini
doesn't affect it.
After this patch, running test/alternator/run with --https or --aws
finishes without warnings, as desired.
Fixes#15287
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15292
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.
Prepare for extending the token_metadata interface
to provide host_id in update_topology.
We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).
Closes#15297
* github.com:scylladb/scylladb:
locator: topology: is_configured_this_node: delete spurious semicolumn
locator: topology: is_configured_this_node: compare host_id first
Some tests use non-threaded do_with_cql_env() and wrap the inner lambda
with seastar::async(). The cql env already provides a helper for that
Indentation is deliberately left broken until next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we assume that node_exporter artifacts are
always located under `build/node_exporter`. but this could might
hold anymore, if we want to have a self-contained build, in the sense
that different builds do not share the same set of node_exporter
artifacts. this could be a waste as the node_exporter artifacts
are identical across different builds, but this makes things
a lot simpler -- different builds do not have to hardwire to
a certain directory.
so, a new option is added to `create-relocatable-package.py`, this
allows us to specify the directory where node_export artifacts
are located.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of specifying the build "mode", and assuming that
the build directory is always located at "build/${mode}", specify
the build directory explicitly. this allows us to use
`create-relocatable-package.py` to package artifacts built
at build directory whose path does not comply to the
naming convention, for instance, we might want to build
scylla in `build/yet-another-super-feature/release`.
so, in this change, we trade `--mode` for an option named
`--build-dir` and update `configure.py` accordingly.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.
Fixes#15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15264
Passing the gate_closed_exception to the task promise
ends up with abandoned exception since no-one is waiting
for it.
Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.
Fixes scylladb/scylladb#15211
In addition, this series adds a private abort_source for each task_manager module
(chained to the main task_manager::abort_source) and abort is requested on task_manager::module::stop().
gate holding in compaction_manager is hardened
and makes sure to stop compaction_manager and task_manager in sstable_compaction_test cases.
Closes#15213
* github.com:scylladb/scylladb:
compaction_manager: stop: close compaction_state:s gates
compaction_manager: gracefully handle gate close
task_manager: task: start: fixup indentation
task_manager: module: make_task: enter gate when the task is created
task_manaer: module: stop: request abort
task_manager: task::impl: subscribe to module about_source
test: compaction_manager_stop_and_drain_race_test: stop compaction and task managers
test: simple_backlog_controller_test: stop compaction and task managers
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.
Prepare for extending the token_metadata interface
to provide host_id in update_topology.
We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Improved the coverage of the tests for the list_append() function
in UpdateExpression - test that if one of its arguments is not a list,
including a missing attribute or item, it is reported as an error as
expected.
The new tests pass on both Alternator and DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15291
so we can point debian_files_gen.py to builddir other than
'build', and can optionally use other output directory. this would
help to reduce the number of "magic numbers" in our building system.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
restructure the script into functions, prepare for the change which
allows us to specify the build directory when preparing the "debian"
packaging recipes.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of flattening the functions into the script, let's structure them into functions. so they can be reused. and more maintainable this way.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15242
* github.com:scylladb/scylladb:
build: early return when appropriate
build: extract generate_compdb() out
Right now, the function allows for passing the path to a file as a seastar::sstring,
which is then converted to std::filesystem::path -- implicitly to the caller.
However, the function performs I/O, and there is no reason to accept any other type
than std::filesystem::path, especially because the conversion is straightforward.
Callers can perform it on their own.
This commit introduces the more constrained API.
Closes#15266
This series cleans up gossiper.
Methods that do not change the gossiper object are marked as const.
Dead code is removed.
Closes#15272
* github.com:scylladb/scylladb:
gossiper: get_current* methods: mark as const
gossiper: get_generation_for_nodes: mark as const
gossiper: examine_gossiper: mark as const
gossiper: request_all, send_all: mark as const
gossiper: do_on_*notifications: mark as const
utils: atomic_vector: mark for_each functions as const
gossiper: compare_endpoint_startup: mark as const
gossiper: get_state_for_version_bigger_than: mark as const
gossiper: make_random_gossip_digest: delete dead legacy code
gossiper: make_random_gossip_digest: mark as const
gossiper: do_sort: mark as const
gossiper: is* methods: mark as const
gossiper: wait_for_gossip and friends: mark as const
gossiper: drop unused dump_endpoint_state_map
gossiper: remove unused shadow version members
On boot system keyspace is kicked to insert local info into system.local
table. Among other things there's dc:rack pair which sys.ks. gets from
its cache which, in turn, should have been previously initialized from
snitch on sys.ks. start. This patch makes the local info updating method
get the dc:rack from caller via argument. Callers, in turn, call snitch
directly, because these are main and cql_test_env startup routines.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service needs to get local dc:rack pair in some places and it
calls system keyspace's local_dc_rack() method for it. However, the
method returns back the data from sys.ks. cache which, in turn, was
previously initialized from snitch's data. This patch makes storage
service get location from snitch directly, without messing with system
keyspace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are some places out there that generate locator::endpoint_dc_rack
pair out of snitch's get_datacenter() and get_rack() calls. Generalize
those with snitch's new method. It will also be used by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patch made full endpoint location be available as a local
variable near the places that get this location from the system
keyspace. This patch replaces the sys.ks. calls with the variables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places in repair code that get datacenter from the
topology. Nearby there are calls to update_topology() which, in turn,
needs full location ({dc, rack} pair). This patch makes the former
places obtain full location from topology and get the dc part from it.
This is needed as a preparation to let latter places use that location.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"experimental" option was marked "Unused" in 64bc8d2f7d. but we
chose to keep it in hope that the upgrade test does not fail.
despite that the upgrade tests per-se survived the "upgrade",
after the upgrade, the tests exercising the experimental features
are still failing hard. they have not been updated to set the
"experimental-features" option, and are still relying on
"experimental" to enable all the experimental features under
test.
so, in this change, let's just drop the option so that
scylla can fail early at seeing this "experimental" option.
this should help us to identify the tests relying on it
quicker. as the "experimental" features should only be used
in development environment, this change should have no impact
to production.
Refs #15214
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15233
Make sure the compaction_state:s are idle before
they are destroyed. Although all tasks are stopped
in stop_ongoing_compactions, make sure there is
fiber holding the compaction_state gate.
compaction_manager::remove now needs to close the
compaction_state gate and to stop_ongoing_compactions
only if the gate is not closed yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Check if the compaction_state gate is closed
along with _state != state::enabled and return early
in this case.
At this point entering the gate is guaranteed to succeed.
So enter the gate before calling `perform_compaction`
keeping the std::optional<gate_holder> throughout
the compaction task.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Passing the gate_closed_exception to the task promise in start()
ends up with abandoned exception since no-one is waiting
for it.
Instead, enter the gate when the task is made
so it will fail make_task if the gate is already closed.
Fixesscylladb/scylladb#15211
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Have a private about_source for every module
and request abort on stop() to signal all outstanding
tasks to abort (especially when they are sleeping
for the task_ttl).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather to the top-level task_manager about_source,
to provide separation between task_manager modules
so each one can be aborted and stopped independentally
of the others (in the next patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The s.service since d42685d0cb is having on-board query processor ref^w
pointer and can use it to join cluster
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15236
instead of flattening the functions into the script, let's structure
them into functions. so they can be reused. and more maintainable
this way.
Refs #15241
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replacing `minimum_keyspace_rf` config option with 4 config options:
`{minimum,maximum}_replication_factor_{warn,fail}_threshold`, which
allow us to impose soft limits (issue a warning) and hard limits (not
execute CQL) on RF when creating/altering a keyspace.
The reason to rather replace than extend `minimum_keyspace_rf` config
option is to be aligned with Cassandra, which did the same, and has the
same parameters' names.
Only min soft limit is enabled by default and it is set to 3, which means
that we'll generate a CQL warning whenever RF is set to either 1 or 2.
RF's value of 0 is always allowed and means that there will not be any
replicas on a given DC. This was agreed with PM.
Because we don't allow to change guardrails' values when scylla is
running (per PM), there're no tests provided with this PR, and dtests will be
provided separately.
Exceeding guardrails' thresholds will be tracked by metrics.
Resolves#8619
Refs #8892 (the RF part, not the replication-strategy part)
Closes#14262
We need to const_cast `this` since the const
container() has no const invoke_on override.
Trying to fix this in seastar sharded.hh breaks
many other call sites in scylla.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They only need to access the _vec_lock rwlock
so mark it as mutable, but otherwise they provide a const
interface to the calls, as the called func receives
the entries by value and it cannot modify them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we start group 0 leadership monitor fiber only during
a normal bootstrap. However, we should also do it when we restart
a node (either with or without upgrading it to Raft).
Fixes#15166Closes#15204
in a4eb3c6e0f, we passed the path of
"image" to `dbuild`, but that was wrong. we should pass its content
to this script. so in this change, it is fixed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15247
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See https://github.com/scylladb/scylladb/issues/11202.
However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.
As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophic regressions in
small-partition workloads, but big enough to accommodate workloads where
index cache is obviously beneficial.
This series adds such a configurable limit, sets it to to 0.2 of total cache memory by default,
and re-enables index caching by default.
Fixes#15118Closes#14994
* github.com:scylladb/scylladb:
test: boost/cache_algorithm_test: add cache_algorithm_test
sstables: partition_index_cache: deglobalize stats
utils: cached_file: deglobalize cached_file metrics
db: config: enable index caching by default
config: add index_cache_fraction
utils: lru: add move semantics to list links
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.
This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.
But it also makes sense even without that motive.
Move cached_file metrics from a thread_local variable
to cache_tracker.
This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.
But it also makes sense even without that motive.
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See #11202.
However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.
As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophical regressions in
small-partition workloads, but big enough to accomodate workloads where
index cache is obviously beneficial.
This patch sets such a limit to 0.2 of total cache memory, and re-enables
index caching by default.
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.
This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
Before the patch, fixing list links is done manually in the move constructor of
`evictable`. After the patch, it is done by the move constructors of the links
themselves.
This makes for slightly cleaner code, especially after we add more links in an
upcoming patch.
This series ensures that endpoint state changes (for each single endpoint) are applied to the gossiper endpoint_state_map as a whole and on all shards.
Any failure in the process will keep the existing endpoint state intact.
Note that verbs that modify the endpoint states of multiple endpoints may still succeed to modify some of them before hitting an error and those changes are committed to the endpoint_state_map, so we don't ensure atomicity when updating multiple endpoints' states.
Fixes scylladb/scylladb#14794
Fixes scylladb/scylladb#14799
Closes#15073
* github.com:scylladb/scylladb:
gossiper: move endpoint_state by value to apply it
gossiper: replicate: make exception safe
gms: pass endpoint_state_ptr to endpoint_state change subscribers
gossiper: modify endpoint state only via replicate
gossiper: keep and serve shared endpoint_state_ptr in map
gossiper: get_max_endpoint_state_version: get state by reference
api/failure_detector: get_all_endpoint_states: reduce allocations
cdc/generation: get_generation_id_for: get endpoint_state&
gossiper: add for_each_endpoint_state helpers
gossiper: add num_endpoints
gossiper: add my_endpoint_state
This PR collects followups described in #14972:
- The `system.topology` table is now flushed every time feature-related
columns are modified. This is done because of the feature check that
happens before the schema commitlog is replayed.
- The implementation now guarantees that, if all nodes support some
feature as described by the `supported_features` column, then support
for that feature will not be revoked by any node. Previously, in an
edge case where a node is the last one to add support for some feature
`X` in `supported_features` column, crashes before applying/persisting
it and then restarts without supporting `X`, it would be allowed to boot
anyway and would revoke support for the `X` in `system.topology`.
The existing behavior, although counterintuitive, was safe - the
topology coordinator is responsible for explicitly marking features as
enabled, and in order to enable a feature it needs to perform a special
kind of a global barrier (`barrier_after_feature_update`) which only
succeeds after the node has updated its features column - so there is no
risk of enabling an unsupported feature. In order to make the behavior
less confusing, the node now will perform a second check when it tries
to update its `supported_features` column in `system.topology`.
- The `barrier_after_feature_update` is removed and the regular global
`barrier` topology command is used instead. The `barrier` handler now
performs a feature check if the node did not have a chance to verify and
update its cluster features for the second time.
JOIN_NODE rpc will be sent separately as it is a big item on its own.
Fixes: #14972Closes#15168
* github.com:scylladb/scylladb:
test: topology{_experimental_raft}: don't stop gracefully in feature tests
storage_service: remove _topology_updated_with_local_metadata
topology_coordinator: remove barrier_after_feature_update
topology_coordinator: perform feature check during barrier
storage_service: repeat the feature check after read barrier
feature_service: introduce unsupported_feature_exception
feature_service: move startup feature check to a separate function
topology_coordinator: account for features to enable in should_preempt_balancing
group0_state_machine: flush system.topology when updating features columns
Add a document describing in detail how to use clang's "-ftime-trace"
option, and the ClangBuildAnalyzer tool, to find the source files,
header files and templates which slow down Scylla's build the most.
I've used this tool in the past to reduce Scylla build time - see
commits:
fa7a302130 (reduced 6.5%)
f84094320d (reduced 0.1%)
6ebf32f4d7 (reduced 1%)
d01e1a774b (reduced 4%)
I'm hoping that documenting how to use this tool will allow other
developers to suggest similar commits.
Refs #1.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15209
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.
Refs #15187Closes#15188
* github.com:scylladb/scylladb:
sstable: add samples of fname to be matched by regex
sstables: change make_descriptor() to accept fs::path
sstables: switch entry_descriptor(sstring..) to std::string_view
sstables: change make_descriptor() to accept fs::path
the dbuild script provided by the branch being debugged might not
include the recent fixes included by current branch from which
`open-coredump.sh` is launched.
so, instead of using the dbuild script in the repo being debugged,
let's use the dbuild provided by current branch. also, wrap the
dbuild command line. for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15240
Scrub tests use a lot of temporary directories. This is suspected to
cause problems in some cases. To improve the situation, this patch:
* Creates a single root temporary directory for all scrub tests
* All further fixtures create their files/directories inside this root
dir.
* All scrub tests create their temporary directories within this root
dir.
* All temporary directories now use an appropriate "prefix", so we can
tell which temporary directory is part of the problem if a test fails.
Refs: #14309Closes#15117
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
* seastar 0784da87...6e80e84a (29):
> Revert "shared_token_bucket: Make duration->tokens conversion more solid"
> Merge 'chunked_fifo: let incremetal operator return iterator not basic_iterator' from Kefu Chai
> memory: diable transparent hugepages if --overprovisioned is specified
Ref https://github.com/scylladb/scylladb/issues/15095
> http/exception: s/<TAB>/ /
> install-dependencies.sh: re-add protobuf
> Merge 'Keep capacity on fair_queue_entry' from Pavel Emelyanov
> Merge 'Fix server-side RPC stream shutdown' from Pavel Emelyanov
Fixes https://github.com/scylladb/scylladb/issues/13100
> smp: make service management semaphore thread local
> tls_test: abort_accept() after getting server socket
> Merge 'Print more IO info with ioinfo app' from Pavel Emelyanov
> rpc: Fix client-side stream registration race
Ref https://github.com/scylladb/scylladb/issues/13100
> tests: perf: shard_token_bucket: avoid capturing unused variables in lambdas
> build: pass -DBoost_NO_CXX98_FUNCTION_BASE to C++ compiler
> reactor: Drop some dangling friend declarations
> fair_queue: Do not re-evaluate request capacity twice
> build: do not use serial number file when signing a cert
> shared_token_bucket: Make duration->tokens conversion more solid
> tests: Add perf test for shard_token_bucket
> Merge 'Make make_file_impl() less yielding' from Pavel Emelyanov
> fair_queue: Remove individual requests counting
> reactor, linux-aio: print value of aio-max-nr on error
> Merge 'build, net: disable implicit fallthough' from Kefu Chai
> shared_token_bucket: Fix duration_for() underflow
> rpc: Generalize get_stats_internal() method
> doc/building-dpdk.md: fix invalid file path of README-DPDK.md
> install-dependencies: add centos9
> Merge 'log: report scheduling group along with shard id' from Kefu Chai
> dns: handle exception in do_sendv for udp
> Merge 'Add a stall detector histogram' from Amnon Heiman
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15218
change another overload of `make_descriptor()` to accept `fs::path`,
in the same spirit of a previous change in this area. so we have
a more consistent API for creating sstable descriptor. and this
new API is simpler to use.
Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so its callers don't need to construct a temporary `sstring` if
the parameter's type is not `sstring`. for instance, before
this change, `entry_descriptor::make_descriptor(const std::filesystem::path...)`
would have to construct two temporary instances of `sstring`
for calling this function.
after this change, it does not have to do so.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to lower the programmer's cognitive load. as programmer might want
to pass the full path as the `fname` when calling
`make_descriptor(sstring sstdir, sstring fname)`, but this overload
only accepts the filename component as its second parameter. a
single `path` parameter would be easier to work with.
Refs #15187
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The current cluster feature tests are stopping nodes in a graceful way.
Doing it gracefully isn't strictly necessary for the test scenarios and
we can switch `server_stop_gracefully` calls to `server_stop`. This only
became possible after a previous commit which causes `system.topology`
table to be flushed when cluster feature columns are modified, and will
server as a good test for it.
After removing barrier_after_feature_update, the flag is no longer
needed by anybody. The field in storage_service is removed and
do_update_topology_with_local_metadata is inlined.
The `barrier_after_feature_update` was introduced as a variant of the
`barrier` command, meant to be used by the topology coordinator when
enabling a feature. It was meant to give more guarantees to the topology
coordinator than the regular barrier, but the regular barrier has been
adjusted in the previous commits so that it can be used instead of the
special barrier.
This commit gets rid of `barrier_after_feature_update` and replaces its
uses with `barrier`.
Due to the possible situation where a node applies a command that
advertises support for a feature but crashes before applying it, there
is a period of time where a node might have its group 0 server running
but does not support all of the features. Currently, we solve the issue
by using a special `barrier_after_feature_update` which will not succeed
until the node makes sure to update its `supported_features` column (or,
since the previous commit, shuts down if it doesn't support all required
features).
However, we can make it work with regular barrier after adjusting it
slightly. In case the local metadata was not updated yet, it will
perform a feature check. This will make sure that the global barrier
issued by the topology coordinator before enabling features will not
succeed if the problematic situation occurs.
We would like to guarantee the following property: if all nodes have
some feature X in their `supported_features` column in
`system.topology`, then it's no longer possible for anybody to revoke
support for it. Currently, it is not guaranteed because the following
can happen:
1. A node commits a command that updates its `supported_features`,
marking feature X as supported. It is the last node to do so and now
all nodes support X.
2. Node crashes before applying the command locally.
3. Node is downgraded not to support X and restarted.
4. The feature check in `enable_features_on_startup` passes because it
happens before starting the group 0 server.
5. The `supported_features` column is updated in
`update_topology_with_local_metadata`, removing support for X.
Even though the guarantee does not hold, it's not a problem because the
`barrier_after_metadata_update` is required to succeed on all nodes
before topology coordinator moves to enable a feature, and - as the name
suggests - it requires `update_topology_with_local_metadata` to finish.
However, choosing to give this guarantee makes it simpler to reason
about how cluster features on raft work and removes some pathological
cases (e.g. trying to downgrade some other node after step 1 will fail,
but will be again possible after step 5). Therefore, this commit adds a
second check to `update_topology_with_local_metadata` which disallows
removing support for a feature that is supported by everybody - and
stops the boot process if necessary.
The logic responsible for checking supported features agains the
currently enabled features (and features that are unsafe to disable) is
moved to a separate function, `check_features`. Currently, it is only
used from `enable_features_on_startup`, but more checks against features
in raft will be added in the commits that follow.
First replicate the new endpoint_state on all shards
before applying the replicated endpoint_state objects
to _endpoint_state_map.
Fixesscylladb/scylladb#14794
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And restrict the accessor methods to return const pointers
or refrences.
With that, the endpoint_state_ptr:s held in the _endpoint_state_map
point to immutable endpoint_state objects - with one exception:
the endpoint_state update_timestamp may be updated in place,
but the endpoint_state_map is immutable.
replicate() replaces the endpoint_state_ptr in the map
with a new one to maintain immutability.
A later change will also make this exception safe so
replicate will guarantee strong exception safety so that all shards
are updated or none of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.
Fixesscylladb/scylladb#14799
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
reserve the result vector based on the known
number of endpoints and then move-construct each entry
rather than copying it.
Also, use refrences to traverse the application_state_map
rather than copying each of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to lookup the application_state again using the
endpoint, as both callers already have a reference to
the endpoint_state handy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before changing _endpoint_state_map to hold a
lw_shared_ptr<endpoint_state>, provide synchronous helpers
for users to traverse all endpoint_states with no need
to copy them (as long as the called func does not yield).
With that, gossiper::get_endpoint_states() can be made private.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Return the number of endpoints tracked by gossiper.
This is useful when the caller doesn't need
access to the endpoint states map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get or creates the endpoint_state for this node.
Instead of accessing _endpoint_state_map directly.
Do this before changing the map to hold a lw_shared_ptr<endpoint_state>
in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
open-coredump.sh allows us to specify --scylla-repo-path, but
developer's remote repo name is not always "origin" -- the
"origin" could be his/her own remote repo. not the one from which
we want to pull from.
so, in this change, assuming that the remote repo to be pulled
from has been added to the local repo, we query the local repo
for its name and pull using that name instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15220
before this change, we don't configure the selinux label when
binding shared volume, but this results in permission denied
when accessing `/opt/scylladb` in the container when the selinux
is enabled. since we are not likely to share the volume with
other containers, we can use `Z` to indicate that the bind
mount is private and unshared. this allows the launched container
to access `/opt/scylladb` even if selinux is enabled.
since selinux is enabled by default on a installation of fedora 38.
this change should improve the user experience of open-coredump
when developer uses fedora distributions.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15229
Override methods returning expected children number and job size
in repair tasks. With them get_progress method would be able to
return more precise progress value.
Change 6449c59 brought back abort on listener failure, but this causes crashes when listeners hit expected errors like gate_closed.
Detect shutdown usig the gossiper _abort_source
and in this case just log a warning about the errors but do not abort.
Fixesscylladb/scylladb#15031Closes#15100
* github.com:scylladb/scylladb:
gossiper: apply_new_states: tolerate listener errors during shutdown
gossiper: do_on_change_notifications: check abort source
gossiper: lock_endpoint_update_semaphore: get_units with _abort_source
gossiper: lock_endpoint: get_units with _abort_source
gossiper: is_enabled: consider also _abort_source
We want to stop supporting IPs for `--ignore-dead-nodes` in
`raft_removenode` and `--ignore-dead-nodes-for-replace` for
`raft_replace`. However, we shouldn't remove these features without the
deprecation period because the original `removenode` and `replace`
operations still support them. So, we add them for now.
The `IP -> Raft ID` translation is done through the new
`raft_address_map::find_by_addr` member function.
We update the documentation to inform about the deprecation of the IP
support for `--ignore-dead-nodes`.
Fixes#15126Closes#15156
* github.com:scylladb/scylladb:
docs: inform about deprecating IP support for --ignore-dead-nodes
raft topology: support IPs for --ignore-dead-nodes
raft_address_map: introduce find_by_addr
Use common code to notify subscribers on_dead
from remove_endpoint() and from mark_dead().
Modeled after do_on_change_notifications.
Refs https://github.com/scylladb/scylladb/pull/15179#discussion_r1306969125Closes#15206
* github.com:scylladb/scylladb:
gossiper: remove_endpoint: get the endpoint_state before yielding
gossiper: add do_on_dead_notifications
The cql_test_env has a virtual require_column_has_value() helper that better fits cql_assertions crowd. Also, the helper in question duplicates some existing code, so it can also be made shorter (and one class table helper gets removed afterwards)
Closes#15208
* github.com:scylladb/scylladb:
cql_assertions: Make permit from env
table: Remove find_partition_slow() helper
sstable_compaction_test: Do not re-decorate key
cql_test_env: Move .require_column_has_value
cql_test_env: Use table.find_row() shortcut
Motivation:
The user can bootstrap 3 different clusters and then connect them
(#14448). When these clusters start gossiping, their token rings will be
merged, but there will be 3 different group 0s in there. It results in a
corrupted cluster.
We need to prevent such situations from happening in clusters which
don't use Raft-based topology.
-------
Gossiper service sets its group0 id on startup if it is stored in
`scylla_local` or sets it during joining group0.
Send group0_id (if it is set) when the node tries to initiate the gossip
round. When a node gets gossip_digest_syn it checks if its group0 id
equals the local one and if not, the message is discarded.
Fixes#14448
Performed manual tests with the following scenario:
1. setup a cluster of two nodes (one compiled with and one without this patch)
2. setup a new node
3. create a basic keyspace and table
4. execute simple select and insert queries
Tested 4 scenarios: the seed node was with or without this patch, and the third node was with or without this patch.
These tests didn't detect any errors.
Closes#15004
* github.com:scylladb/scylladb:
tests: raft: cluster of nodes with different group0 ids
gossip: add group0_id attribute to gossip_digest_syn
This field is about to be removed in newer seastar, so it
shouldn't be checked in scylla-gdb
(see also ae6fdf1599)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15203
To call table::find_row() one needs to provide a permit. Tests have
short and neat helper to create one from cql_test_env
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
these methods are only used by the public methods of this class and
its derived class "memtable_encoding_stats_collector".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15190
The is_partition_dead() local helper accepts partition key argument and
decorates it. Howerver, its caller gets partition key from decorated key
itself, and can just pass it along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The require_column_has_value() finds the cell in three steps -- finds
partition, then row, then cell. The class table already has a method to
facilitate row finding by partition and clustering key
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main goal of this PR is to stop cdc_generation_service from calling
system_keyspace::bootstrap_complete(). The reason why it's there is that
gen. service doesn't want to handle generation before node joined the
ring or after it was decommissioned. The cleanup is done with the help
of storage_service->cdc_generation_service explicit dependency brought
back and this, in turn, suddenly freed the raft and API code from the
need to carry cdc gen. service reference around.
Closes#15047
* github.com:scylladb/scylladb:
cdc: Remove bootstrap state assertion from after_join()
cdc: Rework gen. service check for bootstrap state
api: Don't carry cdc gen. service over
storage_service: Use local cdc gen. service in join_cluster()
storage_service: Remove cdc gen. service from raft_state_monitor_fiber()
raft: Do not carry cdc gen. service over
storage_service: Use local cdc gen. service in topo calls
storage_service: Bring cdc_generation_service dependency back
In #14722, a source of work was added to `handle_topology_transition`,
but the `should_preempt_balancing` function was not updated accordingly,
as is suggested by the comment in `handle_topology_transition`. This
omission happened due to a hasty rebase.
This commit fixes the issue, and now `should_preempt_balancing` will
return true if there are some features that should be enabled.
The `supported_features` and `enabled_features` columns from
`system.topology` are read during the feature check that happens early
on boot. The check enforces two properties:
- A node is not allowed to revoke support for a feature after it notices
in its local topology state that the feature is supported by all
nodes.
- Similarly, a node is not allowed to revoke support for a feature after
seeing that it was put to the `enabled_features` column by the
topology coordinator.
However, due to the fact that the check has to happen before (schema)
commitlog replay and the table is not explicitly flushed when
`supported_features` or `enabled_features` columns are modified, the
feature check on boot might operate on old data and not do its job
properly.
In order to fix this, this commit modifies the `group0_state_machine` so
that is flushes the `system.topology` table every time the
`supported_features` or `enabled_features` column is modified, and after
every snapshot transfer.
The reproducer for #14448.
The test starts two nodes with different group0_ids. The second node
is restarted and tries to join the cluster consisting of the first node.
gossip_digest_syn message should be rejected by the first node, so
the second node will not be able to join the cluster.
This test uses repair-based node operations to make this test easier.
If the second node successfully joins the cluster, their tokens metadata
will be merged and the repair service will allow to decommission the second node.
If not - decommissioning the second node will fail with an exception
"zero replica after the removal" thrown by the repair service.
Gossiper service sets its group0 id on startup if it is stored in `scylla_local`
or sets it during joining group0.
Send group0_id (if it is set) when the node tries to initiate the gossip round.
When a node gets gossip_digest_syn it checks if its group0 id equals the local
one and if not, the message is discarded.
Fixes#14448.
We want to call the on_dead notifications if the
node was alive and it had endpoint_state.
Get the ep state before we may yield in
mutate_live_and_unreachable_endpoints, similarly
to mark_dead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixes#12261Fixes#15088
Also, add rest_api unit test for those apis
Closes#15093
* github.com:scylladb/scylladb:
test: rest_api: add test_gossiper
gossiper: add get_unreachable_members_synchronized
As was described in the previous patch, this method is explicitly called
by storage service after updating the bootstrap state, so it's unneeded
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The legacy_handle_cdc_generation() checks if the node had bootstrapped
with the help of system_keyspace method. The former is called in two
cases -- on boot via cdc_generation_service::after_join() and via
gossiper on_...() notifications. The notifications, in turn, are set up
in the very same after_join().
The after_join(), in turn, is called from storage_service explicitly
after the bootstrap state is updated to be "complete", so the check for
the state in legacy_handle_...() seems unnecessary. However, there's
still the case when it may be stepped on -- decommission. When performed
it calls storage_service::leave_ring() which udpates the bootstrap state
to be "needed", thus preventing the cdc gen. service from doing anything
inside gossiper's on_...() notifications.
It's more correct to stop cdc gen. service handling gossiper
notifications by unsubscribing it, but by adding fragile implicit
dependencies on the bootstrap state.
Checks for sys.dist.ks in the legacy_handle_...() are kept in a form
of on-internal-error. The system distributed keyspace is activated by
storage service even before the bootstrap state is updated and is
never deactivated, but it's anyway good to have this assertion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a storage_service/cdc_streams_check_and_repair endpoint that
needs to provide cdc gen. service to call storage_service method on. Now
the latter has its own reference to the former and API can stop taking
care of that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question accepts cdc_generation_service ref argument from
main and cql_test_env, but storage service now has local cdcv gen.
service reference, so this argument and its propagation down the stack
can be removed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a cdc_generation_service ref sitting on group0_state_machine and
the only reason it's there is to call storage_service::topology_...()
mathods. Now when storage service can access cdc gen. service on its
own, raft code can forget about cdc
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology_state_load() and topology_transition() both take cdc gen.
service an an argument, but can work with the local reference. This
makes the next patch possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It sort of reverts the 5a97ba7121 commit, because storage service now
uses the cdc generation service to serve raft topo updates which, in
turn, takes the cdc gen. service all over the raft code _just_ to make
it as an argument to storage service topo calls.
Also there's API carrying cdc gen. service for the single call and also
there's an implicit need to kick cdc gen. service on decommission which
also needs storage service to reference cdc gen. after boot is complete
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.
Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138
The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141
Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15142
This function is too flaky with the 30 seconds timeout.
For example, the following was seen locally with
`test_updated_shards_during_add_decommission_node` in dev mode:
alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node6.log:
```
INFO 2023-08-27 15:47:25,753 [shard 0] gossip - Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:30,754 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:35,761 [shard 0] gossip - (rate limiting dropped 495 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:40,766 [shard 0] gossip - (rate limiting dropped 498 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:45,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
INFO 2023-08-27 15:47:50,768 [shard 0] gossip - (rate limiting dropped 497 similar messages) Waiting for 2 live nodes to show up in gossip, currently 1 present...
ERROR 2023-08-27 15:47:55,758 [shard 0] gossip - Timed out waiting for 2 live nodes to show up in gossip
INFO 2023-08-27 15:47:55,759 [shard 0] init - Shutting down group 0 service
```
alternator_stream_tests.py::TestAlternatorStreams::test_updated_shards_during_add_decommission_node/node1.log:
```
INFO 2023-08-27 15:48:02,532 [shard 0] gossip - InetAddress 127.0.43.6 is now UP, status = UNKNOWN
...
WARN 2023-08-27 15:48:03,552 [shard 0] gossip - failure_detector_loop: Send echo to node 127.0.43.6, status = failed: seastar::rpc::closed_error (connection is closed)
```
Note that node1 saw node6 as UP after node6 already timed out
and was shutting down.
Increase the timeout to 3 minutes in all modes to reduce flakiness.
Fixes#15185
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15186
Permits added to `_ready_list` remain there until
executed by `execution_loop()`.
But `execution_loop()` exits when `_stopped == true`,
even though nothing prevents new permits from being added
to `_ready_list` after `stop()` sets `_stopped = true`.
Thus, if there are reads concurrent with `stop()`,
it's possible for a permit to be added to `_ready_list`
after `execution_loop()` has already quit. Such a permit will
never be destroyed, and `stop()` will forever block on
`_permit_gate.close()`.
A natural solution is to dismiss `execution_loop()` only after
it's certain that `_ready_list` won't receive any new permits.
This is guaranteed by `_permit_gate.close()`. After this call completes,
it is certain that no permits *exist*.
After this patch, `execution_loop()` no longer looks at `_stopped`.
It only exits when `_ready_list_cv` breaks, and this is triggered
by `stop()` right after `_permit_gate.close()`.
Fixes#15198Closes#15199
It is too early to require that all nodes in normal state
have a non-null host_id.
The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.
In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.
Fixesscylladb/scylladb#15181
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15184
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixes scylladb/scylladb#15138
Closes#15141
* github.com:scylladb/scylladb:
storage_proxy: create_write_response_handler: carry effective_replication_map_ptr from paxos_response_handler
storage_proxy: send_to_live_endpoints: throw on_internal_error if node not found
if the endpoint specified when creating a KEYSPACE is not found,
when flushing a memtable, we would throw an `std::out_of_range`
exception when looking up the client in `storage_manager::_s3_endpoints`
by the name of endpoint. and scylla would crash because of it. so
far, we don't have a good way to error out early. since the
storage option for keyspace is still experimental, we can live
with this, but would be better if we can spot this error in logging
messages when testing this feature.
also, in this change, `std::invalid_argument` is thrown instead of
`std::out_of_range`. it's more appropriate in this circumstance.
Refs #15074
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15075
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.
Fixes scylladb/scylladb#15147
Closes#15149
* github.com:scylladb/scylladb:
storage_proxy: mutate_atomically_result: carry effective_replication_map down to create_write_response_handler
storage_proxy: mutate_atomically_result: keep schema of batchlog mutation in context
Since 75d1dd3a76
gossiper::convict will no longer call `mark_dead`
(e.g. when called from the failure detection loop
after a node is stopped following decommission)
and therefore the on_dead notification won't get called.
To make that explicit, if the node was alive before
remove_endpoint erased it from _live_endpoint,
and it has an endpoint_state, call the on_dead notifications.
These are imporant to clean up after the node is dead
e.g. in storage_proxy::on_down which cancels all
respective write handlers.
This preferred over going through `mark_dead` as the latter
marks the endpoint as unreachable, which is wrong in this
case as the node left the cluster.
Fixes#15178
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15179
Disabling fstrim.timer was for avoid running fstrim on /var/lib/scylla from
both scylla-fstrim.timer and fstrim.timer, but fstrim.timer actually never do
that, since it is only looking on fstab entries, not our systemd unit.
To run fstrim correctly on rootfs and other filesystems not related
scylla, we should stop disabling fstrim.timer.
Fixes#15176
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#15177
seastar::format() just forward the parameters to be formatted to
`fmt::format_to()`, which is able to format `std::string`, so there is
no need to cast the `std::string` instance to `std::string_view` for
formatting it.
in this change, the cast is dropped. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15143
Add a class that handles log file browsing with the following features:
* mark: returns "a mark" to the current position of the log.
* wait_for: asynchronously checks if the log contains the given message.
* grep: returns a list of lines matching the regular expression in the log.
Add a new endpoint in `ManagerClient` to obtain the scylla logfile path.
Fixes#14782Closes#14834
New file streaming for tablets will require integration with compaction
groups. So this patch introduces a way for streaming to take a storage
snapshot of a given tablet using its token range. Memtable is flushed
first, so all data of a tablet can be streamed through its sstables.
The interface is compaction group / tablet agnostic, but user can
easily pick data from a single tablet by using the range in tablet
metadata for a given tablet.
E.g.:
auto erm = table.get_effective_replication_map();
auto& tm = erm->get_token_metadata();
auto tablet_map = tm.tablets().get_tablet_map(table.schema()->id());
for (auto tid : tablet_map.tablet_ids()) {
auto tr = tmap.get_token_range(tid);
auto ssts = co_await table.take_storage_snapshot(tr);
...
}
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15128
We want to stop supporting IPs for --ignore-dead-nodes in
raft_removenode and --ignore-dead-nodes-for-replace for
raft_replace. However, we shouldn't remove these features without
the deprecation period because the original removenode and
replace operations still support them. So, we add them for now.
Additionally, we modify test_raft_ignore_nodes.py so that it
verifies the added IP support.
before this change, `checksummed_file_data_sink_impl` just inherits the
`data_sink_impl::flush()` from its parent class. but as a wrapper around
the underlying `_out` data_sink, this is not only an unusual design
decision in a layered design of an I/O system, but also could be
problematic. to be more specific, the typical user of `data_sink_impl`
is a `data_sink`, whose `flush()` member function is called when
the user of `data_sink` want to ensure that the data sent to the sink
is pushed to the underlying storage / channel.
this in general works, as the typical user of `data_sink` is in turn
`output_stream`, which calls `data_sink.flush()` before closing the
`data_sink` with `data_sink.close()`. and the operating system will
eventually flush the data after application closes the corresponding
fd. to be more specific, almost none of the popular local filesystem
implements the file_operations.op, hence, it's safe even if the
`output_stream` does not flush the underlying data_sink after writing
to it. this is the use case when we write to sstables stored on local
filesystem. but as explained above, if the data_sink is backed by a
network filesystem, a layered filesystem or a storage connected via
a buffered network device, then it is crucial to flush in a timely
manner, otherwise we could risk data lost if the application / machine /
network breaks when the data is considerered persisted but they are
_not_!
but the `data_sink` returned by `client::make_upload_jumbo_sink` is
a little bit different. multipart upload is used under the hood, and
we have to finalize the upload once all the parts are uploaded by
calling `close()`. but if the caller fails / chooses to close the
sink before flushing it, the upload is aborted, and the partially
uploaded parts are deleted.
the default-implemented `checksummed_file_data_sink_impl::flush()`
breaks `upload_jumbo_sink` which is the `_out` data_sink being
wrapped by `checksummed_file_data_sink_impl`. as the `flush()`
calls are shortcircuited by the wrapper, the `close()` call
always aborts the upload. that's why the data and index components
just fail to upload with the S3 backend.
in this change, we just delegate the `flush()` call to the
wrapped class.
Fixes#15079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15134
In the following commit, we add IP support for --ignore-dead-nodes
in raft_removenode and raft_replace. To implement it, we need
a way to translate IPs to Raft IDs. The solution is to add a new
member function -- find_by_addr -- to raft_address_map that
does the IP->ID translation.
The IP support for --ignore-dead-nodes will be deprecated and
find_by_addr shouldn't be called for other reasons, so it always
logs a warning.
We also add some unit tests for find_by_addr.
It's possible that compaction task is preempted after completion and
before reevaluation, causing pending_tasks to be > 1.
Let's only exit the loop if there are no pending tasks, and also
reduce 100ms sleep which is an eternity for this test.
Fixes#14809.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15059
As `create_write_response_handler` on this path accepts
an `inet_address_vector_replica_set` that corresponds to the
effective_replication_map_ptr in the paxos_response_handler,
but currently, the function retrieves a new
effective_replication_map_ptr
that may not hold all the said endpoints.
Fixesscylladb/scylladb#15138
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixesscylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Change 6449c59 brought back abort on listener failure,
but this causes crashes when listeners hit expected errors
like gate_closed.
Detect shutdown usig the gossiper _abort_source
and in this case just log a warning about the errors
but do not abort.
Fixesscylladb/scylladb#15031
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As Tomasz Grabiec correctly noted:
> We should also ensure that once _abort_source is aborted, we don't attempt to process any further notifications, because that would violate monotonicity due to partially failed notification. Even if the next listener eventually fails too, if this invariant is violated, it can have undesired side effects.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Locking the _endpoint_update_semaphore should be abortable with the
gossiper _abort_source. No further processing should
be done once abort is requested.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Locking an endpoint should be abortable with the
gossiper _abort_source. No further processing should
be done once abort is requested.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Once abort is requested we should not process any more
gossip RPCs to prevent undesired side effects
of partially applied state changes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The effective_replication_map_ptr passed to
`create_write_response_handler` by `send_batchlog_mutation`
must be synchronized with the one used to calculate
_batchlog_endpoints to ensure they use the same topology.
Fixesscylladb/scylladb#15147
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The batchlog mutation is for system.batchlog.
Rather than looking the schema up in multiple places
do that once and keep it in the context object.
It will be used in the next patch to get a respective
effective_replication_map_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sstabledump is deprecated in place of `scylla sstable` commands. so
let's reflect this in the document.
Fixes#15020
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15021
Scylla sstable promises to *never* mutate its input sstables. This
promise was broken by `scylla sstable scrub --scrub-mode=validate`,
because validate moves invalid input sstables into qurantine. This is
unexpected and caused occasional failures in the scrub tests in
test_tools.py. Fix by propagating a flag down to
`scrub_sstables_validate_mode()` in `compaction.cc`, specifying whether
validate should qurantine invalid sstables, then set this flag to false
in `scylla-sstable.cc`. The existing test for validate-mode scrub is
ammended to check that the sstable is not mutated. The test now fails
before the fix and passes afterwards.
Fixes: #14309Closes#15139
Currently he gossiper marks endpoint_state objects as alive/dead.
I some cases the endpoint_state::is_alive function is checked but in many other cases
gossiper::is_alive(endpoint) is used to determine if the endpoint is alive.
This series removed the endpoint_state::is_alive state and moves all the logic to gossiper::is_alive
that bases its decision on the endpoint having an endpoint_state and being in the _live_endpoints set.
For that, the _live_endpoints is made sure to be replicated to all shards when changed
and the endpoint_state changes are serialized under lock_endpoint, and also making sure that the
endpoint_state in the _endpoint_states_map is never updated in place, but rather a temporary copy is changed
and then safely replicated using gossiper::replicate
Refs https://github.com/scylladb/scylladb/issues/14794Closes#14801
* github.com:scylladb/scylladb:
gossiper: mark_alive: remove local_state param
endpoint_state: get rid of _is_alive member and methods
gossiper: is_alive: use _live_endpoints
gossiper: evict_from_membership: erase endpoint from _live_endpoints
gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change
gossiper: run: no need to replicate live_endpoints
gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change
gossiper: add mutate_live_and_unreachable_endpoints
gossiper: reset_endpoint_state_map: clear also shadow endpoint sets
gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards
gossiper: functions that change _live_endpoints must be called on shard 0
gossiper: add lock_endpoint_update_semaphore
gossiper: make _live_endpoints an unordered_set
endpoint_state: use gossiper::is_alive externally
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.
Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.
Fixes: #15078
Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
We make every alternative type in the request_param variant
a named struct to make the code more readable. Additionally, this
change will make extending request parameters easier if we decide
to do so in the future.
Closes#15132
Some runtime errors thrown in storage_service::raft_removenode
start with the "raft topology " prefix. Since "raft topology" is
an implementation detail, we don't want to throw this information
through the user API. Only logs should contain it.
Closes#15136
This patch mark per-scheduling group counters with skip_when_empty flag.
This reduces metrics reporting for scheduling groups that do not use
those counters.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch mark the conditional metrics counters with skip_when_empty
flag, to reduce metrics reporting when cas is not in used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch mark scylla_transport_cql_errors_total with skip_when_empty
flag.
It reduces the overhead for metrics for types that are never reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When casting a float or double column to a string with `CAST(f AS TEXT)`,
Scylla is expected to print the number with enough digits so that reading
that string back to a float or double restores the original number
exactly. This expectation isn't documented anywhere, but makes sense,
and is what Cassandra does.
Before commit 71bbd7475c, this wasn't the
case in Scylla: `CAST(f AS TEXT)` always printed 6 digits of precision,
which was a bit under enough for a float (which can have 7 decimal digits
of precision), but very much not enough for a double (which can need 15
digits). The origin of this magic "6 digits" number was that Scylla uses
seastar::to_sstring() to print the float and double values, and before
the aforementioned commit those functions used sprintf with the "%g"
format - which always prints 6 decimal digits of precision! After that
commit, to_sstring() now uses a different approach (based on fmt) to
print the float and double values, that prints all significant digits.
This patch adds a regression test for this bug: We write float and double
values to the database, cast them to text, and then recover the float
or double number from that text - and check that we get back exactly the
same float or double object. The test *fails* before the aforementioned
commit, and passes after it. It also passes on Cassandra.
Refs #15127
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15131
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.
Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793
Closes#14844
* github.com:scylladb/scylladb:
api: storage_service: improve description of /storage_service/host_id
token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
in this series, we also print the sstable name and pk when writing a tombstone whose local_deletion_time (ldt for short) is greater than INT32_MAX which cannot be represented by an uint32_t.
Fixes#15015Closes#15107
* github.com:scylladb/scylladb:
sstable/writer: log sstable name and pk when capping ldt
test: sstable_compaction_test: add a test for capped tombstone ldt
scylla-gdb.py has two methods for iterating over all tables:
* all_tables()
* for_each_table()
Despite this, many places in the code iterate over the column family map
directly. This patch leaves just a single method (for_each_table()) and
migrates all the codebase to use it, instead of iterating over the raw
map. While at it, the access to the map is made backward compatible with
pre 52afd9d42d code, said commit wrapped database::_column_families in
tables_metadata object. This broke scylla-gdb.py for older versions.
Closes#15121
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.
The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.
Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.
Fixes#15025Closes#15113
* github.com:scylladb/scylladb:
test: add test_raft_ignore_nodes
test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
raft topology: pass ignore_nodes to {replace, remove}_with_repair
raft topology: exec_global_command: add ignore_nodes to exclude_nodes
raft topology: exec_global_command: change type of exclude_nodes
topology_state_machine: extend request_param with a set of raft ids
raft topology: set ignore_nodes in raft_removenode and raft_replace
utils: introduce split_comma_separated_list
raft topology: add the ignore_nodes column to system.topology
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.
The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.
Closes#14901
* github.com:scylladb/scylladb:
test_fencing: add test_fence_hints
test.py: output the skipped tests
test.py: add skip_mode decorator and fixture
test.py: add mode fixture
hints: add debug log for dropped hints
hints: send_one_hint: extend the scope of file_send_gate holder
pylib: add ScyllaMetrics
hints manager: add send_errors counter
token_metadata: add debug logs
fencing: add simple data plane test
random_tables.py: add counter column type
raft topology: don't increment version when transitioning to node_state::normal
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.
This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.
Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.
Fixes#15104Closes#15112
This sort series deals with two stall sources in row-level repair `to_repair_rows_list`:
1. Freeing the input `repair_rows_on_wire` in one shot on return (as seen in https://github.com/scylladb/scylladb/issues/14537)
2. Freeing the result `row_list` in one shot on error. this hasn't been seen in testing but I have no reason to believe it is not susceptible to stalls exactly like repair_rows_on_wire with the same number of rows and mutations.
Fixes https://github.com/scylladb/scylladb/issues/14537Closes#15102
* github.com:scylladb/scylladb:
repair: reindent to_repair_rows_list
repair: to_repair_rows_list: clear_gently on error
repair: to_repair_rows_list: consume frozen rows gently
We add two tests verifying that --ignore-dead-nodes in
raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace are handled correctly.
We need a 7-cluster to have a Raft majority. Therefore, these
tests are quite slow, and we want to run them only in the dev mode.
ManagerClient.remove_node allows passing ignore_dead only as
List[IPAddress]. However, raft_removenode currently supports
only host ids. To write a test that passes ignore_dead to
ManagerClient.remove_node in the Raft topology mode, we allow
passing ignore_dead as List[HostId].
Note that we don't want to use List[IPAddress | HostId] because
mixing IP addresses and host ids fails anyway. See
ss::remove_node.set(...) in api::set_storage_service.
To properly stream ranges during the removenode or replace
operation in the Raft topology mode, we pass IPs of the ignored
nodes to replace_with_repair and remove_with_repair in
storage_service::raft_topology_cmd_handler.
We add ignore_nodes to exclude_nodes in exec_global_command
to ignore nodes marked as dead by --ignore-dead-nodes for
raft_removenode and --ignore-dead-nodes-for-replace for
raft_replace.
We extend exclude_nodes in exec_global_command with ignore_nodes
in the next commit. Since we already use std::unordered_set to
store ids of the ignored nodes and their number is unknown, we
change the type of exclude_nodes from utils::small_vector to
std::unordered_set.
We add two new alternative types to service::request_param:
removenode_param and replace_param. They allow storing the list
of ignored nodes loaded from the ignore_nodes column of
system.topology. We also remove the raft::server_id type because
it has been only used by the replace operation.
To handle --ignore-dead-nodes in raft_removenode and
--ignore-dead-nodes-for-replace in raft_replace, we set the
ignore_nodes value of the topology mutation in these functions. In
the following commits, we ensure that the topology coordinator
properly makes use of it.
The test makes a write through the first node with
the third node down, this causes a hint to be stored on the
first node for the second. We increment the version
and fence_version on the third node, restart it,
and expect to see a hint delivery failure
because of versions mismatch. Then we update the versions
of the first node and expect hint to be successfully
delivered.
pytest option -rs forces it to print
all the skipped tests along with
the reasons. Without this option we
can't tell why certain tests were skipped,
maybe some of them shouldn't already.
Syntactic sugar for marking tests to be
skipped in a particular mode.
There is skip_in_debug/skip_in_release in suite.yaml,
but they can be applied only on the entire file,
which is unnatural and inconvenient. Also, they
don't allow to specify a reason why the test is skipped.
Separate dictionary skipped_funcs is needed since
we can't use pytest fixtures in decorators.
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.
Fixes#15110
This patch adds facilities to work
with Scylla metrics from test.py tests.
The new metrics property was added to
ManagerClient, its query method
sends a request to Scylla metrics
endpoint and returns and object
to conveniently access the result.
ScyllaMetrics is copy-pasted from
test_shedding.py. It's difficult
to reuse code between 'new' and 'old'
styles of tests, we can't just import
pylib in 'old' tests because of some
problems with python search directories.
A past commit of mine that attempted
to solve this problem was rejected on review.
There was no indication of problems
in the hints manager metrics before.
We need this counter for fencing tests
in the later commit, but it seems to be
useful on its own.
We log the new version when the new token
metadata is set.
Also, the log for fence_version is moved
in shared_token_metadata from storage_service
for uniformity.
The test starts a three node cluster
and manually decrements the version on
the last node. It then tries to write
some data through the last node and
expects to get 'stale topology' exception.
Use the presence of of the endpoint in _live_endpoints
as the authoritative source for is_alive
rather than the endpoint_state::is_alive status.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than keeping an expensive copy of `_live_endpoint`
and `_unreachable_endpoints` in shadow members, while they aren't
currently used for their content anyhow, just to detect when
their corresponding members change.
With that, it is renamed to replicate_live_and_unreachable_endpoints.
This still doesn't provide strong exception safety guarantees,
but at least we don't "cheat" about shard state
and we don't mark shard 0 state as "replicated" by
updating the shadow members.
Also, we save some unneeded allocations.
Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As asias@scylladb.com noticed, after the previous
patch that calls replicate_live_endpoints_on_change
in mutate_live_and_unreachable_endpoints, _live_endpoints
are always updated on all shards when they change,
so there's no need anymore to replicate them here.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to propagate any change to _live_endpoints
to all shards. Currently we just update the `_live_endpoints_version`
and `replicate_live_endpoints_on_change` propagtes the
change some undetermined time in the future.
To rely on `_live_endpoints` for gossiper::is_alive,
that may be called on any shard, we want to propagate
the change to all shards as soon as it happens.
Use `mutate_live_and_unreachable_endpoints` to update
_live_endpoints and/or _unreachable_endpoints safely,
under `lock_endpoint_update_semaphore`. It is responsible
for incrementing _live_endpoints_version and
calling `replicate_live_endpoints_on_change` to
propagate the change to all shards.
Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for safely modifying _live_endpoints
and/or _unreachable_endpoints and replicating the
new version to all shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Three places handle comma-separated lists similarly:
- ss::remove_node.set(...) in api::set_storage_service,
- storage_service::parse_node_list,
- storage_service::is_repair_based_node_ops_enabled.
In the next commit, the fourth place that needs the same logic
appears -- storage_service::raft_replace. It needs to load
and parse the --ignore-dead-nodes-for-replace param from config.
Moreover, the code in is_repair_based_node_ops_enabled is
different and doesn't seem right. We swap '\"' and '\'' with ' '
but don't do anything with it afterward.
To avoid code duplication and fix is_repair_based_node_ops_enabled,
we introduce the new function utils::split_comma_separated_list.
This change has a small side effect on logging. For example,
ignore_nodes_strs in storage_service::parse_node_list might be
printed in a slightly different form.
In the following commits, we add support for --ignore-dead-nodes
in raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace. To make these request parameters accessible for the
topology coordinator, we store them in the new ignore_nodes
column of system.topology.
If we don't clear them, there is a slight chance
that the next update will make `_live_endpoints` or `_unreachable_endpoints`
equal to their shadow counterparts and prevent an update
in `replicate_live_endpoints_on_change`.
Fixes#15003
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Not only on the calling shard (shard 0).
Essentially this change folds `update_live_endpoints_version`
into `reset_endpoint_state_map`.
Acquire the _endpoint_update_semaphore to serialize
this with `replicate_live_endpoints_on_change`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`update_live_endpoints_version` and functions that call it
must be called on shard 0, since it updates the authoritative
`_live_endpoints` and `_live_endpoints_version`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a private helper to acquire the _endpoint_update_semaphore
before calling replicate_live_endpoints_on_change.
Must be called on shard 0.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to maintain as an unrdered_set
and it will be used in a following patch
to determine is_alive(endpoint) in O(1) on average.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent destroying of potentially large `rows` and `row_list`
in one shot on error as it might caused a reactor stall.
Instead, use utils::clear_gently on the error return path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although to_repair_rows_list may yield if needed
between rows and mutation fragments, the input
`repair_rows_on_wire` is freed in one shot
and that may cause stalls as seen in qa:
```
| bytes_ostream::free_chain at ././bytes_ostream.hh:163
++ - addr=0x4103be0:
| bytes_ostream::~bytes_ostream at ././bytes_ostream.hh:199
| (inlined by) frozen_mutation_fragment::~frozen_mutation_fragment at ././mutation/frozen_mutation.hh:273
| (inlined by) std::destroy_at<frozen_mutation_fragment> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
| (inlined by) std::__cxx11::_List_base<frozen_mutation_fragment, std::allocator<frozen_mutation_fragment> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
| (inlined by) partition_key_and_mutation_fragments::~partition_key_and_mutation_fragments at ././repair/repair.hh:203
| (inlined by) std::destroy_at<partition_key_and_mutation_fragments> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
| (inlined by) std::__cxx11::_List_base<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
| (inlined by) to_repair_rows_list at ./repair/row_level.cc:597
```
This change consumes the rows and frozen mutation fragments
incrementally, freeing each after being processed.
Fixesscylladb/scylladb#14537
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We already have a test for issue #13533, where an "IN" doesn't work with
a secondary index (the secondary index isn't used in that case, and
instead inefficient filtering is required). Recently a user noticed the
same problem also exists for local secondary indexes - and this patch
includes a reproducing test. The new test is marked xfail, as the issue is
still unfixed. The new test is Scylla-only because local secondary index
is a Scylla-only extension that doesn't exist in Cassandra.
Refs #13533.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15106
The system.tablets table stores replica sets as a CQL set type,
which is sorted. This means that if, in a tablet replica set
[n1, n2, n3] n2 is replaced with n4, then on reload we'll see
[n1, n3, n4], changing the relative position of n3 from the third
replica to the second.
The relative position of replicas in a replica set is important
for materialized views, as they use it to pair base replicas with
view replicas. To prepare for materialized views using tablets,
change the persistent data type to list, which preserves order.
The code that generates new replica sets already preserves order:
see locator::replace_replica().
While this changes the system schema, tablets are an experimental
feature so we don't need to worry about upgrades.
Closes#15111
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectLimitTest.java into our cql-pytest framework.
The tests reproduce two already-known bugs:
Refs #9879: Using PER PARTITION LIMIT with aggregate functions should
fail as Invalid query
Refs #10357: Spurious static row returned from query with filtering,
despite not matching filter
And also helped discover two new issues:
Refs #15099: Incorrect sort order when combining IN, and ORDER BY
Refs #15109: PER PARTITION LIMIT should be rejected if SELECT DISTINCT
is used
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15114
commit 7c8c020 introduced a new type of a keyspace, an internal keyspace
It defined the semantics for this internal keyspace, this keyspace is
somewhat a hybrid between system and user keyspace.
Here we extend the semantics to include also flushes, meaning that
flushes will be done using the system dirty_mamory_manager. This is
in order to allow inter dependencies between internal tables and user
tables and prevent deadlocks.
One example of such a deadlock is our `replicated_key_provider`
encryption on the enterprise version. The deadlock occur because in some
circumstances, an encrypted user table flush is dependant upon the
`encrypted_keys` table being flushed but since the requests are
serialized, we get a deadlock.
Tests: unit tests dev + debug
The deadlock dtest reproducer:
encryption_at_rest_test.py::TestEncryptionAtRest::test_reboot
Fixes#14529
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#14547
"
The series fixes bogus asserting during topology state load and add a
test that runs rebuild to make sure the code will not regress again.
Fixes#14958
"
* 'gleb/rebuilding_fix_v1' of github.com:scylladb/scylla-dev:
test: add rebuild test
system_keyspace: fix assertion for missing transition_state
Clang and GCC's warning option of `-Wbraced-scalar-init` warns
at seeing superfluous use of braces, like:
```
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2187:32: error: braces around scalar initializer [-Werror,-Wbraced-scalar-init]
.snapshot_threshold{1},
^~~
```
usually, this does not hurt. but by taking the braces out, we have
a more readable piece of code, and less warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15086
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.
Fixes#15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
local_delection_time (short for ldt) is a timestamp used for the
purpose of purging the tombstone after gc_grace_seconds. if its value
is greater than INT32_MAX, it is capped when being written to sstable.
this is very likely a signal of bad configuration or a even a bug in
scylla. so we keep track of it with a metric named
"scylla_sstables_capped_tombstone_deletion_time".
in this change, a test is added to verify that the metric is updated
upon seeing a tombstone with this abnormal ldt.
because we validate the consistency before and after compaction in
tests, this change adds a parameter to disable this check, otherwise,
because capping the ldt changes the mutation, the validation would
fail the test.
Refs #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
Closes#15067
This argument was dead since its introduction and 'discard' was
always configured regardless of its value.
This patch allows actually configuring things using this argument.
Fixes#14963Closes#14964
This commit updates the description of perftune.py.
It is based on the information in the reported issue (below),
the contents of help for perftune.py, and the input from
@vladzcloudius.
Fixes https://github.com/scylladb/scylladb/issues/14233Closes#14879
Fixes https://github.com/scylladb/scylla-docs/issues/2467
This commit updates the Networking section. The scope is:
- Removing the outdated content, including the reference to
the super outdated posix_net_conf.sh script.
- Adding the guidelines provided by @vladzcloudius.
- Adding the reference to the documentation for
the perftune.py script.
Closes#14859
The test is flaky on CI in debug builds
on aarch64 (#14752), here we sprinkle more
logs for debug/aarch64 hoping it'll help to
debug it.
Ref #14752Closes#14822
in load_sstables(), `sst_path` is already an instace of `std::filesystem::path`,
so there is no need to cast it to `std::filesystem::path`. also,
`path.remove_filename()` returns something like
"system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/", when the
trailing slash. when we get a component's path in `sstable::filename`,
we always add a "/" in between the `dir` and the filename, so this'd
end up with two slashes in the path like:
"/var/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//mc-2-big-Data.db"
so, in order to remove the duplicated slash, let's just use
`path.parent_path()` here.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15035
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.
Fixesscylladb/scylladb#14843
Refs scylladb/scylladb#14793
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To prevent use-after-free as seen in
https://github.com/scylladb/scylladb/issues/15097
where a temp schema_ptr retrieved from a global_schema_ptr
get destroyed when the notification function yielded.
Capturing the schema_ptr on the coroutine frame
is inexpensive since its a shared ptr and it makes sure
that the schema remains valid throughput the coroutine
life time.
Fixes scylladb/scylladb#15097
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15098
`raft_server` in test/raft/randomized_nemesis_test.cc manages
instances of direct_fd_pinger and direct_fd_clock with unique_ptr<>.
this unique_ptr<> deletes these managed instances using delete.
but since these two classes have virtual methods, the compiler feels
nervous when deleting them. because these two classes have virtual
functions, but they do not have virtual destructor. in other words,
in theory, these pointers could be pointing derived classes of them,
and deleting them could lead to leak.
so to silence the warning and to prevent potential issues, let's
just mark these two classes final.
this should address the warning like:
```
In file included from /home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:9:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/reactor.hh:24:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/aligned_buffer.hh:24:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78:
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'direct_fd_pinger<int>' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<direct_fd_pinger<int>>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1400:5: note: in instantiation of member function 'std::unique_ptr<direct_fd_pinger<int>>::~unique_ptr' requested here
~raft_server() {
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: note: in instantiation of member function 'raft_server<ExReg>::~raft_server' requested here
delete __ptr;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<raft_server<ExReg>>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1704:24: note: in instantiation of member function 'std::unique_ptr<raft_server<ExReg>>::~unique_ptr' requested here
._server = nullptr,
^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:1742:19: note: in instantiation of member function 'environment<ExReg>::new_node' requested here
auto id = new_node(first, std::move(cfg));
^
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2113:39: note: in instantiation of member function 'environment<ExReg>::new_server' requested here
auto leader_id = co_await env.new_server(true);
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15084
to mirror the build modes supported by `configure.py`.
Closes#15085
* github.com:scylladb/scylladb:
build: cmake: support Coverage and Sanitize build modes
build: cmake: error out if specified build type is unknown
On boot several manipulations with system.local are performed.
1. The host_id value is selected from it with key = local
If not found, system_keyspace generates a new host_id, inserts the
new value into the table and returns back
2. The cluster_name is selected from it with key = local
Then it's system_keyspace that either checks that the name matches
the one from db::config, or inserts the db::config value into the
table
3. The row with key = local is updated with various info like versions,
listen, rpc and bcast addresses, dc, rack, etc. Unconditionally
All three steps are scattered over main, p.1 is called directly, p.2 and
p.3 are executed via system_keyspace::setup() that happens rather late.
Also there's some touch of this table from the cql_test_env startup code.
The proposal is to collect this setup into one place and execute it
early -- as soon as the system.local table is populated. This frees the
system_keyspace code from the logic of selecting host id and cluster
name leaving it to main and keeps it with only select/insert work.
refs: #2795
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15082
Tablet migration may execute a global token metadata barrier before
executing updates of system.tablets. If table is dropped while the
barrier is happening, the updates will bring back rows for migrated
tablets in a table which is no longer there. This will cause tablet
metadata loading to fail with error:
missing_column (missing column: tablet_count)
Like in this log line:
storage_service - raft topology: topology change coordinator fiber got error raft::stopped_error (Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): std::_Nested_exception<std::runtime_error> (Failed to read tablet metadata): missing_column (missing column: tablet_count)œ")
The fix is to read and execute the updates in a single group0 guard
scope, and move execution of the barrier later. We cannot now generate
updates in the same handle_tablet_migration() step if barrier needs to
be executed, so we resuse the mechanism for two-step stage transition
which we already have for handling of streaming. The next pass will
notice that the barrier is not needed for a given tablet and will
generate the stage update.
Fixes#15061Closes#15069
The DML page is quite long (21 screenfuls on my monitor); split
it into one page per statement to make it more digestible.
The sections that are common to multiple statement are kept
in the main DML page, and references to them are added.
Closes#15053
scylladb overrides some of seastar logging related options with its
own options by applying them with `logging::apply_settings()`. but
we fail to inherit `with_color` from Seastar as we are using the
designated initializer, so the unspecified members are zero initialized.
that's why we always have logging message in black and white even
if scylla is running in a tty and `--log-with-color 1` is specified.
so, make the debugging life more colorful, let's inherit the option
from Seastar, and apply it when setting logging related options.
see also 29e09a3292
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15076
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
```
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```
Closes#14863
* github.com:scylladb/scylladb:
Kill scylla option to configure number of compaction groups
replica: Wire tablet into compaction group
token_metadata: Add this_host_id to topology config
replica: Switch to chunked_vector for storing compaction groups
replica: Generate group_id for compaction_group on demand
this should help the developer to understand what build types are
supported if the specified one is unknown.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this series, the "experimental" option is marked `Unused` as it has been marked deprecated for almost 2 years since scylla 4.6. and use `experimental_features` to specify the used experimental features explicitly.
Closes#14948
* github.com:scylladb/scylladb:
config: remove unused namespace alias
config: use std::ranges when appropriate
config: drop "experimental" option
test: disable 'enable_user_defined_functions' if experimental_features does not include udf
test: pylib: specify experimental_features explicitly
to note that sstablemetadata is being deprecated and encourage
user to switch over to the native tools.
Fixes#15020
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15040
in this series, the build of rpm and deb from submodules is fixed:
1. correct the path of reloc package
2. add the dependency of reloc package to deb/rpm build targets
Closes#15062
* github.com:scylladb/scylladb:
build: cmake: correct reloc_pkg's path
build: cmake: build rpm/deb from reloc_pkg
before this change, object_store/test_basic.py create a config file
for specifying the object storage settings, and pass the path of this
file as the argument of `--object-storage-config-file` option when
running scylla. we have the same requirement when testing scylla
with minio server, where we launch a minio server and manually
create a the config file and feed it to scylla.
to ease the preparation work, let's consolidate by creating the
config file in `minio_server.py`, so it always creates the config
file and put it in its tempdir. since object_store/test_basic.py
can also run against an S3 bucket, the fixture implemented
object_store/conftest.py is updated accordingly to reuse the
helper exposed by MinioServer to create the config file when it
is not available.
Closes#15064
* github.com:scylladb/scylladb:
s3/client: avoid hardwiring env variables names
s3/client: generate config file for tests
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.
Fixes: #13942
Message-ID: <ZNsynXayKim2XAFr@scylladb.com>
in 34c3688017, we added a virtual function
to `config_file`, and we new and delete pointer pointing to a
`db::config` instance with `unique_ptr<>`. this makes the compiler
nervous, as deleting a pointer pointing to an instance of non-final
class with virtual function could lead to leak, if this pointer actually
points to a derived class of this non-final class. so, in order to
silence the warning and to prevent potential problem in future, let's
mark `db::config` final.
the warning from Clang 16 looks like:
```
In file included from /home/kefu/dev/scylladb/test/lib/test_services.cc:10:
In file included from /home/kefu/dev/scylladb/test/lib/test_services.hh:25:
In file included from /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/memory:78:
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:99:2: error: delete called on non-final 'db::config' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
delete __ptr;
^
/usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/unique_ptr.h:404:4: note: in instantiation of member function 'std::default_delete<db::config>::operator()' requested here
get_deleter()(std::move(__ptr));
^
/home/kefu/dev/scylladb/test/lib/test_services.cc:189:16: note: in instantiation of member function 'std::unique_ptr<db::config>::~unique_ptr' requested here
auto cfg = std::make_unique<db::config>();
^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15071
instead of hardwiring the names in multiple places, let's just
keep them in a single place as variables, and reference them by
these variables instead of their values.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, object_store/test_basic.py create a config file
for specifying the object storage settings, and pass the path of this
file as the argument of `--object-storage-config-file` option when
running scylla. we have the same requirement when testing scylla
with minio server, where we launch a minio server and manually
create a the config file and feed it to scylla.
to ease the preparation work, let's consolidate by creating the
config file in `minio_server.py`, so it always creates the config
file and put it in its tempdir. since object_store/test_basic.py
can also run against an S3 bucket, the fixture implemented
object_store/conftest.py is updated accordingly to reuse the
helper exposed by MinioServer to create the config file when it
is not available.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The option was introduced to bootstrap the project. It's still
useful for testing, but that translates into maintaining an
additional option and code that will not be really used
outside of testing. A possible option is to later map the
option in boost tests to initial_tablets, which may yield
the same effect for testing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.
There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.
INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
There's a need for compaction_group_manager, as table will still support
"tabletless" mode, and we don't want to sprinkle ifs here and there,
to support both modes. It's not really a manager (it's not even supposed
to store a state), but I couldn't find a better name.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that token_metadata::get_my_id() is not available
early in the bootstrap process, as raft topology is pulled later
than new tables are registered and created, and this node is added
to topology even later.
To allow creation of compaction groups to retrieve "my id" from
token metadata early, initialization will now feed local id
into topology config which is immutable for each node anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If `live_updatable_config_params_changeable_via_cql` is set to true, configuration parameters defined with `liveness::LiveUpdate` option can be updated in the runtime with CQL, i.e. by updating `system.config` virtual table.
If we don't want any configuration parameter to be changed in the
runtime by updating `system.config` virtual table, this option should be
set to false. This option should be set to false for e.g. cloud users,
who can only perform CQL queries, and should not be able to change
scylla's configuration on the fly.
Current implemenatation is generic, but has a small drawback - messages
returned to the user can be not fully accurate, consider:
```
cqlsh> UPDATE system.config SET value='2' WHERE name='task_ttl_in_seconds';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="option is not live-updateable" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
```
where `task_ttl_in_seconds` has been defined with
`liveness::LiveUpdate`, but because `live_updatable_config_params_changeable_via_cql` is set to
`false` in `scylla.yaml,` `task_ttl_in_seconds` cannot be modified in the
runtime by updating `system.config` virtual table.
Fixes#14355Closes#14382
Before compaction task executors started inheriting from
compaction_task_impl, they were destructed immediately after
compaction finished. Destructors of executors and their
fields performed actions that affected global structures and
statistics and had impact on compaction process.
Currently, task executors are kept in memory much longer, as their
are tracked by task manager. Thus, destructors are not called just
after the compaction, which results in compaction stats not being
updated, which causes e.g. infinite cleanup loop.
Add release_resources() method which is called at the end
of compaction process and does what destructors used to.
Fixes: #14966.
Fixes: #15030.
Closes#15005
should have been use `ignore_errors=True` to ignore
the error. this issue has not poped up, because
we haven't run into the case where the log file
does not exist.
this was a regression introduced by
d4ee84ee1e
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15063
before this change, the filename in path of reloc package looks like:
tools-scylla-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz
but it should have been:
scylla-tools-5.4.0~dev-0.20230816.2eb6dc57297e.noarch.tar.gz
so, when repackaging the reloc tarball to rpm / deb, the scripts
just fails to find the reloc tarball and fail.
after this change, the filename is corrected to match with the one
generated using `build_reloc.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, dist-${name}-rpm and dist-${name}-deb targets
do not depend on the corresponding reloc pkg from which these
prebuilt packages are created. so these scripts fail if the reloc
package does not exist.
to address this problem, the reloc package is added as their
dependencies.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
An sstable can be in one of several states -- normal, quarantined, staging, uploading. Right now this "state" is hard-wired into sstable's path, e.g. quarantined sstable would sit in e.g. /var/lib/data/ks-cf-012345/quarantine/ directory. Respectively, there's a bunch of directory names constexprs in sstables.hh defining each "state". Other than being confusing, this approach doesn't work well with S3 backend. Additionally, there's snapshot subdir that adds to the confusion, because snapshot is not quite a state.
This PR converts "state" from constexpr char* directories names into a enum class and patches the sstable creation, opening and state-changing API to use that enum instead of parsing the path.
refs: #13017
refs: #12707Closes#14152
* github.com:scylladb/scylladb:
sstable/storage: Make filesystem storage with initial state
sstable: Maintain state
sstable: Make .change_state() accept state, not directory string
sstable: Construct it with state
sstables_manager: Remove state-less make_sstable()
table: Make sstables with required state
test: Make sstables with upload state in some cases
tools: Make sstables with normal state
table: Open-code sstables making streaming helpers
tests: Make sstables with normal state by default
sstable_directory: Make sstable with required state
sstable_directory: Construct with state
distributed_loader: Make sstable with desired state when populating
distributed_loader: Make sstable with upload state when uploading
sstable: Introduce state enum
sstable_directory: Merge verify and g.c. calls
distributed_loader: Merge verify and gc invocations
sstable/filesystem: Put underscores to dir members
sstable/s3: Mark make_s3_object_name() const
sstable: Remove filename(dir, ...) method
and use that in compaction_group, rather than
respective accumulators of its own.
This is part of of larger series to make cache updates exception safe.
Refs #14043Closes#15052
* github.com:scylladb/scylladb:
sstable_set: maintain total bytes_on_disk
sstable_set: insert, erase: return status
there is a small time window after we find a free port and before
the minio server listens on that port, if another server sneaked
in the time window and listen on that port, minio server can
still fail to start even there might be free port for it.
so, in this change, we just retry with a random port for a fixed
number of times until the minio server is able to serve.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15042
We aim for a large number of tablets, therefore let's switch
to chunked_vector to avoid large contiguous allocs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There are a few good reasons for this change.
1) compaction_group doesn't have to be aware of # of groups
2) thinking forward to dynamic tablets, # of groups cannot be
statically embedded in group id, otherwise it gets stale.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The use-after-move is not very harmful as it's only used when
handling exception. So user would be left with a bogus message.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#15054
before this change, `scylla sstable dump-statistics` prints the
"regular_columns" as a list of strings, like:
```
"regular_columns": [
"name",
"clustering_order",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"column_name_bytes",
"type_name",
"org.apache.cassandra.db.marshal.BytesType",
"name",
"kind",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type",
"name",
"position",
"type_name",
"org.apache.cassandra.db.marshal.Int32Type",
"name",
"type",
"type_name",
"org.apache.cassandra.db.marshal.UTF8Type"
]
```
but according
https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/scylla-sstable.html#dump-statistics,
> $SERIALIZATION_HEADER_METADATA := {
> "min_timestamp_base": Uint64,
> "min_local_deletion_time_base": Uint64,
> "min_ttl_base": Uint64",
> "pk_type_name": String,
> "clustering_key_types_names": [String, ...],
> "static_columns": [$COLUMN_DESC, ...],
> "regular_columns": [$COLUMN_DESC, ...],
> }
>
> $COLUMN_DESC := {
> "name": String,
> "type_name": String
> }
"regular_columns" is supposed to be a list of "$COLUMN_DESC".
the same applies to "static_columnes". this schema makes sense,
as each column should be considered as a single object which
is composed of two properties. but we dump them like a list.
so, in this change, we guard each visit() call of `json_dumper()`
with `StartObject()` and `EndObject()` pair, so that each column
is printed as an object.
after the change, "regular_columns" are printed like:
```
"regular_columns": [
{
"name": "clustering_order",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "column_name_bytes",
"type_name": "org.apache.cassandra.db.marshal.BytesType"
},
{
"name": "kind",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
},
{
"name": "position",
"type_name": "org.apache.cassandra.db.marshal.Int32Type"
},
{
"name": "type",
"type_name": "org.apache.cassandra.db.marshal.UTF8Type"
}
]
```
Fixes#15036
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15037
This reverts commit 70b5360a73. It generates
a failure in group0_test .test_concurrent_group0_modifications in debug
mode with about 4% probability.
Fixes#15050
and use that in compaction_group, rather than
respective accumulators of its own.
bytes_on_disk is implemented by each sstable_set_impl
and is update on insert and erase (whether directly
into the sstable_set_impl or via the sstable_set).
Although compound_sstable_set doesn't implement
insert and erase, it override `bytes_on_disk()` to return
the sum of all the underlying `sstable_set::bytes_on_disk()`.
Also, added respective unit tests for `partitioned_sstable_set`
and `time_series_sstable_set`, that test each type's
bytes_on_disk, including cloning of the set, and the
`compound_sstable_set` bytes_on_disk semantics.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
perform_offstrategy is called from try_perform_cleanup
when there are sstables in the maintenance set that require
cleanup.
The input sstables are inserted into the compaction_state
`sstables_requiring_cleanup` and `try_perform_cleanup`
expects offstrategy compaction to clean them up along
with reshape compaction.
Otherwise, the maintenance sstables that require cleanup
are not cleaned up by cleanup compaction, since
the reshape output sstable(s) are not analyzed again
after reshape compaction, where that would insert
the output sstable(s) into `sstables_requiring_cleanup`
and trigger their cleanup in the subsequent cleanup compaction.
The latter method is viable too, but it is less effficient
since we can do reshape+cleanup in one pass, vs.
reshape first and cleanup later.
Fixes scylladb/scylladb#15041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15043
Document how SELECT clauses are considered. For example, given the query
SELECT * FROM tab WHERE a = 3 LIMIT 1
We'll get different results if we first apply the WHERE clause then LIMIT
the result set, or if we first LIMIT there result set and then apply the
WHERE clause.
Closes#14990
discard_result ignores only successful futures. Thus, if
perform_compaction<regular_compaction_task_executor> call fails,
a failure is considered abandoned, causing tests to fail.
Explicitly ignore failed future.
Fixes: #14971.
Closes#15000
The filesystem storage driver uses different paths depending on sstable
state. It's possible to keep only table directory _and_ state on it and
construct this path on demand when needed, but it's faster to keep full
path onboard. All the more so it's only exported outside via .prefix()
call which is for logs only, but still
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This means -- keep state on sstable, change it with change_state() call
and (!) fix the is_<state>() helpers not to check storage->prefix()
nit: mark requires_view_building() noexcept while at it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pretty cosmetic change, but it will allow S3 to finally support moving
sstables between states (after this patch it still doesn't)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves make_path() call from outside of sstable::sstable()
inside it. Later it will be moved even further. Also, now sstable can
know its state and keep it (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now all callers specify the state they want their sstables in explicitly
and the old API can be removed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By default it's created with normal state, but there are some places
that need to put it into staging. Do it with new state enum
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As was mentione in the previous patch, there are few places in tests
that put sstables in upload/ subdir and they really mean it. Those need
to use sstables manager/directory API directly (already) and specify the
state explicitly (this patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just like tests, tool open sstable by its full path and doesn't make any
assumptions about sstable state
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of those that call each other to end up calling plain
make_sstable() one. It's simpler to patch both if they just call the
latter directly.
While at it -- drop the unused default argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's assumed that sstables are not very specific about which
subdirectory an sstable is, so they can use normal state. Places that
need to move sstables between states will use sstable manager API
explicitly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The state is on sstable_directory, can switch to using the new manager
API. The full path is still there, will be dropped later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to replace full path sitting on this object eventually. For now
they have to co-exist, but state will be used to make_sstable()-s from
manager with its new API
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This still needs to conver state to directory name internally as
sstable_directory instances are hashed on populator by subdir string.
Also the full string path is printed in logs. All this is now internal
to populate method and will be fixed later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several states between which an sstable can migrate. Nowadays
the state is encoded into sstable directory, which is not nice. Also S3
backed sstables don't support states only keeping sstables in "normal".
This patch adds enum state in order to replace the path-encoded one
eventually. The new sstables_manager::make_sstable() method is added
that accepts table directory (without quarantine/ or staging/ component)
and the desired initial state (optional). Next patches will make use of
this maker and the existing one will be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by fs storage driver that can do dir/file concatenation
on its own. Moreover, this method is not welcome to be used even
internally
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both, init_system_keyspace() and init_non_system_keyspaces() populate
the keyspaces with the help of distributed_loader::populate_keyspace().
That method, in turn, walks the list of keyspaces' tables to load
sstables from disk and attach to them.
After it both init_...-s take the 2nd pass over keyspaces' tables to
call the table::mark_ready_for_writes() on each. This marking can be
moved into populate_keyspace(), that's much easier and shorter because
that method already has the shard-wide table pointer and can just call
whatever it needs on the table.
This changes the initialization sequence, before the patch all tables
were populated before any of them was marked as ready for write. This
looks safe however, as marking a table for write meaks resetting its
generation generator and different tables' generators are independent
from each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#15026
To add a sharded service to the cql_test_env one needs to patch it in 5 or 6 places
- add cql_test_env reference
- add cql_test_env constructor argument
- initialize the reference in initializer list
- add service variable to do_with method
- pass the variable to cql_test_env constructor
- (optionally) export it via cql_test_env public method
Steps 1 through 5 are annoying, things get much simpler if look like
- add cql_test_env variable
- (optionally) export it via cql_test_env public method
This is what this PR does
refs: #2795Closes#15028
* github.com:scylladb/scylladb:
cql_test_env: Drop local *this reference
cql_test_env: Drop local references
cql_test_env: Move most of the stuff in run_in_thread()
cql_test_env: Open-code env start/stop and remove both
cql_test_env: Keep other services as class variables
cql_test_env: Keep services as class variables
cql_test_env: Construct env early
cql_test_env: De-static fdpinger variable
cql_test_env: Define all services' variables early
cql_test_env: Keep group0_client pointer
We see the abort_requested_exception error from time
to time, instead of sleep_aborted that was expected
and quietly ignored (in debug log level).
Treat abort_requested_exception the same way since
the error is expected on shutdown and to reduce
test flakiness, as seen for example, in
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3033/artifact/logs-full.release.010/1691896356104_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_schema/node2.log
```
INFO 2023-08-13 03:12:29,151 [shard 0] compaction_manager - Asked to stop
WARN 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted)
INFO 2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Finished main loop
WARN 2023-08-13 03:12:29,152 [shard 0] cdc - Aborted update CDC description table with generation (2023/08/13 03:12:17, d74aad4b-6d30-4f22-947b-282a6e7c9892)
INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Asked to stop
INFO 2023-08-13 03:12:29,152 [shard 1] compaction_manager - Stopped
INFO 2023-08-13 03:12:29,153 [shard 0] init - Signal received; shutting down
INFO 2023-08-13 03:12:29,153 [shard 0] init - Shutting down view builder ops
INFO 2023-08-13 03:12:29,153 [shard 0] view - Draining view builder
INFO 2023-08-13 03:12:29,153 [shard 1] view - Draining view builder
INFO 2023-08-13 03:12:29,153 [shard 0] compaction_manager - Stopped
ERROR 2023-08-13 03:12:29,153 [shard 0] view - start failed: seastar::abort_requested_exception (abort requested)
ERROR 2023-08-13 03:12:29,153 [shard 1] view - start failed: seastar::abort_requested_exception (abort requested)
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#15029
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.
Fixes: #13942
Message-ID: <ZNSWF/cHuvcd+g1t@scylladb.com>
There are some asserting checks for keyspace and table existence on cql_test_env that perform some one-linee work in a complex manner, tests can do better on their own. Removing it makes cql_test_env simpler
refs: #2795Closes#15027
* github.com:scylladb/scylladb:
test: Remove require_..._exists from cql_test_env
test: Open-code ks.cf name parse into cdc_test
test: Don't use require_table_exists() in test/lib/random_schema
test: Use BOOST_REQUIRE(!db.has_schema())
test: Use BOOST_REQUIRE(db.has_schema())
test: Use BOOST_REQUIRE(db.has_keyspace())
test: Threadify cql_query_test::test_compact_storage case
test: Threadify some cql_query_test cases
The local auto& foo = env._foo references in run_in_thread() a no longer
needed, the code that uses foo can be switched to use _foo (this->_foo)
instead
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Thw do_with() method is static and cannot just access cql_test_env
variable's fields, using local references instead. To simplify this,
most of the method's content is moved to non-static run_in_thread()
method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are more services on do_with() stack that are not referenced from
the cql_test_env. Move them to be class variables too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they are duplicated -- variables exist on do_with() stack and the
class references some of them. This patch makes is vice-versa -- all the
variables are on the cql_test_env and do_with() references them. The
latter will change soon
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays they are all scattered along the .do_with() function. Keeping
them in one early place makes it possible to relocate them onto the
cql_test_env later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now reference, but some time later it won't be able to get
initialized construction-time, so turn it into pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test uses qualified ks.cf name to find the schema, but it's the only
test case that does it. There's no point in maintaining a dedicated
helper on the cql_test_env just for that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This check is pointless. The subsequent call to find_column_family()
would call on_internal_error() in case schema is not found, and since
cql_test_env sets abort-on-internal-error to true, this would fail just
like that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Surprisingly there's a dedicated helper for the check opposite to the
one fixed in the previous patch. Fix one too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as in previous patch, the cql_test_env::require_table_exists()
helper is exactly the same, but returns future and asserts on failures
for no gain
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cql_test_env::require_keyspace_exists() performs exactly the same
check, but is future-returning function for no reason and it assert()s
on failure, that's less informative (not that it ever failed) than
BOOST_REQUIRE
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's like in previous patch, and for the same reason, but the change is
a bit more complicated because it uses resolved futures' results in few
places, so it likely deserves separate commit
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those two use straight .then-s sequences, no point in keeping them that
long. Being threads makes next patches shorter and nicer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This version increment in not accompanied by a
global_token_metadata_barrier, which means the new
version won't be reflected in fence_version
and basically will have no effect in terms of fencing.
If make_task call in compaction_manager::perform_compaction yields,
compaction_task_executor::_compaction_state may be gone and gate
won't be held.
Hold gate immediately after compaction_task_executor is created.
Add comment not to call prepare_task without preparation.
Refs: #14971.
Fixes: #14977.
Closes#14999
before this change, we build multiple relocatable package for
different builds in parallel using ninja. all these relocatable
packages are built using the same script of
`create-relocatable-package.py`. but this script always use the
same directory and file for the `.relocatable_package_version`
file.
so there are chances that these jobs building the relocatable
package can race and writing / accessing the same file at the same
time. so, in this change, instead of using a fixed file path
for this temporary file, we use a NamedTemporaryFile for this purpose.
this should helps us avoid the build failures like
```
[2023-08-10T09:38:00.019Z] FAILED: build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz
[2023-08-10T09:38:00.019Z] scripts/create-relocatable-package.py --mode debug 'build/debug/dist/tar/scylla-unstripped-5.4.0~dev-0.20230810.116c10a2b0c6.x86_64.tar.gz'
[2023-08-10T09:38:00.019Z] Traceback (most recent call last):
[2023-08-10T09:38:00.019Z] File "/jenkins/workspace/scylla-master/scylla-ci/scylla/scripts/create-relocatable-package.py", line 130, in <module>
[2023-08-10T09:38:00.019Z] os.makedirs(f'build/{SCYLLA_DIR}')
[2023-08-10T09:38:00.019Z] File "<frozen os>", line 225, in makedirs
[2023-08-10T09:38:00.019Z] FileExistsError: [Errno 17] File exists:
'build/scylla-package'
```
Fixes#15018
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15007
Before returning task status, wait_task waits for it to finish with
done() method and calls get() on a resulting future.
If requested task fails, an exception will be thrown and user will
get internal server error instead of failed task status.
Result of done() method is ignored.
Fixes: #14914.
Closes#14915
The topology coordinator only marks a replaced node as LEFT
during the replace operation and actually removes it from the
group 0 config in cleanup_group0_config_if_needed. If this
function is called before raft has committed a replacing node as
a voter, it does not remove the replaced node from the group 0
config. Then, the coordinator can decide that it has no work to
do and starts sleeping, leaving us with an outdated config.
This behavior reduces group 0 availability and causes problems in
tests (see #14885). Also, it makes the coordinator's logic
confusing - it claims that it has no work to do when it has some
work to do. Therefore, we modify the coordinator so that it
removes the replaced node earlier in handle_topology_transition.
Fixes#14885Fixes#14975Closes#15009
This series fixes a couple of review comments on #14845Closes#14976
* github.com:scylladb/scylladb:
gossiper: lock_endpoint: fix comment regarding permit_id mismatch
gossiper: lock_endpoint: change assert to on_internal_error
The code assumes that if there is no transition_state there should be no
nodes that currently in transition in a state other then left_token_ring
state, but rebuild operation also creates such nodes, so add the check
for it as well.
there is chance that the default port of 9000 has been used on the host
running the test, in that case, we should try to use another available
port.
so, in this change, we try ports in the ranges of [9000, 9000+1000), and
use the first one which is not connectable.
Fixes#14985
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14997
* github.com:scylladb/scylladb:
test: stop using HostRegistry in MinioServer
s3/client: check for available port before starting minio server
test.py schedules calls to cluster .uninstall() and .stop() making
double calls to it running at the same time. Mark the cluster as not
running early on.
While there, do the same for .stop_gracefully() for consistency.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14987
Currently, `mark_dead` is called with null_permit_id
from `convict`, and in this case, by contract,
it should lock the endpoint, same as `mark_as_shutdown`.
This change somehow escaped #14845 so it amends it.
Fixes#14838Closes#15001
* github.com:scylladb/scylladb:
gossiper: verify permit_id in all private functions
gossiper: convict: lock_endpoint
Instead of acquiring the permit is the permit_id arg is null,
like in mark_as_shutdown, just asssert that the permit_id is
non-null. The functions are both private and we control all callers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, `mark_dead` is called with null_permit_id
from `convict`, and in this case, by contract,
it should lock the endpoint, same as `mark_as_shutdown`.
This change somehow escaped #14845 so it amends it.
Fixes#14838
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
since MinioServer find a free port by itself, there is no need to
provide it an IP address for it anymore -- we can always use
127.0.0.1.
so, in this change, we just drop the HostRegistry parameter passed
to the constructor of MinioServer, and pass the host address in place
of it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `system.group0_history` table provides useful descriptions for each
command committed to Raft group 0. One way of applying a command to
group 0 is by calling `migration_manager::announce`. This function has
the `description` parameter set to empty string by default. Some calls
to `announce` use this default value which causes `null` values in
`system.group0_history`. We want `system.group0_history` to have an
actual description for every command, so we change all default
descriptions to reasonable ones.
Going further, We remove the default value for the `description`
parameter of `migration_manager::announce` to avoid using it in the
future. Thanks to this, all commands in `system.group0_history` will
have a non-null description.
Fixes#13370Closes#14979
* github.com:scylladb/scylladb:
migration_manager: announce: remove the default value of description
test: always pass empty description to migration_manager::announce
migration_manager: announce: provide descriptions for all calls
there is chance that the default port of 9000 has been used on the
host running the test, in that case, we should try to use another
available port.
so, in this change, we try ports in the ranges of [9000, 9000+1000),
and use the first one which is not connectable.
Fixes#14985
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently `sstable_requiring_cleanup` is updated using `compacting_sstable_registration`, but that mechanism is not used by offstrategy compaction, leading to #14304.
This series introduces `compaction_manager::on_compaction_completion` that intercepts the call
to the table::on_compaction_completion. This allows us to update `sstable_requiring_cleanup` right before the compacted sstables are deleted, making sure they are no leaked to `sstable_requiring_cleanup`, which would hold a reference to them until cleanup attempts to clean them up.
`cleanup_incremental_compaction_test` was adjusted to observe the sstables `on_delete` (by adding a new observer event) to detect the case where cleanup attempts to delete the leaked sstables and fails since they were already deleted from the file system by offstrategy compaction. The test fails with the fix and passes with it.
Fixes#14304Closes#14858
* github.com:scylladb/scylladb:
compaction_manager: on_compaction_completion: erase sstables from sstables_requiring_cleanup
compaction/leveled_compaction_strategy: ideal_level_for_input: special case max_sstable_size==0
sstable: add on_delete observer
compaction_manager: add on_compaction_completion
sstable_compaction_test: cleanup_incremental_compaction_test: verify sstables_requiring_cleanup is empty
before this change, we use generator expression to initialize CMAKE_CXX_FLAGS_RELEASE, this has two problems:
1. the generator expression is not expanded when setting a regular variable.
2. the ending ">" is missing in one of the generator expression.
3. the parameters are not separated with ";"
so address them, let's just
* use `add_compile_options()` directly, as the corresponding `mode.${build_mode}.cmake` is only included when the "${build_mode}" is used.
* add comma in between the command line options.
* add the missing closing ">"
Closes#14996
* github.com:scylladb/scylladb:
build: cmake: pass --gc-sections to ld not ar
build: cmake: use add_compile_options() in release build
when it comes to `regular_compaction_task_executor`, we repeat the
compaction until the compaction can not proceed, so after an iteration
of compaction completes successfully, the task can still continue with
yet another round of the compaction as it sees appropriate. so let's
update the comment to reflect this fact.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14891
quite a few member variables serves as the configuration for
a given compaction, they are immutable in the life cycle of it,
so for better readability, let's mark them `const`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14981
The only place that still calls it is static force_blocking_flush method. If can be made non-static already. Also, while at it, coroutinize some system_keyspace methods and fix a FIXME regarding replica::database access in it.
Closes#14984
* github.com:scylladb/scylladb:
code: Remove query-context.hh
code: Remove qctx
system_keyspace: Use system_keyspace's container() to flush
system_keyspace: Make force_blocking_flush() non-static
system_keyspace: Coroutinize update_tokens()
system_keyspace: Coroutinize save_truncation_record()
ar is not able to tell which sections to be GC'ed, hence it does
not care about --gc-sections, but ld does. let's add this option
to CMAKE_EXE_LINKER_FLAGS.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we use generator expression to initialize
CMAKE_CXX_FLAGS_RELEASE, this has two problems:
1. the generator expression is not expanded when setting
a regular variable.
2. the ending ">" is missing in one of the generator
expression.
3. the parameters are not separated with ";"
so address them, let's just
* use `add_compile_options()` directly, as the corresponding
`mode.${build_mode}.cmake` is only included when the
"${build_mode}" is used.
* add comma in between the command line options.
* add the missing closing ">"
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
"experimental" was marked deprecated in 8b917f7c. this change
was included since Scylla 4.6. now that 5.3 has been branched,
this change will be included 5.4. this should be long enough
for the user's turn around if this option is ever used.
the dtests using this option has been audited and updated
accordingly. and the unit testing this option is removed as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
"enable_user_defined_functions" is enabled by default by
`make_scylla_conf()` in pylib/scylla_cluster.py, and we've being
using `experimental` = True in this very function. this combination
works fine, as "udf" is enabled by `experimental`. but
since `experimental` is deprecated, if we drop this option and stop
handling it. this peace is broken. "enable_user_defined_function"
requires "udf" experimental feature. but test_boost_after_ip_change
feed the scylla with an empty `experimental_features` list, so
the test fails. to pave for the road of dropping `experimental`
option, let's disable `enable_user_defined_function` as well
in test_boost_after_ip_change.
the same applies to other tests changed in this commit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
"experimental" was marked deprecated in 8b917f7c. so let's
specify the experimental features explicitly using
`experimental_feature` option.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In force_blocking_flush() there's an invoke-on-all invocation of
replica::database::flush() and a FIXME to get the replica database from
somewhere else rather than via query-processor -> data_dictionary.
Since now the force_blocking_flush() is non-static the invoke-on-all can
happen via system_keyspace's container and the database can be obtained
directly from the sys.ks. local instance
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Erase retired sstable from compaction_state::sstables_requiring_cleanup
also on_compaction_completion (in addition to
compacting_sstable_registration::release_compacting
for offstrategy compaction with piggybacked cleanup
or any other compaction type that doesn't use
compacting_sstable_registration.
Add cleanup_during_offstrategy_incremental_compaction_test
that is modeled after cleanup_incremental_compaction_test to check
that cleanup doesn't attempt to cleanup already-deleted
sstables that were left over by offstrategy compaction
in sstables_requiring_cleanup.
Fixes#14304
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent div-by-zero byt returning const level 1
if max_sstable_size is zero, as configured by
cleanup_incremental_compaction_test, before it's
extended to cover also offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass the call to the table on_compaction_completion
so we can manage the sstables requiring cleanup state
along the way.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We remove the default value for the description parameter of
migration_manager::announce to avoid using it in the future.
Thanks to this, all commands in system.group0_history will
have a non-null description.
In the next commit, we remove the default value for the
description parameter of migration_manager::announce to avoid
using it in the future. However, many calls to announce in tests
use the default value. We have to change it, but we don't really
care about descriptions in the tests, so we pass the empty string
everywhere.
The system.group0_history table provides useful descriptions
for each command committed to Raft group 0. One way of applying
a command to group 0 is by calling migration_manager::announce.
This function has the description parameter set to empty string
by default. Some calls to announce use this default value which
causes null values in system.group0_history. We want
system.group0_history to have an actual description for every
command, so we change all default descriptions to reasonable ones.
We can't provide a reasonable description to announce in
query_processor::execute_thrift_schema_command because this
function is called in multiple situations. To solve this issue,
we add the description parameter to this function and to
handler::execute_schema_command that calls it.
While in SQL DISTINCT applies to the result set, in CQL it applies
to the table being selected, and doesn't allow GROUP BY with clustering
keys. So reject the combination like Cassandra does.
While this is not an important issue to fix, it blocks un-xfailing
other issues, so I'm clearing it ahead of fixing those issues.
An issue is unmarked as xfail, and other xfails lose this issue
as a blocker.
Fixes#12479Closes#14970
Knowing that a server gained or lost leadership in group 0 is
sometimes useful for the purpose of debugging, so we log
information about these events on the INFO level.
Gaining and losing leadership are relatively rare events, so
this change shouldn't flood the logs.
Closes#14877
For `removenode`, we make a removed node a non-voter early. There is no
downside to it because the node is already dead. Moreover, it improves
availability in some situations.
For `decommission`, if we decommission a node when the number of nodes
is even, we make it a non-voter early to improve availability. All
majorities containing this node will remain majorities when we make this
node a non-voter and remove it from the set because the required size of
a majority decreases.
We don't change `decommission` when the number of nodes is odd since
this may reduce availability.
Fixes#13959Closes#14911
* github.com:scylladb/scylladb:
raft: make a decommissioning node a non-voter early
raft: topology_coordinator: implement step_down_as_nonvoter
raft: make a removed node a non-voter early
Rewrite test that checks whether task_manager/wait_task works properly.
The old version didn't work. Delete functions used in old version.
Closes#14959
* github.com:scylladb/scylladb:
test: rewrite wait_task test
test: move ThreadWrapper to rest_util.py
When deleting multiple sstables with the same prefix
the deletion atomicity is ensured by the pending_delete_log file,
so if scylla crashes in the middle, deletions will be replyed on
restart.
Therefore, we don't have to ensure atomicity of each individual
`unlink`. We just need to sync the directory once, before
removing the pending_delete_log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14967
This makes it possible to remove remaining users of the global qctx.
The thing is that db::schema_tables code needs to get wasm's engine, alien runner and instance cache to build wasm context for the merged function or to drop it from cache in the opposite case. To get the wasm stuff, this code uses global qctx -> query_processor -> wasm chain. However, the functions (un)merging code already has the database reference at hand, and its natural to get wasm stuff from it, not from the q.p. which is not available
So this PR packs the wasm engine, runner and cache on sharded<wasm::manager> instance, makes the manager be referenced by both q.p. and database and removes the qctx from schema tables code
Closes#14933
* github.com:scylladb/scylladb:
schema_tables: Stop using qctx
database: Add wasm::manager& dependency
main, cql_test_env, wasm: Start wasm::manager earlier
wasm: Shuffle context::context()
wasm: Add manager::remove()
wasm: Add manager::precompile()
wasm: Move stop() out of query_processor
wasm: Make wasm sharded<manager>
query_processor: Wrap wasm stuff in a struct
The metrics are registered on-demand when load-balancer is invoked, so that only leader exports the metrics. When leader changes, the old leader will stop exporting.
The metrics are divided into two levels: per-dc and per-node. In prometheus, they will have appropriate labels for dc and host_id values.
Closes#14962
* github.com:scylladb/scylladb:
tablet_allocator: unregister metrics when leadership is lost
tablets: load_balancer: Export metrics
service, raft: Move balance_tablets() to tablet_allocator
tablet_allocator: Start even if tablets feature is not enabled
main, storage_service: Pass tablet allocator to storage_service
Before the patch, tablet metadata update was processed on local schema merge
before table changes.
When table is dropped, this means that for a while table will exist
without a corresponding tablet map. This can cause memtable flush for
this table to fail, resulting in intentional abort(). That's because
sstable writing attempts to access tablet map to generate sharding
metadata.
If auto_snapshot is enabled, this is much more likely to happen,
because we flush memtables on table drop.
To fix the problem, process tablet metadata after dropping tables, but
before creating tables.
Fixes#14943Closes#14954
There are two places in there that need qctx to get query_processor from
to, in turn, get wasm::manager from. Fortunately, both places have the
database reference at hand and can get the wasm::manager from it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The dependency is needed by db::schema_tables to get wasm manager for
its needs. This patch prepares the ground. Now the wasm::manager is
shared between replica::database and cql3::query_processor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed by replica::database and should be available that
early. It doesn't depend on anything and can be moved in the starting
order safely
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a constructor that builds context out of const manager reference.
The existing one needs to get engine and instance cache and does it via
query_processor. This change lets removing those exports and finally --
drop the wasm::manager -> cql3::query_processor friendship
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is one of the users of query_processor's export of wasm::manager's
instance cache. Remove it in advance
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the q.p. stops it also "stops" the wasm manager. Move this call
into main. The cql test env doesn't need this change, it stops the whole
sharded service which stops instances on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The wasm::manager is just cql3::wasm_context renamed. It now sits in
lang/wasm* and is started as a sharded service in main (and cql test
env). This move also needs some headers shuffling, but it's not severe
This change is required to make it possible for the wasm::manager to be
shared (by reference) between q.p. and replica::database further
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three wasm-only fields on q.p. -- engine, cache and runner.
This patch groups them on a single wasm_context structure to make it
earier to manipulate them in the next patches
The 'friend' declaration it temporary, will go away soon
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.
However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.
In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.
Fixes: #14944Closes#14955
* github.com:scylladb/scylladb:
feature_service: don't load whole topology state to check features
system_keyspace: separate loading topology_features from topology
topology_state_machine: extract features-related fields to a struct
untyped_result_set: add missing_column_exception
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.
Fixes#14907
Also, use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()
Closes#14952
* github.com:scylladb/scylladb:
raft: group0_state_machine: transfer_snapshot: make abortable
raft: group0_state_machine: transfer_snapshot: hold gate
Currently, feature service uses `system_keyspace::load_topology_state`
to load information about features from the `system.topology` table.
This function implicitly assumes that it is called after schema
commitlog replay and will correspond to the state of the topology state
machine after some command is applied.
However, feature check happens before the commitlog replay. If some
group 0 command consists of multiple mutations that are not applied
atomically, the `load_topology_state` function may fail to construct a
`service::topology` object based on the table state. Moreover, this
function not only checks `system.topology` but also
`system.cdc_generations_v3` - in the case of the issue, the entry that
was loaded from the this table didn't contain the `num_ranges`
parameter.
In order to fix this, the feature check code now uses
`load_topology_features_state` which only loads enabled and supported
features from `system.topology`. Only this information is really
necessary for the feature check, and it doesn't have any invariants to
check.
Fixes: #14944
Now, it is possible to load topology_features separately from the
topology struct. It will be used in the code that checks enabled
features on startup.
`enabled_features` and `supported_features` are now moved to a new
`topology::features` struct. This will allow to move load this
information independently from the `topology` struct, which will be
needed for feature checking during start.
Getting reason argument in task_manager_module::get_progress is deceiving
as the method works properly only for streaming::stream_reason::repair
(repair::shard_repair_task_impl::nr_ranges_finished isn't updated for
any other reason).
If some state update in database::add_column_family throws,
info about a column family would be inconsistent.
Undo already performed operations in database::add_column_family
when one throws.
Fixes: #14666.
Closes#14672
* github.com:scylladb/scylladb:
replica: undo the changes if something fails
replica: start table earlier in database::add_column_family
All compaction task executors, except for regular compaction one,
become task manager compaction tasks.
Creating and starting of major_compaction_task_executor is modified
to be consistent with other compaction task executors.
Closes#14505
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to cover compaction group tasks
compaction: turn custom_task_executor into compaction_task_impl
compaction: turn sstables_task_executor into sstables_compaction_task_impl
compaction: change sstables compaction tasks type
compaction: move table_upgrade_sstables_compaction_task_impl
compaction: pass task_info through sstables compaction
compaction: turn offstrategy_compaction_task_executor into offstrategy_compaction_task_impl
compaction: turn cleanup_compaction_task_executor into cleanup_compaction_task_impl
comapction: use optional task info in major compaction
compaction: use perform_compaction in compaction_manager::perform_major_compaction
While describing materialized view, print `synchronous_updates` option
only if the tag is present in schema's extensions map. Previously if the
key wasn't present, the default (false) value was printed.
Fixes: #14924Closes#14928
we wait for the same condition couple lines before, so no need to
check it again using `BOOST_CHECK_EQUAL()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14921
Currently, when one tries to access a column that an untyped_result_set
does not contain, a `std::bad_variant_access` exception is thrown. This
exception's message provides very little context and it can be difficult
to even figure out where this message is coming from.
In order to improve the situation, a new exception `missing_column` is
introduced which includes the missing column's name in its error
message. The exception derives from `std::bad_variant_access` for
compatibility with existing code that may want to catch it.
`boost::program_options::value()` create a new typed_value<T> object,
without holding it with a shared_ptr. boost::program_options expects
developer to construct a `bpo::option_description` right away from it.
and `boost::program_options::option_description` takes the ownership
of the `type_value<T>*` raw pointer, and manages its life cycle with
a shared_ptr. but before passing it to a `bpo::option_description`,
the pointer created by `boost::program_options::value()` is a still
a raw pointer.
before this change, we initialize positional options as global
variables using `boost::program_options::value()`. but unfortunately,
we don't always initialize a `bpo::option_description` from it --
we only do this on demand when the corresponding subcommand is
called.
so, if the corresponding subcommand is not called, the created
`typed_value<T>` objects are leaked. hence LeakSanitizer warns us.
after this change, we create the option vector as a static
local variable in a function so it is created on demand as well.
as an alternative, we could initialize the options vector as local
variable where it used. but to be more consistent with how
`global_option` is specified. and to colocate them in a single
place, let's keep the existing code layout.
Fixes#14929
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14939
Commit bc5f6cf45d
added a reserve call to the `ranges` vector before
inserting all the returned token ranges into it.
However, that reservation is too small as we need
to express size+1 ranges for size tokens with
<unbound, token[0]> and <token[size-1], unbound>
ranges at the front and back, respectively.
Fixes#14849
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14938
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Hold a (newly added) group0_state_machine gate
that is closed and waited on in group0_state_machine::abort()
To prevent use-after-free when destroying the group0_state_machine
while transfer_snapshot runs.
Fixes#14907
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds the ranges_parallelism option to repair restful API.
Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel.
Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job.
To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel.
Fixes#4847Closes#14886
* github.com:scylladb/scylladb:
repair: Add ranges_parallelism option
repair: Change to use coroutine in do_repair_ranges
before this change, if the object_store test fails, the tempdir
will be preserved. and if our CI test pipeline is used to perform
the test, the test job would scan for the artifacts, and if the
test in question fails, it would take over 1 hour to scan the tempdir.
to alleviate the pain, let's just keep the scylla logging file
no matter the test fails or succeeds. so that jenkins can scan the
artifacts faster if the test fails.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14880
Per-table metrics are very valuable for the users, it does come with a high load on both the reporting and the collecting metrics systems.
This patch adds a small subset of per-metrics table that will be reported on the node level.
The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by write operations in memtables
system_column_family_total_disk_space - Total disk space used system_column_family_live_sstable - Live sstable count system_column_family_read_latency_count - Number of reads system_column_family_write_latency_count - Number of writes
The names of the read/write metrics is based on the histogram convention, so when latencies histograms will be added, the names will not change.
The metrics are label with a specific label __per_table="node" so it will be possible to easily manipulate it.
The metrics will be available when enable_metrics_reporting (the per-table full metrics flag) is off
Fixes#2198Closes#13293
* github.com:scylladb/scylladb:
replica/table.cc: Add node-per-table metrics
config: add enable_node_table_metrics flag
If we decommission a node when the number of nodes is even, we
make it a non-voter early to improve availability. All majorities
containing this node will remain majorities when we make this node
a non-voter and remove it from the set because the required size
of a majority decreases.
We move the logic that makes topology_coordinator a non-voter to
a separate function called step_down_as_nonvoter to avoid code
duplication. We use this function in the next commit.
For removenode, we make a removed node a non-voter early. There is
no downside to it because the node is already dead. Moreover, it
improves availability in some situations. Consider a 4-node
cluster with one dead none. If we make the dead node a non-voter
at the beginning of removenode, group 0 will survive the death
of another node in the middle of removenode.
This series cleans up and hardens the endpoint locking design and
implementation in the gossiper and endpoint-state subscribers.
We make sure that all notifications (expect for `before_change`, that
apparently can be dropped) are called under lock_endpoint, as well as
all calls to gossiper::replicate, to serialize endpoint_state changes
across all shards.
An endpoint lock gets a unique permit_id that is passed to the
notifications and passed back by them if the notification functions call
the gossiper back for the same endpoint on paths that modify the
endpoint_state and may acquire the same endpoint lock - to prevent a
deadlock.
Fixes scylladb/scylladb#14838
Refs scylladb/scylladb#14471
Closes#14845
* github.com:scylladb/scylladb:
gossiper: replicate: ensure non-null permit
gossiper: add_saved_endpoint: lock_endpoint
gossiper: mark_as_shutdown: lock_endpoint
gossiper: real_mark_alive: lock_endpoint
gossiper: advertise_token_removed: lock_endpoint
gossiper: do_status_check: lock_endpoint
gossiper: remove_endpoint: lock_endpoint if needed
gossiper: force_remove_endpoint: lock_endpoint if needed
storage_service: lock_endpoint when removing node
gossiper: use permit_id to serialize state changes while preventing deadlocks
gossiper: lock_endpoint: add debug messages
utils: UUID: make default tagged_uuid ctor constexpr
gossiper: lock_endpoint must be called on shard 0
gossiper: replicate: simplify interface
gossiper: mark_as_shutdown: make private
gossiper: convict: make private
gossiper: mark_as_shutdown: do not call convict
This PR implements the functionality of the raft-based cluster features
needed to safely manage and enable cluster features, according to the
cluster features on raft design doc.
Enabling features is a two phase process, performed by the topology
coordinator when it notices that there are no topology changes in
progress and there are some not-yet enabled features that are declared
to be supported by all nodes:
1. First, a global barrier is performed to make sure that all nodes saw
and persisted the same state of the `system.topology` table as the
coordinator and see the same supported features of all nodes. When
booting, nodes are now forbidden to revoke support for a feature if all
nodes declare support for it, a successful barrier this makes sure that
no node will restart and disable the features.
2. After a successful barrier, the features are marked as enabled in the
`system.topology` table.
The whole procedure is a group 0 operation and fails if the topology
table is modified in the meantime (e.g. some node changes its supported
features set).
For now, the implementation relies on gossip shadow round check to
protect from nodes without all features joining the cluster. In a
followup, a new joining procedure will be implemented which involves the
topology coordinator and lets it verify joining node's cluster features
before the new node is added to group 0 and to the cluster.
A set of tests for the new implementation is introduced, containing the
same tests as for the non-raft-based cluster feature implementation plus
one additional test, specific to this implementation.
Closes#14722
* github.com:scylladb/scylladb:
test: topology_experimental_raft: cluster feature tests
test: topology: fix a skipped test
storage_service: add injection to prevent enabling features
storage_service: initialize enabled features from first node
topology_state_machine: add size(), is_empty()
group0_state_machine: enable features when applying cmds/snapshots
persistent_feature_enabler: attach to gossip only if not using raft
feature_service: enable and check raft cluster features on startup
storage_service: provide raft_topology_change_enabled flag from outside
storage_service: enable features in topology coordinator
storage_service: add barrier_after_feature_update
topology_coordinator: exec_global_command: make it optional to retake the guard
topology_state_machine: add calculate_not_yet_enabled_features
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
Fixes#12010Closes#14892
* github.com:scylladb/scylladb:
distributed_loader: process_sstable_dir: do not verify snapshots
utils/directories: verify_owner_and_mode: add recursive flag
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.
in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed, except for "test/lib/scylla_test_case.hh"
as it brings some command line options used by scylla tests.
see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14922
get_compacted_fragments_writer() returns a instance of
`compacted_fragments_writer`, there is no need to cast it again.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14919
Per-table metrics are very valuable for the users, it does come with a
high load on both the reporting and the collecting metrics systems.
This patch adds a small subset of per-metrics table that will be
reported on the node level.
The list of metrics is:
system_column_family_memtable_switch - Number of times flush has
resulted in the memtable being switched out
system_column_family_memtable_partition_writes - Number of write
operations performed on partitions in memtables
system_column_family_memtable_partition_hits - Number of times a write
operation was issued on an existing partition in memtables
system_column_family_memtable_row_writes - Number of row writes
performed in memtables
system_column_family_memtable_row_hits - Number of rows overwritten by
write operations in memtables
system_column_family_total_disk_space - Total disk space used
system_column_family_live_sstable - Live sstable count
system_column_family_read_latency_count - Number of reads
system_column_family_write_latency_count - Number of writes
The names of the read/write metrics is based on the histogram convention,
so when latencies histograms will be added, the names will not change.
The metrics are label with a specific label __per_table="node" so it
will be possible to easily manipulate it.
The metrics will be available when enable_metrics_reporting (the
per-table full metrics flag) is off and enable_node_table_metrics is
true.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.
There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.
This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.
For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
add fmt formatter for `compaction_task_executor::state` and
`compaction_task_executor` and its derived classes.
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `compaction_task_executor`, its derived classes and
`compaction_task_executor::state` without the help of `operator<<`.
since all of the callers of 'operator<<' of these types now use
formatters, the operator<< are removed in this change. the helpers
like `to_string()` and `describe()` are removed as well, as it'd
be more consistent if we always use fmtlib for formatting instead
of inventing APIs with different names.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14906
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.
This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.
Fixesscylladb/scylladb#14896
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14897
Although the implementation of cluster features on raft is not complete
yet, it makes sense to add some tests for the existing implementation.
The `test_raft_cluster_features.py` file includes the same set of tests
as the file with non-raft-based cluster feature tests, plus one
additional test which checks that a node will not allow disabling a
feature if it sees that other nodes support it (even though the feature
is not enabled yet).
The `test_partial_upgrade_can_be_finished_with_removenode` test does not
work because the `cql` variable is used before it is declared. It was
not noticed because the test is marked as skipped, and does not work for
the non-raft cluster feature implementation. The variable declaration is
moved higher and the test now works; it will be used to test the raft
cluster feature implementation.
The first node in the cluster defines it and it does not need to consult
with anybody whether its features should be enabled or not. We can
immediately mark those features as enabled in raft when the first node
inserts its join request to the topology table.
The enable_features_on_join function is now only called if the node does
not use topology over raft, and so the node will not react to changes in
gossip features.
In the future, support for switching to topology coordinator in runtime
will be added and the persistent feature enabler should disconnect
itself during the upgrade procedure. We don't have such procedure yet,
so a bunch of TODOs is added instead.
The enable_features_on_startup method is adjusted for the raft-based
cluster features. In topology coordinator mode:
- Information about enabled features is taken from system.topology
instead of the usual system.scylla_local (`enabled_features` key).
- Features which, according to the local state, are supported by all
nodes but not enabled yet are also checked. Support for such features
cannot be revoked safely because the topology coordinator might have
performed a successful global barrier and might have proceeded with
marking the feature as enabled.
Information about whether we are using topology changes on raft or not
will be soon necessary for the persistent feature enabler, so that it
can do some additional checks based on the local raft topology state.
If the topology coordinator notices that there are no nodes requesting
to be joined, no topology operations in progress and there are some
features that are declared to be supported by all normal nodes but not
enabled yet, the topology coordinator will attempt to enable those
features. This is done in the following way, under a group 0 guard:
- A global `barrier_after_feature_update` is performed to make sure
that:
- All nodes have already updated their supported_features column after
boot and won't attempt to revoke any during current runtime,
- Saw and persisted the latest topology state so that, after restart,
the feature check won't allow them to revoke support for features
that the topology coordinator is going to enable.
- After the barrier succeeds, the coordinator tries to add the features
to the `enabled_features` column.
Ensure that replicate is called under lock_endpoint
to serialize endpoint state changes on all shards.
Otherwise, we may end up with incosistent state
across shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function manipulates the endpoint state
and calls replicate and mark_dead, therefore it
must ensure this is done under lock_endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function manipulates internal state on shard 0
and calls subscribers async callbacks so we should
lock the endpoint to serialize state changes on it.
With that, get_endpoint_state_for_endpoint_ptr after
locking the endpoint in real_mark_alive, not
before calling it, in the background continuation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function manipulates the endpoint state
and calls replicate, therefore it
must ensure this is done under lock_endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function manipulates the endpoint state by
calling remove_endpoint and evict_from_membership
(and possibly yielding in-between), so it should
serialize the state change with lock_endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass permit_id to subscribers when we acquire one
via lock_endpoint. The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adds a variant of the existing `barrier` topology command which requires
from all participating nodes to confirm that they updated their features
after boot and won't remove any features from it until restart. A
successful global barrier of this type gives the topology coordinator a
guarantee that it can safely enable features that were supported by all
nodes at the moment of the barrier.
Currently, exec_global_command takes a group 0 guard, drops it and
retakes it after the command is finished. For current uses it is fine
from the correctness' point of view and, given that an operation can
take a long time, shorter duration of the guard improves the odds of the
operation succeeding.
However, this is not sufficient for cluster features because they will
need to execute a global barrier under the group 0 guard.
This commit modifies the interface of `exec_global_command` so that
dropping and retaking the guard is optional (the default is to retake
it).
Adds a function which calculates a set of features that are supported by
all normal nodes but are not enabled yet - according to the state of the
topology state machine.
In this PR we add proper fencing handling to the `counter_mutation` verb.
As for regular mutations, we do the check twice in `handle_counter_mutation`, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to `write_both_read_new`, streaming might have already started and missed this update.
In `mutate_counters` we can use a single `fencing_token` for all leaders, since all the erms are processed without yields and should underneath share the same `token_metadata`.
We don't pass fencing token for replication explicitly in `replicate_counter_from_leader` since `mutate_counter_on_leader_and_replicate` doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current `token_metadata` version for outgoing requests.
Closes#14564
* github.com:scylladb/scylladb:
counter_mutation: add fencing
encode_replica_exception_for_rpc: handle the case when result type is a single exception_variant
counter_mutation: add replica::exception_variant to signature
We add the CDC generation optimality check in
`storage_service::raft_check_and_repair_cdc_streams` so that it doesn't
create new generations when unnecessary. Since
`generation_service::check_and_repair_cdc_streams` already has this
check, we extract it to the new `is_cdc_generation_optimal` function to
not duplicate the code.
After this change, multiple tasks could wait for a single generation
change. Calling `signal` on `topology_state_machine.event` would't wake
them all. Moreover, we must ensure the topology coordinator wakes when
his logic expects it. Therefore, we change all `signal` calls on
`topology_state_machine.event` to `broadcast`.
We delay the deletion of the `new_cdc_generation` request to the moment
when the topology transition reaches the `publish_cdc_generation` state.
We need this change to ensure the added CDC generation optimality check
in the next commit has an intended effect. If we didn't make it, it
would be possible that a task makes the `new_cdc_generation` request,
and then, after this request was removed but before committing the new
generation, another task also makes the `new_cdc_generation` request. In
such a scenario, two generations are created, but only one should. After
delaying the deletion of `new_cdc_generation` requests, the second
request would have no effect.
Additionally, we modify the `test_topology_ops.py` test in a way that
verifies the new changes. We call
`storage_service::raft_check_and_repair_cdc_streams` multiple times
concurrently and verify that exactly one generation has been created.
Fixes#14055Closes#14789
* github.com:scylladb/scylladb:
storage_service: raft_check_and_repair_cdc_streams: don't create a new generation if current one is optimal
storage_service: delay deletion of the new_cdc_generation request
raft topology: broadcast on topology_state_machine.event instead of signal
cdc: implement the is_cdc_generation_optimal function
The `migration_manager` service is responsible for schema convergence in
the cluster - pushing schema changes to other nodes and pulling schema
when a version mismatch is observed. However, there is also a part of
`migration_manager` that doesn't really belong there - creating
mutations for schema updates. These are the functions with `prepare_`
prefix. They don't modify any state and don't exchange any messages.
They only need to read the local database.
We take these functions out of `migration_manager` and make them
separate functions to reduce the dependency of other modules (especially
`query_processor` and CQL statements) on `migration_manager`. Since all
of these functions only need access to `storage_proxy` (or even only
`replica::database`), doing such a refactor is not complicated. We just
have to add one parameter, either `storage_proxy` or `database` and both
of them are easily accessible in the places where these functions are
called.
This refactor makes `migration_manager` unneeded in a few functions:
- `alternator::executor::create_keyspace`,
- `cql3::statements::alter_type_statement::prepare_announcement_mutations`,
- `cql3::statements::schema_altering_statement::prepare_schema_mutations`,
- `cql3::query_processor::execute_thrift_schema_command:`,
- `thrift::handler::execute_schema_command`.
We remove the `migration_manager&` parameter from all these functions.
Fixes#14339Closes#14875
* github.com:scylladb/scylladb:
cql3: query_processor::execute_thrift_schema_command: remove an unused parameter
cql3: schema_altering_statement::prepare_schema_mutations: remove an unused parameter
cql3: alter_type_statement::prepare_announcement_mutations: change parameters
alternator: executor::create_keyspace: remove an unused parameter
service: migration_manager: change the prepare_ methods to functions
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
query_processor::execute_thrift_schema_command and
thrift::handler::execute_schema_command (that calls
query_processor::execute_thrift_schema_command) has been unused.
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
schema_altering_statement::prepare_schema_mutations has been
unused by all classes inheriting from schema_altering_statement.
After changing the prepare_ methods of migration_manager to
functions, the migration_manager& parameter of
alter_type_statement::prepare_announcement_mutations has become
unneeded. However, the function needs access to
service::storage_proxy and data_dictionary::database. Passing
storage_proxy& to it is enough.
This patch adds the ranges_parallelism option to repair restful API.
Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.
Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.
To make it safer, unlike the PR #4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.
Fixes#4847
Those without `_as` suffix are just marked non-static
The `..._as` ones are made class methods (now they are local to system_keyspace.cc)
After that the `..._as` ones are patched to use `this->` instead of `qctx`
Closes#14890
* github.com:scylladb/scylladb:
system_keyspace: Stop using qctx in [gs]et_scylla_local_param_as()
system_keyspace: Reuse container() and _db member for flushing
system_keyspace: Make [gs]et_scylla_local_param_as() class methods
system_keyspace: De-static [gs]et_scylla_local_param()
Currently, operation-options are declared in a single global list, then
operations refer to the options they support via name. This system was
born at a time, when scylla-sstable had a lot of shared options between
its operations, so it was desirable to declare them centrally and only
add references to individual operations, to reduce duplication.
However, as the dust settled, only 2 options are shared by 2 operations
each. This is a very low benefit. Up to now the cost was also very low
-- shared options meant the same in all operations that used them.
However this is about to change and this system becomes very awkward to
use as soon as multiple operations want to have an option with the same
name, but sligthly (or very) different meaning/semantics.
So this patch changes moves the options to the operations themselves.
Each will declare the list of options it supports, without having to
reference some common list.
This also removes an entire (although very uncommon) class of bugs:
option-name referring to inexistent option.
Closes#14898
Keep the endpoint address and the caller function name
around and print them in the different lock life cycle
state changes.
While at it, coroutinize gossiper::lock_endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We can't lock an endpoint on arbitrary shards
since collision will not be detected this way.
Assert that, and while at it, make the method private
as it is only used internally by the class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before making further changes to the endpoint_state_map
implementation, simplify `replicate` by providing only
one variant, replicating complete endpoint_state across
shards, instead of applying finer resolution changes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a `topology_change` command is applied, the topology state is
reloaded and `cdc::generation_service::handle_cdc_generation` is called.
This creates a dependency of group0 on `cdc::generation_service`.
Currently, the group0 server is stopped during `raft_group_registry`
shutdown. However, it is called after `cdc::generation_service`
shutdown, which can result in a segfault.
To prevent this issue, this commit stops the group0 server and removes it
from `raft_group_registry` during `group0_service` shutdown.
Fixes#14397.
Closes#14779
Reproducer:
97d6946e31
It creates two nodes. The second one is forced to stop after joining
group0. It sleeps before calling handle_cdc_generation and sleeps just
before raft_group_registry is stopped. It ensures that
handle_cdc_generation wakes up after starting the second sleep. If the
cdc_generation_service shutdown waits for raft_group_registry to stop,
handle_cdc_generation will be called without any issue. Otherwise, it
will crash since cdc_generation_service won't exist. The test passes
always. If the crash happens it can be seen in the log file of the
second node.
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.
The migration track is interrupted by pending topology change operations.
The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
Closes#14851
* github.com:scylladb/scylladb:
tablets: load_balancer: Remove double logging
tests: tablets: Check that load balancing is interrupted by topology change
tests: tablets: Add test for load balancing with active migrations
tablets: Balance tablets concurrently with active migrations
storage_service, tablets: Extract generate_migration_updates()
storage_service, tablets: Move get_leaving_replica() to tablets.cc
locator: tablets: Move std::hash definition earlier
storage_service: Advance tablets independently
topology_coordinator: Fix missed notification on abort
tablets: Add formatter for tablet_migration_info
Now those methods are non-static and can start using this's reference to
query processor instead of the global qctx thing
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_scylla_local_param_as() wants to flush replica::database on all
shards. For that it uses smp::invoke_on_all() and qctx, but since the
method is now non-static one for system_keyspace it can enjoy usiing
container().invoke_on_all() and this->_db (on target shard)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These are now two .cc-local templatized helpers, but they are only
called by system_keyspace:: non-static methods, so can be such as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All same-class callers are now non-static methods of system_keyspace,
all external callers do it via an object at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Skip over verification of owner and mode of the snapshots
sub-directory as this might race with scylla-manager
trying to delete old snapshots concurrently.
Fixes#12010
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow the caller to verify only the top level directories
so that sub-directories can be verified selectively
(in particular, skip validation of snapshots).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.
Fixes: #13290Closes#13349
* github.com:scylladb/scylladb:
replica: make tables_metadata's attributes private
replica: add methods to get a filtered copy of tables map
replica: add methods to check if given table exists
replica: add methods to get table or table id
replica: api: return table_id instead of const table_id&
replica: iterate safely over tables related maps
replica: pass tables_metadata to phased_barrier_top_10_counts
replica: add methods to safely add and remove table
replica: wrap column families related maps into tables_metadata
replica: futurize database::add_column_family and database::remove
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.
Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member
Closes#14876
* github.com:scylladb/scylladb:
schema_tables: Extract query_processor from system_keyspace for querying
schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
migration_manager: Add system_keyspace argument to get_schema_mapping()
It is possible that topology will contain nodes that are no longer normal token owners, so they don't need to be sync'ed with.
Fixes scylladb/scylladb#14793
Closes#14798
* github.com:scylladb/scylladb:
storage_service: refresh_sync_nodes: restrict to reachable token owners
storage_service: refresh_sync_nodes: fix log message
locator: topology: node::state: make fine grained
It is possible that topology would contain nodes that do no longer
own tokens or that are unreachable, so they can't be sync'ed with.
Restrict the list to nodes in a normal or being_decommissioned state.
Fixesscylladb/scylladb#14793
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, after triggering the compaction,
compaction_manager_basic_test waits until the triggered compaction
completes. but since the regular compaction is run in a loop which
does not stop until either the daemon is stopping, or there is no
more sstables to be compacted, or the compaction is disabled.
but we only get the input sstables for compaction after swiching
to the "pending" state, and acquiring the read lock of the
compaction_state, and acquiring the read lock is implemented as
an coroutine, so there is chance that coroutine is suspended,
and the execution switches to the test. in this case, the test
will find that even after the triggered compaction completes,
there are still one or more pending compactions. hence the test
fails.
to address this problem, instead of just waiting for the compaction
to complete, we also wait until the number of pending compaction tasks
is 0. so that even if the test manages to sneak into the time window,
it won't proceed and starting check the compaction manager's stats.
Fixes#14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14889
Add calls to `maybe_yield` in the per-range loops to prevent stalls
if the loop never yields.
Note: originally the stalls were detected in nested calls
to `query_partition_key_range_concurrent` (see #14008).
This series turned the tail-recursion into iteration,
but still the inner loop(s) never yield and do quite
a lot of computations - so they mioght stall when called
with a large number of ranges.
Fixes#14008
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
instead of using a loop of std::swap(), let's use std::shift_left()
when appropriate. simpler and more readable this way.
moreover, the pattern of looking for a command and consume it from
the command line resembles what we have in main(), so let's use
similar logic to handle both of them. probably we can consolidate
them in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14888
this change change `const` to `constexpr`. because the string literal
defined here is not only immutable, but also initialized at
compile-time, and can be used by constexpr expressions and functions.
this change is introduced to reduce the size of the change when moving
to compile-time format string in future. so far, seastar::format() does
not use the compile-time format string, but we have patches pending on
review implementing this. and the author of this change has local
branches implementing the changes on scylla side to support compile-time
format string, which practically replaces most of the `format()` calls
with `seastar::format()`.
to reduce the size of the change and the pain of rebasing, some of the
less controversial changes are extracted and upstreamed. this one is one
of them.
this change also addresses following compilation failure:
```
/home/kefu/dev/scylladb/tools/scylla-sstable.cc:2836:44: error: call to consteval function 'fmt::basic_format_string<char, const char *const &, seastar::basic_sstring<char, unsigned int, 15>>::basic_format_string<const char *, 0>' is not a constant expression
2836 | .description = seastar::format(description_template, app_name, boost::algorithm::join(operations | boost::adaptors::transformed([] (const auto& op) {
| ^
/usr/include/fmt/core.h:3148:67: note: read of non-constexpr variable 'description_template' is not allowed in a constant expression
3148 | FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
| ^
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14887
DynamoDB limits of all expressions (ConditionExpression, UpdateExpression,
ProjectionExpression, FilterExpression, KeyConditionExpression) to just
4096 bytes. Until now, Alternator did not enforce this limit, and we had
an xfailing test showing this.
But it turns out that not enforcing this limit can be dangerous: The user
can pass arbitrarily-long and arbitrarily nested expressions, such as:
a<b and (a<b and (a<b and (a<b and (a<b and (a<b and (...))))))
or
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
and those can cause recursive algorithms in Alternator's parser and
later when applying expressions to recurse very deeply, overflow the
stack, and crash.
This patch includes new tests that demonstrate how Scylla crashes during
parsing before enforcing the 4096-byte length limit on expressions.
The patch then enforces this length limit, and these tests stop crashing.
We also verify that deeply-nested expressions shorter than the 4096-byte
limit are apparently short enough for our recursion ability, and work
as expected.
Unforuntately, running these tests many times showed that the 4096-byte
limit is not low enough to avoid all crashes so this patch needs to do
more:
The parsers created by ANTLR are recursive, and there is no way to limit
the depth of their recursion (i.e., nothing like YACC's YYMAXDEPTH).
Very deep recursion can overflow the stack and crash Scylla. After we
limited the length of expression strings to 4096 bytes this was *almost*
enough to prevent stack overflows. But unfortunetely the tests revealed
that even limited to 4096 bytes, the expression can sometimes recurse
too deeply: Consider the expression "((((((....((((" with 4000 parentheses.
To realize this is a syntax error, the parser needs to do a recursive
call 4000 times. Or worse - because of other Antlr limitations (see rants
in comments in expressions.g) it's actually 12000 recursive calls, and
each of these calls have a pretty large frame. In some cases, this
overflows the stack.
The solution used in this patch is not pretty, but works. We add to rules
in alternator/expressions.g that recurse (there are two of those - "value"
and "boolean_expression") an integer "depth" parameter, which we increase
when the rule recurses. Moreover, we add a so-called predicate
"{depth<MAX_DEPTH}?" that stops the parsing when this limit is reached.
When the parsing is stopped, the user will see a special kind of parse
error, saying "expression nested too deeply".
With this last modification to expressions.g, the tests for deeply-nested but
still-below-4096-bytes expressions
(test_limits.py::test_deeply_nested_expression_*) would not fail sporadically
as they did without it.
While adding the "expression nested too deeply" case, I also made the
general syntax-error reporting in Alternator nicer: It no longer prints
the internal "expression_syntax_error" type name (an exception type will
only be printed if some sort of unexpected exception happens), and it
prints the character position where the syntax error (or too deep
nested expression) was recognized.
Fixes#14473
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14477
We have had support for COUNTER columns for quite some time now, but some functionality was left unimplemented - various internal and CQL functions resulted in "unimplemented" messages when used, and the goal of this series is to fix those issues. The primary goal was to add the missing support for CASTing counters to other types in CQL (issue #14501), but we also add the missing CQL `counterasblob()` and `blobascounter()` functions (issue #14742).
As usual, the series includes extensive functional tests for these features, and one pre-existing test for CAST that used to fail now begins to pass.
Fixes#14501Fixes#14742Closes#14745
* github.com:scylladb/scylladb:
test/cql-pytest: test confirming that casting to counter doesn't work
cql: support casting of counter to other types
cql: implement missing counterasblob() and blobascounter() functions
cql: implement missing type functions for "counters" type
We add a special mode of load balancing, enabled through error
injection, which causes it to continuously generate plans. This
should keep the topology coordinator continuously in the tablet
migration track.
We enable this mode in test_tablets.py:test_bootstrap before
bootstrapping nodes to see that bootstrap request interrupts
tablet migration track. If this would not be the case, the
test will hang.
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.
Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.
The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.
The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
This change makes the topology state machine advance each tablet
independently which allows them to finish migrations at different
speeds, not at the speed of the slowest tablet.
It will also open the possibility of starting new transitions concurrently
with already active ones.
This is implemented by having a single transition state "tablet
migration", and handling it by scanning all the transitions and
advancing tablet state machines. Updates and barriers are batched for
all tablets in each cycle.
One complication is the tracking of streaming sessions. The operations
are no longer nested in the scope of a single handle method, and
cannot be waited on explicitly, as that would inhibit progress of the
coordinator, which starts later migrations. They live as independent
fibers, which associated with tablets in a transient data structure
which lives within the coordinator instance. This data structure is
consulted for a given tablet in each cycle of the
handle_tablet_migration() pump to check if streaming has finished and
we can move the tablet to the next stage. If the pump has no work,
only then it waits for any streaming to finish by blocking on the
_topo_sm.event.
If _as is aborted while the coordinator is in the middle of handling,
and decides to go to sleep, it may go to sleep without noticing that
it was aborted. Fix by checking before blocking on the condition
variable.
In general, every condition which can cause signal() should be checked
before when(). This patch doesn't fix all the cases. For example,
signal() can be called when there arrives a new topology request. This
can happen after the coordinator checked because it releases the guard
before calling when().
In the previous patch we implemented CAST operations from the COUNTER
type to various other types. We did not implement the reverse cast,
from different types to the counter type. Should we? In this patch
we add a test that shows we don't need to bother - Cassandra does not
support such casts, so it's fine that we don't too - and indeed the
test shows we don't support them.
It's not a useful operation anyway.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We were missing support in the "CAST(x AS type)" function for the counter
type. This patch adds this support, as well as extensive testing that it
works in Scylla the same as Cassandra.
We also un-xfail an existing test translated from Cassandra's unit
test. But note that this old test did not cover all the edge-cases that
the new test checks - some missing cases in the implementation were
not caught by the old test.
Fixes#14501
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Code in functions.cc creates the different TYPEasblob() and blobasTYPE()
functions for all type names TYPE. The functions for the "counter" type
were skipped, supposedly because "counters are not supported yet". But
counters are supported, so let's add the missing functions.
The code fix is trivial, the tests that verify that the result behaves
like Cassandra took more work.
After this patch, unimplemented::cause::COUNTERS is no longer used
anywhere in the code. I wanted to remove it, but noticed that
unimplemented::cause is a graveyard of unused causes, so decided not
to remove this one either. We should clean it up in a separate patch.
Fixes#14742
Also includes tests for tangently-related issues:
Refs #12607
Refs #14319
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
types.cc had eight of its functions unimplemented for the "counters"
types, throwing an "unimplemented::cause::COUNTERS" when used.
A ninth function (validate) was unimplemented for counters but did not
even throw.
Many code paths did not use any of these functions so didn't care, but
some do - e.g., the silly do-nothing "SELECT CAST(c AS counter)" when
c is already a counter column, which causes this operation to fail.
When the types.cc code encounters a counter value, it is (if I understand
it correctly) already a single uint64_t ("long_type") value, so we fall
back to the long_type implementation of all the functions. To avoid mistakes,
I simply copied the reversed_type implementation for all these functions -
whereas the reversed_type implementation falls back to using the underlying
type, the counter_type implementation always falls back to long_type.
After this patch, "SELECT CAST(c AS counter)" for a counter column works.
We'll introduce a test that verifies this (and other things) in a later
patch in this series.
The following patches will also need more of these functions to be
implemented correctly (e.g., blobascounter() fails to validate the size
of the input blob if the validate function isn't implemented for the
counter type).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The scaffolding required to have a working scylla tool app, is considerable, leading to a large amount of boilerplate code in each such app. This logic is also very similar across the two tool apps we have and would presumably be very similar in any future app. This PR extracts this logic into `tools/utils.hh` and introduces `tool_app_template`, which is similar to `seastar::app_template` in that it centralizes all the option handling and more in a single class, that each tool has to just instantiate and then call `run()` to run the app.
This cuts down on the repetition and boilerplate in our current tool apps and make prototyping new tool apps much easier.
Closes#14855
* github.com:scylladb/scylladb:
tools/utils.hh: remove unused headers
tools/utils: make get_selected_operation() and configure_tool_mode() private
tools/utils.hh: de-template get_selected_operation()
tools/scylla-types: migrate to tools_app_template
tools/scylla-types: prepare for migration to tool_app_template
tools/scylla-sstable.cc: fix indentation
tools/scylla-sstables: migrate to tool_app_template
tools/scylla-sstables: prepare for migration to tool_app_template
tools: extract tool app skeleton to utils.hh
The selector keeps selected format in system.local and uses static
db::system_keyspace::(get|set)_scylla_local_param() helpers to access
it. The helpers are turning into non-static so the selector should call
those on system_keyspace object, not class
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14871
convict doesn't do anything useful in this case
since we're already in mark_as_shutdown and
convict is called after mark_dead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This refreshes clang to 16.0.6 and libstdc++ to 13.1.1.
compiler-rt, libasan, and libubsan are added to install-dependencies.sh
since they are no longer pulled in as depdendencies.
Closes#13730
This reverts commit fb05fddd7d. After
1554b5cb61 ("Update seastar submodule"),
which fixed a coroutine bug in Seastar, it is no longer necessary.
Also revert the related "build: drop the warning on -O0 might fail tests"
(894039d444).
* seastar c0e618bbb...0784da876 (11):
> Revert "metrics: Remove registered_metric::operator()"
> build: use new behavior defined by CMP0127
> build: pass -DBOOST_NO_CXX98_FUNCTION_BASE to C++ compiler
> coroutine: fix a use-after-free in maybe_yield
Ref #13730.
> Merge 'sstring: add more accessors' from Kefu Chai
> Merge 'semaphore: semaphore_units: return units when reassigned' from Benny Halevy
> metrics: do not define defaulted copy assignment operator
> HTTP headers in http_response are now case insensitive
> rpc: Make server._proto a reference
> Merge 'Cleanup class metrics::registered_metrics' from Pavel Emelyanov
> core: undefine fallthrough to fix compilation error
Closes#14862
Currently, streaming and repair processes and sends data as-is. This is wasteful: streaming might be sending data which is expired or covered by tombstones, taking up valuable bandwidth and processing time. Repair additionally could be exposed to artificial differences, due to different nodes being in different states of compactness.
This PR adds opt-in compaction to `make_streaming_reader()`, then opts in all users. The main difference being in how these choose the current compaction time to use:
* Load'n'stream and streaming uses the current time on the local node.
* Repair uses a centrally chosen compaction time, generated on the repair master and propagated to al repair followers. This is to ensure all repair participants work with the exact state of compactness.
Importantly, this compaction does *not* purge tombstones (tombstone GC is disabled completely).
Fixes: https://github.com/scylladb/scylladb/issues/3561Closes#14756
* github.com:scylladb/scylladb:
replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
repair/row_level: opt in to compacting the stream
streaming: opt-in to compacting the stream
sstables_loader: opt-in for compacting the stream
replica/table: add optional compacting to make_multishard_streaming_reader()
replica/table: add optional compacting to make_streaming_reader()
db/config: add config item for enabling compaction for streaming and repair
repair: log the error which caused the repair to fail
readers: compacting_reader: use compact_mutation_state::abandon_current_partition()
mutation/mutation_compactor: allow user to abandon current partition
... instead of global qctx. The now used qctx->execute_cql() just calls
the query_processor::execute_internal with cache_internal::yes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14874
The schema_tables() column-mapping code runs queries over system. table,
but it needs LOCAL_ONE CL and cherry-pick on caching, so regular
system_keyspace::execute_cql() won't work here.
However, since schema_tables is somewhat part of system_keyspace, it's
natural to let the former fetch private query_processor& from the latter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The callers all have local sys_ks argument:
- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()
And a test that can get it from cql_test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will need one to pass to db::schema_tables code. The caller is paxos
code with sys_ks local variable at hand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction_manager_basic_test checks the stats of compaction_manager to
verify that there are no ongoing or pending compactions after the triggering
the compaction and waiting for its completion. but in #14865, there are
still active compaction(s) after the compaction_manager's stats shows there
is at least one task completed.
to understand this issue better, let's use `BOOST_CHECK_EQUAL()` instead
of `BOOST_REQUIRE()`, so that the test does not error out when the check
fails, and we can have better understanding of the status when the test
fails.
Refs #14865
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14872
It now has a single user, so it doesn't have to be a template.
For now, make the method inline, so it can stay in the header. It will
be moved to utils.cc in the next patch.
The skeleton of the two existing scylla-native tools (scylla-types and
scylla-sstable) is very similar. By skeleton, I mean all the boilerplate
around creating and configuring a seastar::app_template, representing
operations/command and their options, and presenting and selecting
these.
To facilitate code-sharing and quick development of any new tools,
extract this skeleton from scylla-sstable.cc into tools/utils.hh,
in the form of a new tool_app_template, which wraps a
seastar::app_template and centralizes all the boilerplate logic in a
single place. The extracted code is not a simple copy-paste, although
many elements are simply copied. The original code is not removed yet.
The migration_manager service is responsible for schema convergence
in the cluster - pushing schema changes to other nodes and pulling
schema when a version mismatch is observed. However, there is also
a part of migration_manager that doesn't really belong there -
creating mutations for schema updates. These are the functions with
prepare_ prefix. They don't modify any state and don't exchange any
messages. They only need to read the local database.
We take these functions out of migration_manager and make them
separate functions to reduce the dependency of other modules
(especially query_processor and CQL statements) on
migration_manager. Since all of these functions only need access
to storage_proxy (or even only replica::database), doing such a
refactor is not complicated. We just have to add one parameter,
either storage_proxy or database and both of them are easily
accessible in the places where these functions are called.
These are users of global `qctx` variable or call `(get|set)_scylla_local_param(_as)?` which, in turn, also reference the `qctx`. Unfortunately, the latter(s) are still in use by other code and cannot be marked non-static in this PR
Closes#14869
* github.com:scylladb/scylladb:
system_keyspace: De-static set_raft_group0_id()
system_keyspace: De-static get_raft_group0_id()
system_keyspace: De-static get_last_group0_state_id()
system_keyspace: De-static group0_history_contains()
raft: Add system_keyspace argument to raft_group0::join_group0()
TLS certificate authenticator registers itself using a
`class_registrator`. that's why CMake is able to build without
compiling this source file. but for the sake of completeness, and
to be sync with configure.py, let's add it to CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14866
The caller is raft_group0_client with sys.ks. dependency reference and
group0_state_machine with raft_group0_client exporing its sys.ks.
This makes it possible to instantly drop one more qctx reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The caller is raft_group0_client with sys.ks. dependency reference.
This allows to drop one qctx reference right at once
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method will need one to access db::system_keyspace methods. The
sys.ks. is at hand and in use in both callers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We add the CDC generation optimality check in
storage_service::raft_check_and_repair_cdc_streams so that it
doesn't create new generations when unnecessary.
Additionally, we modify the test_topology_ops.py test in a way
that verifies the new changes. We call
storage_service::raft_check_and_repair_cdc_streams multiple
times concurrently and verify that exactly one generation has been
created.
We delay the deletion of the new_cdc_generation request to the
moment when the topology transition reaches the
publish_cdc_generation state. We need this change to ensure
adding the CDC generation optimality check in the next commit
has an intended effect. If we didn't make it, it would be possible
that a task makes the new_cdc_generation request, and then, after
this request was removed but before committing the new generation,
another task also makes the new_cdc_generation request. In such
a scenario, two generations are created, but only one should.
After the change introduced by this commit, the second request
would have no effect.
After adding the CDC generation optimality check in
storage_service::raft_check_and_repair_cdc_streams in the
following commits, multiple tasks will be waiting for a single
generation change. Calling signal on topology_state_machine.event
won't wake them all. Moreover, we must ensure the topology
coordinator wakes when his logic expects it. Therefore, we change
all signal calls on topology_state_machine.event to broadcast.
In the following commits, we add the CDC generation optimality
check to storage_service::raft_check_and_repair_cdc_streams so
that it doesn't create new CDC generations when unnecessary. Since
generation_service::check_and_repair_cdc_streams already has
this check, we extract it to the new is_cdc_generation_optimal
function to not duplicate the code.
cleanup_compaction_task_executor inherits both from compaction_task_executor
and cleanup_compaction_task_impl.
Add a new version of compaction_manager::perform_task_on_all_files
which accepts only the tasks that are derived from compaction_task_impl.
After all task executors' conversions are done, the new version replaces
the original one.
Debug mode is so slow that the work:poll ratio decreases, leading
to even more slowness as more polling is done for the same amount
of work.
Increase the task quota to recover some performance.
Ref #14752.
Closes#14820
To make it consistent with the upcoming methods, methods triggering
major compaction get std::optional<tasks::task_info> as an argument.
Thanks to that we can distinguish between a task that has no parent
and the task which won't be registered in task manager.
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d (5.2.0), it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
c2ba94dc39e4add9db213751295fb17b95e6b962. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Manually tested using ccm on cluster upgrade scenarios and node restarts.
Closes#14441
* github.com:scylladb/scylladb:
test: schema_change_test: Verify digests also with TABLE_DIGEST_INSENSITIVE_TO_EXPIRY enabled
schema_mutations, migration_manager: Ignore empty partitions in per-table digest
migration_manager, schema_tables: Implement migration_manager::reload_schema()
schema_tables: Avoid crashing when table selector has only one kind of tables
This commit adds a requirement to upgrade ScyllaDB
drivers before upgrading ScyllaDB.
The requirement to upgrade the Monitoring Stack
has been moved to the new section so that
both prerequisites are documented together.
NOTE: The information is added to the 5.2-to-5.3
upgrade guide because all future upgrade guides
will be based on this one (as it's the latest one).
If 5.3 is released, this commit should be backported
to branch-5.3.
Refs https://github.com/scylladb/scylladb/issues/13958Closes#14771
If the cluster isn't empty and all servers are stopped, calling
ScyllaCluster.add_server can start a new cluster. That's because
ScyllaCluster._seeds uses the running servers to calculate the
seed node list, so if all nodes are down, the new node would
select only itself as a seed, starting a new cluster.
As a single ScyllaCluster should describe a single cluster, we
make ScyllaCluster.add_server fail when called on a non-empty
cluster with all its nodes stopped.
Closes#14804
when we convert timestamp into string it must look like: '2017-12-27T11:57:42.500Z'
it concerns any conversion except JSON timestamp format
JSON string has space as time separator and must look like: '2017-12-27 11:57:42.500Z'
both formats always contain milliseconds and timezone specification
Fixes#14518Fixes#7997Closes#14726
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
Use locally generated compaction time on each node. This could lead to
different nodes making different decisions on what is expired or not.
But this is already the case for streaming, as what exactly is expired
depends on when compaction last run.
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.
Call sites are updated, but none opt in just yet.
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!
All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
Compacting can greatly reduce the amount of data to be processed by
streaming and repair, but with certain data shapes, its effectiveness
can be reduced and its CPU overhead might outweight the benefits. This
should very rarely be the case, but leave an off switch in case
this becomes a problem in a deployment.
Not wired yet.
Instead of just a boolean _failed flag, persist the error message of the
exception which caused the repair to fail, and include it in the log
message announcing the failure.
When next_partition() or fast_forward_to() is called. Instead of trying
to simulate a properly closed partition by injecting synthetic mutation
fragments to properly close it.
Currently, the compactor requires a valid stream and thus abandoning a
partition in the middle was not possible. This causes some complications
for the compacting reader, which implements methods such as
`next_partition()` which is possibly called in the middle of a
partition. In this case the compacting reader attempts to close the
partition properly by inserting a synthetic partition-end fragment into
the stream. This is not enough however as it doesn't close any range
tombstone changes that might be active. Instead of piling on more
complexity, add an API to the compactor which allows abandoning the
current partition.
simpler than the "begin, end" iterator pair. and also tighten the
type constraints, now require the value type to be
sstables::shared_sstable. this matches what we are expecting in
the implementation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14678
The db.get_version() called that early returns value that database got
construction-time, i.e. -- empty_version thing. It makes little sense
committing it into the system k.s. all the more so the "real" version is
calculated and updated few steps after .setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14833
Fortunately, this is pretty simple -- the only caller is storage_service
that has sharded<system_keysace> dependency reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14824
It was found that cached_file dtor can hit the following assert
after OOM
cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`
cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.
That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.
share() is responsible for automatically placing the page into
LRU once refcount drops to zero.
If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.
Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.
A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.
A reproducer was added.
Fixes#14814.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14818
Fixes https://github.com/scylladb/scylla-docs/issues/4028
The goal of this update is to discourage the use of
the default cassandra superuser in favor of a custom
super user - and explain why it's a good practice.
The scope of this commit:
- Adding a new page on creating a custom superuser.
The page collects and clarifies the information
about the cassandra superuser from other pages.
- Remove the (incomplete) information about
superuser from the Authorization and Authentication
pages, and add the link to the new page instead.
Additionaly, this update will result in better
searchability and ensures language clarity.
Closes#14829
This is to make m.s. initialization more solid and simplify sys.ks.::setup()
Closes#14832
* github.com:scylladb/scylladb:
system_keyspace: Remove unused snitch arg from setup()
messaging_service: Setup preferred IPs from config
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.
We missed a few such cases in test_gsi.py, and this patch corrects them.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.
There are no functional changes in this patch - most of the lines
changed are just reindents.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14835
Make sure _pending_mark_alive_endpoints is unmarked in
any case, including exceptions.
Fixes#14839
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14840
Te view updating consumer uses `_buffer_size` to decide when to flush the accumulated mutations, passing them to the actual view building code. This `_buffer_size` is incremented every time a mutation fragment is consumed. This is not exact, as e.g. range tombstones are represented differently in the mutation object, than in the fragment, but it is good enough. There is one flaw however: `_buffer_size` is not incremented when consuming a partition-start fragment. This is when the mutation object is created in the mutation rebuilder. This is not a big problem when partition have many rows, but if the partitions are tiny, the error in accounting quickly becomes significant. If the partitions are empty, `_buffer_size` is not bumped at all for empty partitions, and any number of these can accumulate in the buffer. We have recently seen this causing stalls and OOM as the buffer got to immense size, only containing empty and tiny partitions.
This PR fixes this by accounting the size of the freshly created `mutation` object in `_buffer_size`, after the partition-start fragment is consumed.
Fixes: #14819Closes#14821
* github.com:scylladb/scylladb:
test/boost/view_build_test: add test_view_update_generator_buffering_with_empty_mutations
db/view/view_updating_consumer: account for the size of mutations
mutation/mutation_rebuilder*: return const mutation& from consume_new_partition()
mutation/mutation: add memory_usage()
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Today, test/*/run always kills Scylla at the end of the test with
SIGKILL (kill -9), so the Scylla shutdown code doesn't run. It was
believed that a clean shutdown would take a long time, but in fact,
it turns out that 99% of the shutdown time was a silly sleep in the
gossip code, which this patch disables with the "--shutdown-announce-in-ms"
option.
After enabling this option, clean shutdown takes (in a dev build on
my laptop) just 0.02 seconds. It's worth noting that this shutdown
has no real work to do - no tables to flush, and so on, because the
pytest framework removes all the tables in its own fixture cleanup
phase.
So in this patch, to kill Scylla we use SIGTERM (15) instead of SIGKILL.
We then wait until a timeout of 10 seconds (much much more than 0.02
seconds!) for Scylla to exit. If for some reason it didn't exit (e.g.,
it hung during the shutdown), it is killed again with SIGKILL, which
is guaranteed to succed.
This change gives us two advantages
1. Every test run with test/*/run exercises the shutdown path. It is perhaps
excessive, but since the shutdown is so quick, there is no big downside.
2. In a test-coverage run, a clean shutdown allows flushing the counter
files, which wasn't possible when Scylla was killed with KILL -9.
Fixes#8543
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14825
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.
The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.
The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.
Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.
Closes#14601
* github.com:scylladb/scylladb:
tests: test_tablets: Add test for bootstraping a node
storage_service: topology_coordinator: Implement tablet migration state machine
tablets: Introduce tablet_mutation_builder
service: tablet_allocator: Introduce tablet load balancer
tablets: Introduce tablet_map::for_each_tablet()
topology: Introduce get_node()
token_metadata: Add non-const getter of tablet_metadata
storage_service: Notify topology state machine after applying schema change
storage_service: Implement stream_tablet RPC
tablets: Introduce global_tablet_id
stream_transfer_task, multishard_writer: Work with table sharder
tablets: Turn tablet_id into a struct
db: Do not create per-keyspace erm for tablet-based tables
tablets: effective_replication_map: Take transition stage into account when computing replicas
tablets: Store "stage" in transition info
doc: Document tablet migration state machine and load balancer
locator: erm: Make get_endpoints_for_reading() always return read replicas
storage_service: topology_coordinator: Sleep on failure between retries
storage_service: topology_coordinator: Simplify coordinator loop
main: Require experimental raft to enable tablets
A test reproducing #14819, that is, the view update builder not flushing
the buffer when only empty partitions are consumed (with only a
tombstone in them).
All partitions will have a corresponding mutation object in the buffer.
These objects have non-negligible sizes, yet the consumer did not bump
the _buffer_size when a new partition was consumer. This resulted in
empty partitions not moving the _buffer_size at all, and thus they could
accumulate without bounds in the buffer, never triggering a flush just
by themselves. We have recently seen this causing OOM.
This patch fixes that by bumping the _buffer_size with the size of the
freshly created mutation object.
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from
- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests
All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx
Closes#14778
* github.com:scylladb/scylladb:
system_keyspace: Make save_truncation_record() non-static
code: Pass sharded<db::system_keyspace>& to database::truncate()
db: Add sharded<system_keyspace>& to legacy_schema_migrator
The function dispatch a background operation that must be
waited on in stop().
Fixes scylladb/scylladb#14791
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14797
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.
Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.
It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md
Just a simplification.
Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
Make _column_families and _ks_cf_to_uuid private to prevent unsafe
access. The maps can be accessed only through method which use locks
if preemption is possible.
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
When messaging_service shuts down it first sets _shutting_down to true
and proceeds with stopping clients and servers. Stopping clients, in
turn, is calling client.stop() on each.
Setting _shutting_down is used in two places.
First, when a client is stopped it may happen that it's in the middle of
some operation, which may result in call to remove_error_rpc_client()
and not to call .stop() for the second time it just does nothing if the
shutdown flag is set (see 357c91a076).
Second, get_rpc_client() asserts that this flag is not set, so once
shutdown started it can make sure that it will call .stop() on _all_
clients and no new ones would appear in parallel.
However, after shutdown() is complete the _clients vector of maps
remains intact even though all clients from it are stopped. This is not
very debugging-friendly, the clients are better be removed on shutdown.
fixes: #14624
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14632
It is quite common to stop a tested scylla process with ^C, which will
raise KeyboardInterrupt from subprocess.run(). Catch and swallow this
exception, allowing the post-processing to continue.
The interrupted process has to handle the interrupt correctly too --
flush the coverage data even on premature exit -- but this is for
another patch.
Closes#14815
As for regular mutations, we do the check
twice in handle_counter_mutation, before
and after applying the mutations. The last
is important in case fence was moved while
we were handling the request - some post-fence
actions might have already happened at this
time, so we can't treat the request as successful.
For example, if topology change coordinator was
switching to write_both_read_new, streaming
might have already started and missed this update.
In mutate_counters we can use a single fencing_token
for all leaders, since all the erms are processed
without yields and should underneath share the
same token_metadata.
We don't pass fencing token for replication explicitly in
replicate_counter_from_leader since
mutate_counter_on_leader_and_replicate doesn't capture erm
and if the drain on the coordinator timed out the erm for
replication might be different and we should use the
corresponding (maybe the new one) topology version for
outgoing write replication requests. This delayed
replication is similar to any other background activity
(e.g. writing hints) - it takes the current erm and
the current token_metadata version for outgoing requests.
We will need it in later commit to return
exceptions from handle_counter_mutation.
We also add utils::Tuple concept restriction
for add_replica_exception_to_query_result
since its type parameters are always tuples.
We are going to add fencing for counter mutations,
this means handle_counter_mutation will sometimes throw
stale_topology_exception. RPC doesn't marshall exceptions
transparently, exceptions thrown by server are delivered
to the client as a general remote_verb_error, which is not
very helpful.
The common practice is to embed exceptions into handler
result type. In this commit we use already existing
exception_variant as an exception container. We mark
exception_variant with [[version]] attribute in the idl
file, this should handle the case when the old replica
(without exception_variant in the signature) is replying
to the new one.
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.
This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.
The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.
Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.
Fixes#14812.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14813
in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added.
to address this problem, in this change
1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names
2. use the new ctor in simple_backlog_controller_test
Fixes#14767Closes#14783
* github.com:scylladb/scylladb:
test: use different table names in simple_backlog_controller_test
test/lib/simple_schema: add ctor for customizing ks.cf
test/lib/simple_schema: do not hardwire ks.cf
This commit adds the information on how to install ScyllaDB
without root privileges (with "unified installer", but we've
decided to drop that name - see the page title).
The content taken from the website
https://www.scylladb.com/download/?platform=tar&version=scylla-5.2#open-source
is divided into two sections: "Download and Install" and
"Configure and Run ScyllaDB".
In addition, the "Next Steps" section is also copied from
the website, and adjusted to be in sync with other installation
pages in the docs.
Refs https://github.com/scylladb/scylla-docs/issues/4091Closes#14781
In this commit we just pass a fencing_token
through hint_mutation RPC verb.
The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.
Closes#14580
We don't load gossiper endpoint states in `storage_service::join_cluster` if `_raft_topology_change_enabled`, but gossiper is still needed even in case of `_raft_topology_change_enabled` mode, since it still contains part of the cluster state. To work correctly, the gossiper needs to know the current endpoints. We cannot rely on seeds alone, since it is not guaranteed that seeds will be up to date and reachable at the time of restart.
The problem was demonstrated by the test `test_joining_old_node_fails`, it fails occasionally with `experimental_features: [consistent-topology-changes]` on the line where it waits for `TEST_ONLY_FEATURE` to become enabled on all nodes. This doesn't happen since `SUPPORTED_FEATURES` gossiper state is not disseminated, and feature_service still relies on gossiper to disseminate information around the cluster.
The series also contains a fix for a problem in `gossiper::do_send_ack2_msg`, see commit message for details.
Fixes#14675Closes#14775
* github.com:scylladb/scylladb:
storage_service: restore gossiper endpoints on topology_state_load fix
gossiper: do_send_ack2_msg fix
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.
Until now, semaphore mismatch was only checked in multi-partition queries. The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.
The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.
Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050Closes: #14770Closes#14736
* github.com:scylladb/scylladb:
querier_cache: add stats of scheduling group mismatches
querier_cache: check semaphore mismatch during querier lookup
querier_cache: add reference to `replica::database::is_user_semaphore()`
replica:database: add method to determine if semaphore is user one
We don't load gossiper endpoint states in
storage_service::join_cluster if
_raft_topology_change_enabled, but gossiper
is still needed even in case of
_raft_topology_change_enabled mode, since it
still contains part of the cluster state.
To work correctly, the gossiper needs to know
the current endpoints. We cannot rely on seeds alone,
since it is not guaranteed that seeds will be
up to date and reachable at the time of restart.
The specific scenario of the problem: cluster with
three nodes, the second has the first in seeds,
the third has the first and second. We restart all
the nodes simultaneously, the third node uses its
seeds as _endpoints_to_talk_with in the first gossiper round
and sends SYN to the first and sedond. The first node
hasn't started its gossiper yet, so handle_syn_msg
returns immediately after if (!this->is_enabled());
The third node receives ack from the second node and
no communication from the first node, so it fills
its _live_endpoints collection with the second node
and will never communicate with the first node again.
The problem was demonstrated by the test
test_joining_old_node_fails, it fails occasionally with
experimental_features: [consistent-topology-changes]
on the line where it waits for TEST_ONLY_FEATURE
to become enabled on all nodes. This doesn't happen
since SUPPORTED_FEATURES gossiper state is not
disseminated because of the problem described above.
The first commit is needed since add_saved_endpoint
adds the endpoint with some default app states with locally
incrementing versions and without that fix gossiper
refuses to fill the real app states for this endpoint later.
Fixes: #14675
Fixes#14668
In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.
Additionally, we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.
Closes#14704
* github.com:scylladb/scylladb:
replica: do not derive the commitlog sync period for schema commitlog
config: set schema_commitlog_segment_size_in_mb to 128
config: add schema_commitlog_segment_size_in_mb variable
This commit is a first part of the fix for #14675.
The issue is about the test test_joining_old_node_fails
faling occasionally with
experimental_features: [consistent-topology-changes].
The next commit contains a fix for it, here we
solve the pre-existing gossiper problem
which we stumble upon after the fix.
Local generation for addr may have been
increased since the current node sent
an initial SYN. Comparing versions across different
generations in get_state_for_version_bigger_than
could result in loosing some app states with
smaller versions.
More specifically, consider a cluster with nodes
.1, .2, .3, .3 has .1 and .2 as seeds, .2 has .1
as a seed. Suppose .2 receives a SYN from .3 before
its gossiper starts, and it has a
version 0.24 for .1 in endpoint_states.
The digest from .3 contains 0.25 as a version for .1,
so examine_gossiper produces .1->0.24 as a digest
and this digest is send to .3 as part of the ack.
Before processing this ack, .3 processed an ack from
.1 (scylla sends SYN to many nodes) and updates
its endpoint_states according to it, so now it
has .1->100500.32 for .1. Then
we get to do_send_ack2_msg and call
get_state_for_version_bigger_than(.1, 24).
This returns properties which has version > 24,
ignoring a lot of them with smaller versions
which has been received from .1. Also,
get_state_for_version_bigger_than updates
generation (it copies get_heart_beat_state from
.3), so when we apply the ack in handle_ack2_msg
at .2 we update the generation and now the
skipped app states will only be updated on .2
if somebody change them and increment their version.
Cassandra behaviour is the same in this case
(see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/GossipDigestAckVerbHandler.java#L86). This is probably less
of a problem for them since most of the time they
send only one SYN in one gossiper round
(save for unreachable nodes), so there is less
room for conflicts.
before this change, add_version_library() is a single function
which accomplishes two tasks:
1. build scylla-version target using
2. add an object library
but this has two problems:
1. we should run `SCYLLA-VERSION-GEN` at configure time, instead
of at build time. otherwise the targets which read from the
SCYLLA-{VERSION, RELEASE, PRODUCT}-FILE cannot access them,
unless they are able to read them in their build rules. but
they always use `file(STRINGS ..)` to read them, and thsee
`file()` command is executed at configure time. so, this
is a dead end.
2. we repeat the `file(STRING ..)` multiple places. this is
not ideal if we want to minimize the repeatings.
so, to address this problem, in this change:
1. use `execute_process()` instead of `add_custom_command()`
for generating these *-FILE files. so they are always ready
at build time. this partially reverts bb7d99ad37.
2. extract `generate_scylla_version()` out of `add_version_library()`.
so we can call the former much earlier than the latter.
this would allow us to reference the variables defined by
the `generate_scylla_version()` much earlier.
3. define cached strings in the extracted function, so that
they can consumed by other places.
4. reference the cached variables in `build_submodule.cmake`.
also, take this opportunity to fix the version string
used in build_submodule.cmake: we should have used
`scylla_version_tilde`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14769
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.
This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.
What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.
The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
Before choosing a function, we prepare the arguments that can be
prepared without a receiver. Preparing an argument makes
its type known, which allows to choose the best overload
among many possible functions.
The function that prepared the argument passes the unprepared
argument by mistake. Let's fix it so that it actually uses
the prepared argument.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#14786
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```
When no JSON value is specified, the query should be rejected.
Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.
A unit test is added to prevent regressions.
Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#14785
in `simple_backlog_controller_test`, we need to have multiple tables
at the same time. but the default constructor of `simple_schema` always
creates schema with the table name of "ks.cf". we are going to have
a per-table metrics. and the new metric group will use the table name
as its counter labels, so we need to either disable this per-table
metrics or use a different table name for each table.
as in real world, we don't have multiple tables at the same time. it
would be better to stop reusing the same table name in a single test
session. so, in this change, we use a random cf_name for each of
the created table.
Fixes#14767
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
some low level tests, like the ones exercising sstables, creates
multiple tables. and we are going to add per-table metrics and
the new metrics uses the ks.cf as part of its unique id. so,
once the per-table metrics is enabled, the sstable tests would fail.
as the metrics subsystem does not allow registering multiple
metric groups with the same name.
so, in this change, we add a new constructor for `simple_schema`,
so that we can customize the the schema's ks and cf when creating
the `simple_schema`. in the next commit, we will use this new
constructor in a sstable test which creates multiple tables.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead, query the name of ks and cf from the scheme. this change
prepare us for the a simple_schema whose ks and cf can be customized
by its contructor.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from
- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One of the class' methods calls db::drop_table_on_all_shards() that will
need sys.ks. in the next patch.
The reference in question is provided from the only caller -- main.cc
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in 46616712, we tried to keep the tmpdir only if the test failed,
and keep up to 1 of them using the recently introduced
option of `tmp_path_retention_count`. but it turns out this option
is not supported by the pytest used by our jenkins nodes, where we
have pytest 6.2.5. this is the one shipped along with fedora 36.
so, in this change, the tempdir is removed if the test completes
without failures. as the tempdir contains huge number of files,
and jenkins is quite slow scanning them. after nuking the tempdir,
jenkins will be much faster when scanning for the artifacts.
Fixes#14690
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14772
The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by https://github.com/scylladb/scylladb/commite2c9cdb576, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.
The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios.
Closes#13817
* github.com:scylladb/scylladb:
test/mutation_test: test_compactor_validator_sanity_test
mutation/mutation_compactor: fix indentation
mutation/mutation_compactor: validate the input stream
mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator
readers: reader-from-fragment: don't modify stream when created without range
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).
Easily fixed by removing the empty alternative. A unit test is
added.
Fixes#14705.
Closes#14707
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.
Closes#14765
* github.com:scylladb/scylladb:
s3/test: use seastar::deferred() to perform cleanup
s3/test: close using deferred_close()
The set in question is read-and-delete-only and thus always empty. Originally it was removed by commit c9993f020d (storage_service: get rid of handle_state_replacing), but some dangling ends were left. Consequentially, the on_alive() callback can get rid of few dead if-else branches
Closes#14762
* github.com:scylladb/scylladb:
storage_service: Relax on_alive()
storage_service: Remove _replacing_nodes_pending_ranges_updater
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.
Closes#14758
* github.com:scylladb/scylladb:
db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
db/system_keyspace: Make paxos methods non-static
service/paxos: Add db::system_keyspace& argument to some methods
test: Optionally initialize proxy remote for cql_test_env
proxy/remote: Keep sharded<db::system_keyspace>& dependency
Task manager tasks covering reshard compaction.
Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4.
Regression test added.
Closes#14739
* github.com:scylladb/scylladb:
test: add test for resharding with non-empty owned_ranges_ptr
test: extend test_compaction_task.py to test resharding compaction
compaction: add shard_reshard_sstables_compaction_task_impl
compaction: invoke resharding on sharded database
compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
compaction: add reshard_sstables_compaction_task_impl
compaction: create resharding_compaction_task_impl
Greatly expand this test to check that the compactor validates the input
stream properly.
The test is renamed (the _sanity_test suffix is removed) to reflect the
expanded scope.
The mutation compactor has a validator which it uses to validate the
stream of mutation fragments that passes through it. This validator is
supposed to validate the stream as it enters the compactor, as opposed
to its compacted form (output). This was true for most fragment kinds
except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by e2c9cdb576, which itself was a flawed
attempt at fixing an error seen because purged tombstones were not
terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly:
on page-cut, if the validator has an active tombstone, a closing
tombstone is generated for it, to avoid the false-positive error. With
this, range tombstones can be validated again as they come in.
The fragment reader currently unconditionally forwards its buffer to the
passed-in partition range. Even if this range is
`query::full_partition_range`, this will involve dropping any fragments
up to the first partitions tart. This causes problems for test users who
intentionally create invalid fragment streams, that don't start with a
partition-start.
Refactor the reader to not do any modifications on the stream, when
neither slice, nor partition-range was passed by the user.
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.
in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.
so, in this change,
* to prevent sstables in the transient stage from being picked
up by regular compactions, a new interface class is introduced
so that the sstable is always added to registration before
it is added to sstable set, and removed from registration after
it is removed from sstable set. the struct helps to consolidate
the regitration related logic in a single place, and helps to
make it more obvious that the timespan of an sstable in
the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
overlap with the output sstable run. the run_id for the
gc sstable run is created only when the gc sstable writer
is created. because the gc sstables is not always created
for all compactions.
please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.
Fixes#14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14725
Since repair is performed on all nodes, each node can just repair the
primary ranges instead of all owned ranges. This avoids repair ranges
more than once.
Closes#14766
Commit f5e3b8df6d introduced an optimization for
as_mutation_source_excluding_staging() and added a test that
verifies correctness of single key and range reads based
on supplied predicates. This new test aims to improve the
coverage by testing directly both table::as_mutation_source()
and as_mutation_source_excluding_staging(), therefore
guaranteeing that both supply the correct predicate to
sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14763
let's use RAII to remove the object use as a fixture, so we don't
leave some object in the bucket for testing. this might interfere
with other tests which share the same minio server with the test
which fails to do its clean up if an exception is thrown.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Now when there's always-false local variable, it can also be dropped and
all the associated if-else branches can be simplified
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set in question is always empty, so it can be removed and the only
check for its contents can be constified
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`expression`'s default constructor is dangerous as an it can leak
into computations and generate surprising results. Fix that by
removing the default constructor.
This is made somewhat difficult by the parser generator's reliance
on default construction, and we need to expand our workaround
(`uninitialized<>`) capabilities to do so.
We also remove some incidental uses of default-constructed expressions.
Closes#14706
* github.com:scylladb/scylladb:
cql3: expr: make expression non-default-constructible
cql3: grammar: don't default-construct expressions
cql3: grammar: improve uninitialized<> flexibility
cql3: grammar: adjust uninitialized<> wrapper
test: expr_test: don't invoke expression's default constructor
cql3: statement_restrictions: explicitly initialize expressions in index match code
cql3: statement_restrictions: explicitly intitialize some expression fields
cql3: statement_restrictions: avoid expression's default constructor when classifying restrictions
cql3: expr: prepare_expression: avoid default-constructed expression
cql3: broadcast_tables: prepare new_value without relying on expression default constructor
This template call is only used by system keyspace paxos methods. All
those methods are no longer static and can use system_keyspace::_qp
reference to real query processor instead of global qctx. The
execute_cql_with_timeout() wrapper is moved to system_keyspace to make
it work
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service::paxos_state methods that call those already have system
keyspace reference at hand and can call method on an object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The paxos_state's .prepare(), .accept(), .learn() and .prune() methods
access system keyspace via its static methods. The only caller of those
(storage_proxy::remote) already has the sharded system k.s. reference
and can pass its .local() one as argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some test cases that use cql_test_env involve paxos state updates. Since
this update is becoming via proxy->remote->system_keyspace those test
cases need cql_test_env to initialize the remote part of the proxy too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In reshard_sstables_compaction_task_impl::run() we call
sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed
to that method, we use both sharded sstable_directory service
and its local instance.
To make it straightforward that sharded and local instances are
dependend, we call sharded<replica::database>::invoke_on_all
instead and access local directory through the sharded one.
Add task manager's task covering resharding compaction.
A struct and some functions are moved from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
This dependency will be needed to call service::paxos_state:: calls and
all of them are done in storage_proxy::remote() methods only
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar bac344d58...c0e618bbb (7):
> resource: take kernel min_free_kbytes into account when allocating memory
Fixes#14721
> build: append JOB_POOLS definition instead of setting it
> net: use designated initialization when appropriate
> websocket: print logging message before handle_ping()
> circular_buffer_fixed_capacity_test: enable randomize test to reverse
> prometheus: do not qualify return type with const.
> alien: do not define customized copy ctor
Closes#14755
We don't want to apply the value of the commitlog_sync_period_in_ms
variable to schema commitlog. Schema commitlog runs in batch mode,
so it doesn't need this parameter.
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
this change is a follow-up of 3129ae3c8c.
since in both cases in this change, the `num_ranges` should always
be greater than zero, there is no need to use `int` for its type,
and "num_ranges" returned by the CQL query should always be greater
or equal to zero, so there is no need to check if it is positive.
in this change, we
* change the type of `num_ranges` to `size_t`
* change std::cmp_not_equal() to !=
to avoid using the verbose `std::cmp_not_equal()` helper, for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14754
Regression test for #14257.
It starts two nodes. It introduces a sleep in raft_group_registry::on_alive
(in raft_group_registry.cc) when receiving a gossip notification about HOST_ID
update from the second node. Then it restarts the second node with a different IP.
Due to the sleep, the old notification from the old IP arrives after the second
node has restarted. If the bug is present, this notification overrides the address
map entry and the second read barrier times out, since the first node cannot reach
the second node with the old IP.
Closes#14609.
Closes#14728
this series enables CMake to build submodules. it helps developers to build, for instance, the java tools on demand.
Closes#14751
* github.com:scylladb/scylladb:
build: cmake: build submodules
build: cmake: generate version files with add_custom_command()
the test of "type_json_test" was added locally, and has not landed
on master. but it somehow was spilled into 87170bf07a by accident.
so, let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14749
Now it sits in replicate/database.cc, but the latter is overloaded with
code, worth keeping less, all the more so the ..._prefix itself lives in
the keys.hh header.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14748
The helper was left from the storage-service shutdown-vs-drain rework
(long ago), now it just occupies space in code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14747
SELECT MUTATION FRAGMENTS is a new select statement sub-type, which allows dumping the underling mutations making up the data of a given table. The output of this statement is mutation-fragments presented as CQL rows. Each row corresponds to a mutation-fragment. Subsequently, the output of this statement has a schema that is different than that of the underlying table. The output schema is derived from the table's schema, as following:
* The table's partition key is copied over as-is
* The clustering key is formed from the following columns:
- mutation_source (text): the kind of the mutation source, one of: memtable, row-cache or sstable; and the identifier of the individual mutation source.
- partition_region (int): represents the enum with the same name.
- the copy of the table's clustering columns
- position_weight (int): -1, 0 or 1, has the same meaning as that in position_in_partition, used to disambiguate range tombstone changes with the same clustering key, from rows and from each other.
* The following regular columns:
- metadata (text): the JSON representation of the mutation-fragment's metadata.
- value (text): the JSON representation of the mutation-fragment's value.
Data is always read from the local replica, on which the query is executed. Migrating queries between coordinators is frobidden.
More details in the documentation commit (last commit).
Example:
```cql
cqlsh> CREATE TABLE ks.tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));
cqlsh> DELETE FROM ks.tbl WHERE pk = 0;
cqlsh> DELETE FROM ks.tbl WHERE pk = 0 AND ck > 0 AND ck < 2;
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 0, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 1, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (0, 2, 0);
cqlsh> INSERT INTO ks.tbl (pk, ck, v) VALUES (1, 0, 0);
cqlsh> SELECT * FROM ks.tbl;
pk | ck | v
----+----+---
1 | 0 | 0
0 | 0 | 0
0 | 1 | 0
0 | 2 | 0
(4 rows)
cqlsh> SELECT * FROM MUTATION_FRAGMENTS(ks.tbl);
pk | mutation_source | partition_region | ck | position_weight | metadata | mutation_fragment_kind | value
----+-----------------+------------------+----+-----------------+--------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
1 | memtable:0 | 0 | | | {"tombstone":{}} | partition start | null
1 | memtable:0 | 2 | 0 | 0 | {"marker":{"timestamp":1688122873341627},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122873341627}}} | clustering row | {"v":"0"}
1 | memtable:0 | 3 | | | null | partition end | null
0 | memtable:0 | 0 | | | {"tombstone":{"timestamp":1688122848686316,"deletion_time":"2023-06-30 11:00:48z"}} | partition start | null
0 | memtable:0 | 2 | 0 | 0 | {"marker":{"timestamp":1688122860037077},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122860037077}}} | clustering row | {"v":"0"}
0 | memtable:0 | 2 | 0 | 1 | {"tombstone":{"timestamp":1688122853571709,"deletion_time":"2023-06-30 11:00:53z"}} | range tombstone change | null
0 | memtable:0 | 2 | 1 | 0 | {"marker":{"timestamp":1688122864641920},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122864641920}}} | clustering row | {"v":"0"}
0 | memtable:0 | 2 | 2 | -1 | {"tombstone":{}} | range tombstone change | null
0 | memtable:0 | 2 | 2 | 0 | {"marker":{"timestamp":1688122868706989},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1688122868706989}}} | clustering row | {"v":"0"}
0 | memtable:0 | 3 | | | null | partition end | null
(10 rows)
```
Perf simple query:
```
/build/release/scylla perf-simple-query -c1 -m2G --duration=60
```
Before:
```
median 141596.39 tps ( 62.1 allocs/op, 13.1 tasks/op, 43688 insns/op, 0 errors)
median absolute deviation: 137.15
maximum: 142173.32
minimum: 140492.37
```
After:
```
median 141889.95 tps ( 62.1 allocs/op, 13.1 tasks/op, 43692 insns/op, 0 errors)
median absolute deviation: 167.04
maximum: 142380.26
minimum: 141025.51
```
Fixes: https://github.com/scylladb/scylladb/issues/11130Closes#14347
* github.com:scylladb/scylladb:
docs/operating-scylla/admin-tools: add documentation for the SELECT * FROM MUTATION_FRAGMENTS() statement
test/topology_custom: add test_select_from_mutation_fragments.py
test/boost/database_test: add test for mutation_dump/generate_output_schema_from_underlying_schema
test/cql-pytest: add test_select_mutation_fragments.py
test/cql-pytest: move scylla_data_dir fixture to conftest.py
cql3/statements: wire-in mutation_fragments_select_statement
cql3/restrictions/statement_restrictions: fix indentation
cql3/restrictions/statement_restrictions: add check_indexes flag
cql3/statments/select_statement: add mutation_fragments_select_statement
cql3: add SELECT MUTATION FRAGMENTS select statement sub-type
service/pager: allow passing a query functor override
service/storage_proxy: un-embed coordinator_query_options
replica: add mutation_dump
replica: extract query_state into own header
replica/table: add make_nonpopulating_cache_reader()
replica/table: add select_memtables_as_mutation_sources()
tools,mutation: extract the low-level json utilities into mutation/json.hh
tools/json_writer: fold SstableKey() overloads into callers
tools/json_writer: allow writing metadata and value separately
tools/json_writer: split mutation_fragment_json_writer in two classes
tools/json_writer: allow passing custom std::ostream to json_writer
metrics.json was added in d694a42745,
`configure.py` was updated accordingly. this change mirrors this
change in CMake building system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14753
Since ec77172b4b (" Merge 'cql3: convert
the SELECT clause evaluation phase to expressions' from Avi Kivity"),
we rewrite non-aggregating selectors to include an aggregation, in order
to have the rest of the code either deal with no aggregation, or
all selectors aggregating, with nothing in between. This is done
by wrapping column selectors with "first" function calls: col ->
first(col).
This broke non-aggregating selectors that included the ttl() or
writetime() pseudo functions. This is because we rewrote them as
writetime(first(col)), and writetime() isn't a function that operates
on any values; it operates on mutations and so must have access to
a column, not an expression.
Fix by detecting this scenario and rewriting the expression as
first(writetime(col)).
Unit and integration tests are added.
Fixes#14715.
Closes#14716
Allowing caller to turn off checking for indexes. Useful if the
restrictions are applied on a pseudo-table, which has no corresponding
table object, and therefore no index manager (or indexes for that
matter).
Not wired in yet. SELECT * FROM MUTATION_FRAGMENTS($table) is a new
select statement sub-type, which allows dumping the underling mutations
making up the data of a given table. The output of this statement is
mutation-fragments presented as CQL rows. Each row corresponds to a
mutation-fragment. Subsequently, the output of this statement has a
schema that is different than that of the underlying table.
Data is always read from the local replica, on which the query is
executed. Migrating queries between coordinators is not allowed.
SELECT * FROM MUTATION_FRAGMENTS($table) is a new select statement
sub-type. More information will be provided in the patch which introduces
it. This patch adds only the Cql.g changes and what is further strictly
necessary.
To allow paging for requests that don't go through storage-proxy
directly. By default, there is no override and the code falls-back to
directly invoking storage_proxy::query() as before.
This file contains facilities to dump the underlying mutations contained
in various mutation sources -- like memtable, cache and sstables -- and
return them as query results. This can be used with any table on the
system. The output represents the mutation fragments which make up said
mutations, and it will be generated according to a schema, which is a
transformation of the table's schema.
This file provides a method, which can be used to implement the backend
of a select-statement: it has a similar signature to regular query
methods.
Allowing reading from each individual memtable which contains the given
token, without exposing the memtables themselves to the caller. Exposing
the memtables directly to any code outside of table is undesired because
they are mutable objects.
Soon, we will want to convert mutation fragments into json inside the
scylla codebase, not just in tools. To avoid scylla-core code having to
include tools/ (and link against it), move the low-level json utilities
into mutation/.
The values of cells are potentially very large and thus, when presenting
row content as json in SELECT * FROM MUTATION_FRAGMENTS($table) queries,
we want to separate metadata and cell values into separate columns, so
users can opt out from the potentially big values being included too.
To support this use-case, write(row) and its downstream write methods
get a new `include_value` flag, which defaults to true. When set to
false, cell values will not be included in the json output. At the same
time, new methods are added to convert only cell values of a row to
json.
1) mutation_partition_json_writer - containing all the low level
utilities for converting sub-fragment level mutation components (such
as rows, tombstones, etc.) and their components into json;
2) mutation_fragment_stream_json_writer - containing all the high level
logic for converting mutation fragment streams to json;
The latter using the former behind the scenes. The goal is to enable
reuse of converting mutation-fragments into json, without being forced
to work around differences in how the mutation fragments are reprenented
in json, on the higher level.
this mirrors what we have in the `build.ninja` generated by
`configure.py`. with this change, we can build for instance,
`dist-tool-tar` from the `build.ninja` generated by CMake.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of using execute_process(), let's use add_custom_command()
to generate the SCYLLA-{VERSION,RELEASE,PRODUCT}-FILE, so that we
can let other targets depend on these generated files. and generate
them on demand.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`gossiper::remove_endpoint` performs `on_remove` callbacks for all
endpoint change subscribers. This was done in the background (with a
discarded future) due the following reason:
```
// We can not run on_remove callbacks here because on_remove in
// storage_service might take the gossiper::timer_callback_lock
```
however, `gossiper::timer_callback_lock` no longer exists, it was
removed in 19e8c14.
Today it is safe to perform the `storage_service::on_remove` callback in
the foreground -- it's only taking the token metadata lock, which is
also taken and then released earlier by the same fiber that calls
`remove_endpoint` (i.e. `storage_service::handle_state_normal`).
Furthermore, we want to perform it in the foreground. First, there
already was a comment saying:
```
// do subscribers first so anything in the subscriber that depends on gossiper state won't get confused
```
it's not too precise, but it argues that subscriber callbacks should be
serialized with the rest of `remove_endpoint`, not done concurrently
with it.
Second, we now have a concrete reason to do them in the foreground. In
issue #14646 we observed that the subcriber callbacks are racing with
the bootstrap procedure. Depending on scheduling order, if
`storage_service::on_remove` is called too late, a bootstrapping node
may try to wait for a node that was earlier replaced to become UP which
is incorrect. By putting the `on_remove` call into foreground of
`remove_endpoint`, we ensure that a node that was replaced earlier will
not be included in the set of nodes that the bootstrapping node waits
for (because `storage_service::on_remove` will clear it from
`token_metadata` which we use to calculate this set of nodes).
We also get rid of an unnecessary `seastar::async` call.
Fixes#14646Closes#14741
The feature check in `enable_features_on_startup` loads the list
of features that were enabled previously, goes over every one of them
and checks whether each feature is considered supported and whether
there is a corresponding `gms::feature` object for it (i.e. the feature
is "registered"). The second part of the check is unnecessary
and wrong. A feature can be marked as supported but its `gms::feature`
object not be present anymore: after a feature is supported for long
enough (i.e. we only support upgrades from versions that support the
feature), we can consider such a feature to be deprecated.
When a feature is deprecated, its `gms::feature` object is removed and
the feature is always considered enabled which allows to remove some
legacy code. We still consider this feature to be supported and
advertise it in gossip, for the sake of the old nodes which, even
though they always support the feature, they still check whether other
nodes support it.
The problem with the check as it is now is that it disallows moving
features to the disabled list. If one tries to do it, they will find
out that upgrading the node to the new version does not work:
`enable_features_on_startup` will load the feature, notice that it is
not "registered" (there is no `gms::feature` object for it) and fail
to boot.
This commit fixes the problem by modifying `enable_features_on_startup`
not to look at the registered features list at all. In addition to
this, some other small cleanups are performed:
- "LARGE_COLLECTION_DETECTION" is removed from the deprecated features
list. For some reason, it was put there when the feature was being
introduced. It does not break anything because there is
a `gms::feature` object for it, but it's slightly confusing
and therefore is removed.
- The comment in `supported_feature_set` that invites developers to add
features there as they are introduced is removed. It is no longer
necessary to do so because registered features are put there
automatically. Deprecated features should still be put there,
as indicated as another comment.
Fortunately, this issue does not break any upgrades as of now - since
we added enabled cluster feature persisting, no features were
deprecated, and we only add registered features to the persisted feature
list.
An error injection and a regression test is added.
Closes#14701
* github.com:scylladb/scylladb:
topology_custom: add deprecated features test
feature_service: add error injection for deprecated cluster feature
feature_service: move error injection check to helper function
feature_service: handle deprecated features correctly in feature check
Move `merger` to its own header file. Leave the logic of applying
commands to `group0_state_machine`. Remove `group0_state_machine`
dependencies from `merger` to make it an independent module.
Add a test that checks if `group0_state_machine_merger` preserves
timeuuid monotonicity. `last_id()` should be equal to the largest
timeuuid, based on its timestamps.
This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid
order and uuid order. Consequently, the resulting command should have a
more recent timeuuid.
Fixes#14568Closes#14682
* github.com:scylladb/scylladb:
raft: group0_state_machine_merger: add test for timeuuid ordering
raft: group0_state_machine: extract merger to its own header
Fixes#10447
This issue is an expected behavior. However, `abort_requested_exception` is not handled properly.
-- Why this issue appeared
1. The node is drained.
2. `migration_manager::drain` is called and executes `_as.request_abort();`.
3. The coordinator sends read RPCs to the drained replica. On the replica side, `storage_proxy::handle_read` calls `migration_manager::get_schema_for_read`, which is defined like this:
```cpp
future<schema_ptr> migration_manager::get_schema_for_write(/* ... */) {
if (_as.abort_requested()) {
co_return coroutine::exception(std::make_exception_ptr(abort_requested_exception()));
}
/* ... */
```
So, `abort_requested_exception` is thrown.
4. RPC doesn't preserve information about its type, and it is converted to a string containing its error message.
5. It is rethrown as `std::runtime_error` on the coordinator side, and `abstract_resolve_reader::error()` logs information about it. However, we don't want to report `abort_requested_exception` there. This exception should be catched and ignored:
```cpp
void error(/* ... */) {
/* ... */
else if (try_catch<abort_requested_exception>(eptr)) {
// do not report aborts, they are trigerred by shutdown or timeouts
}
/* ... */
```
-- Proposed solution
To fix this issue, we can add `abort_requested_exception` to `replica::exception_variant` and make sure that if it is thrown by `migration_manager::get_schema_for_write`, `storage_proxy::handle_read` correctly encodes it. Thanks to this change, `abstract_read_resolver::error` can correctly handle `abort_requested_exception` thrown on the replica side by not reporting it.
-- Side effect of the proposed solution
If the replica supports it, the coordinator doesn't, and all nodes support `feature_service::typed_errors_in_read_rpc`, the coordinator will fail to decode `abort_requested_exception` and it will be decoded to `unknown_exception`. It will still be rethrown as `std::runtime_error`, however the message will change from *abort requested* to *unknown exception*.
-- Another issue
Moreover, `handle_write` reports abort requests for the same reason. This also floods the logs (this time on the replica side) for the same reason. I don't think it is intended, so I've changed it too. This change is in the last commit.
Closes#14681
* github.com:scylladb/scylladb:
service: storage_proxy: do not report abort requests in handle_write
service: storage_proxy: encode abort_requested_exception in handle_read
service: storage_proxy: refactor encode_replica_exception_for_rpc
replica: add abort_requested_exception to exception_variant
This is a translation of Cassandra's CQL unit test source file
BatchTest.java into our cql-pytest framework.
This test file an old (2014) and small test file, with only a few minimal
testing of mostly error paths in batch statements. All test tests pass in
both Cassandra and Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14733
test.py with --x-log2-compaction-groups option rotted a little bit.
Some boost tests added later didn't use the correct header which
parses the option or they didn't adjust suite.yaml.
Perhaps it's time to set up a weekly (or bi-weekly) job to verify
there are no regressions with it. It's important as it stresses
the data plane for tablets reusing the existing tests available.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14732
by default, up to 3 temporary directories are kept by pytest.
but we run only a single time for each of the $TMPDIR. per
our recent observation, it takes a lot more time for jenkins
to scan the tempdir if we use it for scylla's rundir.
so, to alleviate this symptom, we just keep up to one failed
session in the tempdir. if the test passes, the tempdir
created by pytest will be nuked. normally it is located at
scylladb/testlog/${mode}/pytest-of-$(whoami).
see also
https://docs.pytest.org/en/7.3.x/reference/reference.html#confval-tmp_path_retention_policy
Refs #14690
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14735
[xemul: Withdrawing from PR's comments
object_store is the only test which
is using tmpdir fixture
starts / stops scylla by itself
and put the rundir of scylla in its own tmpdir
we don't register the step of cleaning up [the temp dir] using the utilities provided by
cql-pytest. we rely on pytest to perform the cleanup. while cql-pytest performs the
cleanup using a global registry.
]
Some minor fixes reported by `mypy`.
Closes#14693
* github.com:scylladb/scylladb:
test/pylib: fix function attribute
test/pylib: check cmd is defined before using it
test/pylib: fix return type hint
test/pylib: remove redundant method
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.
in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.
if a forward declaration suffice, the declaration is added instead.
see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warningCloses#14740
* github.com:scylladb/scylladb:
treewide: remove #includes not use directly
size_tiered_backlog_tracker: do not include remove header
Fixes#9405
`sync_point` API provided with incorrect sync point id might allocate
crazy amount of memory and fail with `std::bad_alloc`.
To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering it with
a checksum calculated in `db::hints::decode` before decoding.
Closes#14534
* github.com:scylladb/scylladb:
db: hints: add checksum to sync point encoding
db: hints: add the version_size constant
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.
in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.
if a forward declaration suffice, the declaration is added instead.
see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
By making it independent of the number of units the view update
generator's registration semaphore is created with. We want to increase
this number significantly and that would destabilize this test
significantly. To prevent this, detach the test from the number of units
completely, while stil preserving the original intent behind it, as best
as it could be determined.
Closes#14727
in order to identify the problems caused by integer type promotion when comparing unsigned and signed integers, in this series, we
- address the warnings raised by `-Wsign-compare` compiler option
- add `-Wsign-compare` compiler option to the building systems
Closes#14652
* github.com:scylladb/scylladb:
treewide: use unsigned variable to compare with unsigned
treewide: compare signed and unsigned using std::cmp_*()
This series is based on top of the seastar relabel config API.
The series adds a REST API for the configuration, it allows to get and set it.
The API is registered under the V2 prefix and uses the swagger 2.0 definition.
After this series to get the current relabel-config configuration:
```
curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/'
```
A set config example:
```
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \
{ \
"source_labels": [ \
"__name__" \
], \
"action": "replace", \
"target_label": "level", \
"replacement": "1", \
"regex": "io_que.*" \
} \
]' 'http://localhost:10000/v2/metrics-config/'
```
This is how it looks like in the UI

Closes#12670
* github.com:scylladb/scylladb:
api: Add the metrics API
api/config: make it optional if the config API is the first to register
api: Add the metrics.json Swagger file
Preparing for V2 API from files
this series addresses the FTBFS of tests with CMake, and also checks for the unknown parameters in `add_scylla_test()`
Closes#14650
* github.com:scylladb/scylladb:
build: cmake: build SEASTAR tests as SEASTAR tests
build: cmake: error out if found unknown keywords
build: cmake: link tests against necessary libraries
before this change, we sort sstables with compaction disabled, when we
are about to perform the compaction. but the idea of of guarding the
getting and registering as a transaction is to prevent other compaction
to mutate the sstables' state and cause the inconsistency.
but since the state is tracked on per-sstable basis, and is not related
to the order in which they are processed by a certain compaction task.
we don't need to guard the "sort()" with this mutual exclusive lock.
for better readability, and probably better performance, let's move the
sort out of the lock. and take this opportunity to use
`std::ranges::sort()` for more concise code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14699
some times we initialize a loop variable like
auto i = 0;
or
int i = 0;
but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:
```
int main(int argc, char **argv) {
int x = -1;
unsigned y = 2;
fmt::print("{}\n", x < y);
return 0;
}
```
this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This patch adds a metrics API implementation.
The API supports get and set the metric relabel config.
Seastar supports metrics relabeling in runtime, following Prometheus
relabel_config.
Based on metrics and label name, a user can add or remove labels,
disable a metric and set the skip_when_empty flag.
The metrics-config API support such configuration to be done using the
RestFull API.
As it's a new API it is placed under the V2 path.
After this patch the following API will be available
'http://localhost:10000/v2/metrics-config/' GET/POST.
For example:
To get the current config:
```
curl -X GET --header 'Accept: application/json' 'http://localhost:10000/v2/metrics-config/'
```
To set a config:
```
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '[ \
{ \
"source_labels": [ \
"__name__" \
], \
"action": "replace", \
"target_label": "level", \
"replacement": "1", \
"regex": "io_que.*" \
} \
]' 'http://localhost:10000/v2/metrics-config/'
```
Until now, only the configuration API was part of the V2 API.
Now, when other APIs are added, it is possible that another API would be
the first to register. The first to register API is different in the
sense that it does not have a leading ',' to it.
This patch adds an option to mark the config API if it's the first.
This patch changes the base path of the V2 of the API to be '/'. That
means that the v2 prefix will be part of the path definition.
Currently, it only affect the config API that is created from code.
The motivation for the change is for Swagger definitions that are read
from a file. Currently, when using the swagger-ui with a doc path set
to http://localhost:10000/v2 and reading the Swagger from a file swagger
ui will concatenate the path and look for
http://localhost:10000/v2/v2/{path}
Instead, the base path is now '/' and the /v2 prefix will be added by
each endpoint definition.
From the user perspective, there is no change in current functionality.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
sync point API provided with incorrect sync point id might allocate
crazy amount of memory and fail with std::bad_alloc.
To fix this, we can check if the encoded sync point has been modified
before decoding. We can achieve this by calculating a checksum before
encoding, appending it to the encoded sync point, and compering
it with a checksum calculated in db::hints::decode before decoding.
The next commit changes the format of encoding sync points to V2. The
new format appends the checksum to the encoded sync points and its
implementation uses the checksum_size constant - the number of bytes
required to store the checksum. To increase consistency and readability,
we can additionally add and use the version_size constant.
Definitions of sync_point::decode and sync_point::encode are slightly
changed so that they don't depend on the version_size value and make
implementation of the V2 format easier.
Modify sstable_compaction_test.cc so that it does not depend on
how quick compaction manager stats are updated after compaction
is triggered.
It is required since in the following changes the context may
switch before the stats are updated.
Compaction task executors which inherit from compaction_task_impl
may stay in memory after the compaction is finished.
Thus, state switch cannot happen in destructor.
Switch state to none in perform_task defer.
This test checks if `group0_state_machine_merger` preserves timeuuid monotonicity.
`last_id()` should be equal to the largest timeuuid, based on its timestamps.
This test combines two commands in the reverse order of their timeuuids.
The timeuuids yield different results when compared in both timeuuid order and
uuid order. Consequently, the resulting command should have a more recent timeuuid.
Closes#14568
Move `merger` to its own header file. Leave the logic of applying commands to
`group0_state_machine`. Remove `group0_state_machine` dependencies from `merger`
to make it an independent module. Add `static` and `const` keywords to its
methods signature. Change it to `class`. Add documentation.
With this patch, it is easier to write unit tests for the merger.
The `locator::topology::config::this_host_id` field is redundant
in all places that use `locator::topology::config`, so we can
safely remove it.
Closes#14638Closes#14723
We don't want to report aborts in storage_proxy::handle_write, because it
can be only triggered by shutdowns and timeouts.
Before this change, such reports flooded logs when a drained node still
received the write RPCs.
storage_proxy::handle_read now makes sure that abort_requested_exception
is encoded in a way that preserves its type information. This allows
the coordinator to properly deserialize and handle it.
Before this change, if a drained replica was still receiving the read
RPCs, it would flood the coordinator's logs with std::runtime_error
reports.
To properly handle abort_requested_exception thrown from
migration_manager::get_schema_for_read in storage_proxy::handle_read (we
do in the next commit) we have to somehow encode and return it. The
encode_replica_exception_for_rpc function is not suitable for that because
it requires the SourceTuple type (of a value returned by do_query()) which
we don't know when calling get_schema_for_read.
We move the part of encode_replica_exception_for_rpc responsible for
handling exceptions to a new function and rewrite it in a way that doesn't
require the SourceTuple type. As this function fits the name
encode_replica_exception_for_rpc better, we name it this way and rename
the previous encode_replica_exception_for_rpc.
With Raft-topology enabled, test_remove_garbage_group0_members has been
flaky when it should always fail. This has been discussed in #14614.
Disabling Raft-topology in the topology suite is problematic because
the initial cluster size is non-zero, so we have nodes that already use
Raft-topology at the beginning of the test. Therefore, we move
test_topology_remove_garbage_group0.py to the topology_custom suite.
Apart from disabling Raft-topology, we have to start 4 servers instead
of 1 because of the different initial cluster sizes.
Closes#14692
Fixes https://github.com/scylladb/scylladb/issues/13783
This commit documents the nodetool checkAndRepairCdcStreams
operation, which was missing from the docs.
The description is added in a new file and referenced from
the nodetool operations index.
Closes#14700
This is the first phase of providing strong exception safety guarantees by the generic `compaction_backlog_tracker::replace_sstables`.
Once all compaction strategies backlog trackers' replace_sstables provide strong exception safety guarantees (i.e. they may throw an exception but must revert on error any intermediate changes they made to restore the tracker to the pre-update state).
Once this series is merged and ICS replace_sstables is also made strongly exception safe (using infrastructure from size_tiered_backlog_tracker introduced here), `compaction_backlog_tracker::replace_sstables` may allow exceptions to propagate back to the caller rather than disabling the backlog tracker on errors.
Closes#14104
* github.com:scylladb/scylladb:
leveled_compaction_backlog_tracker: replace_sstables: provide strong exception safety guarantees
time_window_backlog_tracker: replace_sstables: provide strong exception safety guarantees
size_tiered_backlog_tracker: replace_sstables: provide strong exception safety guarantees
size_tiered_backlog_tracker: provide static calculate_sstables_backlog_contribution
size_tiered_backlog_tracker: make log4 helper static
size_tiered_backlog_tracker: define struct sstables_backlog_contribution
size_tiered_backlog_tracker: update_sstables: update total_bytes only if set changed
compaction_backlog_tracker: replace_sstables: pass old and new sstables vectors by ref
compaction_backlog_tracker: replace_sstables: add FIXME comments about strong exception safety
add fmt formatter for `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput`.
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput` without the help of `operator<<`.
please note, despite that it's more popular to use the IEC prefixes
when presenting the size of storage, i.e., MiB for 1024**2 bytes instead
of MB for 1000**2 bytes, we are still using the SI binary prefixes as
the default binary prefix, in order to preserve the existing behavior.
the operator<< for these types are removed.
the tests are updated accordingly.
Refs #13245Closes#14719
* github.com:scylladb/scylladb:
utils: drop operator<< for pretty printers
utils: add fmt formatter for pretty printers
these comments or docstrings are not in-sync with the code they
are supposed to explain. so let's update them accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14545
in this header, none of the exceptions defined by
`exceptions/exceptions.hh` is used. so let's drop the `#include`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14718
`db/query_context.hh` contains the declaration of class
`db::query_context`. but `replica/table.cc` does not use or need
`db::query_context`.
so, in this change, the `#include` is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14717
When running compaction task test on the same scylla instantion
other tests are run, some compaction tasks from other test cases may
be left in task manager. If they stay in memory long enough, they may
get unregistered during the compaction task test and cause bad_request
status.
Drain old compaction tasks before and after each test.
Fixes: #14584.
Closes#14585
since all callers of these operators have switched to fmt formatters.
let's drop them. the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
add fmt formatter for `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput`.
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `utils::pretty_printed_data_size` and
`utils::pretty_printed_throughput` without the help of `operator<<`.
please note, despite that it's more popular to use the IEC prefixes
when presenting the size of storage, i.e., MiB for 1024**2 bytes instead
of MB for 1000**2 bytes, we are still using the SI binary prefixes as
the default binary prefix, in order to preserve the existing behavior.
also, we use the singular form of "byte" when formating "1". this is
more correct.
the tests are updated accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Split long-runing database mutation tests.
At a trade-off with verbosity, split these sub-tests for the long running tests `database_with_data_in_sstables_is_a_mutation_source_`*.
Refs #13905Closes#14455
* github.com:scylladb/scylladb:
test/lib/mutation_source_test: bump ttl
test/boost/memtable_test: split memtable sub-tests
test/boost/database_test: split mutation sub-tests
it turns out we are creating two tables with the same name in
sstable_expired_data_ratio. and when creating the second table,
we don't destroy the first one.
this does not happen in the real world, we could tolerate this
in test. but this matters if we're going to have a system-wide per-table
registry which use the name of table as the table's identifier in the
registry. for instance, the metrics name for the tables would conflict.
so, in this series, we use different name for the tables under
testing. they can share the same set of sstables though. this fulfills
the needs of this test in question. also, we rename some variables
for better readability in this series.
Fixes https://github.com/scylladb/scylladb/issues/14657Closes#14665
* github.com:scylladb/scylladb:
test: rename variables with better names
test: use different table names in sstable_expired_data_ratio
test: explicitly capture variables
before this change, if the formatter size is greater than a pettabyte,
`exp` would be 6. but we still use it as the index to find the suffix
in `suffixes`, but the array's size is 6. so we would be referencing
random bits after "PB" for the suffix of the formatted size.
in this change
* loop in the suffix for better readability. and to avoid
the off-by-one errors.
* add tests for both pretty printers
Branches: 5.1,5.2,5.3
Fixes#14702
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14713
it is not supported by C++, and might not yield expected result.
as "0 <= d" evaluates to true, which is always less than "magic".
so let's avoid using it.
```
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2908:23: error: result of comparison of constant 54313 with expression of type 'bool' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
2908 | assert(0 <= d < magic);
| ~~~~~~ ^ ~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14695
prepare_expression() already validates the types and computes
the index of the field; no need to redo that work when
evaluating the expression.
The tests are adjusted to also prepare the expression.
Closes#14562
Use a large ttl (2h+) to avoid deletions for database_test.
An actual fix would be to make database_test to not ignore query_time,
but this is much harder.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Split long-runing memtable tests.
At a trade-off with verbosity, split these sub-tests for the long
running tests
test_memtable_with_many_versions_conforms_to_mutation_source*.
Refs #13905
Split long-runing database mutation tests.
At a trade-off with verbosity, split these sub-tests for the long
running tests database_with_data_in_sstables_is_a_mutation_source_*.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
we first use `cf` and then `lcs_table` later on in
`sstable_expired_data_ratio` to represent "tables_for_tests"
with schema of different compaction strategies.
to improve the readability, we rename the variables which are
related to STCS (Sized-Tiered Compaction Strategy) to "stcs_*", so
better reflect their relations, for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it turns out we are creating two tables with the same name in
sstable_expired_data_ratio. and when creating the second table,
we don't destroy the first one.
this does not happen in the real world, we could tolerate this
in test. this matters if we're going to have a system-wide per-table
registry which use the name of table as the table's identifier in the
registry. for instance, the metrics name for the tables would conflict.
to avoid creating multiples tables with the same ${ks}.${cf},
after this change, we use different name for the tables under
testing, and they can share the same set of sstables though. this
fulfills the needs of this test in question, and the needs of
having per-table metrics with table id as their identifiers.
Fixes#14657
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There is no obvious default expression, so better not to allow
default construction of expressions to prevent unintended values
from leaking in. Resolves a FIXME.
Use uninitialized<expression> for that. Since it's heavily used,
alias it as "uexpression".
To prevent uninitialized<> from leaking into the rest of the
system, change do_with_parser() to unwrap it. We add an
unwrap_uninitialized_t template type alias for that.
Lots of std::move()s are sprinkled around to make things compile,
as uninitialized<T> refuses to convert to T without them.
Fixes https://github.com/scylladb/scylladb/issues/14490
This commit fixes mulitple links that were broken
after the documentation is published (but not in
the preview) due to incorrect syntax.
I've fixed the syntax to use the :docs: and :ref:
directive for pages and sections, respectively.
Closes#14664
uninitialized<> is used to work around the parser generator's propensity
to default-construct return values by supplying a default constructor
to otherwise non-default-constructible types. Make it easier to initialize
it not only from the wrapped type, but also from types convertible to
the wrapped type.
This is useful to initialize an uninitialized<expression> from an
expression element (say a binary_operator), without an explicit
conversion.
The grammar generator relies on everything having a default
constuctor, and to accomodate it we have an uninitialized<>
template that fakes a default constructor where one doesn't
exist. For convenience we have implicit conversion operators
from uninitialized<T> to T. Currently, we have them for both
rvalue-reference and normal reference wrappers.
It turns out that C++ isn't clever enough to deal with both
of them when templates are involved. When it needs a T but
as an uninitialized_wrapper<T>&&, it sees both conversion
operators and can't pick one.
Aid it by removing the non-rvalue conversion operator. The
rvalue conversion operator is more efficient, and is all that
is needed, since we don't use values more than once in the grammar.
Sprinkle std::move()s on the rest of the grammar to keep it
compiling. In a few places the odd "$production" syntax
is changed to the more common "var=production ... { var }".
Provide a way to fetch all pages for `run_async`.
While there, move the code to a common helper module.
Fixes https://github.com/scylladb/scylladb/issues/14451Closes#14688
* github.com:scylladb/scylladb:
test/pylib: handle paged results for async queries
test/pylib: move async query wrapper to common module
This is the last step of deprecation dance of DTCS.
In Scylla 5.1, users were warned that DTCS was deprecated.
In 5.2, altering or creation of tables with DTCS was forbidden.
5.3 branch was already created, so this is targetting 5.4.
Users that refused to move away from DTCS will have Scylla
falling back to the default strategy, either STCS or ICS.
See:
WARN 2023-07-14 09:49:11,857 [shard 0] schema_tables - Falling back to size-tiered compaction strategy after the problem: Unable to find compaction strategy class 'DateTieredCompactionStrategy
Then user can later switch to a supported strategy with
alter table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14559
The index match code has some default-initialized expressions. These won't
compile when we remove expression's default constructor, so replace them
with the current default value, an empty conjunction.
An empty conjunction doesn't make any special sense here; the code
should be refactored not to rely on this random initial value. But this
is delicate code and the refactoring shouldn't be done in the middle of
an unrelated series.
_partition_key_restrictions, _clustering_columns_restrictions, and
_nonprimary_key_restrictions are currently default-initialized. As
we're about to remove expression's default constructor, we need
to initialize them with something.
Use conjunction({}). Not only is this what the default constructor does,
that's what those fields' manipulators assume - they adjust field x
using make_conjunction(y, x). This dates to expression's roots as
a replacement for restrictions.
We have some gnarly code that classifies restrictions by the column
they restrict. This uses std::unordered_map::operatorp[], which uses
the value's default constructor. This happens to be "expression", and
as we're about to remove the default constructor, this won't do.
Fix by using try_emplace(), which makes the code nicer and more
efficient. It could be further improved, but it's better to demolish it
instead.
We're about to remove expression's default constructor, so adjust
the usertype_constructor code that checks whether a field has an
initializer or whether we must supply a NULL to not rely on it.
we intent to print the error message. but failed to pass it to the
formatter. if we actually run into this case, fmtlib would throw.
so in this change, we also print the error when
announcing schema change fails.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14623
A broadcast_table modification query consists of the the key, the new value,
and the condition. When preparing it, we construct the query with
a default new_value expression, and pass it to
operation::prepare_for_broadcast_tables() to fill .new_value.
Since we're removing expression's default constructor, this won't work.
So instead nothing to a (renamed)
operation::prepare_new_value_for_broadcast_tables(), and use the return
value to fill the query.
Adds an error injection which allows to enable the TEST_ONLY_FEATURE as
a deprecated feature, i.e. it is assumed to be always enabled, but still
considered to be supported by the node and advertised in gossip.
And also extract "features_enable_test_feature" literal to a string
constant. This should slightly improve readability and make it more
consistent with the next commit.
The feature check in `enable_features_on_startup` loads the list
of features that were enabled previously, goes over every one of them
and checks whether each feature is considered supported and whether
there is a corresponding `gms::feature` object for it (i.e. the feature
is "registered"). The second part of the check is unnecessary
and wrong. A feature can be marked as supported but its `gms::feature`
object not be present anymore: after a feature is supported for long
enough (i.e. we only support upgrades from versions that support the
feature), we can consider such a feature to be deprecated.
When a feature is deprecated, its `gms::feature` object is removed and
the feature is always considered enabled which allows to remove some
legacy code. We still consider this feature to be supported and
advertise it in gossip, for the sake of the old nodes which, even
though they always support the feature, they still check whether other
nodes support it.
The problem with the check as it is now is that it disallows moving
features to the disabled list. If one tries to do it, they will find
out that upgrading the node to the new version does not work:
`enable_features_on_startup` will load the feature, notice that it is
not "registered" (there is no `gms::feature` object for it) and fail
to boot.
This commit fixes the problem by modifying `enable_features_on_startup`
not to look at the registered features list at all. In addition to
this, some other small cleanups are performed:
- "LARGE_COLLECTION_DETECTION" is removed from the deprecated features
list. For some reason, it was put there when the feature was being
introduced. It does not break anything because there is
a `gms::feature` object for it, but it's slightly confusing
and therefore is removed.
- The comment in `supported_feature_set` that invites developers to add
features there as they are introduced is removed. It is no longer
necessary to do so because registered features are put there
automatically. Deprecated features should still be put there,
as indicated as another comment.
Fortunately, this issue does not break any upgrades as of now - since
we added enabled cluster feature persisting, no features were
deprecated, and we only add registered features to the persisted feature
list.
This option allows user to change the number of ranges to stream in
batch per stream plan.
Currently, each stream plan streams 10% of the total ranges.
With more ranges per stream plan, it reduces the waiting time between
two stream plans. For example,
stream_plan1: shard0 (t0), shard1 (t1)
stream_plan2: shard0 (t2), shard1 (t3)
We start stream_plan2 after all shards finish streaming in stream_plan1.
If shard0 and shard1 in stream_plan1 finishes at different time. One of
the shards will be idle.
If we stream more ranges in a single stream plan, the waiting time will
be reduced.
Previously, we retry the stream plan if one of the stream plans is
failed. That's one of the reasons we want more stream plans. With RBNO
and 1f8b529e08 (range_streamer: Disable restream logic), the
restream factor is not important anymore.
Also, more ranges in a single stream plan will create bigger but fewer
sstables on the receiver side.
The default value is the same as before: 10% percentage of total ranges.
Fixes#14191Closes#14402
This is a followup to 1545ae2d3b
A new reader is introduced that automatically closes the underlying sstable reader once it's exhausted after a fast forward call.
Allowing us to revert 1fefe597e6 which is fragile.
Closes#14669
* github.com:scylladb/scylladb:
Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call"
sstables: Automatically close exhausted SSTable readers in cleanup
If migration_manager::get_schema_for_write is called after
migration_manager::drain, it throws abort_requested_exception.
This exception is not present in replica::exception_variant, which
means that RPC doesn't preserve information about its type. If it is
thrown on the replica side, it is deserialized as std::runtime_error
on the coordinator. Therefore, abstract_read_resolver::error logs
information about this exception, even though we don't want it (aborts
are triggered on shutdown and timeouts).
To solve this issue, we add abort_requested_exception to
replica::exception_variant and, in the next commits, refactor
storage_proxy::handle_read so that abort_requested_exception thrown in
migration_manager::get_schema_for_write is properly serialized. Thanks
to this change, unchanged abstract_read_resolver::error correctly
handles abort_requested_exception thrown on the replica side by not
reporting it.
Provide a flag to fetch all pages for run_async().
Add a simple test to random tables. Runs within 6 seconds in debug mode.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This reverts commit d3034e0fab.
The test modified by this commit
(view_build_test.test_view_update_generator_register_semaphore_unit_leak)
often fails, breaking build jobs.
The sharded<service>::invoke_on_all() has the ability to call method by
pointer with automagical unwrapping of sharded references. This makes
the code shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14684
Fixes https://github.com/scylladb/scylladb/issues/14598
This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)
This commit must be backported to branch-5.3, because
the feature is already on that branch.
Closes#14686
also, remove unnecessary forward declarations.
* compaction_manager_test_task_executor is only referenced
in the friend declaration. but this declaration does not need
a forward declaration of the friend class
* compaction_manager_test_task_executor is not used anywhere.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14680
Add `parameters` map to `injection_shared_data`. Now tests can attach
string data to injections that can be read in injected code via
`injection_handler`.
Closes#14521Closes#14608
* github.com:scylladb/scylladb:
tests: add a `parameters` argument to code that enables injections
api/error_injection: add passing injection's parameters to enable endpoint
tests: utils: error injection: add test for injection's parameters
utils: error injection: add a string-to-string map of injection's parameters
utils: error injection: rename received_messages_counter to injection_shared_data
before this change, we would have report in Jenkins like:
```
[Info] - 1 out of 3 times failed: failed.
== [File] - test/boost/commitlog_test.cc
== [Line] - 298
[Info] - passed: release=1, dev=1
== [File] - test/boost/commitlog_test.cc
== [Line] - 298
[Info] - failed: debug=1
== [File] - test/boost/commitlog_test.cc
== [Line] - 298
```
the first section is rendered from the an `Info` tag,
created by `test.py`. but the ending "failed" does not
help in this context, as we already understand it's failing.
so, in this change, it is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14546
Add missing validation of the AttributeDefinitions parameter of the
CreateTable operation in Alternator. This validation isn't needed
for correctness or safety - the invalid entries would have been
ignored anyway. But this patch is useful for user-experience - the
user should be notified when the request is malformed instead of
ignoring the error.
The fix itself is simple (a new validate_attribute_definitions()
function, calling it in the right place), but much of the contents
of this patch is a fairly large set of tests covering all the
interesting cases of how AttributeDefinitions can be broken.
Particularly interesting is the case where the same AttributeName
appears more than once, e.g., attempting to give two different types
to the same key attribute - which is not allowed.
One of the new tests remains xfail even after this patch - it checks
the case that a user attempts to add a GSI to an existing table where
another GSI defined the key's type differently. This test can't
succeed until we allow adding GSIs to existing tables (Refs #11567).
Fixes#13870.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14556
The refactoring in c48dcf607a dropped
the noexcept around listener notification. This is probably
unintentional, as the comment which explains why we need to abort was
preserved.
Closes#14573
we print the stream id in the logging messages, but in this case,
we forgot to pass `stream` to `log::debug()`. but the placeholder
for `stream` was added. if the underlying fmtlib actually formats
the argument with the format string, it would throw.
fortunately, we don't enable debug level logging often, guess that's
why we haven't spotted this issue yet.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14620
the callers of the constructor does not move variable into this
parameter, and the constructor itself is not able to consume it.
as the parameter is a vector while `compaction_sstable_registration`
use an `unordered_set` for tracking the sstables being compacted.
so, to avoid creating a temporary copy of the vector, let's just
pass by reference.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14661
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.
The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.
With this fix the test passes both with
--consistent-cluster-management option and
without it.
The specific unit test for this problem was added.
Fixes: #14184Closes#14673
For now, `received_messages_counter` have only data for messaging the injection.
In future, there will be more data to keep, for example, a string-to-string map of
injection's parameters.
Rename this class and its attributes.
Consider
- 10 repair instances take all the 10 _streaming_concurrency_sem
- repair readers are done but the permits are not released since they
are waiting for view update _registration_sem
- view updates trying to take the _streaming_concurrency_sem to make
progress of view update so it could release _registration_sem, but it
could not take _streaming_concurrency_sem since the 10 repair
instances have taken them
- deadlock happens
Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.
To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.
Fixes#14676Closes#14677
This patch adds to docs/alternator/compatibility.md mentions of three
recently-added DynamoDB features (ReturnValuesOnConditionCheckFailure,
DeletionProtectionEnabled and TableClass) which Alternator does not yet
support.
Each of these mentions also links to the github issue we have on each
feature - issues #14481, #14482 and #10431 respectively.
During a review of this patch, the reviewers didn't like that I used
words like "recent" and "new" to describe recently-added DynamoDB
features, and asked that I use specific dates instead. So this is what
I do in this patch for the new features - and I also went back and
fixed a few pre-existing references to "recent" and "new" features,
and added the dates.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14483
In 10c1f1dc80 I fixed
`make_group0_history_state_id_mutation` to use correct timestamp
resolution (microseconds instead of milliseconds) which was supposed to
fix the flakiness of `test_group0_history_clearing_old_entries`.
Unfortunately, the test is still flaky, although now it's failing at a
later step -- this is because I was sloppy and I didn't adjust this
second part of the test to also use microsecond resolution. The test is
counting the number of entries in the `system.group0_history` table that
are older than a certain timestamp, but it's doing the counting using
millisecond resolution, causing it to give results that are off by one
sometimes.
Fix it by using microseconds everywhere.
Fixes#14653Closes#14670
instead of accessing the `feature_service`'s member variable, use
the accessor provided by sstable_manager. so we always access the
this setting via a single channel. this should helps with the
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14658
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.
The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.
At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.
Closes#14663
* github.com:scylladb/scylladb:
test: test_node_isolation: use `ManagerClient.con_gen` to create CQL connection
test: manager_client: make `con_gen` for `ManagerClient.__init__` nonoptional
Add a reader that will automatically close the underlying sstable
reader if fast forward is called with a range past the range
spanned by the SSTable. This is only to be used in the context
of fast forward calls in cleanup, as combined reader in full
scans can proactively close the readers that returned EOS.
Regular reads that go through cache enable fast forwarding to
position range, therefore won't enable auto-closed reader.
Compactions don't enable any kind of forward, and they won't
have it enabled either.
The overhead is minimal, with cleanup being able to reach the
same 38MB/s as before this patch.
Refs #12998.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This PR fixes or removes broken links reported by an online link checker.
Fixes https://github.com/scylladb/scylladb/issues/14488Closes#14462
* github.com:scylladb/scylladb:
doc: update the link to ABRT
doc: fix broken links on the Scylla SStable page
When repair writes a sstable to disk, we check if the sstable needs view
update processing. If yes, the sstable will be placed into the staging
dir for processing, with the _registration_sem semaphore to prevent too
many pending unprocessed sstables.
We have seen multiple cases in the field where view update processing is
inefficient and way too slow which blocks the base table repair to
finish on time.
This patch increases the registration_queue_size to a bigger number to
mitigate the problem that slow view update processing blocks repair.
It is better to have a consistent base table + inconsistent view table
than inconsistent base table + inconsistent view table.
Currently, sstables in staging dir are not compacted. So we could not
increase the _registration_sem with too big number to avoid accumulate
too many sstables.
The view_build_test.cc is updated to make the test pass.
Closes#14241
In d2a4079bbe, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row).
However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the *smaller* of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse.
Fix it by using the correct comparison method.
Fixes: #14600Closes#14616
* github.com:scylladb/scylladb:
utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment
group0_state_machine: use correct comparison for timeuuids in `merger`
utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&`
utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t*`
The definitions of virtual tables make up approximately a quarter of the
huge system_keyspace.cc file (almost 4K lines), pulling in a lot of
headers only used by them.
Move them to a separate source file to make system_keyspace.cc easier
for humans and compilers to digest.
This patch also moves the `register_virtual_tables()`,
`install_virtual_readers()` as well as the `virtual_tables` global.
Closes#14308
The test isolates a node and then connects to it through CQL.
The `connect()` step would often timeout on ARM debug builds. This was
already dealt with in the past in the context of other tests: #11289.
The `ManagerClient.con_gen` function creates a connection in a way that
avoids the problem -- connection timeout settings are adjusted to
account for the slowness. Use it in this test to fix the flakiness.
At the same time, reduce the timeout used for the actual CQL request
(after the driver has already connected), because the test expects this
request to timeout and waiting for 200 seconds here is just a waste of
time.
because `lw_shared_ptr::operator=(T&&)` was deprecated. we started to
have following waring:
```
/home/kefu/dev/scylladb/test/boost/statement_restrictions_test.cc:394:41: warning: 'operator=' is deprecated: call make_lw_shared<> and assign the result instead [-Wdeprecated-declarations]
394 | definition.column_specification = std::move(specification);
| ^
/home/kefu/dev/scylladb/seastar/include/seastar/core/shared_ptr.hh:346:7: note: 'operator=' has been explicitly marked deprecated here
346 | [[deprecated("call make_lw_shared<> and assign the result instead")]]
| ^
1 warning generated.
```
so, in this change, we use the recommended way to update a lw_shared_ptr.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14648
`ManagerClient` is given a function that is used to create CQL
connections to the Scylla cluster. For some reason it was typed as
`Optional` even though it was never passed `None`. Fix it.
before this change, the format string contains two placeholders,
but only one extra argument is passed in. if we actually format
this logging message, fmtlib would throw.
after this change, we pass the exception's error message as yet
another argument.
this logging message is printed with "trace" level, guess that's
why we haven't have the exception thrown by fmtlib.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14628
The DynamoDB documentation for the size() function claims that it only
works on paths (attribute names or references), but it actually works on
constants from the query (e.g., ":val") as well.
It turns out that Alternator supports this undocumented case already, but
gets the error path wrong: Usually, when size() is calculated on the data,
if the data has the wrong type of size() (e.g., an integer), the condition
simply doesn't match. But if the value comes from the query - it should
generate an error that the query is wrong - ValidationException.
This patch fixes this case, and also adds tests for it that pass on both
DynamoDB and Alternator (after this patch).
Fixes#14592
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14593
There's -k|--keyspace argument to the tables command that's supposed to
filter tables belonging to specific keyspace that doesn't work. Fix it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14634
This is a translation of Cassandra's CQL unit test source file
functions/CastFctsTest.java into our cql-pytest framework.
There are 13 tests, 9 of them currently xfail.
The failures are caused by one recently-discovered issue:
Refs #14501: Cannot Cast Counter To Double
and by three previously unknown or undocumented issues:
Refs #14508: SELECT CAST column names should match Cassandra's
Refs #14518: CAST from timestamp to string not same as Cassandra on zero
milliseconds
Refs #14522: Support CAST function not only in SELECT
Curiously, the careful translation of this test also caused me to
find a bug in Cassandra https://issues.apache.org/jira/browse/CASSANDRA-18647
which the test in Java missed because it made the same mistake as the
implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14528
The Alternator test test_ttl.py::test_ttl_expiration_gsi_lsi was flaky.
The test incorrectly assumes that when we write an already expired item,
it will be visible for a short time until being deleted by the TTL thread.
But this doesn't need to be true - if the test is slow enough, it may go
look or the item after it was already expired!
So we fix this test by splitting it into two parts - in the first part
we write a non-expiring item, and notice it eventually appears in the
GSI, LSI, and base-table. Then we write the same item again, with an
expiration time - and now it should eventually disappear from the GSI,
LSI and base-table.
This patch also fixes a small bug which prevented this test from running
on DynamoDB.
Fixes#14495
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14496
Due to wrong order of stopping of compaction services, shutdown needs
to wait until all compactions are complete, which may take really long.
Moreover, test version of compaction manager does not abort task manager,
which is strictly bounded to it, but stops its compaction module. This results
in tests waiting for compaction task manager's tasks to be unregistered,
which never happens.
Stopping and aborting of compaction manager and task manager's compaction
module are performed in a proper order.
Closes#14461
* github.com:scylladb/scylladb:
tasks: test: abort task manager when wrapped_compaction_manager is destructed
compaction: swap compaction manager stopping order
compaction: modify compaction_manager::stop()
The scylla netw command prints clients from [0] index only, but there
are more of them on messaging service. Print all
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14633
The command is to print interesting and/or hard-to-get-by-hand info about individual tables
Closes#14635
* github.com:scylladb/scylladb:
test: Add 'scylla table' cmd test
scylla-gdb: Print table phased barriers
scylla-gdb: Add 'table' command
both tagged_integer_test and tablets_test are drived by
"scylla_test_case", and they use seastar thread. let's make them
"SEASTAR" tests, so they can link against the used libraries.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Performance tests such as `perf-fast-forward` are executed in our CI
environments in two steps (two invocations of the `scylla` process):
first by populating data directories (with `--populate` option), then by
running the actual test.
These tests are using `cql_test_env`, which did not load the previously
saved (in the populate step) Host ID of this node, but generated a new
one randomly instead.
In b39ca97919 we enabled
`consistent_cluster_management` by default. This caused the perf tests
to hang in `setup_group0` at `read_barrier` step. That's because Raft
group 0 was initialized with old configuration -- the one created during
the populate step -- but the Raft server was started with a newly
generated Host ID (which is used as the server's Raft ID), so the server
considered itself as being outside the configuration.
Fix this by reloading the Host ID from disk, simulating more closely the
behavior of main.cc initialization.
Fixes#14599Closes#14640
Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node.
That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317.
To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead.
Without further ado,
before:
`INFO 2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.`
after:
`INFO 2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.`
Fixes#12998.
Fixes#14317.
Closes#14469
* github.com:scylladb/scylladb:
test: Extend cleanup correctness test to cover more cases
compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
sstables: Close SSTable reader if index exhaustion is detected in fast forward call
sstables: Simplify sstable reader initialization
compaction: Extend make_sstable_reader() interface to work with mutation_source
test: Extend sstable partition skipping test to cover fast forward using token
Today, SSTable cleanup skips to the next partition, one at a time, when it finds
that the current partition is no longer owned by this node.
That's very inefficient because when a cluster is growing in size, existing
nodes lose multiple sequential tokens in its owned ranges. Another inefficiency
comes from fetching index pages spanning all unowned tokens, which was described
in #14317.
To solve both problems, cleanup will now use multi range reader, to guarantee
that it will only process the owned data and as a result skip unowned data.
This results in cleanup scanning an owned range and then fast forwarding to the
next one, until it's done with them all. This reduces significantly the amount
of data in the index caching, as index will only be invoked at each range
boundary instead.
Without further ado,
before:
... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.
after:
... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.
Fixes#12998.
Fixes#14317.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When wiring multi range reader with cleanup, I found that cleanup
wouldn't be able to release disk space of input SSTables earlier.
The reason is that multi range reader fast forward to the next range,
therefore it enables mutation_reader::forwarding, and as a result,
combined reader cannot release readers proactively as it cannot tell
for sure that the underlying reader is exhausted. It may have reached
EOS for the current range, but it may have data for the next one.
The concept of EOS actually only applies to the current range being
read. A reader that returned EOS will actually get out of this
state once the combined reader fast forward to the next range.
Therefore, only the underlying reader, i.e. the sstable reader,
can for certain know that the data source is completely exhausted,
given that tokens are read in monotonically increasing order.
For reversed reads, that's not true but fast forward to range
is not actually supported yet for it.
Today, the SSTable reader already knows that the underlying SSTable
was exhausted in fast_forward_to(), after it call index_reader's
advance_to(partition_range), therefore it disables subsequent
reads. We can take a step further and also check that the index
was exhausted, i.e. reached EOF.
So if the index is exhausted, and there's no partition to read
after the fast_forward_to() call, we know that there's nothing
left to do in this reader, and therefore the reader can be
closed proactively, allowing the disk space of SSTable to be
reclaimed if it was already deleted.
We can see that the combined reader, under multi range reader,
will incrementally find a set of disjoint SSTable exhausted,
as it fast foward to owned ranges
1:
INFO 2023-07-05 10:51:09,570 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-4525396453480898112, start},{-4525396453480898112, end}]
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db, start == *end, eof ? true
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - closing reader 0x60100029d800 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-3-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false
2:
INFO 2023-07-05 10:51:09,572 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-2253424581619911583, start},{-2253424581619911583, end}]
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db, start == *end, eof ? true
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - closing reader 0x60100029d400 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == *end, eof ? false
INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false
And so on.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It's odd that we see things like:
if (!is_initialized()) {
return initialize().then([this] {
if (!is_initialized()) {
and
return ensure_initialized().then([this, &pr] {
if (!is_initialized()) {
One might think initialize will actually initialize the reader by
setting up context, and ensure_initialized() will even have stronger
guarantees, meaning that the reader must be initialized by it.
But none are true.
In the context of single-partition read, it can happen initialize()
will not set up context, meaning is_initialized() returns false,
which is why initialization must be checked even after we call
ensure_initialized().
Let's merge ensure_initialized() and initialize() into a
maybe_initialize() which returns a boolean saying if the reader
is initialized.
It makes the code initializing the reader easier to understand.
This reverts commit 2a58b4a39a, reversing
changes made to dd63169077.
After patch 87c8d63b7a,
table_resharding_compaction_task_impl::run() performs the forbidden
action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard,
which is a data race that can cause a use-after-free, typically manifesting
as allocator corruption.
Note: before the bad patch, this was avoided by copying the _contents_ of the
lw_shared_ptr into a new, local lw_shared_ptr.
Fixes#14475Fixes#14618Closes#14641
Fixes#14299
failure_detector can try sending messages to TLS endpoints before start_listen
has been called (why?). Need TLS initialized before this. So do on service creation.
Closes#14493
to inspect the sstable generation after uuid-based generation
change. in this change:
* a pretty printer for sstable::generation_type is added
* now that the pretty printer for the generation_type is registered,
we can just leverage it when printing the sstable name, so
instead of checking if `_generation` member variable contains
`_value`, we use delegate it to `str()`, which is used by
`str.format()`. as the behavior of `str()` is similar to that of
the gdb `print` command, and calls `value.format_string()`, which
in turn calls into `to_string()` if the "value" in question has
a pretty printer.
after this change, the printer is able to print both the generations
before the uuid change and the ones after the change.
a typical gdb session looks like:
```
(gdb) p generation._value
$5 = f0770b40-1c7c-11ee-b136-bf28f8d18b88
(gdb) p generation
$10 = 3g7g_0bu7_0jpvk2p0mmtlsb8lu0
(gdb) p/x generation._value.least_sig_bits
$7 = 0xb136bf28f8d18b88
(gdb) p/x generation._value.most_sig_bits
$8 = 0xf0770b401c7c11ee
```
if we use `scripts/base36-uuid.py` to encode
the msb and lsb, we'd need to:
```console
scripts/base36-uuid.py -e 0xf0770b401c7c11ee 0xb136bf28f8d18b88
3g7g_0bu7_0jpvk2p0mmtlsb8lu0
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14561
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.
A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.
Fixes#14533.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14615
instead of concatenating strings, let's format using the builtin
support of `log::debug()`. for two reasons:
1. better performance, after this change, we don't need to
materialize the concatenated string, if the "debug" level logging
is not enabled. seasetar::log only formats when a certain log
level is enabled.
2. better readability. with the format string, it is clear what
is the fixed part, and which arguments are to be formatted.
this also helps us to move to compile-time formatting check,
as fmtlib requires the caller to be explicit when it wants
to use runtime format string.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14627
These barriers show if there's any operation in progress (read, write,
flush or stream). These are crucial to know if stopping fails, e.g. see
issue #13100
These barriers are symmarized in 'scylla memory' command, but they are
also good to know on per-table basis
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's 'scylla tables' one that lists tables on the given/current
shard, but the list is unable to show lots of information. It prints the
table address so it can be explored by hand, but some data is more handy
to be parsed and printed with the script
The syntax is
$ scylla table ks.cf
For now just print the schema version. To be extended in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Refs: https://github.com/scylladb/scylla-docs/issues/4091
Fixes https://github.com/scylladb/scylla-docs/issues/3419
This PR moves the installation instructions from the [website](https://www.scylladb.com/download/) to the documentation. Key changes:
- The instructions are mostly identical, so they were squeezed into one page with different tabs.
- I've merged the info for Ubuntu and Debian, as well as CentOS and RHEL.
- The page uses variables that should be updated each release (at least for now).
- The Java requirement was updated from Java 8 to Java 11 following [this issue](https://github.com/scylladb/scylla-docs/issues/3419).
- In addition, the title of the Unified Installer page has been updated to communicate better about its contents.
Closes#14504
* github.com:scylladb/scylladb:
doc: update the prerequisites section
doc: improve the tile of Unified Installer page
doc: move package install instructions to the docs
* seastar 2b7a341210...bac344d584 (3):
> tls: Export error_category instance used by tls + some common error codes
> reactor: cast enum to int when formatting it
> cooking: bump up zlib to 1.2.13
In d2a4079bbe, `merger` was modified so
that when we merge a command, `last_group0_state_id` is taken to be the
maximum of the merged command's state_id and the current
`last_group0_state_id`. This is necessary for achieving the same
behavior as if the commands were applied individually instead of being
merged -- where we take the maximum state ID from `group0_history` table
which was applied until now (because the table is sorted using the state
IDs and we take the greatest row).
However, a subtle bug was introduced -- the `std::max` function uses the
`utils::UUID` standard comparison operator which is unfortunately not
the same as timeuuid comparison that Scylla performs when sorting the
`group0_history` table. So in rare cases it could return the *smaller*
of the two timeuuids w.r.t. the correct timeuuid ordering. This would
then lead to commands being applied which should have been turned to
no-ops due to the `prev_state_id` check -- and then, for example,
permanent schema desync or worse.
Fix it by using the correct comparison method.
Fixes: #14600
The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte
buffers. Introduce a version which operates directly on the
`utils::UUID` type.
To reuse existing comparison code, we serialize to a buffer before
comparing. But we avoid allocations by using `std::array`. Since the
serialized size needs to be known at compile time for `std::array`, mark
`UUID::serialized_size()` as `constexpr`.
`timeuuid_tri_compare` takes `bytes_view` parameters and converts them
to `const int8_t*` before comparing.
Extract the part that operates on `const int8_t*` to separate function
which we will reuse in a later commit.
with tagging ops, we will be able to attach kv pairs to an object.
this will allow us to mark sstable components with taggings, and
filter them based on them.
* test/pylib/minio_server.py: enable anonymous user to perform
more actions. because the tagging related ops are not enabled by
"mc anonymous set public", we have to enable them using "set-json"
subcommand.
* utils/s3/client: add methods to manipulate taggings.
* test/boost/s3_test: add a simple test accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14486
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.
unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.
this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.
in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.
Closes#14470
* github.com:scylladb/scylladb:
sstables: use try_emplace() when appropriate
replica,sstable: do not assign a value to a shared_ptr
we added pretty_printers.cc back in
83c70ac04f, in which configure.py is
updated. so let's sync the CMake building system accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14442
As the goal is to make compaction filter to the next owned range,
make_sstable_reader() should be extended to create a reader with
parameters forwarded from mutation_source interface, which will
be used when wiring cleanup with multi range reader.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Our usage of inodes is dual:
- the Index.db and Data.db components are pinned in memory as
the files are open
- all other components are read once and never looked at again
As such, tune the kernel to prefer evicting dcache/inodes to
memory pages. The default is 100, so the value of 2000 increases
it by a factor of 20.
Ref https://github.com/scylladb/scylladb/issues/14506Closes#14509
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.
Existing intended cases were annotated.
Closes#14607
locator/*_snitch.cc updated for http::reply losing the _status_code
member without a deprecation notice.
* seastar 99d28ff057...2b7a341210 (23):
> Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity
Fixes#8828.
> reactor: use structured binding when appropriate
> Simplify payload length and mask parsing.
> memcached: do not used deprecated API
> build: serialize calls to openssl certificate generation
> reactor: epoll backend: initialize _highres_timer_pending
> shared_ptr: deprecate lw_shared_ptr operator=(T&&)
> tests: fail spawn_test if output is empty
> Support specifying the "build root" in configure
> Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov
> build: correct the syntax error in comment
> util: print_safe: fix hex print functions
> Add code examples for handling exceptions
> smp: warn if --memory parameter is not supported
> Merge 'gate: track holders' from Benny Halevy
> file: call lambda with std::invoke()
> deleter: Delete move and copy constructors
> file: fix the indent
> file: call close() without the syscall thread
> reactor: use s/::free()/::io_uring_free_probe()/
> Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai
> reactor: Don't re-evaliate local reactor for thread_pool
> Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov
Closes#14602
The script gets the build id on its own but eu-unstrip-ing the core file
and searching for the necessary value in the output. This can be
somewhat lenghthy operation especially on huge core files. Sometimes
(e.g. in tests) the build id is known and can be just provided as an
argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14574
Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.
Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).
Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
replaced or changed their IPs. These entries are still using status
`NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
`handle_state_normal` is also called for it). The
`_gossiper.wait_alive` call would wait for those entries too and
eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
may not be populated yet, which is required to do the IP<->Host ID
translations -- and populating `token_metadata` happens inside
`handle_state_normal`, so we have a chicken-and-egg problem here.
It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs. `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).
Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.
This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.
We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.
In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.
We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.
We also introduce regression tests with necessary extensions
to the test framework.
Fixes#14468Fixes#14487Closes#14507
* github.com:scylladb/scylladb:
test: rename `test_topology_ip.py` to `test_replace.py`
test: test bootstrap after IP change
test: scylla_cluster: return the new IP from `change_ip` API
test: node replace with `ignore_dead_nodes` test
test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
storage_service: remove `get_nodes_to_sync_with`
storage_service: use `token_metadata` to calculate nodes waited for to be UP
storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.
so, let's make the percentage a float before formatting it.
Fixes#14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14588
Michał Chojnowski noted that this is not true. -O0 almost doubles
the run time of `./test.py --mode=debug`. but it does not fail
any of the tests.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14456
Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially.
When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`.
In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100).
This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`:
```
with this patch
20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2...
20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859
without this patch
21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2...
21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140
```
Fixes: #14093Closes#14178
* github.com:scylladb/scylladb:
repair_test: add test_reader_with_different_strategies
repair: extract repair_reader declaration into reader.hh
repair_meta: get_estimated_partitions fix
repair_meta: use multishard_filter reader if the number of partitions is small
repair_meta: delay _repair_reader creation
database.hh: make_multishard_streaming_reader with range parameter
database.cc: extract streaming_reader_lifecycle_policy
In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs
to sstables is passes to reducer of map_reduce0. Reducer then accesses
these shared pointers. As reducer is invoked on the same shard
map_reduce0 is called, we have an illegal access to shared pointer
on non-owner cpu.
A set of shared pointers to sstables is trasnsformed in map function,
which is guaranteed to be invoked on a shard associated with the service.
Fixes: #14515.
Closes#14532
fmtlib uses `{}` as the placeholder for the formatted argument, not
`{}}`.
so let's correct it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14586
when formatting the error message for `api_error::validation`, we
always include the caller in the error message, but in this case,
forgot to pass the `caller` to `seastar::format()`. if fmtlib
actually formats them, it would throw.
so let's pass `caller` to `seastar::format()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14589
When task manager is not aborted, the tasks are stored in the memory,
not allowing the tasks' gate to be closed.
When wrapped_compaction_manager is destructed, task manager gets
aborted, so that system could shutdown.
task_manager::module::stop() waits till all compactions are complete.
Thus, ongoing compactions should be aborted before stop() is called
not to prolong shutdown process.
Task manager's compaction module is stopped after
compaction_manager::do_stop(), which aborts ongoing compactions,
is called.
The test has about 1/2500000 chance to fail due to a conflict of random
values. And it recently did, just to spite us.
Fight back.
Fixes#14563Closes#14576
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.
so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.
Fixes#14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14578
fmtlib allows us to specify the field width dynamically, so specify
the field width in the same statement formatting the argument improves
the readability. and use the constexpr fmt string allows us to switch
to compile-time formatter supported by fmtlib v8.
this change also use `fmt::print()` to format the argument right to
the output ostream, instead of creating a temporary sstring, and
copy it to the output ostream.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14579
Fixes https://github.com/scylladb/scylladb/issues/14565
This commit improves the description of ScyllaDB configuration
via User Data on AWS.
- The info about experimental features and developer mode is removed.
- The description of User Data is fixed.
- The example in User Data is updated.
- The broken link is fixed.
Closes#14569
The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.
Commands are sent sequentially to avoid concurrency problems.
Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.
Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.
A mechanism to remove old CDC generations will be introduced as a
follow-up.
Closes#13962
* github.com:scylladb/scylladb:
test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
raft: add release_guard
raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
raft topology: make `mutation_size_threshold` depends on `max_command_size`
raft: reduce max batch size of raft commands and raft entries
raft: add description argument to add_entry_unguarded
raft: introduce `write_mutations` command
raft: refactor `topology_change` applying
Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead.
Fixes#14388Closes#14558
* github.com:scylladb/scylladb:
direct_fd: do not ping self
raft: initialize raft_group_registry with host id early
raft: code cleanup
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.
This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it
prints an error. This exception is not an error in this case, and it can be
thrown when the leader has changed. It can happen in `add_entry_unguarded`
while sending a part of the CDC generation data in the `write_mutations` command.
Catch this exception in `topology_coordinator::run` and print a warning.
Broadcasts all mutations returned from `prepare_new_cdc_generation_data`
except the last one. Each mutation is sent in separate raft command. It takes
`group0_guard`, and if the number of mutations is greater than one, the guard
is dropped, and a new one is created and returned, otherwise the old one will
be returned. Commands are sent in parallel and unguarded (the guard used for
sending the last mutation will guarantee that the term hasn't been changed).
Returns the generation's UUID, guard and last mutation, which will be sent
with additional topology data by the caller.
If we send the last mutation in the `write_mutation` command, we would use a
total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of
mutations), so it's better to send it in `topology_change` (we need to send
it after all `write_mutations`) with some small metadata.
With the default commitlog segment size, `mutation_size_threshold` will be 4 MB.
In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation
data can reach the size of 30 MB, thus there will be no more than 8 commands.
In a multi-DC cluster with 100ms latencies between DCs, this operation should
take about 200ms since we send the commands concurrently, but even if the commands
were replicated sequentially by Raft, it should take no more than 1.6s, which is
incomparably smaller than bootstrapping operation (bootstrapping is quick if there
is no data in the cluster, but usually if one has 100 nodes they have tons of data,
so indeed streaming/repair will take much longer (hours/days)).
Fixes FIXME in pr #13683.
If `group0_state_machine` applies all commands individually (without batching),
the resulting current `state_id` -- which will be compared with the
`prev_state_id` of the next command if it is a guarded command -- equals the
maximum of the `next_state_id` of all commands applied up to this point.
That's because the current `state_id` is obtained from the history table by
taking the row with the largest clustering key.
When `group0_state_machine::apply` is called with a batch of commands, the
current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id`
only once. When a command is merged, its `next_state_id` overwrites
`last_group0_state_id`, regardless of their order.
Let's consider the following situation:
The leader sends two unguarded `write_mutations` commands concurrently, with
timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded
`topology_change` with `prev_state_id` equal T2.
Suppose that the command with timeuuid T2 is committed first, and these commands
are small enough that all of `write_mutations` could be merged into one command.
Some followers can get all of these three commands before its `fsm` polls them.
In this situation, `group0_state_machine::apply` is called with all three of
them and `merger` will merge both `write_mutations` into one command. After that,
`merger::last_group0_state_id` will be equal to T1 (this command was committed
as the second one). When it processes the `topology_change` command, it will
compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in
making this command a no-op (which wouldn't happen if the commands were applied
individually).
Such a scenario results in inconsistent results: one replica applies `topology_change`,
but another makes it a no-op.
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.
Send entire cdc_generations_v3 table in the snapshot to resolve this problem.
As a follow-up, a mechanism to remove old CDC generations will be introduced.
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.
Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.
A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.
Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.
Fixed typo "1204" => "1024".
Provide useful description for `write_mutations` and
`broadcast_tables_query` that is stored in `system.group0_history`.
Reduces scope of issue #13370.
Fixes https://github.com/scylladb/scylladb/issues/13877
This commit adds the information about Rust CDC Connector
to the documentation. All relevant pages are updated:
the ScyllaDB Rust Driver page, and other places in
the docs where Java and Go CDC connectors are mentioned.
In addition, the drivers table is updated to indicate
Rust driver support for CDC.
Closes#14530
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:
```
WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```
It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.
And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.
To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.
Fixes#13538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14557
DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L`
which is 50 MB, not 50 bytes.
Fixes#14413
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14414
We have plenty of code marked with #if 0. Once it was an indication
of missing functionality, but the code has evolved so much it's
useless as an indication and only a distraction.
Delete it.
Closes#14511
Fixes https://github.com/scylladb/scylladb/issues/14459
This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements.
Closes#14519
* github.com:scylladb/scylladb:
doc: minor language improvements on the Migration Tools page
doc: remove the link to the private repository
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.
This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:
1. Empty string (the default) leaves the current behavior -
DescribeEndpoints echos the request's "Host" header.
2. The string "disabled" disables the DescribeEndpoints (it will return
an UnknownOperationException). This is how DynamoDB Local behaves,
and the AWS C++ SDK and the Dynamodb Shell work well in this mode.
3. Any other string is a fixed string to be returned by DescribeEndpoints.
It can be useful in setups that should return a known address.
Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.
Fixes#14410.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14432
Earlier, when local query processor wasn't available at
the beginning of system start, we couldn't query our own
host id when initializing the raft group registry. The local
host id is needed by the registry since it is responsible
to route RPC messages to specific raft groups, and needs
to reject messages destined to a different host.
Now that the host id is known early at boot, remove the optional
and pass host id in the constructor. Resolves an earlier fixme.
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.
This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.
Closes#14357.
Closes#14460
* github.com:scylladb/scylladb:
tests: introduce InjectionHandler class for communicating with injected code
api/error_injection: add message_injection endpoint
tests: utils: error injections: add test for inject_with_handler
utils: error injection: add inject_with_handler for interactions with injected code
utils: error injection: create structure for error injections data
Currently, it is hard for injected code to wait for some events, for example,
requests on some REST endpoint.
This commit adds the `inject_with_handler` method that executes injected function
and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers
associated with the injection on all shards about the received message.
There is a counter of received messages in `received_messages_counter`; it is shared
between the injection_data, which is created once when enabling an injection on
a given shard, and all `injection_handler`s, that are created separately for each
firing of this injection. The `counter` is incremented when receiving a message from
the REST endpoint and the condition variable is signaled.
Each `injection_handler` (separate for each firing) stores its own private counter,
`_read_messages_counter` that private counter is incremented whenever we wait for a
message, and compared to the received counter. We sleep on the condition variable
if not enough messages were received.
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
Regression test for #14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.
In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.
In #14468 and #14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.
We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.
In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.
We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.
Fixes#14468Fixes#14487
Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.
Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).
Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
replaced or changed their IPs. These entries are still using status
`NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
`handle_state_normal` is also called for it). The
`_gossiper.wait_alive` call would wait for those entries too and
eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
may not be populated yet, which is required to do the IP<->Host ID
translations -- and populating `token_metadata` happens inside
`handle_state_normal`, so we have a chicken-and-egg problem here.
The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).
Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.
This addresses the immediate failure described in #14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.
This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.
Fixes https://github.com/scylladb/scylladb/issues/14503Closes#14502
* github.com:scylladb/scylladb:
test: view_build_test: add range tombstones to test_view_update_generator_buffering
test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
view_updating_consumer: make buffer limit a variable
view: fix range tombstone handling on flushes in view_updating_consumer
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
this change has no impact on `build.ninja` generated by `configure.py`.
as we are using a `set` for tracking the tests to be built. but it's
still an improvement, as we should not add duplicated entries in a set
when initializing it.
there are two occurrences of `test/boost/double_decker_test`, the one
which is in the club of the local cluster of collections tests - bptree,
btree, radix_tree and double_decker are preserved.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14478
in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store.
Closes#14165
* github.com:scylladb/scylladb:
s3/test: do not return ip in managed_cluster()
s3/test: verify the behavior with asserts
s3/test: restructure object_store/run into a pytest
s3/test: extract get_scylla_with_s3_cmd() out
s3/test: s/restart_with_dir/kill_with_dir/
s3/test: vendor run_with_dir() and friends
s3/test: remove get_tempdir()
s3/test: extract managed_cluster() out
Currently we hold group0_guard only during DDL statement's execute()
function, but unfortunately some statements access underlying schema
state also during check_access() and validate() calls which are called
by the query_processor before it calls execute. We need to cover those
calls with group0_guard as well and also move retry loop up. This patch
does it by introducing new function to cql_statement class take_guard().
Schema altering statements return group0 guard while others do not
return any guard. Query processor takes this guard at the beginning of a
statement execution and retries if service::group0_concurrent_modification
is thrown. The guard is passed to the execute in query_state structure.
Fixes: #13942
Message-Id: <ZJ2aeNIBQCtnTaE2@scylladb.com>
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.
in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.
also, in this change, instead of hardwiring the alias of "local" in the code,
define a variable for it. less repeating this way.
Fixes https://github.com/scylladb/scylladb/issues/1719Closes#14517
* github.com:scylladb/scylladb:
test/pylib: do not hardwire alias to "local"
test/pylib: retry if minio_server is not ready
let's just use cluster.contact_points for retrieving the IP address
of the scylla node in this single-node cluster. so the name of
managed_cluster() is less weird.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of using a single run to perform the test, restructure
it into a pytest based test suite with a single test case.
this should allow us to add more tests exercising the object-storage
and cached/tierd storage in future.
* add fixtures so they can be reused by tests
* use tmpdir fixture for managing the tmpdir, see
https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture
* perform part of the teardown in the "test_tempdir()" fixture
* change the type of test from "Run" to "Python"
* rename "run" to "test_basic.py"
* optionally start the minio server if the settings are not
found in command line or env variables, so that the tests are
self-contained without the fixture setup by test.py.
* instead of sys.exit(), use assert statement, as this is
what pytest uses.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* define a dedicated S3_server class which duck types MinioServer.
it will be used to represent S3 server in place of MinioServer if
S3 is used for testing
* prepare object_storage.yaml in get_scylla_with_s3(), so it is more
clear that we are using the same set of settings for launching
scylla
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
replace the restart_with_dir() with kill_with_dir(), so
that we can simplify the usage of managed_cluster() by enabling it
to start and stop the single-node cluster. with this change, the caller
does not need to run the scylla and pass its pid to this function
any more.
since the restart_with_dir() call is superseded by managed_cluster(),
which tears down the cluster, teardown() is now only responsible to
print out the log file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to match with another call of managed_cluster(), so it's clear that
we are just reusing test_tempdir.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
for setting up the cluster and tearing down it.
this helps to indent the code so that it is visually explicit
the lifecycle of the cluster.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there is chance that minio_server is not ready to serve after
launching the server executable process. so we need to retry until
the first "mc" command is able to talk to it.
in this change, add method `mc()` is added to run minio client,
so we can retry the command before it timeouts. and it allows us to
ignore the failure or specify the timeout. this should ready the
minio server before tests start to connect to it.
Fixes#1719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
It is possible that a gossip message from an old node is delivered
out of order during a slow boot and the raft address map overwrites
a new IP address with an obsolete one, from the previous incarnation
of this node. Take into account the node restart counter when updating
the address map.
A test case requires a parameterized error injection, which
we don't support yet. Will be added as a separate commit.
Fixes#14257
Refs #14357Closes#14329
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.
To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.
This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).
The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.
Fixes#14503
This command is used to send mutations over raft.
In later commits if `topology_change` doesn't fit the max command size,
it will be split into smaller mutations and sent over multiple raft
commands.
Split up the `topology_change` command's logic to apply mutations and reload
the topology state in seperate functions.
This aims to extract the logic of applying mutations to use it in future raft commands.
This commit moves the installation instructions with Linux
packages from the webstite to the docs.
The scope:
- Added the install-on-linux.rst file that has information
about all supported linux platform. The replace variables
in the file must be updated per release.
- Updated the index page to include the new file.
Refs: scylladb/scylla-docs#4091
this would allow developer to run a minio server for testing, for instance, s3_test.
Closes#14485
* github.com:scylladb/scylladb:
test/pylib: chmod +x minio_server.py
test/pylib: allow run minio_server.py as a stand-alone tool
We replace is_local_reader bool_class with the
read_strategy enum, since now we have three options.
We choose our new multishard_streaming_reader if
the number of partitions is less than the
number of master subranges.
In later commits we will need the estimated number
of partitions in _repair_reader creation,
so in this commit we delay it until the
reader is first used in read_rows_from_disk
function. read_rows_from_disk is used
in get_sync_boundary, which is called
by master after set_estimated_partitions.
We add an overload of make_multishard_streaming_reader
which reads all the data in the given range. We will use it later
in row level repair if --smp is different on the
nodes and the number of partitions is small.
We are going to use it later in a new
make_multishard_streaming_reader overload.
In this commit we just move it outside
into the anonymous namespace, no other code changes
were made.
so we don't have to search in the unordered_map twice. and it's
more readable, as we don't need to compare an iterator with the
sentry.
also, take the opportunity to simplify the code by using the
temporary `s3_cfg` when possible instead of `it->second.cfg`
which is less readable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead using the operator=(T&&) to assign an instance of `T` to a
shared_ptr, assign a new instance of shared_ptr to it.
unlike std::shared_ptr, seastar::shared_ptr allows us to move a value
into the existing value pointed by shared_ptr with operator=(). the
corresponding change in seastar is
319ae0b530.
but this is a little bit confusing, as the behavior of a shared_ptr
should look like a pointer instead the value pointed by it. and this
could be error-prune, because user could use something like
```c++
p = std::string();
```
by accident, and expect that the value pointed by `p` is cleared.
and all copies of this shared_ptr are updated accordingly. what
he/she really wants is:
```c++
*p = std::string();
```
and the code compiles, while the outcome of the statement is that
the pointee of `p` is destructed, and `p` now points to a new
instance of string with a new address. the copies of this
instance of shared_ptr still hold the old value.
this behavior is not expected. so before deprecating and removing
this operator. let's stop using it.
in this change, we update two caller sites of the
`lw_shared_ptr::operator=(T&&)`. instead of creating a new instance
pointee of the pointer in-place, a new instance of lw_shared_ptr is
created, and is assigned to the existing shared_ptr.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
add a shebang line. so we can just launch
a minio_server using
```console
test/pylib/minio_server.py --host 127.0.0.1
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this would allow developer to run a minio server for testing, for
instance, s3_test, using something like:
```console
$ python3 test/pylib/minio_server.py --host 127.0.0.1
tempdir='/tmp/tmpfoobar-minio'
export S3_SERVER_ADDRESS_FOR_TEST=127.0.0.1
export S3_SERVER_PORT_FOR_TEST=900
export S3_PUBLIC_BUCKET_FOR_TEST=testbucket
```
and developer is supposed to copy-and-paste the `export` commands
to prepare the environmental variables for the test using the
minio server. the tempdir is used for the rundir of minio, and it
is also used for holding the log file of this tool. one might want
to check it when necessary.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We want to disable `migration_manager` schema pulls and make schema
managed only by Raft group 0 if Raft is enabled. This will be important
with Raft-based topology, when schema will depend on topology (e.g. for
tablets).
We solved the problem partially in PR #13695. However, it's still
possible for a bootstrapping node to pull schema in the early part of
bootstrap procedure, before it setups group 0, because of how the
currently used `_raft_gr.using_raft()` check is implemented.
Here's the list of cases:
- If a node is bootstrapping in non-Raft mode, schema pulls must remain
enabled.
- If a node is bootstrapping in Raft mode, it should never perform a
schema pull.
- If a bootstrapped node is restarting in non-Raft mode but with Raft
feature enabled (which means we should start upgrading to use Raft),
or restarting in the middle of Raft upgrade procedure, schema pulls must
remain enabled until the Raft upgrade procedure finishes.
This is also the case of restarting after RECOVERY.
- If a bootstrapped node is restarting in Raft mode, it should never
perform a schema pull.
The `raft_group0` service is responsible for setting up Raft during boot
and for the Raft upgrade procedure. So this is the most natural place to
make the decision that schema pulls should be disabled. Instead of
trying to come up with a correct condition that fully covers the above
list of cases, store a `bool` inside `migration_manager` and set it from
`raft_group0` function at the right moment - when we decide that we
should boot in Raft mode, or restart with Raft, or upgrade. Most of the
conditions are already checked in `setup_group0_if_exist`, we just need
to set the bool. Also print a log message when schema pulls are
disabled.
Fix a small bug in `migration_manager::get_schema_for_write` - it was
possible for the function to mark schema as synced without actually
syncing it if it was running concurrently to the Raft upgrade procedure.
Correct some typos in comments and update the comments.
Fixes#12870Closes#14428
* github.com:scylladb/scylladb:
raft_group_registry: remove `has_group0()`
raft_group0_client: remove `using_raft()`
migration_manager: disable schema pulls when schema is Raft-managed
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
SELECT clause components (selectors) are currently evaluated during query execution
using a stateful class hierarchy. This state is needed to hold intermediate state while
aggregating over multiple rows. Because the selectors are stateful, we must re-create
them each query using a selector_factory hierarchy.
We'd like to convert all of this to the unified expression evaluation machinery, so we can
have just one grammar for expressions, and just one way to evaluate expressions, but
the statefulness makes this complex.
In commit 59ab9aac44 "(Merge 'functions: reframe aggregate functions in terms
of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving
their state to aggregate_function_selector::_accumulator, and therefore into the
class hierarchy we're addressing now. Another reason for keeping state is that selectors
that aren't aggregated capture the first value they see in a GROUP BY group.
Since expressions can't contain state directly, we break apart expressions that contain
aggregate functions into two: an inner expression that processes incoming rows within
a group, and an outer expression that generates the group's output. The two expressions
communicate via a newly introduced expression element: a temporary.
The problem of non-aggregated columns requiring state is solved by encapsulating
those columns in an internal aggregate function, called the "first" function.
In terms of performance, this series has little effect, since the common case of selectors
that only contain direct column references without transformations is evaluated via a fast
path (`simple_selection`). This fast-path is preserved with almost no changes.
While the series makes it possible to start to extend the grammar and unify expression
syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change:
the `SELECT JSON` statement generates json object field names based on the input selectors.
In one case the name of the field has changed, but it is an esoteric case (where a function call
is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra.
Closes#14467
* github.com:scylladb/scylladb:
cql3: selection: drop selector_factories, selectables, and selectors
cql3: select_statement: stop using selector_factories in SELECT JSON
cql3: selection: don't create selector_factories any more
cql3: selection: collect column_definitions using expressions
cql3: selection: reimplement selection::is_aggregate()
cql3: selection: evaluate aggregation queries via expr::evaluate()
cql3: selection, select_statement: fine tune add_column_for_post_processing() usage
cql3: selection: evaluate non-aggregating complex selections using expr::evaluate()
cql3: selection: store primary key in result_set_builder
cql3: expression: fix field_selection::type interpretation by evaluate()
cql3: selection: make result_set_builder::current non-optional<>
cql3: selection: simplify row/group processing
cql3: selection: convert requires_thread to expressions
cql: selection: convert used_functions() to expressions
cql3: selection: convert is_reducible/get_reductions to expressions
cql3: selection: convert is_count() to expressions
cql3: selection convert contains_ttl/contains_writetime to work on expressions
cql3: selection: make simple_selectors stateless
cql3: expression: add helper to split expressions with aggregate functions
cql3: selection: short-circuit non-aggregations
cql3: selection: drop validate_selectors
cql3: select_statement: force aggregation if GROUP BY is used
cql3: select_statement: levellize aggregation depth
cql3: selection: skip first_function when collecting metadata
cql3: select_statement: explicitly disable automatic parallelization with no aggregates
cql3: expression: introduce temporaries
cql3: select_statement: use prepared selectors
cql3: selection: avoid selector_factories in collect_metadata()
cql3: expressions: add "metadata mode" formatter for expressions
cql3: selection: convert collect_metadata() to the prepared expression domain
cql3: selection: convert processes_selection to work on prepared expressions
cql3: selection: prepare selectors earlier
cql3: raw_selector: deinline
cql3: expression: reimplement verify_no_aggregate_functions()
cql3: expression: add helpers to manage an expression's aggregation depth
cql3: expression: improve printing of prepared function calls
cql3: functions: add "first" aggregate function
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
SELECT JSON uses selector_factories to obtain the names of the
fields to insert into the json object, and we want to drop
selector_factories entirely. Switch instead to the ":metadata" mode
of printing expressions, which does what we want.
Unfortunately, the switch changes how system functions are converted
into field names. A function such as unixtimestampof() is now rendered
as "system.unixtimestampof()"; before it did not have the keyspace
prefix.
This is a compatiblity problem, albeit an obscure one. Since the new
behavior matches Cassandra, and the odds of hitting this are very low,
I think we can allow the change.
The replica needs to know which columns we're interested in. Iterate
and recurse into all selector expressions to collect all mentioned columns.
We use the same algorithm that create_factories_and_collect_column_definitions()
uses, even though it is quadratic, to avoid causing surprises.
When constructing a selection_with_processing, split the
selectors into an inner loop and an outer loop with split_aggregation().
We can then reimplement add_input_row() and get_output_row() as follows:
- add_input_row(): evaluate the inner loop expressions and store
the results in temporaries
- get_output_row(): evaluate the outer loop expressions, pulling in
values from those temporaries.
reset(), which is called between groups, simply copies the initial
values rathered by split_aggregation() into the temporaries.
The only complexity comes from add_column_for_post_query_processing(),
which essentially re-does the work of split_aggregation(). It would
be much better if we added the column before split_aggregation() was
called, but some refactoring has to take place before that happens.
In three cases we need to consult a column that's possibly not explicitly
selected:
- for the WHERE clause
- for GROUP BY
- for ORDER BY
The return value of the function is the index where the newly-added
column can be found. Currently, the index is correct for both
the internal column vector and the result set, but soon in won't
be.
In the first two cases (WHERE clause and ORDER BY), we're interested
in the column before grouping, in the last case (ORDER BY) we're interested
in the column after grouping, so we need to distinguish between the two.
Since we already have selection::index_of() that returns the pre-grouping
index, choose the post-grouping index for the return value of
selection::add_column_for_post_processing(), and change the GROUP BY
code to use index_of(). Comments are added.
Now that everything is in place, implement the fast-path
transform_input_row() for selection_with_processing. It's a
straightforward call to evaluate() in a loop.
We adjust add_column_for_post_processing() to also update _selectors,
otherwise ORDER BY clauses that require an additional column will not
see that column.
Since every sub-class implements transform_input_row(), mark
the base class declaration as pure virtual.
expr::evaluate() expects an exploded primary key in its
evaluation_inputs structure (this dates back from the conversion
of filtering to expressions). But right now, the exploded primary
key is only available in the filter.
That's easy to fix however: move the primary key containers
to result_set_builder and just keep references in the filter.
After this, we can evaluate column_value expressions that
reference the primary key.
field_selection::type refers to the type of the selection operation,
not the type of the structure being selected. This is what
prepare_expression() generates and how all other expression elements
work, but evaluate() for field_selection thinks it's the type
of the structure, and so fails when it gets an expression
from prepare_expression().
Fix that, and adjust the tests.
Previously, we used the engagedness of result_set_builder::optional
as a flag, but the previous patch eliminated that and it's always
engaged. Remove the optional wrapper to reduce noise.
Processing a result set relies on calling result_set_builder::new_row().
This function is quite complex as it has several roles:
- complete processing of the previously computed row, if any
- determine if GROUP BY grouping has changed, and flush the previous group
if so
- flush the last group if that's the case
This works now, but won't work with expr::evaluate. The reason is that
new_row() is called after the partition key and clustering key of the
new row have been evaluated, so processing of the previous row will see
incorrect data. It works today because we copy the partition key and
clustering key into result_set_builder::current, but expr::evaluate
uses the exploded partition key and clustering key, which have been
clobbered.
The solution is to separate the roles. Instead of new_row() that's
responsible for completing the previous row and starting a new one,
we have start_new_row() that's responsible for what its name says,
and complete_row() that's responsible for completing the row and
checking for group change. The responsibity for flushing the final
group is moved to result_set_builder::build(). This removes the
awkward "more_rows_coming" parameter that makes everything more
complicated.
result_set_builder::current is still optional, but it's always
engaged. The next patch will clean that up.
used_functions() is used to check whether prepared statements need
to be invalidated when user-defined functions change.
We need to skip over empty scalar components of aggregates, since
these can be defined by users (with the same meaning as if the
identity function was used).
The current version of automatic query parallelization works when all
selectors are reducible (e.g. have a state_reduction_function member),
and all the inputs to the aggregates are direct column selectors without
further transformation. The actual column names and reductions need to
be packed up for forward_service to be used.
Convert is_reducible()/get_reductions() to the expression world. The
conversion is fairly straightforward.
contains_ttl/contains_writetime are two attributes of a selection. If a selection
contains them, we must ask the replica to send them over; otherwise we don't
have data to process. Not sending ttl/writetime saves some effort.
The implementation is a straightforward recursive descent using expr::find_in_expression.
Now that we push all GROUP BY queries to selection_with_processing,
we always process rows via transform_input_row() and there's no
reason to keep any state in simple_selectors.
Drop the state and raise an internal error if we're ever
called for aggregation.
Aggregate functions cannot be evaluated directly, since they implicitly
refer to state (the accumulator). To allow for evaluation, we
split the expression into two: an inner expression that is evaluated
over the input vector (once per element). The inner expression calls
the aggregation function, with an extra input parameter (the accumulator).
The outer expression is evaluated once per input vector; it calls
the final function, and its input is just the accumulator. The outer
expression also contains any expressions that operate on the result
of the aggregate function.
The acculator is stored in a temporary.
Simple example:
sum(x)
is transformed into an inner expression:
t1 = (t1 + x) // really sum.aggregation_function
and an outer expression:
result = t1 // really sum.state_to_result_function
Complicated example:
scalar_func(agg1(x, f1(y)), agg2(x, f2(y)))
is transformed into two inner expressions:
t1 = agg1.aggregation_function(t1, x, f1(y))
t2 = agg2.aggregation_function(t2, x, f2(y))
and an outer expression
output = scalar_func(agg1.state_to_result_function(t1),
agg2.state_to_result_function(t2))
There's a small wart: automatically parallelized queries can generate
"reducible" aggregates that have no state_to_result function, since we
want to pass the state back to the coordinator. Detect that and short
circuit evaluation to pass the accumulator directly.
Currently, selector evaluation assumes the most complex case
where we aggregate, so multiple input rows combine into one output row.
In effect the query either specifies an outer loop (for the group)
and an inner loop (for input rows), or it only specifies the inner loop;
but we always perform the outer and inner loop.
Prepare to have a separate path for the non-aggregation case by
introducing transform_input_row().
GROUP BY is typically used with aggregation. In one case the aggregation
is implicit:
SELECT a, b, c
FROM tab
GROUP BY x, y, z
One row will appear from each group, even though no aggregation
was specified. To avoid this irregularity, rewrite this query as
SELECT first(a), first(b), first(c)
FROM tab
GROUP BY x, y, z
This allows us to have different paths for aggregations and
non-aggregations, without worrying about this special case.
Avoid mixed aggregate/non-aggregate queries by inserting
calls to the first() function. This allows us to avoid internal
state (simple_selector::_current) and make selector evaluation
stateless apart from explicit temporaries.
We plan to rewrite aggregation queries that have a non-aggregating
selector using the first function, so that all selectors are
aggregates (or none are). Prevent the first function from affecting
metadata (the auto-generated column names), by skipping over the
first function if detected. They input and output types are unchanged
so this only affects the name.
A query of the form `SELECT foo, count(foo) FROM tab` returns the first
value of the foo column along with the count. This can't be parallized
today since the first selector isn't an aggregate.
We plan to rewrite the query internally as `SELECT first(foo), count(foo)
FROM tab`, in order to make the query more regular (no mixing of aggregates
and non-aggregates). However, this will defeat the current check since
after the rewrite, all selectors are aggregates.
Prepare for this by performing the check on a pre-rewrite variable, so
it won't be affected by the query rewrite in the next patch.
Note that although even though we could add support for running
first() in parallel, it's not possible to get the correct results,
since first() is not commutative and we don't reduce in order. It's
also not a particularly interesting query.
Temporaries are similar to bind variables - they are values provided from
outside the expression. While bind variables are provided by the user, temporaries
are generated internally.
The intended use is for aggregate accumulator storage. Currently aggregates
store the accumulator in aggregate_function_selector::_accumulator, which
means the entire selector hierarchy must be cloned for every query. With
expressions, we can have a single expression object reused for many computations,
but we need a way to inject the accumulator into an aggregation, which this
new expression element provides.
Change one more layer of processing to work on prepared
rather than raw selectors. This moves the call to prepare
the selectors early in select_statement processing. In turn
this changes maybe_jsonize_select_clause() and forward_service's
mock_selection() to work in the prepared realm as well.
This moves us one step closer to using evaluate() to process
the select clause, as the prepared selectors are now available
in select_statement. We can't use them yet since we can't evaluate
aggregations.
When returning a result set (and when preparing a statement), we
return metadata about the result set columns. Part of that is the
column names, which are derived from the expressions used as selectors.
Currently, they are computed via selector::column_name(), but as
we're dismantling that hierarchy we need a different way to obtain
those names.
It turns out that the expression formatter is close enough to what
we need. To avoid disturbing the current :user mode, add a new
:metadata mode and apply the adjustments needed to bring it in line
with what column metadata looks like today.
Note that column metadata is visible to applications and they can
depend on it; e.g. the Python driver allows choosing columns based on
their names rather than ordinal position.
processes_selection() checks whether a selector passes-through a column
or applies some form of processing (like a case or function application).
It's more sensible to do this in the prepared domain as we have more
information about the expression. It doesn't really help here, but
it does help the refactoring later in the series.
Currently, each selector expression is individually prepared, then converted
into a selector object that is later executed. This is done (on a vector
of raw selectors) by cql3::selection::raw_selector::to_selectables().
Split that into two phases. The first phase converts raw_selector into
a new struct prepared_selector (a better name would be plain 'selector',
but it's taken for now). The second phase continues the process and
converts prepared_selector into selectables.
This gives us a full view of the prepared expressions while we're
preparing the select clause of the select statement.
Most clauses in a CQL statement don't tolerate aggregate functions,
and so they call verify_no_aggregate_functions(). It can now be
reimplemented in terms of aggregation_depth(), removing some code.
We define the "aggregation depth" of an expression by how many
nested aggregation functions are applied. In CQL/SQL, legal
values are 0 and 1, but for generality we deal with any aggregation depth.
The first helper measures the maximum aggregation depth along any path
in the expression graph. If it's 2 or greater, we have something like
max(max(x)) and we should reject it (though these helpers don't). If
we get 1 it's a simple aggregation. If it's zero then we're not aggregating
(though CQL may decide to aggregate anyway if GROUP BY is used).
The second helper edits an expression to make sure the aggregation depth
along any path that reaches a column is the same. Logically,
`SELECT x, max(y)` does not make sense, as one is a vector of values
and the other is a scalar. CQL resolves the problem by defining x as
"the first value seen". We apply this resolution by converting the
query to `SELECT first(x), max(y)` (where `first()` is an internal
aggregate function), so both selectors refer to scalars that consume
vectors.
When a scalar is consumed by an aggregate function (for example,
`SELECT max(x), min(17)` we don't have to bother, since a scalar
is implicity promoted to a vector by evaluating it every row. There
is some ambiguity if the scalar is a non-pure function (e.g.
`SELECT max(x), min(random())`, but it's not worth following.
A small unit test is added.
Currently, a prepared function_call expression is printed as an
"anonymous function", but it's not really anonymous - the name is
available. Print it out.
This helps in a unit test later on (and is worthwhile by itself).
Split long running test
test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests
for _plain and _reverse.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14447
the formatter for sstables::generation_type does not support "d"
specifier, so we should not use "{:d}" for printing it. this works
before d7c90b5239, but after that
change, generation_type is not an alias of int64_t anymore.
and its formatter does not support "d", so we should either
specialize fmt::formatter<generation_type> to support it or just
drop the specifier.
since seastar::format() is using
```c++
fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...);
```
to print the arguments with given fmt string, we cannot identify
these kind of error at compile time.
at runtime, if we have issues like this, {fmt} would throw exception
like:
```
terminate called after throwing an instance of 'fmt::v9::format_error'
what(): invalid format specifier
```
when constructing the `std::runtime_error` instance.
so, in this change, "d" is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14427
also, take this opportunity to let `handle_mutation_fragment()` return void. for better readability.
Closes#14258
* github.com:scylladb/scylladb:
repair: do not check retval of handle_mutation_fragment()
repair: coroutinize move_row_buf_to_working_row_buf()
repair: coroutinize read_rows_from_disk()
repair: coroutinize get_sync_boundary()
first(x) returns the first x it sees in the group. This is useful
for SELECT clauses that return a mix of aggregates and non-aggregates,
for example
SELECT max(x), x
with inputs of x = { 1, 2, 3 } is expected to return (3, 1).
Currently, this behavior is handled by individual selectors,
which means they need to contain extra state for this, which
cannot be easily translated to expressions. The new first function
allows translating the SELECT clause above to
SELECT max(x), first(x)
so all selectors are aggregations and can be handled in the same
way.
The first() function is not exposed to users.
boost's the operator==() implementation of boost's zip_iterator
returns true only if all elements in enclosed tuple of zip_iterator
are equal. and the zip_iterator always advances all the iterators in
the enclosed tuple. but in our case, some components might be missing.
in other words, the size of the `components` might be smaller than
that of the `types` range. so, when the zip_iterator advances past
the end of the components, scylla starts reading out of bounds.
because zip_iterator does not allow us to customize how it implements
the equal operator. and we cannot deduce the size of components without
reading all of them. so in this change, we partially revert
3738fcbe05, instead of using fmt::join(),
just iterate through the components manually. this should avoid
the out-of-bound reading, and also preserve the original behavior.
Branches: 5.3
Fixes#14435
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14457
db::config is pretty large (~32k) and there are four of them, blowing the stack. Fix by
allocating them on the heap.
It's not clear why this shows up on my system (clang 16) and not in the frozen toolchain.
Perhaps clang 16 is less able to reuse stack space.
Closes#14464
In our cql's documentation, there was no information that type can
be omitted in a describe statement.
Added this information along with the order of looking for the element.
So far generic describe (`DESC <name>`) followed Cassandra
implementation and it only described keyspace/table/view/index.
This commit adds UDT/UDF/UDA to generic describe.
Fixes: #14170
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.
In some unlucky circumstances, a stack overflow can occur:
- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
combined reader (~500),
- All of the above needs to happen within a single task quota.
In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.
A regression test is added.
Fixes: scylladb/scylladb#14415
Closes#14452
We want to disable `migration_manager` schema pulls and make schema
managed only by Raft group 0 if Raft is enabled. This will be important
with Raft-based topology, when schema will depend on topology (e.g. for
tablets).
We solved the problem partially in PR #13695. However, it's still
possible for a bootstrapping node to pull schema in the early part of
bootstrap procedure, before it setups group 0, because of how the
currently used `_raft_gr.using_raft()` check is implemented.
Here's the list of cases:
- If a node is bootstrapping in non-Raft mode, schema pulls must remain
enabled.
- If a node is bootstrapping in Raft mode, it should never perform a
schema pull.
- If a bootstrapped node is restarting in non-Raft mode but with Raft
feature enabled (which means we should start upgrading to use Raft),
or restarting in the middle of Raft upgrade procedure, schema pulls must
remain enabled until the Raft upgrade procedure finishes.
This is also the case of restarting after RECOVERY.
- If a bootstrapped node is restarting in Raft mode, it should never
perform a schema pull.
The `raft_group0` service is responsible for setting up Raft during boot
and for the Raft upgrade procedure. So this is the most natural place to
make the decision that schema pulls should be disabled. Instead of
trying to come up with a correct condition that fully covers the above
list of cases, store a `bool` inside `migration_manager` and set it from
`raft_group0` function at the right moment - when we decide that we
should boot in Raft mode, or restart with Raft, or upgrade. Most of the
conditions are already checked in `setup_group0_if_exist`, we just need
to set the bool. Also print a log message when schema pulls are
disabled.
Fix a small bug in `migration_manager::get_schema_for_write` - it was
possible for the function to mark schema as synced without actually
syncing it if it was running concurrently to the Raft upgrade procedure.
Correct some typos in comments and update the comments.
Fixes#12870
Modify task_manager::task::impl::get_progress method so that,
whenever relevant, progress is calculated based on children's
progress. Otherwise progress indicates only whether the task
is finished or not.
The method may be overriden in inheriting classes.
Closes#14381
* github.com:scylladb/scylladb:
tasks: delete task_manager::task::impl::_progress as it's unused
tasks: modify task_manager::task::impl::get_progress method
tasks: add is_complete method
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).
This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046
- [x] 5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x] 4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)
Closes#14444
* github.com:scylladb/scylladb:
doc: fix rollback in 4.3-to-2021.1 upgrade guide
doc: fix rollback in 5.0-to-2022.1 upgrade guide
doc: fix rollback in 5.1-to-2022.2 upgrade guide
Fixes https://github.com/scylladb/scylladb/issues/14033
This PR:
- replaces the OUTDATED list of platforms supported by Unified Installer with a link to the "OS Support" page. In this way, the list of supported OSes will be documented in one place, preventing outdated documentation.
- improves the language and syntax, including:
- Improving the wording.
- Replacing "Scylla" with "ScyllaDB"
- Fixing language mistakes
- Fixing heading underline so that the headings render correctly.
Closes#14445
* github.com:scylladb/scylladb:
doc: update the language - Unified Installer page
doc: update Unified Installer support
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.
In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.
To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.
It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.
Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).
Fixes: #14066Closes#14336
This series aims at hardening schema merges and preventing inconsistencies across shards by
updating the database shards before calling the notification callback.
As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux.
In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state .
Other changes optimize this path by:
- updating shard 0 first, to seed the effective_replication_map.
- executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change.
Fixes#13137Closes#14158
* github.com:scylladb/scylladb:
migration_manager: propagate listener notification exceptions
storage_service: keyspace_changed: execute only on shard 0
database: modify_keyspace_on_all_shards: execute func first on shard 0
database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
database: add modify_keyspace_on_all_shards
schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
database: create_keyspace_on_all_shards
database: update_keyspace_on_all_shards
database: drop_keyspace_on_all_shards
This commit improves the language and syntax on
the Unified Installer page. The changes cover:
- Improving the wording.
- Replacing "Scylla" with "ScyllaDB"
- Fixing language mistakes
- Fixing heading underline so that the headings
render correctly.
This commit replaces the OUTDATED list of platforms supported
by Unified Installer with a link to the "OS Support" page.
In this way, the list of supported OSes will be documented
in one place, preventing outdated documentation.
Modify task_manager::task::impl::get_progress method so that,
whenever relevant, progress is calculated based on children's
progress. Otherwise progress indicates only whether the task
is finished or not.
Reduce test string value size, parallelize inserts, and use a prepared statement,
The debug running time for this tests is reduced from 13:18 to 7:52.
Refs #13905Closes#14380
* github.com:scylladb/scylladb:
test/boost/index_with_paging_test: parallel insert
test/boost/index_with_paging_test: prepared statement
test/boost/index_with_paging_test: reduce running time
handle_mutation_fragment() does not return `stop_iteration::yes`
anymore after fbbc86e18c, so let's
stop checking its return value. and make it return void.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.
As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.
This should improve CI stability.
Fixes: #12972
Refs: #14042Closes#14354
* github.com:scylladb/scylladb:
messaging_service: print which connections are dropped due to missing topology info
storage_service: wait for nodes to be UP on bootstrap
storage_service: wait for NORMAL state handler before `setup_group0()`
storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
A GROUP BY combined with aggregation should produce a single
row per group, except for empty groups. This is in contrast
to an aggregation without GROUP BY, which produces a single
row no matter what.
The existing code only considered the case of no grouping
and forced a row into the result, but this caused an unwanted
row if grouping was used.
Fix by refining the check to also consider GROUP BY.
XFAIL tests are relaxed.
Fixes#12477.
Note, forward_service requires that aggregation produce
exactly one row, but since it can't work with grouping,
it isn't affected.
Closes#14399
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
test: add test for group0 raft command merging
service: raft: respect max mutation size limit when persisting raft entries
group0_state_machine: merge commands before applying them whenever possible
This connection dropping caused us to spend a lot of time debugging.
Those debugging sessions would be shorter if Scylla logs indicated that
connections are being dropped and why.
Connection drops for a given node are a one-time event - we only do it
if we establish a connection to a node without topology info, which
should only happen before we handle the node's NORMAL status for the
first time. So it's a rare thing and we can log it on INFO level without
worrying about log spam.
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.
Refs: #14042
This doesn't completely fix#14042 yet becasue it's specific to
gossiper-based topology mode only. For Raft-based topology, the node
joining procedure will be coordinated by the topology coordinator right
from the start and it will be the coordinator who issues the 'wait for
node to see other live nodes'.
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
We do it only in non-raft-topology mode, because with Raft-based
topology, node state changes are propagated to the cluster through
explicit global barriers and we plan to remove node statuses from
gossiper altogether.
Fixes: #12972
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.
This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.
Refs: https://github.com/scylladb/scylla-enterprise/issues/3046Closes#14373
Parallelize inserts for long-running test_index_with_paging.
Run time in debug mode reduced by 1 minute 48 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Reduce test string value size for test_index_with_paging from 4096 to
100. With 100 bytes it should make the base row significantly larger
than the key so the test will exercise both types of paging in the
scanning code.
The debug running time for this tests is reduced from 9 minutes to 6
minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.
This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```
Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved.
We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different.
We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change.
A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash.
Fixes: https://github.com/scylladb/scylladb/issues/13129Closes#14429
* github.com:scylladb/scylladb:
modification_statement: reject conditional statements with empty clustering key
statements/cas_request: fix crash on empty clustering range in LWT
This piece of `storage_service::wait_for_ring_to_settle()` will be
performed earlier in the boot procedure in follow-up commits.
Make it more generic, to be able to wait for `n` nodes to show up. Here
we wait for `2` nodes - ourselves and at least one other.
In reshard_sstables_compaction_task_impl::run() we call
sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed
to that method, we use both sharded sstable_directory service
and its local instance.
To make it straightforward that sharded and local instances are
dependend, we call sharded<replica::database>::invoke_on_all
instead and access local directory through the sharded one.
As a preparation for integrating resharding compaction with task manager
a struct and some functions are copied from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
`modification_statement::execute_with_condition` validates that a query with
an IF condition can be executed correctly.
There's already a check for empty partition key ranges, but there was no check
for empty clustering ranges.
Let's add a check for the clustering ranges as well, they're not allowed to be empty.
After this change Scylla outputs the same type of message for empty partition and
clustering ranges, which improves UX.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
LWT queries with empty clustering range used to cause a crash.
For example in:
```cql
UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3
```
The range of `c` is empty - there are no valid values.
This caused a segfault when accessing the `first` range:
```c++
op.ranges.front()
```
To fix it let's throw en exception when the clustering range
is empty. Cassandra also rejects queries with `c = 1 AND c = 2`.
There's also a check for empty partition range, as it used
to crash in the past, can't really hurt to add it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.
There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to detect this case.
After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.
Fixes#13491Closes#14375
* github.com:scylladb/scylladb:
readers: evictable_reader: don't accidentally consume the entire partition
test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position
this script provides a tool to decode a base36 encoded timeuuid
to the underlying msb and lsb bits, and to encode msb and lsb
to a string with base36.
Both scylla and Cassandra 4.x support this new SSTable identifier used
in SSTable names. like "nb-3fw2_0tj4_46w3k2cpidnirvjy7k-big-Data.db".
Since this is a new way to print timeuuid, and unlike the representation
defined by RFC4122, it is not straightforward to connect the the
in-memory representation (0x6636ac00da8411ec9abaf56e1443def0) to its
string representation of SSTable identifiers, like
"3fw2_0tj4_46w3k2cpidnirvjy7k". It would be handy to have this
tool to encode/decode the number/string for debugging purpose.
For more context on the new SSTable identifier, please
see
https://cassandra.apache.org/_/blog/Apache-Cassandra-4.1-New-SSTable-Identifiers.html
and https://issues.apache.org/jira/browse/CASSANDRA-17048
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14374
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the previous buffer-fill. Otherwise, the
reader could get stuck in an infinite loop between buffer fills, if the
reader is evicted in-between.
The code guranteeing this forward progress had a bug: the comparison
between the position after the last buffer-fill and the current
last fragment position was done in the wrong direction.
So if the condition that we wanted to achieve was already true, we would
continue filling the buffer until partition end which may lead to OOMs
such as in #13491.
There was already a fix in this area to handle `partition_start`
fragments correctly - #13563 - but it missed that the position
comparison was done in the wrong order.
Fix the comparison and adjust one of the tests (added in #13563) to
detect this case.
Fixes#13491
test_range_tombstones_v2 is too strict for this reader -- it expects a
particular sequence of `range_tombstone_change`s, but
multishard_combining_reader, when tested with a small buffer, may
generate -- as expected -- additional (redundant) range tombstone change
pairs (end+start).
Currently we don't observe these redundant fragments due to a bug in
`evictable_reader_v2` but they start appearing once we fix the bug and
the test must be prepared first.
To prepare the test, modify `flat_reader_assertions_v2` so it squashes
redundant range tombstone change pairs. This happens only in non-exact
mode.
Enable exact mode in `test_sstable_reversing_reader_random_schema` for
comparing two readers -- the squashing of `r_t_c`s may introduce an
artificial difference.
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
The code that preserves raft entries builds one batch statement to store
all of them, but the butch's statement execute() merges all of the
statements into one mutation and passes it to the database. The mutation
can be larger than max mutation size limit and the write will fail. Fix
it by splitting the write to multiple batch statements if needed.
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
Fixes: #12581
… -> ScyllaDB University
Closes#14385
* github.com:scylladb/scylladb:
Update docs/operating-scylla/procedures/backup-restore/index.rst
Fixing broken links to ScyllaDB University lessons, Scylla University -> ScyllaDB University
Fixes#10099
Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator, will extract roles from TLS authentication certificate (not wire cert - those are server side) subject, based on configurable regex.
Example:
scylla.yaml:
```
authenticator: com.scylladb.auth.CertificateAuthenticator
auth_superuser_name: <name>
auth_certificate_role_query: CN=([^,\s]+)
client_encryption_options:
enabled: True
certificate: <server cert>
keyfile: <server key>
truststore: <shared trust>
require_client_auth: True
```
In a client, then use a certificate signed with the <shared trust> store as auth cert, with the common name <name>. I.e. for qlsh set "usercert" and "userkey" to these certificate files.
No user/password needs to be sent, but role will be picked up from auth certificate. If none is present, the transport will reject the connection. If the certificate subject does not contain a recongnized role name (from config or set in tables) the authenticator mechanism will reject it.
Otherwise, connection becomes the role described.
To facilitate this, this also contains the addition of allowing setting super user name + salted passwd via command line/conf + some tweaks to SASL part of connection setup.
Closes#12214
* github.com:scylladb/scylladb:
docs: Add documentation of certificate auth + auth_superuser_name
auth: Add TLS certificate authenticator
transport: Try to do early, transport based auth if possible
auth: Allow for early (certificate/transport) authentication
auth: Allow specifying initial superuser name + passwd (salted) in config
roles-metadata: Coroutinuze some helpers
Otherwise regular compaction can sneak in and
see !cs.sstables_requiring_cleanup.empty() with
cs.owned_ranges_ptr == nullptr and trigger
the internal error in `compaction_task_executor::compact_sstables`.
Fixes scylladb/scylladb#14296
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14297
View building from staging creates a reader from scratch (memtable
\+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
```
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
```
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
`INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s`
to
`INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s`
Refs https://github.com/scylladb/scylladb/issues/14089.
Fixes scylladb/scylladb#14244.
Closes#14364
* github.com:scylladb/scylladb:
table: Optimize creation of reader excluding staging for view building
view_update_generator: Dump throughput and duration for view update from staging
utils: Extract pretty printers into a header
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.
perf shows that the reader creation is very expensive:
+ 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+ 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+ 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare
+ 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare
+ 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+ 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping
+ 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment
+ 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+ 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free
+ 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+ 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64
+ 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe
+ 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+ 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small
+ 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects
+ 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run
+ 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate
+ 1.19% 0.05% reactor-3 libc.so.6 [.] syscall
+ 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+ 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert
That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).
The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.
This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.
This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.
With this improvement, view building was measured to be 3x faster.
from
INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s
to
INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s
Refs #14089.
Fixes#14244.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Very helpful for user to understand how fast view update generation
is processing the staging sstables. Today, logs are completely
silent on that. It's not uncommon for operators to peek into
staging dir and deduce the throughput based on removal of files,
which is terrible.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It seems like the current 1-second TTL is too
small for debug build on aarch64 as seen in
https://jenkins.scylladb.com/job/scylla-master/job/build/1513/artifact/testlog/aarch64/debug/cql-pytest.test_using_timestamp.1.log
```
k = unique_key_int()
cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v1}) USING TIMESTAMP {ts} and TTL 1")
cql.execute(f"INSERT INTO {table} (k, v) VALUES ({k}, {v2}) USING TIMESTAMP {ts}")
> assert_value(k, v1)
test/cql-pytest/test_using_timestamp.py:140:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
k = 10, expected = 2
def assert_value(k, expected):
select = f"SELECT k, v FROM {table} WHERE k = {k}"
res = list(cql.execute(select))
> assert len(res) == 1
E assert 0 == 1
E + where 0 = len([])
```
Increase the TTL used to write data to de-flake the test
on slow machines running debug build.
Ref #14182
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14396
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards. Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.
We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```
The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).
Refs #3389
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.
Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When creating or altering a keyspace, we create a new
effective_replication_map instance.
It is more efficient to do that first on shard 0
and then on all other shards, otherwise multiple
shards might need to calculate to new e_r_m (and reach
the same result). When the new e_r_m is "seeded" on
shard 0, other shards will find it there and clone
a local copy of it - which is more efficient.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When creating, updating, or dropping keyspaces,
first execute the database internal function to
modify the database state, and only when all shards
are updated, run the listener notifications,
to make sure they would operate when the database
shards are consistent with each other.
Fixes#13137
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Run all keyspace create/update/drop ops
via `modify_keyspace_on_all_shards` that
will standardize the execution on all shards
in the coming patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similar to create_keyspace_on_all_shards,
`extract_scylla_specific_keyspace_info` and
`create_keyspace_from_schema_partition` can be called
once in the upper layer, passing keyspace_metadata&
down to database::update_keyspace_on_all_shards
which now would only make the per-shard
keyspace_metadata from the reference it gets
from the schema_tables layer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes#10099
Adds the com.scylladb.auth.CertificateAuthenticator type. If set as authenticator,
will extract roles from TLS authentication certificate (not wire cert - those are
server side) subject, based on configurable regex.
Example:
scylla.yaml:
authenticator: com.scylladb.auth.CertificateAuthenticator
auth_superuser_name: <name>
auth_certificate_role_queries:
- source: SUBJECT
query: CN=([^,\s]+)
client_encryption_options:
enabled: True
certificate: <server cert>
keyfile: <server key>
truststore: <shared trust>
require_client_auth: True
In a client, then use a certificate signed with the <shared trust>
store as auth cert, with the common name <name>. I.e. for cqlsh
set "usercert" and "userkey" to these certificate files.
No user/password needs to be sent, but role will be picked up
from auth certificate. If none is present, the transport will
reject the connection. If the certificate subject does not
contain a recongnized role name (from config or set in tables)
the authenticator mechanism will reject it.
Otherwise, connection becomes the role described.
Instead of locking this to "cassandra:cassandra", allow setting in scylla.yaml
or commandline. Note that config values become redundant as soon as auth tables
are initialized.
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
1. The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
2. The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch
Refs #2252, #6033Closes#12917
By making all changes on temporary variables
and eventually moving them back into the tracker members
in a noexcept block the function can safely throw
until the changes are committed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of providing refresh_sstables_backlog_contribution
that updates the tracker in place, provide a static function
calculate_sstables_backlog_contribution that doesn't change
the tracker state to facilitate exception safety in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Encapsulate the contribution-related members in
struct contribution, to be used for strong exception safety.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although replace_sstables is supposed to be called
only once per {old_ssts, new_ssts} it is safer
to update `_total_bytes` with `sst->data_size()`
only if the sst was inserted/erased successfully.
Otherwise _total_bytes may go out of sync with the
contents of _all.
That said, the next step should be to refer to the
compaction_group's main sstable set directly rather
than maintaining a "shadow" set in the tracker.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In the next patch, we will want to observe when the result
message is a schema change and handle it differently than
when it is not. This patch adds a helper method for that,
which should be more readable than a dynamic_pointer_cast
and a comparison with nullptr.
Task manager's tasks covering resharding compaction
on top and shard level.
Closes#14112
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test reshaping compaction
compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run()
compaction: add shard_reshaping_compaction_task_impl
replica: delete unused function
compaction: add table_reshaping_compaction_task_impl
compaction: copy reshape to task_manager_module.cc
compaction: add reshaping_compaction_task_impl
Split long running test_aggregate_functions to one case per type.
This allows test.py to run them in parallel.
Before this it would take 18 minutes to run in debug mode. Afterwards
each case takes 30-45 seconds.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14368
Split long running test test_schema_changes in 3 parts, one for each
writable_sstable_versions so it can be run in parallel by test.py.
Add static checks to alert if the array of types changed.
Original test takes around 24 minutes in debug mode, and each new split
test takes around 8 minutes.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14367
Remove previous configuration blocking parallel run.
Test cases run fine in local debug.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14369
Split long running tests
test_database_with_data_in_sstables_is_a_mutation_source_plain and
test_database_with_data_in_sstables_is_a_mutation_source_reverse.
They run with x_log2_compaction_groups of 0 and 1, each one taking from
10 to 15 minutes each in debug mode, for a total of 28 and 22 minutes.
Split the test cases to run with 0 and 1, so test.py can run them in
parallel.
Refs #13905
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#14356
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.
Also, unmask is unnecessary for enabling the timer.
Fixes#14249Closes#14252
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.
The problem was introduced in this commit
2d791a5ed4. The
function `encode_replica_exception_for_rpc` used
`default_tuple_maker` callback to create tuples
containing exceptions. Callers returned pointers
to default-constructed values in this callback,
e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
The commit changed this to just `SourceTuple{}`,
which means nullptr for pointer types.
Fixes: #14282Closes#14352
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036
This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.
This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.
Closes#14372
Compaction tasks covering table major, cleanup, offstrategy,
and upgrade sstables compaction inherit sequence number from their
parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.
Closes#14379
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Part of moving the responsibility for applying
and notifying keyspace schema changes from
schema_tables to the database so that the
database can control the order of applying the changes
across shards and when to notify its listeners.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The series contains mostly cleanups for query processor and no functional
change. The last patch is a small cleanup for the storage_proxy.
* 'qp-cleanup' of https://github.com/gleb-cloudius/scylla:
storage_proxy: remove unused variable
client_state: co-routinise has_column_family_access function
query_processor: get rid of internal_state and create individual query_satate for each request
cql3: move validation::validate_column_family from client_state::has_column_family_access
client_state: drop unneeded argument from has.*access functions
cql3: move check for dropping cdc tables from auth to the drop statement code itself
query_processor: co-routinise execute_prepared_without_checking_exception_message function
query_processor: co-routinize execute_direct_without_checking_exception_message function
cql3: remove empty statement::validate functions
cql3: remove empty function validate_cluster_support
cql3/statements: fix indentation and spurious white spaces
query_processor: move statement::validate call into execute_with_params function
query_processor: co-routinise execute_with_params function
query_processor: execute statement::validate before each execution of internal query instead of only during prepare
query_processor: get rid of shared internal_query_state
query_processor: co-routinize execute_paged_internal function
query_processor: co_routinize execute_batch_without_checking_exception_message function
query_processor: co-routinize process_authorized_statement function
It's very annoying to add a declaration to expression.hh and watch
the whole world get recompiled. Improve that by moving less-common
functions to a new header expr-utils.hh. Move the evaluation machinery
to a new header evaluate.hh. The remaining definitions in expression.hh
should not change as often, and thus cause less frequent recompiles.
Closes#14346
* github.com:scylladb/scylladb:
cql3: expr: break up expression.hh header
cql3: expr: restrictions.hh: protect against double inclusions
cql3: constants: deinline
cql3: statement_restrictions: deinline
cql3: deinline operation::fill_prepare_context()
There was a bug in describe_statement. If executing `DESC FUNCTION <uda name>` or ` DESC AGGREGATE <udf name>`, Scylla was crashing because the function was found (`functions::find()` searches both UDFs and UDAs) but the function was bad and the pointer wasn't checked after cast.
Added a test for this.
Fixes: #14360Closes#14332
* github.com:scylladb/scylladb:
cql-pytest:test_describe: add test for filtering UDF and UDA
cql3:statements:describe_statement: check pointer to UDF/UDA
Adding a function declaration to expression.hh causes many
recompilations. Reduce that by:
- moving some restrictions-related definitions to
the existing expr/restrictions.hh
- moving evaluation related names to a new header
expr/evaluate.hh
- move utilities to a new header
expr/expr-utilities.hh
expression.hh contains only expression definitions and the most
basic and common helpers, like printing.
To reduce future header fan-in, deinline all non-trivial functions.
While these aer on the hot path, they can't be inlined anyway as they're
virtual, and they're quite heavy anyway.
Checking keyspace/table presence should not be part of authorization code
and it is not done consistently today. For instance keyspace presence
is not checked in "alter keyspace" during authorization, but during
statement execution. Make it consistent.
Checking if a table is CDC log and cannot be dropped should not be done
as part of authentication (this has nothing to do with auth), but in the
drop statement itself. Throwing unauthorized_exception is wrong as well,
but unfortunately it is enshrined with a test. Not sure if it is a good
idea to change it now.
There is a discrepancy on how statement::validate is used. On a regular
path it is called before each execution, but on internal execution
path it is called only once during prepare. Such discrepancy make it
hard to reason what can and cannot be done during the call. Call it
uniformly before each execution. This allow validate to check a state that
can change after prepare.
internal_query_state was passed in shared_ptr from the java
translation times. It may be a regular c++ type with a lifetime
bound by the function execution it was created in.
Make evaluate()'s body more regular, then exploit it by
replacing the long list of branches with a lambda template.
Closes#14306
* github.com:scylladb/scylladb:
cql3: expr: simplify evaluate()
cql3: expr: standardize evaluate() branches to call do_evaluate()
cql3: expr: rename evaluate(ExpressionElement) to do_evaluate()
This is V2 of https://github.com/scylladb/scylladb/pull/14108
This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs.
The scope:
* Added new files with instructions for AWS, GCP, and Azure.
* Added the new files to the index.
* Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section.
* Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:"
* Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files in the Enterprise repository.
In addition:
- The Configuration Reference file was moved out of the installation section (it's not about installation at all)
- The links to creating a cluster were removed from the installation page (as not related).
Related: https://github.com/scylladb/scylla-docs/issues/4091Closes#14153
* github.com:scylladb/scylladb:
doc: remove the rpm-info file (What is in each RPM) from the installation section
doc: move cloud deployment instruction to docs -v2
Spans are slightly cleaner, slightly faster (as they avoid an indirection),
and allow for replacing some of the arguments with small_vector:s.
Closes#14313
There was a bug that caused aggregates to fail when used on column-sensitive columns.
For example:
```cql
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written.
The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307Closes#14340
* github.com:scylladb/scylladb:
service/forward_service.cc: make case-sensitivity explicit
cql-pytest/test_aggregate: test case-sensitive column name in aggregate
forward_service: fix forgetting case-sensitivity in aggregates
Task manager task covering compaction group major
compaction.
Uses multiple inheritance on already existing
major_compaction_task_executor to keep track of
the operation with task manager.
Closes#14271
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py
test: use named variable for task tree depth
compaction: turn major_compaction_task_executor into major_compaction_task_impl
compaction: take gate holder out of task executor
compaction: extend signature of some methods
tasks: keep shared_ptr to impl in task
compaction: rename compaction_task_executor methods
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.
This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.
Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.
Closes#13850
* github.com:scylladb/scylladb:
test: pylib: increase checking period for `get_alive_endpoints`
test: add node banning test
test: pylib: manager_client: `get_cql()` helper
test: pylib: ScyllaCluster: server pause/unpause API
raft topology: ban left nodes
raft topology: skip `left_token_ring` state during `removenode`
raft topology: prepare decommission path for node banning
raft topology: introduce `left_token_ring` state
raft topology: `raft_topology_cmd` implicit constructor
messaging_service: implement host banning
messaging_service: exchange host IDs and map them to connections
messaging_service: store the node's host ID
messaging_service: don't use parameter defaults in constructor
main: move messaging_service init after system_keyspace init
Fixes https://github.com/scylladb/scylladb/issues/14333
This commit replaces the documentation landing page with
the Open Source-only documentation landing page.
This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.
Closes#14343
Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was a bug which made aggregates fail when used with case-sensitive
column names.
Add a test to make sure that this doesn't happen in the future.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The chunk size used in sstable compression can be set when creating a
table, using the "chunk_length_in_kb" parameter. It can be any power-of-two
multiple of 1KB. Very large compression chunks are not useful - they
offer diminishing returns on compression ratio, and require very large
memory buffers and reading a very large amount of disk data just to
read a small row. In fact, small chunks are recommended - Scylla
defaults to 4 KB chunks, and Cassandra lowered their default from 64 KB
(in Cassandra 3) to 16 KB (in Cassandra 4).
Therefore, allowing arbitrarily large chunk sizes is just asking for
trouble. Today, a user can ask for a 1 GB chunk size, and crash or hang
Scylla when it runs out of memory. So in this patch we add a hard limit
of 128 KB for the chunk size - anything larger is refused.
Fixes#9933
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14267
This reverts commit 562087beff.
The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.
Fixes#10459
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.
Takes load due to current tablet allocation into account.
Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
We still use it in many places in unit tests, which is ok because
those tables are vnode-based.
We want to check incorrect uses in production as they may lead to hard
to debug consistency problems.
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
This is not strictly necessary, as the multishard reader will be later
avoided altogether for tablet-based tables, but it is a step towards
converting all code to use the erm->get_sharder() instead of
schema::get_sharder().
schema::get_sharder() does not use the correct sharder for
tablet-based tables. Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.
Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:
start_token = sharder.token_for_next_shard(end_token, shard);
If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
Soon, compute_shards_for_this_sstable() will need to take a sharder object.
open_data() is called indirectly from sstable::load() and directly
after writing an sstable from various paths. The latter don't really
need to compute shards, since the field is already set by the writer. In
order to reduce code churn, move compute_shards_for_this_sstable() to
the load() path only so that only load() needs to take the sharder.
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.
Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.
Existing users will be transitioned incrementally in later patches.
The logic was extracted from ring_position_range_sharder::next(), and
the latter was changed to rely on sharder::next_shard().
The tablet sharder will have a different implementation for
next_shard(). This way, ring_position_range_sharder can work with both
current sharder and the tablet sharder.
Querying from virtual table system.token_ring fails if there is a
tablet-based table due to attempt to obtain a per-keyspace erm.
Fix by not showing such keyspaces.
This will make it easier to access table proprties in places which
only have schema_ptr. This is in particular useful when replacing
dht::shard_of() uses with s->table().shard_of(), now that sharding is
no longer static, but table-specific.
Also, it allows us to install a guard which catches invalid uses of
schema::get_sharder() on tablet-based tables.
It will be helpful for other uses as well. For example, we can now get
rid of the static_props hack.
The netyr may exist, but its schema may not yet be loaded. learn()
didn't take that into account. This problem is not reachable in
production code, which currently always calls get_or_load() before
learn(), except for boot, but there's no concurrency at that point.
Exposed by unit test added later.
System tables have static schemas and code uses those static schemas
instead of looking them up in the database. We want those schemas to
have a valid table() once the table is created, so we need to attach
registry entry to the target schema rather than to a schema duplicate.
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.
Two new columns are added to `system.topology`:
- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.
These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.
During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.
Closes#14232
* github.com:scylladb/scylladb:
storage_service: update supported cluster features in group0 on start
storage_service: add methods for features to topology mutation builder
storage_service: use explicit ::set overload instead of a template
storage_service: reimplement mutation builder setters
storage_service: introduce topology_mutation_builder_base
topology_state_machine: include information about features
system_keyspace: introduce deserialize_set_column
db/system_keyspace: add storage for cluster features managed in group 0
Now, when a node starts, it will update its `supported_features` row in
`system.topology` via `update_topology_with_local_metadata`.
At this point, the functionality behind cluster features on raft is
mostly incomplete and the state of the `supported_features` column does
not influence anything so it's safe to update this column
unconditionally. In the future, the node will only join / start group0
server if it is sure that it supports all enabled features and it can
safely update the `supported_features` parameter.
The newly added `supported_features` and `enabled_features` columns can
now be modified via topology mutation builders:
- `supported_features` can now be overwritten via a new overload of
`topology_node_mutation_builder::set`.
- `enabled_features` can now be extended (i.e. more elements can be
added to it) via `topology_mutation_builder::add_enabled_features`. As
the set of enabled features only grows, this should be sufficient.
The `topology_node_mutation_builder::set` function has an overload which
accepts any type which can be converted to string via `::format`. Its
presence can lead to easy mistakes which can only be detected at runtime
rather at compile time. A concrete example: I wrote a function that
accepts an std::set<S> where S is convertible to sstring; it turns out
that std::string_view is not std::convertible_to sstring and overload
resolution falled back to the catch-all overload.
This commit gets rid of the catch-all overload and replaces it with
explicit ones. Fortunately, it was used for only two enums, so it wasn't
much work.
As promised in the previous commit which introduced
topology_mutation_builder_base, this commit adjusts existing setters of
topology mutation builder and topology node mutation builder to use
helper methods defined in the base class.
Note that the `::set` method for the unordered set of tokens now does
not delete the column in case an empty value is set, instead it just
writes an empty set. This semantic is arguably more clear given that we
have an explicit `::del` method and it shouldn't affect the existing
implementation - we never intentionally insert an empty set of tokens.
Introduces `topology_mutation_builder_base` which will be a base class
for both topology mutation builder and topology node mutation builder.
Its purpose is to abstract away some detail about setting/deleting/etc.
column in the mutation, the actual topology (node) mutation builder will
only have to care about converting types and/or allowing only particular
columns to be set. The class is using CRTP: derived classes provide
access to the row being modified, schema and the timestamp.
For the sake of commit diff readability, this commt only introduces this
class and changes the builders to derive from it but no setter
implementations are modified - this will be done in the next commit.
There are three places in system_keyspace.cc which deserialize a column
holding a set of tokens and convert it to an unordered set of
dht::token. The deserialization process involves a small number of steps
that are the same in all of those places, therefore they can be
abstracted away.
This commit adds `deserialize_set_column` function which takes care of
deserializing the column to `set_type_impl::native_type` which can be
then passed to `decode_tokens`. The new function will also be useful for
decoding set columns with cluster features, which will be handled in the
next commit.
The lambda is defined to return a coordinator_result<stop_iteration>,
but in fact only returns successful outcomes, never failures.
Change it to return a plain stop_iteration, so its callers don't have
to check for failure.
`server_sees_others` and similar functions periodically call
`get_alive_endpoints`. The period was `.1` seconds, increase it to `.5`
to reduce the log spam (I checked empirically that `.5` is usually how
long it takes in dev mode on my laptop.)
The "tell the node to shut down" RPC would fail every time in the
removenode path (since the node is dead), which is kind of awkward.
Besides, for removenode we don't really need the `left_token_ring`
state, we don't need to coordinate with the node - writes destined for
it are failing anyway (since it's dead) and we can ban the node
immediately.
Remove the node from group 0 while in `write_both_read_new` transition
state (even when we implement abort, in this state it's too late to
abort, we're committed to removing the node - so it's fine to remove it
from group 0 at this point).
Currently the decommissioned node waits until it observes that it was
moved to the `left` state, then proceeds to leave group 0 and shut down.
Unfortunately, this strategy won't work once we introduce banning nodes
that are in `left` state - there is no guarantee that the
decommissioning node will observe that it entered `left` state. The
replication of Raft commands races with the ban propagating through the
cluster.
We also can't make the node leave as soon as it observes the
`left_token_ring` state, which would defeat the purpose of
`left_token_ring` - allowing all nodes to observe that the node has left
the token ring before it shuts down.
We could introduce yet another state between `left_token_ring` and
`left`, which the node waits for before shutting down; the coordinator
would request a barrier from the node before moving to `left` state.
The alternative - which we chose here - is to have the coordinator
explicitly tell the node to shutdown while we're in `left_token_ring`
through a direct RPC. We introduce
`raft_topology_cmd::command::shutdown` and send it to the node while in
`left_token_ring` state, after we requested a cluster barrier.
We don't require the RPC to succeed; we need to allow it to fail to
preserve availability. This is because an earlier incarnation of the
coordinator may have requested the node to shut down already, so the
new coordinator will fail the RPC as the node is already dead. This also
improves availability in general - if the node dies while we're in
`left_token_ring`, we can proceed.
We don't lose safety from that, since we'll ban the node (later commit).
We only lose a bit of user experience if there's a failure at this
decommission step - the decommissioning node may hang, never receiving
the RPC (it will be necessary to shut it down manually).
Another complication arising from banning the node is that it won't be
able to leave group 0 on its own; by the time it tries that, it may have
already been banned by the cluster (the coordinator moves the node to
`left` state after telling it to shut down). So we get rid of the
`leave_group0` step from `raft_decommission()` (which simplifies the
function too), putting a `remove_from_raft_config` inside the
coordinator code instead - after we told the node to shut down.
(Removing the node from configuration is also another reason why we need
to allow the above RPC to fail; the node won't be able to handle the
request once it's outside the configuration, because it handles all
coordinator requests by starting a read barrier.)
Finally, a complication arises when the coordinator is the
decommissioning node. The node would shut down in the middle of handling
the `left_token_ring` state, leading to harmless but awkward errors even
though there were no node/network failures (the original coordinator
would fail the `left_token_ring` state logic; a new coordinator would take
over and do it again, this time succeeding). We fix that by checking if
we're the decommissioning node at the beginning of `left_token_ring`
state handler, and if so, stepping down from leadership by becoming a
nonvoter first.
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).
Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.
After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
Saves some redundant typing when passing `raft_topology_cmd` parameters,
so we can change this:
```
raft_topology_cmd{raft_topology_cmd::command::fence_old_reads}
```
into this:
```
raft_topology_cmd::command::fence_old_reads
```
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
be immediately dropped) when receiving the `CLIENT_ID` verb.
When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.
Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.
This mapping will be used later when we ban nodes that we remove from
the cluster.
This PR fixes some problems found after the PR was merged:
* missed `node_to_work_on` assignment in `handle_topology_transition`;
* change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation.
* `update_fence_version` has beed moved after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error.
Fixes: #14303Closes#14292
* github.com:scylladb/scylladb:
main.cc: move update_fence_version after group0_service.setup_group0_if_exist
shared_token_metadata: update_fence_version: on_internal_error -> throw
storage_service: handle_topology_transition: fix missed node assignment
major_compaction_task_executor inherits both from compaction_task_executor
and major_compaction_task_impl.
Thanks to that an executed operation is represented in task manager.
In the following commits, classes deriving from compaction_task_executor
will be alive longer than they are kept in compaction_manager::_tasks.
Thus, the compaction_task_executor::_gate_holder would be held,
blocking other compactions.
compaction_task_executor::_gate_holder is moved outside of
compaction_task_executor object.
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.
Fixes#14182
Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182Closes#14183
* github.com:scylladb/scylladb:
docs: dml: add update ordering section
cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
atomic_cell: compare_atomic_cell_for_merge: update and add documentation
compare_atomic_cell_for_merge: compare value last for live cells
mutation_test: test_cell_ordering: improve debuggability
Otherwise, the validation
new_fence_version <= token_metadata::version
inside update_fence_version will use an uninitialized
token_metadata::version == 0
and we will get an error.
The test_topology_ops was improved to
catch this problem.
Fixes: #14303
on_internal_error is wrong for fence_version
condition violation, since in case of
topology change coordinator migrating to another
node we can have raft_topology_cmd::command::fence
command from the old coordinator running in
parallel with the fence command (or topology version
upgrading raft command) from the new one.
The comment near the raft_topology_cmd::command::fence
handling describes this situation, assuming an exception
is thrown in this case.
It is currently located in query_class_config.hh, which is named after a
now defunct struct. This arrangement is unintuitive and there is no
upside to it. The main user of max_result_size is query_comand, so
colocate it next to the latter.
Closes#14268
and add docs/dev/timestamp-conflict-resolution.md
to document the details of the conflict resolution algorithm.
Refs scylladb/scylladb#14063
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extend a signature of table::compact_all_sstables and
compaction_manager::perform_major_compaction so that they get
the info of a covering task.
This allows to easily create child tasks that cover compaction group
compaction.
Keep seastar::shared_ptr to task::impl instead of std::unique_ptr
in task. Some classes deriving from task::impl may be used outside
task manager context.
Add reproducers for #14182:
test_rewrite_different_values_using_same_timestamp verifies
expiration-based cell reconciliation.
test_rewrite_different_values_using_same_timestamp_and_expiration
is a scylla_only test, verifying that when
two cells with same timestamp and same expiration
are compared, the one with the lesser ttl prevails.
test_rewrite_using_same_timestamp_select_after_expiration
reproduces the specific issue hit in #14182
where a cell is selected after it expires since
it has a lexicographically larger value than
the other cell with later expiration.
test_rewrite_multiple_cells_using_same_timestamp verifies
atomicity of inserts of multiple columns, with a TTL.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As in compare_atomic_cell_for_merge, we want to consider
the row marker ttl for ordering, in case both are expiring
and have the same expiration time.
This was missed in a57c087c89
and a085ef74ff.
With that in mind, add documentation to compare_row_marker_for_merge
and a mutual note to both functions about their
equivalence.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was changed in CASSANDRA-14592
for consistency with the preference for dead cells over live cells,
as expiring cells will become tombstones at a future time
and then they'd win over live cells with the same timestamp,
hence they should win also before expiration.
In addition, comparing the cell value before expiration
can lead to unintuitive corner cases where rewriting
a cell using the same timestamp but different TTL
may cause scylla to return the cell with null value
if it expired in the meanwhile.
Also, when multiple columns are written using two upserts
using the same write timestamp but with different expiration,
selecting cells by their value may return a mixed result
where each cell is selected individually from either upsert,
by picking the cells with the largest values for each column,
while using the expiration time to break tie will lead
to a more consistent results where a set of cell from
only one of the upserts will be selected.
Fixesscylladb/scylladb#14182
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, it is hard to tell which of the many sub-cases
fail in this unit test, in case any of them fails.
This change uses logging in debug and trace level
to help with that by reproducing the error
with --logger-log-level testlog=trace
(The cases are deterministic so reproducing should not
be a problem)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Split off do_execute() into a fast path and slow(ish) path, and
coroutinize the latter.
perf-simple-query shows no change in performance (which is
unsurprising since it picks the fast path which is essentially unchanged).
Closes#14246
* github.com:scylladb/scylladb:
cql3: select_statement: reindent execute_without_checking_exception_message_aggregate_or_paged()
cql3: select_statement: coroutinize execute_without_checking_exception_message_aggregate_or_paged()
cql3: select_statement: split do_execute into fast-path and slow/slower paths
cql3: select_statement: disambiguate execute() overloads
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
In its current state s3 client uses a single default-configured http client thus making different sched classes' workload compete with each other for sockets to make requests on. There's an attempt to handle that in upload-sink implementation that limits itself with some small number of concurrent PUT requests, but that doesn't help much as many sinks don't share this limit.
This PR makes S3 client maintain a set of http clients, one per sched-group, configures maximum number of TCP connections proportional to group's shares and removes the artificial limit from sinks thus making them share the group's http concurrency limit.
As a side effect, the upload-sink fixes the no-writes-after-flush protection -- if it's violated, write will result in exception, while currently it just hangs on a semaphore forever.
fixes: #13458fixes: #13320fixes: #13021Closes#14187
* github.com:scylladb/scylladb:
s3/client: Replace skink flush semaphore with gate
s3/client: Configure different max-connections on http clients
s3/client: Maintain several http clients on-board
s3/client: Remove now unused http reference from sink and file
s3/client: Add make_request() method
This patch adds some minimal tests for the "with compression = {..}" table
configuration. These tests reproduce three known bugs:
Refs #6442: Always print all schema parameters (including default values)
Scylla doesn't return the default chunk_length_in_kb, but Cassandra
does.
Refs #8948: Cassandra 3.11.10 uses "class" instead of "sstable_compression"
for compression settings by default
Cassandra switched, long ago, the "sstable_compression" attribute's
name to "class". This can break Cassandra applications that create
tables (where we won't understand the "class" parameter) and applications
that inquire about the configuration of existing tables. This patch adds
tests for both problems.
Refs #9933: ALTER TABLE with "chunk_length_kb" (compression) of 1MB caused a
core dump on all nodes
Our test for this issue hangs Scylla (or crashes, depending on the test
environment configuration), when a huge allocation is attempted during
memtable flush. So this test is marked "skip" instead of xfail.
The tests included here also uncovered a new minor/insignificant bug,
where Scylla allows floating point numbers as chunk_length_in_kb - this
number is truncated to an integer, and allowed, unlike Cassandra or
common sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14261
Now that all branches in the visitor are uniform and consist
of a single call to do_evaluate() overloads, we can simplify
by calling a lambda template that does just that.
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.
However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.
This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.
This series fixes the bug and adds a unit test which reproduces the issue.
Fixes#14110Closes#14305
* github.com:scylladb/scylladb:
test: boost/row_cache_test: add a reproducer for #14110
cache_flat_mutation_reader: use the correct schema in prepare_hash
mutation: mutation_cleaner: add pause()
evaluate(expression) calls the various evaluate(ExpressionElement)
overloads to perform its work. However, if we add an ExpressionElement
and forget to implement its evaluate() overload, we'll end up in
with infinite recursion. It will be caught immediately, but better to
avoid it.
Also sprinkle static:s on do_evaluate() where missing.
Since `mvcc: make schema upgrades gentle` (51e3b9321b),
rows pointed to by the cursor can have different (older) schema
than the schema of the cursor's snapshot.
However, one place in the code wasn't updated accordingly,
causing a row to be processed with the wrong schema in the right
circumstances.
This passed through unit testing because it requires
a digest-computing cache read after a schema change,
and no test exercised this.
Fixes#14110
In unit tests, we would want to delay the merging of some MVCC
versions to test the transient scenarios with multiple versions present.
In many cases this can be done by holding snapshots to all versions.
But sometimes (i.e. during schema upgrades) versions are added and
scheduled for merge immediately, without a window for the test to
grab a snapshot to the new version.
This patch adds a pause() method to mutation_cleaner, which ensures
that no asynchronous/implicit MVCC version merges happen within
the scope of the call.
This functionality will be used by a test added in an upcoming patch.
Exposing scrub compaction to the command-line. Allows for offline scrub of sstables, in cases where online scrubbing (via scylla itself) is not possible or not desired. One such case recently was an sstable from a backup which turned out to be corrupt, `nodetool refresh --load-and-stream` refusing to load it.
Fixes: #14203Closes#14260
* github.com:scylladb/scylladb:
docs/operating-scylla/admin-tools: scylla-sstable: document scrub operation
test/cql-pytest: test_tools.py: add test for scylla sstable scrub
tools/scylla-sstable: add scrub operation
tools/scylla-sstable: write operation: add none to valid validation levels
tools/scylla-sstable: handle errors thrown by the operation
test/cql-pytest: add option to omit scylla's output from the test output
tools/scylla-sstable: s/option/operation_option/
tool/scylla-sstable: add missing comments
This mini-series updates the expected errors in `test/cql-pytest/test-timestamp.py`
to the ones changed in b7bbcdd178.
Then, it renamed the test to `test_using_timestamp.py` so it would
run automatically with `test.py`.
Closes#14293
* github.com:scylladb/scylladb:
cql-pytest: rename test-timestamp.py to test_using_timestamp.py
cql-pytest: test-timestamp: test_key_writetime: update expected errors
This reverts commit 8a54e478ba. As
commit 7dadd38161 ("Revert "configure: Switch debug build from
-O0 to -Og") was reverted (by b7627085cb, "Revert "Revert
"configure: Switch debug build from -O0 to -Og""")), we do the
same to cmake to keep the two build systems in sync.
Closes#14286
There are tons of wrappers that help test cases make sstables for their needs. And lots of code duplication in test cases that do parts of those helpers' work on their own. This set cleans some bits of those
Closes#14280
* github.com:scylladb/scylladb:
test/utils: Generalize making memtable from vector<mutation>
test/util: Generalize make_sstable_easy()-s
test/sstable_mutation: Remove useless helper
test/sstable_mutation: Make writer config in make_sstable_mutation_source()
test/utils: De-duplicate make_sstable_containing-s
test/sstable_compaction: Remove useless one-line local lambda
test/sstable_compaction: Simplify sstable making
test/sstables*: Make sstable from vector of mutations
test/mutation_reader: Remove create_sstable() helper from test
It's excessive, test case that needs it can get storage prefix without
this fancy wrapper-helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14273
Commit 4e205650 (test: Verify correctness of sstable::bytes_on_disk())
added a test to verify that sstable::bytes_on_disk() is equal to the
real size of real files. The same test case makes sense for S3-backed
sstables as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14272
Followup to 9bfa63fe37. Like in
`test_downgrade_after_successful_upgrade_fails`, the test
`test_joining_old_node_fails` also restarts all nodes at once and is prone
to a bug in the Python driver which can prevent the session from
reconnecting to any of the nodes. This commit applies the same
workaround to the other test (manual reconnect by recreating the Python
driver session).
Closes#14291
Another node can stop after it joined the group0 but before it
advertised itself in gossip. `get_inet_addrs` will try to resolve all
IPs and `wait_for_peers_to_enter_synchronize_state` will loop
indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one
of the nodes confirms that the upgrade procedure has finished. For that,
it doesn't need the IPs of all group 0 members - only the IP of some
nodes which can do the confirmation.
This pr restructures the code so that IPs of nodes are resolved inside
the `max_concurrent_for_each` that
`wait_for_peers_to_enter_synchronize_state` performs. Then, even if some
IPs won't be resolved, but one of the nodes confirms a successful
upgrade, we can continue.
Fixes#13543Closes#14046
* github.com:scylladb/scylladb:
raft topology: test: check if aborted node replacing blocks bootstrap
raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs
1. Otherwise test.py doesn't recognize it.
2. As it represents what the test does in a better way.
3. Following `test_using_timeout.py` naming convention.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The error messages were changed in
b7bbcdd178.
Extend the `match` regular expression param
to pytest.raises to include both old and new message
to remain backward compatible also with Cassandra,
as this test is run against both Cassandra and Scylla.
Note that the test didn't run automatically
since it's named `test-timestamp.py` and test.py
looks up only test scripts beginning with `test_`.
The test will be renamed in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is V2 of https://github.com/scylladb/scylladb/pull/14108
This commit moves the installation instruction for the cloud from the [website ](https://www.scylladb.com/download/)to the docs.
The scope:
* Added new files with instructions for AWS, GCP, and Azure.
* Added the new files to the index.
* Updating the "Install ScyllaDB" page to create the "Cloud Deployment" section.
* Adding new bookmarks in other files to create stable links, for example, ".. _networking-ports:"
* Moving common files to the new "installation-common" directory. This step is required to exclude the open source-only files
in the Enterprise repository.
In addition:
- The Configuration Reference file was moved out of the installation
section (it's not about installation at all)
- The links to creating a cluster were removed from the installation
page (as not related).
Related: https://github.com/scylladb/scylla-docs/issues/4091
In preparation for converting selectors to evaluate expressions,
add support for evaluating column_mutation_attribute (representing
the WRITETIME/TTL pseudo-functions).
A unit test is added.
Fixes#12906Closes#14287
* github.com:scylladb/scylladb:
test: expr: test evaluation of column_mutation_attribute
test: lib: enhance make_evaluation_inputs() with support for ttls/timestamps
cql3: expr: evaluate() column_mutation_attribute
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.
Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.
Refs: #14283
Initialization of `system_keyspace` is now done in a single place instead of
being spread out through the entire procedure. `system_keyspace` is also
available for queries much earlier which allows, for example, to load our Host
ID before we initialize any of the distributed services (like gossiper,
messaging_service etc.) This is doable because `query_processor` is now
available early. A couple of FIXMEs have been resolved.
Refs: #14202Closes#14285
* github.com:scylladb/scylladb:
main, cql_test_env: simplify `system_keyspace` initialization
db: system_keyspace: take simpler service references in `make`
db: system_keyspace: call `initialize_virtual_tables` from `main`
db: system_keyspace: refactor virtual tables creation
db: system_keyspace: remove `system_keyspace_make`
db: system_keyspace: refactor local system table creation code
replica: database: remove `is_bootstrap` argument from create_keyspace
replica: database: write a comment for `parse_system_tables`
replica: database: remove redundant `keyspace::get_erm_factory()` getter
db: system_keyspace: don't take `sharded<>` references
There's no way to evaluate a column_mutation_attribute via CQL
yet (the only user uses old-style cql3::selection::selector), so
we only supply a unit test.
While remaining backwards compatible, allow supplying custom timestamp/ttl
with each fake column value.
Note: I tried to use a formatter<> for the new data structure, but
got entangled in a template loop.
Enhance evaluation_inputs with timestamps and ttls, and use
them to evaluate writetime/ttl.
The data structure is compatible with the current way of doing
things (see result_set_builder::_timestamps, result_set_build::_ttls).
We use std::span<> instead of std::vector<> as it is more general
and a tiny bit faster.
The algorithm is taken from writetime_or_ttl_selector::add_input().
Low CPU utilization is a major contributor to high test time.
Low CPU utilization can happen due to tests sleeping, or lack
of concurrency due to Amdahl's law.
Utilization is computed by dividing the utilized CPU by the available
CPU (CPU count times wall time).
Example output:
Found 134 tests.
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
[134/134] boost dev [ PASS ] boost.json_cql_query_test.test_unpack_decimal.1
------------------------------------------------------------------------------
CPU utilization: 4.8%
Closes#14251
Initialization of `system_keyspace` is now all done at once instead of
being spread out through the entire procedure. This is doable because
`query_processor` is now available early. A couple of FIXMEs have been
resolved.
Take references to services which are initialized earlier. The
references to `gossiper`, `storage_service` and `raft_group0_registry`
are no longer needed.
This will allow us to move the `make` step right after starting
`system_keyspace`.
`initialize_virtual_tables` was called from `system_keyspace::make`,
which caused this `make` function to take a bunch of references to
late-initialized services (`gossiper`, `storage_service`).
Call it from `main`/`cql_test_env` instead.
Note: `system_keyspace::make` is called from
`distributed_loader::init_system_keyspace`. The latter function contains
additional steps: populate the system keyspaces (with data from
sstables) and mark their tables ready for writes.
None of these steps apply to virtual tables.
There exists at least one writable virtual table, but writes into
virtual tables are special and the implementation of writes is
virtual-table specific. The existing writable virtual table
(`db_config_table`) only updates in-memory state when written to. If a
virtual table would like to create sstables, or populate itself with
sstable data on startup, it will have to handle this in its own
initialization function.
Separating `initialize_virtual_tables` like this will allow us to
simplify `system_keyspace` initialization, making it independent of
services used for distributed communication.
Split `system_keyspace::make` into two steps: creating regular
`system` and `system_schema` tables, then creating virtual tables.
This will allow, in later commit, to make `system_keyspace`
initialization independent of services used for distributed
communication such as `gossiper`. See further commits for details.
`system_keyspace_make` would access private fields of `database` in
order to create local system tables (creating the `keyspace` and
`table` in-memory structures, creating directory for `system` and
`system_schema`).
Extract this part into `database::create_local_system_table`.
Make `database::add_column_family` private.
Take `query_processor` and `database` references directly, not through
`sharded<...>&`. This is now possible because we moved `query_processor`
and `database` construction early, so by the time `system_keyspace` is
started, the services it depends on were also already started.
Calls to `_qp.local()` and `_db.local()` inside `system_keyspace` member
functions can now be replaced with direct uses of `_qp` and `_db`.
Runtime assertions for dependant services being initialized are gone.
Implement `expr:valuate()` for `expr::field_selection`.
`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as a `field_selection`, which selects the field `a`.
Evaluating such an expression consists of finding the right element's value in a serialized UDT value and returning it.
Note that it's still not possible to use `field_selection` inside the `WHERE` clause. Enabling it would require changes to the grammar, as well as query planning, Current `statement_restrictions` just reacts with `on_internal_error` when it encounters a `field_selection`.
Nonetheless it's a step towards relaxing the grammar, and now it's finally possible to evaluate all kinds of prepared expressions (#12906)
Fixes: https://github.com/scylladb/scylladb/issues/12906Closes#14235
* github.com:scylladb/scylladb:
boost/expr_test: test evaluate(field_selection)
cql3/expr: fix printing of field_selection
cql3/expression: implement evaluate(field_selection)
types/user: modify idx_of_field to use bytes_view
column_identifer: add column_identifier_raw::text()
types: add read_nth_user_type_field()
types: add read_nth_tuple_element()
This reverts commit 7dadd38161.
The latest revert cited debuggability trumping performance, but the
performance loss is su huge here that debug builds are unusable and
next promotions time out.
In the interest of progress, pick the lesser of two evils.
Both, make_sstable_easy() and make_sstable_containing() prepare memtable
by allocating it and applying mutations from vector. Make a local
helper. Many test cases can, probably, benefit from it too, but they
often do more stuff before applying mutation to memtable, so this is
left for future patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them, one making sstable from memtable and the other
one doing the same from a custom reader. The former can just call the
latter with memtable's flat reader
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two make_sstable_mutation_source() helpers that call one
another and test cases only need one of them, so leave just one that's
in use.
Also don't pass env's tempdir to make_sstable() util call, it can get
env's tempdir on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These local helpers accept writer config which's made the same way by
callers, so the helpers can do it on their own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The function that prepares memtable from mutations vector can call its
overload that writes this memtable into an sstable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_usable_sst() wrapper lambda is not needed, calling the
make_sstable_containing() is shorter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a temporary memtable and on-stack lambda that makes the
mutation. Both are overkill, make_sstable_containing() can work on just
plan on-stack-constructed mutation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are many cases that want to call make_sstable_containing() with
the vector of mutations at hand. For that they apply it to a temporary
memtable, but sstable-utils can work with the mutations vector as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test `test_downgrade_after_successful_upgrade_fails` shuts down the whole cluster, reconfigures the nodes and then restarts. Apparently, the python driver sometimes does not handle this correctly; in one test run we observed that the driver did not manage to reconnect to any of the nodes, even though the nodes managed to start successfully.
More context can be found on the python driver issue.
This PR works around this issue by using the existing `reconnect_driver` function (which is a workaround for a _different_ python driver issue already) to help the driver reconnect after the full cluster restart.
Refs: scylladb/python-driver#230
Closes#14276
* github.com:scylladb/scylladb:
tests/topology: work around python driver issue in cluster feature tests
test/topology{_raft_disabled}: move reconnect_driver to topology utils
In https://github.com/scylladb/scylladb/pull/14231 we split `storage_proxy` initialization into two phases: for local and remote parts. Here we do the same with `query_processor`. This allows performing queries for local tables early in the Scylla startup procedure, before we initialize services used for cluster communication such as `messaging_service` or `gossiper`.
Fixes: #14202
As a follow-up we will simplify `system_keyspace` initialization, making it available earlier as well.
Closes#14256
* github.com:scylladb/scylladb:
main, cql_test_env: start `query_processor` early
cql3: query_processor: split `remote` initialization step
cql3: query_processor: move `migration_manager&`, `forwarder&`, `group0_client&` to a `remote` object
cql3: query_processor: make `forwarder()` private
cql3: query_processor: make `get_group0_client()` private
cql3: strongly_consistent_modification_statement: fix indentation
cql3: query_processor: make `get_migration_manager` private
tracing: remove `qp.get_migration_manager()` calls
table_helper: remove `qp.get_migration_manager()` calls
thrift: handler: move implementation of `execute_schema_command` to `query_processor`
data_dictionary: add `get_version`
cql3: statements: schema_altering_statement: move `execute0` to `query_processor`
cql3: statements: pass `migration_manager&` explicitly to `prepare_schema_mutations`
main: add missing `supervisor::notify` message
The `topology_coordinator::cleanup_group0_config_if_needed` function first checks whether the number of group 0 members is larger than the number of non-left entries in the topology table, then attempts to remove nodes in left state from group 0 and prints a warning if no such nodes are found. There are some problems with this check:
- Currently, a node is added to group 0 before it inserts its entry to the topology table. Such a node may cause the check to succeed but no nodes will be removed, which will cause the warning to be printed needlessly.
- Cluster features on raft will reverse the situation and it will be possible for an entry in system.topology to exist without the corresponding node being a part of group 0. This, in turn, may cause the check not to pass when it should and nodes could be removed later than necessary.
This commit gets rid of the optimization and the warning, and the topology coordinator will always compute the set of nodes that should be removed. Additionally, the set of nodes to remove is now computed differently: instead of iterating over left nodes and including only those that are in group 0, we now iterate over group 0 members and include those that are in `left` state. As the number of left nodes can potentially grow unbounded and the number of group 0 members is more likely to be bounded, this should give better performance in long-running clusters.
Closes#14238
* github.com:scylladb/scylladb:
storage_service: fix indentation after previous commit
storage_service: remove optimization in cleanup_group0_config_if_needed
The test `test_downgrade_after_successful_upgrade_fails` stops all
nodes, reconfigures them to support the test-only feature and restarts
them. Unfortunately, it looks like python driver sometimes does not
handle this properly and might not reconnect after all nodes are shut
down.
This commit adds a workaround for scylladb/python-driver#230 - the test
re-creates python driver session right after nodes are restarted.
The `reconnect_driver` function will be useful outside the
`topology_raft_disabled` test suite - namely, for cluster feature tests
in `topology`. The best course of action for this function would be to
put it into pylib utils; however, the function depends on ManagerClient
which is defined in `test.pylib.manager_client` that depends on
`test.pylib.utils` - therefore we cannot put it there as it would cause
an import cycle. The `topology.utils` module sounds like the next best
thing.
In addition, the docstring comment is updated to reflect that this
function will now be used to work around another issue as well.
Pass `migration_manager&`, `forward_service&` and `raft_group0_client&`
in the remote init step which happens after the constructor.
Add a corresponding uninit remote step.
Make sure that any use of the `remote` services is finished before we
destroy the `remote` object by using a gate.
Thanks to this in a later commit we'll be able to move the construction
of `query_processor` earlier in the Scylla initialization procedure.
These services are used for performing distributed queries, which
require remote calls. As a preparation for 2-phase initialization of
`query_processor` (for local queries vs for distributed queries), move
them to a separate `remote` object which will be constructed in the
second phase.
Replace the getters for the different services with a single `remote()`
getter. Once we split the initialization into two phases, `remote()`
will include a safety protection.
After previous commits it's no longer used outside `query_processor`.
Also remove the `const` version - not needed for anything.
Use the getter instead of directly accessing `_mm` in `query_processor`
methods. Later we will put `_mm` in a separate object.
The `system.topology` table is extended with two new columns that will
be used to manage cluster features:
- `supported_features set<text>` is a new clustering column which holds
the features that given node advertises as supported. It will be first
initialized when the node joins the cluster, and then updated every
time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the
features that are considered enabled by the cluster. Unlike in the
current gossip-based implementation the features will not be enabled
implicitly when all nodes support a feature, but rather when via an
explicit action of the topology coordinator.
Exposing scrub compaction to the command-line. Scrubbed sstables are
written into a directory specified by the `--output-directory` command
line parameter. This directory is expected to be empty, to avoid
clashes with any pre-existing sstables. This can be overriden by the
user if they wish.
Instead of letting the runtime catch them. Also, make sure all exception
throw due to bad arguments are instances of `std::invalid_argument`,
these are now reported differently from other, runtime errors.
Remove the now extraneous `error:` prefix from all exception messages.
Scylla's output is often unnecessary to debug a failed test, or even
detrimental because one has to scroll back in the terminal after each
test run, to see the actual test's output. Add an option,
--omit-scylla-output, which when present on the command line of `run`,
the output of scylla will be omitted from the test output.
Also, to help discover this option (and others), don't run the tests
when either -h or --help is present on the command line. Just invoke
pytest (with said option) and exit.
This is the initial implementation of [this spec](https://docs.google.com/document/d/1X6pARlxOy6KRQ32JN8yiGsnWA9Dwqnhtk7kMDo8m9pI/edit).
* the topology version (int64) was introduced, it's stored in topology table and updated through RAFT at the relevant stages of the topology change algorithm;
* when the version is incremented, a `barrier_and_drain` command is sent to all the nodes in the cluster, if some node is unavailable we fail and retry indefinitely;
* the `barrier_and_drain` handler first issues a `raft_read_barrier()` to obtain the latest topology, and then waits until all requests using previous versions are finished; if this round of RPCs is finished the topology change coordinator can be sure that there are no requests inflight using previous versions and such requests can't appear in the future.
* after `barrier_and_drain` the topology change coordinator issues the `fence` command, it stores the current version in local table as `fence_version` and blocks requests with older versions by throwing `stale_topology_exception`; if a request with older version was started before the fence, its reply will also be fenced.
* the fencing part of the PR is for the future, when we relax the requirement that all nodes are available during topology change; it should protect the cluster from requests with stale topology from the nodes which was unavailable during topology change and which was not reached by the `barrier_and_drain()` command;
* currently, fencing is implemented for `mutation` and `read` RPCs, other RPCs will be handled in the follow-ups; since currently all nodes are supposed to be alive the missing parts of the fencing doesn't break correctness;
* along with fencing, the spec above also describes error handling, isolation and `--ignore_dead_nodes` parameter handling, these will be also added later; [this ticket](https://github.com/scylladb/scylladb/issues/14070) contains all that remains to be done;
* we don't worry about compatibility when we change topology table schema or `raft_topology_cmd_handler` RPC method signature since the raft topology code is currently hidden by `--experimental raft` flag and is not accessible to the users. Compatibility is maintained for other affected RPCs (mutation, read) - the new `fencing_token` parameter is `rpc::optional`, we skip the fencing check if it's not present.
Closes#13884
* github.com:scylladb/scylladb:
storage_service: warn if can't find ip for server
storage_proxy.cc: add and use global_token_metadata_barrier
storage_service: exec_global_command: bool result -> exceptions
raft_topology: add cmd_index to raft commands
storage_proxy.cc: add fencing to read RPCs
storage_proxy.cc: extract handle_read
storage_proxy.cc: refactor encode_replica_exception_for_rpc
storage_proxy: fix indentation
storage_proxy: add fencing for mutation
storage_servie: fix indentation
storage_proxy: add fencing_token and related infrastructure
raft topology: add fence_version
raft_topology: add barrier_and_drain cmd
token_metadata: add topology version
Scenario:
1. Start a cluster with nodes node1, node2, node3
2. Start node4 replacing node node2
3. Stop node node4 after it joined group0 but before it advertised itself in gossip
4. Start node5 replacing node node2
Test simulates the behavior described in #13543.
Test passes only if `wait_for_peers_to_enter_synchronize_state` doesn't need to
resolve all IPs to return early. If not, node5 will hang trying to resolve the
IP of node4:
```
raft_group0_upgrade - : failed to resolve IP addresses of some of the cluster members ([node4's host ID])
```
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14266
memory_footprint_test fails with:
`sstable - writing sstables with too old format`
because it attempts to write the obsolete sstables formats,
for which the writer code has been long removed.
Fix that.
Closes#14265
repair_node_state::state is only for debugging purpose, see
ab57cea783 which introduced it.
so this change does not impact the behavior of scylla, but can
improve the debugging experience by reflecting more accurate state
of repair when we are actually inspecting it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14255
Add a unit test which tests evaluating field selections.
Alas at the moment it's impossible to add a cql-pytest,
as the grammar and query planning doesn't handle field
selections inside the WHERE clause.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
expression printing has two modes: debug and user.
The user mode should output standard CQL that can be
parsed back to an expression.
In debug mode there can be some additional information
that helps with debugging stuff.
The code for printing `field_selection` didn't distinguish
between user mode and debug mode. It just always printed
in debug mode, with extra parenthesis around the field selection.
Let's change it so that it emits valid CQL in user mdoe.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Implement expr:valuate() for expr::field_selection.
`field_selection` is used to represent access to a struct field.
For example, with a UDT value:
```
CREATE TYPE my_type (a int, b int);
```
The expression `my_type_value.a` would be represented as
a field_selection, which selects the field 'a'.
Evaluating such an expression consists of finding the
right element's value in a serialized UDT value
and returning it.
Fixes: https://github.com/scylladb/scylladb/issues/12906
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Let's change the argument type from `bytes`
to `bytes_view`. Sometimes it's possible to get
an instance of `bytes_view`, but getting `bytes`
would require a copy, which is wasteful.
`bytes_view` allows to avoid copies.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
I would like to be able to get a reference to the
string inside `column_identifer_raw`, but there was
no such function. There was only `to_string()`, which
copies the entire string, which is wasteful.
Let's add the method `text()`, which returns a reference
instead of a copy. `column_identifier` already has
such method, so `column_identifier_raw` can have one
as well.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function which can be used to read the nth
field of a serialized UDT value.
We could deserialize the whole value and then choose
one of the deserialized fields, but that would be wasteful.
Sometimes we only need the value of one field, not all of them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.
This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.
Fixes#13543
CQL statements carry expressions in many contexts: the SELECT, WHERE, SET, and IF clauses, plus various attributes. Previously, each of these contexts had its own representation for an expression, and another one for the same expression but before preparation. We have been gradually moving towards a uniform representation of expressions.
This series tackles SELECT clause elements (selectors), in their unprepared phase. It's relatively simple since there are only five types of expression components (column references, writetime/ttl modifiers, function calls, casts, and field selections). Nevertheless, there isn't much commonality with previously converted expression elements so quite a lot of code is involved.
After the series, we are still left with a custom post-prepare representation of expressions. It's quite complicated since it deals with two passes, for aggregation, so it will be left for another series.
Closes#14219
* github.com:scylladb/scylladb:
cql3: seletor: drop inheritance from assignment_testable
cql3: selection: rely on prepared expressions
cql3: selection: prepare selector expressions
cql3: expr: match counter arguments to function parameters expecting bigint
cql3: expr: avoid function constant-folding if a thread is needed
cql3: add optional type annotation to assignment_testable
cql3: expr: wire unresolved_identifier to test_assignment()
cql3: expr: support preparing column_mutation_attribute
cql3: expr: support preparing SQL-style casts
cql3: expr: support preparing field_selection expressions
cql3: expr: make the two styles of cast expressions explicit
cql3: error injection functions: mark enabled_injections() as impure
cql3: eliminate dynamic_cast<selector> from functions::get()
cql3: test_assignment: pass optional schema everywhere
cql3: expr: prepare_expr(): allow aggregate functions
cql3: add checks for aggregation functions after prepare
cql3: expr: add verify_no_aggregate_functions() helper
test: add regression test for rejection of aggregates in the WHERE clause
cql3: expr: extract column_mutation_attribute_type
cql3: expr: add fmt formatter for column_mutation_attribute_kind
cql3: statements: select_statement: reuse to_selectable() computation in SELECT JSON
currently, despite that we are moving from Java-8 to
Java-11, we still support both Java versions. and the
docker image used for testing Datatax driver has
not been updated to install java-11.
the "java" executable provided by openjdk-java-8 does
not support "--version" command line argument. java-11
accept both "-version" and "--version". so to cater
the needs of the the outdated docker image, we pass
"-version" to the selected java. so the test passes
if java-8 is found. a better fix is to update the
docker image to install java-11 though.
the output of "java -version" and "java --version" is
attached here as a reference:
```console
$ /usr/lib/jvm/java-1.8.0/bin/java --version
Unrecognized option: --version
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
```
```console
$ /usr/lib/jvm/java-1.8.0/bin/java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)
```
```console
/usr/lib/jvm/jre-11/bin/java --version
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```
```console
$ /usr/lib/jvm/jre-11/bin/java -version
openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc38) (build 11.0.19+7, mixed mode, sharing)
```
Fixes#14253
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14254
this series adds an option named "uuid_sstable_identifier_enabled", and the related cluster feature bit, which is set once all nodes in this cluster set this option to "true". and the sstable subsystem will start using timeuuid instead plain integer for the identifier of sstables. timeuuid should be a better choice for identifiers as we don't need to worry about the id conflicts anymore. but we still have quite a few tests using static sstables with integer in their names, these tests are not changed in this series. we will create some tests to exercise the sstable subsystem with this option set.
a very simple inter-op test with Cassandra 4.1.1 was also performed to verify that the generated sstables can be read by the Cassandra:
1. start scylla, and connect it with cqlsh, run following commands, and stop it
```
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
cqlsh> SELECT * FROM ks.cf;
```
2. enable Cassandra's `uuid_sstable_identifiers_enabled`, and start Cassandra 4.1.1, and connect it with cqlsh, run following commands, and stop it
```
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy','replication_factor':1} ;
cqlsh> CREATE TABLE ks.cf ( name text primary key, value text );
cqlsh> INSERT INTO ks.cf (name, value) VALUES ('1', 'one');
cqlsh> SELECT * FROM ks.cf;
```
2. move away the sstables generated by Cassandra, and replace it with the sstables generated by scylladb:
```console
$ mv ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49 /tmp
$ mv ~/scylla/ks/cf-db47a12009d611eea6b8b179df3a2d5d ~/cassandra/data/data/ks/cf-b29d23a009d911eeb5fed163c4d0af49
```
3. start Cassandra 4.1.1 again, and connect it with cqlsh, run following commands
```
cqlsh> SELECT * FROM ks.cf;
name | value
------+-------
1 | one
```
Fixes https://github.com/scylladb/scylladb/issues/10459Closes#13932
* github.com:scylladb/scylladb:
replica,sstable: introduce invalid generation id
sstables, replica: pass uuid_sstable_identifiers to generation generator
gms/feature_service: introduce UUID_SSTABLE_IDENTIFIERS cluster feature
db: config: add uuid_sstable_identifiers_enabled option
sstables, replica: support UUID in generation_type
This allows to reflect cause-and-effect
relationships in the logs: if some command
failed, we write to the log at the top level
of the topology state machine. The log message
includes the current state of the state
machine and a description of what
exactly went wrong.
Note that in the exec_global_command overload
returning node_to_work_on we don't call retake_node()
if the nested exec_global_command failed.
This is fine, since all the callers
just log/break in this case.
In this commit we add logic to protect against
raft commands reordering. This way we can be
sure that the topology state
(_topology_state_machine._topology) on all the
nodes processing the command is consistent
with the topology state on the topology change
coordinator. In particular, this allows
us to simply use _topology.version as the current
version in barrier_and_drain instead of passing it
along with the command as a parameter.
Topology coordinator maintains an index of the last
command it has sent to the cluster. This index is
incremented for each command and sent along with it.
The receiving node compares it with the last index
it received in the same term and returns an error
if it's not greater. We are protected
against topology change coordinator migrating
to other node by the already existing
terms check: if the term from the command
doesn't match the current term we return an error.
On the call site we use the version captured in
read_executor/erm/token_metadata. In the handlers
we use apply_fence twice just like in mutation RPC.
Fencing was also added to local query calls, such as
query_result_local in make_data_request. This is for
the case when query coordinator was isolated from
topology change coordinator and didn't receive
barrier_and_drain.
We are going to add fencing to read RPCs, it would be easier
to do it once for all three of them. This refactoring
enables this since it allows to use
encode_replica_exception_for_rpc for handle_read_digest.
At the call site, we use the version, captured
in erm/token_metadata. In the handler, we use
double checking, apply_fence after the local
write guarantees that no mutations
succeed on coordinators if the fence version
has been updated on the replica during the write.
Fencing was also added to mutate_locally calls
on request coordinator, for the case
if this coordinator was isolated from the
topology change coordinator and missed the
barrier_and_drain command.
A new stale_topology_exception was introduced,
it's raised in apply_fence when an RPC comes
with a stale fencing_token.
An overload of apply_fence with future will be
used to wrap the storage_proxy methods which
need to be fenced.
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.
The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
the invalid sstable id is the NULL of a sstable identifier. with
this concept, it would be a lot simpler to find/track the greatest
generation. the complexity is hidden in the generation_type, which
compares the a) integer-based identifiers b) uuid-based identifiers
c) invalid identitifer in different ways.
so, in this change
* the default constructor generation_type is
now public.
* we don't check for empty generation anymore when loading
SSTables or enumerating them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we assume that generation is always integer based.
in order to enable the UUID-based generation identifier if the related
option is set, we should populate this option down to generation generator.
because we don't have access to the cluster features in some places where
a new generation is created, a new accessor exposing feature_service from
sstable manager is added.
Fixes#10459
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
UUID_SSTABLE_IDENTIFIERS is a new cluster wide feature. if it is
enabled, all nodes will generate new sstables with the UUID as their
generation identifiers. this feature is configured using
config option of "uuid_sstable_identifiers_enabled".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
unlike Cassandra 4.1, this option is true by default, will be used
for enabling cluster feature of "UUID_SSTABLE_IDENTIFIERS". not wired yet.
please note, because we are still using sstableloader and
sstabledump based on 3.x branch, while the Cassandra upstream
introduced the uuid sstable identifier in its 4.x branch, these tool
fail to work with the sstables with uuid identifier, so this option
is disabled when performing these tests. we will enable it once
these tools are updated to support the uuid-basd sstable identifiers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change generalize the value of generation_type so it also
supports UUID based identifier.
* sstables/generation_type.h:
- add formatter and parse for UUID. please note, Cassandra uses
a different format for formatting the SSTable identifier. and
this formatter suits our needs as it uses underscore "_" as the
delimiter, as the file name of components uses dash "-" as the
delimiter. instead of reinventing the formatting or just use
another delimiter in the stringified UUID, we choose to use the
Cassandra's formatting.
- add accessors for accessing the type and value of generation_type
- add constructor for constructing generation_type with UUID and
string.
- use hash for placing sstables with uuid identifiers into shards
for more uniformed distrbution of tables in shards.
* replica/table.cc:
- only update the generator if the given generation contains an
integer
* test/boost:
- add a simple test to verify the generation_type is able to
parse and format
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Push those calls up the call stack, to `trace_keyspace_helper` module.
Pass `migration_manager` reference around together with
`query_processor` reference.
It's now named `execute_thrift_schema_command` in `query_processor`.
This allows us to remove yet another
`query_processor::get_migration_manager()` call.
Now that `execute_thrift_schema_command` sits near
`execute_schema_statement` (the latter used for CQL), we can see a
certain similarity. The Thrift version should also in theory get a retry
loop like the one CQL has, so the similarity would become even stronger.
Perhaps the two functions could be refactored to deduplicate some logic
later.
Rename it to `execute_schema_statement`.
This allows us to remove a call to
`query_processor::get_migration_manager`, the goal being to make it a
private member function.
We want to stop relying on `qp.get_migration_manager()`, so we can make
the function private in the future. This in turn is a prerequisite for
splitting `query_processor` initialization into two phases, where the
first phase will only allow local queries (and won't require
`migration_manager`).
This patch adds an xfailing test reproducing the bug in issue #12762:
When a SELECT uses a secondary index to list rows, if there is also a
PER PARTITION LIMIT given, Scylla forgets to apply it.
The test shows that the PER PARTITION LIMIT is correctly applied when
the index doesn't exist, but forgotten when the index is added.
In contrast, both cases work correctly in Cassandra.
This patch also adds a second variant of this test, which adds filtering
to the mix, and ensures that PER PARTITION LIMIT 1 doesn't give up on the
first row of each partition - but rather looks for the first row that
passes the filter, and only then moves on to the next partition.
Refs #12762.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14248
The `topology_coordinator::cleanup_group0_config_if_needed` function
first checks whether the number of group 0 members is larger than the
number of non-left entries in the topology table, then attempts to
remove nodes in left state from group 0 and prints a warning if no such
nodes are found. There are some problems with this check:
- Currently, a node is added to group 0 before it inserts its entry to
the topology table. Such a node may cause the check to succeed but no
nodes will be removed, which will cause the warning to be printed
needlessly.
- Cluster features on raft will reverse the situation and it will be
possible for an entry in system.topology to exist without the
corresponding node being a part of group 0. This, in turn, may cause
the check not to pass when it should and nodes could be removed later
than necessary.
This commit gets rid of the optimization and the warning, and the
topology coordinator will always compute the set of nodes that should be
removed. Additionally, the set of nodes to remove is now computed
differently: instead of iterating over left nodes and including only
those that are in group 0, we now iterate over group 0 members and
include those that are in `left` state. As the number of left nodes can
potentially grow unbounded and the number of group 0 members is more
likely to be bounded, this should give better performance in
long-running clusters.
simpler this way. and this also matches with the "add" call at the
beginning of this function.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14239
Move the initialization of `storage_proxy` early in the startup procedure, before starting
`system_keyspace`, `messaging_service`, `gossiper`, `storage_service` and more.
As a follow-up, we'll be able to move initialization of `query_processor` right
after `storage_proxy` (but this requires a bit of refactoring in
`query_processor` too).
Local queries through `storage_proxy` can be done after the early initialization step.
In a follow-up, when we do a similar thing for `query_processor`, we'll be able
to perform local CQL queries early as well. (Before starting `gossiper` etc.)
Closes#14231
* github.com:scylladb/scylladb:
main, cql_test_env: initialize `storage_proxy` early
main, cql_test_env: initialize `database` early
storage_proxy: rename `init_messaging_service` to `start_remote`
storage_proxy: don't pass `gossiper&` and `messaging_service&` during initialization
storage_proxy: prepare for missing `remote`
storage_proxy: don't access `remote` during local queries in `query_partition_key_range_concurrent`
db: consistency_level: remove overload of `filter_for_query`
storage_proxy: don't access `remote` when calculating target replicas for local queries
storage_proxy: introduce const version of `remote()`
replica: table: introduce `get_my_hit_rate`
storage_proxy: `endpoint_filter`: remove gossiper dependency
try_emplace() is
- simpler than the lookup-and-insert dance, and
- presumably, it is more efficient.
- also, most importantly, it is simpler to read.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14237
This function will have a continuation for sure, so converting it to a
coroutine will not cause extra allocations for sure.
wrap_result_to_error_message(), which is used to convert a
coordinator_result-unaware continuation to a coordinator_result
aware continuation, is converted to traditional check-error-and-return.
select_statement::do_execute() has a fast path where it forwards
to execute_without_checking_exception_message_non_aggregate_unpaged().
In this fast path, we aren't paging (a good reason for that is reading
a partition without clustering keys) and in the slow/slower paths
we page and/or perform complex processing like aggregation.
The fast path doesn't need any continuations, but the slow/slower paths
do. Split them off so that the slow/slower paths can be coroutinized
without impacting the fast path.
This reverts commit 9526258b89.
Because the issue (#14213) supposed to be fix only exists in the
enterprise branch. And that issue has been fixed in a different
way in a different place.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14234
to reduce the indentation level, and to improve the readability. also, take this opportunity to name some variables for better readability.
Closes#14236
* github.com:scylladb/scylladb:
migration_manager: coroutineize migration_manager::do_merge_schema_from()
migration_manager: coroutineize migration_manager::sync_schema()
Currently, the mapping is initialized from the gossiper state when
group0 server is started and updated from a gossiper change
listener. Gossiper state is restored from system.peers in
storage_service::join_cluster(), which is later than
setup_group0_if_exists() is called.
The restarted server will hang in
group0_service.setup_group0_if_exist(), which waits for snapshot
loading, which waits for storage_service::topology_state_load(), which
waits for IP mapping for servers mentioned in the topology, and
produces logs like this:
WARN 2023-06-12 15:45:21,369 [shard 0] storage_service - (rate limiting dropped 196 similar messages) raft topology: cannot map c94ae68f-869d-4727-8b2f-d40814e395f0 to ip, retrying.
This is a regression after f26179c, where group0 server is initialized
before the gossiper is started.
The fix is to load the mapping from system.peers before group0 is
started. Gossiper state is not available at this point, so we read the
mapping directly from system keyspace. This change will also be needed
to implement messaging by host id, even if raft is disabled, where we
will need to restore the mapping early.
Fixes#14217Closes#14220
This is another part of splitting Scylla initialization into two phases:
local and remote parts. Performing queries is done with `storage_proxy`,
so for local queries we want to initialize it before we initialize
services specific to cluster communication such as `gossiper`,
`messaging_service`, `storage_service`.
`system_keyspace` should also be initialized after `storage_proxy` (and
is after this patch) so in the future we'll be able to merge the
multiple initialization steps of `system_keyspace` into one (it only
needs the local part to work).
We want to separate two phases of Scylla service initialization: first
we initialize the local part, which allows performing local queries,
then a remote part, which requires contacting other nodes in a cluster
and allows performing distributed queries.
The `database` object is crucial for both remote and local queries, but it
was created pretty late, after services such as `gossiper` or
`storage_service` which are used for distributed operations.
Fortunately we can easily move `database` initialization and all of its
prerequisites early in the init procedure.
These services are now passed during `init_messaging_service`, and
that's when the `remote` object is constructed.
The `remote` object is then destroyed in `uninit_messaging_service`.
Also, `migration_manager*` became `migration_manager&` in
`init_messaging_service`.
Prepare the users of `remote` for the possibility that it's gone.
The `remote()` accessor throws an error if it's gone. Observe that
`remote()` is only used in places where it's verified that we really
want to send a message to a remote node, with a small exception:
`truncate_blocking`, which truncates locally by sending an RPC to
ourselves (and truncate always sends RPC to the whole cluster; we might
want to change this behavior in the future, see #11087). Other places
are easy to check (it's either implementations of `apply_remotely`
which is only called for remote nodes, or there's an `if` that checks
we don't apply the operation to ourselves).
There is one direct access to `_remote` which checks first if `_remote`
is available: `storage_proxy::is_alive`. If `_remote` is unavailable, we
consider nodes other than us dead. Indeed, if `gossiper` is unavailable,
we didn't have a chance to gossip with other nodes and mark them alive.
In `query_partition_key_range_concurrent` there's a calculation of cache
hit rates which requires accessing `gossiper` through `remote`. We want
to support local queries when `remote` is unavailable.
Check if it's a local query and only if not, fetch `gossiper` from `remote`.
We only want to access `remote` when it's necessary - when we're
performing a query that involves remote nodes. We want to support local
queries when `remote` (in particular, `gossiper&`) is unavailable.
Add a helper, `storage_proxy::filter_replicas_for_read`, which will
check if it's a local query and return early in that case without
accessing `remote`.
The method sits on sstable, but is called only from fs storage and it's
the only place that really needs it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14230
After recent changes, all wasm related logic has been moved from
the database class to the query_processor. As a result, the wasm
headers no longer need to be included there, and in particular,
files that include replica/database.hh no longer need to wait
on the generated header rust/wasmtime_bindings.hh to compile.
Fixes#14224Closes#14223
This is a translation of Cassandra's CQL unit test source file
validation/operations/UpdateTest.java into our cql-pytest framework.
There are 18 tests, and they did not reproduce any previously-unknown
bug, but did provide additional reproducers for two known issues:
Refs #12243: Setting USING TTL of "null" should be allowed
Refs #12474: DELETE/UPDATE print misleading error message suggesting
ALLOW FILTERING would work
Note that we knew about this issue for the DELETE operation, and
the new test shows the same issue exists for UPDATE.
I had to modify some of the tests to allow for different error messages
in ScyllaDB (in cases where the different message makes sense), as well
as cases where we decided to allow in Scylla some behaviors that are
forbidden in Cassandra - namely Refs #12472.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14222
When scylla starts it collects dangling sstables from the datadir. It includes temporary sstable directories and pending-deletion log. S3-backed sstables cannot be garbage-collected like that, instead "garbage" entries from the ownership table should be processed. Currently the g.c. code is unaware of storage and scans datadir for whatever sstable it's called for.
This PR prepares the garbage_collect() call to become virtual, but no-op for ownership-table lister. Proper S3 garbage-collecting is not yet here, it needs an extra patch to seastar http client.
refs: #13024Closes#14023
* github.com:scylladb/scylladb:
sstable_directory: Do not collect filesystem garbage for S3-backed sstables
sstable_directory: Deduplicate .process() location argument
sstable_directory: Keep directory lister on stack
sstable_directory: Use directory_lister API directly
Fixes https://github.com/scylladb/scylladb/issues/14084
This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.
Closes#14228
* github.com:scylladb/scylladb:
doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
doc: add OS support for ScyllaDB 5.3
to reduce the indentation level, and to improve the readability.
also, take this opportunity to name some variables for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add a function which retrieves the value of nth
field from a serialized tuple value.
I tried to make it as efficient as possible.
Other functions, like evaluate(subscript) tend to
deserialize the whole structure and put all of its
elements in a vector. Then they select a single element
from this vector.
This is wasteful, as we only need a single element's value.
This function goes over the serialized fields
and directly returns the one that is needed.
No allocations are needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Since all function overload selection is done by prepare_expression(),
we no longer need to implement the assignment_testable interface, so
drop it.
Since there's now just one implementation of assignment_testable,
we can drop it and replace it by the implementation (expressions),
but that is left for later.
Now that selector expressions are prepared, we can avoid doing
the work ourselves:
- function_name:s are resolved into functions, so we can error
out if we see a function_name (and drop the with_function class)
- casts are converted to anonymous functions, so we can error
out if we see them (and drop with with_cast class)
- field_selection:s can relay on the prepared field_idx
Call prepare_expression() on selector expressions to resolve types. This
leaves us with just one way to move from the unprepared domain to the
prepared domain.
The change is somewhat awkward since do_prepare_selectable() is re-doing
work that is done by prepare_expression(), but somehow it all works. The
next patch will tear down the unnecessary double-preparation.
assignment_testable is used to convey type information to function overload
selection. The implementation for `selector` recognizes that counters are
really bigints and special cases them. The equivalent implementation for
expressions doesn't, so bring over that nuance here too.
With this, things like sum(counter_column) match the overload for
sum(bigint) rather than failing.
Our prepare phase performs constant-folding: if an expression
is composed of constants, and is pure, it is evalauted during
the preparation phase rather than during query execution.
This however can't work for user-defined functions as these require
running in a thread, and we aren't running in a thread during
prepration time. Skip the optimization in this case.
Before this series, function overload resolution peeked
at function arguments to see if they happened to be selectors,
and if so grabbed their type. If they did not happen to be
selectors, we woudln't know their type, but as it happened
all generic functions are aggregates, and aggregates are only
legal in the SELECT clause, so that didn't matter.
In a previous patch, we changed assignment_testable to carry
an optional type and wired it to selector, so we wouldn't
need to dynamic_cast<selector>.
Now, we wire the optional type to assignment_testable_expression,
so overload resolution of generic functions can happen during
expression preparation.
The code that bridges the function argument expressions to
assignment_testable is extracted into a function, since it's
too complicated to be written as a transform.
The field_selection structure is augmented with the field
index so that does not need to be done at evaluation time,
similar to the current with_field_selection selectable.
CQL supports two cast styles:
- C-style: (type) expr, used for casts between binary-compatible types
and for type hinting of bind variables
- SQL-tyle: (expr AS type), used for real type convertions
Currently, the expression system differentiates them by the cast::type
field, which is a data_type for SQL-style casts and a cql3_type::raw
for C-style casts, but that won't work after the prepare phase is applied
to SQL-style casts when the type field will be prepared into a data_type.
Prepare for this by adding a separate enum to distinguish between the
two styles.
A pure function should return the same value on every invocation,
but enabled_injections() returns true or false depending on global
state.
Mark it impure to reflect that.
Currently, the bug has no effect, but once we prepare selectors,
the prepare_function_call() will constant-fold calls to pure
functions, so we'll capture global state at prepare time rather
than evaluate it each time anew.
Type inference for function calls is a bit complicated:
- a function argument can be inferred from the signature: a call to
my_func(:arg) will infer :arg's type from the function signature
- a function signature can be inferred from its argument types:
a call to max(my_column) will select the correct max() signature
(as max is generic) from my_column's type
Currently, functions::get() implements this by invoking
dynamic_cast<selector*> on the argument. If the caller of
functions::get() is the SELECT clause preparation, then the
cast will succeed and we'll be able to find the type. If not,
we fail (and fall back to inferring the argument types from a
non-generic function signature).
Since we're about to move selectors to expressions, the dynamic_cast
will fail, so we must replace it with a less fragile approach.
The fix is to augment assignment_testable (the interface representing
a function argument) with an intentionally-awkwardly-named
assignment_testable_type_opt(), that sees whether we happen to know
the type for the argument in order to implement signature-from-argument
inference.
A note about assignment_testable: this is a bridge interface
that is the least common denominator of anything that calls functions.
Since we're moving towards expressions, there are fewer implementations of
the interface as the code evolves.
test_assignment() and related functions check for type compatibility between
a right-hand-side and a left-hand-side.
It started its life with a limited functionality for INSERT and UPDATE,
but now it's about to be used for cast expression in selectors, which
can cast a column_value. A column_value is still an unresolved_identifier
during the prepare phase, and cannot be resolved without a schema.
To prepare for this, pass an optional schema everywhere.
Ultimately, test_assignment likely needs to be folded into prepare_expr(),
but before that prepare_expr() has to be used everywhere.
prepare_expr() began its life as a replacement for the WHERE clause,
so it shares its restrictions, one of which is not supporting aggregate
functions.
In previous patches, we added an explicit check to all users, so we can
now remove the check here, so that we can later prepare selectors.
In addition to dropping the check, we drop the dynamic_cast<scalar_function>,
as it can now fail. It turns out it's unnecessary since everything is available
from the base class.
Note we don't allow constant folding involving aggregate functions: first,
our evaluator doesn't support it, and second, we don't have the iteration count
at prepare time.
Since we don't yet prepare selectors, all calls to prepare_expr()
are adjusted.
Note that missing a check isn't fatal - it will be trapped at runtime
because evaluate(aggregate) will throw.
Aggregate functions are only allowed in certain contexts (the
SELECT clause and the HAVING clause, which we don't yet have).
prepare_expr() currently rejects aggregate functions, but that means
we cannot use it to prepare selectors.
To prepare for the use of prepare_expr() in selectors, we'll have to
move the check out of prepare_expr(). This helper is the beginning of
that change.
I considered adding a parameter to prepare_expr(), but that is even
more noisy than adding a call to the helper.
column_mutation_attribute_type() returns int32_type or long_type
depending on whether TTL or WRITETIME is requested.
Will be used later when we prepare column_mutation_attribute
expressions.
There are two execute() overloads, but they don't do the same thing - one is a
partial implementation of the other.
The same is true of two execute_without_checking_exception_message() overloads.
Change the name of the subordinate overload to indicate its role. Overloads should
be used when the only difference between overloads is the argument type, not when
one does a subset of the other.
per clang's document, -Og is like -O1, which is in turn an optimization
level between -O0 and -O2. -O0 "generates the most debuggable code".
for instance, with -O0, presumably, the variables are not allocated in
the registers and later get overwritten, they are always allocated on
the stack. this helps with the debugging.
in this change, -O0 is used for better debugging experience. the
downside is that the emitted code size will be greater than the one
emitted from -Og, and the executable will be slower.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14210
This reverts commit 7e68ed6a5d. With -Og
simple 'gdb build/debug/scylla -ex start' shows the function parameters
as "optimized out" while with -O0 they display fine. This applies to
all variables, not just main's parameters.
Bisect revealed that this behavior started with the reverted commit; it's
not due to a later toolchain update.
Fixes#14196.
Closes#14197
This PR adds a set of tests for cluster features. The set covers those tests from the test plan for the upcoming "cluster features on raft" that do not depend on implementation-specific details. Those tests are applicable to the existing gossip-based implementation, so they can be useful for us right now.
The tests simulate cluster upgrades by conditionally marking support for the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the tests work only in non-release mode.
The `test_partial_upgrade_can_be_finished_with_removenode` test is marked as skipped because of a bug in gossip that prevents features from being enabled if a node was removed within the last 3 days (#14194).
Closes#14211
* github.com:scylladb/scylladb:
test/topology: add cluster feature tests
test: introduce get_supported_features/get_enabled_features
test: move wait_for_feature to pylib utils
feature_service: add a test-only cluster feature
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.
The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.
As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.
A unit test is added. It passes with Cassandra as well.
Fixes#14198.
Closes#14199
This commit introduces a new boolean flag, `shutdown`, to the
forward_service, along with a corresponding shutdown method. It also
adds checks throughout the forward_service to verify the value of the
shutdown flag before retrying or invoking functions that might use the
messaging service under the hood.
The flag is set before messaging service shutdown, by invoking
forward_service::shutdown in main. By checking the flag before each call
that potentially involves the messaging service, we can ensure that the
messaging service is still operational. If the flag is false, indicating
that the messaging service is still active, we can proceed with the
call. In the event that the messaging service is shutdown during the
call, appropriate exceptions should be thrown somewhere down in called
functions, avoiding potential hangs.
This fix should resolve the issue where forward_service retries could
block the shutdown.
Fixes#12604Closes#13922
The `topology_node_mutation_builder::set` function, when passed a
non-empty set of tokens, will construct a mutation that adds given
tokens to the column instead of overwriting them. This is not a problem
today because we are always calling `set` on an empty column, but given
the fact that the function is called `set` not `add` and other overloads
of `set` do overwrite, the function might be misused in the future.
This commit fixes the problem by initializing the tombstone in
`collection_mutation_description` properly, causing the previous state
to be dropped before applying new tokens. The tombstone has a timestamp
which is one less than the timestamp of the added cells, mimicking the
CQL behavior which happens when a non-frozen collection is overwritten.
Closes#14216
The function used `gossiper&` to check whether an endpoint is considered
alive. Abstract this out through `noncopyable_function`.
This will allow us to use `endpoint_filter` during local queries when
`remote` (which contains the `gossiper` reference) is unavailable.
when compiling the generated source files, sometimes, we can run
into the FTBFS like:
02:18:54 FAILED: build/release/gen/cql3/CqlParser.o
02:18:54 clang++ ... -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
...
02:18:54 In file included from build/release/gen/cql3/CqlParser.cpp:44:
02:18:54 In file included from build/release/gen/cql3/CqlParser.hpp:75:
02:18:54 In file included from ./cql3/statements/create_function_statement.hh:12:
02:18:54 In file included from ./cql3/functions/user_function.hh:16:
02:18:54 ./lang/wasm.hh:15:10: fatal error: 'rust/wasmtime_bindings.hh' file not found
02:18:54 #include "rust/wasmtime_bindings.hh"
02:18:54 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
CqlParser.cc is a source file generated from cql3/Cql.g, this source in
turn includes another source file generated from
wasmtime_bindings/src/lib.rs. but we failed to setup this dependency in
the build.ninja rules -- we only teach ninja that "to compile the
grammer source files, please prepare the`serializers` source files
first". but this is not enough.
so, in this change, we just replace `serializers` with `gen_headers`,
as the latter is a superset of the former. and should fulfill the needs
of CqlParser.cc.
Fixes#14213
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14214
This commit adds a set of tests for cluster features. The set covers
those tests from the test plan for the upcoming "cluster features on
raft" that do not depend on implementation-specific details. Those tests
are applicable to the existing gossip-based implementation, so they can
be useful for us right now.
The tests simulate cluster upgrades by conditionally marking support for
the TEST_ONLY_FEATURE on the nodes via error injection. Therefore, the
tests work only in non-release mode.
The `test_partial_upgrade_can_be_finished_with_removenode` test is
marked as skipped because of a bug in gossip that prevents features from
being enabled if a node was removed within the last 3 days (#14194).
Introduces two helper functions that allow getting information about
supported/enabled features on a node, according to its system tables.
As a bonus, the `wait_for_feature` function is refactored to use
`get_enabled_features`.
Fixes https://github.com/scylladb/scylladb/issues/14097
This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.
The update is in sync with the change made for
ScyllaDB 5.2.
This commit must be backported to branch-5.2 and
branch-5.3.
Closes#14118
Adds a cluster feature called TEST_ONLY_FEATURE. It can only be marked
as supported via error injection "features_enable_test_feature", which
can be enabled on node startup via CLI option or YAML configuration.
The purpose of this cluster feature is to simulate upgrading a node to a
version that supports a new feature. This allows us to write tests which
verify that the cluster feature mechanism works.
The fact that TEST_ONLY_FEATURE can only be enabled via error injection
should make it impossible to accidentally enable it in release mode and,
consequently, in production.
Prior to this `table_name` was validated for every request in `find_table_name` leading to unnecessary overhead (although small, but unnecessary). Now, the `table_name` is only validated while creation reqeust and in other requests iff the table does not exist (to keep compatibility with DynamoDB's exception).
Fixes: #12538Closes#13966
because we build stripped package and non-stripped package in parallel
using ninja. there are chances that the non-stripped build job could
be adding build/node_exporter directory to the tarball while the job
building stripped package is using objcopy to extract the symbols from
the build/node_exporter/node_exporter executable. but objcopy creates
temporary files when processing the executables. and the temporary
files can be spotted by the non-stripped build job. there are two
consequences:
1. non-stripped build job includes the temporary files in its tarball,
even they are not supposed to be distributed
2. non-stripped build job fails to include the temporary file(s), as
they are removed after objcopy finishes its job. but the job did spot
them when preparing the tarball. so when the tarfile python module
tries to include the previous found temporary file(s), it throws.
neither of these consequences is expected. but fortunately, this only
happens when packaging the non-stripped package. when packaging the
stripped package, the build/node_exported directory is not in flux
anymore. as ninja ensures the dependencies between the jobs.
so, in this change, we do not add the whole directory when packaging
the non-stripped version. as all its ingredients have been added
separately as regular files. and when packaing the stripped version,
we still use the existing step, as we don't have to list all the
files created by strip.sh:
node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz}
we could do so in this script, but the repeatings is unnecessary and
error-prune. so, let's keep including the whole directory recursively,
so all the debug symbols are included.
Fixes https://github.com/scylladb/scylladb/issues/14079Closes#14081
* github.com:scylladb/scylladb:
create-relocatable-package.py: package build/node_export only for stripped version
create-relocatable-package.py: use positive condition when possible
this change is a follow-up of 4f5fcb02fd,
the goal is to avoid the programming oversights like
```c++
trace(trace_ptr, "foo {} with {} but {} is {}");
```
as `trace(const trace_state_ptr& p, const std::string& msg)` is
a better match than the templated one, i.e.,
`trace(const trace_state_ptr& p, fmt::format_string<T...> fmt, T&&...
args)`. so we cannot detect this with the compile-time format checking.
so let's just drop this overload, and update its callers to use
the other overload.
The change was suggested by Avi. the example also came from him.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14188
sel is a local variable, and it is not shared with anybody else.
so make it a unique_ptr<> for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14189
Uploading sinks have internal semaphore limiting the maximum number of
uploading parts and pieces with the value of two. This approach has
several drawbacks.
1. The number is random. It could as well be three, four and any other
2. Jumbo upload in fact violates this parallelizm, because it applies to
maximum number of pieces _and_ maximum number of parts in each piece
that can be uploaded in parallels. Thus jumbo upload results in four
parts in parallel.
3. Multiple uploads don't sync with each other, so uploading N objects
would result in N * 2 (or even N * 4 with jumbo) uploads in parallel.
4. Single upload could benefit from using more sockets if no other
uploads happen in parallel. IOW -- limit should be shard-wide, not
single-upload-wide
Previous patches already put the per-shard parallelizm under (some)
control, so this semaphore is in fact used as a way to collect
background uploading fibers on final flush and thus can be replaced with
a gate.
As a side effect, this fixes an issue that writes-after-flush shouldn't
happen (see #13320) -- when flushed the upload gate is closed and
subsequent writes would hit gate-closed error.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patch different sched groups got different http clients.
By default each client is started with 100 allowed connections. This can
be too much -- 100 * nr-sched-groups * smp::count can be quite huge
number. Also, different groups should have different parallelizm, e.g.
flush/compaction doesn't care that much about latency and can use fewer
sockets while query class is more welcome to have larger concurrency.
As a starter -- configure http clients with maximum shares/100 sockets.
Thus query class would have 10 and flush/compaction -- 1.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The intent is to isolate workloads from different sched groups from each
other and not let one sched group consume all sockets from the http
client thus affecting requests made by other sched groups.
The conention happens in the maximim number of socket an http client may
have (see scylladb/seastar#1652). If requests take time and client is
asked to make more and more it will eventually stop spawning new
connections and would get blocked internally waiting for running
requests to complete and put a socket back to pool. If a sched group
workload (e.g. -- memtable flush) consumes all the available sockets
then workload from another group (e.g. -- query) would be blocked thus
spoiling its latency (which is poor on its own, but still)
After this change S3 client maintains a sched_group:http_client map
thus making sure different sched groups don't clash with each other so
that e.g. query requests don't wait for flush/compaction to release a
socket.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper call will serve several purposes.
First, make necessary preparations to the request before making, in
particular -- calling authorize()
Second, there's the need to re-make requests that failed with
"connection closed" error (see #13736)
Third, one S3 client is shared between different scheduling groups. In
order to isolate groups' workload from each other different http clients
should be used, and this helper will be in change of selecting one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The estimated_partitions is estimated after the repair_meta is created.
Currently, the default estimated_partitions was used to create the
write which is not correct.
To fix, use the updated estimated_partitions.
Reported by Petr Gusev
Closes#14179
Return 400 Bad Request instead of 500 Internal Server Error
when user requests task or module which does not exist through
task manager and task manager test apis.
Closes#14166
* github.com:scylladb/scylladb:
test: add test checking response status when requested module does not exist
api: fix indentation
api: throw bad_param_exception when requested task/module does not exists
in this series, we use {fmt}'s compile-time formatting check, and avoid deep copy when creating sstring from std::string.
Closes#14169
* github.com:scylladb/scylladb:
tracing: use std::string instead of sstring for event_record::message
tracing: use compile-time formatting check
Compaction task test should only check the intended group of task.
Thus, the tasks are filtered in each test.
In order to be able to run the tests in parallel, checks for the tasks
of the same type are grouped together.
Fixes: #14131.
Closes#14161
* github.com:scylladb/scylladb:
test: put compaction task checks of the same type together
test: filter tasks of given compaction type
there are chances that a Boost::test test fails to generate a
valid XML file after the test finishes. and
xml.etree.ElementTree.parse() throws when parsing it.
see https://github.com/scylladb/scylla-pkg/issues/3196
before this change, the exception is not handled, and test.py
aborts in this case. this does not help and could be misleading.
after this change, the exception is handled and printed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14180
Containing two tables, describing all the possible operations seen in user, system and streaming semaphore diagnostics dumps.
Closes#14171
* github.com:scylladb/scylladb:
docs/dev/reader-concurrency-semaphore.md: add section about operations
docs/dev/reader-concurrency-semaphore.md: switch to # headers markings
reader_concurrency_semaphore: s/description/operation/ in diagnostics dumps
The helper if huge in the form of then-chain. Also it's generic enough not to limit itself in returning sstables' Data file names only.
refs: #14122 (detached from the one that needs more thinking about)
Closes#14174
* github.com:scylladb/scylladb:
table: Return shared sstable from get_sstables_by_partition_key()
table: Coroutinize get_sstables_by_partition_key()
* seastar afe39231...99d28ff0 (16):
> file/util: Include seastar.hh
> http/exception: Use http::reply explicitly
> http/client: Include lost condition-variable.hh
> util: file: drop unnecessary include of reactor.hh
> tests: perf: add a markdown printer
> http/client: Introduce unexpected_status_error for client requests
> sharded: avoid #include <seastar/core/reactor.hh> for run_in_background()
> code: Use std::is_invocable_r_v instead of InvokeReturns
> http/client: Add ability to change pool size on the fly
> http/client: Add getters for active/idle connections counts
> http/client: Count and limit the number of connections
> http/client: Add connection->client RAII backref
> build: use the user-specified compiler when building DPDK
> build: use proper toolchain based on specified compiler
> build: only pass CMAKE_C_COMPILER when building ingredients
> build: use specified compiler when building liburing
Two changes are folded into the commit:
1. missing seastar/core/coroutine.hh include in one .cc file that
got it indirectly included before seastar reactor.hh drop from
file.hh
2. http client now returns unexpected_status_error instead of
std::runtime_error, so s3 test is updated respectively
Closes#14168
There are some headers that include tracing/*.hh ones despite all they
need is forward-declared trace_state_ptr
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14155
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.
Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.
Fixes: #14029Fixes: #14072Closes#14167
* github.com:scylladb/scylladb:
raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry`
raft: replication test: don't hang if `_seen` overshots `_apply_entries`
raft: replication test: print a warning when handling `commit_status_unknown`
In 297c75c6d8 I set the timeout to
5 minutes mainly due to debug mode which is often quite slow on Jenkins.
But 5 minutes is a bit of an overkill. It wouldn't be a problem but
there is a dtest that waits for a node to fail bootstrap; it's wasteful
for the test to sleep for an entire 5 minutes.
Set it to:
- 3 minutes in debug mode,
- 30 seconds in dev/release modes.
Ref: scylladb/scylla-dtest#3203Closes#14140
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.
Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.
Fixes: #14029
As in the previous commit, if a command gets doubly applied due to
`commit_status_unknown`, this will could lead to hard-to-debug failures;
one of them was the test hanging because we would never call
`_done.set_value()` in `state_machine::apply` due to `_seen`
overshooting `_apply_entries`.
Fix the problem and print a warning if we apply too many commands.
Fixes: #14072
`commit_status_unknown` may lead to double application and then a
hard-to-debug failure. But some tests actually rely on retrying it, so
print a warning and leave a FIXME for maybe a better future solution.
Ref: #14029
This commit adds a table (with 1 row) explaining Scylla-specific
materialized view options - which now consists just of
synchronous_updates.
Tested manually by running `make preview` from docs/ directory.
Closes#11150
The call is generic enough not to drop the sstable itself on return so
that callers can do whatever they need with it. The only today's caller
is API which will convert sstables to filenames on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
when creating an event_record, the typical use case is to use a
string created using fmt::format(), which returns a std::string.
before this change, we always convert the std::string to a sstring,
and move this shinny new sstring into a new event_record. but
when creating sstring, we always performs a deep copy, which is not
necessary, as we own the std::string already.
so, in this change, instead of performing a deep copy, we just keep
the std::string and pass it all the way to where event_record is
created. please note, the std::string will be implicitly converted
to data_value, and it will be dropped on the floor after being
serialized in abstract_type::decompose(). so this deep copy is
inevitable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this change we pass the fmt string using fmt::format_string<T...>
in order to {fmt}'s compile-time formatting. so we can identify
the bad format specifier or bad format placeholders at compile-time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In task manager and task manager test rest apis when a task or module
which does not exist is requested, we get Internal Server Error.
In such cases, wrap thrown exceptions in bad_param_exception
to respond with Bad Request code.
Modify test accordingly.
statement_restrictions: forbid IS NOT NULL on columns outside the primary key
IS NOT NULL is currently allowed only when creating materialized views.
It's used to convey that the view will not include any rows that would make the view's primary key columns NULL.
Generally materialized views allow to place restrictions on the primary key columns, but restrictions on the regular columns are forbidden. The exception was IS NOT NULL - it was allowed to write regular_col IS NOT NULL. The problem is that this restriction isn't respected, it's just silently ignored (see #10365).
Supporting IS NOT NULL on regular columns seems to be as hard as supporting any other restrictions on regular columns.
It would be a big effort, and there are some reasons why we don't support them.
For now let's forbid such restrictions, it's better to fail than be wrong silently.
Throwing a hard error would be a breaking change.
To avoid breaking existing code the reaction to an invalid IS NOT NULL restrictions is controlled by the `strict_is_not_null_in_views` flag.
This flag can have the following values:
* `true` - strict checking. Having an `IS NOT NULL` restriction on a column that doesn't belong to the view's primary key causes an error to be thrown.
* `warn` - allow invalid `IS NOT NULL` restrictions, but throw a warning. The invalid restrictions are silently ignored.
* `false` - allow invalid `IS NOT NULL` restricitons, without any warnings or errors. The invalid restrictions are silently ignored.
The default values for this flag are `warn` in `db::config` and `true` in scylla.yaml.
This way the existing clusters will have `warn` by default, so they'll get a warning if they try to create such an invalid view.
New clusters with fresh scylla.yaml will have the flag set to `true`, as scylla.yaml overwrites the default value in `db::config`.
New clusters will throw a hard error for invalid views, but in older existing clusters it will just be a warning.
This way we can maintain backwards compatibility, but still move forward by rejecting invalid queries on new clusters.
Fixes: #10365Closes#13013
* github.com:scylladb/scylladb:
boost/restriction_test: test the strict_is_not_null_in_views flag
docs/cql/mv: columns outside of view's primary key can't be restricted
cql-pytest: enable test_is_not_null_forbidden_in_filter
statement_restrictions: forbid IS NOT NULL on columns outside the primary key
schema_altering_statement: return warnings from prepare_schema_mutations()
db/config: add strict_is_not_null_in_views config option
statement_restrictions: add get_not_null_columns()
test: remove invalid IS NOT NULL restrictions from tests
This is a regression introduced in f26179cd27.
Fixes: #14136
* 'gleb/set_group0' of github.com:scylladb/scylla-dev:
test: restart first node to see if it can boot after restart
service: move setting of group0 point in storage_service earlier
In test_compaction_task.py tests concerning the same type of compaction
are squashed together so that they are run synchronously and there is
no data race when the tests are run in parallel.
Add unit tests for the strict_is_not_null_in_views flag.
This flag controls the behavior in case of an invalid
IS NOT NULL restrictions on a materialized view column.
Materialized views allow only restricting columns
that belong to the view's primary key, all other
restrictions should be rejected.
There was a bug where IS NOT NULL restrictions
weren't rejected, but simply ignored instead.
This flags controls what should happen when the user
runs a query with such an invalid IS NOT NULL restriction.
strict_is_not_null_in_views can have the following values:
* `true` - strict checking, invalid queries are rejected
* `warn` - the query is allowed, but a warning is printed
* `false` - the query is allowed, the invalid restrictions
are silently ignored.
The tests are based on the ones for strict_allow_filtering,
which reside in the lines preceding the newly added tests.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We used to allow IS NOT NULL restrictions on columns
that were not part of the materialized view's primary key.
It runs out that such restrictions are silently ignored (see #10365),
so we no longer allow such restrictions.
Update the documentation to reflect that change.
Also there was a mistake in the documentation.
It said that restrictions are allowed on all columns
of the base table's primary key, but they are actually
allowed on all columns of the view table's primary key,
not the base tables.
This change also fixes that mistake.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
IS NOT NULL is now allowed only on the view's primary key columns,
so the xfail marker can be removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
IS NOT NULL is currently allowed only
when creating materialized views.
It's used to convey that the view will
not include any rows that would make the
view's primary key columns NULL.
Generally materialized views allow
to place restrictions on the primary key
columns, but restrictions on the regular
columns are forbidden. The exception was
IS NOT NULL - it was allowed to write
regular_col IS NOT NULL. The problem is
that this restriction isn't respected,
it's just silently ignored.
Supporting IS NOT NULL on regular columns
seems to be as hard as supporting
any other restrictions on regular columns.
It would be a big effort, and there are some
reasons why we don't support them.
For now let's forbid such restrictions,
it's better to fail than be wrong silently.
Throwing a hard error would be a breaking change.
To avoid breaking existing code the reaction to
invalid IS NOT NULL restrictions is controlled
by the `strict_is_not_null_in_views` flag.
The default values for this flag are `warn` in db::config
and `true` in scylla.yaml.
This way the existing clusters will have `warn` by default,
so they'll get a warning if they try to create such an
invalid view.
New clusters with fresh scylla.yaml will have the flag set
to `true`, as scylla.yaml overwrites the default value
in db::config.
New clusters will throw a hard error for invalid views,
but in older existing clusters it will just be a warning.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Validation of a CREATE MATERIALIZED VIEW statement takes place inside
the prepare_schema_mutations() method.
I would like to generate warnings during this validation, but there's
currently no way to pass them.
Let's add one more return value - a vector of CQL warnings generated
during the execution of this statement.
A new alias is added to make it clear what the function is returning:
```c++
// A vector of CQL warnings generated during execution of a statement.
using cql_warnings_vec = std::vector<sstring>;
```
Later the warnings will be sent to the user by the function
schema_altering_statment::execute(), which is the only caller
of prepare_schema_mutations().
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
IS NOT NULL shouldn't be allowed on columns
which are outside of the materialized view's primary key.
It's currently allowed to create views with such restrictions,
but they're silently ignored, it's a bug.
In the following commits restricting regular columns
with IS NOT NULL will be forbidden.
This is a breaking change.
Some users might have existing code that creates
views with such restrictions, we don't want to break it.
To deal with this a new feature flag is introduced:
strict_is_not_null_in_views.
By default it's set to `warn`. If a user tries to create
a view with such invalid restrictions they will get a warning
saying that this is invalid, but the query will still go through,
it's just a warning.
The default value in scylla.yaml will be `true`. This way new clusters
will have strict enforcement enabled and they'll throw errors when the
user tries to create such an invalid view,
Old clusters without the flag present in scylla.yaml will
have the flag set to warn, so they won't break on an update.
There's also the option to set the flag to `false`. It's dangerous,
as it silences information about a bug, but someone might want it
to silence the warnings for a moment.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
we have a dedicated facility for loading sstables since
68dfcf5256, and column_family (i.e. table)
is not responsible for loading new sstables. so update the comment
to reflect this change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14154
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)
Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.
Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).
Dtest: https://github.com/scylladb/scylla-dtest/pull/3206Fixes#14001.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14147
group0 pointer in storage_service should be set when group0 starts.
After f26179cd27 we start group0 earlier,
so we need to move setting of the group0 pointer as well.
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.
Fixes#14139Closes#14143
as an alternative to passing the link-args using the environmental variable,
we can also use build script to pass the "-C link-args=<FLAG>" to the compiler.
see https://doc.rust-lang.org/nightly/cargo/reference/build-scripts.html#cargorustc-link-argflag
to ensure that cargo is called again by ninja, after build.rs is
updated, build.rs is added as a dependency of {wasm} files along with
Cargo.lock.
this change is verified using following command
```
RUSTFLAGS='--print link-args' cargo build \
--target=wasm32-wasi \
--example=return_input \
--locked \
--manifest-path=Cargo.toml \
--target-dir=build/cmake/test/resource/wasm/rust
```
the output includes "-zstack-size=131072" in the argument passed to lld:
```
Compiling examples v0.0.0 (/home/kefu/dev/scylladb/test/resource/wasm/rust)
LC_ALL="C"
PATH="/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin:/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin/self-contained:/home/kefu/.local/bin:/home/kefu/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin"
VSLANG="1033"
"lld"
"-flavor" "wasm" "--rsp-quoting=posix" "--export"
"_scylla_abi" "--export" "_scylla_free" "--export" "_scylla_malloc"
"--export" "return_input" "-z" "stack-size=1048576" "--stack-first"
"--allow-undefined" "--fatal-warnings" "--no-demangle"
...
"-L" "/usr/lib/rustlib/wasm32-wasi/lib"
"-L" "/usr/lib/rustlib/wasm32-wasi/lib/self-contained"
"-o"
"/home/kefu/dev/scylladb/build/cmake/test/resource/wasm/rust/wasm32-wasi/debug/examples/return_input-ef03083560989040.wasm"
"--gc-sections"
"--no-entry"
"-O0"
"-zstack-size=131072"
```
with this change, it'd be easier to build .wat files in CMake, so
we don't need to repeat the settings in both configure.py and
CMakeLists.txt
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14123
Recent seastar update added rpc::server::shutdown() method that only isolates the server from the network, but lets all internal handler callbacks continue running up until stop() is called. This patch makes use of it in messaging service by calling this new shiny shutdown() in its shutdown() and calling good old stop() in its stop().
Intentionally, it will prevent scylla from freezing on drain in case some RPC handler gets stuck. It may later freeze on stop(), but it's less horrible. Also chances are that by stop time some other handler's dependencies would have been drained/shut-down so the handler can wake up and stop normally.
fixes: #14031Closes#14115
* github.com:scylladb/scylladb:
messaging_service: Shutdown rpc server on shutdown
messaging_service: Generalize stop_servers()
messaging_service: Restore indentation after previous patch
messaging_service: Coroutinize stop()
messaging_service: Coroutinize stop_servers()
this series replaces hard-coded values with variables. will need to expand this test to cover most test cases when working on tiered-storage.
Closes#14137
* github.com:scylladb/scylladb:
s3/test: use variable for inserted data
s3/test: replace test_ks and test_cf with variables
s3/test: introduce format_tuples() for formatting CQL queries
instead of hardwiring the dataset in test, let's define them with
variables and use the variables instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in order to make data set for testing more visible, format_tuples() is
introduced for formatting a dict into a set of structured values
consumable by CQL.
this function is added to test/cql-pytest/util.py in hope that it
can be reused by other tests using CQL.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().
Fixes#13945Closes#14026
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.
With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.
Use the same timeout configuration as other queries in this module do.
Fixes#13545.
Closes#14134
Adds preemption points used in Alternator when:
- sending bigger json response
- building results for BatchGetItem
I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:
auto start = std::chrono::steady_clock::now();
do { } while ((std::chrono::steady_clock::now() - start) < 100ms);
and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.
Refs #7926Fixes#13689Closes#12351
* github.com:scylladb/scylladb:
alternator: remove redundant flush call in make_streamed
utils: yield when streaming json in print()
alternator: yield during BatchGetItem operation
Summary of the patch set:
- eliminates not needed calls to rjson::find (~1% tps improvement in `perf-simple-query --write`)
- adds some very specific test in this area (more general cases were covered already)
- fixes some minor validation bug
Fixes https://github.com/scylladb/scylladb/issues/13251Closes#12675
* github.com:scylladb/scylladb:
alternator: fix unused ExpressionAttributeNames validation when used as a part of BatchGetItem
alternator: eliminate duplicated rjson::find() of ExpressionAttributeNames and ExpressionAttributeValues
The DeleteTable operation in Alternator shoudl return a TableDescription
object describing the table which has just been deleted, similar to what
DescribeTable returns
Fixes scylladb#11472
Closes#11628
* tools/cqlsh 8769c4c2...6e1000f1 (5):
> build: erase uid/gid information from tar archives
> Add github action to update the dockerhub description
> cqlsh: Add extension handler for "scylla_encryption_options"
> requirements.txt: update python-driver==3.26.0
> Add support for arm64 docker image
Closes#13878
we compile .wat files from .rs and .c source files since
6d89d718d9.
these .wat are used by test/cql-pytest/test_wasm.py . let's update
the CMake building system accordingly so these .wat files can also
be generated using the "wasm" target. since the ctest system is
not used. this change should allow us to perform this test manually.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14126
before this change, when consolidating the boost's XML logger file,
we just practically concatenate all the tests' logger file into a single
one. sometimes, we run the tests for multiple times, and these runs share
the same TestSuite and TestCase tags. this has two sequences,
1. there is chance that only a test has both successful and failed
runs. but jenkins' "Test Results" page cannot identify the failed
run, it just picks a random run when one click for the detail of
the run. as it takes the TestCase's name as part of its identifier.
and we have multiple of them if the argument passed to the --repeat
option is greater than 1 -- this is the case when we promote the
"next" branch.
2. the testReport page of Jenkins' xUnit plugin created for the "next"
job is 3 times as large as the one for the regular "scylla-ci" run.
as all tests are repeated for 3 times. but what we really cares is
history of a certain test not a certain run of it.
in this change, we just pick a representive run of a test if it is
repeated multiple times and add a "Message" tag for including the
summary of the runs. this should address the problems above:
1. the failed tests always stand out so we can always pinpoint it with
Jenkins's "Test Results" page.
2. the tests are deduped by its name.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14069
As per our roll out plan, make consistent_cluster_management (aka Raft
for schema changes) the default going forward. It means all
clusters which upgrade from the previous version and don't have
`consistent_cluster_management` explicitly set in scylla.yaml will begin
upgrading to Raft once all nodes in the cluster have moved to the new
version.
Fixes#13980Closes#13984
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't block the thread which prevents concurrent tests from running
during this time. Use the dedicated `run_async`.
Also to silence `mypy` which complains that `manager.cql` is `Optional`
(so in theory might be `None`, e.g. after `driver_close`), use
`manager.get_cql()`.
Closes#14109
The series includes mostly cleanups and one bug fix.
The fix is for the race where messages that need to access group0 server are arriving
before the server is initialized.
* 'gleb/group0-sp-mm-race-v2' of github.com:scylladb/scylla-dev:
service: raft: fix typo
service: raft: split off setup_group0_if_exist from setup_group0
storage_service: do not allow override_decommission flag if consistent cluster management is enabled
storage_service: fix indentation after the previous patch
storage_service: co-routinize storage_service::join_cluster() function
storage_service: do not reload topology from peers table if topology over raft is enabled
storage_service: optimize debug logging code in case debug log is not enabled
In production environments the Scylla boot procedure includes various
sleeps such as 'ring delay' and 'waiting for gossip to settle'. We
disable those sleeps in test.py tests and we'd also like to disable
them, if possible, in dtests.
Unfortunately, disabling the sleeps causes problems with schema: a
bootstrapping node creates its own versions of distributed keyspaces and
tables (such as `system_distributed`) because it doesn't first wait for
gossip to settle, during which it would usually pull existing schemas of
those keyspaces/tables from existing nodes. This may cause schema
disagreement for the whole duration of the bootstrap procedure (the
other nodes don't pull schema from a bootstrapping node; pulls are only
allowed once it becomes NORMAL), which causes the bootstrapping node to
costantly pull schema in attempts to synchronize, which doesn't work
because it's the other nodes which don't have schema mutations, not this
node. Even when the bootstrapping node finishes, the existing nodes
won't automatically pull schema from that node - only once we perform
another schema change a pull will be triggered.
The continuous pulls and the lack of schema synchronization until manual
schema change cause problems in tests. For example we observed the test
timing out in debug mode because bootstrap took too long due to the node
having to perform ~700 schema pulls (it attempts to synchronize schema
on each range repair). There's also potential for permanent schema
divergence, although I haven't seen this yet - in my experiments, once
the existing nodes pull from the new node, schema would always converge.
In any case, the safe and robust solution is to ensure that the
bootstrapping node pulls schema from existing nodes early in the boot
procedure. Then it won't try to create its own versions of the
distributed keyspaces/tables because it'll see they are already present
in the cluster.
In fact there already is `storage_service::wait_for_ring_to_settle`
which is supposed to wait until schema is in agreement before
proceeding.
However, this schema agreement wait relied on an earlier wait at the
beginning of the function - for a node to show up in gossiper
(otherwise, if we're the only node in gossiper, the schema agreement
wait trivially finishes immediately).
Unfortunately, this wait would timeout after `ring_delay` and proceed,
even if no other node was observed, instead of throwing an error...
To make it safe, modify the logic so if we timeout, we refuse to
bootstrap. To make it work in tests which set `ring_delay` to 0, make it
independent of `ring_delay` - just set the timeout to 5 minutes.
Fixes#14065Fixes#14073Closes#14105
Secondary index creation is asynchronous, meaning it
takes time for existing data to be reflected within
the index. However, new data added after the
index is created should appear in it immediately.
The test consisted of two parts. The first created
a series of indexes for one table, added
test data to the table, and then ran a series of checks.
In the second part, several new indexes were added to
the same table, and checks were made to make sure that
already existing data would appear in them. This
last part was flaky.
The patch just moves the index creation statements
from the second part to the first.
Fixes: #14076Closes#14090
The function storage_service::wait_for_ring_to_settle() is called when
bootstrapping a new node in an existing cluster, and it's supposed to
wait until the caller has the right schema - to allow the bootstrap
to start (the bootstrap needs to copy all existing tables from other
nodes).
The code of this function mostly checks in-memory structures in the
gossiper and migration manager, and if they aren't ready, sleeps and
tries again (until a timeout of "ring_delay_ms"). Today we sleep a
whole second between each try, but that's excessive - the checks are
very cheap, and we can do them much more often, so we can stop the
loop much closer to when the schema becomes available.
This patch changes the sleep from 1 second to 10 milliseconds.
The benefit of this patch is not huge - on average I measured about
0.25 seconds saving on adding a node to a cluster. But I don't see
any downside either.
Noticed while looking into Refs #14073
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14101
read_mutation_from_flat_mutation_reader might throw
so we need to close the reader returned from
ms.make_fragment_v1_stream also on the error
path to avoid the internal error abort when the
reader is destroyed while opened.
Fixes#14098
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14099
In shard compaction tasks per table tasks will be created all at once
and then they will wait for their turn to run.
A function that allows waking up tasks one after another and a function
that makes the task wait for its turn are added.
Currently setup_group0 is responsible to start existing group0 on restart
or create a new one and joining the cluster with it during bootstrap. We
want to create the server for existing group0 earlier, before we start
to accept messages because some messages may assume that the server
exists already. For that we split creation of exiting group0 server into
a separate function and call it on restart before the messaging service
starts accepting messages.
Fixes: #13887
After c7826aa910, sstable runs are cleaned up together.
The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.
Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.
Therefore cleanup is affected by the 100% space overhead.
To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.
New unit test reproduces the failure, and passes with the fix.
Fixes#14035.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#14038
The command prints segment_manager address, because it's the manager
who's on interest, not the db::commitlog itself. Also it prints out all
found segments, it's just for convenience -- segments are in a vector of
shared pointers and it's handy to have object addresses instantly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14088
By default `idl-compiler.py` emits code to pass parameters by value. There was an attribute `[[ref]]`, which makes it to use `const&`, but it was not used systematically and in many cases parameters were redundantly copied. In this PR, all `verb` directives have been reviewed and the `[[ref]]` attribute has been added where it makes sense.
The parameters [are serialised synchronously](https://github.com/scylladb/seastar/blob/master/include/seastar/rpc/rpc_impl.hh#L471) so there should be no lifetime issues. This was not the case before, but the behaviour changed in [this commit](3942546d41). Now it's not a problem to get an object by reference when using `send_` methods.
Fixes: #12504Closes#14003
* github.com:scylladb/scylladb:
tracing::trace_info: pass by ref
storage_proxy: pass inet_address_vector_replica_set by ref
raft: add [[ref]] attribute
repair: add [[ref]] attribute
forward_request: add [[ref]] attribute
storage_proxy: paxos:: add [[ref]] attribute
storage_proxy: read_XXX:: make read_command [[ref]]
storage_proxy: hint_mutation:: make frozen_mutation [[ref]]
storage_proxy: mutation:: make frozen_mutation [[ref]]
occasionally, we are observing build failures like:
```
17:20:54 FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz
17:20:54 dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz'
17:20:54 Traceback (most recent call last):
17:20:54 File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module>
17:20:54 os.makedirs(f'build/{SCYLLA_DIR}')
17:20:54 File "<frozen os>", line 225, in makedirs
17:20:54 FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package'
```
to understand the root cause better, instead of swallowing the error,
let's raise the exception it is not caused by non-existing directory.
a similar change was applied to scripts/create-relocatable-package.py
in a0b8aa9b13. which was correct per-se.
but the original intention was to understand the root cause of the
failure when packaging scylla-debuginfo-*.tar.gz, which is created
by the dist/debuginfo/scripts/create-relocatable-package.py.
so, in this change, the change is ported to this script.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14082
In compaction logs table is identified by {keyspace}.{table_id}.
Instead, table name should be used in run_on_existing_tables
logs. To do so, task manager's compaction tasks use table_info
instead of table_id.
Keyspace argument is copied to run_on_existing_tables
to ensure it's alive.
Closes#13816
* github.com:scylladb/scylladb:
compaction: use table_info in compaction tasks
api: move table_info to schema/schema_fwd.hh
CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports
an issue on how the default argument is evaluated. this problem is
more obvious when it comes to how `std::source_location::current()`
is evaluated as a default argument. but not all compilers have the
same behavior, see https://godbolt.org/z/PK865KdG4.
notebaly, clang-15 evaluates the default argument at the callee
site. so we need to check the capability of compiler and fall back
to the one defined by util/source_location-compat.hh if the compiler
suffers from CWG 2631. and clang-16 implemented CWG2631 in
https://reviews.llvm.org/D136554. But unfortunately, this change
was not backported to clang-15.
before switching over to clang-16, for using std::source_location::current()
as the default parameter and expect the behavior defined by CWG2631,
we have to use the compatible layer provided by Seastar. otherwise
we always end up having the source_location at the callee side, which
is not interesting under most circumstances.
so in this change, all places using the idiom of passing
std::source_location::current() as the default parameter are changed
to use seastar::compat::source_location::current(). despite that
we have `#include "seastarx.h"` for opening the seastar namespace,
to disambiguate the "namespace compat" defined somewhere in scylladb,
the fully qualified name of
`seastar::compat::source_location::current()` is used.
see also 09a3c63345, where we used
std::source_location as an alias of std::experimental::source_location
if it was available. but this does not apply to the settings of our
current toolchain, where we have GCC-12 and Clang-15.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14086
When run like 'open-coredump.sh --help' the options parsing loop doesn't
run because $# == 1 and [ $# -gt 1 ] evaluates to false.
The simplest fix is to parse -h|--help on its own as the options parsing
loop assumes that there's core-file argument present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14075
read_command, partition_key and paxos::proposal
are marked with [[ref]]. partition_key contains
dynamic allocations and can be big. proposal
contains frozen_mutation, so it's also
contains dynamic allocations.
The call sites are fine, the already passed
by reference.
We had a redundant copies at the call sites of
these methods. Class read_command does not
contain dynamic allocations, but it's quite
but by itself (368 bytes).
We had a redundant copy in receive_mutation_handler
forward_fn callback. This frozen_mutation is
dynamically allocated and can be arbitrary large.
Fixes: #12504
Task manager compaction tasks need table names for logs.
Thus, compaction tasks store table infos instead of table ids.
get_table_ids function is deleted as it isn't used anywhere.
because we build stripped package and non-stripped package in parallel
using ninja. there are chances that the non-stripped build job could
be adding build/node_exporter directory to the tarball while the job
building stripped package is using objcopy to extract the symbols from
the build/node_exporter/node_exporter executable. but objcopy creates
temporary files when processing the executables. and the temporary
files can be spotted by the non-stripped build job. there are two
consequences:
1. non-stripped build job includes the temporary files in its tarball,
even they are not supposed to be distributed
2. non-stripped build job fails to include the temporary file(s), as
they are removed after objcopy finishes its job. but the job did spot
them when preparing the tarball. so when the tarfile python module
tries to include the previous found temporary file(s), it throws.
neither of these consequences is expected. but fortunately, this only
happens when packaging the non-stripped package. when packaging the
stripped package, the build/node_exported directory is not in flux
anymore. as ninja ensures the dependencies between the jobs.
so, in this change, we do not add the whole directory when packaging
the non-stripped version. as all its ingredients have been added
separately as regular files. and when packaing the stripped version,
we still use the existing step, as we don't have to list all the
files created by strip.sh:
node_exporter{,.debug,.dynsyms,.funcsyms,.keep_symbols,.minidebug.xz}
we could do so in this script, but the repeatings is unnecessary and
error-prune. so, let's keep including the whole directory recursively,
so all the debug symbols are included.
Fixes#14079
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose:
to make it possible to cancel a handler for a view update write,
which means we stop waiting for a response to the write, timing out
the handler immediately. This was done to solve issue with node
shutdown hanging because it was waiting for a view update to finish;
view updates were configured with 5 minute timeout. See #3966, #4028.
Now we're having a similar problem with hint updates causing shutdown
to hang in tests (#8079).
`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy`
that a node is down.
To make it possible to reuse this algorithm for other handlers, move
the functionality into `abstract_write_response_handler`. We inherit
from `bi::list_base_hook` so it introduces small memory overhead to
each write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.
Use this new functionality to also cancel hint write handlers when we
shutdown. This fixes#8079.
Closes#14047
* github.com:scylladb/scylladb:
test: reproducer for hints manager shutdown hang
test: pylib: ScyllaCluster: generalize config type for `server_add`
test: pylib: scylla_cluster: add explicit timeout for graceful server stop
service: storage_proxy: make hint write handlers cancellable
service: storage_proxy: rename `view_update_handlers_list`
service: storage_proxy: make it possible to cancel all write handler types
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.
Fixes#13919Reopens#13747
When scylla starts it may go to sleep along the way before the "serving"
message appears. If SIGINT is sent at that time the whole thing unrolls
and the main code ends up catching the sleep_aborted exception, printing
the error in logs and exiting with non-zero code. However, that's not an
error, just the start was interrupted earlier than it was expected by
the stop_signal thing.
fixes: #12898
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14034
Currently, partitioned_sstable_set::insert may erase a sstable
from the set inadvertently, if an exception is thrown while
(re-)inserting it.
To prevent that, simply return early after detecting that
insertion didn't took place, based on the unordered_set::insert
result.
This issue is theoretical, as there are no known case
of re-inserting sstables into the partitioned sstable set.
Fixes#14060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14061
Task manager's tasks that have parent task inherit sequence number
from their parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.
Closes#14045
libxcrypt is used by auth subsystem, for instance, `crypt_r()` provided
by this library is used by passwords.cc. so let's link against it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14030
The _update_timer callback calls adjust() that
depends on _current_backlog and currently, _current_backlog is
destroyed before _update_timer.
This is benign since there are no preemption points in
the destructor, but it's more correct and elegant
to destroy the timer first, before other members it depends on.
Fixes#14056
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14057
occasionally, we are observing build failures like:
```
17:20:54 FAILED: build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz
17:20:54 dist/debuginfo/scripts/create-relocatable-package.py --mode release 'build/release/dist/tar/scylla-debuginfo-5.4.0~dev-0.20230522.5b2687e11800.x86_64.tar.gz'
17:20:54 Traceback (most recent call last):
17:20:54 File "/jenkins/workspace/scylla-master/scylla-ci/scylla/dist/debuginfo/scripts/create-relocatable-package.py", line 60, in <module>
17:20:54 os.makedirs(f'build/{SCYLLA_DIR}')
17:20:54 File "<frozen os>", line 225, in makedirs
17:20:54 FileExistsError: [Errno 17] File exists: 'build/scylla-debuginfo-package'
```
to understand the root cause better, instead of swallowing the error,
let's raise the exception it is not caused by non-existing directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13978
This set encapsulates ks/cf directories creation and deletion into keyspace and table classes methods. This is needed to facilitate making the storage initialization storage-type aware in the future. Also this makes the replica/ code less involved in formatting sstables' directory path by hand.
refs: #13020
refs: #12707Closes#14048
* github.com:scylladb/scylladb:
keyspace: Introduce init_storage()
keyspace: Remove column_family_directory()
table: Introduce destroy_storage()
table: Simplify init_storage()
table: Coroutinize init_storage()
table: Relocate ks.make_directory_for_column_family()
distributed_loader: Use cf.dir() instead of ks.column_family_directory()
test: Don't create directory for system tables in cql_test_env
if we just want to build a single test and scylla executables, we
might want to use `configure.py` like:
./configure.py --mode debug --compiler clang++ --with scylla --with test/boost/database_test
which generates `build.ninja` for us, with following rules:
build $builddir/debug/test/boost/database_test_g: link.debug ... | $builddir/debug/seastar/libseastar.so
$builddir/debug/seastar/libseastar_testing.so
libs = $seastar_libs_debug $libs -lthrift -lboost_system $seastar_testing_libs_debug
libs = $seastar_libs_debug
but the last line prevents database_test_g for linking against
the third-party libraries like libabsl, which could have been
pulled in by $libs. but the second assignment expression just
makes the value of `libs` identical to that of `seastar_libs_debug`.
but that library does not include the libraries which are only
used by scylla. so we could run into link failure with the
`build.ninja` generated with this command line. like:
```
FAILED: build/debug/test/boost/database_test_g
...
ld.lld: error: undefined symbol: seastar::testing::entry_point(int, char**)
>>> referenced by scylla_test_case.hh:22 (./test/lib/scylla_test_case.hh:22)
>>> build/debug/test/boost/database_test.o:(main)
...
ld.lld: error: undefined symbol: boost::unit_test::unit_test_log_t::set_checkpoint(boost::unit_test::basic_cstring<char const>, unsigned long, boost::unit_tes
t::basic_cstring<char const>)
>>> referenced by database_test.cc:298 (test/boost/database_test.cc:298)
>>> build/debug/test/boost/database_test.o:(require_exist(seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool))
...
```
with this change, the extra assignment expression is dropped. this
should not cause any regression. as f'$seastar_libs_{mode}' as
been included as a part of `local_libs` before the grand if-the-else
block in the for loop before this `f.write()` statement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14041
This reverts commit 3c54d5ec5e.
The reverted change fixed the FTBFS of the test in question with Clang 16,
which rightly stopped convert the LHS of `"hello" == sstring{"hello"}` to
the type of the type acceptable by the member operator even we have a
constructor for this conversion, like
class sstring {
public:
bar_t(const char*);
bool operator==(const sstring&) const;
bool operator!=(const sstring&) const;
};
because we have an operator!=, as per the draft of C++ standard
https://eel.is/c++draft/over.match.oper#4 :
> A non-template function or function template F named operator==
> is a rewrite target with first operand o unless a search for the
> name operator!= in the scope S from the instantiation context of
> the operator expression finds a function or function template
> that would correspond ([basic.scope.scope]) to F if its name were
> operator==, where S is the scope of the class type of o if F is a
> class member, and the namespace scope of which F is a member
> otherwise.
in 397f4b51c3, the seastar submodule was
updated. in which, we now have a dedicated overload for the `const char*`
case. so the compiler is now able to compile the expression like
`"hello" == sstring{"hello"}` in C++20 now.
so, in this change, the workaround is reverted.
Closes#14040
When erasing a sstable first check if its run_id
exists in _all_runs, otherwise do nothing with
that respect, and then if the run becomes empty
when erasing the last sstable (and it could have been
a single-sstable run from get go), erase the run
from `_all_runs`.
Fixes#14052
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14054
If the executable of a matching unit or boost test is not executable,
warn to console and skip.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13982
If topology over raft is enabled the source if the truth for the
topology is in the group0 state machine and no other code should
create topology metadata.
If server shutdown hangs, the `manager.server_stop_gracefully` call
would eventually (after 5 minutes) timeout with a cryptic
`TimeoutError`; it's a generic timeout for performing requests by the
tests to `ScyllaClusterManager`. It was non-obvious how to find what
actually caused the timeout - you'd have to browse multiple logs.
Introduce an explicit timeout in `ScyllaServer.stop_gracefully`. Set it
to 1 minute. Whether this is a good value may be arguable, but shutdown
taking longer than that probably indicates problems. The important thing
is that this timeout is shorter than the generic request timeout.
If this times out we get a nice error in the test:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/server/1/stop_gracefully, params: None, json: None, body:
E Stopping server ScyllaServer(1, 127.162.40.1, 826d5884-4696-4a22-80a7-cc872aa43102) gracefully took longer than 60s
```
Whether a write handler should be cancellable is now controlled by a
parameter passed to `create_write_response_handler`. We plumb it down
from `send_to_endpoint` which is called by hints manager.
This will cause hint write handlers to immediately timeout when we
shutdown or when a destination node is marked as dead.
Fixes#8079
The list will be used for non-view-update write handlers as well, so
generalize the name. Also generalize some variable names used in the
implementation.
This commit only renames things + some comments were added,
there are no logical changes.
The `view_update_write_response_handler` class, which is a subclass of
`abstract_write_response_handler`, was created for a single purpose: to
make it possible to cancel a handler for a view update write, which
means we stop waiting for a response to the write, timing out the
handler immediately. This was done to solve issue with node shutdown
hanging because it was waiting for a view update to finish; view updates
were configured with 5 minute timeout. See #3966, #4028.
Now we're having a similar problem with hint updates causing shutdown to
hang in tests (#8079).
`view_update_write_response_handler` implements cancelling by adding
itself to an intrusive list which we then iterate over to timeout each
handler when we shutdown or when gossiper notifies `storage_proxy` that
a node is down.
To make it possible to reuse this algorithm for other handlers, move the
functionality into `abstract_write_response_handler`. We inherit from
`bi::list_base_hook` so it introduces small memory overhead to each
write handler (2 pointers) which was only present for view update
handlers before. But those handlers are already quite large, the
overhead is small compared to their size.
Not all handlers are added to the cancelling list, this is controlled by
the `cancellable` parameter passed to the constructor. For now we're
only cancelling view handlers as before. In following commits we'll also
cancel hint handlers.
instead of using BOOST_REQUIRE() use, for instance
BOOST_REQUIRE_NE() and BOOST_REQUIRE_EQUAL() for better
error message when the test fails, as Boost::test would
print out the LHS and RHS of the comparison expression
if it fails.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14050
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/
Reviewed-by: Konstantin Osipov <kostja@scylladb.com>
* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
storage_service: avoid unneeded copies in on_change
storage_service: remove check that is always true
storage_service: rename handle_state_removing to handle_state_removed
storage_service: avoid string copy
storage_service: delete code that handled REMOVING_TOKENS state
gossiper: remove code related to advertising REMOVING_TOKEN state
migration_manager: add wait_for_schema_agreement() function
There's a helper verification_error() that prints a warning and returns
excpetional future. The one is converted into void throwing one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to class table, the keyspace class also needs to create
directory for itself for some reason. It looks excessive as table
creation would call recursive_touch_directory() and would create the ks
directory too, but this call is there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's no longer used outside of make_column_family_config(). Not to
encourage people to use it -- drop it and open-code into that single
caller
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When table is DROP-ed the directory with all its sstables is removed
(unless it contains snapshots). Wrap this into table.destroy_storage()
method, later it will need to become sstable::storage-specific
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's no need in copying the datadirs vector to call parallel_for_each
upon. The datadirs[0] is in fact datadir field.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method initializes storage for table naturally belongs to that
class. So rename it while moving. Also, there's no longer need to carry
table name and uuid as arguments, being table method it can just get the
paths to work on from config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
BatchGetItem request is a map of table names and 'sub-requests', ExpressionAttributeNames is defined on
'sub-request' level but the code was instead checking the top level, obtaining nullptr every time which
effectively disables unused names check.
Fixes#13251
rjson::find is not a very cheap function, it involves bunch of function calls and loop iteration.
Overall it costs 120-170 intructions even for small requests. Some example profile of alternator::executor::query
execution shows ~18 rjson::find calls, taking in total around 7% of query internal proessing time (note that
JSON parse/print and http handling are not part of this function).
This patch eliminates 2 rjson::find calls for most request types. I saw 1-2% tps improvement
in `perf-simple-query --write` although it does improve tps I suspect real percentage is smaller and
don't have much confindentce in this particular number, observed benchmark variance is too high to measure it reliably.
We've observed sporadic failures of this test in CI related to driver reconnection after server restart.
Fixes#14032Closes#14027
* github.com:scylladb/scylladb:
test: test_tablets.py: Wait for driver to see the hosts after restart
test: test_tablets.py: Pass server id to server_restart()
test: test_tablets.py: Add missing await on server_restart()
Apparently, the driver may be still establishing connections in the
background after connecting to the cluster and queries may fail with:
cassandra.cluster.NoHostAvailable
Replace reconnection with wait_for_cql_and_get_hosts(), which ensures
that the driver sees the host.
Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low.
To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object.
The new jumbo sink is capable of uploading ~500Tb of data, which looks enough.
fixes: #13019Closes#13577
* github.com:scylladb/scylladb:
sstables: Switch data and index sink to use jumbo uploader
s3/test: Tune-up multipart upload test alignment
s3/test: Add jumbo upload test
s3/client: Wait for background upload fiber on close-abort
c3/client: Implement jumbo upload sink
s3/client: Move memory buffers to upload_sink from base
s3/client: Move last part upload out of finalize_upload()
s3/client: Merge do_flush() with upload_part()
s3/client: Rename upload_sink -> upload_sink_base
This is a small follow-up for [this PR](https://github.com/scylladb/scylladb/pull/13715), it resolves some comments in the initial PR that didn't make their way into it.
* remove `noexcept` from `clear_gently`, since exceptions can be raised from move constructor;
* an optimisation for `vnode_effective_replication_map::get_range_addresses`, avoid redundant binary search.
Closes#14015
* github.com:scylladb/scylladb:
vnode_erm: optimize get_range_addresses
clear_gently: remove noexcept for rvalue references overload
* indent the nested paragraphs of list items
* use table to format the time sequence for better
readability
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14016
"Enable_repair_based_node_ops" is the name of an option, and the leading
character should be lowecase "e". so fix it.
Fixes#14017
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14018
After a schema change, memtable and cache have to be upgraded to the new schema. Currently, they are upgraded (on the first access after a schema change) atomically, i.e. all rows of the entry are upgraded with one non-preemptible call. This is a one of the last vestiges of the times when partition were treated atomically, and it is a well known source of numerous large stalls.
This series makes schema upgrades gentle (preemptible). This is done by co-opting the existing MVCC machinery.
Before the series, all partition_versions in the partition_entry chain have the same schema, and an entry upgrade replaces the entire chain with a single squashed and upgraded version.
After the series, each partition_version has its own schema. A partition entry upgrade happens simply by adding an empty version with the new schema to the head of the chain. Row entries are upgraded to the current schema on-the-fly by the cursor during reads, and by the MVCC version merge ongoing in the background after the upgrade.
The series:
1. Does some code cleanup in the mutation_partition area.
2. Adds a schema field to partition_version and removes it from its containers (partition_snapshot, cache_entry, memtable_entry).
3. Adds upgrading variants of constructors and apply() for `row` and its wrappers.
4. Prepares partition_snapshot_row_cursor, mutation_partition_v2::apply_monotonically and partition_snapshot::merge_partition_versions for dealing with heterogeneous version chains.
5. Modifies partition_entry::upgrade to perform upgrades by extending the version chain with a new schema instead of squashing it to a single upgraded version.
Fixes#2577Closes#13761
* github.com:scylladb/scylladb:
test: mvcc_test: add a test for gentle schema upgrades
partition_version: make partition_entry::upgrade() gentle
partition_version: handle multi-schema snapshots in merge_partition_versions
mutation_partition_v2: handle schema upgrades in apply_monotonically()
partition_version: remove the unused "from" argument in partition_entry::upgrade()
row_cache_test: prepare test_eviction_after_schema_change for gentle schema upgrades
partition_version: handle multi-schema entries in partition_entry::squashed
partition_snapshot_row_cursor: handle multi-schema snapshots
partiton_version: prepare partition_snapshot::squashed() for multi-schema snapshots
partition_version: prepare partition_snapshot::static_row() for multi-schema snapshots
partition_version: add a logalloc::region argument to partition_entry::upgrade()
memtable: propagate the region to memtable_entry::upgrade_schema()
mutation_partition: add an upgrading variant of lazy_row::apply()
mutation_partition: add an upgrading variant of rows_entry::rows_entry
mutation_partition: switch an apply() call to apply_monotonically()
mutation_partition: add an upgrading variant of rows_entry::apply_monotonically()
mutation_fragment: add an upgrading variant of clustering_row::apply()
mutation_partition: add an upgrading variant of row::row
partition_version: remove _schema from partition_entry::operator<<
partition_version: remove the schema argument from partition_entry::read()
memtable: remove _schema from memtable_entry
row_cache: remove _schema from cache_entry
partition_version: remove the _schema field from partition_snapshot
partition_version: add a _schema field to partition_version
mutation_partition: change schema_ptr to schema& in mutation_partition::difference
mutation_partition: change schema_ptr to schema& in mutation_partition constructor
mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor
mutation_partition: add upgrading variants of row::apply()
partition_version: update the comment to apply_to_incomplete()
mutation_partition_v2: clean up variants of apply()
mutation_partition: remove apply_weak()
mutation_partition_v2: remove a misleading comment in apply_monotonically()
row_cache_test: add schema changes to test_concurrent_reads_and_eviction
mutation_partition: fix mixed-schema apply()
A CREATE KEYSPACE query which specifies an empty string ('') as the replication factor value is currently allowed:
```cql
CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''};
```
This is wrong, it's invalid to have an empty replication factor string.
It creates a keyspace without any replication, so the tables inside of it aren't writable.
Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error, `NetworkTopolgyStrategy` should do the same.
The problem was in `prepare_options`, it treated an empty replication factor string as no replication factor.
Changing it to `std::optional` fixes the problem,
Now `std::nullopt` means no replication factor, and `make_optional("")` means that there is a replication factor, but it's described by an empty string.
Fixes: https://github.com/scylladb/scylladb/issues/13986Closes#13988
* github.com:scylladb/scylladb:
test/network_topology_strategy_test: Test NTS with replication_factor option in test_invalid_dcs
ks_prop_defs: disallow empty replication factor string in NTS
The exception unrecognized_entity_exception used to have two fields:
* entity - the name that wasn't recognized
* relation_str - part of the WHERE clause that contained this entity
In 4e0a089f3e the places that throw
this exception were modified, the thrower started passing unrecognized
column name to both fields - entity and relation_str. It was easier to
do things this way, accessing the whole WHERE clause can be problematic.
The problem is that this caused error messages to get weird, e.g:
"Undefined name x in where clause ('x')".
x is not the WHERE clause, it's the unrecognized name.
Let's remove the `relation_str` field as it isn't used anymore,
it only causes confusion. After this change the message would be:
"Unrecognized name x"
Which makes much more sense.
Refs #10632
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13944
Found and fixed yet another place where an error message prints a column
name as "bytes" type which causes it to be printed as hexadecimal codes
instead of the actual characters of the name.
The specific error message fixed here is "Cannot use selection function
writeTime on PRIMARY KEY part k" which happens when you try to use
writetime() or ttl() on a key column (which isn't allowed today - see
issue #14019). Before this patch we got "6b" in the error message instead
of "k".
The patch also includes a regression test that verifies that this
error condition is recognized and the real name of the column is
printed. This test fails before this patch, and passes after it.
As usual, the test also passes on Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14021
The sstable_directory::garbage_collect() scans /var/lib/scylla for
whatever sstable it's called for. S3-backed ones don't have anything
there, so the g.c. run is no-op. Make this call be lister virtual
method, so that only filesystem lister does this scan and the ownership
table lister becomes the real no-op. Later it will be filled with code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When sstable directory calls lister it passes the _sstable_dir as an
argument. However, the very same _sstable_dir was used to construct the
lister, and by now all the lister implementations keep this value
aboard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The directory_lister _lister exists as class member, but is only used
once -- when the .process() is called -- and then is closed forever.
It's simpler to keep the lister on the .process() stack.
This change also makes filesystem lister keep the copy of directory as
class member, it will be useful for the next patch as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The filesystem components lister has private wrappers on top of
directory lister it uses internally. These are lefrovers from making the
sstable directory storage-aware, now they can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this string is used in as the option description in the command line
help message. so it is a part of user facing interface.
in this change, the typo is fixed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14013
After consistent schema changes, remove schema pulls from gossiper
events if Raft is enabled, and considering Raft upgrade state.
Only disable pull if Raft is fully enabled.
Fixes#12870
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13695
In get_range_addresses we are iterating
over vnode tokens, don't need to do
binary search for them in tmptr->first_token,
they can be directly used as keys
for _replication_map.
Improve test/alternator/README.md by adding better and more beginner-
friendly introduction to how to run the Alternator tests, as well
as a section about the philosophy of the Alternator test suite, and
some guideliness on how to write good tests in that framework.
Much of this text was copied from test/cql-pytest/README.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13999
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `api::table_info` without the help of `operator<<`.
but the corresponding `operator<<()` is preserved in this change, as we
still have lots of callers relying on this << operator instorage_service.cc
where std::vector<table_info> is formatted using operator<<(ostream&, const Range&)
defined in to_string.hh. we could have used fmt/ranges.h to print the
std::vector<table_info>. but the combination of operator<<(ostream&, const Range&)
and FMT_DEPRECATED_OSTREAM renders this impossible. because
unlike the builtin range formatter specializations, the fallback formatter
synthesized from the operator<< does not have brackets defined for
the range printer. the brackets are used as the left and right marks
of the range, for instance, the array-alike containers are printed
like [1,2,3], while the tuple-alike containers are printed like
(1,2,3). once we are allowed to remove FMT_DEPRECATED_OSTREAM, we
should be able to use the builtin range formatter, and remove the
operator<< for api::table_info by then.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13975
to reduce the size of header file, in hope to speed the compilation. let's
implement the implementation of format() function into .cc file.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14010
Here's a simple test that can be used to check S3 object read latencies.
To run one must export the same variables as for any other S3 unit test:
- S3_SERVER_ADDRESS_FOR_TEST
- S3_SERVER_PORT_FOR_TEST
- S3_PUBLIC_BUCKET_FOR_TEST
and the AWS creds are a must via AWS_S3_EXTRA='$key:$secret:$region' env
variable.
Accepted options are
--duration SEC -- test duration in seconds
--parallel NR -- number of fibers to run in parallel
--object-size BYTES -- object size to use (1MB by default)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13895
The manager is in charge of updating IO bandwidth on the respective prio class. Nowadays it uses global priority-manager, but unifying sched classes effort will require it to use non-global streaming sched group. After the patch the sched class field is unused, but it's a preparation towards huge (really huge) "switch to seastar API level 7" patch
ref: #13963Closes#13997
* github.com:scylladb/scylladb:
stream_manager: Add streaming sched group copy
cql_test_env: Move sched groups initialization up
In commit 0a71151bc4 I wanted to avoid
a incorrect deprecation warning from the Python driver but fixed it
in an incorrect way. I never noticed the fix was incorrect because
the test was already xfailing, and the incorrect fix just made it
fail differently... In this patch I revert that commit.
With this revert, I am *not* bringing back the spurious warning -
the Python driver bug was already fixed in
https://github.com/datastax/python-driver/pull/1103 - so developers
with a fairly recent version will no longer see the spurious warning.
Both old and new drivers will at least do the correct thing, as
it was before that unfortunate commit.
Fixes#8752.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#14002
There are few places in sstables/ code that require caller to specify priority class to pass it along to file stream options. All these callers use default class, so it makes little sense to keep it. This change makes the sched classes unification mega patch a bit smaller.
ref: #13963Closes#13996
* github.com:scylladb/scylladb:
sstables: Remove default prio class from rewrite_statistics()
sstables: Remove prio class from validate_checksums subs
sstables: Remove always default io-prio from validate_checksums()
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print
- raft::server_address
- raft::config_member
- raft::configuration
without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13976
not all tables in system keyspace are volatile. among other things, system.sstables and system.tablets are persisted using sstables like regular user tables. so add a dedicated section for them. also, in this change, raft table is added to the new section.
Closes#13981
* github.com:scylladb/scylladb:
docs/dev/system_keyspace: add raft table
docs/dev/system_keyspace: move sstables and tablets into another section
Several test cases use common operations one files like existence checking, content comparing, etc. with the help of home-brew local helpers. The set makes use of some existing seastar:: ones and generalizes others into test/lib/. The primary intent here is `57 insertions(+), 135 deletions(-)`
Closes#13936
* github.com:scylladb/scylladb:
test: Generalize touch_file() into test_utils.*
test/database: Generalize file/dir touch and exists checks
test/sstables: Use seastar::file_exists() to check
test/sstables: Remove sstdesc
test/sstables: Use compare_files from utils/ in sstable_test
test/sstables: Use compare_files() from utils/ in sstable_3_x_test
test/util: Add compare_file() helpers
not all tables in system keyspace are volatile. among other things,
system.sstables and system.tablets are persisted using sstables like
regular user tables. so move them into the section where we have
other regular tables there.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There are several places in tests that either use default_priority_class() explicitly, or use some specific prio class obtained from priority manager. There's currently an ongoing work to remove all priority classes, this set makes the final patch a bit smaller and easier to review. In particular -- in many cases default_priority_class() is implicit and can be avoided by callers. Also, using any prio class by test is excessive, it can go with (implicit) default_priority_class.
ref: #13963Closes#13991
* github.com:scylladb/scylladb:
test, memtable: Use default prio class
test, memtable: Add default value for make_flush_reader() last arg
test, view_build: Use default prio class
test, sstables: Use implicit default prio class in dma_write()
test, sstables: Use default sstable::get_writer()'s prio class arg
The way index_reader maintains io_priority_class can be relaxed a bit. The main intent is to shorten the #13963 final patch a bit, as a side effect index_reader gets its portion of API polishing.
ref: #13963Closes#13992
* github.com:scylladb/scylladb:
index_reader: Introduce and use default arguments to constructor
index_reader: Use _pc field in get_file_input_stream_options() directly
index_reader: Move index_reader::get_file_input_stream_options to private: block
The metrics that are being deregistered (in this PR) cause Scylla to crash when a
table is dropped, but the corresponding table object in memory is not
yet deallocated, and a new table with the same name is created. This
caused a double-metrics-registration exception to be thrown. In order to
avoid it, we are deregistering table's metrics as soon as the table is
marked to be disposed from the database. Table's representation in memory can
still live, but shouldn't forbid other table with the same name to be
created.
Fixes#13548Closes#13971
* Adds the new feature tutorial site to the docs
* fixes the unnecessary redirection (iot.scylladb.com)
Closes#13998
* github.com:scylladb/scylladb:
Skip unnecessary redirection
Add links to feature store tutorial
To speed up total test suite run, change configuration to schedule slow
topology tests first.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13948
we changed the type of generation column in system.sstables
from bigint to timeuuid in 74e9e6dd1a
but that change failed to update the document accordingly. so let's
update the document to reflect the change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13994
The manager in question is responsible for maintaining the streaming
class IO bandwidth update. Nowadays it does it via priority manager's
global streaming IO priority class field, but it will need to switch to
streaming sched group.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The streaming manager will need to keep its copy of
streaming/maintenance group, so groups should be created early.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.
Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.
Fixes: #13784Closes#13790
* github.com:scylladb/scylladb:
test/boost/database_test: add unit test for semaphore mismatch on range scans
partition_slice_builder: add set_specific_ranges()
multishard_mutation_query: make reader_context::lookup_readers() exception safe
multishard_mutation_query: lookup_readers(): make inner lambda a coroutine
The method is called with explicitly default pririty class and puts one
into the fstream options. This whole chain can be avoided
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable.read_checksum() and .read_digest() accept prio class
argument from validate_checsums(), but it's always the "default" one.
Remove the arg and remove stream options initializations as they'll pick
up default prio class on their default constructing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All calls to sstables::validate_checksums() happen with explicitly
default priority class. Just hard-code it as such in the method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
* seastar f94b1bb9cb...aff87d5bb9 (9):
> prometheus.cc: change function in foreach_metrics to use const ref
Fixes#13929
> sstring: add == comparison operators
> Add a simple code example for promise in tutorial.md
> Add example for cpuset docs
> Merge 'httpd: modernize' from Avi Kivity
> Merge 'TLS: support for extracting certificate subject alt names from client certs' from Calle Wilund
> Merge 'Update IO stats label set' from Pavel Emelyanov
> file: s/(void)/()/ in function's parameter list
> scripts: addr2line: allow specifying kallsyms path
Closes#13985
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helps users to figure if the repair has failed due to a peer node
was down during repair.
For example:
```
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: Repair
1026 out of 1026 ranges, keyspace=ks2a, table={test_table, tb},
range=(9203128250168517738,+inf), peers={127.0.0.2}, live_peers={},
status=skipped_no_live_peers
INFO [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: stats:
repair_reason=repair, keyspace=ks2a, tables={test_table, tb}, ranges_nr=513,
round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0,
tx_hashes_nr=0, rx_hashes_nr=0, duration=0 seconds, tx_row_nr=0, rx_row_nr=0,
tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={}, row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s, row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out
of 1026 ranges failed, keyspace=ks2a, tables={test_table, tb},
repair_reason=repair, nodes_down_during_repair={127.0.0.2}
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]:
repair_tracker run failed: std::runtime_error ({shard 0: std::runtime_error
(repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out of 1026 ranges failed,
keyspace=ks2a, tables={test_table, tb}, repair_reason=repair,
nodes_down_during_repair={127.0.0.2})})
```
In addition, change the `status=skipped` to `status=skipped_no_live_peers`
to make it more clear.
Closes#13928
There's a rather boring test_sstable_exists() helper in the test that
can be replaced with a more standard seastar API call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper class is used to transfer directory name and generation int
value into the compare_sstables() helper. Remove both, the utils/ stuff
is useful enough not to use wrappers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's yet another implementation of read-the-whole-file and
check-file-contents-matches helpers in the test. Replace it with the
utils/ facility. Next patch will be able to wash more stuff out of
this test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a static helper under the same name that can be replaced with
utils/ one. The code here runs in async context to .get0() the result.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Many places call memtable::make_flush_reader() with default priority
class. Make it a default-arg for the method, other reader making methods
of memtable already have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test case tries to be "correct" and calls sst->write_components()
with streaming priority class. It's a test anyway, no need to be too
diligent here
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::get_writer()'s prio class argument has its default value.
No need to pass it explicitly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current documentation of real_dirty_memory_accounter is not
quite compelling enough.
One can find some proper documentation by digging through the
git log and reading the description of the commit which added it,
but it shouldn't be that way.
This patch replaces the current documentation of the class with something
more explanatory.
Closes#13927
As described in https://github.com/scylladb/scylladb/issues/8638,
we're moving away from `SimpleStrategy`, in the future
it will become deprecated.
We should remove all uses of it and replace them
with `NetworkTopologyStrategy`.
This change replaces `SimpleStrategy` with
`NetworkTopologyStrategy` in all unit tests,
or at least in the ones where it was reasonable to do so.
Some of the tests were written explicitly to test the
`SimpleStrategy` strategy, or changing the keyspace from
`SimpleStrategy` to `NetworkTopologyStrategy`.
These tests were left intact.
It's still a feature that is supported,
even if it's slowly getting deprecated.
The typical way to use `NetworkTopologyStrategy` is
to specify a replication factor for each datacenter.
This could be a bit cumbersome, we would have to fetch
the list of datacenters, set the repfactors, etc.
Luckily there is another way - we can just specify
a replication factor to use for or each existing
datacenter, like this:
```cql
CREATE KEYSPACE {} WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1};
```
This makes the change rather straightforward - just replace all
instances of `'SimpleStrategy'', with `'NetworkTopologyStrategy'`.
Refs: https://github.com/scylladb/scylladb/issues/8638
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13990
Fixes https://github.com/scylladb/scylladb/issues/13969
This commit enables publishing the docs for branch-5.3
The documentation for version 5.3 will be marked as
unstable until it is released, and an appropriate
warning is in place.
Closes#13977
To speed up total test suite run, split the raft upgrade tests and run schedule slow tests first.
Closes#13951
* github.com:scylladb/scylladb:
test/topology: run first slow raft upgrade tests
test/topology: split raft upgrade tests
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.
This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.
I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.
Fixes: #13841Fixes: #12552Closes#13843
* github.com:scylladb/scylladb:
message: match unknown tenants to the default tenant
message: generalize per-tenant connection types
This refactoring is a follow-up for https://github.com/scylladb/scylladb/pull/13376, move per keyspace data structures related to topology changes from `token_metadata` to `erm`.
We move `pending_endpoints` and `read_endpoints`, along with their computation logic, from `token_metadata` to `vnode_effective_replication_map`. The `vnode_effective_replication_map` seems more appropriate for them since it contains functionally similar `replication_map` and we will be able to reuse `pending_endpoints/read_endpoints` across keyspaces sharing the same `factory_key`.
At present, `pending_endpoints` and `read_endpoints` are updated in the `update_pending_ranges` function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the `migration_info` for specific keyspaces. In this PR we introduce a new `topology_change_info` structure to hold the first part's data and create an `update_topology_change_info` function to update it. This structure will be used in `vnode_effective_replication_map` to compute `pending_endpoints` and `read_endpoints`. This enables the reuse of `topology_change_info` across all keyspaces, unlike the current `update_pending_ranges` implementation, which is another benefit of this refactoring.
The PR also optimises `replication_map` memory usage for the case `natural_endpoints_depend_on_token == false`. We store endpoints list only once with special key
instead of duplicating them for each `vnode` token.
The original `update_pending_ranges` remains unchanged during the PR commits, and will be removed entirely upon transitioning to the new implementation.
Closes#13715
* github.com:scylladb/scylladb:
token_metadata_test: add a test for everywhere strategy
token_metadata_test: check read_endpoints when bootstrapping first node
token_metadata_test: refactor tests, extract create_erm
token_metadata: drop has_pending_ranges and migration_info
effective_replication_map: add has_pending_ranges
token_metadata: drop update_pending_ranges
effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading
token_metadata_test.cc: create token_metadata and replication_strategy as shared pointers
vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading
calculate_effective_replication_map: compute pending_endpoints and read_endpoints
vnode_erm: optimize replication_map
vnode_erm::get_range_addresses: use sorted_tokens
abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token
sequenced_set: add extract_vector method
effective_replication_map: clone_endpoints_gently -> clone_data_gently
vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints
stall_free.hh: add clear_gently for rvalues
stall_free.hh: relax Container requirement
token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map
token_metadata: introduce topology_change_info
token_metadata: replace set_topology_transition_state with set_read_new
test_invalid_dcs is a test which has a list of incorrect replication factor
values, and tries to create keyspaces with these incorrect values.
The standard way of creating a NetworkTopologyStrategy keyspace
is to specify the replication factor for each specific datacenter,
but there's also a simpler way - a user can just write: 'replication_factor': X
to convey that all of the current datacenters should have replication_factor X.
This way of creating a NetworkTopologyStrategy wasn't tested by test_invalid_dcs,
let's add it to the test to improve coverage.
Refs: https://github.com/scylladb/scylladb/issues/13986
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A CREATE KEYSPACE query which specifies an empty string ('')
as the replication factor value is currently allowed:
```cql
CREATE KEYSPACE bad_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': ''};
```
This is wrong, it's invalid to have an empty replication factor string.
It creates a keyspace without any replication, so the tables inside of it aren't writable.
Trying to create a `SimpleStrategy` keyspace with such replication factor throws an error,
`NetworkTopolgyStrategy` should do the same.
The problem was in `prepare_options`, it treated an empty replication factor string
as no replication factor. Changing it to `std::optional` fixes the problem,
Now `std::nullopt` means no replication factor, and `make_optional("")` means
that there is a replication factor, but it's described by an empty string.
Fixes: https://github.com/scylladb/scylladb/issues/13986
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
If the executable of a matching unit or boost test is not present, warn
to console and skip.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13949
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.
It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.
This PR implements the API on top of raft topology, it was originally
implemented using gossiper. It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.
Closes#13683
* github.com:scylladb/scylladb:
docs: update topology-over-raft.md
test: topology_experimental_raft: test `check_and_repair_cdc` API
raft topology: implement `check_and_repair_cdc_streams` API
raft topology: implement global request handling
raft topology: introduce `prepare_new_cdc_generation_data`
raft_topology: `get_node_to_work_on_opt`: return guard if no node found
raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
raft topology: separate `publish_cdc_generation` state
raft topology: non-node-specific `exec_global_command`
raft topology: introduce `start_operation()`
raft topology: non-node-specific `topology_mutation_builder`
topology_state_machine: introduce `global_topology_request`
topology_state_machine: use `uint16_t` for `enum_class`es
raft topology: make `new_cdc_generation_data_uuid` topology-global
0d4ffe1d69 introduced a regression where
it used the sha1 of the local "master" branch instead of the remote's
"master" branch in the title of the commit message.
in this change, let's use the origin/${branch}'s sha1 in the title.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13974
Fixes https://github.com/scylladb/scylladb/issues/13808
This commit moves cloud instance recommendations from the Requirements page to a new dedicated page.
The content of subsections is simply copy-pasted, but I added the introduction and metadata for better
searchability.
Closes#13935
* github.com:scylladb/scylladb:
doc: add a cloud instance recommendations page
Fixes sporadic failures to execute INSERT which follows the restart:
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.194.238.2:9042 datacenter1>: ConnectionShutdown('Connection to 127.194.238.2:9042 is closed')})
There could be system.tablet mutations in the schema commit log. We
need to see them before loading sstables of user tables because we
need sharding information.
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `counter_shard_view` and `counter_cell_view` without the
help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13967
We add the has_pending_ranges function to erm. The
implementation for vnode is similar to that of token_metadata.
For tablets, we add new code that checks if the given endpoint
is contained in tablet_map::_transitions.
The function storage_service::update_pending_ranges is
turned to update_topology_changes_info.
The pending_endpoints and read_endpoints will be
computed later, when the erms are rebuilt.
We already use the new pending_endpoints from erm though
the get_pending_ranges virtual function, in this commit
we update all the remaining places to use the new
implementation in erm, as well as remove the old implementation
in token_metadata.
We want to switch token_metadata_test to the new
implementation of pending_endpoints and read_endpoints in erm.
To do this, it is convenient to have token_metadata and
replication_strategy as shared pointers, as it fits better with the signature
of calculate_effective_replication_map. In this commit we don't
change the logic of the tests, we just migrate them to use pointers.
In this commit we introduce functions to erm for accessing
pending_endpoints and read_endpoints similar to the
corresponding functions in token_metadata. The only
difference - we no longer need the keyspace_name map.
The functions get_pending_endpoints and get_endpoints_for_reading
are virtual, since they have different implementations
for vnode and for tablets.
The get_pending_endpoints already existed. For tablets it
remained unchanged, while for vnode we just changed
it from calling on token_metadata to using a local field.
We have also removed ks_name from the signature as it's
no longer needed.
For vnodes, the get_endpoints_for_reading also just
employs the local field. In the case of tablets, we currently
return nullptr as the appropriate implementation remains unclear.
In this commit we add logic to calculate pending_endpoints and
read_endpoints, similar to how it was done in update_pending_ranges.
For situations where 'natural_endpoints_depend_on_token'
is false we short-circuit the calculations, breaking out
of the loop after the first iteration. In this case we add a
single item with key=default_replication_map_key
to the replication_map and set pending_endpoints/read_endpoints
key range to the entire set of possible values.
In the loop we iterate over all_tokens, which contains the union of
all boundary tokens, from the old and from the new topology.
In addition to updating pending_endpoints and read_endpoints in the loop,
we remember the new natural endpoints in the replication_map
if the current token is contained in the current set of boundary tokens.
We optimise memory usage of replication_map by
storing endpoints list only once in case of
natural_endpoints_depend_on_token() == false. For simplicity,
this list is stored in the same unordered_map with
special key default_replication_map_key.
We inline both get_natural_endpoints and
for_each_natural_endpoint_until from abstract_replication_strategy
into vnode_erm since now the overrides in local and everywhere
strategies are redundant. The default implementation works
for them as empty sorted_tokens() is not a problem, we
store endpoints with a special key.
Function do_get_natural_endpoints was extracted,
since get_natural_endpoints returns by val,
but for_each_natural_endpoint_until reference in sufficient.
We want to refactor replication_map so that it doesn't
store multiple copies of the same endpoints vector
in case of natural_endpoints_depend_on_token == false.
To preserve get_range_addresses behaviour
we iterate over tm.sorted_tokens() instead of
_replication_map.
It's possible that the callers of this function
are ok with single range in case of
natural_endpoints_depend_on_token == false,
but to restrict the scope of the refactoring we
refrain from going to that direction.
We need to account for the new fields in the clone implementation.
The signature future<erm> erm::clone() const; doesn't work because
the call will be made via foreign_ptr on an instance from another
shard, so we need to use local values for replication_strategy
and token_metadata.
Refactor ~vnode_effective_replication_map, use
our new clear_gently overload for rvalue references.
Add new fields _pending_endpoints and _read_endpoints
to the call.
vnode_efficient_replication_map::clear_gently is removed as
it was not used.
We don't use the return value of erase, so
we can allow it to return anything. We'll
need this for ring_mapping, since
boost::icl::interval_map::erase(it)
returns void.
We plan to move pending_endpoints and read_endpoints, along
with their computation logic, from token_metadata to
vnode_effective_replication_map. The vnode_effective_replication_map
seems more appropriate for them since it contains functionally
similar _replication_map and we will be able to reuse
pending_endpoints/read_endpoints across keyspaces
sharing the same factory_key.
At present, pending_endpoints and read_endpoints are updated in the
update_pending_ranges function. The update logic comprises two
parts - preparing data common to all keyspaces/replication_strategies,
and calculating the migration_info for specific keyspaces. In this commit,
we introduce a new topology_change_info structure to hold the first
part's data add create an update_topology_change_info function to
update it. This structure will later be used in
vnode_effective_replication_map to compute pending_endpoints
and read_endpoints. This enables the reuse of topology_change_info
across all keyspaces, unlike the current update_pending_ranges
implementation, which is another benefit of this refactoring.
The update_topology_change_info implementation is mostly derived from
update_pending_ranges, there are a few differences though:
* replacing async and thread with plain co_awaits;
* adding a utils::clear_gently call for the previous value
to mitigate reactor stalls if target_token_metadata grows large;
* substituting immediately invoked lambdas with simple variables and
blocks to reduce noise, as lambdas would need to be converted into coroutines.
The original update_pending_ranges remains unchanged, and will be
removed entirely upon transitioning to the new implementation.
Meanwhile, we add an update_topology_change_info call to
storage_service::update_pending_ranges so that we can
iteratively switch the system to the new implementation.
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559Closes#13812
* github.com:scylladb/scylladb:
table: signal compaction_manager when staging sstables become eligible for cleanup
compaction_manager: perform_cleanup: wait until all candidates are cleaned up
compaction_manager: perform_cleanup: perform_offstrategy if needed
compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
sstable_set: add for_each_sstable_gently* helpers
Fixes https://github.com/scylladb/scylladb/issues/13808
This commit moves cloud instance recommendations from
the Requirements page to a new dedicated page.
The content of subsections is simply copy-pasted, but
I added the introduction and metadata for better
searchability.
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13514
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13931
Without the feature, the system schema doesn't have the table, and the
read will fail with:
Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)
We should not attempt to read tablet metadata in the experimental
feature is not enabled.
Fixes#13946Closes#13947
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.
`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.
`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.
The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.
It's cumbersome and people forget about it,
so let's change the default to `user`.
There issues about expressions being printed
in a `strange` way, this fixes them.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#13916
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.
Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.
This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).
Closes#13933
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).
The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.
Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.
Fixes: #13837Closes#13172
* github.com:scylladb/scylladb:
gossiper: Remove features and sysks from gossiper
system_keyspace: De-static save_local_supported_features()
system_keyspace: De-static load_|save_local_enabled_features()
system_keyspace: Move enable_features_on_startup to feature_service (cont)
system_keyspace: Move enable_features_on_startup to feature_service
feature_service: Open-code persist_enabled_feature_info() into enabler
gms: Move feature enabler to feature_service.cc
gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
gms: Persist features explicitly in features enabler
feature_service: Make persist_enabled_feature_info() return a future
system_keyspace: De-static load_peer_features()
gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
gossiper: Enable features and register enabler from outside
gms: Add feature_service and system_keyspace to feature_enabler
Some state that is used to fill in 'peeers' table is still propagated
over gossiper. When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.
Fixes: #13911
Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.
This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.
This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.
Refs: #11870Closes#13869
* github.com:scylladb/scylladb:
db/system_keyspace: remove dependency on storage_proxy
db/system_keyspace: replace storage_proxy::query*() with replica:: equivalent
replica: add query.hh
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853
There are several places that need to carry a pointer to a table that's shard-wide accessible -- database snapshot and truncate code and distributed loader. The database code uses `get_table_on_all_shards()` returning a vector of foreign lw-pointers, the loader code uses its own global_column_family_ptr class.
This PR generalizes both into global_table_ptr facility.
Closes#13909
* github.com:scylladb/scylladb:
replica: Use global_table_ptr in distributed loader
replica: Make global_table_ptr a class
replica: Add type alias for vector of foreign lw-pointers
replica: Put get_table_on_all_shards() to header
replica: Rewrite get_table_on_all_shards()
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13846
`ep` is std::move'ed to get_endpoint_state_for_endpoint_ptr
but it's used later for logger.warn()
Fixes#13921
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13920
The loader has very similar global_column_family_ptr class for its
distributed loadings. Now it can use the "standard" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now all users of global_table know it's a vector and reference its
elements with this_shard_id() index. Making the global_table_ptr a class
makes it possible to stop using operator[] and "index" this_shard_id()
in its -> and * operators.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Use sharded<database>::invoke_on_all() instead of open-coded analogy.
Also don't access database's _column_families directly, use the
find_column_family() method instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The IS NOT NULL restrictions are handled in a special way.
Instead of putting them together with other restrictions,
statement_restrictions collects all columns restricted
by IS NOT NULL and puts them in the _not_null_columns field.
Add a getter to access this set of columns.
The field is private, so it can't be accessed without
a function that explictily exposes it.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The IS NOT NULL restrictions is currently supported
only in the CREATE MATERIALIZED VIEW statements.
These restrictions works correctly for columns
that are part of the view's primary key,
but they're silently ignored on other columns.
The following commits will forbid placing
the IS NOT NULL restriction on columns
that aren't a part of the view's primary key.
The tests have to be modified in order
to pass, because some of them have
a useless IS NOT NULL restriction
on regular columns that don't belong
to the view's primary key.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.
First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the loader starts it first scans the directory for sstables'
tempdirs and pending deletion logs. Put both into one call so that it
can be moved more easily later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13915
This commit fixes broken links to the Enterprise docs.
They are links to the enterprise branch, which is not
published. The links to the Enterprise docs should include
"stable" instead of the branch name.
This commit must be backported to branch-5.2, because
the broken links are present in the published 5.2 docs.
Closes#13917
perform_cleanup may be waiting for those sstables
to become eligible for cleanup so signal it
when table::move_sstables_from_staging detects an
sstable that requires cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.
Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.
Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is possible that cleanup will be executed
right after repair-based node operations,
in which case we have a 5 minutes timer
before off-strategy compaction is started.
After marking the sstables that need cleanup,
perform offstrategy compaction, if needed.
This will implicitly cleanup those sstables
as part of offstrategy compaction, before
they are even passed for view update (if the table
has views/secondary index).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scan all sstables to determine which of them
requires cleanup before calling perform_task_on_all_files.
This allows for cheaper no-op return when
no sstable was identified as requiring cleanup,
and also it will allow triggering offstrategy
compaction if needed, after selecting the sstables
for cleanup, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently callers of `for_each_sstable` need to
use a seastar thread to allow preemption
in the for_each_sstable loop.
Provide for_each_sstable_gently and
for_each_sstable_gently_until to make using this
facility from a coroutine easier, without requiring
a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
string_format_test was added in 1b5d5205c8,
so let's add it to CMake building system as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13912
with off-strategy, input list size can be close to 1k, which will
lead to unneeded reallocations when formatting the list for
logging.
in the past, we faced stalls in this area, and excessive reallocation
(log2 ~1k = ~10) may have contributed to that.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13907
Currently we may start to receive requests before group0 is configured
during boot. If that happens those requests may try to pull schema and
issue raft read barrier which will crash the system because group0 is
not yet available. Workaround it by pretending the raft is disabled in
this case and use non raft procedure. The proper fix should make sure
that storage proxy verbs are registered only after group0 is fully
functional.
Message-Id: <ZGOZkXC/MsiWtNGu@scylladb.com>
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.
In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.
This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.
Fixes#12462Closes#13906
CI once failed due to mc being unable to configure minio server. There's currently no glues why it could happen, let's increase the minio.py verbosity a bit
refs: #13896Closes#13901
* github.com:scylladb/scylladb:
test,minio: Run mc with --debug option
test,minio: Log mc operations to log file
Currently, when a user creates a function or a keyspace, no
permissions on functions are update.
Instead, the user should gain all permissions on the function
that they created, or on all functions in the keyspace they have
created. This is also the behavior in Cassandra.
However, if the user is granted permissions on an function after
performing a CREATE OR REPLACE statement, they may
actually only alter the function but still gain permissions to it
as a result of the approach above, which requires another
workaround added to this series.
Lastly, as of right now, when a user is altering a function, they
need both CREATE and ALTER permissions, which is incompatible
with Cassandra - instead, only the ALTER permission should be
required.
This series fixes the mentioned issues, and the tests are already
present in the auth_roles_test dtest.
Fixes#13747Closes#13814
* github.com:scylladb/scylladb:
cql: adjust tests to the updated permissions on functions
cql: fix authorization when altering a function
cql: grant permissions on functions when creating a keyspace/function
cql: pass a reference to query processor in grant_permissions_to_creator
test_permissions: make tests pass on cassandra
This series introduces a new gossiper method: get_endpoints that returns a vector of endpoints (by value) based on the endpoint state map.
get_endpoints is used here by gossiper and storage_service for iterations that may preempt
instead of iterating direction over the endpoint state map (`_endpoint_state_map` in gossiper or via `get_endpoint_states()`) so to prevent use-after-free that may potentially happen if the map is rehashed while the function yields causing invalidation of the loop iterators.
Fixes#13899Closes#13900
* github.com:scylladb/scylladb:
storage_service: do not preempt while traversing endpoint_state_map
gossiper: do not preempt while traversing endpoint_state_map
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since [gms: heart_beat_state: use generation_type and version_type](4cdad8bc8b)
(merged in [Merge 'gms: define and use generation and version types'...](7f04d8231d))
Implementing min/max correctly
Fixes#13801Closes#13880
* github.com:scylladb/scylladb:
storage_service: handle_state_normal: on_internal_error on "owns no tokens"
utils: tagged_integer: implement std::numeric_limits::{min,max}
test: add tagged_integer_test
The previous version of wasmtime had a vulnerability that possibly
allowed causing undefined behavior when calling UDFs.
We're directly updating to wasmtime 8.0.1, because the update only
requires a slight code modification and the Wasm UDF feature is
still experimental. As a result, we'll benefit from a number of
new optimizations.
Fixes#13807Closes#13804
The auto& db = proxy.local().get_db() is called few lines above this
patch, so the &db can be reused for invoke_on_all() call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13896
These two can grow large. Non-jumbo sink is effectively limited with
10000 parts, since each is ~5Mb the maximum uploadable data/index
happens to be 50Gb which is too small.
Other components shouldn't grow that big and continue using simple and a
bit faster uploading sink.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the test uses a sequence of 1024-bytes buffers. This lets
minio server actively de-duplicate those blocks by page boundary (it's a
guess, but it it's truish because minio reports back equivalent ETags
for lots of uploading parts). Make the buffer not be power of two so
that when squashed together the resulting 2^X buffers don't get equal.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It re-uses most of the existing upload sink test, but configures the
jumbo sink with at most 3 parts in each intermediate object not to
upload 50Gb part to switch to the next one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When uploading a part (and a piece) there can be one or more background
fibers handling the upload. In case client needs to abort the operation
it calls .close() without flush()ing. In this case the S3 API Abort is
made and the sink can be terminated. It's expected that background
fibers would resolve on their own eventually, but it's not quite the
case.
First, they hold units for the semaphore and the semaphore should be
alive by the time units are returned.
Second, the PUT (or copy) request can finish successfully and it may be
sitting in the reactor queue waiting for its continuation to get
scheduler. The continuation references sink via "this" capture to put
the part etag.
Finally, in case of piece uploading the copy fiber needs _client at the
end to issue delete-object API call dropping the no longer needed part.
Said that -- background fibers must be waited upon on .close() if the
closing is aborting (if it's successfull close, then the fibers mush
have been picked up by final flush() call).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sink is also in charge of uploading large objects in parts, but this
time each part is put with the help of upload-part-copy API call, not
the regular upload-part one.
To make it work the new sink inherits from the uploading base class, but
instead of keeping memory_data_sink_buffers with parts it keeps a sink
to upload a temporary intermediate object with parts. When the object is
"full", i.e. the number of parts in it hits the limit, the object is
flushed, then copied into the target object with the S3 API call, then
deletes the intermediate object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the buffers manipulations now happen in the upload_sink class and
the respective member can be removed from base class. The base class
only messes with the buffers in its upload_part() call, but that's
unavoidable, as uploading part implies sending its contents which sits
in buffers.
Now the base class can be re-used for uploading parts with the help of
copy-part API call (next patches)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change has two reasons. First, is to facilitate moving the
memory_data_sink_buffers from base class, i.e. -- continuation of the
previous patch. Also this fixes a corner case -- if final sink flush
happens right after the previous part was sent for uploading, the
finalization doesn't happen and sink closing aborts the upload even if
it was successful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The do_flush() helper is practically useless because what it does can be
done by the upload_part() itself. This merge also facilitates moving the
memory_data_sink_buffers from base class to uploader class by next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There will appear another sink that would implement multipart upload
with the help of copy-part functionality. Current uploading code is
going to be partially re-used, so this patch moves all of it into the
base class in advance. Next patches will pick needed parts.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently everything minio.py does goes to test.py log, while mc (and
minio) output go to another log file. That's inconvenient, better to
keep minio.py's messages in minio log file.
Also, while at it, print a message if local alias drop fails (it's
benign failure, but it's good to have the note anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this option allows user to use specified linker instead of the
default one. this is more flexible than adding more linker
candidates to the known linkers.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13874
For unknown reasons, clang 16 rejects equality comparison
(operator==) where the left-hand-side is an std::string and the
right-hand-side is an sstring. gcc and older clang versions first
convert the left-hand-side to an sstring and then call the symmetric
equality operator.
I was able to hack sstring to support this assymetric comparison,
but the solution is quite convoluted, and it may be that it's clang
at fault here. So instead this patch eliminates the three cases where
it happened. With is applied, we can build with clang 16.
Closes#13893
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".
this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.
this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.
But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13834
Instead of simply throwing an exception. With just the exception, it is
impossible to find out what went wrong, as this API is very generic and
is used in a variety of places. The backtrace printed by
`on_internal_error()` will help zero in on the problem.
Fixes: #13876Closes#13883
Although this condition should not happen,
we suspect that certain timing conditions might
lead this state of node in handle_normal_state
(possibly when shutdown) has no tokens.
Currently we call on_internal_error_noexcept, so
if abort_on_internal_error is false, we will just
print an error and continue on with handle_state_normal.
Change that to `on_internal_error` so to throw an
exception in production in this unexpected state.
Refs #13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/13857
This commit adds the OS support for ScyllaDB Enterprise 2023.1.
The support is the same as for ScyllaDB Open Source 5.2, on which
2023.1 is based.
After this commit is merged, it must be backported to branch-5.2.
In this way, it will be merged to branch-2023.1 and available in
the docs for Enterprise 2023.1
Closes: #13858
Separate cluster_size into a cluster section and specify this value as
initial_size.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13440
Add add a respective unit test.
It turns out that numeric_limits defines an implicit implementation
for std::numeric_limits<utils::tagged_integer<Tag, ValueType>>
which apprently returns a default-constructed tagged_integer
for min() and max(), and this broke
`gms::heart_beat_state::force_highest_possible_version_unsafe()`
since 4cdad8bc8b
(merged in 7f04d8231d)
Implementing min/max correctly
Fixes#13801
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore. but many users would like to set the driver timers based on
server timers. we need to enable them to configure timeout even
when the server is still running.
in this change,
* `alternator_timeout_in_ms` is marked as live-updateable
* `executor::_s_default_timeout` is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so `executor::_timeout_in_ms` can
be initialized on per-shard basis
* `executor::set_default_timeout()` is dropped, as we already pass
the option to executor in its ctor.
Fixes#12232Closes#13300
* github.com:scylladb/scylladb:
alternator: split the param list of executor ctor into multi lines
alternator,config: make alternator_timeout_in_ms live-updateable
in this series, instead of hardwiring to integer, we switch to generation generator for creating new generations. this should helps us to migrate to a generation identifier which can also represented by UUID. and potentially can help to improve the testing coverage once we switch over to UUID-based generation identifier. will need to parameterize these tests by then, for sure.
Closes#13863
* github.com:scylladb/scylladb:
test: sstable: use generator to generate generations
test: sstable: pass generation_type in helper functions
test: sstable: use generator to generate generations
It is useful to know which node has the error. For example, when a node
has a corrupted sstable, with this patch, repair master node can tell
which node has the corrupted sstable.
```
WARN 2023-05-15 10:54:50,213 [shard 0] repair -
repair[2df49b2c-219d-411d-87c6-2eae7073ba61]: get_combined_row_hash: got
error from node=127.0.0.2, keyspace=ks2a, table=tb,
range=(8992118519279586742,9031388867920791714],
error=seastar::rpc::remote_verb_error (some error)
```
Fixes#13881Closes#13882
Currently, error injections can be enabled either through HTTP or CQL.
While these mechanisms are effective for injecting errors after a node
has already started, it can't be reliably used to trigger failures
shortly after node start. In order to support this use case, this commit
adds possibility to enable some error injections via config.
A configuration option `error_injections_at_startup` is added. This
option uses our existing configuration framework, so it is possible to
supply it either via CLI or in the YAML configuration file.
- When passed in commandline, the option is parsed as a
semicolon-separated list of error injection names that should be
enabled. Those error injections are enabled in non-oneshot mode.
The CLI option is marked as not used in release mode and does not
appear in the option list.
Example:
--error-injections-at-startup failure_point1;failure_point2
- When provided in YAML config, the option is parsed as a list of items.
Each item is either a string or a map or parameters. This method is
more flexible as it allows to provide parameters for each injection
point. At this time, the only benefit is that it allows enabling
points in oneshot mode, but more parameters can be added in the future
if needed.
Explanatory example:
error_injections_at_startup:
- failure_point1 # enabled in non-oneshot mode
- name: failure_point2 # enabled in oneshot mode
one_shot: true # due to one_shot optional parameter
The primary goal of this feature is to facilitate testing of raft-based
cluster features. An error injection will be used to enable an
additional feature to simulate node upgrade.
Tests: manual
Closes#13861
Constructors of trace_state class initialize most of the fields in constructor body with the help of non-inline helper method. It's possible and is better to initialize as much as possible with initializer lists.
Closes#13871
* github.com:scylladb/scylladb:
tracing: List-initialize trace_state::_records
tracing: List-initialize trace_state::_props
tracing: List-initialize trace_state::_slow_query_threshold
tracing: Reorder trace_state fields initialization
tracing: Remove init_session_records()
tracing: List-initialize one_session_records::ttl
tracing: List-initialize one_session_records
tracing: List-initialize session_record
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.
Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.
So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic
fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024Closes#13562
* github.com:scylladb/scylladb:
sstables: Implement atomic deleter for s3 storage
sstables: Get atomic deleter from underlying storage
sstables: Move delete_atomically to manager and rename
Add basic test for tagged+integer arithmetic operations.
Remove const qualifier from `tagged_integer::operator[+-]=`
as these are add/sub-assign operators that need to modify
the value in place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similarly to how we handle Roles and Tables, we do not
allow permissions on non-existent objects, so the CREATE
permission on a specific function is meaningless, because
for the permission to be granted to someone, the function
must be already created.
This patch removes the CREATE permission from the set of
permissions applicable to a specific function.
Fixes#13822Closes#13824
This is a translation of Cassandra's CQL unit test source file
validation/entities/UFTypesTest.java into our cql-pytest framework.
There are 7 tests, which reproduce one known bug:
Refs #13746: UDF can only be used in SELECT, and abort when used in WHERE, or in INSERT/UPDATE/DELETE commands
And uncovered two previously unknown bugs:
Refs #13855: UDF with a non-frozen collection parameter cannot be called on a frozen value
Refs #13860: A non-frozen collection returned by a UDF cannot be used as a frozen one
Additionally, we encountered an issue that can be treated as either a bug or a hole in documentation:
Refs #13866: Argument and return types in UDFs can be frozen
Closes#13867
Adding new APIs /column_family/tombstone_gc and /storage_service/tombstone_gc, that will allow for disabling tombstone garbage collection (GC) in compaction.
Mimicks existing APIs /column_family/autocompaction and /storage_service/autocompaction.
column_family variant must specify a single table only, following existing convention.
whereas the storage_service one can specify an entire keyspace, or a subset of a tables in a keyspace.
column_family API usage
-----
```
The table name must be in keyspace:name format
Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"
```
storage_service API usage
-----
```
Tables can be specified using a comma-separated list.
Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"
Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"
```
Closes#13793
* github.com:scylladb/scylladb:
test: Test new API for disabling tombstone GC
test: rest_api: extract common testing code into generic functions
Add API to disable tombstone GC in compaction
api: storage_service: restore indentation
api: storage_service: extract code to set attribute for a set of tables
tests: Test new option for disabling tombstone GC in compaction
compaction_strategy: bypass tombstone compaction if tombstone GC is disabled
table: Allow tombstone GC in compaction to be disabled on user request
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This series changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
database_test is failing sporadically and the cause was traced back
to commit e3e7c3c7e5.
The commit forces a subset of tests in database_test, to run once
for each of predefined x_log2_compaction_group settings.
That causes two problems:
1) test becomes 240% slower in dev mode.
2) queries on system.auth is timing out, and the reason is a small
table being spread across hundreds of compaction groups in each
shard. so to satisfy a range scan, there will be multiple hops,
making the overhead huge. additionally, the compaction group
aware sstable set is not merged yet. so even point queries will
unnecessarily scan through all the groups.
Fixes#13660.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13851
This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore:
* reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend.
* reader_resources: split `reset()` into `noexcept` and potentially throwing variant.
* reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one)
Closes#13678
* github.com:scylladb/scylladb:
reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
reader_permit: split resource_units::reset()
reader_permit: make consume()/signal() API private
It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation.
Closes#13862
* github.com:scylladb/scylladb:
sstables: Restore indentation after previous patches
sstables: Coroutinuze read_toc() outer part
sstables: Coroutinuze read_toc() inner part
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.
In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.
There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.
1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow
This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.
refs: #13458
refs: #13369
refs: scylladb/seastar#1652Closes#13859
* github.com:scylladb/scylladb:
s3: Pick client from manager via handle
s3: Generalize s3 file handle
s3: Live-update clients' configs
sstables: Keep clients shared across sstables
storage_manager: Rewrap config map
sstables, database: Move object storage config maintenance onto storage_manager
sstables: Introduce sharded<storage_manager>
The existing storage::wipe() method of s3 is in fact atomic deleter --
it commits "deleting" status into ownership table, deletes the objects
from server, then removes the entry from ownership table. So the atomic
deleter does the same and the .wipe() just removes the objects, because
it's not supposed to be atomic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
While the driver isn't known without the sstable itself, we have a
vector of them can can get it from the front element. This is not very
generic, but fortunately all sstables here belong to the same table and,
respectively, to the same storage and even prefix. The latter is also
assert-checked by the sstable_directory atomic deleter code.
For now S3 storage returns the same directory-based deleter, but next
patch will change that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to let manager decide which storage driver to call for atomic
sstables deletion in the next patch. While at it -- rename the
sstable_directory's method into something more descriptive (to make
compiler catch all callers of it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This field needs to call trace_state::ttl_by_type() which, in turn,
looks into _props. The latter should have been initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It takes props from constructor args and tunes them according to the
constructing "flavor" -- primary or secondary state. Adding two static
helpers code-document the intent and make list-initialization possible
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction strategies know how to pick files that are most likely to
satisfy tombstone purge conditions (i.e. not shadow data in uncompacting
files).
This logic can be bypassed if tombstone GC was disabled by user,
as it's a waste of effort to proceed with it until re-enabled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If tombstone GC was disabled, compaction will ensure that fully expired
sstables won't be bypassed and that no expired tombstones will be
purged. Changing the value takes immediate effect even on ongoing
compactions.
Not wired into an API yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The instance ptr and props have to be set up early, because other
members' initialization depends on them. It's currently OK, because
other members are initialized in the constructor body, but moving them
into initializer list would require correct ordering
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now does nothing but wraps make_lw_shared<one_session_records>()
call. Callers can do it on their own thus facilitating further
list-initialization patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For that to happen the value evaluation is moved from the
init_session_records() into a private trace_state helper as it checks
the props values initialized earlier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This object is constructed via one_session_records thus the latter needs
to pass some arguments along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The methods that take storage_proxy as argument can now accept a
replica::database instead. So update their signatures and update all
callers. With that, system_keyspace.* no longer depends on storage_proxy
directly.
Use the recently introduced replica side query utility functions to
query the content of the system tables. This allows us to cut the
dependency of the system keyspace on storage proxy.
The methods still take storage proxy parameter, this will be replaced
with replica::database in the next patch.
There is still one hidden storage proxy dependency left, via
clq3::query_processor. This will be addressed later.
Containing utility methods to query data from the local replica.
Intended to be used to read system tables, completely bypassing storage
proxy in the process.
This duplicates some code already found in storage proxy, but that is a
small price to pay, to be able to break some circular dependencies
involving storage proxy, that have been plaguing us since time
immemorial.
One thing we lose with this, is the smp service level using in storage
proxy. If this becomes a problem, we can create one in database and use
it in these methods too.
Another thing we lose is increasing `replica_cross_shard_ops` storage
proxy stat. I think this is not a problem at all as these new functions
are meant to be used by internal users, which will reduce the internal
noise in this metric, which is meant to indicate users not using
shard-aware clients.
As a result of the preceding patches, permissions on a function
are now granted to its creator. As a result, some permissions may
appear which we did not expect before.
In the test_udf_permissions_serialization, we create a function
as the superuser, and as a result, when we compare the permissions
we specifically granted to the ones read from the LIST PERMISSIONS
result, we get more than expected - this is fixed by granting
permissions explicitly to a new user and only checking this user's
permissions list.
In the test_grant_revoke_udf_permissions case, we test whether
the DROP permission in enforced on a function that we have previously
created as the same user - as a result we have the DROP permission
even without granting it directly. We fix this by testing the DROP
permission on a function created by a different user.
In the test_grant_revoke_alter_udf_permissions case, we previously
tested that we require both ALTER and CREATE permissions when executing
a CREATE OR REPLACE FUNCTION statement. The new permissions required
for this statement now depend on whether we actually CREATE or REPLACE
a function, so now we test that the ALTER permission is required when
REPLACING a function, and the CREATE permission is required when
CREATING a function. After the changes, the case no longer needs to
be arfitifially extracted from the previous one, so they are merged
now. Analogous adjustments are made in the test case
test_grant_revoke_alter_uda_permissions.
Currently, when a user is altering a function, they need
both CREATE and ALTER permissions, instead of just ALTER.
Additionally, after altering a function, the user is
treated as an owner of this function, gaining all access
permissions to it.
This patch fixes these 2 issues, by checking only the ALTER
permission when actually altering, and by not modifying
user's permisssions if the user did not actually create
the function.
When a user creates a function, they should have all permissions on
this function.
Similarly, when a user creates a keyspace, they should have all
permissions on functions in the keyspace.
This patch introduces GRANTs on the missing permissions.
In the following patch, the grant_permissions_to_creator method is going
to be also used to grant permissions on a newly created function. The
function resource may contain user-defined types which need the
query processor to be prepared, so we add a reference to it in advance
in this patch for easier review.
Despite the cql-pytests being intended to pass on both Scylla and
Cassandra, the test_permissions.py case was actually failing on
Cassandra in a few cases. The most common issue was a different
exception type returned by Scylla and Cassandra for an invalid
query. This was fixed by accepting 2 types of exceptions when
necessary.
The second issue was java UDF code that did not compile, which was
fixed simply by debugging the code.
The last issue was a case that was scylla_only with no good reason.
The missing java UDFs were added to that case, and the test was
adjusted so that the ALTER permission was only checked in a
CREATE OR REPLACE statement only if the UDF was already existing -
- Scylla requires it in both cases, which will get resolved in the
next patch.
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to generate generations for
`make_sstable_for_all_shards()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
always avoid using generation_type if possible. this helps us to
hide the underlying type of generation identifier, which could also
be a UUID in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of assuming the integer-based generation id, let's use
the generation generator for creating a new generation id. this
helps us to improve the testing coverity once we migrate to the
UUID-based generation identifier.
this change uses generator to create generations for
`make_sstable_for_this_shard()`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add the global-factory onto the client that is
- cross-shard copyable
- generates a client from local storage_manager by given endpoint
With that the s3 file handle is fixed and also picks up shared s3
clients from the storage manager instead of creating its own one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the s3 file handle tries to carry client's info via explicit
host name and endpoint config pointer. This is buggy, the latter pointer
is shard-local can cannot be transferred across shards.
This patch prepares the fix by abstracting the client handle part.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the client is accessible directli via the storage_manager, when
the latter is requested to update its endpoint config, it can kick the
client to do the same.
The latter, in turn, can only update the AWS creds info for now. The
endpoint port and https usage are immutable for now.
Also, updating the endpoint address is not possible, but for another
reason -- the endpoint itself is the part of keyspace configuration and
updating one in the object_storage.yaml will have no effect on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays each sstable gets its own instance of an s3::client. This patch
keeps clients on storage_manager's endpoints map and when creating a
storage for an sstable -- grab the shared pointer from the map, thus
making one client serve all sstables over there (except for those that
duplicated their files with the help of foreign-info, but that's to be
handled by next patches).
Moving the ownership of a client to the storage_manager level also means
that the client has to be closed on manager's stop, not on sstable
destroy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the map is endpoint -> config_ptr. Wrap the config_ptr into an
s3_endpoint struct. Next patch will keep the client on this new wrapper
struct thus making them shared between sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the map<endpoint, config> sits on the sstables manager and its
update is governed by database (because it's peering and can kick other
shards to update it as well).
Having the sharded<storage_manager> at hand lets freeing database from
the need to update configs and keeps sstables_manager a bit smaller.
Also this will allow keeping s3 clients shared between sstables via this
map by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager in question keeps track of whatever sstables_manager needs
to work with the storage (spoiler: only S3 one). It's main-local sharded
peering service, so that container() call can be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It just needs to catch the system_error of ENOENT and re-throw it as
malformed_sstable_exception.
Indentatil is deliberately left broken. Again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One non-trivial change is the removal of buf temporary variable. That's
because it existed under the same name in the .then() lambda generating
name conflict after coroutinization.
Other than that it's pretty straightforward.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now which schema pull may issues raft read barrier it may stuck if
majority is not available. Make the operation abortable and abort it
during queries if timeout is reached.
The immediate mode is similar to timeout mode with gc_grace_seconds
zero. Thus, the gc_before returned should be the query_time instead of
gc_clock::time_point::max in immediate mode.
Setting gc_before to gc_clock::time_point::max, a row could be dropped
by compaction even if the ttl is not expired yet.
The following procedure reproduces the issue:
- Start 2 nodes
- Insert data
```
CREATE KEYSPACE ks2a WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 2 };
CREATE TABLE ks2a.tb (pk int, ck int, c0 text, c1 text, c2 text, PRIMARY
KEY(pk, ck)) WITH tombstone_gc = {'mode': 'immediate'};
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (10 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (20 ,1, 'x', 'y', 'z')
USING TTL 1000000;
INSERT into ks2a.tb (pk,ck, c0, c1, c2) values (30 ,1, 'x', 'y', 'z')
USING TTL 1000000;
```
- Run nodetool flush and nodetool compact
- Compaction drops all data
```
~128 total partitions merged to 0.
```
Fixes#13572Closes#13800
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
or, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Also, add a respective rest_api unit test
Closes#13849
* github.com:scylladb/scylladb:
test: rest_api: test_storage_service: add test_storage_service_keyspace_cleanup_with_no_owned_ranges
compaction_manager: perform_cleanup: handle empty owned ranges
Schema pull may fail because the pull does not contain everything that
is needed to instantiate a schema pointer. For instance it does not
contain a keyspace. This patch changes the code to issue raft read
barrier before the pull which will guaranty that the keyspace is created
before the actual schema pull is performed.
Refs: #3760Fixes: #13211
Fixes https://github.com/scylladb/scylladb/issues/13805
This commit fixes the redirection required by moving the Glossary
page from the top of the page tree to the Reference section.
As the change was only merged to master (not to branch-5.2),
it is not working for version 5.2, which is now the latest stable
version.
For this reason, "stable" in the path must be replaced with "master".
Closes#13847
the series drops some of the callers using SSTable generation as integer. as the generation of SSTable is but an identifier, we should not use it as an integer out of generation_type's implementation.
Closes#13845
* github.com:scylladb/scylladb:
test: drop unused helper functions
test: sstable_mutation_test: avoid using helper using generation_type::int_t
test: sstable_move_test: avoid using helper using generation_type::int_t
test: sstable_*test: avoid using helper using generation_type::int_t
test: sstable_3_x_test: do not use reuseable_sst() accepting integer
Updates to the compaction_group sstable sets are
never done in place. Instead, the update is done
on a mutable copy of the sstable set, and the lw_shared
result is set back in the compaction_group.
(see for example compaction_group::set_main_sstables)
Therefore, there's currently a risk in perform_cleanup
`get_sstables` lambda that if it yield while in
set.for_each_sstable, the sstable_set might be replaced
and the copy it is traversing may be destroyed.
This was introduced in c2bf0e0b72.
To prevent that, hold on to set.shared_from_this()
around set.for_each_sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13852
all users of these two helpers have switched to their alternatives,
so there is no need to keep them.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer. also, since
`generate_clustered()` is only used by functions in the same
compilation unit, let's take the opportunity to mark it `static`.
and there is no need to pass generation as a template parameter,
we just pass it as a regular parameter.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in helper functions, we just use
`generation_type`. please note, despite that we'd prefer generating
the generations using generator, the SSTables used by the tests
modified by this change are stored in the repo, to ensure that the
tests are always able to find the SSTable files, we keep them
unchanged instead of using generation_generator, or a random
generation for the testing.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type by offering a default paramter, which is a
generation created using 1. this preserves the existing behavior.
we will divert other callers of `reusable_sst(...,
generation_type::int)` in following-up changes in different ways.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using the helper accepting int, we switch to the one which accepts
generation_type.
also, as no callers are using the last parameter of `make_test_sstable()`,
let's drop it .
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This series fixes an issue with altering permissions on UDFs with
parameter types that are UDTs with quoted names and adds
a test for it.
The issue was caused by the format of the temporary string
that represented the UDT in `auth::resource`. After parsing the
user input to a raw type, we created a string representing the
UDT using `ut_name::to_string()`. The segment of the resulting
string that represented the name of the UDT was not quoted,
making us unable to parse it again when the UDT was being
`prepare`d. Other than for this purpose, the `ut_name::to_string()`
is used only for logging, so the solution was modifying it to
maybe quote the UDT name.
Ref: https://github.com/scylladb/scylladb/pull/12869Closes#13257
* github.com:scylladb/scylladb:
cql-pytest: test permissions for UDTs with quoted names
cql: maybe quote user type name in ut_name::to_string()
cql: add a check for currently used stack in parser
cql-pytest: add an optional name parameter to new_type()
Currently, when creating a UDA, we only check for permissions
for creating functions. However, the creator gains all permissions
to the UDA, including the EXECUTE permission. This enables the
user to also execute the state/reduce/final functions that were
used in the UDA, even if they don't have the EXECUTE permissions
on them.
This patch adds checks for the missing EXECUTE permissions, so
that the UDA can be only created if the user has all required
permissions.
The new permissions that are now required when creating a UDA
are now granted in the existing UDA test.
Fixes#13818Closes#13819
Currently, when a function has no arguments, the function_args()
method, which is supposed to return a vector of string_views
representing the arguments of the function, returns a nullopt
instead, as if it was a functions_resource on all functions
or all functions in a keyspace. As a result, the functions_resource
can't be properly formatted.
This is fixed in this patch by returning an empty vector instead,
and the fix is confirmed in a cql-pytest.
Fixes#13842Closes#13844
data_consume_rows keeps an input_stream member that must be closed.
In particular, on the error path, when we destroy it possibly
with readaheads in flight.
Fixes#13836
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13840
It is possible that a node will have no owned token ranges
in some keyspaces based on their replication strategy,
if the strategy is configured to have no replicas in
this node's data center.
In this case we should go ahead with cleanup that will
effectively delete all data.
Note that this is current very inefficient as we need
to filter every partition and drop it as unowned.
It can be optimized by either special casing this case
ot, better, use skip forward to the next owned range.
This will skip to end-of-stream since there are no
owned ranges.
Fixes#13634
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
CQL evolved several expression evaluation mechanisms: WHERE clause,
selectors (the SELECT clause), and the LWT IF clause are just some
examples. Most now use expressions, which use managed_bytes_opt
as the underlying value representation, but selectors still use bytes_opt.
This poses two problems:
1. bytes_opt generates large contiguous allocations when used with large blobs, impacting latency
2. trying to use expressions with bytes_opt will incur a copy, reducing performance
To solve the problem, we harmonize the data types to managed_bytes_opt
(#13216 notwithstanding). This is somewhat difficult since the source of the values
are views into a bytes_ostream. However, luckily bytes_ostream and managed_bytes_view
are mostly compatible so with a little effort this can be done.
The series is neutral wrt performance:
before:
```
222118.61 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
224250.14 tps ( 61.1 allocs/op, 12.1 tasks/op, 43094 insns/op, 0 errors)
224115.66 tps ( 61.1 allocs/op, 12.1 tasks/op, 43092 insns/op, 0 errors)
223508.70 tps ( 61.1 allocs/op, 12.1 tasks/op, 43107 insns/op, 0 errors)
223498.04 tps ( 61.1 allocs/op, 12.1 tasks/op, 43087 insns/op, 0 errors)
```
after:
```
220708.37 tps ( 61.1 allocs/op, 12.1 tasks/op, 43118 insns/op, 0 errors)
225168.99 tps ( 61.1 allocs/op, 12.1 tasks/op, 43081 insns/op, 0 errors)
222406.00 tps ( 61.1 allocs/op, 12.1 tasks/op, 43088 insns/op, 0 errors)
224608.27 tps ( 61.1 allocs/op, 12.1 tasks/op, 43102 insns/op, 0 errors)
225458.32 tps ( 61.1 allocs/op, 12.1 tasks/op, 43098 insns/op, 0 errors)
```
Though I expect with some more effort we can eliminate some copies.
Closes#13637
* github.com:scylladb/scylladb:
cql3: untyped_result_set: switch to managed_bytes_view as the cell type
cql3: result_set: switch cell data type from bytes_opt to managed_bytes_opt
cql3: untyped_result_set: always own data
types: abstract_type: add mixed-type versions of compare() and equal()
utils/managed_bytes, serializer: add conversion between buffer_view<bytes_ostream> and managed_bytes_view
utils: managed_bytes: add bidirectional conversion between bytes_opt and managed_bytes_opt
utils: managed_bytes: add managed_bytes_view::with_linearized()
utils: managed_bytes: mark managed_bytes_view::is_linearized() const
Currently, when a user has permissions on a function/all functions in
keyspace, and the function/keyspace is dropped, the user keeps the
permissions. As a result, when a new function/keyspace is created
with the same name (and signature), they will be able to use it even
if no permissions on it are granted to them.
Simliarly to regular UDFs, the same applies to UDAs.
After this patch, the corresponding permissions on functions are dropped
when a function/keyspace is dropped.
Fixes#13820Closes#13823
Task manager's tasks covering scrub compaction on top,
shard and table level.
For this levels we have common scrub tasks for each scrub
mode since they share code. Scrub modes will be differentiated
on compaction group level.
Closes#13694
* github.com:scylladb/scylladb:
test: extend test_compaction_task.py to test scrub compaction
compaction: add table_scrub_sstables_compaction_task_impl
compaction: add shard_scrub_sstables_compaction_task_impl
compaction: add scrub_sstables_compaction_task_impl
api: get rid of unnecessary std::optional in scrub
compaction: rename rewrite_sstables_compaction_task_impl
in this series, `data_dictionary::storage_options` is refactored so that each dedicated storage option takes care of itself, instead of putting all the logic into `storage_options`. cleaner this way. as the next step, i will add yet another set of options for the tiered_storage which is backed by the s3_storage and the local filesystem_storage. with this change, we will be able to group the per-option functionalities together by the option thy are designed for, instead of sharding them by the actual function.
Closes#13826
* github.com:scylladb/scylladb:
data_dictionary: define helpers in options
data_dictionary: only define operator== for storage options
The validator classes have their definition in a header located in mutation/, while their implementation is located in a .cc in readers/mutation_reader.cc.
This PR fixes this inconsistency by moving the implementation into mutation/mutation_fragment_stream_validator.cc. The only change is that the validator code gets a new logger instance (but the logger variable itself is left unchanged for now).
Closes#13831
* github.com:scylladb/scylladb:
mutation/mutation_fragment_stream_validator.cc: rename logger
readers,mutation: move mutation_fragment_stream_validator to mutation/
On connection setup, the isolation cookie of the connection is matched
to the appropriate scheduling group. This is achieved by iterating over
the known statement tenant connection types as well as the system
connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and
the remote node has a scheduling group the local one doesn't have. To
avoid demoting a scheduling group of unknown importance, in this case the
default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise
version, as the scheduling groups of the enterprise service-levels will
match none of the statement tenants and will hence fall-back to the
default scheduling group. As a consequence, while the cluster is mixed,
user workload on old (OSS) nodes, will be executed under the system
scheduling group and concurrency semaphore. Not only does this mean that
user workloads are directly competing for resources with system ones,
but the two workloads are now sharing the semaphore too, reducing the
available throughput. This usually manifests in queries timing out on
the old (OSS) nodes in the cluster.
This patch proposes to fix this, by recognizing that the unknown
scheduling group is in fact a tenant this node doesn't know yet, and
matching it with the default statement tenant.
With this, order should be restored, with service-level connections
being recognized as user connections and being executed in the statement
scheduling group and the statement (user) concurrency semaphore.
We have a set amount of connection types for each tenant. The amount of
these connection types can change. Although currently these are
hardcoded in a single place, soon (in the next patch) there will be yet
another place where these will be used. To avoid duplicating these
names, making future changes error prone, centralize them in a const
array, generalizing the concept of a tenant connection type.
When new nodes are added or existing nodes are deleted, the topology
state machine needs to shunt reads from the old nodes to the new ones.
This happens in the `write_both_read_new` state. The problem is that
previously this state was not handled in any way in `token_metadata` and
the read nodes were only changed when the topology state machine reached
the final 'owned' state.
To handle `write_both_read_new` an additional `interval_map` inside
`token_metadata` is maintained similar to `pending_endpoints`. It maps
the ranges affected by the ongoing topology change operation to replicas
which should be used for reading. When topology state sm reaches the
point when it needs to switch reads to a new topology, it passes
`request_read_new=true` in a call to `update_pending_ranges`. This
forces `update_pending_ranges` to compute the ranges based on new
topology and store them to the `interval_map`. On the data plane, when a
read on coordinator needs to decide which endpoints to use, it first
consults this `interval_map` in `token_metadata`, and only if it doesn't
contain a range for current token it uses normal endpoints from
`effective_replication_map`.
Closes#13376
* github.com:scylladb/scylladb:
storage_proxy, storage_service: use new read endpoints
storage_proxy: rename get_live_sorted_endpoints->get_endpoints_for_reading
token_metadata: add unit test for endpoints_for_reading
token_metadata: add endpoints for reading
sequenced_set: add extract_set method
token_metadata_impl: extract maybe_migration_endpoints helper function
token_metadata_impl: introduce migration_info
token_metadata_impl: refactor update_pending_ranges
token_metadata: add unit tests
token_metadata: fix indentation
token_metadata_impl: return unique_ptr from clone functions
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change restores the use of raft::internal::tagged_uint64
for the raft types and adds back an idl definition for
it that is not marked as final, similar to the way
raft::internal::tagged_id extends utils::tagged_uuid.
Fixes#13752Closes#13774
* github.com:scylladb/scylladb:
raft, idl: restore internal::tagged_uint64 type
raft: define term_t as a tagged uint64_t
idl: gossip_digest: include required headers
in this series, we encode the value of generation using UUID to prepare for the UUID generation identifier. simpler this way, as we don't need to have two ways to encode integer or a timeduuid: uuid with a zero timestamp, and a variant. also, add a `from_string()` factory method to convert string to generation to hide the the underlying type of value from generation_type's users.
Closes#13782
* github.com:scylladb/scylladb:
sstable: use generation_type::from_string() to convert from string
sstable: encode int using UUID in generation_type
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information about the prepared statment, including names of values bound to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`) there's an expectation that the name assigned to this bind marker will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been unified with the code that handles generic function calls, and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being `"partition key token"`, so a change to `token(p1, p2)` broke some things.
This patch sets the name back to `"partition key token"`. To achieve this we detect any restrictions that match the pattern `token(p1, p2, p3) = X` and set the receiver name for X to `"partition key token"`.
Fixes: #13769Closes#13815
* github.com:scylladb/scylladb:
cql-pytest: test that bind marker is partition key token
cql3/prepare_expr: force token() receiver name to be partition key token
in this change,
* instead of using "\d+" to match the generation, use "[^-]",
* let generation_type to convert a string to generation
before this change, we casts the matched string in SSTable file name
to integer and then construct a generation identifier from the integer.
this solution has a strong assumption that the generation is represented
with an integer, we should not encode this assumption in sstable.cc,
instead we'd better let generation_type itself to take care of this. also,
to relax the restriction of regex for matching generation, let's
just use any characters except for the delimeter -- "-".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we already use UUID for encoding an bigint in SSTable registry
table, let's just use the same approach for encoding bigint in generation_type,
to be more consistent, and less repeatings this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We use set_topology_transition_state to set read_new state
in storage_service::topology_state_load
based on _topology_state_machine._topology.tstate.
This triggers update_pending_ranges to compute and store new ranges
for read requests. We use this information in
storage_proxy::get_endpoints_for_reading
when we need to decide which nodes to use for reading.
We are going to use remapped_endpoints_for_reading, we need
to make sure we use it in the right place. The
get_live_sorted_endpoints function looks like what we
need - it's used in all read code paths.
From its name, however, this was not obvious.
Also, we add the parameter ks_name as we'll need it
to pass to remapped_endpoints_for_reading.
In this patch we add
token_metadata::set_topology_transition_state method.
If the current state is
write_both_read_new update_pending_ranges
will compute new ranges for read requests. The default value
of topology_transition_state is null, meaning no read
ranges are computed. We will add the appropriate
set_topology_transition_state calls later.
Also, we add endpoints_for_reading method to get
read endpoints based on the computed ranges.
instead of dispatching and implementing the per-option handling
right in `storage_option`, define these helpers in the dedicated
option themselves, so `storage_option` is only responsible for
dispatching.
much cleaner this way. this change also makes it easier to add yet
another storage backend.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as the only user of these comparison operators is
`storage_options::can_update_to()`, which just check if the given
`storage_options` is equal to the stored one. so no need to define
the <=> operator.
also, no need to add the `friend` specifier, as the options are plain
struct, all the member variables are public.
make the comparison operator a member function instead of a free
function, as in C++20 comparision operators are symmetric.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The validator classes have their definition in a header located in mutation/,
while their implementation is located in a .cc in readers/mutation_reader.cc.
This patch fixes this inconsistency by moving the implementation into
mutation/mutation_fragment_stream_validator.cc. The only change is that
the validator code gets a new logger instance (but the logger variable itself
is left unchanged for now).
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Closes#13785
* github.com:scylladb/scylladb:
sstables: storage: coroutinize idempotent_link_file()
sstables: extract storage out
When preparing a query each bind marker gets a name.
For a query like:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
The bind marker's name should be `"partition key token"`.
Java driver relies on this name, having something else,
like `"token(p1, p2)"` be the name breaks the Java driver.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Let's say that we have a prepared statement with a token restriction:
```cql
SELECT * FROM some_table WHERE token(p1, p2) = ?
```
After calling `prepare` the drivers receives some information
about the prepared statment, including names of values bound
to each bind marker.
In case of a partition token restriction (`token(p1, p2) = ?`)
there's an expectation that the name assigned to this bind marker
will be `"partition key token"`.
In a recent change the code handling `token()` expressions has been
unified with the code that handles generic function calls,
and as a result the name has changed to `token(p1, p2)`.
It turns out that the Java driver relies on the name being
`"partition key token"`, so a change to `token(p1, p2)`
broke some things.
This patch sets the name back to `"partition key token"`.
To achieve this we detect any restrictions that match
the pattern `token(p1, p2, p3) = X` and set the receiver
name for X to `"partition key token"`.
Fixes: #13769
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
We are going to add a function in token_metadata to get read endpoints,
similar to pending_endpoints_for. So in this commit we extract
the maybe_migration_endpoints helper function, which will be
used in both cases.
We are going to store read_endpoints in a way similar
to pending ranges, so in this commit we add
migration_info - a container for two
boost::icl::interval_map.
Also, _pending_ranges_interval_map is renamed to
_keyspace_to_migration_info, since it captures
the meaning better.
Now update_pending_ranges is quite complex, mainly
because it tries to act efficiently and update only
the affected intervals. However, it uses the function
abstract_replication_strategy::get_ranges, which calls
calculate_natural_endpoints for every token
in the ring anyway.
Our goal is to start reading from the new replicas for
ranges in write_both_read_new state. In the current
code structure this is quite difficult to do, so
in this commit we first simplify update_pending_ranges.
The main idea of the refactoring is to build a new version
of token_metadata based on all planned changes
(join, bootstrap, replace) and then for each token
range compare the result of calculate_natural_endpoints on
the old token_metadata and on the new one.
Those endpoints that are in the new version and
are not in the old version should be added to the pending_ranges.
The add_mapping function is extracted for the
future - we are going to use it to handle read mappings.
Special care is taken when replacing with the same IP.
The coordinator employs the
get_natural_endpoints_without_node_being_replaced function,
which excludes such endpoints from its result. If we compare
the new (merged) and current token_metadata configurations, such
endpoints will also be absent from pending_endpoints since
they exist in both. To address this, we copy the current
token_metadata and remove these endpoints prior to comparison.
This ensures that nodes being replaced are treated
like those being deleted.
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8Fixes#13752
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All Raft verbs include `dst_id`, the ID of the destination server, but
it isn't checked. `append_entries` will work even if it arrives at
completely the wrong server (but in the same group). It can cause
problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it
doesn't, the Raft verb is rejected.
Closes#12179
Testing
---
Testcase and scylla's configuration:
57d3ef14d8
It artificially lengthens the duration of replacing the old node. It
increases the chance of getting the RPC command sent to a replaced node,
by the new node.
In the logs of the node that replaced the old one, we can see logs in
the form:
```
DEBUG <time> [shard 0] raft_group_registry - Got message for server <dst_id>, but my id is <my_id>
```
It indicates that the Raft verb with the wrong `dst_id` was rejected.
This test isn't included in the PR because it doesn't catch any specific error.
Closes#13575
* github.com:scylladb/scylladb:
service/raft: raft_group_registry: Add verification of destination ID
service/raft: raft_group_registry: `handle_raft_rpc` refactor
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Rename rewrite_sstables_compaction_task_impl to
sstables_compaction_task_impl as a new name describes the class
of tasks better. Rewriting sstables is a slightly more fine-grained
type of sstable compaction task then the one needed here.
this series prepares for the UUID based generation by replacing the general `value()` function with the function with more specific name: `as_int()`.
Closes#13796
* github.com:scylladb/scylladb:
test: drop a reusable_sst() variant which accepts int as generation
treewide: replace generation_type::value() with generation_type::as_int()
It was already outdated before this PR.
Describe the version of topology state machine implemented in this PR.
Fix some typos and make it proper markdown so it renders nicely on
GitHub etc.
The original API is gossiper-based. Since we're moving CDC generations
handling to Raft-based topology, we need to implement this API as well.
For now the API creates a new generation unconditionally, in a follow-up
I'll introduce a check to skip the creation if the current generation is
optimal.
Refactor the code, taking a bulk of the CDC-specific code used when
there's a bootstrap request to a separate function. We'll use it
elsewhere as well.
Previously the generation committed in `commit_cdc_generation` state
would be published by the coordinator in `write_both_read_old` state.
This logic assumed that we only create new CDC generations during node
bootstrap.
We'll allow committing new generations without bootstrap (without any
node transitions in fact), so we need this separate state.
After publishing the generation, we check whether there is a
transitioning node; if so, we'll enter `write_both_read_old` as next
state, otherwise we'll make the topology non-transitioning.
This function broadcasts a command to cluster members. It takes a
`node_to_work_on`. We'll need a version which works in situations where
there is no 'node to work on'.
This calls `raft_group0_client::start_operation` and checks if the term
is different from the term that the coordinator was initially created
with; if so, we must no longer continue coordinating the topology.
There was one direct call to `raft_group0_client::start_operation`
without a term check, replace it with the introduced function.
The existing `topology_mutation_builder` took a `raft::server_id` in its
constructor and immediately created a clustering row in the
`system.topology` mutation that it was building for the given node.
This does not allow building mutations which only affect the static
columns.
Split the class into two:
- `topology_mutation_builder` doesn't take `raft::server_id` in its
constructor and contains only the methods that are used to set static
columns. It also has a `with_node` method taking a `raft::server_id`
which returns a `topology_node_mutation_builder&`.
- `topology_node_mutation_builder` creates the clustering row and allows
seting its columns.
We'll use `topology_mutation_builder` when we only want to transition
the cluster-global topology state, without affecting any specific nodes'
states.
`topology` currently contains the `requests` map, which is suitable for
node-specific requests such as "this node wants to join" or "this node
must be removed". But for requests for operations that affect the
cluster as a whole, a separate request type and field is more
appropriate. Introduce one.
The enum currently contains the option `new_cdc_generation` for requests
to create a new CDC generation in the cluster. We will implement the
whole procedure in later commits.
- make it a static column in `system.topology`
- move it from node-specific `ring_slice` to cluster-global `topology`
We will use it in scenarios where no node is transitioning.
Also make it `std::optional` in topology for consistency with other
fields (previously, the 'no value' state for this field was represented
using default-constructed `utils::UUID`).
In addition to the data file itself. Currently validation avoids the
index altogether, using the crawling reader which only relies on the
data file and ignores the index+summary. This is because a corrupt
sstable usually has a corrupt index too and using both at the same time
might hide the corruption. This patch adds targeted validation of the
index, independent of and in addition to the already existing data
validation: it validates the order of index entries as well as whether
the entry points to a complete partition in the data file.
This will usually result in duplicate errors for out-of-order
partitions: one for the data file and one for the index file.
Fixes: #9611Closes#11405
* github.com:scylladb/scylladb:
test/cql-pytest: add test_sstable_validation.py
test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
tools/scylla-sstables: write validation result to stdout
sstables/sstable: validate(): delegate to mx validator for mx sstables
sstables/mx/reader: add mx specific validator
mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter
sstables/mx/reader: template data_consume_rows_context_m on the consumer
sstables/mx/reader: move row_processing_result to namespace scope
sstables/mx/reader: use data_consumer::proceed directly
sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)
compaction/compaction: remove now unused scrub_validate_mode_validate_reader()
compaction/compaction: move away from scrub_validate_mode_validate_reader()
tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
sstables/sstable: add validate() method
compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
compaction: scrub: use error messages from validator
mutation_fragment_stream_validator: produce error messages in low-level validator
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
The discussion on the thread says, when we reformat a volume with another
filesystem, kernel and libblkid may skip to populate /dev/disk/by-* since it
detected two filesystem signatures, because mkfs.xxx did not cleared previous
filesystem signature.
To avoid this, we need to run wipefs before running mkfs.
Note that this runs wipefs twice, for target disks and also for RAID device.
wipefs for RAID device is needed since wipefs on disks doesn't clear filesystem signatures on /dev/mdX (we may see previous filesystem signature on /dev/mdX when we construct RAID volume multiple time on same disks).
Also dropped -f option from mkfs.xfs, it will check wipefs is working as we
expected.
Fixes#13737
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#13738
With regards to closing the looked-up querier if an exception is thrown.
In particular, this requires closing the querier if a semaphore mismatch
is detected. Move the table lookup above the line where the querier is
looked up, to avoid having to handle the exception from it.
The test performs an `INSERT` followed by a `SELECT`, checking if the
previously inserted data is returned.
This may fail because we're using `ring_delay = 0` in tests and the two
queries may arrive at different nodes, whose `token_metadata` didn't
converge yet (it's eventually consistent based on gossiping).
I illustrated this here:
https://github.com/scylladb/scylladb/issues/12937#issuecomment-1536147455
Ensure that the nodes' token rings are synchronized (by waiting until
the token ring members on each node is the same as group 0
configuration).
Fixes#12937Closes#13791
`RandomTables.verify_schema` is often called in topology tests after
performing a schema change. It compares the schema tables fetched from
some node to the expected latest schema stored by the `RandomTables`
object.
However there's no guarantee that the latest schema change has already
propagated to the node which we query. We could have performed the
schema change on a different node and the change may not have been
applied yet on all nodes.
To fix that, pick a specific node and perform a read barrier on it, then
use that node to fetch the schema tables.
Fixes#13788Closes#13789
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the source codes, and build Wasm when
necessary. This series adds build commands that
compile C/Rust sources to Wasm and uses them for Wasm
programs that we're already using.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the `wasms` and `wasm_deps` lists in
`configure.py`.
All Wasm programs are build by default when building all
artifacts, artifacts in a given mode, or when building
tests. Additionally, a {mode}-wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/{mode}/wasm,
and are accessed in cql-pytests similarly to the way we're
accessing the scylla binary - using glob.
Closes#13209
* github.com:scylladb/scylladb:
wasm: replace wasm programs with their source programs
build: prepare rules for compiling wasm files
build: set the type of build_artifacts
test: extend capabilities of Wasm reading helper funciton
token_metadata takes token_metadata_impl as unique_ptr,
so it makes sense to create it that way in the first place
to avoid unnecessary moves.
token_metadata_impl constructor with shallow_copy parameter
was made public for std::make_unique. The effective
accessibility of this constructor hasn't changed though since
shallow_copy remains private.
After recent changes, we are able to store only the
C/Rust source codes for Wasm programs, and only build
them when neccessary. This patch utilizes this
opportunity by removing most of the currently stored
raw Wasm programs, replacing them with C/Rust sources
and adding them to the new build system.
Currently, when we deal with a Wasm program, we store
it in its final WebAssembly Text form. This causes a lot
of code bloat and is hard to read. Instead, we would like
to store only the (C/Rust) source codes, and build Wasm
when neccessary. This patch adds build commands that
compile C/Rust sources to Wasm.
After these changes, adding a new program that should be
compiled to Rust, requires only adding the source code
of it and updating the wasms and wasm_deps lists in
configure.py.
All Wasm programs are build by default when building all
artifacts, all artifacts in a given mode, or when building
tests. Additionally, a ninja wasm target is added, so that
it's possible to build just the wasm files.
The generated files are saved in $builddir/wasm.
Currently, build_artifacts are of type set[str] | list, which prevents
us from performing set operations on it. In a future patch, we will
want to take a set difference and set intersections with it, so we
initialize the type of build_artifacts to a set in all cases.
Currently, we require that the Wasm file is named the same
as the funciton. In the future we may want multiple functions
with the same name, which we can't currently do due to this
limitation.
This patch allows specifying the function name, so that multiple
files can have a function with the same name.
Additionally, the helper method now escapes "'" characters, so
that they can appear in future Wasm files.
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Therefore, let's implement it.
Fixes https://github.com/scylladb/scylladb/issues/13649.
Closes#13713
* github.com:scylladb/scylladb:
s3: Provide timestamps in the s3 file implementation
s3: Introduce get_object_stats()
s3: introduce get_object_header()
SSTable relies on st.st_mtime for providing creation time of data
file, which in turn is used by features like tombstone compaction.
Fixes#13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get_object_stats() will be used for retrieving content size and
also last modified.
The latter is required for filling st_mtim, etc, in the
s3::client::readable_file::stat() method.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since #13452, we switched most of the caller sites from std::regex
to boost::regex. in this change, all occurences of `#include <regex>`
are dropped unless std::regex is used in the same source file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13765
functinoality wise, `uint64_t_tri_compare()` is identical to the
three-way comparison operator, so no need to keep it. in this change,
it is dropped in favor of <=>.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13794
The expression system uses managed_bytes_opt for values, but result_set
uses bytes_opt. This means that processing values from the result set
in expressions requires a copy.
Out of the two, managed_bytes_opt is the better choice, since it prevents
large contiguous allocations for large blobs. So we switch result_set
to use managed_bytes_opt. Users of the result_set API are adjusted.
The db::function interface is not modified to limit churn; instead we
convert the types on entry and exit. This will be adjusted in a following
patch.
untyped_result_set is used for internal queries, where ease-of-use is more
important than performance. Currently, cells are held either by value or
by reference (managed_bytes_view). An upcoming change will cause the
result set to be built from managed_bytes_view, making it non-owning, but
the source data is not actually held, resulting in a use-after-free.
Rather than chase the source and force the data to be owned in this case,
just drop the possibility for a non-owning untyped_result_set. It's only
used in non-performance-critical paths and safety is more important than
saving a few cycles.
This also results in simplification: previously, we had a variant selecting
monostate (for NULL), managed_bytes_view (for a reference), and bytes (for
owning data); now we only have a bytes_opt since that already signifies
data-or-NULL.
Once result_set transitions to managed_bytes_opt, untyped_result_set
will follow. For now it's easier to use bytes_opt.
compare() and equal() can compare two unfragmented values or two
fragmented values, but a mix of a fragmented value and an unfragmented
value runs afoul of C++ conversion rules. Add more overloads to
make it simpler for users.
The codebase evolved to have several different ways to hold a fragmented
buffer: fragmented_temporary_buffer (for data received from the network;
not relevant for this discussion); bytes_ostream (for fragmented data that
is built incrementally; also used for a serialized result_set), and
managed_bytes (used for lsa and serialized individual values in
expression evaluation).
One problem with this state of affairs is that using data in one
fragmented form with functions that accept another fragmented form
requires either a copy, or templating everything. The former is
unpalatable for fast-path code, and the latter is undesirable for
compile time and run-time code footprint. So we'd like to make
the various forms compatible.
In 53e0dc7530 ("bytes_ostream: base on managed_bytes") we changed
bytes_ostream to have the same underlying data structure as
managed_bytes, so all that remains is to add the right API. This
is somewhat difficult as the data is hidden in multiple layers:
ser::buffer_view<> is used to abstract a slice of bytes_ostream,
and this is further abstracted by using iterators into bytes_ostream
rather than directly using the internals. Likewise, it's impossible
to construct a managed_bytes_view from the internals.
Hack through all of these by adding extract_implementation() methods,
and a build_managed_bytes_view_from_internals() helper. These are all
used by new APIs buffer_view_to_managed_bytes_view() that extract
the internals and put them back together again.
Ideally we wouldn't need any of this, but unifying the type system
in this area is quite an undertaking, so we need some shortcuts.
Becomes useful in later patches.
To avoid double-compiling the call to func(), use an
immediately-invoked lambda to calculate the bytes_view we'll be
calling func() with.
The code was incorrectly passing a data_value of type bytes due to
implicit conversion of the result of serialize() (bytes_opt) to a
data_value object of type bytes_type via:
data_value(std::optional<NativeType>);
mutation::set_static_cell() accepts a data_value object, which is then
serialized using column's type in abstract_type::decompose(data_value&):
bytes b(bytes::initialized_later(), serialized_size(*this, value._value));
auto i = b.begin();
value.serialize(i);
Notice that serialized_size() is taken from the column type, but
serialization is done using data_value's type. The two types may have
a compatible CQL binary representation, but may differ in native
types. serialized_size() may incorrectly interpret the native type and
come up with the wrong size. If the size is too smaller, we end up with
stack or heap corruption later after serialize().
For example, if the column type is utf8 but value holds bytes, the
size will be wrong because even though both use the basic_sstring
type, they have a different layout due to max_size (15 vs 31).
Fixes#13717Closes#13787
When requesting memory via `reader_permit::request_memory()`, the
requested amount is added to `_requested_memory` member of the permit
impl. This is because multiple concurrent requests may be blocked and
waiting at the same time. When the requests are fulfilled, the entire
amount is consumed and individual requests track their requested amount
with `resource_units` to release later.
There is a corner-case related to this: if a reader permit is registered
as inactive while it is waiting for memory, its active requests are
killed with `std::bad_alloc`, but the `_requested_memory` fields is not
cleared. If the read survives because the killed requests were part of
a non-vital background read-ahead, a later memory request will also
include amount from the failed requests. This extra amount wil not be
released and hence will cause a resource leak when the permit is
destroyed.
Fix by detecting this corner case and clearing the `_requested_memory`
field. Modify the existing unit test for the scenario of a permit
waiting on memory being registered as inactive, to also cover this
corner case, reproducing the bug.
Fixes: #13539Closes#13679
this is one of the changes to reduce the usage of integer based generation
test. in future, we will need to expand the test to exercise the UUID
based generation, or at least to be neutral to the underlying generation's
identifier type. so, to remove the helpers which only accept `generation_type::int_t`
would helps us to make this happen.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()
because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For some reason Scylla crashes on `aarch64` in release mode when calling
`fmt::format` in `raft_removenode` and `raft_decommission`. E.g. on this
line:
```
group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, fmt::format("decomission: request decomission for {}", raft_server.id()));
```
I found this in our configure.py:
```
def get_clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
# we see miscompiles with 1200 and above with format("{}", uuid)
# also coroutine miscompiles with 600
return 300
else:
return 2500
```
but reducing it to `0` didn't help.
I managed to get the following backtrace (with inline threshold 0):
```
void boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear_and_dispose<boost::intrusive::detail::null_disposer>(boost::intrusive::detail::null_disposer) at /usr/include/boost/intrusive/list.hpp:751
(inlined by) boost::intrusive::list_impl<boost::intrusive::mhtraits<seastar::thread_context, boost::intrusive::list_member_hook<>, &seastar::thread_context::_all_link>, unsigned long, false, void>::clear() at /usr/include/boost/intrusive/list.hpp:728
(inlined by) ~list_impl at /usr/include/boost/intrusive/list.hpp:255
void fmt::v9::detail::buffer<wchar_t>::append<wchar_t>(wchar_t const*, wchar_t const*) at ??:?
void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) at ??:?
fmt::v9::vformat[abi:cxx11](fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<fmt::v9::appender, char> >) at ??:?
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v9::format<utils::tagged_uuid<raft::server_id_tag>&>(fmt::v9::basic_format_string<char, fmt::v9::type_identity<utils::tagged_uuid<raft::server_id_tag>&>::type>, utils::tagged_uuid<raft::server_id_tag>&) at /usr/include/fmt/core.h:3206
(inlined by) service::storage_service::raft_removenode(utils::tagged_uuid<locator::host_id_tag>) at ./service/storage_service.cc:3572
```
Maybe it's a bug in `fmt` library?
In any case replacing the call with `::format` (i.e. `seastar::format`
from seastar/core/print.hh) helps.
Do it for the entire file for consistency (and avoiding this bug).
Also, for the future, replace `format` calls with `::format` - now it's
the same thing, but the latter won't clash with `std::format` once we
switch to libstdc++13.
Fixes#13707Closes#13711
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 256 chars;
* the logic of the function made simpler and more maintainable.
Closes#13427
* github.com:scylladb/scylladb:
pylib_test: add tests for read_last_line
pytest: add pylib_test directory
scylla_cluster.py: fix read_last_line
scylla_cluster.py: move read_last_line to util.py
to avoid the FTBFS after we bump up the Seastar submodule which bumped up its API level to v7. and API v7 is a breaking change. so, in order to unbreak the build, we have to hardwire the API level to 6. `configure.py` also does this.
Closes#13780
* github.com:scylladb/scylladb:
build: cmake: disable deprecated warning
build: cmake: use Seastar API level 6
we introduced the linkage to Boost::unit_test_framework in
fe70333c19, this library is used by
test/lib/test_utils.cc, so update CMake accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13781
This is a follow-up to #13399, the patch
addresses the issues mentioned there:
* linesep can be split between blocks;
* linesep can be part of UTF-8 sequence;
* avoid excessively long lines, limit to 512 chars;
* the logic of the function made simpler and more
maintainable.
There are two of them currently with slightly different declaration. Better to leave only one.
Closes#13772
* github.com:scylladb/scylladb:
test: Deduplicate test::filename() static overload
test: Make test::filename return fs::path
The method in question suffers from scylladb/seastar#1298. The PR fixes it and makes a bit shorter along the way
Closes#13776
* github.com:scylladb/scylladb:
sstable: Close file at the end
sstables: Use read_entire_stream_cont() helper
This commit changes the configuration in the conf.py
file to make branch-5.2 the latest version and
remove it from the list of unstable versions.
As a result, the docs for version 5.2 will become
the default for users accessing the ScyllaDB Open Source
documentation.
This commit should be merged as soon as version 5.2
is released.
Closes#13681
One is unused, the other one is not really required in public
Closes#13771
* github.com:scylladb/scylladb:
file_writer: Remove static make() helper
sstable: Use toc_filename() to print TOC file path
The test case consists of two internal sub-test-cases. Making them
explicit kills three birds with one stone
- improves parallelizm
- removes env's tempdir wiping
- fixes code indentation
refs: #12707
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13768
since Seastar now deprecates a bunch of APIs which accept io_priority_class,
we started to have deprecated warnings. before migrating to V7 API,
let's disable this warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to avoid the FTBFS after we bump up the Seastar submodule
which bumped up its API level to v7. and API v7 is a breaking
change. so, in order to unbreak the build, we have to hardwire
the API level to 6. `configure.py` also does this.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 02d5a0d7c...f94b1bb9c (12):
> Merge 'Unify CPU scheduling groups and IO priority classes' from Pavel Emelyanov
> scripts: addr2line: relax regular expression for matching kernel traces
> add dirs for clangd to .gitignore
> http::client: Log failed requests' body
> build: always quote the ENVIRONMENT with quotes
> exception_hacks: Change guard check order to work around static init fail
> shared_future: remove support for variadic futures
> iotune: Don't close file that wasn't opened
Fixes#13439
> Merge 'Relax per tick IO grab threshold' from Pavel Emelyanov
> future: simplify constraint on then() a little
> Merge 'coroutine: generator: initialize const member variable and enable generator tests' from Kefu Chai
> future: drop libc++ std::tuple compatibility hack
Closes#13777
The thing is than when closing file input stream the underlying file is
not .close()-d (see scylladb/seastar#1298). The remove_by_toc_name() is
buggy in this sense. Using with_closeable() fixes it and makes the code
shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The remove_by_toc_name() wants to read the whole stream into a sstring.
There's a convenience helper to facilitate that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
e2c9cdb576 moved the validation of the range tombstone change to the place where it is actually consumed, so we don't attempt to pass purged or discarded range tombstones to the validator. In doing so however, the validate pass was moved after the consume call, which moves the range tombstone change, the validator having been passed a moved-from range tombstone. Fix this by moving he validation to before the consume call.
Refs: #12575Closes#13749
* github.com:scylladb/scylladb:
test/boost/mutation_test: add sanity test for mutation compaction validator
mutation/mutation_compactor: add validation level to compaction state query constructor
mutation/mutation_compactor: validate range tombstone change before it is moved
Current S3 client was tested over minio and it takes few more touches to work with amazon S3.
The main challenge here is to support singed requests. The AWS S3 server explicitly bans unsigned multipart-upload requests, which in turn is the essential part of the sstables S3 backend, so we do need signing. Signing a request has many options and requirements, one of them is -- request _body_ can be or can be not included into signature calculations. This is called "(un)signed payload". Requests sent over plain HTTP require payload signing (i.e. -- request body should be included into signature calculations), which can a bit troublesome, so instead the PR uses unsigned payload (i.e. -- doesn't include the request body into signature calculation, only necessary headers and query parameters), but thus also needs HTTPS.
So what this set does is makes the existing S3 client code sign requests. In order to sign the request the code needs to get AWS key and secret (and region) from somewhere and this somewhere is the conf/object_storage.yaml config file. The signature generating code was previously merged (moved from alternator code) and updated to suit S3 client needs.
In order to properly support HTTPS the PR adds special connection factory to be used with seastar http client. The factory makes DNS resolving of AWS endpoint names and configures gnutls systemtrust.
fixes: #13425Closes#13493
* github.com:scylladb/scylladb:
doc: Add a document describing how to configure S3 backend
s3/test: Add ability to run boost test over real s3
s3/client: Sign requests if configured
s3/client: Add connection factory with DNS resolve and configurable HTTPS
s3/client: Keep server port on config
s3/client: Construct it with config
s3/client: Construct it with sstring endpoint
sstables: Make s3_storage with endpoint config
sstables_manager: Keep object storage configs onboard
code: Introduce conf/object_storage.yaml configuration file
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT *
FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
```
The reason for the regression is that the call to `do_for_each_sstable` in `collect_all_shared_sstables` to search for sstables that need cleanup caused the list of sstables in the sstable directory to be moved and cleared.
parallel_for_each_restricted moves the container passed to it into a `do_with` continuation. This is required for parallel_for_each_restricted.
However, moving the container is destructive and so, the decision whether to move or not needs to be the caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted to accept a container rather than a rvalue reference, allowing the callers to decide whether to move or not.
Most callers are converted to move the container, except for `do_for_each_sstable` that copies `_unshared_local_sstables`, allowing callers to call `dir.do_for_each_sstable` multiple times without moving the list contents.
Closes#13526
* github.com:scylladb/scylladb:
sstable_directory: coroutinize parallel_for_each_restricted
sstable_directory: parallel_for_each_restricted: use std::ranges for template definition
sstable_directory: parallel_for_each_restricted: do not move container
There are two of them currently, both returning fs::path for sstable
components. One is static and can be dropped, callers are patched to use
the non-static one making the code tiny bit shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable::filename() is private and is not supposed to be used as a
path to open any files. However, tests are different and they sometimes
know it is. For that they use test wrapper that has access to private
members and may make assumptions about meaning of sstable::filename().
Said that, the test::filename() should return fs::path, not sstring.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit 1cb95b8cf caused a small regression in the debug printer.
After that commit, range tombstones are printed to stdout,
instead of the target stream.
In practice, this causes range tombstones to appear in test logs
out of order with respect to other parts of the debug message.
Fix that.
Closes#13766
The sstable::write_toc() gets TOC filename from file writer, while it
can get it from itself. This makes the file_writer::get_filename()
private and actually improves logging, as the writer is not required
to have the filename onboard, while sstable always has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All Raft verbs include dst_id, the ID of the destination server, but it isn't checked.
`append_entries` will work even if it arrives at completely the wrong server (but in the same group).
It can cause problems, e.g. in the scenario of replacing a dead node.
This commit adds verifying if `dst_id` matches the server's ID and if it doesn't,
the Raft verb is rejected.
Closes#12179
storage_service uses raft_group0 but the during shutdown the later is
destroyed before the former is stopped. This series move raft_group0
destruction to be after storage_service is stopped already. For the
move to work some existing dependencies of raft_group0 are dropped
since they do not really needed during the object creation.
Fixes#13522
In case an sstable unit test case is run individually, it would fail
with exception saying that S3_... environment is not set. It's better to
skip the test-case rather than fail. If someone wants to run it from
shell, it will have to prepare S3 server (minio/AWS public bucket) and
provide proper environment for the test-case.
refs: #13569
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13755
One-way RPC and two-way RPC have different semantics, i.e. in the first one
client doesn't need to wait for an answer.
This commit splits the logic of `handle_raft_rpc` to enable handle differences
in semantics, e.g. errors handling.
Raft replication doesn't guarantee that all replicas see
identical Raft state at all times, it only guarantees the
same order of events on all replicas.
When comparing raft state with gossip state on a node, first
issue a read barrier to ensure the node has the latest raft state.
To issue a read barrier it is sufficient to alter a non-existing
state: in order to validate the DDL the node needs to sync with the
leader and fetch its latest group0 state.
Fixes#13518 (flaky topology test).
Closes#13756
raft_group0 does not really depends on cdc::generation_service, it needs
it only transiently, so pass it to appropriate methods of raft_group0
instead of during its creation.
Commit ecbd112979
`distributed_loader: reshard: consider sstables for cleanup`
caused a regression in loading new sstables using the `upload`
directory, as seen in e.g. https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/230/testReport/migration_test/TestMigration/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_migrate_sstable_without_compression_3_0_md_/
```
query = "SELECT COUNT(*) FROM cf"
statement = SimpleStatement(query)
s = self.patient_cql_connection(node, 'ks')
result = list(s.execute(statement))
> assert result[0].count == expected_number_of_rows, \
"Expected {} rows. Got {}".format(expected_number_of_rows, list(s.execute("SELECT * FROM ks.cf")))
E AssertionError: Expected 1 rows. Got []
E assert 0 == 1
E +0
E -1
```
The reason for the regression is that the call to `do_for_each_sstable`
in `collect_all_shared_sstables` to search for sstables that need
cleanup caused the list of sstables in the sstable directory to be
moved and cleared.
parallel_for_each_restricted moves the container passed to it
into a `do_with` continuation. This is required for
parallel_for_each_restricted.
However, moving the container is destructive and so,
the decision whether to move or not needs to be the
caller's, not the callee.
This patch changes the signature of parallel_for_each_restricted
to accept a lvalue reference to the container rather than a rvalue reference,
allowing the callers to decide whether to move or not.
Most callers are converted to move the container, as they effectively do
today, and a new method, `filter_sstables` was added for the
`collect_all_shared_sstables` us case, that allows the `func` that
processes each sstable to decide whether the sstable is kept
in `_unshared_local_sstables` or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the validate command uses the logger to output the result of
validation. This is inconsistent with other commands which all write
their output to stdout and log any additional information/errors to
stderr. This patch updates the validate command to do the same. While at
it, remove the "Validating..." message, it is not useful.
We have a more in-depth validator for the mx format, so delegate to that
if the validated sstable is of that format. For kl/la we fall-back to
the reader-level validator we used before.
Working with the low-level sstable parser and index reader, this
validator also cross-checks the index with the data file, making sure
all partitions are located at the position and in the order the index
describes. Furthermore, if the index also has promoted index, the order
and position of clustering elements is checked against it.
This is above the usual fragment kind order, partition key order and
clustering order checks that we already had with the reader-level
validator.
it turns out the only places where we have compiler warnings of -W-parentheses-equality is the source code generated by ANTLR. strictly speaking, this is valid C++ code, just not quite readable from the hygienic point of view. so let's enable this warning in the source tree, but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang, so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Closes#13762
* github.com:scylladb/scylladb:
build: only apply -Wno-parentheses-equality to ANTLR generated sources
compaction: disambiguate format_to()
it turns out the only places where we have compiler warnings of
-W-parentheses-equality is the source code generated by ANTLR. strictly
speaking, this is valid C++ code, just not quite readable from the
hygienic point of view. so let's enable this warning in the source tree,
but only disable it when compiling the sources generated by ANTLR.
please note, this warning option is supported by both GCC and Clang,
so no need to test if it is supported.
for a sample of the warnings, see:
```
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
if ( (LA4_0 == '$'))
~~~~~~^~~~~~
/home/kefu/dev/scylladb/build/cmake/cql3/CqlLexer.cpp:21752:38: note: remove extraneous parentheses around the comparison to silence this warning
if ( (LA4_0 == '$'))
~ ^ ~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Preceding commits in this patch series have extended the MVCC
mechanism to allow for versions with different schemas
in the same entry/snapshot, with on-the-fly and background
schema upgrades to the most recent version in the chain.
Given that, we can perform gentle schema upgrades by simply
adding an empty version with the target schema to the front
of the entry.
This patch is intended to be the first and only behaviour-changing patch in the
series. Previous patches added code paths for multi-schema snapshots, but never
exercised them, because before this patch two different schemas within a single
MVCC chain never happened. This patch makes it happen and thus exercises all the
code in the series up until now.
Fixes#2577
Each partition_version is allowed to have a different schema now.
As of this patch, all versions reachable from a snapshot/entry always
have the same schema, but this will change in an upcoming patch.
This commit prepares merge_partition_versions() for that.
See code comments added in this patch for a detailed description.
The design chosen in this patch requires adding a bit of information to
partition_version. Due to alignment, it results in a regrettable waste of 8
bytes per partition. If we want, we can recover that in the future by squeezing
the bit into some free bit in other fields, for example the highest or lowest
bits of one of the pointers in partition_version.
After this patch, MVCC should be prepared for replacing the atomic schema
upgrade() of cache/memtable entries with a gentle upgrade().
To avoid reactor stalls during schema upgrades of memtable and cache entries,
we want to do them interruptibly, not atomically. To achieve that, we want
to reuse the existing gentle version merging mechanism. If we generalize
version merging algorithms to handle `mutation_partition`s with different
schemas, a schema upgrade will boil down simply to adding a new empty MVCC
version with the new schema.
In a previous patch, we already generalized the cursor to upgrade rows
on the fly when reading.
But we still have to generalize the other MVCC algorithm: the merging of
superfluous mutation_partition_v2 objects. This patch modifies the two-version
merging algorithm: apply_monotonically(). The next patch will update its caller,
merge_partition_versions() to make of use the updated apply_monotonically()
properly.
The upcoming schema upgrade change will perform the schema upgrade by adding
a new version (with the new schema) to the partition entry.
To clean a multi-version entry, eviction is not enough - the versions have
to be merged and/or cleared first. drain() does just that.
An upcoming patch will enable multiple schemas within a single entry,
after the entry is upgraded.
partition_entry::squashed isn't prepared for that yet.
This patch prepares it.
To support gentle schema upgrades, each version has its own schema.
Currently this facility is unused, and the schema is equal for
all versions in a snapshot. But in upcoming commits this will change.
In the new design, after an entry upgrade, there will be a transitional
period where two versions with different schemas will coexist in a snapshot.
Eventually, these versions will be merged by mutation_cleaner into one
version with the current schema, but until then reads have to merge
multi-schema snapshots on the fly.
This commit implements in the cursor support for per-version schemas.
When in upcoming patches we allow multiple schema versions within a single
snapshot, reads will have to upgrade rows on the fly.
This also applies to squashed()
When in upcoming patches we allow multiple schema versions within a single
snapshot, reads will have to upgrade rows on the fly.
This also applies to the static row.
Adds a logalloc::region argument to upgrade_schema().
It's currently unused, but will be further propagated to
partition_entry::upgrade() in an upcoming patch.
A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
A helper which will be used during upcoming changes to
mutation_partition_v2::apply_monotonically(), which will extend it to merging
versions with different schemas.
It will be used in upcoming commits.
A factory function is used, rather than an actual constructor,
because we want to delegate the (easy) case of equal schemas
to the existing single-schema constructor.
And that's impossible (at least without invoking a copy/move
constructor) to do with only constructors.
operator<< accepts a schema& and a partition_entry&. But since the latter
now contains a reference to its schema inside, the former is redundant.
Remove it.
After adding a _schema field to each partition version,
the field in memtable_entry is redundant. It can be always recovered
from the latest version. Remove it.
After adding a _schema field to each partition version,
the field in cache_entry is redundant. It can be always recovered
from the latest version. Remove it.
After adding a _schema field to each partition version,
the field in partition_snapshot is redundant. It can be always recovered
from the latest version. Remove it.
Currently, partition_version does not reference its schema.
All partition_version reachable from a entry/snapshot have the same schema,
which is referenced in memtable_entry/cache_entry/partition_snapshot.
To enable gentle schema upgrades, we want to use the existing background
version merging mechanism. To achieve that, we will move the schema reference
into partition_version, and we will allow neighbouring MVCC versions to have
different schemas, and we will merge them on-the-fly during reads and
persistently during background version merges.
This way, an upgrade will boil down to adding a new empty version with
the new schema.
This patch adds the _schema field to partition_version and propagates
the schema pointer to it from the version's containers (entry/snapshot).
Subsequent patches will remove the schema references from the containers,
because they are now redundant.
We don't have a convention for when to pass `schema_ptr` and and when to pass
`const schema&` around.
In general, IMHO the natural convention for such a situation is to pass the
shared pointer if the callee might extend the lifetime of shared_ptr,
and pass a reference otherwise. But we convert between them willy-nilly
through shared_from_this().
While passing a reference to a function which actually expects a shared_ptr
can make sense (e.g. due to the fact that smart pointers can't be passed in
registers), the other way around is rather pointless.
This patch takes one occurence of that and modifies the parameter to a reference.
Since enable_shared_from_this makes shared pointer parameters and reference
parameters interchangeable, this is a purely cosmetic change.
They will be used in upcoming patches which introduce incremental schema upgrades.
Currently, these variants always copy cells during upgrade.
This could be optimized in the future by adding a way to move them instead.
The comment refers to "other", but it means "pe". Fix that.
The patch also adds a bit of context to the mutation_partition jargon
("evictability" and "continuity"), by reminding how it relates
to the concrete abstractions: memtable and cache.
Most variants of apply() and apply_monotonically() in mutation_partition_v2
are leftovers from mutation_partition, and are unused. Thus they only
add confusion and maintenance burden. Since we will be modifying
apply_monotonically() in upcoming patches, let's clean them up, lest
the variants become stale.
This patch removes all unused variants of apply() and apply_monotonically()
and "manually inlines" the variants which aren't used often enough to carry
their own weight.
In the end, we are left with a single apply_monotonically() and two convenience
apply() helpers.
The single apply_monotonically() accepts two schema arguments. This facility
is unimplemented and unused as of this patch - the two arguments are always
the same - but it will be implemented and used in later parts of the series.
The comment suggests that the order of sentinel insertion is meaningful because
of the resulting eviction order. But the sentinels are added to the tracker
with the two-argument version of insert(), which inserts the second argument
into the LRU right before the (more recent) first argument.
Thus the eviction order of sentinels is decided explicitly, and it doesn't
rely on insertion order.
In some mixed-schema apply helpers for tests, the source mutation
is accidentally copied with the target schema. Fix that.
Nothing seems to be currently affected by this bug; I found it when it
was triggered by a new test I was adding.
we should always qualify `format_to` with its namespace. otherwise
we'd have following failure when compiling with libstdc++ from GCC-13:
```
/home/kefu/dev/scylladb/compaction/table_state.hh:65:16: error: call to 'format_to' is ambiguous
return format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
^~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13760
Support the AWS_S3_EXTRA environment vairable that's :-split and the
respective substrings are set as endpoint AWS configuration. This makes
it possible to run boost S3 test over real S3.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the endpoint config specifies AWS key, secret and region, all the
S3 requests get signed. Signature should have all the x-amz-... headers
included and should contain at least three of them. This patch includes
x-ams-date, x-amz-content-sha256 and host headers into the signing list.
The content can be unsigned when sent over HTTPS, this is what this
patch does.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Existing seastar's factories work on socket_address, but in S3 we have
endpoint name which's a DNS name in case of real S3. So this patch
creates the http client for S3 with the custom connection factory that
does two things.
First, it resolves the provided endpoint name into address.
Second, it loads trust-file from the provided file path (or sets system
trust if configured that way).
Since s3 client creation is no-waiting code currently, the above
initialization is spawned in afiber and before creating the connection
this fiber is waited upon.
This code probably deserves living in seastar, but for now it can land
next to utils/s3/client.cc.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code temporarily assumes that the endpoint port is 9000.
This is what tests' local minio is started with. This patch keeps the
port number on endpoint config and makes test get the port number from
minio starting code via environment.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch -- extent the s3::client constructor to get
the endpoint config value next to the endpoint string. For now the
configs are likely empty, but they are yet unused too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the client is constructed with socket_address which's prepared
by the caller from the endpoint string. That's not flexible engouh,
because s3 client needs to know the original endpoint string for two
reasons.
First, it needs to lookup endpoint config for potential AWS creds.
Second, it needs this exact value as Host: header in its http requests.
So this patch just relaxes the client constructor to accept the endpoint
string and hard-code the 9000 port. The latter is temporary, this is how
local tests' minio is started, but next patch will make it configurable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch. The sstables::s3_storage gets the
endpoint config instance upon creation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The user sstables manager will need to provide endpoint config for
sstables' storage drivers. For that it needs to get it from db::config
and keep in-sync with its updates.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In order to access real S3 bucket, the client should use signed requests
over https. Partially this is due to security considerations, partially
this is unavoidable, because multipart-uploading is banned for unsigned
requests on the S3. Also, signed requests over plain http require
signing the payload as well, which is a bit troublesome, so it's better
to stick to secure https and keep payload unsigned.
To prepare signed requests the code needs to know three things:
- aws key
- aws secret
- aws region name
The latter could be derived from the endpoint URL, but it's simpler to
configure it explicitly, all the more so there's an option to use S3
URLs without region name in them we could want to use some time.
To keep the described configuration the proposed place is the
object_storage.yaml file with the format
endpoints:
- name: a.b.c
port: 443
aws_key: 12345
aws_secret: abcdefghijklmnop
...
When loaded, the map gets into db::config and later will be propagated
down to sstables code (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Allowing the validation level to be customized by whoever creates the
compaction state. Add a default value (the previous hardcoded level) to
avoid the churn of updating all call sites.
e2c9cdb576 moved the validation of the
range tombstone change to the place where it is actually consumed, so we
don't attempt to pass purged or discarded range tombstones to the
validator. In doing so however, the validate pass was moved after the
consume call, which moves the range tombstone change, the validator
having been passed a moved-from range tombstone. Fix this by moving he
validation to before the consume call.
Refs: #12575
in this series, we try to use `generation_type` as a proxy to hide the consumers from its underlying type. this paves the road to the UUID based generation identifier. as by then, we cannot assume the type of the `value()` without asking `generation_type` first. better off leaving all the formatting and conversions to the `generation_type`. also, this series changes the "generation" column of sstable registry table to "uuid", and convert the value of it to the original generation_type when necessary, this paves the road to a world with UUID based generation id.
Closes#13652
* github.com:scylladb/scylladb:
db: use uuid for the generation column in sstable registry table
db, sstable: add operator data_value() for generation_type
db, sstable: print generation instead of its value
Currently there are only 2 tests for S3 -- the pure client test and compound object_store test that launches scylla, creates s3-backed table and CQL-queries it. At the same time there's a whole lot of small unit test for sstables functionality, part of it can run over S3 storage too.
This PR adds this support and patches several test cases to use it. More test cases are to come later on demand.
fixes: #13015Closes#13569
* github.com:scylladb/scylladb:
test: Make resharding test run over s3 too
test: Add lambda to fetch bloom filter size
test: Tune resharding test use of sstable::test_env
test: Make datafile test case run over s3 too
test: Propagate storage options to table_for_test
test: Add support for s3 storage_options in config
test: Outline sstables::test_env::do_with_async()
test: Keep storage options on sstable_test_env config
sstables: Add and call storage::destroy()
sstables: Coroutinize sstable::destroy()
Currently mp_row_consumer_m creates an alias to data_consumer::proceed.
Code in the rest of the file uses both unqualified name and
mp_row_consumer_m::proceed. Remove the alias and just use
`data_consumer::proceed` directly everywhere, leads to cleaner code.
Test sstable::validate() instead. Also rename the unit test testing said
method from scrub_validate_mode_validate_reader_test to
sstable_validate_test to reflect the change.
At this point this test should probably be moved to
sstable_datafile_test.cc, but not in this patch.
Sadly this transition means we loose some test scenarios. Since now we
have to write the invalid data to sstables, we have to drop scenarios
which trigger errors on either the write or read path.
To replace the validate code currently in compaction/compaction.cc (not
in this commit). We want to push down this logic to the sstable layer,
so that:
* Non compaction code that wishes to validate sstables (tests, tools)
doesn't have to go through compaction.
* We can abstract how sstables are validated, in particular we want to
add a new more low-level validation method that only the more recent
sstable versions (mx) will support.
Currently said method creates a combined reader from all the sstables
passed to it then validates this combined reader.
Change it to validate each sstable (reader) individually in preparation
of the new validate method which can handle a single sstable at a time.
Note that this is not going to make much impact in practice, all callers
pass a single sstable to this method already.
Currently, error messages for validation errors are produced in several
places:
* the high-level validator (which is built on the low-level one)
* scrub compaction and validation compaction (scrub in validate mode)
* scylla-sstable's validate operation
We plan to introduce yet another place which would use the low-level
validator and hence would have to produce its own error messages. To cut
down all this duplication, centralize the production of error messages
in the low-level validator, which now returns a `validation_result`
object instead of bool from its validate methods. This object can be
converted to bool (so its backwards compatible) and also contains an
error message if validation failed. In the next patches we will migrate
all users of the low level validator (be that direct or indirect) to use
the error messages provided in this result object instead of coming up
with one themselves.
The evictable reader must ensure that each buffer fill makes forward
progress, i.e. the last fragment in the buffer has a position larger
than the last fragment from the last buffer-fill. Otherwise, the reader
could get stuck in an infinite loop between buffer fills, if the reader
is evicted in-between.
The code guranteeing this forward change has a bug: when the next
expected position is a partition-start (another partition), the code
would loop forever, effectively reading all there is from the underlying
reader.
To avoid this, add a special case to ignore the progress guarantee loop
altogether when the next expected position is a partition start. In this
case, progress is garanteed anyway, because there is exactly one
partition-start fragment in each partition.
Fixes: #13491Closes#13563
This mini-series cleans up printing of ranges in utils/to_string.hh
It generalizes the helper function to work on a std::ranges::range,
with some exceptions, and adds a helper for boost::transformed_range.
It also changes the internal interface by moving `join` the the utils namespace
and use std::string rather than seastar::sstring.
Additional unit tests were added to test/boost/json_test
Fixes#13146Closes#13159
* github.com:scylladb/scylladb:
utils: to_string: get rid of utils::join
utils: to_string: get rid of to_string(std::initializer_list)
utils: to_string: get rid of to_string(const Range&)
utils: to_string: generalize range helpers
test: add string_format_test
utils: chunked_vector: add std::ranges::range ctor
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
This series adds more tests in this area covering additional corner cases,
and then fixes the issue by adding the missing verification where it's
needed. After the series, all 12 tests in test/alternator/test_number.py now pass.
Fixes#6794Closes#13743
* github.com:scylladb/scylladb:
alternator: unit test for number magnitude and precision function
alternator: add validation of numbers' magnitude and precision
test/alternator: more tests for limits on number precision and magnitude
test/alternator: reproducer for DoS in unlimited-precision addition
* change the "generation" column of sstable registry table from
bigint to uuid
* from helper to convert UUID back to the original generation
in the long run, we encourage user to use uuid based generation
identifier. but in the transition period, both bigint based and uuid
based identifiers are used for the generation. so to cater both
needs, we use a hackish way to store the integer into UUID. to
differentiate the was-integer UUID from the geniune UUID, we
check the UUID's most_significant_bits. because we only support
serialize UUID v1, so if the timestamp in the UUID is zero,
we assume the UUID was generated from an integer when converting it
back to a generation identififer.
also, please note, the only use case of using generation as a
column is the sstable_registry table, but since its schema is fixed,
we cannot store both a bigint and a UUID as the value of its
`generation` column, the simpler way forward is to use a single type
for the generation. to be more efficient and to preserve the type of
the generation, instead of using types like ascii string or bytes,
we will always store the generation as a UUID in this table, if the
generation's identifier is a int64_t, the value of the integer will
be used as the least significant bits of the UUID.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cql-pytest
framework.
This test file checks various LWT conditional updates which involve
collections or UDTs (there is a separate test file for LWT conditional
updates which do not involve collections, which I haven't translated
yet).
The tests reproduce one known bug:
Refs #5855: lwt: comparing NULL collection with empty value in IF
condition yields incorrect results
And also uncovered three previously-unknown bugs:
Refs #13586: Add support for CONTAINS and CONTAINS KEY in LWT expressions
Refs #13624: Add support for UDT subfields in LWT expression
Refs #13657: Misformatted printout of column name in LWT error message
Beyond those bona-fide bugs, this test also demonstrates several places
where we intentionally deviated from Cassandra's behavior, forcing me
to comment out several checks. These deviations are known, and intentional,
but some of them are undocumented and it's worth listing here the ones
re-discovered by this test:
1. On a successful conditional write, Cassandra returns just True, Scylla
also returns the old contents of the row. This difference is officially
documented in docs/kb/lwt-differences.rst.
2. Scylla allows the test "l = [null]" or "s = {null}" with this weird
null element (the result is false), whereas Cassandra prints an error.
3. Scylla allows "l[null]" or "m[null]" (resulting in null), Cassandra
prints an error.
4. Scylla allows a negative list index, "l[-2]", resulting in null.
Cassandra prints an error in this case.
5. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
6. Scylla allows "IN null" (the condition just fails), Cassandra prints
an error in this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13663
Now when the test case and used lib/utils code is using storage-agnostic
approach, it can be extended to run over S3 storage as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The resharding test compares bloom filter sizes before and after reshard
runs. For that it gets the filter on-disk filename and stat()s it. That
won't work with S3 as it doesn't have its accessable on-disk files.
Some time ago there existed the storage::get_stats() method, but now
it's gone. The new s3::client::get_object_stat() is coming, but it will
take time to switch to it. For now, generalize filter size fetching into
a local lambda. Next patch will make a stub in it for S3 case, and once
the get_object_stat() is there we'll be able to smoothly start using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`clustering_key_columns()` returns a range view, and `front()` returns
the reference to its first element. so we cannot assume the availability
of this reference after the expression is evaluated. to address this
issue, let's capture the returned range by value, and keep the first
element by reference.
this also silences warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:30: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ^~~~~~~~~~~~~
/home/kefu/dev/scylladb/db/schema_tables.cc:3654:79: note: the temporary was destroyed at the end of the full expression ‘(& v)->view_ptr::operator->()->schema::clustering_key_columns().boost::iterator_range<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> > >::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::random_access_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::bidirectional_traversal_tag>::<anonymous>.boost::iterator_range_detail::iterator_range_base<__gnu_cxx::__normal_iterator<const column_definition*, std::vector<column_definition> >, boost::iterators::incrementable_traversal_tag>::front()’
3654 | const column_definition& first_view_ck = v->clustering_key_columns().front();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13720
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13721
The test case in question spawns async context then makes the test_env
instance on the stack (and stopper for it too). There's helper for the
above steps, better to use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the sstable_datafile test cases are capable of running with S3
storage, so this patch makes the simplest of them do it. Patching the
rest from this file is optional, because mostly the cases test how the
datafile data manipulations work without checking the files
manipulations. So even if making them all run over S3 is possible, it
will just increase the testing time w/o real test of the storage driver.
So this patch makes one test case run over local and S3 storages, more
patches to update more test cases with files manipulations are yet to
come.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach table_for_tests use any storage options, not just local one. For
now the only user that passes non-local options is sstables::test_env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the sstable test case wants to run over S3 storage it needs to
specify that in test config by providing the S3 storage options. So
first thing this patch adds is the helper that makes these options based
on the env left by minio launcher from test.py.
Next, in order to make sstables_manager work with S3 it needs the
plugged system keyspace which, in turn, needs query processor, proxy,
database, etc. All this stuff lives in cql_test_env, so the test case
running with S3 options will run in a sstables::test_env nested inside
cql_test_env. The latter would also need to plug its system keyspace to
the former's sstables manager and turn the experimental feature ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We have known for a long time (see issue #1703) that the quality of our
CQL "syntax error" messages leave a lot to be desired, especially when
compared to Cassandra. This patch doesn't yet bring us great error
messages with great context - doing this isn't easy and it appears that
Antlr3's C++ runtime isn't as good as the Java one in this regard -
but this patch at least fixes **garbage** printed in some error messages.
Specifically, when the parser can deduce that a specific token is missing,
it used to print
line 1:83 missing ')' at '<missing '
After this patch we get rid of the meaningless string '<missing ':
line 1:83 : Missing ')'
Also, when the parser deduced that a specific token was unneeded, it
used to print:
line 1:83 extraneous input ')' expecting <invalid>
Now we got rid of this silly "<invalid>" and write just:
line 1:83 : Unexpected ')'
Refs #1703. I didn't yet marked that issue "fixed" because I think a
complete fix would also require printing the entire misparsed line and the
point of the parse failure. Scylla still prints a generic "Syntax Error"
in most cases now, and although the character number (83 in the above
example) can help, it's much more useful to see the actual failed
statement and where character 83 is.
Unfortunately some tests enshrine buggy error messages and had to be
fixed. Other tests enshrined strange text for a generic unexplained
error message, which used to say " : syntax error..." (note the two
spaces and elipses) and after this patch is " : Syntax error". So
these tests are changed. Another message, "no viable alternative at
input" is deliberately kept unchanged by this patch so as not to break
many more tests which enshrined this message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#13731
So that it could be set to s3 by the test case on demand. Default is
local storage which uses env's tempdir or explicit path argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The s3_storage leaks client when sstable gets destoryed. So far this
came unnoticed, but debug-mode unit test ran over minio captured it. So
here's the fix.
When sstable is destroyed it also kicks the storage to do whatever
cleanup is needed. In case of s3 storage the cleanup is in closing the
on-boarded client. Until #13458 is fixed each sstable has its own
private version of the client and there's no other place where it can be
close()d in co_await-able mannter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the previous patch we added a limit in Alternator for the magnitude
and precision of numbers, based on a function get_magnitude_and_precision
whose implementation was, unfortunately, rather elaborate and delicate.
Although we did add in the previous patches some end-to-end tests which
confirmed that the final decision made based on this function, to accept or
reject numbers, was a correct decision in a few cases, such an elaborate
function deserves a separate unit test for checking just that function
in isolation. In fact, this unit tests uncovered some bugs in the first
implementation of get_magnitude_and_precision() which the other tests
missed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB limits the allowed magnitude and precision of numbers - valid
decimal exponents are between -130 and 125 and up to 38 significant
decimal digitst are allowed. In contrast, Scylla uses the CQL "decimal"
type which offers unlimited precision. This can cause two problems:
1. Users might get used to this "unofficial" feature and start relying
on it, not allowing us to switch to a more efficient limited-precision
implementation later.
2. If huge exponents are allowed, e.g., 1e-1000000, summing such a
number with 1.0 will result in a huge number, huge allocations and
stalls. This is highly undesirable.
After this patch, all tests in test/alternator/test_number.py now
pass. The various failing tests which verify magnitude and precision
limitations in different places (key attributes, non-key attributes,
and arithmetic expressions) now pass - so their "xfail" tags are removed.
Fixes#6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have xfailing tests for issue #6794 - the missing checks on
precision and magnitudes of numbers in Alternator - but this patch adds
checks for additional corner cases. In particular we check the case that
numbers are used in a *key* column, which goes to a different code path
than numbers used in non-key columns, so it's worth testing as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As already noted in issue #6794, whereas DynamoDB limits the magnitude
of numbers to between 10^-130 and 10^125, Scylla does not. In this patch
we add yet another test for this problem, but unlike previous tests
which just shown too much magnitude being allowed which always sounded
like a benign problem - the test in this patch shows that this "feature"
can be used to DoS Scylla - a user user can send a short request that
causes arbitrarily-large allocations, stalls and CPU usage.
The test is currently marked "skip" because it cause cause Scylla to
take a very long time and/or run out of memory. It passes on DynamoDB
because the excessive magnitude is simply not allowed there.
Refs #6794
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It's unused.
Just in case, add a unit test case for using the fmt library to
format it (that includes fmt::to_string(std::initializer_list)).
Note that the existing to_string implementation
used square brackets to enclose the initializer_list
but the new, standardized form uses curly braces.
This doesn't break anything since to_string(initializer_list)
wasn't used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As seen in https://github.com/scylladb/scylladb/issues/13146
the current implementation is not general enough
to provide print helpers for all kind of containers.
Modernize the implementation using templates based
on std::ranges::range and using fmt::join.
Extend unit test for formatting different types of ranges,
boost::transformed ranges, deque.
Fixes#13146
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when altering permissions on a functions resource, we
only check if it's a builtin function and not if it's all functions
in the "system" keyspace, which contains all builtin functions.
This patch adds a check of whether the function resource keyspace
is "system". This check actually covers both "single function"
and "all functions in keyspace" cases, so the additional check
for single functions is removed.
Closes#13596
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Also, add an group_id class and return it from compaction_group and table_state.
Use that to identify the compaction_group / table_state by "ks_name.cf_name compaction_group=idx/total" in log messages.
Fixes#13467Closes#13520
* github.com:scylladb/scylladb:
compaction_manager: print compaction_group id
compaction_group, table_state: add group_id member
compaction_manager: offstrategy compaction: skip compaction if no candidates are found
This reverts commit d85af3dca4. It
restores the linear search algorithm, as we expect the search to
terminate near the origin. In this case linear search is O(1)
while binary search is O(log n).
A comment is added so we don't repeat the mistake.
Closes#13704
No point in going through the vector<mutation> entry-point
just to discover in run time that it was called
with a single-element vector, when we know that
in advance.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13733
this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `cdc::generation_id` and `db_clock::time_point` without the help of `operator<<`.
the formatter of `cdc::generation_id` uses that of `db_clock::time_point` , so these two commits are posted together in a single pull request.
the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now.
Refs #13245Closes#13703
* github.com:scylladb/scylladb:
db_clock: specialize fmt::formatter<db_clock::time_point>
cdc: generation: specialize fmt::formatter<generation_id>
This series fixes a few issues caused by f1bbf705f9
(f1bbf705f9):
- table, compaction_manager: prevent cross shard access to owned_ranges_ptr
- Fixes#13631
- distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
- compaction: make_partition_filter: do not assert shard ownership
- allow the filtering reader now used during resharding to process tokens owned by other shards
Closes#13635
* github.com:scylladb/scylladb:
compaction: make_partition_filter: do not assert shard ownership
distributed_loader: distribute_reshard_jobs: pick one of the sstable shard owners
table, compaction_manager: prevent cross shard access to owned_ranges_ptr
this series silences the warnings from GCC 13. some of these changes are considered as critical fixes, and posted separately.
see also #13243Closes#13723
* github.com:scylladb/scylladb:
cdc: initialize an optional using its value type
compaction: disambiguate type name
db: schema_tables: drop unused variable
reader_concurrency_semaphore: fix signed/unsigned comparision
locator: topology: disambiguate type names
raft: disambiguate promise name in raft::awaited_conf_changes
when compiling using GCC-13, it warns that:
```
/home/kefu/dev/scylladb/utils/s3/client.cc:224:9: error: stack usage might be 66352 bytes [-Werror=stack-usage=]
224 | sstring parse_multipart_upload_id(sstring& body) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
```
so it turns out that `rapidxml::xml_document<>` could be very large,
let's allocate it on heap instead of on the stack to address this issue.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13722
`data_dictionary::database::find_keyspace()` returns a temporary
object, and `data_dictionary::keyspace::user_types()` returns a
references pointing to a member of this temporary object. so we
cannot use the reference after the expression is evaluated. in
this change, we capture the return value of `find_keyspace()` using
universal reference, and keep the return value of `user_types()`
with a reference, to ensure us that we can use it later.
this change silences the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/statements/authorization_statement.cc:68:21: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
68 | const auto& utm = qp.db().find_keyspace(*keyspace).user_types();
| ^~~
```
Fixes#13725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13726
`current_scheduling_group()` returns a temporary value, and `name()`
returns a reference, so we cannot capture the return value by reference,
and use the reference after this expression is evaluated. this would
cause undefined behavior. so let's just capture it by value.
this change also silence following warning from GCC-13:
```
/home/kefu/dev/scylladb/transport/server.cc:204:11: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
204 | auto& cur_sg_name = current_scheduling_group().name();
| ^~~~~~~~~~~
/home/kefu/dev/scylladb/transport/server.cc:204:56: note: the temporary was destroyed at the end of the full expression ‘seastar::current_scheduling_group().seastar::scheduling_group::name()’
204 | auto& cur_sg_name = current_scheduling_group().name();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
```
Fixes#13719
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13724
`data_dictionary::database::find_column_family()` return a temporary value,
and `data_dictionary::table::get_index_manager()` returns a reference in
this temporary value, so we cannot capture this reference and use it after
the expression is evaluated. in this change, we keep the return value
of `find_column_family()` by value, to extend the lifecycle of the return
value of `get_index_manager()`.
this should address the warning from GCC-13, like:
```
/home/kefu/dev/scylladb/cql3/restrictions/statement_restrictions.cc:519:15: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
519 | auto& sim = db.find_column_family(_schema).get_index_manager();
| ^~~
```
Fixes#13727
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13728
Let's remove `expr::token` and replace all of its functionality with `expr::function_call`.
`expr::token` is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`, this will be internally represented as an expression which uses `expr::token` to represent the `token(p1, p2)` part.
The situation with `expr::token` is a bit complicated.
On one hand side it's supposed to represent the partition token, but sometimes it's also assumed that it can represent a generic call to the `token()` function, for example `token(1, 2, 3)` could be a `function_call`, but it could also be `expr::token`.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented as `expr::token` is dangerous - the query planning might think that it is `token(p1, p2, p3)` and plan the query based on this, which would be wrong.
Currently `expr::token` is created only in one specific case.
When the parser detects that the user typed in a restriction which has a call to `token` on the LHS it generates `expr::token`.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token, it stays a `function_call`. During preparation there is no check to see if a `function_call` to `token` could be turned into `expr::token`. This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented as `expr::token` and the query planner handles that, but sometimes it might be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of code duplication between a `function_call` and `expr::token`.
All of the evaluation and preparation is the same for `expr::token` as it's for a `function_call` to the token function.
Currently it's impossible to evaluate `expr::token` and preparation has some flaws, but implementing it would basically consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries.
With `expr::token` we turn a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple tables? The schema is different, so something that is represented as `expr::token` for one schema would be represented as `function_call` in the context of a different schema.
Translating expressions to different tables would require careful manipulation to convert `expr::token` to `function_call` and vice versa. This could cause trouble for index queries.
Overall I think it would be best to remove `expr::token`.
Although having a clear marker for the partition token is sometimes nice for query planning, in my opinion the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things, having two separate representations of the same thing without clear boundaries between them causes trouble.
Instead of having both `expr::token` and `function_call` we can just have the `function_call` and check if it represents a partition token when needed.
Refs: #12906
Refs: #12677Closes: #12905Closes#13480
* github.com:scylladb/scylladb:
cql3: remove expr::token
cql3: keep a schema in visitor for extract_clustering_prefix_restrictions
cql3: keep a schema inside the visitor for extract_partition_range
cql3/prepare_expr: make get_lhs_receiver handle any function_call
cql3/expr: properly print token function_call
expr_test: use unresolved_identifier when creating token
cql3/expr: split possible_lhs_values into column and token variants
cql3/expr: fix error message in possible_lhs_values
cql3: expr: reimplement is_satisfied_by() in terms of evaluate()
cql3/expr: add a schema argument to expr::replace_token
cql3/expr: add a comment for expr::has_partition_token
cql3/expr: add a schema argument to expr::has_token
cql3: use statement_restrictions::has_token_restrictions() wherever possible
cql3/expr: add expr::is_partition_token_for_schema
cql3/expr: add expr::is_token_function
cql3/expr: implement preparing function_call without a receiver
cql3/functions: make column family argument optional in functions::get
cql3/expr: make it possible to prepare expr::constant
cql3/expr: implement test_assignment for column_value
cql3/expr: implement test_assignment for expr::constant
We change the meaning and name of `replication_state`: previously it was meant
to describe the "state of tokens" of a specific node; now it describes the
topology as a whole - the current step in the 'topology saga'. It was moved
from `ring_slice` into `topology`, renamed into `transition_state`, and the
topology coordinator code was modified to switch on it first instead of node
state - because there may be no single transitioning node, but the topology
itself may be transitioning.
This PR was extracted from #13683, it contains only the part which refactors
the infrastructure to prepare for non-node specific topology transitions.
Closes#13690
* github.com:scylladb/scylladb:
raft topology: rename `update_replica_state` -> `update_topology_state`
raft topology: remove `transition_state::normal`
raft topology: switch on `transition_state` first
raft topology: `handle_ring_transition`: rename `res` to `exec_command_res`
raft topology: parse replaced node in `exec_global_command`
raft topology: extract `cleanup_group0_config_if_needed` from `get_node_to_work_on`
storage_service: extract raft topology coordinator fiber to separate class
raft topology: rename `replication_state` to `transition_state`
raft topology: make `replication_state` a topology-global state
as this syntax is not supported by the standard, it seems clang
just silently construct the value with the initializer list and
calls the operator=, but GCC complains:
```
/home/kefu/dev/scylladb/cdc/split.cc:392:54: error: converting to ‘std::optional<partition_deletion>’ from initializer list would use explicit constructor ‘constexpr std::optional<_Tp>::optional(_Up&&) [with _Up = const tombstone&; typename std::enable_if<__and_v<std::__not_<std::is_same<std::optional<_Tp>, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::__not_<std::is_same<std::in_place_t, typename std::remove_cv<typename std::remove_reference<_Iter>::type>::type> >, std::is_constructible<_Tp, _Up>, std::__not_<std::is_convertible<_Iter, _Iterator> > >, bool>::type <anonymous> = false; _Tp = partition_deletion]’
392 | _result[t.timestamp].partition_deletions = {t};
| ^
```
to silences the error, and to be more standard compliant,
let's use emplace() instead.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Let's remove expr::token and replace all of its functionality with expr::function_call.
expr::token is a struct whose job is to represent a partition key token.
The idea is that when the user types in `token(p1, p2) < 1234`,
this will be internally represented as an expression which uses
expr::token to represent the `token(p1, p2)` part.
The situation with expr::token is a bit complicated.
On one hand side it's supposed to represent the partition token,
but sometimes it's also assumed that it can represent a generic
call to the token() function, for example `token(1, 2, 3)` could
be a function_call, but it could also be expr::token.
The query planning code assumes that each occurence of expr::token
represents the partition token without checking the arguments.
Because of this allowing `token(1, 2, 3)` to be represented
as expr::token is dangerous - the query planning
might think that it is `token(p1, p2, p3)` and plan the query
based on this, which would be wrong.
Currently expr::token is created only in one specific case.
When the parser detects that the user typed in a restriction
which has a call to `token` on the LHS it generates expr::token.
In all other cases it generates an `expr::function_call`.
Even when the `function_call` represents a valid partition token,
it stays a `function_call`. During preparation there is no check
to see if a `function_call` to `token` could be turned into `expr::token`.
This is a bit inconsistent - sometimes `token(p1, p2, p3)` is represented
as `expr::token` and the query planner handles that, but sometimes it might
be represented as `function_call`, which the query planner doesn't handle.
There is also a problem because there's a lot of duplication
between a `function_call` and `expr::token`. All of the evaluation
and preparation is the same for `expr::token` as it's for a `function_call`
to the token function. Currently it's impossible to evaluate `expr::token`
and preparation has some flaws, but implementing it would basically
consist of copy-pasting the corresponding code from token `function_call`.
One more aspect is multi-table queries. With `expr::token` we turn
a call to the `token()` function into a struct that is schema-specific.
What happens when a single expression is used to make queries to multiple
tables? The schema is different, so something that is representad
as `expr::token` for one schema would be represented as `function_call`
in the context of a different schema.
Translating expressions to different tables would require careful
manipulation to convert `expr::token` to `function_call` and vice versa.
This could cause trouble for index queries.
Overall I think it would be best to remove expr::token.
Although having a clear marker for the partition token
is sometimes nice for query planning, in my opinion
the pros are outweighted by the cons.
I'm a big fan of having a single way to represent things,
having two separate representations of the same thing
without clear boundaries between them causes trouble.
Instead of having expr::token and function_call we can
just have the function_call and check if it represents
a partition token when needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The schema will be needed once we remove expr::token
and switch to using expr::is_partition_token_for_schema,
which requires a schema arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
get_lhs_receiver looks at the prepared LHS of a binary operator
and creates a receiver corresponding to this LHS expression.
This receiver is later used to prepare the RHS of the binary operator.
It's able to handle a few expression types - the ones that are currently
allowed to be on the LHS.
One of those types is `expr::token`, to handle restrictions like `token(p1, p2) = 3`.
Soon token will be replaced by `expr::function_call`, so the function will need
to handle `function_calls` to the token function.
Although we expect there to be only calls to the `token()` function,
as other functions are not allowed on the LHS, it can be made generic
over all function calls, which will help in future grammar extensions.
The functions call that it can currently get are calls to the token function,
but they're not validated yet, so it could also be something like `token(pk, pk, ck)`.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Printing for function_call is a bit strange.
When printing an unprepared function it prints
the name and then the arguments.
For prepared function it prints <anonymous function>
as the name and then the arguments.
Prepared functions have a name() method, but printing
doesn't use it, maybe not all functions have a valid name(?).
The token() function will soon be represent as a function_call
and it should be printable in a user-readable way.
Let's add an if which prints `token(arg1, arg2)`
instead of `<anonymous function>(arg1, arg2)` when printing
a call to the token function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
One test for expr::token uses raw column identifier
in the test.
Let's change it to unresloved_identifier, which is
a standard representation of unresolved column
names in expressions.
Once expr::token is removed it will be possible
to create a function_call with unresolved_identifiers
as arguments.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The possible_lhs_values takes an expression and a column
and finds all possible values for the column that make
the expression true.
Apart from finding column values it's also capable of finding
all matching values for the partition key token.
When a nullptr column is passed, possible_lhs_values switches
into token values mode and finds all values for the token.
This interface isn't ideal.
It's confusing to pass a nullptr column when one wants to
find values for the token. It would be better to have a flag,
or just have a separate function.
Additionally in the future expr::token will be removed
and we will use expr::is_partition_token_for_schema
to find all occurences of the partition token.
expr::is_partition_token_for_schema takes a schema
as an argument, which possible_lhs_values doesn't have,
so it would have to be extended to get the schema from
somewhere.
To fix these two problems let's split possible_lhs_values
into two functions - one that finds possible values for a column,
which doesn't require a schema, and one that finds possible values
for the partition token and requires a schema:
value_set possible_column_values(const column_definition* col, const expression& e, const query_options& options);
value_set possible_partition_token_values(const expression& e, const query_options& options, const schema& table_schema);
This will make the interface cleaner and enable smooth transition
once expr::token is removed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In possible_lhs_values there was a message talking
about is_satisifed_by. It looks like a badly
copy-pasted message.
Change it to possibel_lhs_values as it should be.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Just like has_token, replace_token will use
expr::is_partition_token_for_schema to find all instance
of the partition token to replace.
Let's prepare for this change by adding a schema argument
to the function before making the big change.
It's unsued at the moment, but having a separate commit
should make it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In the future expr::token will be removed and checking
whether there is a partition token inside an expression
will be done using expr::is_partition_token_for_schema.
This function takes a schema as an argument,
so all functions that will call it also need
to get the schema from somewhere.
Right now it's an unused argument, but in the future
it will be used. Adding it in a separate commit
makes it easier to review.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The statement_restrictions class has a method called has_token_restriction().
This method checks whether the partition key restrictions contain expr::token.
Let's use this function in all applicable places instead of manually calling has_token().
In the future has_token() will have an additional schema argument,
so eliminating calls to has_token() will simplify the transition.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function to check whether the expression
represents a partition token - that is a call
to the token function with consecutive partition
key columns as the arguments.
For example for `token(p1, p2, p3)` this function
would return `true`, but for `token(1, 2, 3)` or `token(p3, p2, p1)`
the result would be `false`.
The function has a schema argument because a schema is required
to get the list of partition columns that should be passed as
arguments to token().
Maybe it would be possible to infer the schema from the information
given earlier during prepare_expression, but it would be complicated
and a bit dangerous to do this. Sometimes we operate on multiple tables
and the schema is needed to differentiate between them - a token() call
can represent the base table's partition token, but for an index table
this is just a normal function call, not the partition token.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add a function that can be used to check
whether a given expression represents a call
to the token() function.
Note that a call to token() doesn't mean
that the expression represents a partition
token - it could be something like token(1, 2, 3),
just a normal function_call.
The code for checking has been taken from functions::get.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Currently trying to do prepare_expression(function_call)
with a nullptr receiver fails.
It should be possible to prepare function calls without
a known receiver.
When the user types in: `token(1, 2, 3)`
the code should be able to figure out that
they are looking for a function with name `token`,
which takes 3 integers as arguments.
In order to support that we need to prepare
all arguments that can be prepared before
attempting to find a function.
Prepared expressions have a known type,
which helps to find the right function
for the given arguments.
Additionally the current code for finding
a function requires all arguments to be
assignment_testable, which requires to prepare
some expression types, e.g column_values.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
The method `functions::get` is used to get the `functions::function` object
of the CQL function called using `expr::function_call`.
Until now `functions::get` required the caller to pass both the keyspace
and the column family.
The keyspace argument is always needed, as every CQL function belongs
to some keyspace, but the column family isn't used in most cases.
The only case where having the column family is really required
is the `token()` function. Each variant of the `token()` function
belongs to some table, as the arguments to the function are the
consecutive partition key columns.
Let's make the column family argument optional. In most cases
the function will work without information about column family.
In case of the `token()` function there's gonna be a check
and it will throw an exception if the argument is nullopt.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this also silence the warning from GCC-13:
```
/home/kefu/dev/scylladb/db/schema_tables.cc:1489:10: error: variable ‘ts’ set but not used [-Werror=unused-but-set-variable]
1489 | auto ts = db_clock::now();
| ^~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
a signed/unsigned comparsion can overflow. and GCC-13 rightly points
this out. so let's use `std::cmp_greater_equal()` when comparing
unsigned and signed for greater-or-equal.
```
/home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
931 | if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] {
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
otherwise GCC 13 complains that
```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
42 | promise<> promise;
| ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
42 | promise<> promise;
| ^~~~~~~~~
```
see also cd4af0c722
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change silences the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Closes#13705
* github.com:scylladb/scylladb:
build: reenable -Wmissing-braces
treewide: add braces around subobject
cql3/stats: use zero-initialization
S3 wasn't providing filter size and accurate size for all SSTable components on disk.
First, filter size is provided by taking advantage that its in-memory representation is roughly the same as on-disk one.
Second, size for all components is provided by piggybacking on sstable parser and writer, so no longer a need to do a separate additional step after Scylla have either parsed or written all components.
Finally, sstable::storage::get_stats() is killed, so the burden is no longer pushed on the storage type implementation.
Refs #13649.
Closes#13682
* github.com:scylladb/scylladb:
test: Verify correctness of sstable::bytes_on_disk()
sstable: Piggyback on sstable parser and writer to provide bytes_on_disk
sstable: restore indentation in read_digest() and read_checksum()
sstable: make all parsing of simple components go through do_read_simple()
sstable: Add missing pragma once to random_access_reader.hh
sstable: make all writing of simple components go through do_write_simple()
test: sstable_utils: reuse set_values()
sstable: Restore indentation in read_simple()
sstable: Coroutinize read_simple()
sstable: Use filter memory footprint in filter_size()
so we can apply `execute_cql()` on `generation_type` directly without
extracting its value using `generation.value()`. this paves the road to
adding UUID based generation id to `generation_type`. as by then, we
will have both UUID based and integer based `generation_type`, so
`generation_type::value()` will not be able to represent its value
anymore. and this method will be replaced by `operator data_value()` in
this use case.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change prepares for the change to use `variant<UUID, int64_t>`
as the value of `generation_type`. as after this change, the "value"
of a generation would be a UUID or an integer, and we don't want to
expose the variant in generation's public interface. so the `value()`
method would be changed or removed by then.
this change takes advantage of the fact that the formatter of
`generation_type` always prints its value. also, it's better to
reuse `generation_type` formatter when appropriate.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
try_prepare_expression(constant) used to throw an error
when trying to prepeare expr::constant.
It would be useful to be able to do this
and it's not hard to implement.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Make it possible to do test_assignment for column_values.
It's implemented using the generic expression assignment
testing function.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
test_assignment checks whether a value of some type
can be assigned to a value of different type.
There is no implementation of test_assignment
for expr::constant, but I would like to have one.
Currently there is a custom implementation
of test_assignment for each type of expression,
but generally each of them boils down to checking:
```
type1->is_value_compatible_with(type2)
```
Instead of implementing another type-specific funtion
I added expresion_test_assignment and used it to
implement test_assignment for constant.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
this change helps to silence the warning of `-Wmissing-braces` from
clang. in general, we can initialize an object without constructor
with braces. this is called aggregate initialization.
but the standard does allow us to initialize each element using
either copy-initialization or direct-initialization. but in our case,
neither of them applies, so the clang warns like
```
suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
options.elements.push_back({bytes(k.begin(), k.end()), bytes(v.begin(), v.end())});
^~~~~~~~~~~~~~~~~~~~~~~~~
{ }
```
in this change,
also, take the opportunity to use structured binding to simplify the
related code.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
use {} instead of {0ul} for zero initialization. as `_query_cnt`
is a multi-dimension array, each elements in `_query_cnt` is yet
another array. so we cannot initialize it with a `{0ul}`. but
to zero-initialize this array, we can just use `{}`, as per
https://en.cppreference.com/w/cpp/language/zero_initialization
> If T is array type, each element is zero-initialized.
so this should recursively zero-initialize all arrays in `_query_cnt`.
this change should silence following warning:
stats.hh:88:60: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
[statements::statement_type::MAX_VALUE + 1] = {0ul};
^~~
{ }
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@annastuchlik please review
Closes#13691
* github.com:scylladb/scylladb:
adding documentation for integration with MindsDB
adding documentation for integration with MindsDB
this series syncs the CMake building system with `configure.py` which was updated for introducing the tablets feature. also, this series include a couple cleanups.
Closes#13699
* github.com:scylladb/scylladb:
build: cmake: remove dead code
build: move test-perf down to test/perf
build: cmake: pick up tablets related changes
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `db_clock::time_point` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `generation_id` without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loading cores from Scylla executables installed in a non-standard
location can cause gdb to fail reading required libraries.
This is an example of a warning I've got after trying to load core
generated by dtest jenkins job (using ./scripts/open-coredump.sh):
> warning: Can't open file /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/0d64f327e1af9bcbb711ee217eda6df16e517c42/libreloc/libboost_system.so.1.78.0 during file-backed mapping note processing
Invocations of `scylla threads` command ended with an error:
> (gdb) scylla threads
> Python Exception <class 'gdb.error'>: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
> Error occurred in Python: Cannot find thread-local storage for LWP 2758, executable file (...)/scylla-debug-unstripped-5.3.0~dev-0.20230121.0d64f327e1af.x86_64/scylla/libexec/scylla:
> Cannot find thread-local variables on this target
An easy fix for this is to set solib-search-path to
/opt/scylladb/libreloc/.
This commit adds that set command to suggested command line gdb
arguments. I guess it's a good idea to always suggest setting
solib-search-path to that path, as it can save other people from wasting
their time on looking why does coredump opening does not work.
Closes#13696
the removed CMake script was designed to cater the needs when
Seastar's CMake script is not included in the parent project, but
this part is never tested and is dysfunctional as the `target_source()`
misses the target parameter. we can add it back when it is actually needed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
We may catch exceptions that are not `marshal_exception`.
Print std::current_exception() in this case to provide
some context about the marshalling error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#13693
Consider
- n1, n2, n3
- n3 is down
- n4 replaces n3 with the same ip address 127.0.0.3
- Inside the storage_service::handle_state_normal callback for 127.0.0.3 on n1/n2
```
auto host_id = _gossiper.get_host_id(endpoint);
auto existing = tmptr->get_endpoint_for_host_id(host_id);
```
host_id = new host id
existing = empty
As a result, del_replacing_endpoint() will not be called.
This means 127.0.0.3 will not be removed as a pending node on n1 and n2 when
replacing is done. This is wrong.
This is a regression since commit 9942c60d93
(storage_service: do not inherit the host_id of a replaced a node), where
replacing node uses a new host id than the node to be replaced.
To fix, call del_replacing_endpoint() when a node becomes NORMAL and existing
is empty.
Before:
n1:
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[cd1f187a-0eee-4b04-91a9-905ecc499cfc]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
storage_service - Set host_id=6f9ba4e8-9457-4c76-8e2a-e2be257fe123 to be owned by node=127.0.0.3
After:
n1:
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Added replacing_node=127.0.0.3 to replace existing_node=127.0.0.3, coordinator=127.0.0.3
token_metadata - Added node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - replace[28191ea6-d43b-3168-ab01-c7e7736021aa]: Marked ops done from coordinator=127.0.0.3
storage_service - Node 127.0.0.3 state jump to normal
token_metadata - Removed node 127.0.0.3 as pending replacing endpoint which replaces existing node 127.0.0.3
storage_service - Set host_id=72219180-e3d1-4752-b644-5c896e4c2fed to be owned by node=127.0.0.3
Tests: https://github.com/scylladb/scylla-dtest/pull/3126Closes#13677
bytes_on_disk is the sum of all sstable components.
As read_simple() fetches the file size before parsing the component,
bytes_on_disk can be added incrementally rather than an additional
step after all components were already parsed.
Likewise, write_simple() tracks the offset for each new component,
and therefore bytes_on_disk can also be added incrementally.
This simplifies s3 life as it no longer have to care about feeding
a bytes_on_disk, which is currently limited to data and index
sizes only.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all parsing of simple components going through do_read_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With all writing of simple components going through do_write_simple(),
common infrastructure can be reused (exception handling, debug logging,
etc), and also statistics spanning all components can be easily added.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For S3, filter size is currently set to zero, as we want to avoid
"fstat-ing" each file.
On-disk representation of bloom filter is similar to the in-memory
one, therefore let's use memory footprint in filter_size().
User of filter_size() is API implementing "nodetool cfstats" and
it cares about the size of bloom filter data (that's how it's
described).
This way, we provide the filter data size regardless of the
underlying storage type.
Refs #13649.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Updated the empty() function in the struct fsm_output to include the
max_read_id_with_quorum field when checking whether the fsm output is
empty or not. The change was made in order maintain consistency with the
codebase and adding completeness to the empty check. This change has no
impact on other parts of the codebase.
Closes#13656
The new name is more generic and appropriate for topology transitions
which don't affect any specific replica but the entire cluster as a
whole (which we'll introduce later).
Also take `guard` directly instead of `node_to_work_on` in this more
generic function. Since we want `node_to_work_on` to die when we steal
its guard, introduce `take_guard` which takes ownership of the object
and returns the guard.
Previously the code assumed that there was always a 'node to work on' (a
node which wants to change its state) or there was no work to do at all.
It would find such a node, switch on its state (e.g. check if it's
bootstrapping), and in some states switch on the topology
`transition_state` (e.g. check if it's `write_both_read_old`).
We want to introduce transitions that are not node-specific and can work
even when all nodes are 'normal' (so there's no 'node to work on'). As a
first step, we refactor the code so it switches on `transition_state`
first. In some of these states, like `write_both_read_old`, there must
be a 'node to work on' for the state to make sense; but later in some
states it will be optional (such as `commit_cdc_generation`).
The lambdas defined inside the fiber are now methods of this class.
Currently `handle_node_transition` is calling `handle_ring_transition`,
in a later commit we will reverse this: `handle_ring_transition` will
call `handle_node_transition`. We won't have to shuffle the functions
around because they are members of the same class, making the change
easier to review. In general, the code will be easier to maintain in
this new form (no need to deal with so many lambda captures etc.)
Also break up some lines which exceeded the 120 character limit (as per
Seastar coding guidelines).
if the visitor clauses are the same, we can just use the generic version
of it by specifying the parameter with `auto&`. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13626
The new name is more generic - it describes the current step of a
'topology saga` (a sequence of steps used to implement a larger topology
operation such as bootstrap).
Previously it was part of `ring_slice`, belonging to a specific node.
This commit moves it into `topology`, making it a cluster-global
property.
The `replication_state` column in `system.topology` is now `static`.
This will allow us to easily introduce topology transition states that
do not refer to any specific node. `commit_cdc_generation` will be such
a state, allowing us to commit a new CDC generation even though all
nodes are normal (none are transitioning). One could argue that the
other states are conceptually already cluster-global: for example,
`write_both_read_new` doesn't affect only the tokens of a bootstrapping
(or decommissioning etc.) node; it affects replica sets of other tokens
as well (with RFs greater than 1).
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687
Common compression libraries work on contiguous buffers.
Contiguous buffers are a problem for the allocator. However, as long as they are short-lived,
we can avoid the expensive allocations by reusing buffers across tasks.
This idea is already applied to the compression of CQL frames, but with some deficiencies.
`utils: redesign reusable_buffer` attempts to improve upon it in a few ways. See its commit message for an extended discussion.
Compression buffer reuse also happens in the zstd SSTable compressor, but the implementation is misguided. Every `zstd_processor` instance reuses a buffer, but each instance has its own buffer. This is very bad, because a healthy database might have thousands of concurrent instances (because there is one for each sstable reader). Together, the buffers might require gigabytes of memory, and the reuse actually *increases* memory pressure significantly, instead of reducing it.
`zstd: share buffers between compressor instances` aims to improve that by letting a single buffer be shared across all instances on a shard.
Closes#13324
* github.com:scylladb/scylladb:
zstd: share buffers between compressor instances
utils: redesign reusable_buffer
This commit moves the Glossary page to the Reference
section. In addition, it adds the redirection so that
there are no broken links because of this change
and fixes a link to a subsection of Glossary.
Closes#13664
The zstd implementation of `compressor` has a separate decompression and
compression context per instance. This is unreasonably wasteful. One
decompression buffer and one compression buffer *per shard* is enough.
The waste is significant. There might exist thousands of SSTable readers, each
containing its own instance of `compressor` with several hundred KiB worth of
unneeded buffers. This adds up to gigabytes of wasted memory and gigapascals
of allocator pressure.
This patch modifies the implementation of zstd_processor so that all its
instances on the shard share their contexts.
Fixes#11733
Large contiguous buffers put large pressure on the allocator
and are a common source of reactor stalls. Therefore, Scylla avoids
their use, replacing it with fragmented buffers whenever possible.
However, the use of large contiguous buffers is impossible to avoid
when dealing with some external libraries (i.e. some compression
libraries, like LZ4).
Fortunately, calls to external libraries are synchronous, so we can
minimize the allocator impact by reusing a single buffer between calls.
An implementation of such a reusable buffer has two conflicting goals:
to allocate as rarely as possible, and to waste as little memory as
possible. The bigger the buffer, the more likely that it will be able
to handle future requests without reallocation, but also the memory
memory it ties up.
If request sizes are repetitive, the near-optimal solution is to
simply resize the buffer up to match the biggest seen request,
and never resize down.
However, if we anticipate pathologically large requests, which are
caused by an application/configuration bug and are never repeated
again after they are fixed, we might want to resize down after such
pathological requests stop, so that the memory they took isn't tied
up forever.
The current implementation of reusable buffers handles this by
resizing down to 0 every 100'000 requests.
This patch attempts to solve a few shortcomings of the current
implementation.
1. Resizing to 0 is too aggressive. During regular operation, we will
surely need to resize it back to the previous size again. If something
is allocated in the hole left by the old buffer, this might cause
a stall. We prefer to resize down only after pathological requests.
2. When resizing, the current implementation allocates the new buffer
before freeing the old one. This increases allocator pressure for no
reason.
3. When resizing up, the buffer is resized to exactly the requested
size. That is, if the current size is 1MiB, following requests
of 1MiB+1B and 1MiB+2B will both cause a resize.
It's preferable to limit the set of possible sizes so that every
reset doesn't tend to cause multiple resizes of almost the same size.
The natural set of sizes is powers of 2, because that's what the
underlying buddy allocator uses. No waste is caused by rounding up
the allocation to a power of 2.
4. The interval of 100'000 uses is both too low and too arbitrary.
This is up for discussion, but I think that it's preferable to base
the dynamics of the buffer on time, rather than the number of uses.
It's more predictable to humans.
The implementation proposed in this patch addresses these as follows:
1. Instead of resizing down to 0, we resize to the biggest size
seen in the last period.
As long as at least one maximal (up to a power of 2) "normal" request
appears each period, the buffer will never have to be resized.
2. The capacity of the buffer is always rounded up to the nearest
power of 2.
3. The resize down period is no longer measured in number of requests
but in real time.
Additionally, since a shared buffer in asynchronous code is quite a
footgun, some rudimentary refcounting is added to assert that only
one reference to the buffer exists at a time, and that the buffer isn't
downsized while a reference to it exists.
Fixes#13437
Fixes https://github.com/scylladb/scylladb/issues/13578
Now that the documentation is versioned, we can remove
the .. versionadded:: and .. versionchanged:: information
(especially that the latter is hard to maintain and now
outdated), as well as the outdated information about
experimental features in very old releases.
This commit removes that information and nothing else.
Closes#13680
std::rel_ops was deprecated in C++20, as C++20 provides a better solution for defining comparison operators. and all the use cases previously to be addressed by `using namespace std::rel_ops` have been addressed either by `operator<=>` or the default-generated `operator!=`.
so, in this series, to avoid using deprecated facilities, let's drop all these `using namespace std::rel_ops`. there are many more cases where we could either use `operator<=>` or the default-generated `operator!=` to simplify the implementation. but here, we care more about `std::rel_ops`, we will drop the most (if not all of them) of the explicitly defined `operator!=` and other comparison operators later.
Closes#13676
* github.com:scylladb/scylladb:
treewide: do not use std::rel_ops
dht: token: s/tri_compare/operator<=>/
Currently, the `reset_to()` implementation calls `consume(new_amount)` (if
not zero), then calls `signal(old_amount)`. This means that even if
`reset_to()` is a net reduction in the amount of resources, there is a
call to `consume()` which can now potentially throw.
Add a special case for when the new amount of resources is strictly
smaller than the old amount. In this case, just call `signal()` with the
difference. This not just avoids a potential `std::bad_alloc`, but also
helps relieving memory pressure when this is most needed, by not failing
calls to release memory.
Into reset_to() and reset_to_zero(). The latter replaces `reset()` with
the default 0 resources argument, which was often called from noexcept
contexts. Splitting it out from `reset()` allows for a specialized
implementation that is guaranteed to be `noexcept` indeed and thus
peace of mind.
This API is dangerous, all resource consumption should happen via RAII
objects that guarantee that all consumed resources are appropriately
released.
At this poit, said API is just a low-level building block for
higher-level, RAII objects. To ensure nobody thinks of using it for
other purposes, make it private and make external users friends instead.
Fix two issues with the replace operation introduced by recent PRs.
Add a test which performs a sequence of basic topology operations (bootstrap,
decommission, removenode, replace) in a new suite that enables the `raft`
experimental feature (so that the new topology change coordinator code is used).
Fixes: #13651Closes#13655
* github.com:scylladb/scylladb:
test: new suite for testing raft-based topology
test: remove topology_custom/test_custom.py
raft topology: don't require new CDC generation UUID to always be present
raft topology: include shard_count/ignore_msb during replace
std::rel_ops was deprecated in C++20, as C++20 provides a better
solution for defining comparison operators. and all the use cases
previously to be addressed by `using namespace std::rel_ops` have
been addressed either by `operator<=>` or the default-generated
`operator!=`.
so, in this change, to avoid using deprecated facilities, let's
drop all these `using namespace std::rel_ops`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.
also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `range_tombstone_list` and `range_tombstone_entry`
without the help of `operator<<`.
the corresponding `operator<<()` for `range_tombstone_entry` is moved
into test, where it is used. and the other one is dropped in this change,
as all its callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13627
there are two variants of `query_processor::for_each_cql_result()`,
both of them perform the pagination of results returned by a CQL
statement. the one which accepts a function returning an instant
value is not used now. so let's drop it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13675
Now gossiper doesn't need those two as its dependencies, they can be
removed making code shorter and dependencies graph simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
That's, in fact, an independent change, because feature enabler doesn't
need this method. So this patch is like "while at it" thing, but on the
other hand it ditches one more qctx usage.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All callers now have the system keyspace instance at hand.
Unfortunately, this de-static doesn't allow more qctx drops, because
both methods use set_|get_scylla_local_param helpers that do use qctx
and are still in use by other static methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This code belongs to feature service, system keyspace shoulnd't be aware
of any pecularities of startup features enabling, only loading and
saving the feature lists.
For now the move happens only in terms of code declarations, the
implementation is kept in its old place to reduce the patch churn.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question is only called by the enabler and is short enough
to be merged into the caller. This kills two birds with one stone --
makes less walks over features list and will make it possible to
de-static system keyspace features load and save methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just move the code. Thie makes gossiper not mess
with enabling/persisting features, but just gossiping them around.
Feature intersection code is still in gossiper, but can be moved in
more suitable location any time later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays features are persisted in feature_service::enable() and there
are four callers of it
- feature enabler via gossiper notifications
- boot kicks feature enabler too
- schema loader tool
- cql test env
None but the first case need to persist features. The loader tool in
fact doesn't do it even now it by keeping qctx uninitialized. Cql test
env wires up the qctx, but it makes no differences for the test cases
themselves if the features are persisted or not.
Boot-time is a bit trickier -- it loads the feature list from system
keyspace and may filter-out some of them, then enable. In that case
committing the list back into system keyspace makes no sense, as the
persisted list doesn't extend.
The enabler, in turn, can call system keyspace directly via its explicit
dependency reference. This fixes the inverse dependency between system
keyspace and feature service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It now knows that it runs inside async context, but things are changing
and soon it will be moved out of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the enabler that's responsible for enabling the features and,
implicitly, persisting them into the system keyspace. This patch moves
this logic from gossiper to feature_enabler, further patching will make
the persisting code be explicit.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a bit hairy. The maybe_enable_features() is called from two places
-- the feature_enabler upon notifications from gossiper and directory by
gossiper from wait_for_gossip_to_settle().
The _latter_ is called only when the wait_for_gossip_to_settle() is
called for the first time because of the _gossip_settled checks in it.
For the first time this method is called by storage_service when it
tries to join the ring (next it's called from main, but that's not of
interest here).
Next, despite feature_enabler is registered early -- when gossiper
instance is constructed by sharded<gossiper>::start() -- it checks for
the _gossip_settled to be true to take any actions.
Considering both -- calling maybe_enable_features() _and_ registering
enabler after storage_service's call to wait_for_gossip_to_settle()
doesn't break the code logic, but make further patching possible. In
particular, the feature_enabler will move to feature_service not to
pollute gossiper code with anything that's not gossiping.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And rename the guy. These dependencies will be used further, both are
available and started when the enabler is constructed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fix a few cases where instead of printing column names in error messages, we printed weird stuff like ASCII codes or the address of the name.
Fixes#13657Closes#13658
* github.com:scylladb/scylladb:
cql3: fix printing of column_specification::name in some error messages
cql3: fix printing of column_definition::name in some error messages
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `storage_service::mode` without the help of `operator<<`.
the corresponding `operator<<()` for `storage_service::mode` is removed
in this change, as all its callers are now using fmtlib for formatting
now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13640
Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c992f. It's still useful
to keep the test to avoid regresions.
The bug manifests by reader throwing:
std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.
Closes#13665
raft_group0 does not really depends on migration_manager, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
raft_group0 does not really depends on query_processor, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Introduce new test suite for testing the new topology coordinator
(runs under `raft` experimental flag). Add a simple test that performs a
basic sequence of topology operations.
raft_group0 does not really depends on storage_service, it needs it only
transiently, so pass it to appropriate methods of raft_group0 instead
of during its creation.
Consolidate `bytes_view_hasher` and abstract_replication_strategy `factory_key_hasher` which are the same into a reusable utils::basic_xx_hasher.
To be used in a followup series for netw:msg_addr.
Closes#13530
* github.com:scylladb/scylladb:
utils: hashing: use simple_xx_hasher
utils: hashing: add simple_xx_hasher
utils: hashers: add HasherReturning concept
hashing: move static_assert to source file
Add a test for get/enable/disable auto_compaction via to column_family api.
And add log messages for admin operations over that api.
Closes#13566
* github.com:scylladb/scylladb:
api: column_family: add log messages for admin operation
test: rest_api: add test_column_family
After the addition of the rust-std-static-wasm32-wasi target, we're
able to compile the Rust programs to Wasm binaries. However, we're still
only able to handle the Wasm UDFs in the Text format, so we need a tool
to translate the .wasm files to .wat. Additionally, the .wasm files
generated by default are unnecessarily large, which can be helped
using wasm-opt and wasm-strip.
The tool for translating wasm to wat (wasm2wat), and the tool for
stripping the wasm binaries (wasm-strip) are included in the `wabt`
package, and the optimization tool (wasm-opt) is included in the
`binaryen` package. Both packages are added to install-dependencies.sh
Closes#13282
[avi: regenerate frozen toolchain]
Closes#13605
The s3::readable_file::stat() call returns a hand-crafted stat structure
with some fields set to some sane values, most are constants. However,
other fields remain not initialized which leads to troubles sometimes.
Better to fill the stat with zeroes and later revisit it for more sane
values.
fixes: #13645
refs: #13649
Using designated initializers is not an option here, see PR #13499
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13650
the temporary directory holding the log file collecting the scylla
subprocess's output is specified by the test itself, and it is
`test_tempdir`. but unfortunately, cql-pytest/run.py is not aware
of this. so `cleanup_all()` is not able to print out the logging
messages at exit. as, please note, cql-pytest/run.py always
collect "log" file under the directory created using `pid_to_dir()`
where pid is the spawned subprocesses. but `object_store/run` uses
the main process's pid for its reusable tempdir.
so, with this change, we also register a cleanup func to printout
the logging message when the test exits.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For concurrent schema changes test, log when the different stages of the
test are finished.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#13654
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.
a test is added accordingly to avoid future changes which break the
dtest.
Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13653
column_specification::name is a shared pointer, so it should be
dereferenced before printing - because we want to print the name, not
the pointer.
Fix a few instances of this mistake in prepare_expr.cc. Other instances
were already correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Printing a column_definition::name() in an error message is wrong,
because it is "bytes" and printed as hexadecimal ASCII codes :-(
Some error messages in cql3/operation.cc incorrectly used name()
and should be changed to name_as_text(), as was correctly done in
a few other error messages in the same file.
This patch also fixes a few places in the test/cql approval tests which
"enshrined" the wrong behavior - printing things like 666c697374696e74
in error messages - and now needs to be fixed for the right behavior.
Fixes#13657
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
During node replace we don't introduce a new CDC generation, only during
regular bootstrap. Instead of checking that `new_cdc_generation_uuid`
must be present whenever there's a topology transition, only check it
when we're in `commit_cdc_generation` state.
Use simple_xx_hasher for bytes_view and effective_replication_map::factory_key
appending hashers instead of their custom, yet identical implementations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And a more specific HasherReturningBytes for hashers
that return bytes in finalize().
HasherReturning will be used by the following patch
also for simple hashers that return size_t from
finalize().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, responsible for injecting mutations of system.tablets to
schema changes.
Note that not all migrations are handled currently. Dependant view or
cdc table drops are not handled.
tablet_aware_replication_strategy is a trait class meant to be
inherited by replication strategy which want to work with tablets. The
trait produces per-table effective_replication_map which looks at
tablet metadata to determine replicas.
No replication startegy is changed to use tablets yet in this patch.
This change puts the reloading into topology_state_load(), which is a
function which reloads token_metadata from system.topology (the new
raft-based topology management). It clears the metadata, so needs to
reload tablet map too. In the future, tablet metadata could change as
part of topology transaction too, so we reload rather than preserve.
It is already set by schema_maker. In tablets_test we will depend on
the id being the same as that set in the schema_builder, so don't
change it to something else.
Currently, scans are splitting partition ranges around tokens. This
will have to change with tablets, where we should split at tablet
boundaries.
This patch introduces token_range_splitter which abstracts this
task. It is provided by effective_replication_map implementation.
This reverts commit 95bf8eebe0.
Later patches will adapt this code to work with token_range_splitter,
and the unit test added by the reverted commit will start to fail.
The unit test asks the query_ranges_to_vnodes_generator to split the range:
[t:end, t+1:start)
around token t, and expects the generator to produce an empty range
[t:end, t:end]
After adapting this code to token_range_splitter, the input range will
not be split because it is recognized as adjacent to t:end, and the
optimization logic will not kick in. Rather than adding more logic to
handle this case, I think it's better to drop the optimization, as it
is not very useful (rarely happens) and not required for correctness.
This allows update_pending_ranges(), invoked on keyspace creation, to
succeed in the presence of keyspaces with per-table replication
strategy. It will update only vnode-based erms, which is intended
behavior, since only those need pending ranges updated.
This change will also make node operations like bootstrap, repair,
etc. to work (not fail) in the presence of keyspaces with per-table
erms, they will just not be replicated using those algorithms.
Before, these would fail inside get_effective_replication_map(), which
is forbidden for keyspaces with per-table replication.
It's meant to be used in places where currently
get_non_local_strategy_keyspaces() is used, but work only with
keyspaces which use vnode-based replication strategy.
Will be used by tablet-based replication strategies, for which
effective replication map is different per table.
Also, this patch adapts existing users of effective replication map to
use the per-table effective replication map.
For simplicity, every table has an effective replication map, even if
the erm is per keyspace. This way the client code can be uniform and
doesn't have to check whether replication strategy is per table.
Not all users of per-keyspace get_effective_replication_map() are
adapted yet to work per-table. Those algorithms will throw an
exception when invoked on a keyspace which uses per-table replication
strategy.
With tablet-based replication strategies it will represent replication
of a single table.
Current vnode_effective_replication_map can be adapted to this interface.
This will allow algorithms like those in storage_proxy to work with
both kinds of replication strategies over a single abstraction.
Add a formatter to compaction::table_state that
prints the table ks_name.cf_name and compaction group id.
Fixes#13467
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In many cases we trigger offstrategy compaction opportunistically
also when there's nothing to do. In this case we still print
to the log lots of info-level message and call
`run_offstrategy_compaction` that wastes more cpu cycles
on learning that it has nothing to do.
This change bails out early if the maintenance set is empty
and prints a "Skipping off-strategy compaction" message in debug
level instead.
Fixes#13466
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now, with f1bbf705f9
(Cleanup sstables in resharding and other compaction types),
we may filter sstables as part of resharding compaction
and the assertion that all tokens are owned by the current
shard when filtering is no longer true.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When distributing the resharding jobs, prefer one of
the sstable shard owners based on foreign_sstable_open_info.
This is particularly important for uploaded sstables
that are resharded since they require cleanup.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Seen after f1bbf705f9 in debug mode
distributed_loader collect_all_shared_sstables copies
compaction::owned_ranges_ptr (lw_shared_ptr<const
dht::token_range_vector>)
across shards.
Since update_sstable_cleanup_state is synchronous, it can
be passed a const refrence to the token_range_vector instead.
It is ok to access the memory read-only across shards
and since this happens on start-up, there are no special
performance requirements.
Fixes#13631
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Update comments, test names and etc. that are still using the old terminology for
permit state names, bring them up to date with the recent state name changes.
Similar to the storage_service api, print a log message
for admin operations like enabling/disabling auto_compaction,
running major compaction, and setting the table compaction
strategy.
Note that there is overlap in functionality
between the storage_service and the column_family api entry points.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
before this change, the line is 249 chars long, so split it into
multiple lines for better readabitlity.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, alternator_timeout_in_ms is not live-updatable,
as after setting executor's default timeout right before creating
sharded executor instances, they never get updated with this option
anymore.
in this change,
* alternator_timeout_in_ms is marked as live-updateable
* executor::_s_default_timeout is changed to a thread_local variable,
so it can be updated by a per-shard updateable_value. and
it is now a updateable_value, so its variable name is updated
accordingly. this value is set in the ctor of executor, and
it is disconnected from the corresponding named_value<> option
in the dtor of executor.
* alternator_timeout_in_ms is passed to the constructor of
executor via sharded_parameter, so executor::_timeout_in_ms can
be initialized on per-shard basis
* executor::set_default_timeout() is dropped, as we already pass
the option to executor in its ctor.
please note, in the ctor of executor, we always update the cached
value of `s_default_timeout` with the value of `_timeout_in_ms`,
and we set the default timeout to 10s in `alternator_test_env`.
this is a design decision to avoid bending the production code for
testing, as in production, we always set the timeout with the value
specified either by the default value of yaml conf file.
Fixes#12232
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, we only tested whether permissions with UDFs
that have quoted names work correctly. This patch adds
the missing test that confirms that we can also use UDTs
(as UDF parameter types) when altering permissions.
Currently, the ut_name::to_string() is used only in 2 cases:
the first one is in logs or as part of error messages, and the
second one is during parsing, temporarily storing the user
defined type name in the auth::resource for later preparation
with database and data_dictionary context.
This patch changes the string so that the 'name' part of the
ut_name (as opposed to the 'keyspace' part) is now quoted when
needed. This does not worsen the logging set of cases, but it
does help with parsing of the resulting string, when finishing
preparing the auth::resource.
After the modification, a more fitting name for the function
is "ut_name::to_cql_string()", so the function is renamed to that.
While in debug mode, we may switch the default stack to
a larger one when parsing cql. We may, however, invoke
the parser recusively, causing us to switch to the big
stack while currently using it. After the reset, we
assume that the stack is empty, so after switching to
the same stack, we write over its previous contents.
This is fixed by checking if we're already using the large
stack, which is achieved by comparing the address of
a local variable to the start and end of the large stack.
"summary":"Retrieve the mapping of endpoint to host ID",
"summary":"Retrieve the mapping of endpoint to host ID of all nodes that own tokens",
"type":"array",
"items":{
"type":"mapper"
@@ -1114,6 +1114,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ranges_parallelism",
"description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
@@ -1946,7 +1954,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset local schema",
"summary":"Forces this node to recalculate versions of schema objects.",
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.