Currently the datadir is ignored.
Use it to construct the table's base path.
Fixesscylladb/scylladb#15418Closesscylladb/scylladb#15480
* github.com:scylladb/scylladb:
distributed_loader: populate_keyspace: access cf by ref
distributed_loader: table_populator: use datadir for base_path
distributed_loader: populate_keyspace: issue table mark_ready_for_writes after all datadirs are processed
distributed_loader: populate_keyspace: fixup indentation
distributed_loader: populate_keyspace: iterate over datadirs in the inner loop
test: sstable_directory_test: add test_multiple_data_dirs
table: init_storage: create upload and staging subdirs on all datadirs
Issue #10357 is about a SELECT with a filter on a regular column which
incorrectly returns a static row without regular columns set (so the
filter would not have matched). We already have four tests reproducing
this issue, but each of them is a small part of a large tests translated
from Cassandra, making it hard to understand the scope of this bug.
So in this patch we add two new tests, one passing and one xfailing,
which clarify the scope of this bug. It turns out that the bug only
occurs when a partition has no clustering rows and only has a static
row. If the partition does have clustering rows - even if those don't
match the filter - the bug doesn't happen. The xfailing test is just
two statements long - a single INSERT and a single SELECT
Refs #10357.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#15120
in this series, we rename s3 credential related variable and option names so they are more consistent with AWS's official document. this should help with the maintainability.
Closesscylladb/scylladb#15529
* github.com:scylladb/scylladb:
main.cc: rename aws option
utils/s3/creds: rename aws_config member variables
So far generic describe (`DESC <name>`) followed Cassandra implementation and it only described keyspace/table/view/index.
This commit adds UDT/UDF/UDA to generic describe.
Fixes: #14170Closesscylladb/scylladb#14334
* github.com:scylladb/scylladb:
docs:cql: add information about generic describe
cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc
cql3:statements:describe_statement: include UDT/UDF/UDA in generic describe
- s/aws_key/aws_access_key_id/
- s/aws_secret/aws_secret_access_key/
- s/aws_token/aws_session_token/
rename them to more popular names, these names are also used by
boto's API. this should improve the readability and consistency.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, mark_ready_for_writes is called too early,
after the first data dir is processed, then the next
datadir will hit an assert in `table::mark_ready_for_writes`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a basic regression test that starts the cql test env
with multiple data directories.
It fails without the previous patch:
table: init_storage: create upload and staging subdirs on all datadirs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Also, while at it, add copyright/license blurbs for tests that were missing it.
Closesscylladb/scylladb#15495
* github.com:scylladb/scylladb:
test/topology_custom: add copyright/license blurb to tests
test/topology_custom: test_select_from_mutation_fragments.py: use async query api
The current read-loop fails to detect end-of-page and if the query
result buider cuts the page, it will just proceed to the next
partition. This will result in distorted query results, as the result
builder will request for the consumption to stop after each clustering
row.
To fix, check if the page was cut before moving on to the next
partition.
A unit test reproducing the bug was also added.
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
Fixes: #14710.
Closesscylladb/scylladb#15091
* github.com:scylladb/scylladb:
cql3: statements: delete execute override
cql3: statements: call check_restricted_table_properties in prepare
cql3: statements: pass data_dictionary::database to check_restricted_table_properties
Fix fromJson(null) to return null, not a error as it did before this patch.
We use "null" as the default value when unwrapping optionals
to avoid bad optional access errors.
Fixes: scylladb#7912
Signed-off-by: Michael Huang <michaelhly@gmail.com>
Closesscylladb/scylladb#15481
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Closesscylladb/scylladb#15400
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.
Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.
This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.
Closesscylladb/scylladb#15497
* github.com:scylladb/scylladb:
s3::client: Track memory in client uploads
code: Configure s3 clients' memory usage
s3::client: Construct client with shared semaphore
sstables::storage_manager: Introduce config
The v.u.g. start stop is now spread over main() code heavily.
1. sharded<v.u.g.>.start() happens early enough to allow depending services register staging sstables on it
2. after the system is "more-or-less" alive the invoke_on_all(v.u.g.::start()) is called (conditionally) to activate the generator background fiber. Not 100% sure why it happens _that_ late, but somehow it's required that while scylla is joining the cluster the generation doesn't happen
3. early on stop the v.u.g. is fully stopped
The 3rd step is pretty nasty. It may happen that v.u.g. is not stopped if scylla start aborts before the last action is defer-scheduled. Also, when it happens, it leaves stopping dependencies with non-initialized v.u.g.'s local instances, which is not symmetrical to how they start.
Said that, this PR fixes the stopping sequence to happen later, i.e. -- being defer-scheduled right after sharded<v.u.g.> is started. Also it makes sure that terminating the background fiber happens as early as it is now. This is done the compaction_manager-style -- the v.u.g. subscribes on stop signal abort source and kicks the fiber to stop when it fires.
Closesscylladb/scylladb#15466
* github.com:scylladb/scylladb:
view_update_generator: Stop for real later
view_update_generator: Add logging to do_abort()
view_update_generator: Move abort kicking to do_abort()
view_update_generator: Add early abort subscription
Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.
Check table properties when create and alter statements are prepared.
The error is no longer returned as an exceptional future, but it
is thrown. Adjust the tests accordingly.
when accessing AWS resources, uses are allowed to long-term security
credentials, they can also the temporary credentials. but if the latter
are used, we have to pass a session token along with the keys.
see also https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
so, if we want to programatically get authenticated, we need to
set the "x-amz-security-token" header,
see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/RESTAuthentication.html#UsingTemporarySecurityCredentials
so, in this change, we
1. add another member named `token` in `s3::endpoint_config::aws_config`
for storing "AWS_SESSION_TOKEN".
2. populate the setting from "object_storage.yaml" and
"$AWS_SESSION_TOKEN" environment variable.
3. set "x-amz-security-token" header if
`s3::endpoint_config::aws_config::token` is not empty.
this should allow us to test s3 client and s3 object store backend
with S3 bucket, with the temporary credentials.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15486
Sstables in transitional states are marked with the respective 'status' in the registry. Currently there are two of such -- 'creating' and 'removing'. And the 'sealed' status for sstables in use.
On boot the distributed loader tries to garbage collect the dangling sstables. For filesystem storage it's done with the help of temorary sstables' dirs and pending deletion logs. For s3-backed sstables, the garbage collection means fetching all non-sealed entries and removing the corresponding objects from the storage.
Test included (last patch)
fixes#13024Closesscylladb/scylladb#15318
* github.com:scylladb/scylladb:
test: Extend object_store test to validate GC works
sstable_directory: Garbage collect S3 sstables on reboot
sstable_directory: Pass storage to garbage_collect()
sstable_directory: Create storage instance too
This sets the real limits on the memory semaphore.
- scylla sets it to 1% of total memory, 10Mb min, 100Mb max
- tests set it to 16Mb
- perf test sets it to all available memory
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore will be used to cap memory consumption by client. This
patch makes sure the reference to a semaphore exists as an argument to
client's constructor, not more than that.
In scylla binary, the semaphore sits on storage_manager. In tests the
semaphore is some local object. For now the semaphore is unused and is
initialized locked as this patch just pushes the needed argument all the
way around, next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just an empty config that's fed to storage_manager when constructed as a
preparation for further heavier patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
cql.execute_async() can now execute paged queries, use it instead of a
blocking API.
While at it, clean-up the test:
* remove unneded wait on ring0 settle
* address flake8 concerns:
- unused imports
- unused variables
- style
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).
If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).
When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.
We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.
Fixes: #7620Fixes: #13957Closesscylladb/scylladb#15331
* github.com:scylladb/scylladb:
test: add test for group 0 schema versioning
test/pylib: log_browsing: fix type hint
feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
schema_tables: use schema version from group 0 if present
migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
migration_manager: migration_request handler: assume `canonical_mutation` support
system_keyspace: make `get/set_scylla_local_param` public
feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
schema_tables: refactor `scylla_tables(schema_features)`
migration_manager: add `std::move` to avoid a copy
schema_tables: remove default value for `reload` in `merge_schema`
schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
system_keyspace: fix outdated comment
Compaction tasks executors serve two different purposes - as compaction
manager related entity they execute compaction operation and as task
manager related entity they track compaction status.
When one role depends on the other, as it currently is for
compaction_task_impl::done() and compaction_task_executor::compaction_done(),
requirements of both roles need to be satisfied at the same time in each
corner case. Such complexity leads to bugs.
To prevent it, compaction_task_impl::done() of executors no longer depends
on compaction_task_executor::compaction_done().
Fixes: #14912.
Closesscylladb/scylladb#15140
* github.com:scylladb/scylladb:
compaction: warn about compaction_done()
compaction: do not run stopped compaction
compaction: modify lowest compaction tasks' run method
compaction: pass do_throw_if_stopping to compaction_task_executor
this target mirrors the target named `{mode}e-test` in the
`build.ninja` build script created by `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15448
Currently, when creating the table, permissions may be mistakenly
granted to the user even if the table is already existing. This
can happen in two cases:
The query has a IF NOT EXISTS clause - as a result no exception
is thrown after encountering the existing table, and the permission
granting is not prevented.
The query is handled by a non-zero shard - as a result we accept
the query with a bounce_to_shard result_message, again without
preventing the granting of permissions.
These two cases are now avoided by checking the result_message
generated when handling the query - now we only grant permissions
when the query resulted in a schema_change message.
Additionally, a test is added that reproduces both of the mentioned
cases.
CVE-2023-33972
Fixes#15467.
* 'no-grant-on-no-create' of github.com:scylladb/scylladb-ghsa-ww5v-p45p-3vhq:
auth: do not grant permissions to creator without actually creating
transport: add is_schema_change() method to result_message
instead of checking the availability of a required program, let's
use the `REQUIRED` argument introduced by CMake 3.18, simpler this
way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15447
when compiling the tests with -Wsign-compare, the compiler complains like:
```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -MF test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o.d -o test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -c /home/kefu/dev/scylladb/test/boost/tablets_test.cc
/home/kefu/dev/scylladb/test/boost/tablets_test.cc:1335:53: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
for (int log2_tablets = 0; log2_tablets < tablet_count_bits; ++log2_tablets) {
~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
```
in this case, it should be safe to use an signed int as the loop
variable to be compared with `tablet_count_bits`, but let's just
appease the compiler so we can enable the warning option project-wide
to prevent any potential issues caused by signed-unsigned comparision.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15449
We add garbage collection for the `CDC_GENERATIONS_V3` table to prevent
it from endlessly growing. This mechanism is especially needed because
we send the entire contents of `CDC_GENERATIONS_V3` as a part of the
group 0 snapshot.
The solution is to keep a clean-up candidate, which is one of the
already published CDC generations. The CDC generation publisher
introduced in #15281 continually uses this candidate to remove all
generations with timestamps not exceeding the candidate's and sets a new
candidate when needed.
We also add `test_cdc_generation_clearing.py` that verifies this new
mechanism.
Fixes#15323Closesscylladb/scylladb#15413
* github.com:scylladb/scylladb:
test: add test_cdc_generation_clearing
raft topology: remove obsolete CDC generations
raft topology: set CDC generation clean-up candidate
topology_coordinator: refactor publish_oldest_cdc_generation
system_keyspace: introduce decode_cdc_generation_id
system_keyspace: add cleanup_candidate to CDC_GENERATIONS_V3
Perform schema changes while mixing nodes in RECOVERY mode with nodes in
group 0 mode:
- schema changes originating from RECOVERY node use
digest-based schema versioning.
- schema changes originating from group 0
nodes use persisted versions committed through group 0.
Verify that schema versions are in sync after each schema change, and
that each schema change results in a different version.
Also add a simple upgrade test, performing a schema change before we
enable Raft (which also enables the new versioning feature) in the
entire cluster, then once upgrade is finished.
One important upgrade test is missing, which we should add to dtest:
create a cluster in Raft mode but in a Scylla version that doesn't
understand GROUP0_SCHEMA_VERSIONING. Then start upgrading to a version
that has this patchset. Perform schema changes while the cluster is
mixed, both on non-upgraded and on upgraded nodes. Such test is
especially important because we're adding a new column to the
`system.scylla_local` table (which we then redact from the schema
definition when we see that the feature is disabled).
As promised in earlier commits:
Fixes: #7620Fixes: #13957
Also modify two test cases in `schema_change_test` which depend on
the digest calculation method in their checks. Details are explained in
the comments.
This series introduces a scylla-native nodetool. It is invokable via the main scylla executable as the other native tools we have. It uses the seastar's new `http::client` to connect to the specified node and execute the desired commands.
For now a single command is implemented: `nodetool compact`, invokable as `scylla nodetool compact`. Once all the boilerplate is added to create a new tool, implementing a single command is not too bad, in terms of code-bloat. Certainly not as clean as a python implementation would be, but good enough. The advantages of a C++ implementation is that all of us in the core team know C++ and that it is shipped right as part of the scylla executable..
Closes#14841
* github.com:scylladb/scylladb:
test: add nodetool tests
test.py: add ToolTestSuite and ToolTest
tools/scylla-nodetool: implement compact operation
tools/scylla-nodetool: implement basic scylla_rest_api_client
tools: introduce scylla-nodetool
utils: export dns_connection_factory from s3/client.cc to http.hh
utils/s3/client: pass logger to dns_connection_factory in constructor
tools/utils: tool_app_template::run_async(): also detect --help* as --help
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be aborted. There is no infrastructure for
aborts yet, so this is not implemented.
Closes#15197
* github.com:scylladb/scylladb:
tablets, raft topology: Add support for decommission with tablets
tablet_allocator: Compute load sketch lazily
tablet_allocator: Set node id correctly
tablet_allocator: Make migration_plan a class
tablets: Implement cleanup step
storage_service, tablets: Prevent stale RPCs from running beyond their stage
locator: Introduce tablet_metadata_guard
locator, replica: Add a way to wait for table's effective_replication_map change
storage_service, tablets: Extract do_tablet_operation() from stream_tablet()
raft topology: Add break in the final case clause
raft topology: Fix SIGSEGV when trace-level logging is enabled
raft topology: Set node state in topology
raft topology: Always set host id in topology
When a column family's schema is changed new compaction
strategy type may be applied.
To make sure that it will behave as expected, compaction
strategy need to contain only the allowed options and values.
Methods throwing exception on invalid options are added.
Fixes: #2336.
Closes#13956
* github.com:scylladb/scylladb:
test: add test for compaction strategy validation
compaction: unify exception messages
compaction: cql3: validate options in check_restricted_table_properties
compaction: validate options used in different compaction strategies
compaction: validate common compaction strategy options
compaction: split compaction_strategy_impl constructor
compaction: validate size_tiered_compaction_strategy specific options
compaction: validate time_window_compaction_strategy specific options
compaction: add method to validate min and max threshold
compaction: split size_tiered_compaction_strategy_options constructor
compaction: make compaction strategy keys static constexpr
compaction: use helpers in validate_* functions
compaction: split time_window_compaction_strategy_options construtor
compaction: add validate method to compaction_strategy_options
time_window_compaction_strategy_options: make copy and move-able
size_tiered_compaction_strategy_options: make copy and move-able
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
Testing the new scylla nodetool tool.
The tests can be run aginst both implementations of nodetool: the
scylla-native one and the cassandra one. They all pass with both
implementations.
There are several system tables with strict durability requirements.
This means that if we have written to such a table, we want to be sure
that the write won't be lost in case of node failure. We currently
accomplish this by accompanying each write to these tables with
`db.flush()` on all shards. This is expensive, since it causes all the
memtables to be written to sstables, which causes a lot of disk writes.
This overheads can become painful during node startup, when we write the
current boot state to `system.local`/`system.scylla_local` or during
topology change, when `update_peer_info`/`update_tokens` write to
`system.peers`.
In this series we remove flushes on writes to the `system.local`,
`system.peers`, `system.scylla_local` and `system.cdc_local` tables and
start using schema commitlog for durability.
Fixes: #15133Closes#15279
* github.com:scylladb/scylladb:
system_keyspace: switch CDC_LOCAL to schema commitlog
system_keyspace: scylla_local: use schema commitlog
database.cc: make _uses_schema_commitlog optional
system_keyspace: drop load phases
database.hh: add_column_family: add readonly parameter
schema_tables: merge_tables_and_views: delay events until tables/views are created on all shards
system_keyspace: switch system.peers to schema commitlog
system_keyspace: switch system.local to schema commitlog
main.cc: move schema commitlog replay earlier
sstables_format_selector: extract listener
sstables_format_selector: wrap when_enabled with seastar::async
main.cc: inline and split system_keyspace.setup
system_keyspace: refactor save_system_schema function
system_keyspace: move initialize_virtual_tables into virtual_tables.hh
system_keyspace: remove unused parameter
config.cc: drop db::config::host_id
main.cc:: extract local_info initialization into function
schema.cc: check static_props for sanity
system_keyspace: set null sharder when configuring schema commitlog
system_keyspace: rename static variables
system_keyspace: remove redundant wait_for_sync_to_commitlog