This patch adds a reproducing test for issue #11588, which is still open
so the test is expected to fail on Scylla ("xfail), and passes on Cassandra.
The test shows that Scylla allows an out-of-range value to be written to
timestamp column, but then it can't be read back.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11864
The PR prepares repair for task manager integration:
- Creates repair_module
- Keeps repair_module in repair_service
- Moves tracker methods to repair_module
- Changes UUID to task_id in repair module
Closes#11851
* github.com:scylladb/scylladb:
repair: check shutdown with abort source in repair module
repair: use generic module gate for repair module operations
repair: move tracker to repair module
repair: move next_repair_command to repair_module
repair: generate repair id in repair module
repair: keep shard number in repair_uniq_id
repair: change UUID to task_id
repair: add task_manager::module to repair_service
repair: create repair module and task
Repair module uses a gate to prevent starting new tasks on shutdown.
Generic module's gate serves the same purpose, thus we can
use it also in repair specific context.
Since both tracker and repair_module serve similar purpose,
it is confusing where we should seek for methods connected to them.
Thus, to make it more transparent, tracker class is deleted and all
its attributes and methods are moved to repair_module.
Number of the repair operation was counted both with
next_repair_command from tracer and sequence number
from task_manager::module.
To get rid of redundancy next_repair_command was deleted and all
methods using its value were moved to repair_module.
Execution shard is one of the traits specific to repair tasks.
Child task should freely access shard id of its parent. Thus,
the shard number is kept in a repair_uniq_id struct.
Create repair_task_impl and repair_module inheriting from respectively
task manager task_impl and module to integrate repair operations with
task manager.
We currently avoid compiling C code in "pip3 install scylla-driver", but
we actually providing portable binary distributions of the package,
so we should use it by "pip3 install --only-binary=:all: scylla-driver".
The binary distribution contains dependency libraries, so we won't have
problem loading it on relocatable python3.
Closes#11852
The PR adds changes to task manager that allow more convenient integration with modules.
Introduced changes:
- adds internal flag in task::impl that allows user to filter too specific tasks
- renames `parent_data` to more appropriate name `task_info`
- creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager
- adds more flexible version of `make_task` method
Closes#11821
* github.com:scylladb/scylladb:
tasks: add alternative make_task method
tasks: rename parent_data to task_info and move it
tasks: move task_id to tasks/types.hh
tasks: add internal flag for task_manager::task::impl
Prevent copying shared_ptr across shards
in do_sync_data_using_repair by allocating
a shared_ptr<abort_source> per shard in
node_ops_meta_data and respectively in node_ops_info.
Fixes#11826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11827
* github.com:scylladb/scylladb:
repair: use sharded abort_source to abort repair_info
repair: node_ops_info: add start and stop methods
storage_service: node_ops_abort_thread: abort all node ops on shutdown
storage_service: node_ops_abort_thread: co_return only after printing log message
storage_service: node_ops_meta_data: add start and stop methods
repair: node_ops_info: prevent accidental copy
The lsa-segment command tries to walk LSA segment objects by decoding
their descriptors and (!) object sizes as well. Some objects in LSA have
dynamic sizes, i.e. those depending on the object contents. The script
tries to drill down the object internals to get this size, but bad news
is that nowadays there are many dynamic objects that are not covered.
Once stepped upon unsupported object, scylla-gdb likely stops because
the "next" descriptor happens to be in the middle of the object and its
parsing throws.
This patch fixes this by taking advantage of the virtual size() call of
the migrate_fn_type all LSA objects are linked with (indirectly). It
gets the migrator object, the LSA object itself and calls
((migrate_fn_type*)<migrator_ptr>)->size((const void*)<object_ptr>)
with gdb. The evaluated value is the live dynamic size of the object.
fixes: #11792
refs: #2455
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11847
Currently, when specifying nodes to ignore for replace or removenode,
we support specifying them only using their ip address.
As discussed in https://github.com/scylladb/scylladb/issues/11839 for removenode,
we intentionally require the host uuid for specifying the node to remove,
so the nodes to ignore (that are also done, otherwise we need not ignore them),
should be consistent with that and be specified using their host_id.
The series extends the apis and allows either the nodes ip address or their host_id
to be specified, for backward compatibility.
We should deprecate the ip address method over time and convert the tests and management
software to use the ignored nodes' host_id:s instead.
Closes#11841
* github.com:scylladb/scylladb:
api: doc: remove_node: improve summary
api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
locator: token_metadata: add parse_host_id_and_endpoint
api: storage_service: remove_node: validate host_id
The current summary of the operation is obscure.
It refers to a token in the ring and the endpoint associated with it,
while the operation uses a host_id to identify a whole node.
Instead, clarify the summary to refer to a node in the cluster,
consistent with the description for the host_id parameter.
Also, describe the effect the call has on the data the removed node
logically owned.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the api is inconsistent: requiring a uuid for the
host_id of the node to be removed, while the ignored nodes list
is given as comma-separated ip addresses.
Instead, support identifying the ignored_nodes either
by their host_id (uuid) or ip address.
Also, require all ignore_nodes to be of the same kind:
either UUIDs or ip addresses, as a mix of the 2 is likely
indicating a user error.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for specifying nodes either by their
host_id or ip address and using the token_metadata
to resolve the mapping.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The node to be removed must be identified by its host_id.
Validate that at the api layer and pass the parsed host_id
down to storage_service::removenode.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, --disks options does not allow symlinks such as
/dev/disk/by-uuid/* or /dev/disk/azure/*.
To allow using them, is_unused_disk() should resolve symlink to
realpath, before evaluating the disk path.
Fixes#11634Closes#11646
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.
Fixes#9149Closes#9153
* github.com:scylladb/scylladb:
scylla_util.py: adding unescape for sysconfig_parser
scylla_util.py: on sysconfig_parser, don't use double quote when it's possible
Currently we use a single shared_ptr<abort_source>
that can't be copied across shards.
Instead, use a sharded<abort_source> in node_ops_info so that each
repair_info instance will use an (optional) abort_source*
on its own shard.
Added respective start and stop methodsm plus a local_abort_source
getter to get the shard-local abort_source (if available).
Fixes#11826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for adding a sharded<abort_source> member.
Wire start/stop in storage_service::node_ops_meta_data.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
A later patch adds a sharded<abort_source> to node_ops_info.
On shutdown, we must orderly stop it, so use node_ops_abort_thread
shutdown path (where node_ops_singal_abort is called will a nullopt)
to abort (and stop) all outstanding node_ops by passing
a null_uuid to node_ops_abort, and let it iterate over all
node ops to abort and stop them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the function co_returns if (!uuid_opt)
so the log info message indicating it's stopped
is not printed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Delete node_ops_info copy and move constructors before
we add a sharded<abort_source> member for the per-shard repairs
in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Even we have __escape() for escaping " middle of the value to writing
sysconfig file, we didn't unescape for reading from sysconfig file.
So adding __unescape() and call it on get().
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.
Fixes#9149
Task manager tasks should be created with make_task method since
it properly sets information about child-parent relationship
between tasks. Though, sometimes we may want to keep additional
task data in classes inheriting from task_manager::task::impl.
Doing it with existing make_task method makes it impossible since
implementation objects are created internally.
The commit adds a new make_task that allows to provide a task
implementation pointer created by caller. All the fields except
for the one connected with children and parent should be set before.
parent_data struct contains info that is common for each task,
not only in parent-child relationship context. To use it this way
without confusion, its name is changed to task_info.
In order to be able to widely and comfortably use task_info,
it is moved from tasks/task_manager.hh to tasks/types.hh
and slightly extended.
It is convenient to create many different tasks implementations
representing more and more specific parts of the operation in
a module. Presenting all of them through the api makes it cumbersome
for user to navigate and track, though.
Flag internal is added to task_manager::task::impl so that the tasks
could be filtered before they are sent to user.
When doing shadow round for replacement the bootstrapping node needs to
know the dc/rack info about the node it replaces to configure it on
topology. This topology info is later used by e.g. repair service.
fixes: #11829
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11838
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.
Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.
The fix is to explicitly tell set extra_replica from unset one.
fixes: #11825
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11833
On most of the software distribution tar.gz, it has sub-directory to contain
everything, to prevent extract contents to current directory.
We should follow this style on our unified package too.
To do this we need to increment relocatable package version to '3.0'.
Fixes#8349Closes#8867
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".
However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.
To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.
Fixes#11617Closes#11666
Since the end bound is exclusive, the end position should be
before_key(), not after_key().
Affects only tests, as far as I know, only there we can get an end
bound which is a clustering row position.
Would cause failures once row cache is switched to v2 representation
because of violated assumptions about positions.
Introduced in 76ee3f029cCloses#11823
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789
Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793
Closes#11795
* github.com:scylladb/scylladb:
distributed_loader: populate_column_family: reindent
distributed_loader: coroutinize populate_column_family
distributed_loader: table_population_metadata: start: reindent
distributed_loader: table_population_metadata: coroutinize start_subdir
distributed_loader: table_population_metadata: start_subdir: reindent
distributed_loader: pre-load all sstables metadata for table before populating it
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.
In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).
Fixes#11801Closes#11808
* github.com:scylladb/scylladb:
test/alternator: add test for issue 11801
MV: fix handling of view update which reassign the same key value
materialized views: inline used-once and confusing function, replace_entry()
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.
The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.
fixes: #10284Closes#11797
* github.com:scylladb/scylladb:
repair: Remove ops_uuid
repair: Remove abort_repair_node_ops() altogether
repair: Subscribe on node_ops_info::as abortion
repair: Keep abort source on node_ops_info
repair: Pass node_ops_info arg to do_sync_data_using_repair()
repair: Mark repair_info::abort() noexcept
node_ops: Remove _aborted bit
node_ops: Simplify construction of node_ops_metadata
main: Fix message about repair service starting
The test wants to see that no allocations larger than 128k are present,
but sets the warning threshold to exactly 128k. Due to an off-by-one in
Seastar, this went unnoticed. However, now that the off-by-one in Seastar
is fixed [1], this test starts to fail.
Fix by setting the warning threshold to 128k + 1.
[1] 429efb5086Closes#11817
The vector(initializer_list<T>) constructor copies the T since
initializer_list is read-only. Move the mutation instead.
This happens to fix a use-after-return on clang 15 on aarch64.
I'm fairly sure that's a miscompile, but the fix is worthwhile
regardless.
Closes#11818
This PR adds some unit tests for the `expr::evaluate()` function.
At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request.
I found a few places where I think it would be better to behave in a different way, but nothing serious.
Closes#11815
* github.com:scylladb/scylladb:
test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
cql3: expr: Add unit tests for bind_variable validation of collections
cql3: expr: Add test for subscripted list and map
cql3: expr: Add test for usertype_constructor
cql3: expr: Add test for tuple_constructor
cql3: expr: Add tests for evaluation of collection constructors
cql3: expr: Add tests for evaluation of column_values and bind_variables
cql3: expr: Add constant evaluation tests
test/boost: Add expr_test_utils.hh
cql3: Add ostream operator for raw_value
cql3: add is_empty_value() to raw_value and raw_value_view
expr_test_utils.hh was a header file with helper methods for
expression tests. All functions were inline, because I didn't
know how to create and link a .cc file in test/boost.
Now the header is split into expr_test_utils.hh and expr_test_utils.cc
and moved to test/lib, which is designed to keep this kind of files.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
"
Snitch was the junction of several services' deps because it was the
holder of endpoint->dc/rack mappings. Now this information is all on
topology object, so snitch can be finally made main-local
"
* 'br-deglobalize-snitch' of https://github.com/xemul/scylla:
code: Deglobalize snitch
tests: Get local reference on global snitch instance once
gossiper: Pass current snitch name into checker
snitch: Add sharded<snitch_ptr> arg to reset_snitch()
api: Move update_snitch endpoint
api: Use local snitch reference
api: Unset snitch endpoints on stop
storage_service: Keep local snitch reference
system_keyspace: Don't use global snitch instance
snitch: Add const snitch_ptr::operator->()
These cannot be meaningfully define for a vector value like resources.
To prevent instinctive misuse, remove them. Operator bool is replaced
with `non_zero()` which hopefully better expresses what to expected.
The comparison operator is just removed and inlined into its own user,
which actually help said user's readability.
Closes#11813
evaluating a bind variable should validate
collection values.
Test that bound collection values are validated,
even in case of a nested collection.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Test that evaluate(tuple_constructor) works
as expected.
It was necessary to implement a custom function
for serializing tuples, because some tests
require the tuple to contain unset_value
or an empty value, which is impossible
to express using the exisiting code.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Test that evaluate(collection_constructor) works as expected.
Added a bunch of utility methods for creating
collection values to expr_test_utils.hh.
I was forced to write custom serialization of
collections. I tried to use data_value,
but it doesn't allow to express unset_value
and empty values.
The custom serialization isnt actually used
in this specific commit, but it's needed
in the following ones.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some tests actively use global snitch instance. This patch makes each
test get a local reference and use it everywhere. Next patch will
replace global instance with local one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Gossiper makes sure local snitch name is the same as the one of other
nodes in the ring. It now gets global snitch to get the name, this patch
passes the name as an argument, because the caller (storage_service) has
snitch instance local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method replaces snitch instance on the existing sharded<snitch_ptr>
and the "existing" is nowadays the global instance. This patch changes
it to use local reference passed from API code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now living in storage_service.cc, but non-global snitch is
available in endpoint_snitch.cc so move the endpoint handler there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snitch/name endpoint needs snitch instance to get the name from.
Also the storage_service/reset_snitch endpoint will also need snitch
instance to call reset on.
This patch carries local snitch reference all thw way through API setup
and patches the get_name() call. The reset_snitch() will come in the
next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some time soon snitch API handlers will operate on local snitch
reference capture, so those need to be unset before the target local
variable variable goes away
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection
At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Until now, authentication in alternator served only two purposes:
- refusing clients without proper credentials
- printing user information with logs
After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.
tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user
Fixes#11379Closes#11380
* github.com:scylladb/scylladb:
alternator: propagate authenticated user in client state
client_state: add internal constructor with auth_service
alternator: pass auth_service and sl_controller to server
This reverts commit df8e1da8b2, reversing
changes made to 4ff204c028. It causes
a crash in debug mode on aarch64 (likely a coroutine miscompile).
Fixes#11809.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.
Fixes#11470Closes#11693
To prevent stalls due to large number of tables.
Fixes scylladb/scylladb#11574
Closes#11689
* github.com:scylladb/scylladb:
schema_tables: merge_tables_and_views reindent
schema_tables: limit paralellism
Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created.
Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion.
I also plan on using the new facility in enterprise.
Closes#11772
* github.com:scylladb/scylladb:
replica/database: revert initial boost to system semaphore with set_resources()
reader_concurrency_semaphore: add set_resources()
When moving a SSTable from staging to base dir, we reused the generation
under the assumption that no SSTable in base dir uses that same
generation. But that's not always true.
When reshaping staging dir, reshape compaction can pick a generation
taken by a SSTable in base dir. That's because staging dir is populated
first and it doesn't have awareness of generations in base dir yet.
When that happens, view building will fail to move SSTable in staging
which shares the same generation as another in base dir.
We could have played with order of population, populating base dir
first than staging dir, but the fragility wouldn't be gone. Not
future proof at all.
We can easily make this safe by picking a new generation for the SSTable
being moved from staging, making sure no clash will ever happen.
Fixes#11789.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11790
This patch adds a test reproducing issue #11801, and confirming that
the previous patch fixed it. Before the previous patch, the test passed
on DynamoDB but failed on Alternator.
The patch also adds four more passing tests which demonstrate that
issue #11801 only happened in the very specific case where:
1. A GSI has two key attributes which weren't key attributes in the
base, and
2. An update sets the second of those attributes to the same value
which it already had.
This bug was originally discovered and explained by @fee-mendes.
Refs #11801.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).
Fixesscylladb/scylladb#11793
Note: table_population_metadata::start_subdir is called
in a seastar thread to facilitate backporting to old versions
that do not support coroutines yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a materialized view has a key (in Alternator, this can be two
keys) which was a regular column in the base table, and a base
update modifies that regular column, there are two distinct cases:
1. If the old and new key values are different, we need to delete the
old view row, and create a new view row (with the different key).
2. If the old and new key values are the same, we just need to update
the pre-existing row.
It's important not to confuse the two cases: If we try to delete and
create the *same* view row in the same timestamp, the result will be
that the row will be deleted (a tombstone wins over data if they have the
same timestamp) instead of updated. This is what we saw in issue #11801.
We had a bug that was seen when an update set the view key column to
the old value it already had: To compare the old and new key values
we used the function compare_atomic_cell_for_merge(), but this compared
not just they values but also incorrectly compared the metadata such as
a the timestamp. Because setting a column to the same value changes its
timestamp, we wrongly concluded that these to be different view keys
and used the delete-and-create code for this case, resulting in the
view row being deleted (as explained above).
The simple fix is to compare just the key values - not looking at
the metadata.
See tests reproducing this bug and confirming its fix in the next patch.
Fixes#11801
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The replace_entry() function is nothing more than a convenience for
calling delete_old_entry() and then create_entry(). But it is only used
once in the code, and we can just open-code the two calls instead of
the one.
The reason I want to change it now is that the shortcut replace_entry()
helped hide a bug (#11801) - replace_entry() works incorrectly if the
old and new row have the same key, because if they do we get a deletion
and creation of the same row with the same timestamp - and the deletion
wins. Having the two calls not hidden by a convenience function makes
this potential problem more apparent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add tests which test that evaluate(column_value)
and evaluate(bind_variable) work as expected.
values of columns and bind variables are
kept in evaluation_inputs, so we need to mock
them in order for evaluate() to work.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Add unit test for evaluating expr::constant values.
evaluate(constant) just returns constant.value,
so there is no point in trying all the possible combinations.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It was supposed to be fix for #2455, but eventually it turned out that #11792 blocks this progress but takes more efforts. So for now only a couple of small improvements (not to lose them by chance)
Closes#11794
* github.com:scylladb/scylladb:
scylla-gdb: Make regions iterable object
scylla-gdb: Dont print 0x0x
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.
refs: #2737
refs: #2795
"
* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
system_keyspace: Dont maintain dc/rack cache
system_keyspace: Indentation fix after previous patch
system_keyspace: Coroutinuze build_dc_rack_info()
topology: Move all post-configuration to topology::config
snitch: Start early
gossiper: Do not export system keyspace
snitch: Remove gossiper reference
snitch: Mark get_datacenter/_rack methods const
snitch: Drop some dead dependency knots
snitch, code: Make get_datacenter() report local dc only
snitch, code: Make get_rack() report local rack only
storage_service: Populate pending endpoint in on_alive()
code: Populate pending locations
topology: Put local dc/rack on topology early
topology: Add pending locations collection
topology: Make get_location() errors more verbose
token_metadata: Add config, spread everywhere
token_metadata: Hide token_metadata_impl copy constructor
gosspier: Remove messaging service getter
snitch: Get local address to gossip via config
...
Add a header file which will contain
utilities for writing expression tests.
For now it contains simple functions
like make_int_constant(), but there
are many more to come.
I feel like it's cleaner to put
all these functions in a separate
file instead of having them spread randomly
between tests.
It also enables code reuse so that
future expression tests can reuse
these functions instead of writing
them from scratch.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It's possible to print raw_value_view,
but not raw_value. It would be useful
to be able to print both.
Implement printing raw_value
by creating raw_value_view from it
and printing the view.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
An empty value is a value that is neither null nor unset,
but has 0 bytes of data.
Such values can be created by the user using
certain CQL functions, for example an empty int
value can be inserted using blobasint(0x).
Add a method to raw_value and raw_value_view,
which allows to check whether the value is empty.
This will be used in many places in which
we need to validate that a value isn't empty.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It used to be used to abort repair_info by the corresponding node-ops
uuid, but this code is no longer there, so it's good to drop the uuid as
well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to know more than the ops_uuid. The needed info
is (well -- will be) sitting on node_ops_info, so pass it along
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will call it inside abort_source subscription callback which
requires the calling code to be noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.
Fixes: #11770Closes#11784
The loop terminates when we run out of keys. There are extra
conditions such as for short read or page limit, but these are
truly discovered during the loop and qualify as special
conditions, if you squint enough.
It was just a crutch for do_with(), and now can be replaced with
ordinary coroutine-protected variables. The member names were renamed
to the final names they were assigned within the do_with().
Indentation and "infinite" for-loop left for later cleanup.
Note the last check for a utils::result<> failure is no longer needed,
since the previous checks for failure resulted in an immediate co_return
rather than propagating the failure into a variable as with continuations.
The lambda coroutine is stabilized with the new seastar::coroutine::lambda
facility.
This series is a step towards non-LRU cache algorithms.
Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.
However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.
This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.
Closes#11716
* github.com:scylladb/scylladb:
utils: lru: assert that evictables are unlinked before destruction
utils: lru: remove unlink_from_lru()
cache: make all cache unlinks explicit
Previous patches introduce the assumption that evictables are manually unlinked
before destruction, to allow for correct bookkeeping within the cache.
This assert assures that this assumptions is correct.
This is particularly important because the switch from automatic to explicit
unlinking had to be done manually. Destructor calls are invisible, so it's
possible that we have missed some automatic destruction site.
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.
However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.
In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
This patch has two reproducing tests for issue #7432, which are cases
where a paged query with a restriction backed by a secondary-index
returns pages larger than the desired page size. Because these tests
reproduce a still-open bug, they are both marked "xfail". Both tests
pass on Cassandra.
The two tests involve quite dissimilar casess - one involves requesting
an entire partition (and Scylla forgetting to page through it), and the
other involves GROUP BY - so I am not sure these two bugs even have the
same underlying cause. But they were both reported in #7432, so let's
have reproducers for both.
Refs #7432
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11586
Unlike the current method (which uses consume()), this will also adjust the
initial resources, adjusting the semaphore as if it was created with the
reduced amount of resources in the first place. This fixes the confusing
90/100 count resources seen in diagnostics dump outputs.
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.
Modern (as of Fedora 37) pytest has the "-sP" flags in the Python command
line, as found in /usr/bin/pytest. This means it will reject the
site-packages directory, where we install the Scylla Python driver. This
causes all the tests to fail.
Work around it by supplying an alternative pytest script that does not
have this change.
Closes#11764
The tests in test_permissions.py use the new_session() utility function
to create a new connection with a different logged-in user.
It models the new connection on the existing one, but incorrectly
assumed that the connection is NOT ssl. This made this test failed
with cql-pytest/run is passed the "--ssl" option.
In this patch we correctly infer the is_ssl state from the existing
cql fixture, instead of assuming it is false. After this pass,
"cql-pytest/run --ssl" works as expected for this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11742
The test verifies that a row which participated in earlier merge, and
its cells lost on the timestamp check, behaves exactly like an empty
row and can accept any mutation.
This wasn't the case in versions prior to f006acc.
Closes#11787
It turned out that boost/tracing test is not run because its name doesn't match the *_test.cc pattern. While fixing it it turned out that the test cannot even start, because it uses future<>.get() calls outside of seastar::thread context. While patching this place the trace-backend registry was removed for simplicity. And, while at it, few more cleanups "while at it"
Closes#11779
* github.com:scylladb/scylladb:
tracing: Wire tracing test back
tracing: Indentation fix after previous patch
tracing: Move test into thread
tracing: Dismantle trace-backend registry
tracing: Use class-registrator for backends
tracing: Add constraint to trace_state::begin()
tracing: Remove copy-n-paste comments from test
tracing: Outline may_create_new_session
Fix https://github.com/scylladb/scylladb/issues/11773
This PR fixes the notes by removing repetition and improving the clairy of the notes on the OS Support page.
In addition, "Scylla" was replaced with "ScyllaDB" on related pages.
Closes#11783
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB
doc: add a comment to remove in future versions any information that refers to previous releases
doc: rewrite the notes to improve clarity
doc: remove the reperitions from the notes
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
- partition scan
- counter mutation
- paxos
With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.
Closes#11296
* github.com:scylladb/scylladb:
storage_proxy: hold effective_replication_map for the duration of a paxos transaction
storage_proxy: move paxos_response_handler class to .cc file
storage_proxy: deinline paxos_response_handler constructor/destructor
storage_proxy: use consistent effective_replication_map for counter coordinator
storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
storage_proxy: query_singular: use fewer smart pointers
storage_proxy: query_singular: simplify lambda captures
locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
storage_proxy: use consistent token_metadata with rest of singular read
A collection of small cleanups, and a bug fix.
Closes#11750
* github.com:scylladb/scylladb:
dirty_memory_manager: move region_group data members to top-of-class
dirty_memory_manager: update region_group comment
dirty_memory_manager: remove outdated friend
dirty_memory_manager: fold region_group::push_back() into its caller
dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available
dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available
dirty_memory_manager: tidy up region_group::execution_permitted()
dirty_memory_manager: reindent region_group::release_queued_allocations()
dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine
dirty_memory_manager: move region_group::_releaser after _shutdown_requested
dirty_memory_manager: move region_group queued allocation releasing into a function
dirty_memory_manager: fold allocation_queue into region_group
dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()
The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.
In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.
The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.
Closes#11763
* github.com:scylladb/scylladb:
service/raft: raft_address_map: get rid of `is_linked` checks
service/raft: raft_address_map: get rid of `to_list_iterator`
service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
service/raft: raft_address_map: don't use intrusive set for timestamped entries
service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
The boost/tracing test is not run, because test.py boost suite collects
tests that match *_test.cc pattern. The tracing one apparently doesn't
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test calls future<>.get()'s in its lambda which is only allowed in
seastar threads. It's not stepped upon because (surprise, surprise) this
test is not run at all. Next patch fixes it.
Meanwhile, the fix is in using cql_env_thread thing for the whole lambda
which runs in it seastar::async() context
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code uses its own class registration engine, but there's a
generic one in utils/ that applies here too. In fact, the tracing
backend registry is just a transparent wrapper over the generic one :\
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's a private method used purely in tracing.cc, no need in compiling it
every time the header is met somewhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.
Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.
We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.
The code becomes shorter and simpler.
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.
In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.
Closes#11704
This add support stripped binary installation for relocatable package.
After this change, scylla and unified packages only contain stripped binary,
and introduce "scylla-debuginfo" package for debug symbol.
On scylla-debuginfo package, install.sh script will extract debug symbol
at /opt/scylladb/<dir>/.debug.
Note that we need to keep unstripped version of relocatable package for rpm/deb,
otherwise rpmbuild/debuild fails to create debug symbol package.
This version is renamed to scylla-unstripped-$version-$release.$arch.tar.gz.
See #8918
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9005
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4
The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.
This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.
Fixes#11355Closes#11361
Requests like `col IN NULL` used to cause
an error - Invalid null value for colum col.
We would like to allow NULLs everywhere.
When a NULL occurs on either side
of a binary operator, the whole operation
should just evaluate to NULL.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#11775
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.
The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.
Fix by capturing it once in the caller and propagating it everywhere.
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
token_metadata is protected by holders of an effective_replication_map_ptr,
so it's just as safe and less expensive for them to obtain a reference
to token_metadata rather than a smart pointer, so give them that option with
a new accessor.
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
Nicer and faster. We have a rare case where we hold a lock for the duration
of a call but we don't want to hold it until the future it returns is
resolved, so we have to resort to a minor trick.
The function that is attached to _releaser depends on
_shutdown_requested. There is currently now use-before-init, since
the function (release_queued_allocations) starts with a yield(),
moving the first use to until after the initialization.
Since I want to get rid of the yield, reorder the fields so that
they are initialized in the right order.
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.
This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.
We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11751
std::experimental::source_location is provided by <experimental/source_location>,
not <source_location>. libstdc++ 12 insists, so change the header.
Closes#11766
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
`timestamped_entry` had two fields:
```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```
The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.
The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)
However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.
The code becomes shorter with this change.
I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
When code was moved to the new directory, a bug was reintroduced with `ssl` local hiding `ssl` module. Fix again.
Closes#11755
* github.com:scylladb/scylladb:
test.py: improve pylint score for conftest
test.py: fix variable name collision with ssl
fmt 9 deprecates automatic fallback to std::ostream formatting.
We should migrate, but in order to do so incrementally, first enable
the deprecated fallback so the code continues to compile.
Closes#11768
Abseil is not under our control, so if a header generates a
warning, we can do nothing about it. So far this wasn't a problem,
but under clang 15 it spews a harmless deprecation warning. Silence
the warning by treating the header as a system header (which it is,
for us).
Closes#11767
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4
Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.
In this case, it is better to reject the bootstrap.
With this patch, one would see the following when adding n4:
```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```
The user needs to perform either of the following before adding a new node:
1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster
Fixes#6088Closes#11425
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.
Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.
Extend the removenode function to allow ignoring dead nodes.
Improve checking for CQL availability when starting nodes to speed up testing.
Closes#11725
* github.com:scylladb/scylladb:
test/topology_raft_disabled: more Raft upgrade tests
test/topology_raft_disabled: refactor `test_raft_upgrade`
test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
test/pylib: rest_client: propagate errors from put_json
test/pylib: fix some type hints
test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
This will provide a single extension point when it's necessary to
prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
discovery won't store raft::server_address. On the same token stop
using discovery::peer_set anywhere outside discovery (for persistence),
use a peer_list instead, which is easier to marshal.
Closes#11676
* github.com:scylladb/scylladb:
raft: (discovery) do not use raft::server_address to carry IP data
raft: (group0) API refactoring to avoid raft::server_address
raft: rename group0_upgrade.hh to group0_fwd.hh
raft: (group0) move the code around
raft: (discovery) persist a list of discovered peers, not a set
raft: (group0) always start group0 using start_server_for_group0()
- Start n1, n2, n3
- Apply network nemesis as below:
+ Block gossip traffic going from nodes 1 and 2 to node 3.
+ All the other rpc traffic flows normally, including gossip traffic
from node 3 to nodes 1 and 2 and responses to node_ops commands from
nodes 1 and 2 to node 3.
- Decommission n3
Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.
I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.
This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.
With the fix, the decommission of n3 will fail like this:
$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})
Fixes#11302Closes#11362
"
There's one via the database's compaction manager and large data handler
sub-services. Both need system keyspace to put their info into, but the
latter needs database naturally via query_processor->storage_proxy link.
The solution is to make c.m. | l.d.h. -> sys.ks. dependency be weak with
the help of shared_from_this(), described in details in patch #2 commit
message.
As a (not-that-)side effect this set removes a bunch of global qctx
calls.
refs: #11684 (this set seem to increase the chance of stepping on it)
"
* 'br-sysks-async-users' of https://github.com/xemul/scylla:
large_data_handler: Use local system_keyspace to update entries
system_keyspace: De-static compaction history update
compaction_manager: Relax history paths
database: Plug/unplug system_keyspace
system_keyspace: Add .shutdown() method
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.
Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11741
Some good news finally. The saved dc/rack info about the ring is now
only loaded once on start. So the whole cache is not needed and the
loading code in storage_service can be greatly simplified
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No users of it left. Despite the gossiper->system_keyspace dependency is
not needed either, keep it alive because gossiper still updates system
keyspace with feature masks, so chances are it will be reactivated some
time later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
They are in fact such, but wasn't marked as const before because they
wanted to talk to non-const gossiper and system_keyspaces methods and
updated snitch internal caches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patches and merged branches snitch no longer needs its
method that gets dc/rack for endpoints from gossiper, system keyspace
and its internal caches.
This cuts the last but the biggest snitch->gossiper dependency.
Also this removes implicit snitch->system_keyspace dependency loop
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A special-purpose add-on to the previous patch.
When messaging service accepts a new connection it sometimes may want to
drop it early based on whether the client is from the same dc/rack or
not. However, at this stage the information might have not yet had
chances to be spread via storage service pending-tokens updating paths,
so here's one more place -- the on_alive() callback
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patches added the concept of pending endpoints in the topology,
this patch populates endpoints in this state.
Also, the set_pending_ranges() is patched to make sure that the tokens
added for the enpoint(s) are added for something that's known by the
topology. Same check exists in update_normal_tokens()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays the topology object only keeps info about nodes that are normal
members of the ring. Nodes that are joining or bootstrapping or leaving
are out of it. However, one of the goals of this patchset is to make
topology object provide dc/rack info for _all_ nodes, even those in
transitive state.
The introduced _pending_locations is about to hold the dc/rack info for
transitive endpoints. When a node becomes member of the ring it is moved
from pending (if it's there) to current locations, when it leaves the
ring it's moved back to pending.
For now the new collection is just added and the add/remove/get API is
extended to maintain it, but it's not really populated. It will come in
the next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently if topology.get_location() doesn't find an entry in its
collection(s) it throws standard out-of-range exception which's very
hard to debug.
Also, next patches will extend this method, the introduced here if
(_current_locations.contains()) makes this future change look nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Copying of token_metadata_impl is heavy operation and it's performed
internally with the help of the dedicated clone_async() method. This
method, in turn, doesn't copy the whole object in its copy-ctor, but
rather default-initializes it and carries the remaining fields later.
Having said that, the standart copy-ctor is better to be made private
and, for the sake of being more explicit, marked as shallow-copy-ctor
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The property-file snitch gossips listen_address as internal-IP state. To
get this value it gets it from snitch->gossiper->messaging_service
chain. This change provides the needed value via config thus cutting yet
another snitch->gossiper dependency and allowing gossiper not to export
messaging service in the future
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just keep some conditions from if()s as local
variables. This is the churn-reducing preparation for one of the the
next patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an endpoint is not in ring the snitch/get_{rack|datacenter} API
still return back some value. The value is, in fact, the default one,
because this is how snitch resolves it -- when it cannot find a node in
gossiper and system keyspace it just returns defaults.
When this happens the API should better return some error (bad param?)
but there's a bug in nodetool -- when the 'status' command collects info
about the ring it first collects the endpoints, then gets status for
each. If between getting an endpoint and getting its status the endpoint
disappears, the API would fail, but nodetool doesn't handle it.
Next patches will make .get_rack/_dc calls use in-topology collections
that don't fall-back to default values if the entry is not found in it,
so prepare the API in advance to return back defaults.
refs: #11706
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.
Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.
So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.
Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
The l._d._h.'s way to update system keyspace is not like in other code.
Instead of a dedicated helper on the system_keyspace's side it executes
the insertion query directly with the help of qctx.
Now when the l._d._h. has the weak system keyspace reference it can
execute queries on _it_ rather than on the qctx.
Just like in previous patch, it needs to keep the sys._k.s. weak
reference alive until the query's future resolves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compaction manager now has the weak reference on the system keyspace
object and can use it to update its stats. It only needs to take care
and keep the shared pointer until the respective future resolves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.
To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.
Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.
It's not a C++ classical reference, because system_keyspace starts after
and stops before database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.
These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.
This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The `removenode` operation normally requires the removing node to
contact every node in the cluster except the one that is being removed.
But if more than 1 node is down it's possible to specify a list of nodes
to ignore for the operation; the `/storage_service/remove_node` endpoint
accepts an `ignore_nodes` param which is a comma-separated list of IPs.
Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient`
so it's possible to pass the list of ignored nodes.
We also modify the `/cluster/remove-node` Manager endpoint to use
`put_json` instead of `get_text` and pass all parameters except the
initiator IP (the IP of the node who coordinates the `removenode`
operation) through JSON. This simplifies the URL greatly (it was already
messy with 3 parameters) and more closely resembles Scylla's endpoint.
Change variable name to avoid collision with module ssl.
This bug was reintroduced when moving code.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This method just jumps into topology.has_endpoint(). The change is
for consistency with other users of it and as a preparation for
topology.has_endpoint() future enhancements
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.
In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.
Fixes: #11228Closes#11748
It's nicer to see a function release_queued_allocations() in a stack
trace rather than start_releaser(), which has done its work during
initialization.
allocation_queue was extracted out of region_group in
71493c253 and 34d532236. But now that region_group refactoring is
mostly done, we can move them back in. allocation_queue has just one
user and is not useful standalone.
In 34d5322368 ("dirty_memory_manager: move more allocation_queue
functions out of region_group") we accidentally started ignoring the
timeout parameter. Fix that.
No release branch has the breakage.
The `add_entry` and `modify_config` methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a `seastar::rpc::closed_error` would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling `modify_config` in `raft_group0`,
which sometimes broke the `removenode` command.
An `intermittent_connection_error` exception was added earlier to
solve a similar problem with the `read_barrier` method. In this patch it
is renamed to `transport_error`, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the destination and whether any mutations were made.
In case of `read_barrier` it does not matter and we just retry, in case
of `add_entry` and `modify_config` we cannot retry because of possible mutations,
so we convert this exception to `commit_status_unknown`, which
the client has to handle.
Explicit comments have also been added to `raft::server` methods
describing all possible exceptions.
Closes#11691
* github.com:scylladb/scylladb:
raft_group0: retry modify_config on commit_status_unknown
raft: convert raft::transport_error to raft::commit_status_unknown
modify_config can throw commit_status_unknown in case
of a leader change or when the leader is unavailable,
but the information about it has not yet reached the
current node. In this patch modify_config is run
again after some time in this case.
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.
An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.
In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.
Explicit comments have also been added to raft::server methods
describing all possible exceptions.
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env
Closes#11738
* github.com:scylladb/scylladb:
system_keyspace: Make get_{local|saved}_tokens non static
size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
cql_test_env: Keep sharded<system_keyspace> reference
size_estimate_virtual_reader: Keep system_keyspace reference
system_keyspace: Pass sys_ks argument to install_virtual_readers()
system_keyspace: Make make() non-static
distributed_loader: Pass sys_ks argument to init_system_keyspace()
system_keyspace: Remove dangling forward declaration
do_with() is a sure indicator for coroutinization, since it adds
an allocation (like the coroutine does with its frame). Therefore
translating a function with do_with is at least a break-even, and
usually a win since other continuations no longer allocate.
This series converts most of storage_proxy's function that have
do_with to coroutines. Two remain, since they are not simple
to convert (the do_with() is kept running in the background and
its future is discarded).
Individual patches favor minimal changes over final readability,
and there is a final patch that restores indentation.
The patches leave some moves from coroutine reference parameters
to the coroutine frame, this will be cleaned up in a follow-up. I wanted
this series not to touch headers to reduce rebuild times.
Closes#11683
* github.com:scylladb/scylladb:
storage_proxy: reindent after coroutinization
storage_proxy: convert handle_read_digest() to a coroutine
storage_proxy: convert handle_read_mutation_data() to a coroutine
storage_proxy: convert handle_read_data() to a coroutine
storage_proxy: convert handle_write() to a coroutine
storage_proxy: convert handle_counter_mutation() to a coroutine
storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
In August 2022, DynamoDB added a "S3 Import" feature, which we don't yet
support - so let's document this missing feature in the compatibility
document.
Refs #11739.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11740
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.
A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively.
A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.
Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.
Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.
In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.
Closes#11449Closes#11674
* github.com:scylladb/scylladb:
sstables: mx/writer: optimize large data stats members order
sstables: mx/writer: keep large data stats entry as members
db: large_data_handler: dynamically update config thresholds
utils/updateable_value: add transforming_value_updater
db/large_data_handler: cql_table_large_data_handler: record large_collections
db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
db/large_data_handler: cql_table_large_data_handler: move ctor out of line
docs: large-rows-large-cells-tables: fix typos
db/system_keyspace: add collection_elements column to system.large_cells
gms/feature_service: add large_collection_detection cluster feature
test: sstable_3_x_test: add test_sstable_too_many_collection_elements
test: lib: simple_schema: add support for optional collection column
test: lib: simple_schema: build schema in ctor body
test: lib: simple_schema: cql: define s1 as static only if built this way
db/large_data_handler: maybe_record_large_cells: consider collection_elements
db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
sstables: mx/writer: add large_data_type::elements_in_collection
db/large_data_handler: get the collection_elements_count_threshold
db/config: add compaction_collection_elements_count_warning_threshold
test: sstable_3_x_test: add test_sstable_write_large_cell
test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
What's contained in this series:
- Refactored compaction tests (and utilities) for integration with multiple groups
- The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group.
- Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue)
- Many changes in replica::table for allowing integration with multiple groups
Next:
- Introduce for_each_compaction_group() for iterating over groups wherever needed.
- Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc).
- Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups
- Introduce static option for defining number of compaction groups and implement function to map a token to its respective group.
- Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting).
Closes#11592
* github.com:scylladb/scylladb:
sstable_resharding_test: Switch to table_for_tests
replica: Move compacted_undeleted_sstables into compaction group
replica: Use correct compaction_group in try_flush_memtable_to_sstable()
replica: Make move_sstables_from_staging() robust and compaction group friendly
test: Rename column_family_for_tests to table_for_tests
sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
test: Don't expose compound set in column_family_for_tests
test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
sstable_compaction_test: Switch to table_state in compact_sstables()
sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
Instead of `test.py.log`, use:
`test.py.dev.log`
when running with `--mode dev`,
`test.py.dev-release.log`
when running with `--mode dev --mode release`,
and so on.
This is useful in Jenkins which is running test.py multiple times in
different modes; a later run would overwrite a previous run's test.py
file. With this change we can preserve the test.py files of all of these
runs.
Closes#11678
The test was added recently and since then causes CI failures.
We suspect that it happens if the node being removed was the Raft group
0 leader. The removenode coordinator tries to send to it the
`remove_from_group0` request and fails.
A potential fix is in review: #11691.
Now all callers have system_keyspace reference at hand. This removes one
more user of the global qctx object
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method static calls system_keyspace::get_local_tokens(). Having the
system_keyspace reference will make this method non-static
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a test_get_local_ranges() call in size-estimate reader which
will need system keyspace reference. There's no other place for tests to
get it from but the cql_test_env thing
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The s._e._v._reader::fill_buffer() method needs system keyspace to get
node's local tokens. Now it's a static method, having system_keyspace
reference will make it non-static
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The size-estimate-virtual-reader will need it, now it's available as
"this" from system_keyspace::make() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper needs system_keyspace reference and using "this" as this
looks natural. Also this de-static-ification makes it possible to put
some sense into the invoke_on_all() call from init_system_keyspace()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It doesn't match the real system_keyspace_make() definition and is in
fact not needed, as there's another "real" one in database.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes a regression introduced in 80917a1054:
"scylla_prepare: stop generating 'mode' value in perftune.yaml"
When cpuset.conf contains a "full" CPU set the negation of it from
the "full" CPU set is going to generate a zero mask as a irq_cpu_mask.
This is an illegal value that will eventually end up in the generated
perftune.yaml, which in line will make the scylla service fail to start
until the issue is resolved.
In such a case a irq_cpu_mask must represent a "full" CPU set mimicking
a former 'MQ' mode.
Fixes#11701
Tested:
- Manually on a 2 vCPU VM in an 'auto-selection' mode.
- Manually on a large VM (48 vCPUs) with an 'MQ' manually
enforced.
Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>
Currently doing `CONTAINS NULL` or `CONTAINS KEY NULL` on a collection evaluates to `true`.
This is a really weird behaviour. Collections can't contain `NULL`, even if they wanted to.
Any operation that has a NULL on either side should evaluate to `NULL`, which is interpreted as `false`.
In Cassandra trying to do `CONTAINS NULL` causes an error.
Fixes: #10359
The only problem is that this change is not backwards compatible. Some existing code might break.
Closes#11730
* github.com:scylladb/scylladb:
cql3: Make CONTAINS KEY NULL return false
cql3: Make CONTAINS NULL return false
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.
Fixes: #11722Closes#11735
To keep the idl definition of plan_id from
getting out of sync with the one in stream_fwd.hh.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11720
The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace().
This makes raft use system ks directly, snitch is patched by another branch
Closes#11729
* github.com:scylladb/scylladb:
raft_group0: Use local reference
raft_group0: Add system keyspace reference
Commit aba475fe1d accidentally fixed a race, which happens in
the following sequence of events:
1) storage service starts drain() via API for example
2) main's abort source is triggered, calling compaction_manager's do_stop()
via subscription.
2.1) do_stop() initiates the stop but doesn't wait for it.
2.2) compaction_manager's state is set to stopped, such that
compaction_manager::stop() called in defer_verbose_shutdown()
will wait for the stop and not start a new one.
3) drain() calls compaction_manager::drain() changing the state from
stopped to disabled.
4) main calls compaction_manager::stop() (as described in 2.2) and
incorrectly tries to stop the manager again, because the state was
changed in step 3.
aba475fe1d accidentally fixed this problem because drain() will no
longer take place if it detects the shutdown process was initiated
(it does so by ignoring drain request if abort source's subscription
was unlinked).
This shows us that looking at the state to determine if stop should be
performed is fragile, because once the state changes from A to B,
manager doesn't know the state was A. To make it robust, we can instead
check if the future that stores stop's promise is engaged, meaning that
the stop was already initiated and we don't have to start a new one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11711
This series undoes some recent damage to clarity, then
goes further by renaming terms around dirty_memory_manager
to be clearer. Documentation is added.
Closes#11705
* github.com:scylladb/scylladb:
dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
dirty_memory_manager: rename _virtual_region_group
api: column_family: fix memtable off-heap memory reporting
dirty_memory_manager: unscramble terminology
This is the continuation of the a980510654 that tries to catch ENOSPCs reported via storage_io_error similarly to how defer_verbose_shutdown() does on stop
Closes#11664
* github.com:scylladb/scylladb:
table: Handle storage_io_error's ENOSPC when flushing
table: Rewrap retry loop
Compacted undeleted sstables are relevant for avoiding data resurrection
in the purge path. As token ranges of groups won't overlap, it's
better to isolate this data, so to prevent one group from interfering
with another.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We need to pass the compaction_group received as a param, not the one
retrieved via as_table_state(). Needed for supporting multiple
groups.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy can happen in parallel to view building.
A semaphore is used to ensure they don't step on each other's
toe.
If off-strategy completes first, then move_sstables_from_staging()
won't find the SSTable alive and won't reach code to add
the file to the backlog tracker.
If view building completes first, the SSTable exists, but it's
not reshaped yet (has repair origin) and shouldn't be
added to the backlog tracker.
Off-strategy completion code will make sure new sstables added
to main set are accounted by the backlog tracker, so
move_sstables_from_staging() only need to add to tracker files
which are certainly not going through a reshape compaction.
So let's take these facts into account to make the procedure
more robust and compaction group friendly. Very welcome change
for when multiple groups are supported.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for multiple compaction groups. Once replica::table
supports multiple groups, there will be no table::as_table_state(),
so for testing table with a single group, we'll be relying on
column_family_for_tests::as_table_state().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The compound set shouldn't be exposed in main_sstables() because
once we complete the switch to column_family_for_tests::table_state,
can happen compaction will try to remove or add elements to its
set snapshot, and compound set isn't allowed to either ops.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Needed once we switch to column_family_for_tests::table_state, so unit
tests relying on correct value will still work
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change will make table_state_for_test the table_state of
column_family_for_tests. Today, an unit test has to keep a reference
to them both and logically couple them, but that's error prone.
This change is also important when replica::table supports multiple
compaction groups, so unit tests won't have to directly reference
the table_state of table, but rather use the one managed by
column_family_for_tests.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The switch is important once we have multiple compaction groups,
as a single table may own several groups. There will no longer be
a replica::table::as_table_state().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Lots of boilerplate is reduced, and will also help to complete the
switch from replica::table to compaction::table_state in the unit
tests.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A binary operator like this:
{1: 2, 3: 4} CONTAINS KEY NULL
used to evaluate to `true`.
This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.
This change is not backwards compatible.
Some existing code might break.
partially fixes: #10359
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
A binary operator like this:
[1, 2, 3] CONTAINS NULL
used to evaluate to `true`.
This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.
This change is not backwards compatible.
Some existing code might break.
partially fixes: #10359
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
It now grabs one from gossiper which is weird. A bit later it will be
possible to remove gossiper->system_keyspace dependency
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All calls in the try block have been noexcept for some time.
Remove the try...catch and the associated misleading comment to avoid confusing
source code readers.
Closes#11715
Since `_partition_size_entry` and `_rows_in_partition_entry`
are accessed at the same time when updated, and similarly
`_cell_size_entry` and `_elements_in_collection_entry`,
place the member pairs closely together to improve data
cache locality.
Follow the same order when preparing the
`scylla_metadata::large_data_stats` map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To save the map lookup on the hot write path,
keep each large data stats entry as a member in the writer
object and build a map for storing the disk_hash in the
scylla metadata only when finalizing it in consume_end_of_stream.
Fixes#11686
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.
Fixes#11685
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Automatically updates a value from a utils::updateable_value
Where they can be of different types.
An optional transfom function can provide an additional transformation
when updating the value, like multiplying it by a factor for unit conversion,
for example.
To be used for auto-updating the large data thresholds
from the db::config.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.
Fixes https://github.com/scylladb/scylladb/issues/11681.
Closes#11682
* github.com:scylladb/scylladb:
replica: Return all sstables in table::get_sstable_set()
sstables: Fix cloning of compound_sstable_set
get_sstable_set() as its name implies is not confined to the main
or maintenance set, nor to a specific compaction group, so let's
make it return the compound set which spans all groups, meaning
all sstables tracked by a table will be returned.
This is a regression introduced in 1e7a444. It affects the API
to return sstable list containing a partition key, as sstables
in maintenance would be missed, fooling users of the API like
tools that could trust the output.
Each compaction group is returning the main and maintenance set
in table_state's main_sstable_set() and maintenance_sstable_set(),
respectively.
Fixes#11681.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The intention was that its clone() would actually clone the content
of an existing set into a new one, but the current impl is actually
moving the sets instead of copying them. So the original set
becomes invalid. Luckily, this problem isn't triggered as we're
not exposing the compound set in the table's interface, so the
compound_sstable_set::clone() method isn't being called.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.
In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).
I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".
I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.
The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
We report virtual memory used, but that's not a real accounting
of the actual memory used. Use the correct real_memory_used() instead.
Note that this isn't a recent regression and was probably broken forever.
However nobody looks at this measure (and it's usually close to the
correct value) so nobody noticed.
Since it's so minor, I didn't bother filing an issue.
Before 95f31f37c1 ("Merge 'dirty_memory_manager: simplify
region_group' from Avi Kivity"), we had two region_group
objects, one _real_region_group and another _virtual_region_group,
each with a set of "soft" and "hard" limits and related functions
and members.
In 95f31f37c1, we merged _real_region_group into _virtual_region_group,
but unfortunately the _real_region_group members received the "hard"
prefix when they got merged. This overloads the meaning of "hard" -
is it related to soft/hard limit or is it related to the real/virtual
distinction?
This patch applied some renaming to restore consistency. Anything
that came from _virtual_region_group now has "virtual" in its name.
Anything that came from _real_region_group now has "real" in its name.
The terms are still pretty bad but at least they are consistent.
- Separate `aiohttp` client code
- Helper to access Scylla server REST API
- Use helper both in `ScyllaClusterManager` (test.py process) and `ManagerClient` (pytest process)
- Add `removenode` and `decommission` operations.
Closes#11653
* github.com:scylladb/scylladb:
test.py: Scylla REST methods for topology tests
test.py: rename server_id to server_ip
test.py: HTTP client helper
test.py: topology pass ManagerClient instead of...
test.py: delete unimplemented remove server
test.py: fix variable name ssl name clash
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations
For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).
Fixes: #11669Closes#11670
The decision to reject a read operation can either be made by replicas,
or by the coordinator. In the second case, the
scylla_storage_proxy_coordinator_read_rate_limited
metric was not incremented, but it should. This commit fixes the issue.
Fixes: #11651Closes#11694
The seastar defer_stop() helper is cool, but it forwards any exception from the .stop() towards the caller. In case the caller is main() the exception causes Scylla to abort(). This fires, for example, in compaction_manager::stop() when it steps on ENOSPC
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11662
* github.com:scylladb/scylladb:
compaction_manager: Swallow ENOSPCs in ::stop()
exceptions: Mark storage_io_error::code() with noexcept
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.
So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.
Fixes#11695
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11696
When the large_collection_detection cluster feature is enabled,
select the internal_record_large_cells_and_collections method
to record the large collection cell, storing also the collection_elements
column.
We want to do that only when the cluster feature is enabled
to facilitate rollback in case rolling upgrade is aborted,
otherwise system.large_cells won't be backward compatible
and will have to be deleted manually.
Delete the sstable from system.large_cells if it contains
elements_in_collection above threshold.
Closes#11449
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For recording collection_elements of large_collections when
the large_collection_detection feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And bump the schema version offset since the new schema
should be distinguishable from the previous one.
Refs scylladb/scylladb#11660
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And a corresponding db::schema_feature::SCYLLA_LARGE_COLLECTIONS
We want to enable the schema change supporting collection_elements
only when all nodes are upgraded so that we can roll back
if the rolling upgrade process is aborted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep the with_static ctor parameter as private member
to be used by the cql() method to define s1 either as static or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Detect large_collections when the number of collection_elements
is above the configured threshold.
Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And update the sstable elements_in_collection
stats entry.
Next step would be to forward it to
large_data_handler().maybe_record_large_cells().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`set_group0_upgrade_state` writes the on-disk state first, then
in-memory state second, both under a write lock.
`get_group0_upgrade_state` would only take the lock if the in-memory
state was `use_pre_raft_procedures`.
If there's an external observer who watches the on-disk state to decide
whether Raft upgrade finished yet, the following could happen:
1. The node wrote `use_post_raft_procedures` to disk but didn't update
the in-memory state yet, which is still `synchronize`.
2. The external client reads the table and sees that the state is
`use_post_raft_procedures`, and deduces that upgrade has finished.
3. The external client immediately tries to perform a schema change. The
schema change code calls `get_group0_upgrade_state` which does not
take the read lock and returns `synchronize`. The schema change gets
denied because schema changes are not allowed in `synchronize`.
Make sure that `get_group0_upgrade_state` cannot execute in-between
writing to disk and updating the in-memory state by always taking the
read lock before reading the in-memory state. As it was before, it will
immediately drop the lock if the state is not `use_pre_raft_procedures`.
This is useful for upgrade tests, which read the on-disk state to decide
whether upgrade has finished and often try to perform a schema change
immediately afterwards.
Closes#11672
Provide a helper client for Scylla REST requests. Use it on both
ScyllaClusterManager (e.g. remove node, test.py process) and
ManagerClient (e.g. get uuid, pytest process).
For now keep using IPs as key in ScyllaCluster, but this will be changed
to UUID -> IP in the future. So, for now, pass both independently. Note
the UUID must be obtained from the server before stopping it.
Refresh client driver connection when decommissioning or removing
a node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In ScyllaCluster currently servers are tracked by the host IP. This is
not the host id (UUID). Fix the variable name accordingly
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Split aiohttp client to a shared helper file.
While there, move aiohttp session setup back to constructors. When there
were teardown issues it looked it could be caused by aiohttp session
being created outside a coroutine. But this is proven not to be the case
after recent fixes. So move it back to the ManagerClient constructor.
On th other hand, create a close() coroutine to stop the aiohttp session.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
cql connection
When there are topology changes, the driver needs to be updated. Instead
of passing the CassandraCluster.Connection, pass the ManagerClient
instance which manages the driver connection inside of it.
Remove workaround for test_raft_upgrade.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.
A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
A do_with() makes this at least a break-even.
Some internal lambdas were not converted since they commonly
do not allocate or block.
A finally() continuation is converted to seastar::defer().
The do_with means the coroutine conversion is free, and conversion
of parallel_for_each to coroutine::parallel_for_each saves a
possible allocation (though it would not have been allocated usually.
An inner continuation is not converted since it usually doesn't
block, and therefore doesn't allocate.
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It was passed to `raft_group_registry::direct_fd_proxy` by value. That
is a bug, we want to pass a reference to the instance that is living
inside `gossiper`.
Fortunately this bug didn't cause problems, because the pinger is only
used for one function, `get_address`, which looks up an address in a map
and if it doesn't find it, accesses the map that lives inside
`gossiper` on shard 0 (and then caches it in the local copy).
Explicitly delete the copy constructor of `direct_fd_pinger` so this
doesn't happen again.
Closes#11661
The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had.
Active tombstone validation is pushed down to the low level validator.
The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore.
Closes#11614
* github.com:scylladb/scylladb:
mutation_fragment_stream_validator: make interface more robust
mutation_fragment_stream_validator: add reset() to validating filter
mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
region_group evolved as a tree, each node of which contains some
regions (memtables). Each node has some constraints on memory, and
can start flushing and/or stop allocation into its memtables and those
below it when those constraints are violated.
Today, the tree has exactly two nodes, only one of which can hold memtables.
However, all the complexity of the tree remains.
This series applies some mechanical code transformations that remove
the tree structure and all the excess functionality, leaving a much simpler
structure behind.
Before:
- a tree of region_group objects
- each with two parameters: soft limit and hard limit
- but only two instances ever instantiated
After:
- a single region_group object
- with three parameters - two from the bottom instance, one from the top instance
Closes#11570
* github.com:scylladb/scylladb:
dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
dirty_memory_manager: simplify region_group::update()
dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
dirty_memory_manager: merge memory_hard_limit into region_group
dirty_memory_manager: rename members in memory_hard_limit
dirty_memory_manager: fold do_update() into region_group::update()
dirty_memory_manager: simplify memory_hard_limit's do_update
dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
dirty_memory_manager: adjust soft_limit threshold check
dirty_memory_manager: drop memory_hard_limit::_name
dirty_memory_manager: simplify memory_hard_limit configuration
dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
dirty_memory_manager: stop inheriting from region_group_reclaimer
dirty_memory_manager: test: unwrap region_group_reclaimer
dirty_memory_manager: change region_group_reclaimer configuration to a struct
dirty_memory_manager: convert region_group_reclaimer to callbacks
dirty_memory_manager: consolidate region_group_reclaimer constructors
dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
dirty_memory_manager: drop unused parameter to memory_hard_limit constructor
dirty_memory_manager: drop memory_hard_limit::shutdown()
dirty_memory_manager: split region_group hierarchy into separate classes
dirty_memory_manager: extract code block from region_group::update
dirty_memory_manager: move more allocation_queue functions out of region_group
dirty_memory_manager: move some allocation queue related function definitions outside class scope
dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
dirty_memory_manager: remove support for multiple subgroups
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.
Fixes: #11668Closes#11671
* abseil 9e408e05...7f3c0d78 (193):
> Allows absl::StrCat to accept types that implement AbslStringify()
> Merge pull request #1283 from pateldeev:any_inovcable_rename_true
> Cleanup: SmallMemmove nullify should also be limited to 15 bytes
> Cleanup: implement PrependArray and PrependPrecise in terms of InlineData
> Cleanup: Move BitwiseCompare() to InlineData, and make it layout independent.
> Change kPower10Table bounds to be half-open
> Cleanup some InlineData internal layout specific details from cord.h
> Improve the comments on the implementation of format hooks adl tricks.
> Expand LogEntry method docs.
> Documentation: Remove an obsolete note about the implementation of `Cord`.
> `absl::base_internal::ReadLongFromFile` should use `O_CLOEXEC` and handle interrupts to `read`
> Allows absl::StrFormat to accept types which implement AbslStringify()
> Add common_policy_traits - a subset of hash_policy_traits that can be shared between raw_hash_set and btree.
> Split configuration related to cycle clock into separate headers
> Fix -Wimplicit-int-conversion and -Wsign-conversion warnings in btree.
> Implement Eisel-Lemire for from_chars<float>
> Import of CCTZ from GitHub.
> Adds support for "%v" in absl::StrFormat and related functions for bool values. Note that %v prints bool values as "true" and "false" rather than "1" and "0".
> De-pointerize LogStreamer::stream_, and fix move ctor/assign preservation of flags and other stream properties.
> Explicitly disallows modifiers for use with %v.
> Change the macro ABSL_IS_TRIVIALLY_RELOCATABLE into a type trait - absl::is_trivially_relocatable - and move it from optimization.h to type_traits.h.
> Add sparse and string copy constructor benchmarks for hash table.
> Make BTrees work with custom allocators that recycle memory.
> Update the readme, and (internally) fix some export processes to better keep it up-to-date going forward.
> Add the fact that CHECK_OK exits the program to the comment of CHECK_OK.
> Adds support for "%v" in absl::StrFormat and related functions for numeric types, including integer and floating point values. Users may now specify %v and have the format specifier deduced. Integer values will print according to %d specifications, unsigned values will use %u, and floating point values will use %g. Note that %v does not work for `char` due to ambiguity regarding the intended output. Please continue to use %c for `char`.
> Implement correct move constructor and assignment for absl::strings_internal::OStringStream, and mark that class final.
> Add more options for `BM_iteration` in order to see better picture for choosing trade off for iteration optimizations.
> Change `EndComparison` benchmark to not measure iteration. Also added `BM_Iteration` separately.
> Implement Eisel-Lemire for from_chars<double>
> Add `-llog` to linker options when building log_sink_set in logging internals.
> Apply clang-format to btree.h.
> Improve failure message: tell the values we don't like.
> Increase the number of per-ObjFile program headers we can expect.
> Fix "unsafe narrowing" warnings in absl, 8/n.
> Fix format string error with an explicit cast
> Add a case to detect when the Bazel compiler string is explicitly set to "gcc", instead of just detecting Bazel's default "compiler" string.
> Fix "unsafe narrowing" warnings in absl, 10/n.
> Fix "unsafe narrowing" warnings in absl, 9/n.
> Fix stacktrace header includes
> Add a missing dependency on :raw_logging_internal
> CMake: Require at least CMake 3.10
> CMake: install artifacts reflect the compiled ABI
> Fixes bug so that `%v` with modifiers doesn't compile. `%v` is not intended to work with modifiers because the meaning of modifiers is type-dependent and `%v` is intended to be used in situations where the type is not important. Please continue using if `%s` if you require format modifiers.
> Convert algorithm and container benchmarks to cc_binary
> Merge pull request #1269 from isuruf:patch-1
> InlinedVector: Small improvement to the max_size() calculation
> CMake: Mark hash_testing as a public testonly library, as it is with Bazel
> Remove the ABSL_HAVE_INTRINSIC_INT128 test from pcg_engine.h
> Fix ClangTidy warnings in btree.h and btree_test.cc.
> Fix log StrippingTest on windows when TCHAR = WCHAR
> Refactors checker.h and replaces recursive functions with iterative functions for readability purposes.
> Refactors checker.h to use if statements instead of ternary operators for better readability.
> Import of CCTZ from GitHub.
> Workaround for ASAN stack safety analysis problem with FixedArray container annotations.
> Rollback of fix "unsafe narrowing" warnings in absl, 8/n.
> Fix "unsafe narrowing" warnings in absl, 8/n.
> Changes mutex profiling
> InlinedVector: Correct the computation of max_size()
> Adds support for "%v" in absl::StrFormat and related functions for string-like types (support for other builtin types will follow in future changes). Rather than specifying %s for strings, users may specify %v and have the format specifier deduced. Notably, %v does not work for `const char*` because we cannot be certain if %s or %p was intended (nor can we be certain if the `const char*` was properly null-terminated). If you have a `const char*` you know is null-terminated and would like to work with %v, please wrap it in a `string_view` before using it.
> Fixed header guards to match style guide conventions.
> Typo fix
> Added some more no_test.. tags to build targets for controlling testing.
> Remove includes which are not used directly.
> CMake: Add an option to build the libraries that are used for writing tests without requiring Abseil's tests be built (default=OFF)
> Fix "unsafe narrowing" warnings in absl, 7/n.
> Fix "unsafe narrowing" warnings in absl, 6/n.
> Release the Abseil Logging library
> Switch time_state to explicit default initialization instead of value initialization.
> spinlock.h: Clean up includes
> Fix minor typo in absl/time/time.h comment: "ToDoubleNanoSeconds" -> "ToDoubleNanoseconds"
> Support compilers that are unknown to CMake
> Import of CCTZ from GitHub.
> Change bit_width(T) to return int rather than T.
> Import of CCTZ from GitHub.
> Merge pull request #1252 from jwest591:conan-fix
> Don't try to enable use of ARM NEON intrinsics when compiling in CUDA device mode. They are not available in that configuration, even if the host supports them.
> Fix "unsafe narrowing" warnings in absl, 5/n.
> Fix "unsafe narrowing" warnings in absl, 4/n.
> Import of CCTZ from GitHub.
> Update Abseil platform support policy to point to the Foundational C++ Support Policy
> Import of CCTZ from GitHub.
> Add --features=external_include_paths to Bazel CI to ignore warnings from dependencies
> Merge pull request #1250 from jonathan-conder-sm:gcc_72
> Merge pull request #1249 from evanacox:master
> Import of CCTZ from GitHub.
> Merge pull request #1246 from wxilas21:master
> remove unused includes and add missing std includes for absl/status/status.h
> Sort INTERNAL_DLL_TARGETS for easier maintenance.
> Disable ABSL_HAVE_STD_IS_TRIVIALLY_ASSIGNABLE for clang-cl.
> Map the absl::is_trivially_* functions to their std impl
> Add more SimpleAtod / SimpleAtof test coverage
> debugging: handle alternate signal stacks better on RISCV
> Revert change "Fix "unsafe narrowing" warnings in absl, 4/n.".
> Fix "unsafe narrowing" warnings in absl, 3/n.
> Fix "unsafe narrowing" warnings in absl, 4/n.
> Fix "unsafe narrowing" warnings in absl, 2/n.
> debugging: honour `STRICT_UNWINDING` in RISCV path
> Fix "unsafe narrowing" warnings in absl, 1/n.
> Add ABSL_IS_TRIVIALLY_RELOCATABLE and ABSL_ATTRIBUTE_TRIVIAL_ABI macros for use with clang's __is_trivially_relocatable and [[clang::trivial_abi]].
> Merge pull request #1223 from ElijahPepe:fix/implement-snprintf-safely
> Fix frame pointer alignment check.
> Fixed sign-conversion warning in code.
> Import of CCTZ from GitHub.
> Add missing include for std::unique_ptr
> Do not re-close files on EINTR
> Renamespace absl::raw_logging_internal to absl::raw_log_internal to match (upcoming) non-raw logging namespace.
> Check for negative return values from ReadFromOffset
> Use HTTPS RFC URLs, which work regardless of the browser's locale.
> Avoid signedness change when casting off_t
> Internal Cleanup: removing unused internal function declaration.
> Make Span complain if constructed with a parameter that won't outlive it, except if that parameter is also a span or appears to be a view type.
> any_invocable_test: Re-enable the two conversion tests that used to fail under MSVC
> Add GetCustomAppendBuffer method to absl::Cord
> debugging: add hooks for checking stack ranges
> Minor clang-tidy cleanups
> Support [[gnu::abi_tag("xyz")]] demangling.
> Fix -Warray-parameter warning
> Merge pull request #1217 from anpol:macos-sigaltstack
> Undo documentation change on erase.
> Improve documentation on erase.
> Merge pull request #1216 from brjsp:master
> string_view: conditional constexpr is no longer needed for C++14
> Make exponential_distribution_test a bigger test (timeout small -> moderate).
> Move Abseil to C++14 minimum
> Revert commit f4988f5bd4176345aad2a525e24d5fd11b3c97ea
> Disable C++11 testing, enable C++14 and C++20 in some configurations where it wasn't enabled
> debugging: account for differences in alternate signal stacks
> Import of CCTZ from GitHub.
> Run flaky test in fewer configurations
> AnyInvocable: Move credits to the top of the file
> Extend visibility of :examine_stack to an upcoming Abseil Log.
> Merge contiguous mappings from the same file.
> Update versions of WORKSPACE dependencies
> Use ABSL_INTERNAL_HAS_SSE2 instead of __SSE2__
> PR #1200: absl/debugging/CMakeLists.txt: link with libexecinfo if needed
> Update GCC floor container to use Bazel 5.2.0
> Update GoogleTest version used by Abseil
> Release absl::AnyInvocable
> PR #1197: absl/base/internal/direct_mmap.h: fix musl build on mips
> absl/base/internal/invoke: Ignore bogus warnings on GCC >= 11
> Revert GoogleTest version used by Abseil to commit 28e1da21d8d677bc98f12ccc7fc159ff19e8e817
> Update GoogleTest version used by Abseil
> explicit_seed_seq_test: work around/disable bogus warnings in GCC 12
> any_test: expand the any emplace bug suppression, since it has gotten worse in GCC 12
> absl::Time: work around bogus GCC 12 -Wrestrict warning
> Make absl::StdSeedSeq an alias for std::seed_seq
> absl::Optional: suppress bogus -Wmaybe-uninitialized GCC 12 warning
> algorithm_test: suppress bogus -Wnonnull warning in GCC 12
> flags/marshalling_test: work around bogus GCC 12 -Wmaybe-uninitialized warning
> counting_allocator: suppress bogus -Wuse-after-free warning in GCC 12
> Prefer to fallback to UTC when the embedded zoneinfo data does not contain the requested zone.
> Minor wording fix in the comment for ConsumeSuffix()
> Tweak the signature of status_internal::MakeCheckFailString as part of an upcoming change
> Fix several typos in comments.
> Reformulate documentation of ABSL_LOCKS_EXCLUDED.
> absl/base/internal/invoke.h: Use ABSL_INTERNAL_CPLUSPLUS_LANG for language version guard
> Fix C++17 constexpr storage deprecation warnings
> Optimize SwissMap iteration by another 5-10% for ARM
> Add documentation on optional flags to the flags library overview.
> absl: correct the stack trace path on RISCV
> Merge pull request #1194 from jwnimmer-tri:default-linkopts
> Remove unintended defines from config.h
> Ignore invalid TZ settings in tests
> Add ABSL_HARDENING_ASSERTs to CordBuffer::SetLength() and CordBuffer::IncreaseLengthBy()
> Fix comment typo about absl::Status<T*>
> In b-tree, support unassignable value types.
> Optimize SwissMap for ARM by 3-8% for all operations
> Release absl::CordBuffer
> InlinedVector: Limit the scope of the maybe-uninitialized warning suppression
> Improve the compiler error by removing some noise from it. The "deleted" overload error is useless to users. By passing some dummy string to the base class constructor we use a valid constructor and remove the unintended use of the deleted default constructor.
> Merge pull request #714 from kgotlinux:patch-2
> Include proper #includes for POSIX thread identity implementation when using that implementation on MinGW.
> Rework NonsecureURBGBase seed sequence.
> Disable tests on some platforms where they currently fail.
> Fixed typo in a comment.
> Rollforward of commit ea78ded7a5f999f19a12b71f5a4988f6f819f64f.
> Add an internal helper for logging (upcoming).
> Merge pull request #1187 from trofi:fix-gcc-13-build
> Merge pull request #1189 from renau:master
> Allow for using b-tree with `value_type`s that can only be constructed by the allocator (ignoring copy/move constructors).
> Stop using sleep timeouts for Linux futex-based SpinLock
> Automated rollback of commit f2463433d6c073381df2d9ca8c3d8f53e5ae1362.
> time.h: Use uint32_t literals for calls to overloaded MakeDuration
> Fix typos.
> Clarify the behaviour of `AssertHeld` and `AssertReaderHeld` when the calling thread doesn't hold the mutex.
> Enable __thread on Asylo
> Add implementation of is_invocable_r to absl::base_internal for C++ < 17, define it as alias of std::is_invocable_r when C++ >= 17
> Optimize SwissMap iteration for aarch64 by 5-6%
> Fix detection of ABSL_HAVE_ELF_MEM_IMAGE on Haiku
> Don’t use generator expression to build .pc Libs lines
> Update Bazel used on MacOS CI
> Import of CCTZ from GitHub.
Closes#11687
To recap: the Nix devenv ({default,shell,flake}.nix and friends) in Scylla is a nicer (for those who consider it so, that is) alternative to dbuild: a completely deterministic build environment without Docker.
In theory we could support much more (creating installable packages, container images, various deployment affordances, etc. -- Nix is, among other things, a kind of parallel-to-everything-else devops realm) but there is clearly no demand and besides duplicating the work the release team is already doing (and doing just fine, needless to say) would be pointless and wasteful.
This PR reflects the accumulated changes that I have been carrying locally for the past year or so. The version currently in master _probably_ can still build Scylla, but that Scylla certainly would not pass unit tests.
What the previous paragraph seems to mean is, apparently I'm the only active user of Nix devenv for Scylla. Which, in turn, presents some obvious questions for the maintainers:
- Does this need to live in the Scylla source at all? (The changes to non-Nix-specific parts are minimal and unobtrusive, but they are still changes)
- If it's left in, who is going to maintain it going forward, should more users somehow appear? (I'm perfectly willing to fix things up when alerted, but no timeliness guarantees)
Closes#9557
* github.com:scylladb/scylladb:
nix: add README.md
build: improvements & upgrades to Nix dev environment
build: allow setting SCYLLA_RELEASE from outside
The combination is hard to read and modify.
Closes#11665
* github.com:scylladb/scylladb:
readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation
readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine
Include the unique test name (the unique name distinguishes between different test repeats) and the test case name where possible. Improve printing of clusters: include the cluster name and stopped servers. Fix some logging calls and add new ones.
Examples:
```
------ Starting test test_topology ------
```
became this:
```
------ Starting test test_topology.1::test_add_server_add_column ------
```
This:
```
INFO> Leasing Scylla cluster {127.191.142.1, 127.191.142.2, 127.191.142.3} for test test_add_server_add_column
```
became this:
```
INFO> Leasing Scylla cluster ScyllaCluster(name: 02cdd180-40d1-11ed-8803-3c2c30d32d96, running: {127.144.164.1, 127.144.164.2, 127.144.164.3}, stopped: {}) for test test_topology.1::test_add_server_add_column
```
Closes#11677
* github.com:scylladb/scylladb:
test/pylib: scylla_cluster: improve cluster printing
test/pylib: don't pass test_case_name to after-test endpoint
test/pylib: scylla_cluster: track current test case name and print it
test.py: pass the unique test name (e.g. `test_topology.1`) to cluster manager
test/pylib: scylla_cluster: pass the test case name to `before_test`
test/pylib: use "test_case_name" variable name when talking about test cases
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).
Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.
Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.
Fixes#10981Closes#11680
"
This series adds a long waited transition of our auto-generation
code to irq_cpu_mask instead of 'mode' in perftune.yaml.
And then it fixes a regression in scylla_prepare perftune.yaml
auto-generation logic.
"
* 'scylla_prepare_fix_regression-v1' of https://github.com/vladzcloudius/scylla:
scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions
scylla_prepare: stop generating 'mode' value in perftune.yaml
* Add some more useful stuff to the shell environment, so it actually
works for debugging & post-mortem analysis.
* Wrap ccache & distcc transparently (distcc will be used unless
NODISTCC is set to a non-empty value in the environment; ccache will
be used if CCACHE_DIR is not empty).
* Package the Scylla Python driver (instead of the C* one).
* Catch up to misc build/test requirements (including optional) by
requiring or custom-packaging: wasmtime 0.29.0, cxxbridge,
pytest-asyncio, liburing.
* Build statically-linked zstd in a saner and more idiomatic fashion.
* In pure builds (where sources lack Git metadata), derive
SCYLLA_RELEASE from source hash.
* Refactor things for more parameterization.
* Explicitly stub out installPhase (seeing that "nix build" succeeds
up to installPhase means we didn't miss any dependencies).
* Add flake support.
* Add copious comments.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
The extant logic for deriving the value of SCYLLA_RELEASE from the
source tree has those assumptions:
* The tree being built includes Git metadata.
* The value of `date` is trustworthy and interesting.
* There are no uncommitted changes (those relevant to building,
anyway).
The above assumptions are either irrelevant or problematic in pure
build environments (such as the sandbox set up by `nix-build`):
* Pure builds use cleaned-up sources with all timestamps reset to Unix
time 0. Those cleaned-up sources are saved (in the Nix store, for
example) and content-hashed, so leaving the (possibly huge) Git
metadata increases the time to copy the sources and wastes disk
space (in fact, Nix in flake mode strips `.git` unconditionally).
* Pure builds run in a sandbox where time is, likewise, reset to Unix
time 0, so the output of `date` is neither informative nor useful.
Now, the only build step that uses Git metadata in the first place is
the SCYLLA_RELEASE value derivation logic. So, essentially, answering
the question "is the Git metadata needed to build Scylla" is a matter
of definition, and is up to us. If we elect to ignore Git metadata
and current time, we can derive SCYLLA_RELEASE value from the content
hash of the cleaned-up tree, regardless of the way that tree was
arrived at.
This change makes it possible to skip the derivation of SCYLLA_RELEASE
value from Git metadata and current time by way of setting
SCYLLA_RELEASE in the environment.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
We notice there are two separate conditions controlling a call to
a single outcome, notify_pressure_relief(). Merge them into a single
boolean variable.
It started life as something shared between memory_hard_limit and
region_group, but now that they are back being the same thing, we
can make it a member again.
The two classes always have a 1:1 or 0:1 relationship, and
so we can just move all the members of memory_hard_limit
into region_group, with the functions that track the relationship
(memory_hard_limit::{add,del}()) removed.
The 0:1 relationship is maintained by initializing the
hard limit parameter with std::numeric_limits<size_t>::max().
The _hard_total_memory variable is always checked if it is
greater than this parameter in order to do anything, and
with this default it can never be.
In preparation for merging memory_hard_limit into region_group,
disambiguate similarly named members by adding the word "hard" in
random places.
memory_hard_limit and region_group are candidates for merging
because they constantly reference each other, and memory_hard_limit
does very little by itself.
do_full_buffer() is an eclectic mix of coroutines and continuations.
That makes it hard to follow what is running sequentially and
concurrently.
Convert it into a pure coroutine by changing internal continuations
to lambda coroutines. These lambda coroutines are guarded with
seastar::coroutine::lambda. Furthermore, a future that is co_awaited
is converted to immediate co_await (without an intermediate future),
since seastar::coroutine::lambda only works if the coroutine is awaited
in the same statement it is defined on.
Print the cluster name and stopped servers in addition to the running
servers.
Fix a logging call which tried to print a server in place of a cluster
and even at that it failed (the server didn't have a hostname yet so it
printed as an empty string). Add another logging call.
Use `_before_test` calls to track the current test case name.
Concatenate it with the unique test name like this:
`test_topology.1::test_add_server_add_column`, and print it
instead of the test case name.
We pass the test case name to `after_test` - so make it consistent.
Arguably, the test case name is more useful (as it's more precise) than
the test name.
Reduce the false dependencies on db/large_data_handler.hh by
not including it from commonly used header files, and rather including
it only in the source files that actually need it.
The is in preparation for https://github.com/scylladb/scylladb/issues/11449Closes#11654
* github.com:scylladb/scylladb:
test: lib: do not include db/large_data_handler.hh in test_service.hh
test: lib: move sstable test_env::impl ctor out of line
sstables: do not include db/large_data_handler.hh in sstables.hh
api/column_family: add include db/system_keyspace.hh
The generator was first setting the marker then applied tombstones.
The marker was set like this:
row.marker() = random_row_marker();
Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.
However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.
This could generate rows with markers uncompacted with shadowable tombstones.
This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.
Fix by making sure there are no key clashes.
Closes#11663
The `server_remove` function did a very weird thing: it shut down a
server and made the framework 'forget' about it. From the point of view
of the Scylla cluster and the driver the server was still there.
Replace the function's body with `raise NotImplementedError`. In the
future it can be replaced with an implementation that calls
`removenode` on the Scylla cluster.
Remove `test_remove_server_add_column` from `test_topology`. It
effectively does the same thing as `test_stop_server_add_column`, except
that the framework also 'forgets' about the stopped server. This could
lead to weird situations because the forgotten server's IP could be
reused in another test that was running concurrently with this test.
Closes#11657
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.
However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It was needed for defining and referencing nop_lp_handler
and in sstable_3_x_test for testing the large_data_handler.
Remove the include from the commonly used header file
to reduce the false dependencies on large_data_handler.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The logic to reject explicit snapshot of views/indexes was improved in aa127a2dbb. However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.
This is implemented in this patch.
The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.
Fixes#11612
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11616
* github.com:scylladb/scylladb:
database: automatically take snapshot of base table views
api: storage_service: reject snapshot of views in api layer
For db::system_keyspace::load_view_build_progress that currently
indirectly satisfied via sstables/sstables.hh ->
db/large_data_handler.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Due to issue #11567, Alternator do not yet support adding a GSI to an
existing table via UpdateTable with the GlobalSecondaryIndexUpdates
parameter.
However, currently, we print a misleading error message in this case,
complaining about the AttributeDefinitions parameter. This parameter
is also required with GlobalSecondaryIndexUpdates, but it's not the
main problem, and the user is likely to be confused why the error message
points to that specific paramter and what it means that this parameter
is claimed to be "not supported" (while it is supported, in CreateTable).
With this patch, we report that GlobalSecondaryIndexUpdates is not
supported.
This patch does not fix the unsupported feature - it just improves
the error message saying that it's not supported.
Refs #11567
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11650
When walking through the ranges, we should yield to prevent stalls. We
do similar yield in other node operations.
Fix a stall in 5.1.dev.20220724.f46b207472a3 with build-id
d947aaccafa94647f71c1c79326eb88840c5b6d2
```
!INFO | scylla[6551]: Reactor stalled for 10 ms on shard 0. Backtrace:
0x4bbb9d2 0x4bba630 0x4bbb8e0 0x7fd365262a1f 0x2face49 0x2f5caff
0x36ca29f 0x36c89c3 0x4e3a0e1
````
Fixes#11146Closes#11160
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.
To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.
The latter is no longer derived from raw::cf_statement,
and just stores a schema_ptr to get to the keyspace and column_family.
`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.
Fixes#11408
Also, update docs/cql/ddl.rst truncate-statement section respectively.
Closes#11409
* github.com:scylladb/scylladb:
docs: cql-extensions: add TRUNCATE to USING TIMEOUT section.
docs: cql: ddl: add support for TRUNCATE USING TIMEOUT
cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT
cql3: selectStatement: restrict to USING TIMEOUT in grammar
cql3: deleteStatement: restrict to USING TIMEOUT|TIMESTAMP in grammar
The series contains fixes for system.large_* log warning and respective documentation.
This prepares the way for adding a new system.large_collections table (See #11449):
Fixes#11620Fixes#11621Fixes#11622
the respective fixes should be backported to different release branches, based on the respective patches they depend on (mentioned in each issue).
Closes#11623
* github.com:scylladb/scylladb:
docs: adjust to sstable base name
docs: large-partition-table: adjust for additional rows column
docs: debugging-large-partition: update log warning example
db/large_data_handler: print static cell/collection description in log warning
db/large_data_handler: separate pk and ck strings in log warning with delimiter
Fix the type of `create_server`, rename `topology_for_class` to `get_cluster_factory`, simplify the suite definitions and parameters passed to `get_cluster_factory`
Closes#11590
* github.com:scylladb/scylladb:
test.py: replace `topology` with `cluster_size` in Topology tests
test.py: rename `topology_for_class` to `get_cluster_factory`
test/pylib: ScyllaCluster: fix create_server parameter type
The test was disabled due to a bug in the Python driver which caused the
driver not to reconnect after a node was restarted (see
scylladb/python-driver#170).
Introduce a workaround for that bug: we simply create a new driver
session after restarting the nodes. Reenable the test.
Closes#11641
Extended the queries language to support bind variables which are bound in the
execution stage, before creating a raft command.
Adjusted `test_broadcast_tables.py` to prepare statements at the beginning of the test.
Fixed a small bug in `strongly_consistent_modification_statement::check_access`.
Closes#11525
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.
snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is greater than
max_log_size - (size of trailing entries).
Closes#11397
* github.com:scylladb/scylladb:
raft replication_test, make backpressure test to do actual backpressure
raft server, shrink_to_fit on log truncation
raft server, release memory if add_entry throws
raft server, log size limit in bytes
When there are errors starting the first cluster(s) the logs of the server logs are needed. So move `.start()` to the `try` block in `test.py` (out of `asynccontextmanager`).
While there, make `ScyllaClusterManager.start()` idempotent.
Closes#11594
* github.com:scylladb/scylladb:
test.py: fix ScyllaClusterManager start/stop
test.py: fix topology init error handling
We don't want to keep memory we don't use, shrink_to_fit guarantees that.
In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.
List the queries that support the TIMEOUT parameter.
Mention the newly added support for TRUNCATE
USING TIMEOUT.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.
To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.
The latter, statements::truncate_statement, is no longer derived
from raw::cf_statement, and just stores a schema_ptr to get to the
keyspace and column_family names.
`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.
Fixes#11408
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.
test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.
Note that cql3::statements::raw::select_statement validate_attrs
now asserts that the ttl or the timestamp attributes aren't set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.
test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.
Note that now delete_statement ctor asserts that the ttl
attribute is not set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
First, a reminder of a few basic concepts in Scylla:
- "topology" is a mapping: for each node, its DC and Rack.
- "replication strategy" is a method of calculating replica sets in
a cluster. It is not a cluster-global property; each keyspace can have
a different replication strategy. A cluster may have multiple
keyspaces.
- "cluster size" is the number of nodes in a cluster.
Replication strategy is orthogonal to topology. Cluster size can be
derived from topology and is also orthogonal to replication strategy.
test.py was confusing the three concepts together. For some reason,
Topology suites were specifying a "topology" parameter which contained
replication strategy details - having nothing to do with topology. Also
it's unclear why a test suite would specify anything to do with
replication strategies - after all, a test may create keyspaces with
different replication strategies, and a suite may contain multiple
different tests.
Get rid of the "topology" parameter, replace it with a simple
"cluster_size". In the future we may re-introduce it when we actually
implement the possibility to start clusters with custom topologies
(which involves configuring the snitch etc.) Simplify the test.py code.
The validator has several API families with increasing amount of detail.
E.g. there is an `operator()(mutation_fragment_v2::kind)` and an
overload also taking a position. These different API families
currently cannot be mixed. If one uses one overload-set, one has to
stick with it, not doing so will generate false-positive failures.
This is hard to explain in documentation to users (provided they even
read it). Instead, just make the validator robust enough such that the
different API subsets can be mixed in any order. The validator will try
to make most of the situation and validate as much as possible.
Behind the scenes all the different validation methods are consolidated
into just two: one for the partition level, the other for the
intra-partition level. All the different overloads just call these
methods passing as much information as they have.
A test is also added to make sure this works.
The previous name had nothing to do with what the function calculated
and returned (it returned a `create_cluster` function; the standard name
for a function that constructs objects would be 'factory', so
`get_cluster_factory` is an appropriate name for a function that returns
cluster factories).
The only usage of `ScyllaCluster` constructor passed a `create_server`
function which expected a `List[str]` for the second parameter, while
the constructor specified that the function should expect an
`Optional[List[str]]`. There was no reason for the latter, we can easily
fix this type error.
Also give a type hint for `create_cluster` function in
`PythonTestSuite.topology_for_class`. This is actually what catched the
type error.
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.
snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
The logic to reject explicit snapshot of views/indexes
was improved in aa127a2dbb.
However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.
This is implemented in this patch.
The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.
Fixes#11612
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than pushing the check to
`snapshot_ctl::take_column_family_snapshot`, just check
that explcitly when taking a snapshot of a particular
table by name over the api.
Other paths that call snapshot_ctl::take_column_family_snapshot
are internal and use it to snap views already.
With that, we can get rid of the allow_view_snapshots flag
that was introduced in aab4cd850c.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow the high level filtering validator to be reset() to a certain
position, so it can be used in situations where the consumption is not
continuous (fast-forwarding or paging).
Currently the active range tombstone change is validated in the high
level `mutation_fragment_stream_validating_stream`, meaning that users of
the low-level `mutation_fragment_stream_validator` don't benefit from
checking that tombstones are properly closed.
This patch moves the validation down to the low-level validator (which
is what the high-level one uses under the hood too), and requires all
users to pass information about changes to the active tombstone for each
fragment.
This test reproduces issue #10365: It shows that although "IS NOT NULL" is
not allowed in regular SELECT filters, in a materialized view it is allowed,
even for non-key columns - but then outright ignored and does not actually
filter out anything - a fact which already surprised several users.
The test also fails on Cassandra - it also wrongly allows IS NOT NULL
on the non-key columns but then ignores this in the filter. So the test
is marked with both xfail (known to fail on Scylla) and cassandra_bug
(fails on Cassandra because of what we consider to be a Cassandra bug).
Refs #10365
Refs #11606
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11615
The goal is not to default initialize an object when its fields are about to be immediately overwritten by the consecutive code.
Closes#11619
* github.com:scylladb/scylladb:
replication_strategy: Construct temp tokens in place
topology: Define copy-sonctructor with init-lists
boolean_factors is a function that takes an expression
and extracts all children of the top level conjunction.
The problem is that it returns a vector<expression>,
which is inefficent.
Sometimes we would like to iterate over all boolean
factors without allocations. for_each_boolean_factor
is implemented for this purpose.
boolean_factors() can be implemented using
for_each_boolean_factor, so it's done to
reduce code duplication.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Snitch uses gossiper for several reasons, one of is to re-gossip the topology-related app states when property-file snitch config changes. This set cuts this link by moving re-gossiping into the existing storage_service::snitch_reconfigured() subscription. Since initial snitch state gossiping happens in storage service as well, this change is not unreasonable.
Closes#11630
* github.com:scylladb/scylladb:
storage_service: Re-gossiping snitch data in reconfiguration callback
storage_service: Coroutinize snitch_reconfigured()
storage_service: Indentation fix after previous patch
storage_service: Reshard to shard-0 earlier
storage_service: Refactor snitch reconfigured kick
Since 244df07771 (scylla 5.1),
only the sstable basename is kept in the large_* system tables.
The base path can be determined from the keyspace and
table name.
Fixes#11621
Adjust the examples in documentation respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since a7511cf600 (scylla 5.0),
sstables containing partitions with too many rows are recorded in system.large_partitions.
Adjust the doc respectively.
Fixes#11622
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The log warning format has changed since f3089bf3d1
and was fixed in the previous patch to include
a delimiter between the partition key, clustering key, and
column name.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When warning about a large cell/collection in a static row,
print that fact in the log warning to make it clearer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently (since f3089bf3d1),
when printing a warning to the log about large rows and/or cells
the clustering key string is concatenated to the partition key string,
rendering the warning confsing and much less useful.
This patch adds a '/' delimiter to separate the fields,
and also uses one to separate the clustering key from the column name
for large cells. In case of a static cell, the clustering key is null
hence the warning will look like: `pk//column`.
This patch does NOT change anything in the large_* system
table schema or contents. It changes only the log warning format
that need not be backward compatible.
Fixes#11620
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Nowadays it's done inside snitch, and snitch needs to carry gossiper
refernece for that. There's an ongoing effort in de-globalizing snitch
and fixing its dependencies. This patch cuts this snitch->gossiper link
to facilitate the mentioned effort.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will add more sleeping code to it and it's simpler if the new
call is co_await-ed rather than .then()-ed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snitch_reconfigured calls update_topology with local node bcast
address argument. Things get simpler if the callee gets the address
itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add cassandra functional - show warn/err when tombstone_warn_threshold/tombstone_failure_threshold reached on select, by partitions. Propagate raw query_string from coordinator to replicas.
Closes#11356
* github.com:scylladb/scylladb:
add utf8:validate to operator<< partition_key with_schema.
Show warn message if `tombstone_warn_threshold` reached on querier.
Otherwise, the token_metadata object is default-initialized, then it's
move-assigned from another object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Otherwise the topology is default-constructed, then its fields
are copy-assigned with the data from the copy-from reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own.
The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state.
The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans.
Closes#11604
* github.com:scylladb/scylladb:
shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
readers/mutation_readers: compacting_reader: remember injected partition-end
db/view: view_builder::execute(): only inject partition-start if needed
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.
Refs scylladb#11410
do_update() has an output parameter (top_relief) which can either
be set to an input parameter or left alone. Simplify it by returning
bool and letting the caller reuse the parameter's value instead.
They are write-only.
This corresponds to the fact that memory_hard_limit does not do
flushing (which is initiated by crossing the soft limit), it only
blocks new allocations.
We have added the finished percentage for repair based node operations.
This patch adds the finished percentage for node ops using the old
streaming.
Example output:
scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000
In addition to the metrics, log shows the percentage is added.
[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646
Fixes#11600Closes#11601
We made this function a template to prevent code duplication, but now
memory_hard_limit was sufficiently simplified so that the implementations
can start to diverge.
Use `>` rather than `>=` to match the hard limit check. This will
aid simplification, since for memory_hard_limit the soft and hard limits
are identical.
This should not cause any material behavior change, we're not sensitive
to single byte accounting. Typical limits are on the order of gigabytes.
We observe that memory_hard_limit's reclaim_config is only ever
initialized as default, or with just the hard_limit parameter.
Since soft_limit defaults to hard_limit, we can collapse the two
into a limit. The reclaim callbacks are always left as the default
no-op functions, so we can eliminate them too.
This fits with memory_hard_limit only being responsible for the hard
limit, and for it not having any memtables to reclaim on its own.
region_group_reclaimer is used to initialize (by reference) instances of
memory_hard_limit and region_group. Now that it is a final class, we
can fold it into its users by pasting its contents into those users,
and using the initializer (reclaim_config) to initialize the users. Note
there is a 1:1 relationship between a region_group_reclaimer instance
and a {memory_hard_limit,region_group} instance.
It may seem like code duplication to paste the contents of one class into
two, but the two classes use region_group_reclaimer differently, and most
of the code is just used to glue different classes together, so the
next patches will be able to get rid of much of it.
Some notes:
- no_reclaimer was replaced by a default reclaim_config, as that's how
no_reclaimer was initialized
- all members were added as private, except when a caller required one
to be public
- an under_presssure() member already existed, forwarding to the reclaimer;
this was just removed.
This inheritance makes it harder to get rid of the class. Since
there are no longer any virtual functions in the class (apart from
the destructor), we can just convert it to a data member. In a few
places, we need forwarding functions to make formerly-inherited functions
visible to outside callers.
The virtual destructor is removed and the class is marked final to
verify it is no longer a base class anywhere.
In one test, region_group_reclaimer is wrapped in another class just
to toggle a bool, but with the new callbacks it's easy to just use
a bool instead.
It's just so much nicer.
The "threshold" limit was renamed to "hard_limit" to contrast it with
"soft_limit" (in fact threshold is a good name for soft_limit, since
it's a point where the behavior begins to change, but that's too much
of a change).
region_group_reclaimer is partially policy (deciding when to reclaim)
and partially mechanism (implementing reclaim via virtual functions).
Move the mechanism to callbacks. This will make it easy to fold the
policy part into region_group and memory_hard_limit. This folding is
expected to simplify things since most of region_group_reclaimer is
cross-class communication.
It clashes with region_group_reclaimer::notify_relief, which does something
different. Since we plan to merge region_group_reclaimer into
memory_hard_limit and region_group (this can simplify the code), we
need to avoid duplicate function names.
Currently, region_group forms a hierarchy. Originally it was a tree,
but previous work whittled it down to a parent-child relationship
(with a single, possible optional parent, and a single child).
The actual behavior of the parent and child are very different, so
it makes sense to split them. The main difference is that the parent
does not contain any regions (memtables), but the child does.
This patch mechanically splits the class. The parent is named
memory_hard_limit (reflecting its role to prevent lsa allocation
above the memtable configured hard limit). The child is still named
region_group.
Details of the transformation:
- each function or data member in region_group is either moved to
memory_hard_limit, duplicated in memory_hard_limit, or left in
region_group.
- the _regions and _blocked_requests members, which were always
empty in the parent, were not duplicated. Any member that only accessed
them was similarly left alone.
- the "no_reclaimer" static member which was only used in the parent
was moved there. Similarly the constructor which accepted it
was moved.
- _child was moved to the parent, and _parent was kept in the child
(more or less the defining change of the split) Similarly
add(region_group*) and del(region_group*) (which manage _child) were moved.
- do_for_each_parent(), which iterated to the top of the tree, was removed
and its callers manually unroll the loop. For the parent, this is just
a single iteration (since we're iterating towards the root), for the child,
this can be two iterations, but the second one is usually simpler since
the parent has many members removed.
- do_update(), introduced in the previous patch, was made a template that
can act on either the parent or the child. It will be further simplified
later.
- some tests that check now-impossible topologies were removed.
- the parent's shutdown() is trivial since it has no _blocked_requests,
but it was kept to reduce churn in the callers.
A mechanical transformation intended to allow reuse later. The function
doesn't really deserve to exist on its own, so it will be swallowed back
by its callers later.
region_group currently fulfills two roles: in one role, when instantiated
as dirty_memory_manager::_virtual_region_group, it is responsible
for holding functions that allocate memtable memory (writes) and only
allowing them to run when enough dirty memory has been flushed from other
memtables. The other role, when instantiated as
dirty_memory_manager::_real_region_group, is to provide a hard stop when
the total amount of dirty memory exceeds the limit, since the other limit
is only estimated.
We want to simplify the whole thing, which means not using the same class
for two different roles (or rather, we can use it for both roles if we
simplify the internals significantly).
As a first step towards clarifying what functionality is used in what
role, move some classes related to holding allocating functions to a new
class allocation_queue. We will gradually move move content there, reducing
the amount of role confusion in region_group.
Type aliases are added to reduce churn.
We only have one parent/child relationship in the region group
hierarchy, so support for more is unneeded complexity. Replace
the subgroup vector with a single pointer, and delete a test
for the removed functionality.
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.
Fixes: #11561
Currently injecting a partition-end doesn't update
`_last_uncompacted_kind`, which will allow for a subsequent
`next_partition()` call to trigger injecting a partition-end, leading to
an invalid mutation fragment stream (partition-end after partition-end).
Fix by changing `_last_uncompacted_kind` to `partition_end` when
injecting a partition-end, making subsequent injection attempts noop.
Fixes: #11608
When resuming a build-step, the view builder injects the partition-start
fragment of the last processed partition, to bring the consumer
(compactor) into the correct state before it starts to consume the
remainder of the partition content. This results in an invalid fragment
stream when the partition was actually over or there is nothing left for
the build step. Make the inject conditional on when the reader contains
more data for the partition.
Fixes: #11607
Update several aspects of the alternator/getting-started.md which were
not up-to-date:
* When the documented was written, Alternator was moving quickly so we
recommended running a nightly version. This is no longer the case, so
we should recommend running the latest stable build.
* The link to the download link is no longer helpful for getting Docker
instructions (it shows some generic download options). Instead point to
our dockerhub page.
* Replace mentions of "Scylla" by the new official name, "ScyllaDB".
* Miscelleneous copy-edits.
Fixes#11218
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11605
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).
In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.
I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.
Fixes#11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783Closes#11384
* github.com:scylladb/scylladb:
test/alternator: insert test names into Scylla logs
rest api: add a new /system/log operation
alternator ttl: log warning if scan took too long.
alternator,ttl: allow sub-second TTL scanning period, for tests
test/alternator: skip fewer Alternator TTL tests
test/alternator: test Alternator TTL metrics
The mutation fragment stream validator filter has a detailed debug log
in its constructor. To avoid putting together this message when the log
level is above debug, it is enclosed in an if, activated when log level
is debug or trace... at least that was intended. Actually the if is
activated when the log level is debug or above (info, warn or error) but
is only actually logged if the log level is exactly debug. Fix the logic
to work as intended.
Closes#11603
The token_metadata::calculate_pending_ranges_for_bootstrap() makes a
clone of itself and adds bootstrapping nodes to the clone to calculate
ranges. Currently added nodes lack the dc/rack which affects the
calculations the bad way.
Unfortunately, the dc/rack for those nodes is not available on topology
(yet) and needs pretty heavy patching to have. Fortunately, the only
caller of this method has gossiper at hand to provide the dc/rack from.
fixes: #11531
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11596
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.
In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.
_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.
The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.
Closes#11530
For some reason, the test is currently flaky on Jenkins. Apparently the
Python driver does not reconnect to the cluster after the cluster
restarts (well it does, but then it disconnects from one of the nodes
and never reconnects again). This causes the test to hang on "waiting
until driver reconnects to every server" until it times out.
Disable it for now so it doesn't block next promotion.
Fix https://github.com/scylladb/scylladb/issues/11373
- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.
Closes#11417
* github.com:scylladb/scylladb:
doc: add a comment to remove the note in version 5.1
doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
doc: add a note to the description of COUNT with a reference to the KB article
doc: add COUNT to the list of acceptable selectors of the SELECT statement
Check existing is_running member to avoid re-starting.
While there, set it to false after stopping.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Start ScyllaClusterManager within error handling so the ScyllaCluster
logs are available in case of error starting up.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.
Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.
Closes#11510
* github.com:scylladb/scylladb:
test/boost: add alternative variant of logalloc test
tools: use standard allocator
utils/logalloc: add use_standard_allocator_segment_pool_backend()
utils/logalloc: introduce segment store backend for standard allocator
utils/logalloc: rebase release segment-store on segment-store-backend
utils/logalloc: introduce segment_store_backend
utils/logalloc: push segment alloc/dealloc to segment_store
test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Fix https://github.com/scylladb/scylladb/issues/11376
This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides).
Closes#11577
* github.com:scylladb/scylladb:
doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1
doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1
This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release:
- ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/
- ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR)
In addition, the section regarding the recommended upgrade procedure has been improved.
Fixes https://github.com/scylladb/scylladb/issues/11450
Fixes https://github.com/scylladb/scylladb/issues/11452Closes#11460
* github.com:scylladb/scylladb:
doc: update the commands to upgrade the ScyllaDB image
doc: fix the filename in the index to resolve the warnings and fix the link
doc: apply feedback by adding she step fo load the new repo and fixing the links
doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
doc: update the Enterprise guide to include the Enterprise-onlyimage file
doc: update the image files
doc: split the upgrade-image file to separate files for Open Source and Enterprise
doc: clarify the alternative upgrade procedures for the ScyllaDB image
doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
It points to a private scylladb repo, which has no place in user-facing
documentation. For now there is no public replacement, but a similar
functionality is in the works for Scylla Manager.
Fixes: #11573Closes#11580
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.
Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.
This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).
Fixes#11524.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11576
Changes done to avoid pitfalls and fix issues of sstable-related unit tests
Closes#11578
* github.com:scylladb/scylladb:
test: Make fake sstables implicitly belong to current shard
test: Make it clearer that sstables::test::set_values() modify data size
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Closes#11563
* github.com:scylladb/scylladb:
test/topology_raft_disabled: write basic raft upgrade test
test: setup logging in topology suites
Fake SSTables will be implicitly owned by the shard that created them,
allowing them to be called on procedures that assert the SSTables
are owned by the current shard, like the table's one that rebuilds
the sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By adding a param with default value, we make it clear in the interface
that the procedure modifies sstable data size.
It can happen one calls this function without noticing it overrides
the data size previously set using a different function.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.
It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.
The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.
Also cleanup the `pytest.ini` files a bit.
In compatibility.md where we refer to the missing ability to add a GSI
to an existing table - let's refer to a new issue specifically about this
feature, instead of the old bigger issue about UpdateItem.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11568
when using ":attrs" attribute' from Nadav Har'El
This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item. The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.
What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.
Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.
Refs #5009.
Closes#11572
* github.com:scylladb/scylladb:
alternator: fix crashes an errors when using ":attrs" attribute
alternator: improve tests for reserved attribute name ":attrs"
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.
The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".
This fix makes the relevant tests, which used to crash or fail, now pass.
This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As explained in issue #5009, Alternator currently forbids the special
attribute name ":attrs", whereas DynamoDB allows any string of approriate
length (including the specific string ":attrs") to be used.
We had only a partial test for this incompatibility, and this patch
improves the testing of this issue. In particular, we were missing a
test for the case that the name ":attrs" was used for a non-key
attribute (we only tested the case it was used as a sort key).
It turns out that Alternator crashes on the new test, when the test tries
to write to a non-key attribute called ":attrs", so we needed to mark
the new test with "skip". Moreover, it turns out that different code paths
handle the attribute name ":attrs" differently, and also crash or fail
in other ways - so we added more than one xfailing and skipped tests
that each fails in a different place (and also a few tests that do pass).
As usual, the new tests we checked to pass on DynamoDB.
Refs #5009
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments.
Closes#11547
* github.com:scylladb/scylladb:
alternator ttl: comment fixes
docs/alternator: fix mention of old alternator-test directory
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.
This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.
Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
the borrower is then responsible for cleaning up the object.
Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.
Closes#11558
Which intializes LSA with use_standard_allocator_segment_pool_backend()
running the logalloc_test suite on the standard allocator segment pool
backend. To avoid duplicating the test code, the new test-file pulls in
the test code via #include. I'm not proud of it, but it works and we
test LSA with both the debug and standard memory segment stores without
duplicating code.
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.
This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.
Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.
The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.
Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"
* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
messaging_service: Fix gossiper verb group
messaging_service: Mind the absence of topology data when creating sockets
messaging_service: Templatize and rename remove_rpc_client_one
Use the new seastar option to instruct seastar to not initialize and use
the seastar allocator, relying on the standard allocator instead.
Configure LSA with the standard allocator based segment store backend:
* scylla-types reserves 1MB for LSA -- in theory nothing here should use
LSA, but just in case...
* scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in
the sstable index caches.
With this, tools now should allocate memory on demand, without reserving
a large chunk of (or all of) the available memory, as regular seastar
apps do.
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage.
Closes#11557
* github.com:scylladb/scylladb:
system_keyspace: De-static get_cdc_generation_id()
system_keyspace: De-static cdc_is_rewritten()
system_keyspace: De-static cdc_set_rewritten()
system_keyspace: De-static update_cdc_generation_id()
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)
Closes#11496
* github.com:scylladb/scylladb:
test.py: handle duplicate result from driver
test.py: log server restarts for topology tests
test.py: log actions for topology tests
Revert "test.py: restart stopped servers before...
test.py: ManagerClient API fix return text
test.py: ManagerClient raise on HTTP != 200
test.py: ManagerClient fix paths to updated resource
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
Said test creates two vectors, the vector storage being allocated with
the default allocator, while its content being allocated on LSA. If an
exception is thrown however, both are freed via the default allocator,
triggering an assert in LSA code. Move the cleanup into a `defer()` so
the correct cleanup sequence is executed even on exceptions.
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.
This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:
sharded<service> s;
co_await s.invoke_on_all([] {
...
barrier.abort();
});
...
co_await s.stop();
If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.
However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.
fixes: #11303
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11553
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.
Fix that by catching the exception and adding the helpful tip.
Closes#11533
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.
The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).
Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.
Fixes#11549Closes#11555
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
Sometimes the driver calls twice the callback on ready done future with
a None result. Log it and avoid setting the local future twice.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.
The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.
The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.
The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.
What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.
Closes#11233
* github.com:scylladb/scylladb:
test: Add test for large partition splitting on compaction
compaction: Add support to split large partitions
sstable: Extend sstable_run to allow disjointness on the clustering level
sstables: simplify will_introduce_overlapping()
test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
test: lib: Fix inefficient merging of mutations in make_sstable_containing()
sstables: Keep track of first partition's first pos and last partition's last pos
sstables: Rename min/max position_range to a descriptive name
sstables_manager: Add sstable metadata reader concurrency semaphore
sstables: Add ability to find first or last position in a partition
teardown..."
This reverts commit df1ca57fda.
In order to prevent timeouts on teardown queries, the previous commit
added functionality to restart servers that were down. This issue is
fixed in fc0263fc9b so there's no longer need to restart stopped servers
on test teardown.
For ManagerClient request API, don't return status, raise an exception.
Server side errors are signaled by status 500, not text body.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.
Fix: #11352Closes#11370
* github.com:scylladb/scylladb:
raft server, status metric
raft server, abort group0 server on background errors
raft server, provide a callback to handle background errors
raft server, check aborted state on public server public api's
Pool.get() might have waiting callers, so if an item is not returned
to the pool after use, tell the pool to add a new one and tell the pool
an entry was taken (used for total running entries, i.e. clusters).
Use it when a ScyllaCluster is dirty and not returned.
While there improve logging and docstrings.
Issue reported by @kbr-.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11546
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11535
It's been ~1 year (2bf47c902e) since we set restrict_dtcs
config option to WARN, meaning users have been warned about the
deprecation process of DTCS.
Let's set the config to TRUE, meaning that create and alter statements
specifying DTCS will be rejected at the CQL level.
Existing tables will still be supported. But the next step will
be about throwing DTCS code into the shadow realm, and after that,
Scylla will automatically fallback to STCS (or ICS) for users which
ignored the deprecation process.
Refs #8914.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11458
When a server is down, the driver expects multiple schema timeouts
within the same request to handle it properly.
Found by @kbr-
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11544
If user stops off-strategy via API, compaction manager can decide
to give up on it completely, so data will sit unreshaped in
maintenance set, preventing it from being compacted with data
in the main set. That's problematic because it will probably lead
to a significant increase in read and space amplification until
off-strategy is triggered again, which cannot happen anytime
soon.
Let's handle it by moving data in maintenance set into main one,
even if unreshaped. Then regular compaction will be able to
continue from where off-strategy left off.
Fixes#11543.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11545
This patch fixes a few errors and out-of-date descriptions in comments
in alternator/ttl.cc. No functional changes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The directory that used to be called alternator-test is now (and has
been for a long time) really test/alternator. So let's fix the
references to it in docs/alternator/alternator.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.
fixes: #11465
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)
This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.
The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:
- currently dc/rack do not change on the fly (though they can, e.g. if
gossiping property file snitch is updated without restart) and
topology update effectively comes from a single place
- updating topology on token-metadata is not like topology.update()
call. Instead, a clone of token metadata is created, then update
happens on the clone, then the clone is committed into t.m. Though
it's possible to find out commit-time which nodes changed their
topology, but since it only happens on join this complexity likely
doesn't worth the effort (yet)
fixes: #11514fixes: #11492fixes: #11483
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Adds support for splitting large partitions during compaction.
Large partitions introduce many problems, like memory overhead and
breaks incremental compaction promise. We want to split large
partitions across fixed-size fragments. We'll allow a partition
to exceed size limit by 10%, as we don't want to unnecessarily split
partitions that just crossed the limit boundary.
To avoid having to open a minimal of 2 fragments in a read, partition
tombstone will be replicated to every fragment storing the
partition.
The splitting isn't enabled by default, and can be used by
strategies that are run aware like ICS. LCS still cannot support
it as it's still using physical level metadata, not run id.
An incremental reader for sstable runs will follow soon.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
An element S1 is completely ordered before S2, if S1's last key is
lower than S2's first key.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
make_sstable_containing() was absurdly slow when merging thousands of
mutations belonging to the same key, as it was unnecessarily copying
the mutation for every merge, producing bad complexity.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.
Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.
Fixes#10637.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new descriptive name is important to make a distinction when
sstable stores position range for first and last rows instead
of min and max.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new method allows sstable to load the first row of the first
partition and last row of last partition.
That's useful for incremental reading of sstable run which will
be split at clustering boundary.
To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We introduce `server_get_config` to fetch the entire configuration dict
and `update_config` to update a value under the given key.
Closes#11493
* github.com:scylladb/scylladb:
test/pylib: APIs to read and modify configuration from tests
test/pylib: ScyllaServer: extract _write_config_file function
test/pylib: ScyllaCluster: extend ActionReturn with dict data
test/pylib: ManagerClient: introduce _put_json
test/pylib: ManagerClient: replace `_request` with `_get`, `_get_text`
test: pylib: store server configuration in `ScyllaServer`
`_request` performed a GET request and extracted a text body out of the
response.
Split it into `_get`, which only performs the request, and `_get_text`,
which calls `_get` and extracts the body as text.
Also extract a `_resource_uri` function which will be used for other
request types.
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.
The suite will be used to test the Raft upgrade procedure.
The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
Closes#11487
* github.com:scylladb/scylladb:
test.py: PythonTestSuite: sum default config params with user-provided ones
test: add a topology suite with Raft disabled
test: pylib: use Python dicts to manipulate `ScyllaServer` configuration
test: pylib: store `config_options` in `ScyllaServer`
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.
Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.
Fixes#11508Closes#11536
Due to slow debug machines timing out, bump up all timeouts
significantly.
The cause was ExecutionProfile request_timeout. Also set a high
heartbeat timeout and bump already set timeouts to be safe, too.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11516
Per-partition rate limiting added a new error type which should be
returned when Scylla decides to reject an operation due to per-partition
rate limit being exceeded. The new error code requires drivers to
negotiate support for it, otherwise Scylla will report the error as
`Config_error`. The existing error code override logic works properly,
however due to a mistake Scylla will report the `Config_error` code even
if the driver correctly negotiated support for it.
This commit fixes the problem by specifying the correct error code in
`rate_limit_exception`'s constructor.
Tested manually with a modified version of the Rust driver which
negotiates support for the new error. Additionally, tested what happens
when the driver doesn't negotiate support (Scylla properly falls back to
`Config_error`).
Branches: 5.1
Fixes: #11517Closes#11518
This series introduces two configurable options when working with TWCS tables:
- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.
Refs: #6923Fixes: #9029Closes#11445
* github.com:scylladb/scylladb:
tests: cql_query_test: add mixed tests for verifying TWCS guard rails
tests: cql_query_test: add test for TWCS window size
tests: cql_query_test: add test for TWCS tables with no TTL defined
cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
cql: add max window restriction for TimeWindowCompactionStrategy
time_window_compaction_strategy: reject invalid window_sizes
cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
* seastar 2b2f6c08...cbb0e888 (10):
> memory: allow user to select allocator to be used at runtime
> perftune.py: correct typos
> Merge 'seastar-addr2line: support more flexible syslog-style backtraces' from Benny Halevy
> Fix instruction count for start_measuring_time
> build: s/c-ares::c-ares/c-ares::cares/
> Merge 'shared_ptr_debug_helper: turn assert into on_internal_error_abort' from Benny Halevy
> test: fix use after free in the loopback socket
> doc/tutorial.md: fix docker command for starting hello-world_demo
> httpd: add a ctor without addr parameter
> dns: dns_resolver: sock_entry: move-construct tcp/udp entries in place
Closes#11526
"
This set makes messaging service notify connection drop listeners
when connection is dropped for _any_ reason and cleans things up
around it afterwards
"
* 'br-messaging-notify-connection-drop' of https://github.com/xemul/scylla:
messaging_service: Relax connection drop on re-caching
messaging_service: Simplify remove_rpc_client_one()
messaging_service: Notify connection drop when connection is removed
`feature_service` provided two sets of features: `known_feature_set` and
`supported_feature_set`. The purpose of both and the distinction between
them was unclear and undocumented.
The 'supported' features were gossiped by every node. Once a feature is
supported by every node in the cluster, it becomes 'enabled'. This means
that whatever piece of functionality is covered by the feature, it can
by used by the cluster from now on.
The 'known' set was used to perform feature checks on node start; if the
node saw that a feature is enabled in the cluster, but the node does not
'know' the feature, it would refuse to start. However, if the feature
was 'known', but wasn't 'supported', the node would not complain. This
means that we could in theory allow the following scenario:
1. all nodes support feature X.
2. X becomes enabled in the cluster.
3. the user changes the configuration of some node so feature X will
become unsupported but still known.
4. The node restarts without error.
So now we have a feature X which is enabled in the cluster, but not
every node supports it. That does not make sense.
It is not clear whether it was accidental or purposeful that we used the
'known' set instead of the 'supported' set to perform the feature check.
What I think is clear, is that having two sets makes the entire thing
unnecessarily complicated and hard to think about.
Fortunately, at the base to which this patch is applied, the sets are
always the same. So we can easily get rid of one of them.
I decided that the name which should stay is 'supported', I think it's
more specific than 'known' and it matches the name of the corresponding
gossiper application state.
Closes#11512
Since we fail to write files to $USER/.config on Jenkins jobs, we need
an option to skip installing systemd units.
Let's add --without-systemd to do that.
Also, to detect the option availability, we need to increment
relocatable package version.
See scylladb/scylla-dtest#2819
Closes#11345
Previously, if the suite.yaml file provided
`extra_scylla_config_options` but didn't provide values for `authorizer`
or `authenticator` inside the config options, the harness wouldn't give
any defaults for these keys. It would only provide defaults for these
keys if suite.yaml didn't specify `extra_scylla_config_options` at all.
It makes sense to give the user the ability to provide extra options
while relying on harness defaults for `authenticator` and `authorizer`
if the user doesn't care about them.
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.
The suite will be used to test the Raft upgrade procedure.
The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
Previously we used a formattable string to represent the configuration;
values in the string were substituted by Python's formatting mechanism
and the resulting string was stored to obtain the config file.
This approach had some downsides, e.g. it required boilerplate work to
extend: to add a new config options, you would have to modify this
template string.
Instead we can represent the configuration as a Python dictionary. Dicts
are easy to manipulate, for example you can sum two dicts; if a key
appears in both, the second dict 'wins':
```
{1:1} | {1:2} == {1:2}
```
This makes the configuration easy to extend without having to write
boilerplate: if the user of `ScyllaServer` wants to add or override a
config option, they can simply add it to the `config_options` dict and
that's it - no need to modify any internal template strings in
`ScyllaServer` implementation like before. The `config_options` dict is
simply summed with the 'base' config dict of `ScyllaServer`
(`config_options` is the right summand so anything in there overrides
anything in the base dict).
An example of this extensibility is the `authenticator` and `authorizer`
options which no longer appear in `scylla_cluster.py` module after this
change, they only appear in the suite.yaml file.
Also, use "workdir" option instead of specifying data dir, commitlog
dir etc. separately.
Previously the code extracted `authenticator` and `authorizer` keys from
the config options and stored them.
Store the entire dict instead. The new code is easier to extend if we
want to make more options configurable.
When messaging_service::get_rpc_client() picks up cached socket and
notices error on it, it drops the connection and creates a new one. The
method used to drop the connection is the one that re-lookups the verb
index again, which is excessive. Tune this up while at it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The output of test/alternator/run ends in Scylla's full log file, where
it is hard to understand which log messages are related to which test.
In this patch, we add a log message (using the new /system/log REST API)
every time a test is started and ends.
The messages look like this:
INFO 2022-08-29 18:07:15,926 [shard 0] api - /system/log:
test/alternator: Starting test_ttl.py::test_describe_ttl_without_ttl
...
INFO 2022-08-29 18:07:15,930 [shard 0] api - /system/log:
test/alternator: Ended test_ttl.py::test_describe_ttl_without_ttl
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add a new REST API operation, taking a log level and a message, and
printing it into the Scylla log.
This can be useful when a test wants to mark certain positions in the
log (e.g., to see which other log messages we get between the two
positions). An alternative way to achieve this could have been for the
test to write directly into the log file - but an on-disk log file is
only one of the logging options that Scylla support, and the approach
in this patch allows to add log message regardless of how Scylla keeps
the logs.
In motivation of this feature is that in the following patch the
test/alternator framework will add log messages when starting and
ending tests, which can help debug test failures.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, we log at "info" level how much time remained at the end of
a full TTL scan until the next scanning period (we sleep for that time).
If the scan was slower than the period, we didn't print anything.
Let's print a warning in this case - it can be useful for debugging,
and also users should know when their desired scan period is not being
honored because the full scan is taking longer than the desired scan
period.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Alternator has the "alternator_ttl_period_in_seconds" parameter for
controlling how often the expiration thread looks for expired items to
delete. It is usually a very large number of seconds, but for tests
to finish quickly, we set it to 1 second.
With 1 second expiration latency, test/alternator/test_ttl.py took 5
seconds to run.
In this patch, we change the parameter to allow a floating-point number
of seconds instead of just an integer. Then, this allows us to halve the
TTL period used by tests to 0.5 seconds, and as a result, the run time of
test_ttl.py halves to 2.5 seconds. I think this is fast enough for now.
I verified that even if I change the period to 0.1, there is no noticable
slowdown to other Alternator tests, so 0.5 is definitely safe.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.
The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).
However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.
So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.
Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.
This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.
4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.
Fixes#11374
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
x
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.
We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.
To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The purpose of this PR is to update the information about the default SStable format.
It
Closes#11431
* github.com:scylladb/scylladb:
doc: simplify the information about default formats in different versions
doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format
doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page
doc: fix additional information regarding the ME format on the SStable 3.x page
doc: add the ME format to the table
add a comment to remove the information when the documentation is versioned (in 5.1)
doc: replace Scylla with ScyllaDB
doc: fix the formatting and language in the updated section
doc: fix the default SStable format
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.
Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"
* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
network_topology_strategy_test: Use topology instead of snitch
network_topology_strategy_test: Populate explicit topology
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.
Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.
This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.
Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.
We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.
A few tests that check the removed functionality are deleted.
Closes#11515
Compaction group can be defined as a set of files that can be compacted together. Today, all sstables belonging to a table in a given shard belong to the same group. So we can say there's one group per table per shard. As we want to eventually allow isolation of data that shouldn't be mixed, e.g. data from different vnodes, then we want to have more than one group per table per shard. That's why compaction groups is being introduced here.
Today, all memtables and sstables are stored in a single structure per table. After compaction groups, there will be memtables and sstables for each group in the table.
As we're taking an incremental approach, table still supports a single group. But work was done on preparing table for supporting multiple groups. Completing that work is actually the next step. Also, a procedure for deriving the group from token is introduced, but today it always return the single group owned by the table. Once multiple groups are supported, then that procedure should be implemented to map a token to a group.
No semantics was changed by this series.
Closes#11261
* github.com:scylladb/scylladb:
replica: Move memtables to compaction_group
replica: move compound SSTable set to compaction group
replica: move maintenance SSTable set to compaction_group
replica: move main SSTable set to compaction_group
replica: Introduce compaction_group
replica: convert table::stop() into coroutine
compaction_manager: restore indentation
compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
test: sstable_compaction_test: Don't reference main sstable set directly
test: sstable_utils: Set data size fields for fake SSTable
test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.
At first it will support repair and compaction tasks, and possibly more in the future.
Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.
Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.
Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.
User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish
To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.
Fixes: #9809Closes#11216
* github.com:scylladb/scylladb:
test: task manager api test
task_manager: test api layer implementation
task_manager: add test specific classes
task_manager: test api layer
task_manager: api layer implementation
task_manager: api layer
task_manager: keep task_manager reference in http_context
start sharded task manager
task_manager: create task manager object
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:
- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.
The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.
The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth.
However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0.
The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn":
We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.
The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough.
Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase.
This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check.
Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.
Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows.
Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-)
Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's.
Refs: #2336
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The group is now responsible for providing the compound set.
table still has one compound set, which will span all groups for
the cases we want to ignore the group isolation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit is restricted to moving main set into compaction_group.
Next, we'll move maintenance set into it and finally the memtable.
A method is introduced to figure out which group a sstable belongs
to, but it's still unimplemented as table is still limited to
a single group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction group is a new abstraction used to group SSTables
that are eligible to be compacted together. By this definition,
a table in a given shard has a single compaction group.
The problem with this approach is that data from different vnodes
is intermixed in the same sstable, making it hard to move data
in a given sstable around.
Therefore, we'll want to have multiple groups per table.
A group can be thought of an isolated LSM tree where its memtable
and sstable files are isolated from other groups.
As for the implementation, the idea is to take a very incremental
approach.
In this commit, we're introducing a single compaction group to
table.
Next, we'll migrate sstable and maintenance set from table
into that single compaction group. And finally, the memtable.
Cache will be shared among the groups, for simplicity.
It works due to its ability to invalidate a subset of the
token range.
There will be 1:1 relationship between compaction_group and
table_state.
We can later rename table_state to compaction_group_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
await_pending_ops() is today marked noexcept, so doesn't have to
be implemented with finally() semantics.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
stop_ongoing_compactions() is made noexcept too as it's called from
remove() and we want to make the latter noexcept, to allow compaction
group to qualify its stop function as noexcept too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
So methods that look at data size and require it to be higher than 0
will work on fake SSTables created using set_values().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
column_family_test::add_sstable will soon be changed to run in a thread,
and it's not needed in this procedure, so let's remove its usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Otherwise it crashes some python versions.
The cast was there before a2dd64f68f
explicitly dropped one while moving the code between files.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11511
It was not and won't be used for anything.
Note that the feature was always disabled or masked so no node ever
announced it, thus it's safe to get rid of.
Closes#11505
This std::function causes allocations, both on construction
and in other operations. This costs ~2200 instructions
for a DC-local query. Fix that.
Closes#11494
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.
Fixes#7478Fixes#9500Fixes#11503Closes#11506
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.
The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.
In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.
In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.
Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharingCloses#11164
* github.com:scylladb/scylladb:
raft: broadcast_tables: add broadcast_kv_store test
raft: broadcast_tables: add returning query result
raft: broadcast_tables: add execution of intermediate language
raft: broadcast_tables: add compilation of cql to intermediate language
raft: broadcast_tables: add definition of intermediate language
db: system_keyspace: add broadcast_kv_store table
db: config: add BROADCAST_TABLES feature flag
The implementation of a test api that helps testing task manager
api. It provides methods to simulate the operations that can happen
on modules and theirs task. Through the api user can: register
and unregister the test module and the tasks belonging to the module,
and finish the tasks with success or custom error.
The test api that helps testing task manager api. It can be used
to simulate the operations that can happen on modules and theirs
task. Through the api user can: register and unregister the test
module and the tasks belonging to the module, and finish the tasks
with success or custom error.
The implementation of a task manager api layer. It provides
methods to list the modules registered in task_manager, list
tasks belonging to the given module, abort, wait for or retrieve
a status of the given task.
The task manager api layer. It can be used to list the modules
registered in task_manager, list tasks belonging to the given
module, abort, wait for or retrieve a status of the given task.
Implementation of a task manager that allows tracking
and managing asynchronous tasks.
The tasks are represented by task_manager::task class providing
members common to all types of tasks. The methods that differ
among tasks of different module can be overriden in a class
inheriting from task_manager::task::impl class. Each task stores
its status containing parameters like id, sequence number, begin
and end time, state etc. After the task finishes, it is kept
in memory for configurable time or until it is unregistered.
Tasks need to be created with make_task method.
Each module is represented by task_manager::module type and should
have an access to task manager through task_manager::module methods.
That allows to easily separate and collectively manage data
belonging to each module.
There are two methods to close an RPC socket in m.s. -- one that's
called on error path of messaging_service::send_... and the other one
that's called upon gossiper down/leave/cql-off notifications. The former
one notifies listeners about connection drop, the latter one doesn't.
The only listener is the storage-proxy which, in turn, kicks database to
release per-table cache hitrate data. Said that, when a node goes down
(or when an operator shuts down its transport) the hit-rate stats
regarding this node are leaked.
This patch moves notification so that any socket drop calls notification
and thus releases the hitrates.
fixes: #11497
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- To isolate the different pytest suites, remove the top level conftest
and move needed contents to existing `test/pylib/cql_repl/conftest.py`
and `test/topology/conftest.py`.
- Add logging to CQL and Python suites.
- Log driver version for CQL and topology tests.
Closes#11482
* github.com:scylladb/scylladb:
test.py: enable log capture for Python suite
test.py: log driver name/version for cql/topology
test.py: remove top level conftest.py
Test queries scylla with following statements:
* SELECT value FROM system.broadcast_kv_store WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST IF value = CONST;
where CONST is string randomly chosen from small set of random strings
and half of conditional updates has condition with comparison to last
written value.
Intermediate language added new layer of abstraction between cql
statement and quering mutations, thus this commit adds new layer of
abstraction between mutations and returning query result.
Result can't be directly returned from `group0_state_machine::apply`, so
we decided to hold query results in map inside `raft_group0_client`. It can
be safely read after `add_entry_unguarded`, because this method waits
for applying raft command. After translating result to `result_message`
or in case of exception, map entry is erased.
Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`.
Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft
commands without `group0_guard`.
Queries on group0_kv_store are executed in `group_0_state_machine::apply`,
but for now don't return results. They don't use previous state id, so they will
block concurrent schema changes, but these changes won't block queries.
In this version snapshots are ignored.
We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement`
and `strongly_consistent_select_statement`. Statements operating on
system.broadcast_kv_store will be compiled to these new subclasses if
BROADCAST_TABLES flag is enabled.
If the query is executed on a shard other than 0 it's bounced to that shard.
An instance may be invalidated before we try to recycle it.
We perform this by setting its value to a nullopt.
This patch adds a check for it when calculating its size.
This behavior didn't cause issues before because the catch
clause below caught errors caused by calling value() on
a nullopt, even though it was intended for errors from
get_instance_size.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#11500
With EverywhereStrategy, we know that all tokens will be on the same node and the data is typically sparse like LocalStrategy.
Result of testing the feature:
Cluster: 2 DC, 2 nodes in each DC, 256 tokens per nodes, 14 shards per node
Before: 154 scanning operations
After: 14 scanning operations (~10x improvement)
On bigger cluster, it will probably be even more efficient.
Closes#11403
In broadcast tables, raft command contains a whole program to be executed.
Sending and parsing on each node entire CQL statement is inefficient,
thus we decided to compile it to an intermediate language which can be
easily serializable.
This patch adds a definition of such a language. For now, only the following
types of statements can be compiled:
* select value where key = CONST from system.broadcast_kv_store;
* update system.broadcast_kv_store set value = CONST where key = CONST;
* update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST;
where CONST is string literal.
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.
Closes#11485
Change a8ad385ecd introduced
```
thread_local std::unordered_map<utils::UUID, seastar::lw_shared_ptr<repair_history_map>> repair_history_maps;
```
We're trying to avoid global scoped variables as much as we can so this should probably be embedded in some sharded service.
This series moves the thread-local `repair_history_maps` instances to `compaction_manager`
and passes a reference to the shard compaction_manager to functions that need it for compact_for_query
and compact_for_compaction.
Since some paths don't need it and don't have access to the compactio_manager,
the series introduced `utils::optional_reference<T>` that allows to pass nullopt.
In this case, `get_gc_before_for_key` behaves in `tombstone_gc_mode::repair` as if the table wasn't repaired and tombstones are not garbage-collected.
Fixes#11208Closes#11366
* github.com:scylladb/scylladb:
tombstone_gc: deglobalize repair_history_maps
mutation_compactor: pass tombstone_gc_state to compact_mutation_state
mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
mutation_partition: compact_for_compaction: get tombstone_gc_state
mutation_readers: pass tombstone_gc_state to compating_reader
sstables: get_gc_before_*: get tombstone_gc_state from caller
compaction: table_state: add virtual get_tombstone_gc_state method
db: view: get_tombstone_gc_state from compaction_manager
db: view: pass base table to view_update_builder
repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager
replica: database: get_tombstone_gc_state from compaction_manager
compaction_manager: add tombstone_gc_state
replica: table: add get_compaction_manager function
tombstone_gc: introduce tombstone_gc_state
repair_service: simplify update_repair_time error handling
tombstone_gc: update_repair_time: get table_id rather than schema_ptr
tombstone_gc: delete unused forward declaration
database: do not drop_repair_history_map_for_table in detach_column_family
The scope of this PR:
- Removing support for Ubuntu 16.04 and Debian 9.
- Adding support for Debian 11.
Closes#11461
* github.com:scylladb/scylladb:
doc: remove support for Debian 9 from versions 2022.1 and 2022.2
doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2
doc: add support for Debian 11 to versions 2022.1 and 2022.2
Enable pytest log capture for Python suite. This will help debugging
issues in remote machines.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove top level conftest so different suites have their own (as it was
before).
Move minimal functionality into existing test/pylib/cql_repl/conftest.py
so cql tests can run on their own.
Move param setting into test/topology/conftest.py.
Use uuid module for unique keyspace name for cql tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Before/after test checks are done per test case, there's no longer need
to check after pytest finishes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11489
As check_restricted_table_properties() is invoked both within CREATE TABLE and ALTER TABLE CQL statements,
we currently have no way to determine whether the operation was either a CREATE or ALTER. In many situations,
it is important to be able to distinguish among both operations, such as - for example - whether a table already has
a particular property set or if we are defining it within the statement.
This patch simply adds a std::optional<schema_ptr> to check_restricted_table_properties() and updates its caller.
Whenever a CREATE TABLE statement is issued, the method is called as a std::nullopt, whereas if an ALTER TABLE is
issued instead, we call it with a schema_ptr.
When importing from `pylib`, don't modify `sys.path` but use the fact
that both `test/` and `test/pylib/` directories contain an `__init__.py`
file, so `test.pylib` is a valid module if we start with `test/` as the
Python package root.
Both `pytest` and `mypy` (and I guess other tools) understand this
setup.
Also add an `__init__.py` to `test/topology/` so other modules under the
`test/` directory will be able to import stuff from `test/topology/`
(i.e. from `test.topology.X import Y`).
Closes#11467
I created new issues for each missing field in DescribeTable's
response for GSIs and LSIs, so in this patch we edit the xfail
messages in the test to refer to these issues.
Additionally, we only had a test for these fields for GSIs, so this
patch also adds a similar test for LSIs. I turns out there is a
difference between the two tests - the two fields IndexStatus and
ProvisionedThroughput are returned for GSIs, but not for LSIs.
Refs #7750
Refs #11466
Refs #11470
Refs #11471
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11473
Move the thread-local instances of the
per-table repair history maps into compaction_manager.
Fixes#11208
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.
It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.
table_state_for_test was modified to just keep a null
tombstone_gc_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by generate_update() for getting the
tombstone_gc_state via the table's compaction_manager.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a tombstone_gc_state member and methods to get it.
Currently the tombstone_gc_state is default constructed,
but a following patch will move the thread-local
repair history maps into the compaction_manager as a member
and then the _tombstone_gc_state member will be initialized
from that member.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and use it to access the repair history maps.
At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's no need for per-shard try/catch here.
Just catch exceptions from the overall sharded operation
to update_repair_time.
Also, update warning to indicate that only updating the repair history
time failed, not "Loading repair history".
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
drop_repair_history_map_for_table is called on each shard
when database::truncate is done, and the table is stopped.
dropping it before the table is stopped is too early.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613
This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).
To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```
Closes#11320
* github.com:scylladb/scylladb:
doc: add the Cassandra version on which the tools are based
doc: fix the version number
doc: update the Enterprise version where the ME format was introduced
doc: add the ME format to the Cassandar Compatibility page
doc: replace Scylla with ScyllaDB
doc: rewrite the Interfaces table to the new format to include more information about CQL support
doc: remove the CQL version from pages other than Cassandra compatibility
doc: fix the CQL version in the Interfaces table
Most of the test's cases use rack-inferring snitch driver and get
DC/RACK from it via the test_dc_rack() helper. The helper was introduced
in one of the previous sets to populate token metadata with some DC/RACK
as normal tokens manipulations required respective endpoint in topology.
This patch removes the usage of global snitch and replaces it with the
pre-populated topology. The pre-population is done in rack-inferring
snitch like manner, since token_metadata still uses global snitch and
the locations from snitch and this temporary topology should match.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a test case that makes its own snitch driver that generates
pre-claculated DC/RACK data for test endpoints. This patch replaces this
custom snitch driver with a standalone topology object.
Note: to get DC/RACK info from this topo the get_location() is used
since the get_rack()/get_datacenter() are still wrappers around global
snitch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently abort-mode scrub exits with a message which basically says
"some problem was found", with no details on what problem it found. Add
a detailed error report on the found problem before aborting the scrub.
Closes#11418
`tokenof` calculates and prints the token of a partition-key.
`shardof` calculates the token and finds the owner shard of a partition-key. The number of shards has to be provided by the `--sharads` parameter. Ignore msb bits param can be tweaked with the `--ignore-msb-bits` parameter, which defaults to 12.
Examples:
```
$ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888
$ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
```
Closes#11436
* github.com:scylladb/scylladb:
tools/scylla-types: add shardof action
tools/scylla-types: pass variable_map to action handlers
tools/scylla-types: add tokenof action
tools/scylla-types: extract printing code into functions
Continuation to debfcc0e (snitch: Move sort_by_proximity() to topology).
The passed addresses are not modified by the helper. They are not yet
const because the method was copy-n-pasted from snitch where it wasn't
such.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220906074708.29574-1-xemul@scylladb.com>
The mutable get_datacenter_endpoints() and get_datacenter_racks() are
dangerous since they expose internal members without enforcing class
invariants. Fortunately they are unused, so delete them.
Closes#11454
"
There are two helpers on snitch that manipulate lists of nodes taking their
dc/rack into account. This set moves these methods from snitch to topology
and storage proxy.
"
* 'br-snitch-move-proximity-sorters' of https://github.com/xemul/scylla:
snitch: Move sort_by_proximity() to topology
topology: Add "enable proximity sorting" bit
code: Call sort_endpoints_by_proximity() via topology
snitch, code: Remove get_sorted_list_by_proximity()
snitch: Move is_worth_merging_for_range_query to proxy
There's one corner case in nodes sorting by snitch. The simple snitch
code overloads the call and doesn't sort anything. The same behavior
should be preserved by (future) topology implementation, but it doesn't
know the snitch name. To address that the patch adds a boolean switch on
topology that's turned off by main code when it sees the snitch is
"simple" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.
So this patch leaves just one sorting method -- the in-place one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Proxy is the only place that calls this method. Also the method name
suggests it's not something "generic", but rather an internal logic of
proxy's query processing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On #11399, I mistakenly committed bug fix of first patch (40134ef) to second one (8835a34).
So the script will broken when 40134ef only, it's not looks good when we backport it to older version.
Let's revert commits and make them single commit.
Closes#11448
* github.com:scylladb/scylladb:
scylla_raid_setup: prevent mount failed for /var/lib/scylla
Revert "scylla_raid_setup: check uuid and device path are valid"
Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"
First implementation of strongly consistent everywhere tables operates on simple table
representing string to string map.
Add hard-coded schema for broadcast_kv_store table (key text primary key,
value text). This table is under system keyspace and is created if and only if
BROADCAST_TABLES feature is enabled.
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Also added code to check make sure uuid and uuid based device path are valid.
Fixes#11359
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
From now on, when an alternator user correctly passed an authentication
step, their assigned client_state will have that information,
which also means proper access to service level configuration.
Previously the username was only used in tracing.
The constructor can be used as backdoor from frontends other than
CQL to create a session with an authenticated user, with access
to its attached service level information.
Fix https://github.com/scylladb/scylladb/issues/11430
@tzach I've added support for Ubuntu 22.04 to the row for version 2022.2. Does that version support Debian 11? That information is also missing (it was only added to OSS 5.0 and 5.1).
Closes#11437
* github.com:scylladb/scylladb:
doc: add support for Ubuntu 22.04 to the Enterprise table
doc: rename the columns in the Enterpise section to be in sync with the OSS section
Decorates a partition key and calculates which shard it belongs to,
given the shard count (--shards) and the ignore msb bits
(--ignore-msb-bits) parameters. The latter is optional and is defaulted to
12.
Example:
$ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.
Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.
Fixes: #11421Closes#11422
Instead of the entire object. Repair meta is a large object, its
printout floods the output of the command. Print only its address, the
user can print the objects it is interested in.
Closes#11428
Ubuntu 20.04 has less than 3 years of OS support remaining.
We should switch to Ubuntu 22.04 to reduce the need for OS upgrades in newly installed clusters.
Closes#11440
"
Messaging needs to know DC/RACK for nodes to decide whether it needs to
do encryption or compression depending on the options. As all the other
services did it still uses snitch to get it, but simple switch to use
topology needs extra care.
The thing is that messaging can use internal IP instead of endpoints.
Currently it's snitch who tries har^w somehow to resolve this, in
particular -- if the DC/RACK is not found for the given argument it
assumes that it might be internal IP and calls back messaging to convert
it to the endpoint. However, messaging does know when it uses which
address and can do this conversion itself.
So this set eliminates few more global snitch usages and drops the
knot tieing snitch, gossiper and messaging with each-other.
"
* 'br-messaging-use-topology-1.2' of https://github.com/xemul/scylla:
messaging: Get DC/RACK from topology
messaging, topology: Keep shared_token_metadata* on messaging
messaging: Add is_same_{dc|rack} helpers
snitch, messaging: Dont relookup dc/rack on internal IP
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.
However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.
The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
repair_based_node_operations_test.py::test_lcs_reshape_efficiency
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
* seastar f2d70c4a17...2b2f6c080e (4):
> perftune.py: special case a former 'MQ' mode in the new auto-detection code
> iostream: Generalize flush and batched flush
> Merge "Equip sharded<>::invoke_on_all with unwrap_sharded_args" from Pavel E
> Merge "perftune.py: cosmetic fixes" from VladZ
Closes#11434
Fix https://github.com/scylladb/scylladb/issues/11393
- Rename the tool names across the docs.
- Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively.
Closes#11432
* github.com:scylladb/scylladb:
doc: update the tool names in the toctree and reference pages
doc: rename the scylla-types tool as Scylla Types
doc: rename the scylla-sstable tool as Scylla SStable
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.
However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When getting dc/rack snitch may perform two lookups -- first time it
does it using the provided IP, if nothing is found snitch assumes that
the IP is internal one, gets the corresponding public one and searches
again.
The thing is that the only code that may come to snitch with internal
IP is the messaging service. It does so in two places: when it tries
to connect to the given endpoing and when it accepts a connection.
In the former case messaging performs public->internal IP conversion
itself and goes to snitch with the internal IP value. This place can get
simpler by just feeding the public IP to snich, and converting it to the
internal only to initiate the connection.
In the latter case the accepted IP can be either, but messaging service
has the public<->private map onboard and can do the conversion itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just like 4a8ed4cc6f, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Also added code to check make sure uuid and uuid based device path are valid.
Fixes#11359Closes#11399
* github.com:scylladb/scylladb:
scylla_raid_setup: prevent mount failed for /var/lib/scylla
scylla_raid_setup: check uuid and device path are valid
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.
in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes#11415
Google Groups recently started rewriting the From: header, garbaging
our git log. This script rewrites it back, using the Reply-To header
as a still working source.
Closes#11416
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.
This patch changes the description to (I hope) better address these
issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11404
* github.com:scylladb/scylladb:
doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
doc: cql-extensions.md: improve description of synchronous views
This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump.
Fixes: https://github.com/scylladb/scylladb/issues/11363Closes#11390
* github.com:scylladb/scylladb:
docs: scylla-sstable.rst: add comparison with SStableDump
docs: scylla-sstable.rst: add section about providing the schema
It was recently decided that the database should be referred to as
"ScyllaDB", not "Scylla". This patch changes existing references
in docs/cql/cql-extensions.md.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.
This patch changes the description to (I hope) better address these
issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
The topology object maintains all sort of node/DC/RACK mappings on
board. When new entries are added to it the DC and RACK are taken
from the global snitch instance which, in turn, checks gossiper,
system keyspace and its local caches.
This set make topology population API require DC and RACK via the
call argument. In most of the cases the populating code is the
storage service that knows exactly where to get those from.
After this set it will be possible to remove the dependency knot
consiting of snitch, gossiper, system keyspace and messaging.
"
* 'br-topology-dc-rack-info' of https://github.com/xemul/scylla:
toplogy: Use the provided dc/rack info
test: Provide testing dc/rack infos
storage_service: Provide dc/rack for snitch reconfiguration
storage_service: Provide dc/rack from system ks on start
storage_service: Provide dc/rack from gossiper for replacement
storage_service: Provide dc/rack from gossiper for remotes
storage_service,dht,repair: Provide local dc/rack from system ks
system_keyspace: Cache local dc-rack on .start()
topology: Some renames after previous patch
topology: Require entry in the map for update_normal_tokens()
topology: Make update_endpoint() accept dc-rack info
replication_strategy: Accept dc-rack as get_pending_address_ranges argument
dht: Carry dc-rack over boot_strapper and range_streamer
storage_service: Make replacement info a real struct
* seastar f9f5228b74...f2d70c4a17 (51):
> cmake: attach property to Valgrind not to hwloc
> Create the seastar_memory logger in all builds
> drop unused parameters
> Merge "Unify pollable_fd shutdown and abort_{reader|writer}" from Pavel E
> > pollable_fd: Replace two booleans with a mask
> > pollable_fd: Remove abort_reader/_writer
> Merge "Improve Rx channels assignment" from Vlad
> > perftune.py: fix comments of IRQ ordering functors
> > perftune.py: add VIRTIO fast path IRQs ordering functor
> > perftune.py: reduce number of Rx channels to the number of IRQ CPUs
> > perftune.py: introduce a --num-rx-queues parameter
> program_options: enable optional selection_value
> .gitignore: ignore the directories generated by VS Code and CLion.
> httpd: compare the Connection header value in a case-insensitive manner.
> httpd: move the logic of keepalive to a separate method.
> register one default priority class for queue
> Reset _total_stats before each run
> log: add colored logging support
> Merge "perftune.py: add NUMA aware auto-detection for big machines" from Vlad
> > perftune.py: mention 'irq_cpu_mask' in the description of the script operation
> > perftune.py: NetPerfTuner: fix bits counting in self.irqs_cpu_mask wider than 32 bits
> > perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): cosmetics: fix a comment and a variable name
> > perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): take into account omitted zero components of the mask
> > perftune.py: PerfTuneBase.compute_cpu_mask_for_mode(): cosmetics: fix a variable name
> > perftune.py: stop printing 'mode' in --dump-options-file
> > perftune.py: introduce a generic auto_detect_irq_mask(cpu_mask) function
> > perftune.py: DiskPerfTuner: use self.irqs_cpu_mask for tuning non-NVME disks
> > perftune.py: stop auto-detecting and using 'mode' internally
> > perftune.py: introduce --get-irq-cpu-mask command line parameter
> > perftune.py: introduce --irq-core-auto-detection-ratio parameter
> build: add a space after function name
> Update HACKING.md
> log: do not inherit formatter<seastar::log_level> from formatter<string_view>
> Merge "Mark connected_socket::shutdown_...'s internals noexcept" from Pavel E
> > native-stack: Mark tcp::in_state() (and its wrappers) const noexcept
> > native-stack: Mark tcb::close and tcb::abort_reader noexcept
> > native-stack: Mark tcp::connection::close_{read|write}() noexcept
> > native-stack: Mark tcb::clear_delayed_ack() and tcb::stop_retransmit_timer() noexcept
> > tls: Mark session::close() noexcept
> > file_desc: Add fdinfo() helper
> > posix-stack: Mark posix_connected_socket_impl::shutdown_{input|output}() noexcept
> > tests: Mark loopback_buffer::shutdown() noexcept
> Merge "Enhance RPC connection error injector" from Pavel E
> > loopback_socket: Shuffle error injection
> > loopback_socket: Extend error injection
> > loopback_socket: Add one-shot errors
> > loopback_socket: Add connection error injection
> > rpc_test: Extend error injector with kind
> > rpc_test: Inject errors on all paths
> > rpc_test: Use injected connect error
> > rpc_test: De-duplicate test socket creation
> Merge 'tls: vec_push: handle async errors rather than throwing on_internal_error' from Benny Halevy
> > tls: do_handshake: handle_output_error of gnutls_handshake
> > tls: session: vec_push: return output_pending error
> > tls: session: vec_push: reindent
> log: disambiguate formatter<log_level> from operator<<
> tls_test: Fix spurious fail in test_x509_client_with_builder_system_trust_multiple (et al)
Fixes scylladb/scylladb#11252
Closes#11401
This PR is related to https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123.
**New Enterprise Upgrade Guide from 2021.1 to 2022.2**
I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst
upgrade/_common/upgrade-image.rst
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst
**Modified Enterprise Upgrade Guides 2021.1 to 2022.2**
I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst
To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide.
**Modified Enterprise Upgrade Guides from 4.6 to 5.0**
These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123).
I've modified the index file to be in sync with the updates.
Closes#11285
* github.com:scylladb/scylladb:
doc: reorganize the content to list the recommended way of upgrading the image first
doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file
doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information
doc: update the guides for Ubuntu and Debian to remove image information and the OS version number
doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1
Having an error while pinging a peer is not a critical error. The code
retires and move on. Lets log the message with less severity since
sometimes those error may happen (for instance during node replace
operation some nodes refuse to answer to pings) and dtest complains that
there are unexpected errors in the logs.
Message-Id: <Ywy5e+8XVwt492Nc@scylladb.com>
on_compaction_completion() is not very descriptive. let's rename
it, following the example of
update_sstable_lists_on_off_strategy_completion().
Also let's coroutinize it, so to remove the restriction of running
it inside a thread only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11407
Test teardown involves dropping the test keyspace. If there are stopped servers occasionally we would see timeouts.
Start stopped servers after a test is finished (and passed).
Revert previous commit making teardown async again.
Closes#11412
* github.com:scylladb/scylladb:
test.py: restart stopped servers before teardown...
Revert "test.py: random tables make DDL queries async"
Said command is broken since 4.6, as the type of `reader_concurrency_semaphore::_permit_list` was changed without an accompanying update to this command. This series updates said command and adds it to the list of tested commands so we notice if it breaks in the future.
Closes#11389
* github.com:scylladb/scylladb:
test/scylla-gdb: test scylla read-stats
scylla-gdb.py: read_stats: update w.r.t. post 4.5 code
scylla-gdb.py: improve string_view_printer implementation
This PR change the CentOS 8 support to Rocky, and add 5.1 and 2022.1, 2022.2 rows to the list of Scylla releases
Closes#11383
* github.com:scylladb/scylladb:
OS support page: use CentOS not Centos
OS support page: add 5.1, 2022.1 and 2022.2
OS support page: Update CentOS 8 to Rocky 8
for topology tests
Test teardown involves dropping the test keyspace. If there are stopped
servers occasionally we would see timeouts.
Start stopped servers after a test is finished.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:
scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.
The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.
If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.
The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.
Fixes#11396Closes#11398
* github.com:scylladb/scylladb:
db: schema_tables: Make table creation shadow earlier concurrent changes
db: schema_tables: Fix formatting
db: schema_mutations: Make operator<<() print all mutations
schema_mutations: Make it a monoid by defining appropriate += operator
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:
scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.
The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.
If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.
The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.
Fixes#11396
Patch 765d2f5e46 did not
evaluate the #if SCYLLA_BUILD_MODE directives properly
and it always matched SCYLLA_BULD_MODE == release.
This change fixes that by defining numerical codes
for each build mode and using macro expansion to match
the define SCYLLA_BUILD_MODE against these codes.
Also, ./configure.py was changes to pass SCYLLA_BUILD_MODE
to all .cc source files, and makes sure it is defined
in build_mode.hh.
Support was added for coverage build mode,
and an #error was added if SCYLLA_BUILD_MODE
was not recognized by the #if ladder directives.
Additional checks verifying the expected SEASTAR_DEBUG
against SCYLLA_BUILD_MODE were added as well,
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11387
There are async timeouts for ALTER queries. Seems related to othe issues
with the driver and async.
Make these queries synchronous for now.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11394
This commit introduces the following changes to Alternator compability doc:
* As of https://github.com/scylladb/scylladb/pull/11298 Alternator will return ProvisionedThroughput in DescribeTable API calls. We add the fact that tables will default to a BillingMode of PAY_PER_REQUEST (this wasn't made explicit anywhere in the docs), and that the values for RCUs/WCUs are hardcoded to 0.
* Mention the fact that ScyllaDB (thus Alternator) hashing function is different than AWS proprietary implementation for DynamoDB. This is mostly of an implementation aspect rather than a bug, but it may cause user confusion when/if comparing the ResultSet between DynamoDB and Alternator returned from Table Scans.
Refs: https://github.com/scylladb/scylladb/issues/11222
Fixes: https://github.com/scylladb/scylladb/issues/11315Closes#11360
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Fixes#11359
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
Closes#11318
* github.com:scylladb/scylladb:
raft, use max_command_size to satisfy commitlog limit
raft, limit for command size
Previous patches made all the callers of topology.update_endpoint()
(via token_metadata.update_topology()) provide correct dc/rack info
for the endpoint. It's now possible to stop using global snitch by
topology and just rely on the dc/rack argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a test that's sensitive to correct dc/rack info for testing
entries. To populate them it uses global rack-inferring snitch instance
or a special "testing" snitch. To make it continue working add a helper
that would populate the topology properly (spoiler: next branch will
replace it with explicitly populated topology object).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When snitch reconfigures (gossiper-property-file one) it kicks storage
service so that it updates itself. This place also needs to update the
dc/rack info about itself, the correct (new) values are taken from the
snitch itself.
There's a bug here -- system.local table it not update with new data
until restart.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a node starts it loads the information about peers from
system.peers table and populates token metadata and topology with this
information. The dc/rack are taken from the sys-ks cache here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a node it started to replace another node it updates token metadata
and topology with the target information eary. The tokens are now taken
from gossiper shadow round, this patch makes the same for dc/rack info.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a node is notified about other nodes state change it may want to
update the topology information about it. In all those places the
dc/rack into about the peer is provided by the gossiper.
Basically, these updates mirror the relevant updates of tokens on the
token metadata object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a node starts it adds itself to the topology. Mostly it's done in
the storage_service::join_cluster() and whoever it calls. In all those
places the dc/rack for the added node is taken from the system keyspace
(it's cache was populated with local dc/rack by the previous patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a cache of endpoint:{dc,rack} on system keyspace cache, but the
local node is not there, because this data is populated from the peers
table, while local node's dc/rack is in snitch (or system.local table).
At the same time, storage_service::join_cluster() and whoever it calls
(e.g. -- the repair) will need this info on start and it's convenient
to have this data on sys-ks cache.
It's not on the peers part of the cache because next branch removes this
map and it's going to be very clumsy to have a whole container with just
one enty in it.
There's a peer code in system_keyspace::setup() that gets the local node
dc/rack and committs it into the system.local table. However, putting
the data into cache is done on .start(). This is because cql-test-env
needs this data cached too, but it doesn't call sys_ks.setup(). Will be
cleaned some other day.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology::update_endpoint() is now a plain wrapper over private
::add_endpoint() method of the same class. It's simpler to merge them
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question tries to be on the safest side and adds the
enpoint for which it updates the tokens into the topology. From now on
it's up to the caller to put the endpoint into topology in advance.
So most of what this patch does is places topology.update_endpoint()
into the relevant places of the code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question populates topology's internal maps with endpoint
vs dc/rack relations. As for today the dc/rack values are taken from the
global snitch object (which, in turn, goes to gossiper, system keyspace
and its internal non-updateable cache for that).
This patch prepares the ground for providing the dc/rack externally via
argument. By now it's just and argument with empty strings, but next
patches will populate it with real values (spoiler: in 99% it's storage
service that calls this method and each call will know where to get it
from for sure)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method creates a copy of token metadata and pushes an endpoint (with
some tokens) into it. Next patches will require providing dc/rack info
together with the endpoint, this patch prepares for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both classes may populate (temporarly clones of) token metadata object
with endpoint:tokens pairs for the endpoint they work with. Next patches
will require that endpoint comes with the dc/rack info. This patch makes
sure dht classes have the necessary information at hand (for now it's
just empty pair of strings).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
scylla_read_stats is not up-to-date wr.r.t. the type of
`reader_concurrency_semaphore::_permit_list`, which was changed in 4.6.
Bring it up-to-date, keeping it backwards compatible with 4.5 and older
releases.
The `_M_str` member of an `std::string_view` is not guaranteed to be a
valid C string (.e.g. be null terminated). Printing it directly often
resulted in printing partial strings or printing gibberish, effecting in
particular the semaphore diagnostics dumps (scylla read-stats).
Use a more reliable method: read `_M_len` amount of bytes from `_M_str`
and decode as UTF-8.
This patch fixes the regression introduced by 3a51e78 which broke
a very important contract: perftune.yaml should not be "touched"
by Scylla scriptology unless explicitly requested.
And a call for scylla_cpuset_setup is such an explicit request.
The issue that the offending patch was intending to fix was that
cpuset.conf was always generated anew for every call of
scylla_cpuset_setup - even if a resulting cpuset.conf would come
out exactly the same as the one present on the disk before tha call.
And since the original code was following the contract mentioned above
it was also deleting perftune.yaml every time too.
However, this was just an unavoidable side-effect of that cpuset.conf
re-generation.
The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf
we should not "touch" perftune.yaml and vise versa.
This patch implements exactly that together with reverting the dangerous
logic introduced by 3a51e78.
Fixes#11385Fixes#10121
Modern perftune.py supports a more generic way of defining IRQ CPUs:
'irq_cpu_mask'.
This patch makes our auto-generation code create a perftune.yaml
that uses this new parameter instead of using outdated 'mode'.
As a side effect, this change eliminates the notion of "incorrect"
value in cpuset.conf - every value is valid now as long as it fits into
the 'all' CPU set of the specific machine.
Auto-generated 'irq_cpu_mask' is going to include all bits from 'all'
CPU mask except those defined in cpuset.conf.
Fixes#9903
Currently SCYLLA_BULD_MODE is defined as a string by the cxxflags
generated by configure.py. This is not very useful since one cannot use
it in a @if preprocessor directive.
Instead, use -DSCYLLA_BULD_MODE=release, for example, and define a
SCYLLA_BULD_MODE_STR as the dtirng representation of it.
In addition define the respective
SCYLLA_BUILD_MODE_{RELEASE,DEV,DEBUG,SANITIZE} macros that can be easily
used in @ifdef (or #ifndef :)) for conditional compilation.
The planned use case for it is to enable a task_manager test module only
in non-release modes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11357
Currently, if a keyspace has an aggregate and the keyspace
is dropped, the keyspace becomes corrupted and another keyspace
with the same name cannot be created again
This is caused by the fact that when removing an aggregate, we
call create_aggregate() to get values for its name and signature.
In the create_aggregate(), we check whether the row and final
functions for the aggregate exist.
Normally, that's not an issue, because when dropping an existing
aggregate alone, we know that its UDFs also exist. But when dropping
and entire keyspace, we first drop the UDFs, making us unable to drop
the aggregate afterwards.
This patch fixes this behavior by removing the create_aggregate()
from the aggregate dropping implementation and replacing it with
specific calls for getting the aggregate name and signature.
Additionally, a test that would previously fail is added to
cql-pytest/test_uda.py where we drop a keyspace with an aggregate.
Fixes#11327Closes#11375
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
"
On token_metadata there are two update_normal_tokens() overloads --
one updates tokens for a single endpoint, another one -- for a set
(well -- std::map) of them. Other than updating the tokens both
methods also may add an endpoint to the t.m.'s topology object.
There's an ongoing effort in moving the dc/rack information from
snitch to topology, and one of the changes made in it is -- when
adding an entry to topology, the dc/rack info should be provided
by the caller (which is in 99% of the cases is the storage service).
The batched tokens update is extremely unfriendly to the latter
change. Fortunately, this helper is only used by tests, the core
code always uses fine-grained tokens updating.
"
* 'br-tokens-update-relax' of https://github.com/xemul/scylla:
token_metadata: Indentation fix after prevuous patch
token_metadata: Remove excessive empty tokens check
token_metadata: Remove batch tokens updating method
tests: Use one-by-one tokens updating method
Some cases in test_wasm.py assumed that all cases
are ran in the same order every time and depended
on values that should have been added to tables in
previous cases. Because of that, they were sometimes
failing. This patch removes this assumption by
adding the missing inserts to the affected cases.
Additionally, an assert that confirms low miss
rate of udfs is more precise, a comment is added
to explain it clearly.
Closes#11367
It could happen that we accessed failure detector service after it was
stopped if a reconfiguration happened in the 'right' moment. This would
resolve in an assertion failure. Fix this.
Closes#11326
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.
Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
schema operations.
With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).
This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.
Read the design doc, then read the commits in topological order
for best reviewing experience.
---
I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.
As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).
Closes#10835
* github.com:scylladb/scylladb:
service/raft: raft_group0: implement upgrade procedure
service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
service/raft: raft_group0: introduce local loggers for group 0 and upgrade
service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
service/raft: raft_group0_client: prepare for upgrade procedure
service/raft: introduce `group0_upgrade_state`
db: system_keyspace: introduce `load_peers`
idl-compiler: introduce cancellable verbs
message: messaging_service: cancellable version of `send_schema_check`
After the previous patch empty passed tokens make the helper co_return
early, so this if is the dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No users left.
The endpoint_tokens.empty() check is removed, only tests could trigger
it, but they didn't and are patched out.
Indentation is left broken
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_pending_address_ranges() accepting a single token is not in use,
its peer that accepts a set of tokens is
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11358
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.
Closes#11284
* github.com:scylladb/scylladb:
utils/logalloc: remove reclaim_timer:: globals
utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
utils/logalloc: move global stat accessors to tracker
utils/logalloc: allocating_section: don't use the global tracker
utils/logalloc: pass down tracker::impl reference to segment_pool
utils/logalloc: move segment pool into tracker
utils/logalloc: add tracker member to basic_region_impl
utils/logalloc: make segment independent of segment pool
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.
Fixes#11245
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11246
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
Fixes#11349
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch set does so, adapted
to the recent version of this file.
Closes#11350
* github.com:scylladb/scylladb:
distributed_loader: Restore separate processing of keyspace init prio/normal
Revert "distributed_loader: Remove unused load-prio manipulations"
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).
The listener starts the `upgrade_to_group0` procedure.
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
`system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
for schema operations (only those for now).
The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.
`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.
If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.
We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.
To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
(the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.
Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
whenever we extend group 0 with new stuff...)
Add some more logging to `randomized_nemesis_test` such as logging the
start and end of a reconfiguration operation in a way that makes it easy
to find one given the other in the logs.
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Fixes#11349
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch and revert before it does so, adapted
to the recent version of this file.
This reverts commit 7396de72b1.
In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).
This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This reverts the actual commit, patch after
ensures we use the prio set.
This series turns plan_id from a generic UUID into a strong type so it can't be used interchangeably with other uuid's.
While at it, streaming/stream_fwd.hh was added for forward declarations and the definition of plan_id.
Also, `stream_manager::update_progress` parameter name was renamed to plan_id to represent its assumed content, before changing its type to `streaming::plan_id`.
Closes#11338
* github.com:scylladb/scylladb:
streaming: define plan_id as a strong tagged_uuid type
stream_manager: update_progress: rename cf_id param to plan_id
streaming: add forward declarations in stream_fwd.hh
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.
Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.
While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.
Closes#11271
* github.com:scylladb/scylladb:
mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
mutation: consume_clustering_fragments: reindent
mutation: consume_clustering_fragments: shuffle emit_rt logic around
mutation: consume, consume_gently: simplify partition_start logic
mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
mutation: consume_clustering_fragments: keep the reversed schema in cookie
mutation: clustering_iterators: get rid of current_rt
mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.
Pass 60 seconds timeout (default is 10) to match the session's.
Fixes https://github.com/scylladb/scylladb/issues/11289Closes#11348
* github.com:scylladb/scylladb:
test.py: bump schema agreement timeout for topology tests
test.py: bump timeout of async requests for topology
test.py: fix bad indent
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198Closes#11269
* github.com:scylladb/scylladb:
mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
frozen_mutation: consume and consume_gently in-order
frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.
Pass 60 seconds timeout (default is 10) to match the session's.
This will hopefully will fix timeout failures on debug mode.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
So that the frozen_mutation consumer can return
stop_iteration::yes if it wishes to stop consuming at
some clustering position.
In this case, on_end_of_partition must still be called
so a closing range_tombstone_change can be emitted to the consumer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.
This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.
Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.
Add a unit test that verifies that.
Fixes#11198
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Improve the randomness of this test, making it a bit easier to
reproduce the scenarios that the test aims to catch.
Increase timeouts a bit to account for this additional randomness.
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Fixes#11235.
In `std::holds_alternative`+`std::get` version, the `get` performs a
redundant check. Also `std::visit` gives a compile-time exhaustiveness
check (whether we handled all possible cases of the `variant`).
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.
Instead, iterator over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.
The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.
So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.
Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.
Fixes#11222
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11298
On recent version of systemd, StandardOutput=syslog is obsolete.
We should use StandardOutput=journal instead, but since it's default value,
so we can just drop it.
Fixes#11322Closes#11339
Provides separate control over debuginfo for perf tests
since enabling --tests-debuginfo affects both today
causing the Jenkins archives of perf tests binaries to
inflate considerably.
Refs https://github.com/scylladb/scylla-pkg/issues/3060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11337
Before changing its type to streaming::plan_id
this patch clarifies that the parameter actually represents
the plan id and not the table id as its name suggests.
For reference, see the call to update_progress in
`stream_transfer_task::execute`, as well as the function
using _stream_bytes which map key is the plan id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
from Tomasz Grabiec
This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.
No known production impact.
Refs https://github.com/scylladb/scylladb/issues/11307Closes#11312
* github.com:scylladb/scylladb:
test: mutation_test: Add explicit test for mutation commutativity
test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
db: mutation_partition: Drop unnecessary maybe_shadow()
db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
mutation_partition: row: make row marker shadowing symmetric
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.
In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.
The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.
To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.
We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).
Some additional comments were written.
Define an enum class, `group0_upgrade_state`, describing the state of
the upgrade procedure (implemented in later commits).
Provide IDL definitions for (de)serialization.
The node will have its current upgrade state stored on disk in
`system.scylla_local` under the `group0_upgrade_state` key. If the key
is not present we assume `use_pre_raft_procedures` (meaning we haven't
started upgrading yet or we're at the beginning of upgrade).
Introduce `system_keyspace` accessor methods for storing and retrieving
the on-disk state.
The compiler allowed passing a `with_timeout` flag to a verb definition;
it then generated functions for sending and handling RPCs that accepted
a timeout parameter.
We would like to generate functions that accept an `abort_source` so an
RPC can be cancelled from the sender side. This is both more and less
powerful than `with_timeout`. More powerful because you can abort on
other conditions than just reaching a certain point in time. Less
powerful because you can't abort the receiver. In any case, sometimes
useful.
For this the `cancellable` flag was added.
You can't use `with_timeout` and `cancellable` at the same verb.
Note that this uses an already existing function in RPC module,
`send_message_cancellable`.
This RPC will be used during the Raft upgrade procedure during schema
synchronization step.
Make a version which can be cancelled when the upgrade procedure gets
aborted.
- Remove `ScyllaCluster.__getitem__()` (pending request by @kbr- in a previous pull request), for this remove all direct access to servers from caller code
- Increase Python driver timeouts (req by @nyh)
- Improve `ManagerClient` API requests: use `http+unix://<sockname>/<resource>` instead of `http://localhost/<resource>` and callers of the helper method only pass the resource
- Improve lint and type hints
Closes#11305
* github.com:scylladb/scylladb:
test.py: remove ScyllaCluster.__getitem__()
test.py: ScyllaCluster check kesypace with any server
test.py: ScyllaCluster server error log method
test.py: ScyllaCluster read_server_log()
test.py: save log point for all running servers
test.py: ScyllaCluster provide endpoint
test.py: build host param after before_test
test.py: manager client disable lint warnings
test.py: scylla cluster lint and type hint fixes
test.py: increase more timeouts
test.py: ManagerClient improve API HTTP requests
Dtest fails if it sees an unknown errors in the logs. This series
reduces severity of some errors (since they are actually expected during
shutdown) and removes some others that duplicate already existing errors
that dtest knows how to deal with. Also fix one case of unhandled
exception in schema management code.
* 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla:
raft: getting abort_requested_exception exception from a sm::apply is not a critical error
schema_registry: fix abandoned feature warning
service: raft: silence rpc::closed_errors in raft_rpc
Given 3 row mutations:
m1 = {
marker: {row_marker: dead timestamp=-9223372036854775803},
tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}
m2 = {
marker: {row_marker: timestamp=-9223372036854775805}
}
m3 = {
tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}
}
We get different shadowable tombstones depending on the order of merging:
(m1 + m2) + m3 = {
marker: {row_marker: dead timestamp=-9223372036854775803},
tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}
m1 + (m2 + m3) = {
marker: {row_marker: dead timestamp=-9223372036854775803},
tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}
The reason is that in the second case the shadowable tombstone in m3
is shadwed by the row marker in m2. In the first case, the marker in
m2 is cancelled by the dead marker in m1, so shadowable tombstone in
m3 is not cancelled (the marker in m1 does not cancel because it's
dead).
This wouldn't happen if the dead marker in m1 was accompanied by a
hard tombstone of the same timestamp, which would effectively make the
difference in shadowable tombstones irrelevant.
Found by row_cache_test.cc::test_concurrent_reads_and_eviction.
I'm not sure if this situation can be reached in practice (dead marker
in mv table but no row tombstone).
Work it around for tests by producing a row tombstone if there is a
dead marker.
Refs #11307
When the row has a live row marker which shadows the shadowable
tombstone, the shadowable tombstone should not be effective. The code
assumes that _shadowable always reflects the current tombstone, so
maybe_shadow() needs to be called whenever marker or regular tombstone
changes. This was not ensured by row::apply(tombstone).
This causes problems in tests which use random_mutation_generator,
which generates mutations which would violate this invariant, and as a
result, mutation commutativity would be violated.
I am not aware of problems in production code.
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.
This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.
There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.
Fixes: #9483
Tests: unit(dev)
Closes#9519
and set crs and rts only in the block where they are used,
so we can get rid of reversed_range_tombstones.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than reversing the schema on every call
just keep the potentially reversed schema in cookie.
Othwerwise, cookie.schema was write only.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide server error logs to caller (test.py).
Avoids direct access to list of servers.
To be done later: pick the failed server. For now it just provides the
log of one server.
While there, fix type hints.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Instead of accessing the first server, now test.py asks ScyllaCluster
for the server log.
In a later commit, ScyllaCluster will pick the appropriate server.
Also removes another direct access to the list of servers we want to get
rid of.
For error reporting, before a test a mark of the log point in time is
saved. Previously, only the log of the first server was saved. Now it's
done for all running servers.
While there, remove direct access to servers on test.py.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If no server started, there is no server in the cluster list. So only
build the pytest --host param after before_test check is done.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Increase Python driver connection timeouts to deal with extreme cases
for slow debug builds in slow machines as done (and explained) in
95bd02246a.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Use the AF Unix socket name as host name instead of localhost and avoid
repeating the full URL for callers of _request() for the Manager API
requests from the client.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If the consumer return stop_iteration::yes for a flushed
row (static or clustered, we should return early and
no consume any more fragments, until `on_end_of_partition`,
where we may still consume a closing range_tombstone_change
past the last consumed row.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consuming the static row is the first ooportunity for
the consumer to return stop_iteration::yes, so there's no
point in checking `_stop_consuming` before consuming it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We already flushed rt_gen when building the current_row
When we get to flush_rows_and_tombstones, we should
just consume it, as the passed position is not if the
current_row but rather a position following it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit 7eda6b1e90, we increased the
request_timeout parameter used by cql-pytest tests from the default of
10 seconds to 120 seconds. 10 seconds was usually more than enough for
finishing any Scylla request, but it turned out that in some extreme
cases of a debug build running on an extremely over-committed machine,
the default timeout was not enough.
Recently, in issue #11289 we saw additional cases of timeouts which
the request_timeout setting did *not* solve. It turns out that the Python
CQL driver has two additional timeout settings - connect_timeout and
control_connection_timeout, which default to 5 seconds and 2 seconds
respectively. I believe that most of the timeouts in issue #11289
come from the control_connection_timeout setting - by changing it
to a tiny number (e.g., 0.0001) I got the same error messages as those
reported in #11289. The default of that timeout - 2 seconds - is
certainly low enough to be reached on an extremely over-committed
machine.
So this patch significantly increases both connect_timeout and
control_connection_timeout to 60 seconds. We don't care that this timeout
is ridiculously large - under normal operations it will never be reached.
There is no code which loops for this amount of time, for example.
Refs #11289 (perhaps even Fixes, we'll need to see that the test errors
go away).
NOTE: This patch only changes test/cql-pytest/util.py, which is only
used by the cql-pytest test suite. We have multiple other test suites which
copied this code, and those test suites might need fixing separately.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11295
Right now, if there's a node for which we don't know the features
supported by this node (they are neither persisted locally, nor gossiped
by that node), we would skip this node in calculating the set
of enabled features and potentially enable a feature which shouldn't be
enabled - because that node may not know it. We should only enable a
feature when we know that all nodes have upgraded and know the feature.
This bug caused us problems when we tried to move RAFT out of
experimental. There are dtests such as `partitioner_tests.py` in which
nodes would enable features prematurely, which caused the Raft upgrade
procedure to break (the procedure starts only when all nodes upgrade
and announce that they know the SUPPORTS_RAFT cluster feature).
Closes#11225
This pull request introduces global secondary-indexing for non-frozen collections.
The intent is to enable such queries:
```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```
We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type. Suppose that the columns in the base table
were as follows:
```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```
The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.
```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```
The reason for the last additional column is that the values from a collection might not be unique.
Fixes#2962Fixes#8745Fixes#10707
This patch does not implement **local** secondary indexes for collection columns: Refs #10713.
Closes#10841
* github.com:scylladb/scylladb:
test/cql-pytest: un-xfail yet another passing collection-indexing test
secondary index: fix paging in map value indexing
test/cql-pytest: test for paging with collection values index
cql, view: rename and explain bytes_with_action
cql, index: make collection indexing a cluster feature
test/cql-pytest: failing tests for oversized key values in MV and SI
cql: fix secondary index "target" when column name has special characters
cql, index: improve error messages
cql, index: fix default index name for collection index
test/cql-pytest: un-xfail several collecting indexing tests
test/cql-pytest/test_secondary_index: verify that local index on collection fails.
docs/design-notes/secondary_index: add `VALUES` to index target list
test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
test/boost/secondary_index_test: update for non-frozen indexes on collections
test/cql-pytest: Uncomment collection indexes tests that should be working now
cql, index: don't use IS NOT NULL on collection column
cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
cql3, index: Use entries() indexes on collections for queries
cql3, index: Use keys() and values() indexes on collections for queries.
types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
db/view/view.cc: compute view_updates for views over collections
view info: has_computed_column_depending_on_base_non_primary_key
column_computation: depends_on_non_primary_key_column
schema, index/secondary_index_manager: make schema for index-induced mv
index/secondary_index_manager: extract keys, values, entries types from collection
cql3/statements/: validate CREATE INDEX for index over a collection
cql3/statements/create_index_statement,index_target: rewrite index target for collection
column_computation.hh, schema.cc: collection_column_computation
column_computation.hh, schema.cc: compute_value interface refactor
Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
Commit 23acc2e848 broke the "--ssl" option of test/cql-pytest/run
(which makes Scylla - and cqlpytest - use SSL-encrypted CQL).
The problem was that there was a confusion between the "ssl" module
(Python's SSL support) and a new "ssl" variable. A rename and a missing
"import" solves the breakage.
We never noticed this because Jenkins does *not* run cql-pytest/run
with --ssl (actually, it no longer runs cql-pytest/run at all).
It is still a useful option for checking SSL-related problems in Scylla
and Seastar.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11292
Test schema changes when there was an underlying topology change.
- per test case checks of cluster health and cycling
- helper class to do cluster manager API requests
- tests can perform topology changes: stop/start/restart servers
- modified clusters are marked dirty and discarded after the test case
- cql connection is updated per topology change and per cluster change
Closes#11266
* github.com:scylladb/scylladb:
test.py: test topology and schema changes
test.py: ClusterManager API mark cluster dirty
test.py: call before/after_test for each test case
test.py: handle driver connection in ManagerClient
test.py: ClusterManager API and ManagerClient
test.py: improve topology docstring
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.
Fixes: https://github.com/scylladb/scylladb/issues/11264Closes#11273
* github.com:scylladb/scylladb:
querier: querier_cache: remove now unused evict_all_for_table()
database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
reader_concurrency_semaphore: add evict_inactive_reads_for_table()
It should have had one, derived instances are stored and destroyed via
the base-class. The only reason this haven't caused bugs yet is that
derived instances happen to not have any non-trivial members yet.
Closes#11293
A mixed bag of improvements developed as part of another PR (https://github.com/scylladb/scylladb/pull/10736). Said PR was closed so I'm submitting these improvements separately.
Closes#11294
* github.com:scylladb/scylladb:
test/lib: move convenience table config factory to sstable_test_env
test/lib/sstable_test_env: move members to impl struct
test/lib/sstable_utils: use test_env::do_with_async()
Instead of querier_cache::evict_all_for_table(). The new method cover
all queriers and in addition any other inactive reads registered on the
semaphore. In theory by the time we detach a table, no regular inactive
reads should be in the semaphore anymore, but if there is any still, we
better evict them before the table is destroyed, they might attempt to
access it in when destroyed later.
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
All present members of sstable_test_env are std::unique_ptr<>:s because
they require stable addresses. This makes their handling somewhat
awkward. Move all of them into an internal `struct impl` and make that
member a unique ptr.
Fixes#11184Fixes#11237
In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.
This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things.
Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.
Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved.
To further emphasize this, don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially ask for the list twice.
More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
Closes#11251
* github.com:scylladb/scylladb:
commitlog: Make get_segments_to_replay on-demand
commitlog: Revert/modify fac2bc4 - do footprint add in delete
Fix https://github.com/scylladb/scylladb/issues/11197
This PR adds a new page where specifying workload attributes with service levels is described and adds it to the menu.
Also, I had to fix some links because of the warnings.
Closes#11209
* github.com:scylladb/scylladb:
doc: remove the reduntant space from index
doc: update the syntax for defining service level attributes
doc: rewording
doc: update the links to fix the warnings
doc: add the new page to the toctree
doc: add the descrption of specifying workload attributes with service levels
doc: add the definition of workloads to the glossary
In preparation for effective_replication_map hygiene, convert
some counter functions to coroutines to simplify the changes.
Closes#11291
* github.com:scylladb/scylladb:
storage_proxy: mutate_counters_on_leader: coroutinize
storage_proxy: mutate_counters: coroutinize
storage_proxy: mutate_counters: reorganize error handling
Simplify ahead of refactoring for consistent effective_replication_map.
This is probably a pessimization of the error case, but the error case
will be terrible in any case unless we resultify it.
Move the error handling function where it's used so the code
is more straightforward.
Due to some std::move()s later, we must still capture the schema early.
Move the termination condition to the front of the loop so it's
clear why we're looping and when we stop.
It's less than perfectly clean since we widen the scope of some variables
(from loop-internal to loop-carried), but IMO it's clearer.
It's much easier to maintain this way. Since it uses ranges_to_vnodes,
it interacts with topology and needs integration into
effective_replication_map management.
The patch leaves bad indentation and an infinite-looking loop in
the interest of minimization, but that will be corrected later.
Note, the test for `!r.has_value()` was eliminated since it was
short-circuited by the test for `!rqr.has_value()` returning from
the coroutine rather than propagating an error.
We use result_wrap() in two places, but that makes coroutinizing the
containing function a little harder, since it's composed of more lambdas.
Remove the wrappers, gaining a bit of performance in the error case.
The function `check_exists` checks whether a given table exists, giving
an error otherwise. It previously used `on_internal_error`.
`check_exists` is used in some old functions that insert CDC metadata to
CDC tables. These tables are no longer used in newer Scylla versions
(they were replaced with other tables with different schema), and this
function is no longer called. The table definitions were removed and
these tables are no longer created. They will only exists in clusters
that were upgraded from old versions of Scylla (4.3) through a sequence
of upgrades.
If you tried to upgrade from a very old version of Scylla which had
neither the old or the new tables to a modern version, say from 4.2 to
5.0, you would get `on_internal_error` from this `check_exists`
function. Fortunately:
1. we don't support such upgrade paths
2. `on_internal_error` in production clusters does not crash the system,
only throws. The exception would be catched, printed, and the system
would run (just without CDC - until you finished upgrade and called
the propoer nodetool command to fix the CDC module).
Unfortunately, there is a dtest (`partitioner_tests.py`) which performs
an unsupported upgrade scenario - it starts Scylla from Cassandra (!)
work directories, which is like upgrading from a very old version of
Scylla.
This dtest was not failing due to another bug which masked the problem.
When we try to fix the bug - see #11225 - the dtest starts hitting the
assertion in `check_exists`. Because it's a test, we configure
`on_internal_error` to crash the system.
The point of this commit is to not crash the system in this rare
scenario which happens only in some weird tests. We now throw
`std::runtime_error` instead of calling `on_internal_error`. In the
dtest, we already ignore the resulting CDC error appearing in the logs
(see scylladb/scylla-dtest#2804). Together with this change, we'll be
able to fix the #11225 bug and pass this test.
Closes#11287
After collection indexing has been implemented, yet another test which
failed because of #2962 now passes. So remove the "xfail" marker.
Refs #2962
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When indexing a map column's *values*, if the same value appears more
than once, the same row will appear in the index more than once. We had
code that removed these duplicates, but this deduplication did not work
across page boundaries. We had two xfailing tests to demonstrate this bug.
In this patch we fix this bug by looking at the page's start and not
generating the same row again, thereby getting the same deduplication
we had inside pages - now across pages.
The previously-xfailing tests now pass, and their xfail tag is removed.
I also added another test, for the case where the base table has only
partition keys without clustering keys. This second test is important
because the code path for the partition-key-only case is different,
and the second test exposed a bug in it as well (which is also fixed
in this patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If a map has several keys with the same value, then the "values(m)" index
must remember all of them as matching the same row - because later we may
remove one of these keys from the map but the row would still need to
match the value because of the remaining keys.
We already had a test (test_index_map_values) that although the same row
appears more than once for this value, when we search for this value the
result only returns the row once. Under the hood, Scylla does find the
same value multiple times, but then eliminates the duplicate matched raw
and returns it only once.
But there is a complication, that this de-duplication does not easily
span *paging*. So in this patch we add a test that checks that paging
does not cause the same row to be returned more than once.
Unfortunately, this test currently fails on Scylla so marked "xfail".
It passes on Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The structure "bytes_with_action" was very hard to understand because of
its mysterious and general-sounding name, and no comments.
In this patch I add a large comment explaining its purpose, and rename
it to a more suitable name, view_key_and_action, which suggests that
each such object is about one view key (where to add a view row), and
an additional "action" that we need to take beyond adding the view row.
This is the best I can do to make this code easier to understand without
completely reorganizing it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Prevent a user from creating a secondary index on a collection column if
the cluster has any nodes which don't support this feature. Such nodes
will not be able to correctly handle requests related to this index,
so better not allow creating one.
Attempting to create an index on a collection before the entire cluster
supports this feature will result in the error:
Indexing of collection columns not supported by some older nodes
in this cluster. Please upgrade them.
Tested by manually disabling this feature in feature_service.cc and
seeing this error message during collection indexing test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In issue #9013, we noticed that if a value larger than 64 KB is indexed,
the write fails in a bad way, and we fixed it. But the test we wrote
when fixing that issue already suggested that something was still wrong:
Cassandra failed the write cleanly, with an InvalidRequest, while Scylla
failed with a mysterious WriteFailure (with a relevant error message
only in the log).
This patch adds several xfailing tests which demonstrate what's still
wrong. This is also summarized in issue #8627:
1. A write of an oversized value to an indexed column returns the wrong
error message.
2. The same problem also exists when indexing a collection, and the indexed
key or value is oversized.
3. The situation is even less pleasant when adding an index to a table
with pre-existing data and an oversized value. In this case, the
view building will fail on the bad row, and never finish.
4. We have exactly the same bugs not just with indexes but also with
materialized views. Interestingly, Cassandra has similar bugs in
materialized views as well (but not in the secondary index case,
where Cassandra does behave as expected).
Refs #8627.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Unfortunately, we encode the "target" of a secondary index in one of
three ways:
1. It can be just a column name
2. It can be a string like keys(colname) - for the new type of
collection indexes introduced in this series.
3. It can be a JSON map ({ ... }). This form is used for local indexes.
The code parsing this target - target_parser::parse() - needs not to
confuse these different formats. Before this patch, if the column name
contains special characters like braces or parentheses (this is allowed
in CQL syntax, via quoting), we can confuse case 1, 2, and 3: A column
named "keys(colname)" will be confused for case 2, and a column named
"{123}" will be confused with case 3.
This problem can break indexing of some specially-crafted column names -
as reproduced by test_secondary_index.py::test_index_quoted_names.
The solution adopted in this patch is that the column name in case 1
should be escaped somehow so it cannot be possibly confused with either
cases 2 and 3. The way we chose is to convert the column name to CQL (with
column_definition::as_cql_name()). In other words, if the column name
contains non-alphanumeric characters, it is wrapped in quotes and also
quotes are doubled, as in CQL. The result of this can't be confused
with case 2 or 3, neither of which may begin with a quote.
This escaping is not the minimal we could have done, but incidentally it
is exactly what Cassandra does as well, so I used it as well.
This change is *mostly* backward compatible: Already-existing indexes will
still have unescaped column names stored for their "target" string,
and the unescaping code will see they are not wrapped in quotes, and
not change them. Backward compatibility will only fail on existing indexes
on columns whose name begin and end in the quote characters - but this
case is extremely unlikely.
This patch illustrates how un-ideal our index "target" encoding is,
but isn't what made it un-ideal. We should not have used three different
formats for the index target - the third representation (JSON) should
have sufficed. However, two two other representations are identical
to Cassandra's, so using them when we can has its compatibility
advantages.
The patch makes test_secondary_index.py::test_index_quoted_names pass.
Fixes#10707.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, trying to create an index on entries(x) where x is
not a map results in an error message:
Cannot create index on index_keys_and_values of column x
The string "index_keys_and_values" is strange - Cassandra prints the
easier to understand string "entries()" - which better corresponds to
what the user actually did.
It turns out that this string "index_keys_and_values" comes from an
elaborate set of variables and functions spanning multiple source files,
used to convert our internal target_type variable into such a string.
But although this code was called "index_option" and sounded very
important, it was actually used just for one thing - error messages!
So in this patch we drop the entire "index_option" abstraction,
replacing it by a static trivial function defined exactly where
it's used (create_index_statement.cc), which prints a target type.
While at it, we print "entries()" instead of "index_keys_and_values" ;-)
After this patch, the
test_secondary_index.py::test_index_collection_wrong_type
finally passes (the previous patch fixed the default table names it
assumes, and this patch fixes the expected error messages), so its
"xfail" tag is removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When creating an index "CREATE INDEX ON tbl(keys(m))", the default name
of the index should be tbl_m_idx - with just "m". The current code
incorrectly used the default name tbl_m_keys_idx, so this patch adds
a test (which passes on Cassandra, and after this patch also on Scylla)
and fixes the default name.
It turns out that the default index name was based on a mysterious
index_target::as_string(), which printed the target "keys(m)" as
"m_keys" without explaining why it was so. This method was actually
used only in three places, and all of them wanted just the column
name, without the "_keys" suffix! So in this patch we rename the
mysterious as_string() to column_name(), and use this function instead.
Now that the default index name uses column_name() and gets just
column_name(), the correct default index name is generated, and the
test passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the previous patches implemented collection indexing, several
tests in test/cql-pytest/test_secondary_index.py that were marked
with "xfail" started to pass - so here we remove the xfail.
Only three collection indexing tests continue to xfail:
test_secondary_index.py::test_index_collection_wrong_type
test_secondary_index.py::test_index_quoted_names (#10707)
test_secondary_index.py::test_local_secondary_index_on_collection (#10713)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
collection fails.
Collection indexing is being tracked by #2962. Global secondary index
over collection is enabled by #10123. Leave this test to track this
behaviour.
Related issue: #10713
ported from Cassandra is expected to fail, since Scylla assumes that
comparison with null doesn't throw error, just evaluates to false. Since
it's not a bug, but expected behavior from the perspective of Scylla, we
don't mark it as xfail.
When the secondary-index code builds a materialized view on column x, it
adds "x IS NOT NULL" to the where-clause of the view, as required.
However, when we index a collection column, we index individual pieces
of the collection (keys, values), the the entire collection, so checking
if the entire collection is null does not make sense. Moreover, for a
collection column x, "x IS NOT NULL" currently doesn't work and throws
errors when evaluating that expression when data is written to the table.
The solution used in this patch is to simply avoid adding the "x IS NOT
NULL" when creating the materialized view for a collection index.
Everything works just fine without it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
don't emit duplicate rows
The index on collection values is special in a way, as its' clustering key contains not only the
base primary key, but also a column that holds the keys of the cells in the collection, which
allows to distinguish cells with different keys but the same value.
This has an unwanted consequence, that it's possible to receive two identical base table primary
keys from indexed_table_select_statement::find_index_clustering_rows. Thankfully, the duplicate
primary keys are guaranteed to occur consequently.
Previous commit added the ability to use GSI over non-frozen collections in queries,
but only the keys() and values() indexes. This commit adds support for the missing
index type - entries() index.
Signed-off-by: Karol Baryła <karol.baryla@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Previous commits added the possibility of creating GSI on non-frozen collections.
This (and next) commit allow those indexes to actually be used by queries.
This commit enables both keys() and values() indexes, as they are pretty similar.
didn't miss returning from function
GCC doesn't consider switches over enums to be exhaustive. Replace
bogous return value after a switch where each of the cases return, with
an exception.
For collection indexes, logic of computing values for each of the column
needed to change, since a single particular column might produce more
than one value as a result.
The liveness info from individual cells of the collection impacts the
liveness info of resulting rows. Therefore it is needed to rewrite the
control flow - instead of functions getting a row from get_view_row and
later computing row markers and applying it, they compute these values
by themselves.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In case of secondary indexes, if an index does not contain any column from
the base which makes up for the primary key, then it is assumed that
during update, a change to some cells from the base table cannot cause
that we're dealing with a different row in the view. This however
doesn't take into account the possibility of computed columns which in
fact do depend on some non-primary-key columns. Introduce additional
property of an index,
has_computed_column_depending_on_base_non_primary_key.
depends_on_non_primary_key_column for a column computation is needed to
detect a case where the primary key of a materialized view depends on a
non primary key column from the base table, but at the same time, the view
itself doesn't have non-primary key columns. This is an issue, since as
for now, it was assumed that no non-primary key columns in view schema
meant that the update cannot change the primary key of the view, and
therefore the update path can be simplified.
Indexes over collections use materialized views. Supposing that we're
dealing with global indexes, and that pk, ck were the partition and
clustering keys of the base table, the schema of the materialized view,
apart from having idx_token (which is used to preserve the order on the
entries in the view), has a computed column coll_value (the name is not
guaranteed to be exactly) and potentially also
coll_keys_for_values_index, if the index was over collection values.
This is needed, since values in a specific collection need not be
unique.
To summarize, the primary key is as follows:
coll_value, idx_token, pk, ck, coll_keys_for_values_index?
where coll_value is the computed value from the collection, be it a key
from the collection, a value from the collection, or the tuple containing
both.
These functions are relevant for indexes over collections (creating
schema for a materialized view related to the index).
Signed-off-by: Michał Radwański <michal.radwanski@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Allow CQL like this:
CREATE INDEX idx ON table(some_map);
CREATE INDEX idx ON table(KEYS(some_map));
CREATE INDEX idx ON table(VALUES(some_map));
CREATE INDEX idx ON table(ENTRIES(some_map));
CREATE INDEX idx ON table(some_set);
CREATE INDEX idx ON table(VALUES(some_set));
CREATE INDEX idx ON table(some_list);
CREATE INDEX idx ON table(VALUES(some_list));
This is needed to support creating indexes on collections.
The syntax used for creating indexes on collections that is present in
Cassandra is unintuitive from the internal representation point of view.
For instance, index on VALUES(some_set) indexes the set elements, which
in the internal representation are keys of collection. Rewrite the index
target after receiving it, so that the index targets are consistent with
the representation.
This type of column computation will be used for creating updates to
materialized views that are indexes over collections.
This type features additional function, compute_values_with_action,
which depending on an (optional) old row and new row (the update to the
base table) returns multiple bytes_with_action, a vector of pairs
(computed value, some action), where the action signifies whether a
deletion of row with a specific key is needed, or creation thereby.
The compute_value function of column_computation has had previously the
following signature:
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
This is superfluous, since never in the history of Scylla, the last
parameter (row) was used in any implentation, and never did it happen
that it returned bytes_opt. The absurdity of this interface can be seen
especially when looking at call sites like following, where dummy empty
row was created:
```
token_column.get_computation().compute_value(
*_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty()));
```
Brings support of cql syntax `INDEX ON table(VALUES(collection))`, even
though there is still no support for indexes over collections.
Previously, index_target::target_type::values was refering to values of
a regular (non-collection) column. Rename it to `regular_values`.
Fixes#8745.
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:
- The `broadcast_rpc_address` is the address that the drivers are
supposed to connect to. `rpc_address` is the address that the node
binds to - it can be set for example to 0.0.0.0 so that Scylla listens
on all addresses, however this gives no useful information to the
driver.
- The `system.peers` table also has the `rpc_address` column and it
already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.
Fixes: #11201Closes#11204
* github.com:scylladb/scylladb:
db/system_keyspace: fix indentation after previous patch
db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values.
Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.
A test using a list as an accumulator is added to
cql-pytest/test_uda.py.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#11250
* github.com:scylladb/scylladb:
cql3: enable collections as UDA accumulators
cql3: extend implementation of to_bytes for raw_value
Fix https://github.com/scylladb/scylla-doc-issues/issues/438
In addition, I've replaced "Scylla" with "ScyllaDB" on that page.
Closes#11281
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB on the Fault Tolerance page
doc: fis the typo in the note
This patch fixes the test test_scan.py::test_scan_paging_missing_limit
which failed in a Jenkins run once (that we know of).
That test verifies that an Alternator Scan operation *without* an explicit
"Limit" is nevertheless paged: DynamoDB (and also Scylla) wanted this page
size to be 1 MB, but it turns out (see #10327) that because of the details
of how Scylla's scan works, the page size can be larger than 1 MB. How much
larger? I ran this test hundreds of times and never saw it exceed a 3 MB
page - so the test asserted the page must be smaller than 4 MB. But now
in one run - we got to this 4 MB and failed the test.
So in this patch we increase the table to be scanned from 4 MB to 6 MB,
and assert the page size isn't the full 6 MB. The chance that this size will
eventually fail as well should be (famous last words...) very small for
two reasons: First because 6 MB is even higher than I the maximum I saw
in practice, and second because empirically I noticed that adding more
data to the table reduces the variance of the page size, so it should
become closer to 1 MB and reduce the chance of it reaching 6 MB.
Refs #10327
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11280
Fix https://github.com/scylladb/scylla-doc-issues/issues/857Closes#11253
* github.com:scylladb/scylladb:
doc: language improvemens to the Counrers page
doc: fix the external link
doc: clarify the disclaimer about reusing deleted counter column values
Fix https://github.com/scylladb/scylla-doc-issues/issues/867
Plus some language, formatting, and organization improvements.
Closes#11248
* github.com:scylladb/scylladb:
doc: language, formatting, and organization improvements
doc: add a disclaimer about not supporting local counters by SSTableLoader
Replication is a mix of several inputs: tokens and token->node mappings (topology),
the replication strategy, replication strategy parameters. These are all captured
in effective_replication_map.
However, if we use effective_replication_map:s captured at different times in a single
query, then different uses may see different inputs to effective_replication_map.
This series protects against that by capturing an effective_replication_map just
once in a query, and then using it. Furthermore, the captured effective_replication_map
is held until the query completes, so topology code can know when a topology is no
longer is use (although this isn't exploited in this series).
Only the simple read and write paths are covered. Counters and paxos are left for
later.
I don't think the series fixes any bugs - as far as I could tell everything was happening
in the same continuation. But this series ensures it.
Closes#11259
* github.com:scylladb/scylladb:
storage_proxy: use consistent topology
storage_proxy: use consistent replication map on read path
storage_proxy: use consistent replication map on write path
storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
consistency_level: accept effective_replication_map as parameter, rather than keyspace
consistency_level: be more const when using replication_strategy
Add support for topology changes: add/stop/remove/restart/replace node.
Test simple schema changes when changing topology.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Preparing for topology tests with changing clusters, run before and
after checks per test case.
Change scope of pytest fixtures to function as we need them per test
casse.
Add server and client API logic.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Add an API via Unix socket to Manager so pytests can query information
about the cluster. Requests are managed by ManagerClient helper class.
The socket is placed inside a unique temporary directory for the
Manager (as safe temporary socket filename is not possible in Python).
Initial API services are manager up, cluster up, if cluster is dirty,
cql port, configured replicas (RF), and list of host ids.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Derive the topology from captured and stable effective_replication_map
instead of getting a fresh topology from storage_proxy, since the
fresh topology may be inconsistent with the running query.
digest_read_resolver did not capture an effective_replication_map, so
that is added.
Capture a replication map just once in
abstract_read_executor::_effective_replication_map_ptr. Although it isn't
used yet, it serves to keep a reference count on topology (for fencing),
and some accesses to topology within reads still remain, which can be
converted to use the member in a later patch.
Capture a replication map just once in
abstract_write_handler::_effective_replication_map_ptr and use it
in all write handlers. A few accesses to get the topology still remain,
they will be fixed up in a later patch.
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).
Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.
Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values. For example, a list with string elements:
'a, b', 'c' would be represented as a string 'a, b, c', while now
it is represented as "['a, b', 'c']".
Some types that were parsable are now represented in a different
way. For example, a tuple ('a', null, 0) was represented as
"a:\@:0", and now it is "('a', null, 0)".
Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.
A test using a list as an accumulator is added to
cql-pytest/test_uda.py.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.
Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.
Add `maybe_yield()` calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.
Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370
about similar patch by Geoffrey Beausire:
994c6ecf3c
> Yep. That dropped our startup from 3000+ seconds to about 40.
Fixes#10337Closes#11277
* github.com:scylladb/scylladb:
abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
abstract_replication_strategy: add has_uniform_natural_endpoints
During shutdown it is normal to get abort_requested_exception exception
from a state machine "apply" method. Do not rethrow it as
state_machine_error, just abort an applier loop with an info message.
Before the patch if an RPC connection was established already then the
close error was reported by the RPC layer and then duplicated by
raft_rpc layer. If a connection cannot be established because the remote
node is already dead RPC does not report the error since we decided that
in that case gossiper and failure detector messages can be used to
detect the dead node case and there is no reason to pollute the logs
with recurring errors. This aligns raft behaviour with what we already
have in storage_proxy that does not report closed errors as well.
If the leader was unavailable during read_barrier,
closed_error occurs, which was not handled in any way
and eventually reached the client. This patch adds retries in this case.
Fix: scylladb#11262
Refs: #11278Closes#11263
This patch reduces the number of metrics ScyllaDB generates.
Motivation: The combination of per-shard with per-scheduling group
generates a lot of metrics. When combined with histograms, which require
many metrics, the problem becomes even bigger.
The two tools we are going to use:
1. Replace per-shard histograms with summaries
2. Do not report unused metrics.
The storage_proxy stats holds information for the API and the metrics
layer. We replaced timed_rate_moving_average_and_histogram and
time_estimated_histogram with the unfied
timed_rate_moving_average_summary_and_histogram which give us an option
to report per-shard summaries instead of histogram.
All the counters, histograms, and summaries were marked as
skip_when_empty.
The API was modified to use
timed_rate_moving_average_summary_and_histogram.
Closes#11173
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.
Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.
Add maybe_yield() calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.
Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370
about similar patch by Geoffrey Beausire:
994c6ecf3c
> Yep. That dropped our startup from 3000+ seconds to about 40.
Fixes#10337
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So that using calaculate_natural_endpoints can be optimized
for strategies that return the same endpoints for all tokens,
namely everywhere_replication_strategy and local_strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refs #11237
Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
This reverts commit 8e892426e2 and fixes
the code in a different way:
That commit moved the scylla_inject_error function from
test/alternator/util.py to test/cql-pytest/util.py and renamed
test/alternator/util.py. I found the rename confusing and unnecessary.
Moreover, the moved function isn't even usable today by the test suite
that includes it, cql-pytest, because it lacks the "rest_api" fixture :-)
so test/cql-pytest/util.py wasn't the right place for it anyway.
test/rest_api/rest_util.py could have been a good place for this function,
but there is another complication: Although the Alternator and rest_api
tests both had a "rest_api" fixture, it has a different type, which led
to the code in rest_api which used the moved function to have to jump
through hoops to call it instead of just passing "rest_api".
I think the best solution is to revert the above commit, and duplicate
the short scylla_inject_error() function. The duplication isn't an
exact copy - the test/rest_api/rest_util.py version now accepts the
"rest_api" fixture instead of the URL that the Alternator version used.
In the future we can remove some of this duplication by having some
shared "library" code but we should do it carefully and starting with
agreeing on the basic fixtures like "rest_api" and "cql", without that
it's not useful to share small functions that operate on them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11275
When called with a null_value or an unset_value,
raw_value::to_bytes() threw an std::get error for
wrong variant. This patch adds a description for
the errors thrown, and adds a to_bytes_opt() method
that instead of throwing returns a std::nullopt.
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.
Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.
Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op, 12.0 tasks/op, 43007 insns/op, 0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op, 12.0 tasks/op, 43181 insns/op, 0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.
Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672Closes#11053
* github.com:scylladb/scylladb:
test/cql-pytest: add test for query tombstone page limit
query-result-writer: stop when tombstone-limit is reached
service/pager: prepare for empty pages
service/storage_proxy: set smallest continue pos as query's continue pos
service/storage_proxy: propagate last position on digest reads
query: result_merger::get() don't reset last-pos on short-reads and last pages
query: add tombstone-limit to read-command
service/storage_proxy: add get_tombstone_limit()
query: add tombstone_limit type
db/config: add config item for query tombstone limit
gms: add cluster feature for empty replica pages
tree: don't use query::read_command's IDL constructor
Fix https://github.com/scylladb/scylla-doc-issues/issues/842
This PR changes the default for the `overprovisioned` option from `disabled` to `enabled`, according to https://github.com/scylladb/scylla-doc-issues/issues/842.
In addition, I've used this opportunity to replace "Scylla" with "ScyllaDB" on the updated page.
Closes#11256
* github.com:scylladb/scylladb:
doc: replace Scylla with ScyllaDB in the product name
doc: change the default for the overprovisioned option
Give cluster control to pytests. While there add missing stop gracefully and add server to ScyllaCluster.
Clusters can be marked dirty but they are not recycled yet. This will be done in a later series.
Closes#11219
* github.com:scylladb/scylladb:
test.py: ScyllaCluster add_server() mark dirty
test.py: ScyllaCluster add server management
test.py: improve seeds for new servers
test.py: Topology tests and Manager for Scylla clusters
test.py: rename scylla_server to scylla_cluster
test.py: function for python driver connection
test.py: ScyllaCluster add_server helper
test.py: shutdown control connection during graceful shutdown
test.py: configurable authenticator and authorizer
test.py: ScyllaServer stop gracefully
test.py: FIXME for bad cluster log handling logic
Currently messaging_service.o takes the longest of all core objects to
compile. For a full build of build/release/scylla, with current ninja
scheduling, on a 32-hyperthread machine, the last ~16% of the total
build time is spent just waiting on messaging_service.o to finish
compiling.
Moving the file to the top of the list makes ninja start its compilation
early and gets rid of that single-threaded tail, improving the total build
time.
Closes#11255
Fixes#11184Fixes#11237
In prev (broken) fix for #11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.
This effectively prevents us from creating segments iff we have tight
limits. Since we nowadays do quite a bit of inserts _before_ commitlog
replay (system.local, but...) we can end up in a situation where we
deadlock start because we cannot get to the actual replay that will
eventually free things.
Another, not thought through, consequence is that we add a single
footprint to _all_ commitlog shard instances - even though only
shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.
Simplest fix is to add the footprint in delete call instead. This will
lock out segment creation until delete call is done, but this is fast.
Also ensures that only replay shard is involved.
Check that the replica returns empty pages as expected, when a large
tombstone prefix/span is present. Large = larger than the configured
query_tombstone_limit (using a tiny value of 10 in the test to avoid
having to write many tombstones).
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
The pager currently assumes that an empty pages means the query is
exhausted. Lift this assumption, as we will soon have empty short
pages.
Also, paging using filtering also needs to use the replica-provided
last-position when the page is empty.
We expect each replica to stop at exactly the same position when the
digests match. Soon however, if replicas have a lot of tombstones, some
may stop earlier then the others. As long as all digests match, this is
fine but we need to make sure we continue from the smallest such
positions on the next page.
We want to transmit the last position as determined by the replica on
both result and digest reads. Result reads already do that via the
query::result, but digest reads don't yet as they don't return the full
query::result structure, just the digest field from it. Add the last
position to the digest read's return value and collect these in the
digest resolver, along with the returned digests.
When merging multiple query-results, we use the last-position of the
last result in the combined one as the combined result's last position.
This only works however if said last result was included fully.
Otherwise we have to discard the last-position included with the result
and the pager will use the position of the last row in the combined
result as the last position.
The commit introducing the above logic mistakenly discarded the last
position when the result is a short read or a page is not full. This is
not necessary and even harmful as it can result in an empty combined
result being delivered to the pager, without a last-position.
When changing topology, tests will add servers. Make add_server mark the
cluster dirty. But mark the cluster as not dirty after calling
add_server when installing the cluster.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Preparing for topology changes, implement the primitives for managing
ScyllaServers in ScyllaCluster. The states are started, stopped, and
removed. Started servers can be stopped or restarted. Stopped servers
can be started. Stopped servers can be removed (destroyed).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Instead of only using last started server as seed, use all started
servers as seed for new servers.
This also avoids tracking last server's state.
Pass empty list instead of None.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Preparing to cycle clusters modified (dirty) and use multiple clusters
per topology pytest, introduce Topology tests and Manager class to
handle clusters.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Isolate python driver connection on its own function. Preparing for
harness client fixture to handle the connection.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For scylla servers, keep default PasswordAuthenticator and
CassandraAuthorizer but allow this to be configurable per test suite.
Use AllowAll* for topology test suite.
Disabling authentication avoids complications later for topology tests
as system_auth kespace starts with RF=1 and tests take down nodes. The
keyspace would need to change RF and run repair. Using AllowAll avoids
this problem altogether.
A different cql fixture is created without auth for topology tests.
Topology tests require servers without auth from scylla.yaml conf.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The code in test.py using a ScyllaCluster is getting a server id and
taking logs from only the first server.
If there is a failure in another server it's not reported properly.
And CQL connection will go only to the first server.
Also, it might be better to have ScyllaCluster to handle these matters
and be more opaque.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.
To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).
Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.
Refs #11005
Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)
Closes#11009
* github.com:scylladb/scylladb:
effective_replication_map: make get_range_addresses asynchronous
range_streamer: add_ranges and friends: get erm as param
storage_service: get_new_source_ranges: get erm as param
storage_service: get_changed_ranges_for_leaving: get erm as param
storage_service: get_ranges_for_endpoint: get erm as param
repair: use get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces_erms
database: add get_non_local_strategy_keyspaces
storage_service: coroutinize update_pending_ranges
effective_replication_map: add get_replication_strategy
effective_replication_map: get_range_addresses: use the precalculated replication_map
abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
abstract_replication_strategy: reindent
utils: sequenced_set: expose set and `contains` method
abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
utils: sequenced_set: templatize VectorType
utils: sanitize sequenced_set
utils: sequenced_set: delete mutable get_vector method
Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1.
Closes#11227
* github.com:scylladb/scylladb:
doc: add the redirects from Ubuntu version specific to version generic pages
doc: remove version-speific content for Ubuntu and add the generic page to the toctree
doc: rename the file to include Ubuntu
doc: remove the version number from the document and add the link to Supported Versions
doc: add a generic page for Ubuntu
doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1
1) Start node1,2,3
2) Stop node3
3) Run nodetool removenode $host_id_of_node3
4) Restart node3
Step 4 is wrong and not allowed. If it happens it will bring back node3
to the cluster.
This patch adds a check during node restart to detect such operation
error and reject the restart.
With this patch, we would see the following in step 4.
```
init - Startup failed: std::runtime_error (The node 127.0.0.3 with
host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the
cluster. Can not restart the removed node to join the cluster again!)
```
Refs #11217Closes#11244
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239Closes#11240
* github.com:scylladb/scylladb:
test: mvcc: Fix illegal use of maybe_refresh()
tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
tests: row_cache_test: Introduce one_shot mode to throttle
row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
We can use std::in_place_type<> to avoid constructing op before
calling emplace_back(). As a reuslt, we can avoid reserving space. The
reserving was there to avoid the need to roll-back in case
emplace_back() throws.
Kudos to Kamil for suggesting this.
Closes#11238
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.
The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.
Closes#11243
avoid about log2(256)=8 reallocations when pushing partition ranges to
be fetched. additionally, also avoid copying range into ranges
container. current_range will not contain the last range, after
moved, but will still be engaged by the end of the loop, allowing
next iteration to happen as expected.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11242
To be used by coordinator side code to determine the correct tombstone
limit to pass to read-command (tombstone limit field added in the next
commit). When this limit is non-zero, the replica will start cutting
pages after the tombstone limit is surpassed.
This getter works similarly to `get_max_result_size()`: if the cluster
feature for empty replica pages is set, it will return the value
configured via db::config::query_tombstone_limit. System queries always
use a limit of 0 (unlimited tombstones).
This will be the value used to break pages, after processing the
specified amount of tombstones. The page will be cut even if empty.
We could maybe use the already existing tombstone_{warn,fail}_threshold
instead and use them as a soft/hard limit pair, like we did with page
sizes.
It is not type safe: has multiple limits passed to it as raw ints, as
well as other types that ints implicitly convert to. Furthermore the row
limit is passed in two separate fields (lower 32 bits and upper 32
bits). All this make this constructor a minefield for humans to use. We
have a safer constructor for some time but some users of the old one
remain. Move them to the safe one.
maybe_refresh() can only be called if the cursor is pointing at a row.
The code was calling it before the cursor was advanced, and was
thus relying on implementation detail.
Scenario:
cache = [
row(pos=2, continuous=false),
row(pos=after(2), dummy=true)
]
Scanning read starts, starts populating [-inf, before(2)] from sstables.
row(pos=2) is evicted.
cache = [
row(pos=after(2), dummy=true)
]
Scanning read finishes reading from sstables.
Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.
The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.
Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).
Fixes#11239
Rather than getting it in the callee, let the caller
(e.g. storage_service)
hold the erm and pass it down to potentially multiple
async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Let its caller Pass the effective_replication_map ptr
so we can get it at the top level and keep it alive
(and coherent) through multiple asynchronous calls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use get_non_local_strategy_keyspaces_erms for getting
a coherent set of keyspace names and their respective
effective replication strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for getting a coheret set of all keyspaces
with non-local replication strategy and their respective
effective_replication_map.
As an example, use it in this patch in
storage_service::update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For node operations, we currently call get_non_system_keyspaces
but really want to work on all keyspace that have non-local
replication strategy as they are replicated on other nodes.
Reflect that in the replica::database function name.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it in storage_service::get_changed_ranges_for_leaving.
A following patch will pass the e_r_m to
storage_service::get_changed_ranges_for_leaving, rather than
getting it there.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to call get_natural_endpoints for every token
in sorted_tokens order, since we can just get the precalculated
per-token endpoints already in the _replication_map member.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reduce large allocations and reactor stalls seen in #11005
by open coding `get_address_ranges` and using std::vector::insert
to efficiently appending the ranges returned by `get_primary_ranges_for`
onto the returned token_range_vector in contrast to building
an unordered_multimap<inet_address, dht::token_range> first in
`get_address_ranges` and traversing it and adding one token_range
at a time.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use that in sights using the endpoint set
returned by abstract_replication_strategy::calculate_natural_endpoints.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And templatize its Vector type so it can be used
with a small_vector for inet_address_vector_replica_set.
Mark the methods const/noexcept as needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is dangerous to use since modifying the
sequenced_set vector will make it go out of sync
with the associated unordered_set member, making
the object unusable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When starting `Build` job we have a situation when `x86` and `arm`
starting in different dates causing the all process to fail
As suggested by @avikivity , adding a date-stamp parameter and will pass
it through downstream jobs to get one release for each job
Ref: scylladb/scylla-pkg#3008
Closes#11234
We would like to define more distinct types
that are currently defined as aliases to utils::UUID to identify
resources in the system, like table id and schema
version id.
As with counter_id, the motivation is to restrict
the usage of the distinct types so they can be used
(assigned, compared, etc.) only with objects of
the same type. Using with a generic UUID will
then require explicit conversion, that we want to
expose.
This series starts with cleaning up the idl header definition
by adding support for `import` and `include` statements in the idl-compiler.
These allow the idl header to become self-sufficient
and then remove manually-added includes from source files.
The latter usually need only the top level idl header
and it, in turn, should include other headers if it depends on them.
Then, a UUID_class template was defined as a shared boiler plate
for the various uuid-class. First, we convert counter_id to use it,
rather than mimicking utils::UUID on its own.
On top of utils::UUID_class<T>, we define table_id,
table_schema_version, and query_id.
Following up on this series, we should define more commonly used
types like: host_id, streaming_plan_id, paxos_ballot_id.
Fixes#11207Closes#11220
* github.com:scylladb/scylladb:
query-request, everywhere: define and use query_id as a strong type
schema, everywhere: define and use table_schema_version as a strong type
schema, everywhere: define and use table_id as a strong type
schema: include schema_fwd.hh in schema.hh
system_keyspace: get_truncation_record: delete unused lambda capture
utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
utils: tagged_uuid: rename to_uuid() to uuid()
counters: counter_id: use base class create_random_id
counters: base counter_id on utils::tagged_uuid
utils: tagged_uuid: mark functions noexcept
utils: tagged_uuid: bool: reuse uuid::bool operator
raft: migrate tagged_id definition to utils::tagged_uuid
utils: uuid: mark functions noexcept
counters: counter_id delete requirement for triviality
utils: bit_cast: require TriviallyCopyable To
repair: delete unused include of utils/bit_cast.hh
bit_cast: use std::bit_cast
idl: make idl headers self-sufficient
db: hints: sync_point: do not include idl definition file
db/per_partition_rate_limit: tidy up headers self-sufficiency
idl-compiler: include serialization impl and visitors in generated dist.impl.hh files
idl-compiler: add include statements
idl_test: add a struct depending on UUID
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.
Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.
Fixes#11207
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)
This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use the common base class for uuid-based types.
tagged_uuid::to_uuid defined here for backward
compatibility, but it will be renamed in the next patch
to uuid().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So it can be used for other types in the system outside
of raft, like counter_id, table_id, table_schema_version,
and more.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This stemmed from utils/bit_cast overly strict requirement.
Now that it was relaxed, these is no need for this static assert
as counter_id is trivially copyable, and that is checked
by bit_cast {read,write}_unaligned
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
idl definition files are not intended for direct
inclusion in .cc files.
Data types it represents are supposed to be defined
in regular C++ header, so define them in db/hints/scyn_point.hh
and include it rather then idl/hinted_handoff.idl.hh.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They are generally required by the serialization implementation.
This will simplify using them without having to hand pick
what header to include in the .cc file that includes them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series is aimed at fixing #11132.
To get there, the series untangles the functions that currently depend on the the cross-shard coordination in table::snapshot,
namely database::truncate and consequently database::drop_column_family.
database::get_table_on_all_shards is added here as a helper to get a foreign shared ptr of the the table shard from all shards,
and it is later used by multiple functions to truncate and then take a snapshot of the sharded table.
database::truncate_table_on_all_shards is defined to orchestrate the truncate process end-to-end, flushing or clearing all table shards before taking a snapshot if needed, using the newly defined table::snapshot_on_all_shards, and by that leaving only the discard_sstables job to the per-shard database::truncate function.
The latter, snapshot_on_all_shards, orchestrates the snapshot process on all shards - getting rid of the per-shard table::snapshot function (after refactoring take_snapshot and finalize_snapshot out of it), and the associated dreaded data structures: snapshot_manager and pending_snapshots.
Fixes#11132.
Closes#11133
* github.com:scylladb/scylladb:
table: reindent write_schema_as_cql
table: coroutinize write_schema_as_cql
table: seal_snapshot: maybe_yield when iterating over the table names
table: reindent seal_snapshot
table: coroutinize seal_snapshot
table: delete unused snapshot_manager and pending_snapshots
table: delete unused snapshot function
table: snapshot_on_all_shards: orchestrate snapshot process
table: snapshot: move pending_snapshots.erase from seal_snapshot
table: finalize_snapshot: take the file sets as a param
table: make seal_snapshot a static member
table: finalize_snapshot: reindent
table: refactor finalize_snapshot out of snapshot
table: snapshot: keep per-shard file sets in snapshot_manager
table: take_snapshot: return foreign unique ptr
table: take_snapshot: maybe yield in per-sstable loop
table: take_snapshot: simplify tables construction code
table: take_snapshot: reindent
table: take_snapshot: simplify error handling
table: refactor take_snapshot out of snapshot
utils: get rid of joinpoint
database: get rid of timestamp_func
database: truncate: snapshot table in all-shards layer
database: truncate: flush table and views in all-shards layer
database: truncate: stop and disable compaction in all-shards layer
database: truncate: move call to set_low_replay_position_mark to all-shards layer
database: truncate: enter per-shard table async_gate in all-shards layer
database: truncate: move check for schema_tables keyspace to all-shards layer.
database: snapshot_table_on_all_shards: reindent
table: add snapshot_on_all_shards
database: add snapshot_table_on_all_shards
database: rename {flush,snapshot}_on_all and make static
database: drop_table_on_all_shards: truncate and stop table in upper layer
database: drop_table_on_all_shards: get all table shards before drop_column_family on each
database: drop_column_family: define table& cf
database: drop_column_family: reuse uuid for evict_all_for_table
database: drop_column_family: move log message up a layer
database: truncate: get rid of the unused ks param
database: add truncate_table_on_all_shards
database: drop_table_on_all_shards: do not accept a truncated_at timestamp_func
database: truncate: get optional snapshot_name from caller
database: truncate: fix assert about replay_position low_mark
database_test: apply_mutation on the correct db shard
Add maybe_yield calls in tight loop, potentially
over thousands of sstable names to prevent reactor stalls.
Although the per-sstable cost is very small, we've experienced
stalls realted to printing in O(#sstables) in compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Handle exceptions, making sure the output
stream is properly closed in all cases,
and an intermediate error, if any, is returned as the
final future.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that snapshot orchestration in snapshot_on_all_shards
doesn't use snapshot_manager, get rid of the data structure.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that snapshot orchestration is done solely
in snapshot_on_all_shards, the per-shard
snapshot function can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call take_snapshot on each shard and collect the
returns snapshot_file_set.
When all are done, move the vector<snapshot_file_set>
to finalize_snapshot.
All that without resorting to using the snapshot_manager
nor calling table::snapshot.
Both will deleted in the following patches.
Fixes#11132
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that seal_snapshot doesn't need to lookup
the snapshot_manager in pending_snapshots to
get to the file_sets, erasing the snapshot_manager
object can be done in table::snapshot which
also inserted it there.
This will make it easier to get rid of it in a later patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
and pass it to seal_snapshot, so that the latter won't
need to lookup and access the snapshot_manager object.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Write schema.cql and the files manifest in finalize_snapshot.
Currently call it from table::snapshot, but it will
be called in a later patch by snapshot_on_all_shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To simplify processing of the per-shard file names
for generating the manifest.
We only need to print them to the manifest at the
end of the process, so there's no point in copying
them around in the process, just move the
foreign unique unordered_set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently copying the sstable file names are created
and destroyed on each shard and are copied by the
"coordinator" shards using submit_to, while the
coroutine holds the source on its stack frame.
To prepare for the next patches that refactor this
code so that the coordinator shard will submit_to
each shard to perform `take_snapshot` and return
the set of sstrings in the future result, we need
to wrap the result in a foreign_ptr so it gets
freed on the shard that created it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There could be thousands of sstables so we better
cosider yielding in the tight loop that copies
the sstable names into the unordered_set we return.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Don't catch exception but rather just return
them in the return future, as the exception
is handled by the caller.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Do the actual snapshot-taking code in a per-shard
take_snapshot function, to be called from
snapshot_on_all_shards in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.
Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With that the database layer does no longer
need to invoke the private table::snapshot function,
so it can be defriended from class table.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Start moving the per-shard state establishment logic
to truncate_table_on_all_shards, so that we would evetually
do only the truncate logic per-se in the per-shard truncate function.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the per-shard truncate function is called
only from truncate_table_on_all_shards, we can reject the schema_tables
keyspace in the upper layer. There's no need to check that on each shard.
While at it, reuse `is_system_keyspace`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Called from the respective database entry points.
Will be called also from the database drop / truncate path
and will be used for central coordination of per-shard
table::snapshot so we don't have to depend on the snapshot_manager
mechanism that is fragile and currently causes abort if we fail
to allocate it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
truncate the table on all shards then stop it on shards
in the upper layer rather than in the per-shard drop_column_family()
function, so we can further refactor truncate later, flushing
and taking snapshot on all shards, before truncating.
With that, rename drop_column_family to detach_columng_family
as now it only deregisters the column family from containers
that refer to it (even via its uuid) and then its caller
is reponsible to take it from there.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
cf->schema()->id() is the same one returned
by find_uuid(ks_name, cf_name);
As a follow up, we should define a concrete
table_id type and rename schema::id() to schema::table_id()
to return it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Print once on "coordinator" shard.
And promote to info level as it's important to log
when we're dropping a table (and if we're going to take a snapshot).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
timestamp_func
Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.
Fixes#11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we change drop_table_on_all_shards to always
pass db_clock::time_point::max() in the next patch,
let it pass a unique snapshot name, otherwise
the snapshot name will always be based on the constant, max
time_point.
Refs #11232
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This assert was tweaked several times:
Introduced in 83323e155e,
then fixed in b2b1a1f7e1 to account
for no rp from discard_sstables, then in
9620755c7f to account for
cases we do not flush the table, then again in
71c5dc82df to make that more accurate.
But, the assert wasn't correct in the first place
in the sense that we first get `low_mark` which
represents the highest replay_position at the time truncate
was called, but then we call discard_sstables with a time_point
of `truncated_at` that we get from the caller via the timestamp_func,
and that one could be in the past, before truncate was called -
hence discard_sstables with that timestamp may very well
return a replay_position from older sstables, prior to flush
that can be smaller than the low_mark.
Fix this assert to account for that case.
The real fix to this issue is to have a truncate_tombstone
that will carry an authoritative api::timstamp (#11230)
Fixes#11231
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Following up on 1c26d49fba,
apply mutations on the correct db shard in all test cases
before we define and use database::truncate_table_on_all_shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).
This can cause reactor stalls and availability issues when replicas
apply such deletions.
This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.
Fixes#11211Closes#11215
These are the first commits out of #10815.
It starts by moving pytest logic out of the common `test/conftest.py`
and into `test/topology/conftest.py`, including removing the async
support as it's not used anywhere else.
There's a fix of a bug of leaving tables in `RandomTables.tables` after
dropping all of them.
Keyspace creation is moved out of `conftest.py` into `RandomTables` as
it makes more sense and this way topology tests avoid all the
workarounds for old version (topology needs ScyllaDB 5+ for Raft,
anyway).
And a minor fix.
Closes#11210
* github.com:scylladb/scylladb:
test.py: fix type hint for seed in ScyllaServer
test.py: create/drop keyspace in tables helper
test.py: RandomTables clear list when dropping all tables
test.py: move topology conftest logic to its own
test.py: async topology tests auto run with pytest_asyncio
Since all topology test will use the helper, create the keyspace in the
helper.
Avoid the need of dropping all tables per test and just drop the
keyspace.
While there, use blocking CQL execution so it can be used in the
constructor and avoids possible issues with scheduling on cleanup. Also,
creation and drop should happen only once per cluster and no test should
be running changes (either not started or finished).
All topology tests are for Scylla with Raft. So don't use the Cassandra
this_dc workaround as it's unnecessary for Scylla.
Remove return type of random_tables fixture to match other fixtures
everywhere else.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Clear the list of active tables when dropping them.
While there do the list element exchange atomically across active and
removed tables lists.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move asyncio, Raft checks, and RandomTables to topology test suite's own
conftest file.
While there, use non-async version of pre-checks to avoid unnecessary
complexity (we want async tests, not async setup, for now).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Async tests and fixtures in the topology directory are expected to run
with pytest_asyncio (not other async frameworks). Force this with auto
mode.
CI has an older pytest_asyncio version lacking pytest_asyncio.fixture.
Auto mode helps avoiding the need of it and tests and fixtures can just
be marked with regular @pytest.mark.async.
This way tests can run in both older and newer versions of the packages.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.
The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.
Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"
* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
consistency_level: Remove is_local() and count_local_endpoints()
storage_proxy: Use topology::local_endpoints_count()
storage_proxy: Use proxy's topology for DC checks
storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
storage_proxy: Use topology local_dc_filter in its methods
storage_proxy: Mark some digest_read_resolver methods private
forwarding_service: Use topology local_dc_filter
storage_service: Use topology local_dc_filter
consistency_level: Use topology local_dc_filter
consitency-level: Call count_local_endpoints from topology
consistency_level: Get datacenter from topology
replication_strategy: Remove hold snitch reference
effective_replication_map: Get datacenter from topology
topology: Add local-dc detection shugar
No code uses them now -- switched to use topology -- so thse two can be
dropped together with their calls for global snitch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A continuation of the previous patches -- now all the code that needs
this helper have proxy pointer at hand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several proxy helper classes need to filter endpoints by datacenter.
Since now the have shared_ptr<proxy> on-board, they can get topology
via proxy's token metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed to get token metadata from proxy. The resolver in
question is created and maintained by abstract_read_executor which
already has shared_ptr<proxy>, so it just gives its copy
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The proxy has token metadata pointer, so it can use its topology
reference to filter endpoints by datacenter
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The service needs to filter out non-local endpoints for its needs. The
service carries token metadata pointer and can get topology from it to
fulfill this goal
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage-service API calls use db::is_local() helper to filter out
tokens from non-local datacenter. In all those places topology is
available from the token metadata pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch, in those places with keyspace object at
hand the topology can be obtained from ks' replication map
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some of db/consistency_level.cc helpers the topology can be
obtained from keyspace's effective replication map
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the strategy is constructed there's no place to get snitch from
so the global instance is used. However, after previous patch the
replication strategy no longer needs snitch, so this dependency can
be dropped
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it gets it from snitch, but the dc/rack info is being relocated
onto topology. The topology is in turn already there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#11184
Not including it here can cause our estimate of "delete or not" after replay
to be skewed in favour of retaining segments as (new) recycles (or even flip
a counter), and if we have repeated crash+restarts we could be accumulating
an effectivly ever increasing segment footprint
Closes#11205
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.
The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.
The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.
This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.
Closes#11088
* github.com:scylladb/scylladb:
service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
service: storage_proxy: `remote`: make `_gossiper` a const reference
gms: gossiper: mark some member functions const
db: consistency_level: `filter_for_query`: take `const gossiper&`
replica: table: `get_hit_rate`: take `const gossiper&`
gms: gossiper: move `endpoint_filter` to `storage_proxy` module
service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
service: storage_proxy: establish private section in `remote`
service: storage_proxy: remove `migration_manager` pointer
service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
service: storage_proxy: remove `_gossiper` field
alternator: ttl: pass `gossiper&` to `expiration_service`
service: storage_proxy: move `truncate_blocking` implementation to `remote`
service: storage_proxy: introduce `is_alive` helper
service: storage_proxy: remove `_messaging` reference
service: storage_proxy: move `connection_dropped` to `remote`
service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
service: storage_proxy: move `handle_write` to `remote`
service: storage_proxy: move `handle_paxos_prune` to `remote`
service: storage_proxy: move `handle_paxos_accept` to `remote`
service: storage_proxy: move `handle_paxos_prepare` to `remote`
service: storage_proxy: move `handle_truncate` to `remote`
service: storage_proxy: move `handle_read_digest` to `remote`
service: storage_proxy: move `handle_read_mutation_data` to `remote`
service: storage_proxy: move `handle_read_data` to `remote`
service: storage_proxy: move `handle_mutation_failed` to `remote`
service: storage_proxy: move `handle_mutation_done` to `remote`
service: storage_proxy: move `handle_paxos_learn` to `remote`
service: storage_proxy: move `receive_mutation_handler` to `remote`
service: storage_proxy: move `handle_counter_mutation` to `remote`
service: storage_proxy: remove `get_local_shared_storage_proxy`
service: storage_proxy: (de)register RPC handlers in `remote`
service: storage_proxy: introduce `remote`
Previously, if pytest itself failed (e.g. bad import or unexpected
parameter), there was no output file but test.py tried to copy it and
failed.
Change the logic of handling the output file to first check if the
file is there. Then if it's worth keeping it, *move* it to the test
directory for easier comparison and maintenance. Else, if it's not worth
keeping, discard it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11193
When update_streams_description() fails it spawns a fiber and retries
the update in the background once every 60s. If the sleeping between
attempts is aborted, the respective exceptional future happens to be
ignored and warned in logs.
fixes: #11192
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220802132148.20688-1-xemul@scylladb.com>
`migration_manager` lifetime is longer than the lifetime of "storage
proxy's messaging service part" - that is, `init_messaging_service` is
called after `migration_manager` is started, and `uninit_messaging_service`
is called before `migration_manager` is stopped. Thus we don't need to
hold an owning pointer to `migration_manager` here.
Later, when `init_messaging_service` will actually construct `remote`,
this will be a reference, not a pointer.
Also observe that `_mm` in `remote` is only used in handlers, and
handlers are unregistered before `_mm` is nullified, which ensures that
handlers are not running when `_mm` is nullified. (This argument shows
why the code made sense regardless of our switch from shared_ptr to raw
ptr).
The function only uses one public function of `gossiper` (`is_alive`)
and is used only in one place in `storage_proxy`.
Make it a static function private to the `storage_proxy` module.
The function used a `default_random_engine` field in `gossiper` for
generating random numbers. Turn this field into a static `thread_local`
variable inside the function - no other `gossiper` members used the
field.
Access `gossiper` through `_remote`.
Later, all those accesses will handle missing `remote`.
Note that there are also accesses through the `remote()` internal getter.
The plan is as follows:
- direct accesses through `_remote` will be modified to handle missing
`_remote` (these won't cause an error)
- `remote()` will throw if `_remote` is missing (`remote()` is only used
for operations which actually need to send a message to a remote node).
The truncate operation always truncates a table on the entire cluster,
even for local tables. And it always does it by sending RPCs (the node
sends an RPC to itself too). Thus it fits in the remote class.
If we want to add a possibility to "truncate locally only" and/or change
the behavior for local tables, we can add a branch in
`storage_proxy::truncate_blocking`.
Refs: #11087
A helper is introduced both in `remote` and in `storage_proxy`.
The `storage_proxy` one calls the `remote` one. In the future it will
also handle a missing `remote`. Then it will report only the local node
to be alive and other nodes dead while `remote` is missing.
The change reduces the number of functions using the `_gossiper` field
in `storage_proxy`.
"
The helper is in charge of receiving INTERNAL_IP app state from
gossiper join/change notifications, updating system.peers with it
and kicking messaging service to update its preferred ip cache
along with initiating clients reconnection.
Effectively this helper duplicates the topology tracking code in
storage-service notifiers. Removing it makes less code and drops
a bunch of unwanted cross-components dependencies, in particular:
- one qctx call is gone
- snitch (almost) no longer needs to get messaging from gossiper
- public:private IP cache becomes local to messaging and can be
moved to topology at low cost
Some nice minor side effect -- this helper was left unsubscribed
from gossiper on stop and snitch rename. Now its all gone.
"
* 'br-remove-reconnectible-snitch-helper-2' of https://github.com/xemul/scylla:
snitch: Remove reconnectable snitch helper
snitch, storage_service: Move reconnect to internal_ip kick
snitch, storage_service: Move system.peers preferred_ip update
snitch: Export prefer-local
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:
- The `broadcast_rpc_address` is the address that the drivers are
supposed to connect to. `rpc_address` is the address that the node
binds to - it can be set for example to 0.0.0.0 so that Scylla listens
on all addresses, however this gives no useful information to the
driver.
- The `system.peers` table also has the `rpc_address` column and it
already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.
Fixes: #11201
Its remaining uses are trivial to remove.
Note: in `handle_counter_mutation` we had this piece of code:
```
}).then([trace_state_ptr = std::move(trace_state_ptr), &mutations, cl, timeout] {
auto sp = get_local_shared_storage_proxy();
return sp->mutate_counters_on_leader(...);
```
Obtaining a `shared_ptr` to `storage_proxy` at this point is
no different from obtaining a regular pointer:
- The pointer is obtained inside `then` lambda body, not in the capture
list. So if the goal of obtaining a `shared_ptr` here was to keep
`storage_proxy` alive until the `then` lambda body is executed, that
goal wasn't achieved because the pointer was obtained too late.
- The `shared_ptr` is destroyed as soon as `mutate_counters_on_leader`
returns, it's not stored anywhere. So it doesn't prolong the lifetime
of the service.
I replaced this with a simple capture of `this` in the lambda.
It's often needed to check if an endpoint sits in the same DC as the
current node. It can be done by
topo.get_datacenter() == topo.get_datacenter(endpoint)
but in some cases a RAII filter function can be helpful.
Also there's a db::count_local_endpoints() that is surprisingly in use,
so add it to topology as well. Next patches will make use of both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same thing as in previous patch -- when gossiper issues
on_join/_change notification, storage service can kick messaging
service to update its internal_ip cache and reconnect to the peer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the INTERNAL_IP state is updated using reconnectable helper
by subscribing on on_join/on_change events from gossiper. The same
subscription exists in storage service (it's a bit more elaborated by
checking if the node is the part of the ring which is OK).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The boolean bit says whether "the system" should prefer connecting to
the address gossiper around via INTERNAL_IP. Currently only gossiping
property file snitch allows to tune it and ec2-multiregion snitch
prefers internal IP unconditionally.
So exporting consists of 2 pieces:
- add prefer_local() snitch method that's false by default or returns
the (existing) _prefer_local bit for production snitch base
- set the _prefer_local to true by ec2-multiregion snitch
While at it the _prefer_local is moved to production_snitch_base for
uniformity with the new prefer_local() call
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:04 +03:00
667 changed files with 26914 additions and 10450 deletions
"summary":"Removes token (and all data associated with enpoint that had it) from the ring",
"summary":"Removes a node from the cluster. Replicated data that logically belonged to this node is redistributed among the remaining nodes.",
"type":"void",
"nickname":"remove_node",
"produces":[
@@ -1245,7 +1245,7 @@
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"description":"Comma-separated list of dead nodes to ignore in removenode operation. Use the same method for all nodes to ignore: either Host IDs or ip addresses.",
_c.log_debug("Splitting large partition {} in order to respect SSTable size limit of {}",*_current_partition.dk,pretty_printed_data_size(_c._max_sstable_size));
// Close partition in current writer, and open it again in a new writer.
do_consume_end_of_partition();
stop_current_writer();
do_consume_new_partition(*_current_partition.dk);
// Replicate partition tombstone to every fragment, allowing the SSTable run reader
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.